DSI Project 2

This project focused on Billboard chart data from the year 2000.  It showcased the Artist, Track, and other important information regarding the tracks that were in the Billboard Top 100 in the year 2000.

We were given the following fictitious scenario:

“On next week’s episode of the ‘Are You Entertained?’ podcast, we’re going to be analyzing the latest generation’s guilty pleasure- the music of the ’00s. Our Data Scientists have poured through Billboard chart data to analyze what made a hit soar to the top of the charts, and how long they stayed there. Tune in next week for an awesome exploration of music and data as we continue to address an omnipresent question in the industry- why do we like what we like?”

After exploring the data I became interested in the question, “Does the shorter the length of a track, increase it’s probability to be on the Billboard top 100 in the year 2000?”

This was an interesting question to me, that I believed the data on hand should be able to answer.  However, first I needed to clean the data.

I started by cleaning up the column names. I used a lambda function since the naming convention was the same across the entire column, and this accomplished my
goal with one line of code, rather than defining every column name.

Screen Shot 2016-06-20 at 12.30.33 PM.png

I decided to convert the time column to a single integer representing seconds. I did this by using the str.split method to separate the two numbers representing minutes and seconds and then apply with a lambda function to combine the two in to a total amount of seconds. This would help me with visualizing the data:

Screen Shot 2016-06-20 at 12.27.53 PM

I also wanted to create a melted data frame that included the time column for my problem statement exploration.

Screen Shot 2016-06-20 at 12.33.02 PM

This allowed me to manipulate the data and create some revealing graphs.


This swarm plot attempts to find the correlation between the length of a track, and it’s ranking on the billboard chart. The chart indicates that shorter the track, the higher it’s rank.

We see this data confirmed again, with a simple histogram of the time data.

download (1)

This chart shows that a good amount of the tracks on the top 100 are low to medium in length.

We even see the data proven in the mathematics of the data.  For instance, the Standard Deviation of the time data is:


The data seems to indicate a solution to the problem statement, that the shorter the track, the higher it’s likelihood to be ranked on the Billboard Top 100 in the year 2000.

Under Construction

This blog will return after a short commercial break.


Get every new post delivered to your Inbox.