Project 3

My approach with this assignment was to clean and narrow down the data to the Top 20 stores and use that data to inform my recommendations to the store owner not only on locations, but other factors that could increase his business.

I went in with the assumption that the top performing zip codes would have room to expand, as this was inferred by the data set and problem.

My problem statement is: “By focusing on the Top 20 best performing stores – as defined by Total Sales – we can optimize the performance of already established business centers and icnrease our profit margins.”

I believe this problem statement was proven true.

The following chart helps us understand the performance of the top 20 stores in Iowa.  It shows us that there is a
significant difference in performance between the top 2 stores and it’s competitors.  As a result, a good potential recommendation would be to model behavior from these stores in the bottom 10 of this list to greatly improve market performance in areas we know will do well.
In addition, we take a look at the Total Volume Sold in liters at these locations, and notice a trend in volume sold in tandem with performance.  However, the difference is very low and will not inform our recommendations a great deal.
Following that logic we can explore data in regards to the top 20 stores and begin to look for trends.
Interestingly, we find that the top 2 performing locations have a price mean that is in the middle of the “Top 20” performance. Alternatively, the majority of the “Top 20” either under or over perform with their price mean. This data is helpful, because it shows that if we focus on those stores in the top 20 we can likely impact performance by adjusting prices to more closely match the top 2 performing stores.
Taking a deep dive in to some predictive analytics centered on Total Sales, Price Per Liter, and Total Volume Sold – in liters – we also find some encouraging data.
Calculating some predictive equations on the above variables, we find a model with an R-squared at .92 and both our Mean Absolute Error and Mean Squared Error hovering around 1 for predictive models built on Total Sales and Total Price.
We see that modeled here:
This shows us that we can be confident in building out predictive tools on these values.
Using the data discussed above I would recommend that the store owner focus on the bottom 10 zip codes identified in our “Top 20” data.  These regions show an ability to perform at a high level, but the top 2 stores showcase an ability for that performance to be drastically improved.
I would recommend utilizing our predictive models to pace out the growth in those regions and utilize the pricing model we discovered in our top 2 stores to adjust the prices in those regions.

DSI Project 2

This project focused on Billboard chart data from the year 2000.  It showcased the Artist, Track, and other important information regarding the tracks that were in the Billboard Top 100 in the year 2000.

We were given the following fictitious scenario:

“On next week’s episode of the ‘Are You Entertained?’ podcast, we’re going to be analyzing the latest generation’s guilty pleasure- the music of the ’00s. Our Data Scientists have poured through Billboard chart data to analyze what made a hit soar to the top of the charts, and how long they stayed there. Tune in next week for an awesome exploration of music and data as we continue to address an omnipresent question in the industry- why do we like what we like?”

After exploring the data I became interested in the question, “Does the shorter the length of a track, increase it’s probability to be on the Billboard top 100 in the year 2000?”

This was an interesting question to me, that I believed the data on hand should be able to answer.  However, first I needed to clean the data.

I started by cleaning up the column names. I used a lambda function since the naming convention was the same across the entire column, and this accomplished my
goal with one line of code, rather than defining every column name.

Screen Shot 2016-06-20 at 12.30.33 PM.png

I decided to convert the time column to a single integer representing seconds. I did this by using the str.split method to separate the two numbers representing minutes and seconds and then apply with a lambda function to combine the two in to a total amount of seconds. This would help me with visualizing the data:

Screen Shot 2016-06-20 at 12.27.53 PM

I also wanted to create a melted data frame that included the time column for my problem statement exploration.

Screen Shot 2016-06-20 at 12.33.02 PM

This allowed me to manipulate the data and create some revealing graphs.


This swarm plot attempts to find the correlation between the length of a track, and it’s ranking on the billboard chart. The chart indicates that shorter the track, the higher it’s rank.

We see this data confirmed again, with a simple histogram of the time data.

download (1)

This chart shows that a good amount of the tracks on the top 100 are low to medium in length.

We even see the data proven in the mathematics of the data.  For instance, the Standard Deviation of the time data is:


The data seems to indicate a solution to the problem statement, that the shorter the track, the higher it’s likelihood to be ranked on the Billboard Top 100 in the year 2000.

Under Construction

This blog will return after a short commercial break.