Showing Some Respect for Data Munging, Part 2: Pitchfork Dataset Cleaning
This article is originally published at https://methodmatters.blogspot.com/
So far on this blog, we've used the data containing information on Pitchfork music reviews (available on Kaggle at this link) for a number of different data analyses. I've found this to be a particularly interesting dataset, and have learned a great deal doing all of these small fun projects.
However, before I could do any data analysis, I had to spend some time and effort on data preparation. Specifically, the raw data from Kaggle are contained in a number of separate tables in an SQLite database, and they therefore required a bit of work to clean and transform into a single tidy dataset.
The goal of this post is to give an overview of the data munging process of the Pitchfork data. Data preparation (or data munging) is an important topic in data science that receives far too little attention in most conversations about applied analytics. Indeed, most of the posts on this blog are all about the "sexier" aspects of data analysis, but from time-to-time it's good to highlight all of the important prep work that is necessary before one can actually analyze data.
Code Repo on Github
All the code is available on Github here. The repo contains a Jupyter notebook that reads the data (available on Kaggle - I can't distribute it myself), and performs the extraction, cleaning, and merging of the Pitchfork review data to create a final tidy dataset.
We will not go over all of the code in this blog post. Rather, we will focus on a few of the key steps that illustrate some of the issues which need to be solved in order to create the tidy data structure of our final dataset.
Extracting the Data from SQLite
The first steps are to import the libraries we'll need and define the directory where the data are stored. We then connect to the SQLite database and print the names of the tables contained therein.
# import needed libraries
import sqlite3, datetime
import pandas as pd
import numpy as np
# define directory where the data are stored
in_dir = 'C:\\Directory\\'
# What are the tables in the database?
con = sqlite3.connect(in_dir + 'database.sqlite')
pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';", con)
There are 6 different tables in our SQLite database: reviews, artists, genres, labels, years, and content. We will extract all of the data from each table into a separate data frame, and close the connection to the SQLite database:
# extract the tables from sqlite to Pandas dataframes
reviews = pd.read_sql('SELECT * FROM reviews', con)
artists = pd.read_sql('SELECT * FROM artists', con)
genres = pd.read_sql('SELECT * FROM genres', con)
labels = pd.read_sql('SELECT * FROM labels', con)
years = pd.read_sql('SELECT * FROM years', con)
content = pd.read_sql('SELECT * FROM content', con)
# and close the connection to the sqllite database
We now have 6 different data frames, each one containing all of the data from a single table in the original SQLite database.
Munging the Reviews Data
We will first munge the dataset entitled "reviews." The dataframe looks like this (first 5 rows shown):
|0||22703||mezzanine||massive attack||http://pitchfork.com/reviews/albums/22703-mezz...||9.3||0||nate patrin||contributor||2017-01-08||6||8||1||2017|
|2||22659||all of them naturals||uranium club||http://pitchfork.com/reviews/albums/22659-all-...||7.3||0||david glickman||contributor||2017-01-07||5||7||1||2017|
|3||22661||first songs||kleenex, liliput||http://pitchfork.com/reviews/albums/22661-firs...||9.0||1||jenn pelly||associate reviews editor||2017-01-06||4||6||1||2017|
|4||22725||new start||taso||http://pitchfork.com/reviews/albums/22725-new-...||8.1||0||kevin lozano||tracks coordinator||2017-01-06||4||6||1||2017|
The code below checks whether there are any duplicate review id's. In fact, there are 4. We then print the data for one of the duplicated review ids:
# there are four duplicated review id's
# which review ids are they?
# look at the first one
reviews[reviews.reviewid == 9417]
Which gives us the following:
|12116||9417||radiodread||easy star all-stars||http://pitchfork.com/reviews/albums/9417-radio...||7.0||0||joe tangari||contributor||2006-10-11||2||11||10||2006|
|12120||9417||radiodread||easy star all-stars||http://pitchfork.com/reviews/albums/9417-radio...||7.0||0||joe tangari||contributor||2006-10-11||2||11||10||2006|
It looks like, for whatever reason, four of the rows are duplicated in the reviews data. My guess is this has something to do with the scraping program that collected the data from the Pitchfork website, but it's hard to know for sure.
We will simply remove the duplicated rows (and reset the index on the Pandas dataframe). I always print the shapes of data frames when doing this type of operation, just to make sure I've actually done the sub-setting properly. The following code drops the duplicate rows and prints the shapes of the original and sub-set dataframes:
# conclusion- there are 4 duplicated reviewid's in the reviews database
# but as the lines are the same- we can simply drop duplicates
# we also reset the index
reviews_final = reviews.drop_duplicates().reset_index(drop = True)
# check that we've removed them
# looks ok!
We have in fact removed 4 rows!
Munging the Genres Data
The genres data looks like this:
This format seems pretty straightforward. However, I again checked for duplicate review id's (code below), and saw that there were 4,291 of them. Why does this happen? The code below does the duplicate checking, extracts the first 5 duplicate review ids, and displays the data for one of them:
# many duplicated review id's
# what are some of the review ids?
# example duplicate review id
genres[genres.reviewid == 8005]
Which gives us the following result:
This seems quite obvious, in retrospect. Not all albums contain music with a single genre. When an album contains music with a variety or mix of genres, the album is assigned multiple genres in the data.
We are ultimately interested in producing a dataset with one row per album. We therefore need to transpose these data from the long format (with one row per review id-genre combination) to the wide format (with one row per review id). We will use the Pandas to_dummies function to produce dummy variables (also sometimes called "one-hot encoding") with this information.
We can produce the matrix of dummy variables, which will put each unique genre value as a column, with the value of "1" when the row contains the genre value, and "0" if the row does not contain the genre value:
# make pandas dummy variables out of the genres
# still multiple rows per review id here
dummy_genres = pd.get_dummies(genres.genre)
# the number of rows is still the same as the genres table
Which returns the following to the console:
We have kept the same number of rows, but increased the number of columns. Our dummy_genres dataframe looks like this:
We now need to aggregate the data to the level of the review id - this will give us the genre data with one row per album, which is what we need. Because of the dummy (0/1) format of our data, if we group the dataframe by review id and take the sum of the dummy columns, we should get the data we want: a single row per review id, with binary indicators for genre, including the cases where albums have multiple genres.
The following code goes through the following steps: concatenating the review id with the dummy variable data, grouping the data by review id and summing all of the dummy columns, and checking the maximum score of the dummy variables. If we have understood the data structure and done everything properly, the dummy columns should have a maximum score of 1.
# merge the dummies back into the genre database
genres_wdummies = pd.concat([genres.reviewid,dummy_genres], axis = 1)
# aggregate to the reviewid level; we take the sum of the dummies
# each dummy should only exist once
# we essentially get boolean indices for each reviewid
# of the genres represented by the album for a given reviewid
genres_wdummies_gb = genres_wdummies.groupby('reviewid').sum()
# some have more than 1 entry per genre!
The last line of code returns the following result:
Three of the genre variables have maximum values of 2! How can this possibly happen?
Let's check out the review that contains the "global" genre twice:
# one review with global twice
# we recognize this review id!
# it's one of the ones that was repeated above
genres_wdummies_gb.index[genres_wdummies_gb['global'] > 1]
This returns the review id 9417. This is one of the duplicate review id's we saw when munging the reviews data above!
What do the raw data look like for this review id?
This seems to be, as we saw with the reviews data above, a problem with duplicate entries in our data. It is again unclear why certain reviews appear multiple times in our database, but this issue of duplicates is something that we have to be attentive to all throughout our data munging process.
The solution I chose here was to keep the data I had produced, and simply binarize the genre columns. In other words, a score of 0 remains 0; all other values are set to 1.
The code below performs this binarization and checks the maximum of the dummy columns to verify that it worked.
# conclusion: we can just binarize everything - if it's 2, it should be 1
# binarize the dataframe: set all values to 0 or 1
genres_final = genres_wdummies_gb.apply(lambda x: np.where(x == 0, 0, 1), axis = 0)
# now we're good!
The last line of code returns the following:
It looks like we have solved the issue!
The head of our cleaned genres dataset, called genres_final, looks like this:
The code found at the Github repo contains the code for munging all 6 of the component data sets, and creating a single final dataframe with one row per album/review id.
The final data produced by the code on Github produces a master dataset containing reviews of 18,389 albums. The head of the final cleaned data looks like this:
|0||22703||mezzanine||massive attack||http://pitchfork.com/reviews/albums/22703-mezz...||9.3||0||nate patrin||contributor||2017-01-08||6||8||1||2017||“Trip-hop” eventually became a ’90s punchline,...||1998.0||NaN||1||0||0||0||0||0||0||0||0||virgin||NaN||NaN||NaN||NaN||massive attack||NaN||NaN||NaN||NaN||NaN||NaN|
|1||22721||prelapsarian||krallice||http://pitchfork.com/reviews/albums/22721-prel...||7.9||0||zoe camp||contributor||2017-01-07||5||7||1||2017||Eight years, five albums, and two EPs in, the ...||2016.0||NaN||0||0||0||0||0||1||0||0||0||hathenter||NaN||NaN||NaN||NaN||krallice||NaN||NaN||NaN||NaN||NaN||NaN|
|2||22659||all of them naturals||uranium club||http://pitchfork.com/reviews/albums/22659-all-...||7.3||0||david glickman||contributor||2017-01-07||5||7||1||2017||Minneapolis’ Uranium Club seem to revel in bei...||2016.0||NaN||0||0||0||0||0||0||0||0||1||static shock||fashionable idiots||NaN||NaN||NaN||uranium club||NaN||NaN||NaN||NaN||NaN||NaN|
|3||22661||first songs||kleenex, liliput||http://pitchfork.com/reviews/albums/22661-firs...||9.0||1||jenn pelly||associate reviews editor||2017-01-06||4||6||1||2017||Kleenex began with a crash. It transpired one ...||2016.0||NaN||0||0||0||0||0||0||0||0||1||kill rock stars||mississippi||NaN||NaN||NaN||kleenex||liliput||NaN||NaN||NaN||NaN||NaN|
|4||22725||new start||taso||http://pitchfork.com/reviews/albums/22725-new-...||8.1||0||kevin lozano||tracks coordinator||2017-01-06||4||6||1||2017||It is impossible to consider a given release b...||2016.0||NaN||1||0||0||0||0||0||0||0||0||teklife||NaN||NaN||NaN||NaN||taso||NaN||NaN||NaN||NaN||NaN||NaN|
This is the dataset that has served as input for all of the different posts I have done using the Pitchfork data!
Summary and Conclusion
This post was completely dedicated to the data munging process. We started with an SQLite database containing 6 different tables, and cleaned them one-by-one, merging them all to produce a final tidy dataset, with one row per album, and all of the information about each album contained in the columns. The complete code is available on Github here and the data are available here.
In popular discussions of data science or statistics, one often starts the narrative with a cleaned dataset that is appropriate for the analysis to be presented. In real life, however, this is rarely the case. For this reason, we focused here on munging 2 of the 6 tables: the reviews table and the genres table, which provide a good illustration of some common pitfalls.
Both the reviews table and the genres table had duplicate data. I wasn't necessarily expecting this, and the documentation for the data made no indication of this issue. Based on the data collection procedure, my assumption is that something happened during the scraping of the data, so that 4 of the albums were scraped twice. Odd "surprises" like this are very common when working on applied data problems, and being vigilant and doing checks like the ones described above is critical to producing high-quality, analysis-ready datasets. It is very dangerous to assume that the data that someone else gave you (or that you found on the internet) are correct and "ready-to-analyze."
I made a poor assumption in my first check of the genres dataset above. I did indeed check for duplicate review id's, and my assumption was that this duplication was due to the fact that each album can have multiple genres. I assumed that everything was fine and continued with my data transformation process (from long to wide formats). It was only in a subsequent check that I realized that there were problems with the duplicate data in the original table. Because I checked multiple times, I was able to realize the problem and correct it. But I think this serves as a nice example of the importance of checking and testing data throughout the munging process!
Coming Up Next
In the next post, we will analyze data on my walking behavior from two different sources: the Accupedo app on my phone, and the Fitbit I wear on my wrist. We will use visualization and basic statistical techniques (in R!) to examine whether these two devices give similar measurements of the number of steps I take each day.
Please visit source website for post related comments.