26.4.12

Data exploration tutorial with google refine

Recently, Hugh Stimson published a great article: Data Mining My Old Radio Playlists. His post mix tutorials on php scripting, data cleaning with google refine and data analysis with PostgreSQL.

This answer post demonstrate that data analysis is fully doable in google refine using really basic function (I'll be using GREL function only once for the long tail analysis). I guess also this post is a good illustration of my previous post on data exploration using google refine.

Count number of play by episode and title 

To determine the unique number of different episode, in google refine, facet the episodename column,
facet > text facet. 


For this one I used two different facet on the title column. The first one is facet by blank and we have 1387 non blank and 1 blank title. In google refine:

facet > customized facet > facet by blank
(at the bottom of the drop down list)

The second one is the same as for the episodename describe previously.


You got it? So same old trick for the number of artist and album.


Most played track, artist and album
Select the title column and text facet. Down at the bottom of the list of choices select facet by choice count. Set your limit >3. At the top you can sort result by count instead of name.
Total 10 differents titles. Click on a choice to display all matching records.


Same trick apply to find the most played artist and album.

The Long Tail
The tricky one in google refine. For sure we won't produce graph with google refine, a visualization tool like Tableau Software or Fusion Table will be more helpful there, but let's give it a try:
  1. Add a new column where you add the count of play per track. For this see the facetCount tutorial,
  2. Remove duplicate from the tracks column as describe here. In order to not lost the title with a blank name and put something in it (like //no title//) before removing duplicate,
  3. Facet the title count column by text.

To go further
Add facet together to drill down in the data. In few click I found that:
  • Kleptones - Live'r than you'll ever be have been played a lot during Utterly Utterly Helpful radio 5 of the 12 played, with a track even played twiced during this show!
  • On the other hand Rockity Roll by Mike Doughty enjoye a large number of played in various shows.
(click on the image to enlarge)


On my version of the data, the playtime was not in the right format, but the time facet could be fun to play with to see what have been played when. 

I've tried to reconcile with freebase to add information in your dataset but freebase only found three direct matches from this data set ...