Today I came across another one of those little gems that sometimes come out of the incredible code foundry which is Google. The project is called Google Refine . Admittedly this post looks rather boring and dull, but there is plenty to get excited about! So read on…

Despite the many critics happy to jump on any of the big players for any reason, I fundamentally admire Google’s work philosophy. The fact that they embrace creativity and personal interests in some sort of anarchic way, provides a huge potential to let interesting projects develop and take shape independently from the general roadmap of official applications.

In the past few years I have been working with a number of large datasets extracted from a variety of databases or merged from different sources. The thing they had in common has always been inconsistency. Some are caused by people, some by the systems. For example, people make input mistakes, but moving data from one system to another creates spurious errors, and sometimes careless programming leads to odd computations. The result is always a messy and lengthy process which involves cleaning data or attempting to augment data from external sources.

There are many ways of approaching the problem, however the most simple is opening up data in a spreadsheet and use grouping, sorting and formatting to try to bring up unusual or odd patterns.
Applying data exploration techniques is the next step; plotting distribution graphs and summary tables is useful to spot things, but often a proper statistic package is easier to use to provide certain summaries. (That’s a second application, hence data might need to be moved again as spreadsheets tend to work well with relatively small datasets!)
Spreadsheets are limited in the number of data points they are able to handle: for example MS Excel was limited to about 65000 rows until the Office 2007 version: not close to be enough for large datasets… Statistical packages are expensive, especially if you are not too keen in tinkering with code in packages like R. However the user interface limits what you can actually do, and even popular choices like SPSS (or PASW in its latest incarnation) and STATA are still relying on syntax (i.e. manual coding) for more complex procedures.

It is not a surprise that I was very very intrigued when I came across the Google Refine project. As well as allowing intuitive use, the package (java based, but running on the desktop) allows powerful transformations and operation with very simple operations. The video screencasts are an excellent showcase for its potential, so I simple had to try it!

In the next few weeks I will play with it and report back soon, in the meantime it is a project to watch for me.

Be Sociable, Share!
Data cleanup trickery
Tagged on:                     

Leave a Reply