Scrubbing Data with Hyperdeck

Data scrubbing is, as they say, an open problem. I’ve seen tons of frameworks and tools touted in various places over the years, but as far as I know, none of them have ever broken out to capture the market. Arguably this is because data scrubbing is a problem so broad that it requires a solution equally broad – in my view, the solution is programming itself. The philosophy behind Hyperdeck is that the web platform – javascript, HTML, all the capabilities of the browser taken as whole – constitutes a tool. The web platform is a tool flexible enough to attack the slippery problem of data scrubbing. It only needs a little assist in being more usable, which is what Hyperdeck tries to provide.

I used Hyperdeck to help scrub data collected for a chart of all historical state governors:

https://www.hyperdeck.io/f/11/examples/scrubber-wikitable

My raw data sources were the Wikipedia pages that listed past governors of each state, e.g. https://en.wikipedia.org/wiki/List_of_Governors_of_California. Wikipedia displays this data in a table, and as far as I know, the raw data is not available in Wikidata. The table unfortunately is configured for visual consumption, not for machine reading. Merged rows and columns, multiple data fields within a single table cell, exceptions and anomalies of all sorts. It is very difficult to write a javascript function that parses a single wikitable, let alone a function that correctly parses all 50 wikitables in our dataset. And before we even get to parsing a wikitable, we have to find the correct wikitable to parse, as pages frequently have more than one. Sometimes the list of governors is all in one table, sometimes it’s split into two or more, with different tables for governors under different state constitutions.

What all of this mess means is that fully-automated scraping and scrubbing is a fantasy. Scrubbing will have to be done with a human in the loop, spotting anomalies and making judgement calls. Fortunately, Hyperdeck is designed to keep a human in the loop.

We start by downloading the Wikipedia HTML files, using whatever tool you can. Then we take an HTML file and upload it to the first text component in the workbook, named “rawHTML”. For this tutorial, the raw data is already in place, but you would do this by clicking the Upload button on the component controls and selecting your file.

scrubbing-1.png

The first processing step is to strip scripts, links, and image sources, so that no external assets get loaded when we add the HTML to our page. This isn’t strictly necessary, but scripts can do all sorts of weird things and it just takes time to load everything. To do this, we click the Run button on the “preprocessHTML” js component, which grabs the text from “rawHTML”, runs some regexes, and stores the output in the “preprocessedHTML” txt component.

Now we come to the meat of the scrubbing. Hyperdeck provides a snips component that stores single-line snippets of javascript code that can be executed individually. I’ve set up these snippet commands so that if you run each one in order from top to bottom, you strip down the table to a simple form, step by step.

scrubbing-2.png

If you’re familiar with jQuery, most of these snippets are pretty straightforward. Most of them simply select elements and remove them. But a few notes: the first snippet selects #output (the hardcoded div where Hyperdeck puts workbook output) and names it x for brevity. The second snippet dumps the preprocessed HTML to #output, and the third reduces it to the first table. After that we remove rowspans and colspans, because these are visual elements that screw up the table structure from a scrubbing standpoint. We get rid of styles and simplify borders so that we can clearly see how many cells are in each row. And then we progressively remove unneeded elements until we’re left with a bare-bones table containing only the data we’re interested in.

scrubbing.gif

In this case, we were able to scrub the full table using judiciously chosen snippets, but often you will need to select elements manually. We can do this by using the :hover CSS pseudo-class. Click in the input box containing “x.find(‘td:hover’).remove()” and then hit tab to move the focus to the Run button. Then you can mouse over individual td’s so that the :hover pseudoclass applies, and hit Enter to run the snippet and delete them.

scrubbing-3.png

Hyperdeck also provides a repl component so that you can run one-off commands right in the workbook, without having to open a separate console.

scrubbing-4.png

The scrubbed table is pretty clear and we could just manually copy it out to apply finishing touches in a text editor, but for the sake of completeness, I wrote a little code to transform the table into our desired final form. This function trims the text, splits the dates by that m-dash, creates objects with the desired field names, and stores the objects in a data component.

scrubbing-5.png

The data component is currently displayed as comma-separated values (csv), but by changing the “display” selectbox in the controls, you can see the data in tsv, json, or yaml as well.

scrubbing-6.png

One table down! Forty-nine to go :-/. Scrubbing is tedious work, and there’s no getting around that. But we’ve created a pipeline that hopefully needs minimal tweaks to get it to work on subsequent raw input. The jQuery might need to be executed in slightly different orders (this is why we have a Snippets component rather than just dumping all the commands into a single function). You might need to manually select some components to delete. The final processing may need some adjustment. And sometimes you might screw up and need to start over from the first snippet. What this Hyperdeck workbook tries to do is provide a flexible scaffolding that can be built on or tuned as needed.