I used Hyperdeck to help scrub data collected for a chart of all historical state governors:
What all of this mess means is that fully-automated scraping and scrubbing is a fantasy. Scrubbing will have to be done with a human in the loop, spotting anomalies and making judgement calls. Fortunately, Hyperdeck is designed to keep a human in the loop.
We start by downloading the Wikipedia HTML files, using whatever tool you can. Then we take an HTML file and upload it to the first text component in the workbook, named “rawHTML”. For this tutorial, the raw data is already in place, but you would do this by clicking the Upload button on the component controls and selecting your file.
The first processing step is to strip scripts, links, and image sources, so that no external assets get loaded when we add the HTML to our page. This isn’t strictly necessary, but scripts can do all sorts of weird things and it just takes time to load everything. To do this, we click the Run button on the “preprocessHTML” js component, which grabs the text from “rawHTML”, runs some regexes, and stores the output in the “preprocessedHTML” txt component.
If you’re familiar with jQuery, most of these snippets are pretty straightforward. Most of them simply select elements and remove them. But a few notes: the first snippet selects #output (the hardcoded div where Hyperdeck puts workbook output) and names it x for brevity. The second snippet dumps the preprocessed HTML to #output, and the third reduces it to the first table. After that we remove rowspans and colspans, because these are visual elements that screw up the table structure from a scrubbing standpoint. We get rid of styles and simplify borders so that we can clearly see how many cells are in each row. And then we progressively remove unneeded elements until we’re left with a bare-bones table containing only the data we’re interested in.
In this case, we were able to scrub the full table using judiciously chosen snippets, but often you will need to select elements manually. We can do this by using the :hover CSS pseudo-class. Click in the input box containing “x.find(‘td:hover’).remove()” and then hit tab to move the focus to the Run button. Then you can mouse over individual td’s so that the :hover pseudoclass applies, and hit Enter to run the snippet and delete them.
Hyperdeck also provides a repl component so that you can run one-off commands right in the workbook, without having to open a separate console.
The scrubbed table is pretty clear and we could just manually copy it out to apply finishing touches in a text editor, but for the sake of completeness, I wrote a little code to transform the table into our desired final form. This function trims the text, splits the dates by that m-dash, creates objects with the desired field names, and stores the objects in a data component.
The data component is currently displayed as comma-separated values (csv), but by changing the “display” selectbox in the controls, you can see the data in tsv, json, or yaml as well.
One table down! Forty-nine to go :-/. Scrubbing is tedious work, and there’s no getting around that. But we’ve created a pipeline that hopefully needs minimal tweaks to get it to work on subsequent raw input. The jQuery might need to be executed in slightly different orders (this is why we have a Snippets component rather than just dumping all the commands into a single function). You might need to manually select some components to delete. The final processing may need some adjustment. And sometimes you might screw up and need to start over from the first snippet. What this Hyperdeck workbook tries to do is provide a flexible scaffolding that can be built on or tuned as needed.