Scrubbing Data with Hyperdeck

Data scrubbing is, as they say, an open problem. I’ve seen tons of frameworks and tools touted in various places over the years, but as far as I know, none of them have ever broken out to capture the market. Arguably this is because data scrubbing is a problem so broad that it requires a solution equally broad – in my view, the solution is programming itself. The philosophy behind Hyperdeck is that the web platform – javascript, HTML, all the capabilities of the browser taken as whole – constitutes a tool. The web platform is a tool flexible enough to attack the slippery problem of data scrubbing. It only needs a little assist in being more usable, which is what Hyperdeck tries to provide.

I used Hyperdeck to help scrub data collected for a chart of all historical state governors:

https://www.hyperdeck.io/f/11/examples/scrubber-wikitable

My raw data sources were the Wikipedia pages that listed past governors of each state, e.g. https://en.wikipedia.org/wiki/List_of_Governors_of_California. Wikipedia displays this data in a table, and as far as I know, the raw data is not available in Wikidata. The table unfortunately is configured for visual consumption, not for machine reading. Merged rows and columns, multiple data fields within a single table cell, exceptions and anomalies of all sorts. It is very difficult to write a javascript function that parses a single wikitable, let alone a function that correctly parses all 50 wikitables in our dataset. And before we even get to parsing a wikitable, we have to find the correct wikitable to parse, as pages frequently have more than one. Sometimes the list of governors is all in one table, sometimes it’s split into two or more, with different tables for governors under different state constitutions.

What all of this mess means is that fully-automated scraping and scrubbing is a fantasy. Scrubbing will have to be done with a human in the loop, spotting anomalies and making judgement calls. Fortunately, Hyperdeck is designed to keep a human in the loop.

We start by downloading the Wikipedia HTML files, using whatever tool you can. Then we take an HTML file and upload it to the first text component in the workbook, named “rawHTML”. For this tutorial, the raw data is already in place, but you would do this by clicking the Upload button on the component controls and selecting your file.

scrubbing-1.png

The first processing step is to strip scripts, links, and image sources, so that no external assets get loaded when we add the HTML to our page. This isn’t strictly necessary, but scripts can do all sorts of weird things and it just takes time to load everything. To do this, we click the Run button on the “preprocessHTML” js component, which grabs the text from “rawHTML”, runs some regexes, and stores the output in the “preprocessedHTML” txt component.

Now we come to the meat of the scrubbing. Hyperdeck provides a snips component that stores single-line snippets of javascript code that can be executed individually. I’ve set up these snippet commands so that if you run each one in order from top to bottom, you strip down the table to a simple form, step by step.

scrubbing-2.png

If you’re familiar with jQuery, most of these snippets are pretty straightforward. Most of them simply select elements and remove them. But a few notes: the first snippet selects #output (the hardcoded div where Hyperdeck puts workbook output) and names it x for brevity. The second snippet dumps the preprocessed HTML to #output, and the third reduces it to the first table. After that we remove rowspans and colspans, because these are visual elements that screw up the table structure from a scrubbing standpoint. We get rid of styles and simplify borders so that we can clearly see how many cells are in each row. And then we progressively remove unneeded elements until we’re left with a bare-bones table containing only the data we’re interested in.

scrubbing.gif

In this case, we were able to scrub the full table using judiciously chosen snippets, but often you will need to select elements manually. We can do this by using the :hover CSS pseudo-class. Click in the input box containing “x.find(‘td:hover’).remove()” and then hit tab to move the focus to the Run button. Then you can mouse over individual td’s so that the :hover pseudoclass applies, and hit Enter to run the snippet and delete them.

scrubbing-3.png

Hyperdeck also provides a repl component so that you can run one-off commands right in the workbook, without having to open a separate console.

scrubbing-4.png

The scrubbed table is pretty clear and we could just manually copy it out to apply finishing touches in a text editor, but for the sake of completeness, I wrote a little code to transform the table into our desired final form. This function trims the text, splits the dates by that m-dash, creates objects with the desired field names, and stores the objects in a data component.

scrubbing-5.png

The data component is currently displayed as comma-separated values (csv), but by changing the “display” selectbox in the controls, you can see the data in tsv, json, or yaml as well.

scrubbing-6.png

One table down! Forty-nine to go :-/. Scrubbing is tedious work, and there’s no getting around that. But we’ve created a pipeline that hopefully needs minimal tweaks to get it to work on subsequent raw input. The jQuery might need to be executed in slightly different orders (this is why we have a Snippets component rather than just dumping all the commands into a single function). You might need to manually select some components to delete. The final processing may need some adjustment. And sometimes you might screw up and need to start over from the first snippet. What this Hyperdeck workbook tries to do is provide a flexible scaffolding that can be built on or tuned as needed.

Using Google Docs (and more!) with Hyperdeck

In this demo we’re going to check out a few neat options in Hyperdeck and string them together to pull data from Google Sheets and into a document. You might like to have the full working example open alongside this walkthrough.

First let’s start with some terms and basic methods. In Hyperdeck, your document is built from components, which can contain plain text, html, javascript, or even images or sound files. You could organize your project a lot of different ways, but we’re going to make use of two javascript methods available via the global Hyperdeck object to keep things neat and compartmentalized: Hyperdeck.Get(componentName) and Hyperdeck.Set(componentName, contents).

Next we’re going to use an External Libraries component to load a library built for pulling data from Google Sheets,  Tabletop.js– this is available from CDNJS, so we’ll plug that in to our component, making it available to the rest of our workbook. And just for fun, while we’re adding external libraries, let’s pull in Handlebars.js so that we can separate our presentational logic from our data manipulation just a bit. You can experiment with other organizational schemes, other template engines, or other utility libraries like Underscore. Just keep in mind that jQuery is built in, and you may have difficulties if you try to reuse the $ variable.

With those libraries loaded, we’re ready to fetch some data and put it to work. We’ll add a data component to hold the information, and a text component to hold the spreadsheet key (for convenience and reusability’s sake). The data used for this example is held in this Google doc , a copy of this source data. It’s necessary to make a copy because Tabletop.js only works with Google docs that are published. This is important to remember when adapting this example to your own needs. Here’s a screenshot to help you find the relevant options:

Google docs menu

And once the document is published, you’ll need the spreadsheet key. Here’s our example again, in the Publish dialog, with the key highlighted:

Where to find the spreadsheet key

This goes in our component named spreadsheetKey. Then we’ll set up a javascript component, loader, which gets the spreadsheetKey and plugs it into Tabletop.js to load data from the spreadsheet into our data component, data1.

Now that we have a local copy of the data, we can do whatever we want with it! For purposes of this demo, we’ve created another javascript component that uses Handlebars.js to render a rudimentary list of hyperlinks to the representatives’ websites, with some data made available for filtering the list and an example filter button.

Another application might be to load a visualization library (e.g., d3.js) and build some charts, or perhaps an interactive map. In fact, you might want to make a copy of the demo workbook and do exactly that! And if you build something neat off of this example, we’d love to hear about it.

Why Hyperdeck?

The question motivating the development of Hyperdeck is: why isn’t the web the default medium of technical communication? Why are professional documents still done in Word, Powerpoint, or LaTeX? Certainly those programs have a strong legacy position, and LaTeX at least has a certain elegance to it. However, those technologies were developed in the 70s and 80s, when the world was very different. I think we can do better.

The idea behind Hyperdeck is that the browser platform is the correct replacement for the office suite. Note what I’m saying: not a particular web app, but the web platform as a whole. The browser contains all you need in a document platform: a built-in programming language with a vast ecosystem, several different display systems, sandboxing, and linking.

However, there are a number of barriers that prevent people from using the web for documents. One is development and hosting – sure, static hosting is pretty easy if all you need to do is share documents, but if you want people to be able to edit and save, you have to build a full-fledged application. Making editable documents needs to be as easy as opening Excel, if large numbers of people are to use the web as a document platform. Second is that the web isn’t too interested in data. If you want to deal with data in the browser, you have to do everything yourself, pulling in libraries and writing the code to herd the data as needed. Third, HTML, as great and revolutionary as it is, committed the cardinal sin of software: it broke backwards compatibility, specifically, compatibility with paper. People in 1991 needed to print documents, and people in 2016 still need to print documents. And the need for precise control over the design and layout of a document isn’t going away – no matter what happens to paper, PDF will be with us for a long time.

Hyperdeck provides solutions to the first two problems and is working on the third. We enable sharing editable documents by hosting all the documents on our platform. Obviously there are issues with centralization, but it solves the immediate problem pretty well. We make data a first-class citizen of web documents, with built-in facilites for data munging. We do not yet support PDF export, but we’re working on it.

Of course you might say, web documents are a solved problem – Google already has Google Docs – how is this different? The problem with Google Docs (and, more plainly, with Office 365) is that it’s basically Microsoft Office ported to the web – the same interface, but built in HTML/CSS/JS instead. I consider this a cargo cult solution. The web platform is very different than the desktop world that the office suite was designed for, way back in the 80s. It stands to reason that a document suite built on the new platform should have a very different design.

There are a couple of specific deficiencies with the existing design of the office suite that I’d like to highlight. First of all, why are spreadsheets, word processors, and presentation editors different apps? Mostly just path dependence – they started as different apps back in the 70s and 80s, and have stayed that way. Certainly Microsoft has made valiant attempts at getting their office apps to play nicely with each other, but ultimately they’re fighting against fundamental elements of the design. The same is true for scriptability. Visual Basic for Applications was not part of the original Excel – it was added as a sort of separate module later on. It works! The world still runs on Excel/VBA (/.NET, whose separateness is even more pronounced). But VBA is a language of the 80s that is showing its age, and again, getting Excel and VBA to play well together requires fighting the inherent separateness of the apps. Finally there’s the add-on ecosystem – Microsoft Office and Google Docs both have add-on systems, but ultimately extensibility is a problem that is most elegantly solved by the web – script tags that pull in Javascript libraries from any URL. There is a javascript library available on the web for almost anything you can imagine, installable with a simple link.

Hyperdeck seeks to unify the office triad on a platform that has scriptability and extensibility built-in. Does it require reinventing the wheel? A little bit. But I see this as reinventing the wheel for paved roads instead of dirt roads. The world has changed. Your office software should change with it.

Intrigued? Try our tutorial here: https://www.hyperdeck.io/f/11/tutorial.