A tabula rasa from PDFs

One of the most frequently asked questions on the NICAR list-serv, an email group for journos interested in data and tools, is how to get data out of a darned PDF. (And, way back in 2002, it was one of the first questions that was asked.)

That’s because data is all too constantly provided or published in those nefarious document formats, even as reporters are getting more and more familiar with using data themselves.

My latest and favorite tool is Tabula, an open-source app made by and for journalists.

While there are a ton of tools out there for getting data out of PDFs (and I’ve compiled a long list here), most of them simply convert an entire PDF into an Excel sheet. Tabula lets you select data tables like you’re taking a screenshot of them, then – click! – you export the data into a variety of formats you can just pop in to Excel.

I highlighted Tabula during my talk at Mozfest in London earlier this month, and explained how to do it in this tutorial, so swing over to there for instructions. (Hint: there are like three steps.)