Data Journalism (CAR)

Page 39

PDF scraping Although some progress has been made in providing information in open formats, too many documents in the UK and elsewhere are still exclusively published as pdfs certainly not an open format in the data world. It is astonishing how many government agencies still refuse to publish information in a structured form that can be checked and analysed. One official noted that his office had ‘previous experience of mischievous manipulation and misrepresentation of Excel and Word documents,’ and used this as an excuse to provide data only as a scanned pdf. (To reiterate the point made earlier in this book: misrepresented data and sloppy reporting will detract from the cause of open data and transparency. Be fair and be accurate.) There are a large number of tools available to unlock and scrape data from PDF documents. It is possible to write code to extract data from pdfs using for example Ruby or Perl - two coding languages - but there are also a number of free or inexpensive, but powerful software tools that accomplish this task. Essentially, they do the reverse of what a PDF writer tool does. Sometimes, scanned

40

documents have to be run through an optical character recognition tool, which are often built into PDF extraction tools. This makes cracking open pdfs more difficult and time-consuming, but will not - and should not - deter a determined journalist or citizen. Some tried and tested tools are AbbyFine reader and UnPdf. To extract data from a PDF using UnPdf, open the relevant file by selecting ‘open’ and navigating to the relevant file. UnPdf recognises the individual cells, and places a blue border around them. It’s worth scrolling through the dataset to make sure that the tool has correctly identified the cells. Sometimes, malformed pdfs make it difficult for the software to recognise cells and a manual clean-up is needed.


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.