Data Journalism by Sean McGrath

Section 5. Alive and Well and on New Platforms

Enter: the Data Journalist

Sean McGrath argues that there are essentially three steps involved in data journalism: the gathering of data, the interrogation of data and the visualisation of data. And, as he stresses, each process requires a different skill as well as different tools In 2010, the CEO of Google, Eric Schmidt, claimed: â&#x20AC;&#x153;There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every two daysâ&#x20AC;? (2010 Atmosphere convention). To place this claim into perspective, let us consider the following. This page consists of approximately 2 kilobytes of data. There are 1,024 kilobytes in a megabyte of data. The complete works of Shakespeare would equate to roughly 10 megabytes. There are then 1,024 megabytes in a gigabyte. An average-sized library, filled with academic journals, would equal roughly 100 gigabytes of data. There are then 1,024 gigabytes in a terabyte. The print collections of the US Library of Congress would equal approximately 10 terabytes. 1,024 terabytes equals 1 petabyte. All the

printed material in the world would equate to roughly 200 petabytes. Finally, 1,024 petabytes is equal to 1 exabyte. Recent research suggests that we are now producing 6.8 exabytes of information every two days. Given print, film, magnetic, and optical storage media produced about 5 exabytes of new information in 2002 alone, it would seem that Schmidt might have been slightly sensationalistic with his statistic. However, his overriding point remains valid. In the information age, both governments and big businesses are being continually pressured into releasing an increasing amount of data. While this suits the journalist, it may also suit those releasing the data – for where better to hide secrets, than in plain view? Amongst the centillions of “0”s and “1”s that make up the digital world we now inhabit, there are infinite journalistic stories waiting to be discovered. All we have to do is find them. Enter: the data journalist Data journalism is really an amalgamation of roles, each of which could take a lifetime to master. The perfect data journalist would be a skilled researcher, statistician, programmer and, of course, journalist. In reality, it is unlikely that an aspiring data journalist would possess complete knowledge in each of these areas but fortunately it is not necessary. The phrase “data journalism” is still taboo. The juxtaposition of the words “data” and “journalism” in the same sentence, inevitably conjures up images of computer programmers huddled over a MacBook, frantically writing code. While this certainly is one element of data journalism, it does not tell us everything. As Martin Moore, Director of Media Standards Trust, states: “Data journalism is shorthand for being able to cope with information abundance. It is not just about numbers. Neither is it about being a mathmo or a techie. It means being able to combine three things: individual human intelligence, networked intelligence, and computing power.” Let us consider each of Moore’s criterion individually. Individual human intelligence To turn raw data into a story, we still need implement fundamental journalistic techniques. For technology to find an answer, a question must still be asked. There is currently no technology in the world that fully 16

understands the prerequisites of a good story. In other words, the data journalist must rely on the exact same skill set as the journalist. Instinct, experience and training are all fundamental. Networked intelligence In October 1997, Distributed Computing Technologies (DCT) successfully cracked a 56-bit key as part of the RSA Secret-Key Challenge. The challenge, set up by RSA Laboratories was designed to demonstrate the strengths and weakness of various encryption methods. However, what DCT revealed in the process was the strength of “grid computing”. The concept was that instead of using one “supercomputer” to carry out intensive calculations, the Internet enabled the calculations to be spread out among thousands of computers. Operations that would have previously been impossible could now be divided up between anyone with a computer. This technology has expanded to perform tasks such as the Search for Extra-terrestrial Intelligence (SETI) and cures for cancers, with millions of users lending their spare computing power to carry out potentially world-altering research. The same Peer-to-Peer (P2P) concept has also seen illegal file, music and video sharing bring their respective industries to their knees. More recently, the notion of network distribution has retreated away from the computers and focused instead on the people operating them. Crowdsourcing could be seen as the twin sister of data journalism, utilising the Internet to bring a network of people together to contribute and investigate. Computing power If an abundance of data is the core reason behind the need for data journalism, computing power is the reason that data journalism can now be practised. In 1965, the co-founder of Intel, Gordon E. Moore, wrote a paper observing a trend and predicted its continuation into the future. The prediction, now known as Moore’s Law, claimed that the transistors that can be placed on an integrated circuit double approximately every two years. Moore’s Law has proven to be remarkably accurate. So much so, that the computer industry uses the law to forecast future trends. Essentially, Moore’s Law means that the technology powering a Personal Computer doubles in power every two years. While it does not 17

necessarily equate to a doubling in computing speed, it goes some way to demonstrating the rapid and exponential increase in computing technology. In 1999, an Intel flagship commercial Central Processing Unit (CPU) was capable of performing 2,054 million instructions per second (MIPS). Their latest CPU, which is often referred to as the “brain” of a computer, is capable of performing approximately 159,000 MIPS. What was once a “supercomputer”, reserved for advanced medical research or taking on the grandmasters of chess, is now simply a “computer” available on the high street for a few hundred pounds. With a little bit of knowledge and a few choice pieces of software, the household computer can now mine, interrogate, analyse and visualise even the most complex data. The process: Learning to work with data If one were to break down the processes involved in data journalism, there would essentially be three steps involved. The gathering of data, the interrogation of data and the visualisation of data. Each process requires a different skill as well as different tools.

Data gathering

Of course, it is impossible to work with data, until there is data to work with. Renowned data journalist Paul Bradshaw suggests that instead of becoming overwhelmed by the amount of information available, the aspiring journalist should start with a question and then search for the data that will answer it. As previously mentioned, the sources for reliable and newsworthy data are ever increasing. The following examples are simply the tip of an iceberg. Creative thinking and experience will reveal an unlimited number of data sources. http://www.data.gov.uk The government’s official data hub is a fantastic source of primary information, ranging from crime figures to traffic statistics. All data can be used for both private and commercial use, although should be acknowledged by including an attribution statement specified by data.gov.uk, which is “name of data provider” data © Crown copyright and database right. http://www.data.gov Although the US counterpart is a good site to become familiar with, the 1

data available is much less forthcoming and, therefore, will probably need to be used in conjunction with other sources. http://www.factual.com An open data platform and community with a wealth of data sets covering a wide range of subjects. http://www.socrata.com Socrata is geared towards non-technical Internet users. Although quite UScentric, Socrata is an excellent source of government, healthcare, energy, education, and environment data. wikileaks.org Wikileaks needs no introduction, but is proving to be one of the most valuable sources of data available. The Freedom of Information Act Although sites such as data.gov.uk aim to take emphasis away from the FoI Act, freedom of information remains one of the most pertinent tools in data retrieval. Heather Brooke’s Your Right to Know (Pluto Press, 2004) is an essential read in order to maximise the effectiveness of information requests. http://www.whatdotheyknow.com is also a useful tool. Not only does it give useful advice, but also it acts as a gathering point for previous requests. Delving Deeper: Hacks meet hackers For those who are not content with using the tools freely available for data gathering, there is another option. At the very apex of the datadriven revolution are those who are combining a level of programming knowledge with journalism. Screen scrapers are bespoke pieces of code that are able to trawl the Internet looking for data relevant to the programmer’s specification. For example, if you wanted to crawl every local police authority’s website in the UK, looking for press releases relevant to police corruption, scraping would make this feasible. Not only could it be done, but it could done several times a day for a year, or for five years. Scraping turns the World Wide Web into the tool and the only restraint is the creative foresight of the journalist.

Again, the word “programming” often summons feelings of anxiety for the uninitiated and in all fairness, even basic programming is not something that can be learnt in a weekend. However, many journalists reading this will be fluent in shorthand and if interested, should view learning basic programming as on par in terms of complexity. Python, Ruby, PHP and Javascript are all languages that one could use as a basis for writing a scraper. For those who feel that learning a programming language is simply too daunting a mountain to climb, there is a simpler option that can yield equally impressive results. There is a growing community of programmers who are working in conjunction with journalists, each of them able to bring something different to the table. If you have a good idea for a scraper, you are almost certain to be able to partner with someone able to write your scraper for you. Scraperwiki.com and hackhackers.com are both excellent sites to meet communities where such collaborative work is possible. Interrogation of data Once the data is collected, it will need to be organised, filtered and cleaned up. While data is readily available, it is an unfortunate fact that it is usually not presented in a way that makes it easy to work with. As Paul Bradshaw advises: “Look out for different names for the same thing, spelling and punctuation errors, poorly formatted fields (e.g. dates that are formatted as text), incorrectly entered data and information that is missing entirely.” There is no one-size fits all, automated approach to data interrogation. Many of the inbuilt functions of Microsoft Excel such as “search and replace” and macro recording will help. There are also tools such as Freebase Gridworks and Google Refine that can help clean up your data. Visualise the fruits of your labour Visualisation is at the heart of all data-driven journalism. The aim of the journalist is to take data, which is vast and incomprehensible and turn it into a story which is simple to understand. It is important to recognise at this stage that data journalism is not attempting to reinvent the wheel. Anyone encountering a pie chart has encountered data visualisation. The same can be said for anyone who has watched a weather report on the news or watched the half-time statistics during a televised football game. 20

There are, however, a growing number of tools that are specifically designed to aid you in your quest to understand your newfound data. Each tool carries different copyright restrictions and if the visualisation is to be published, this should be considered. http://www-958.ibm.com This rather un-catchy web address is the home of IBM’s Manyeyes, a free data visualisation suite. While some of visualisations are rather basic, it offers a variety of original visualisations. It is worth noting that in order to see the visualisation, the user must agree to allow their visualisation to be published on the site. This means that it may not be suitable for all types of journalistic work. www.google.com/fusiontables Google’s Fusion Tables is still in a Beta phase but shows signs of promise. Not only are you able to integrate with Google Documents, but you can map your data directly on to satellite imagery of earth. Conclusion: The era of the network Although data journalism is being embraced by the journalism community at large, there seems to be the inherent antipathy that comes with any major shift in technique and theory. Conventionalists continue to argue that sitting in front of a computer is not “real journalism”. From the opposite side, there is much talk of data journalism being “the future”, as if some form of digital enlightenment will take place and those who do not have a complete command of the world of data, will be left by the wayside. As is often the case with opposing views, neither is entirely correct. One thing that can be said with certainty, though, is that we are resolutely in the era of the network. Google is searched in excess of 300 million times per day while there are 35 hours of footage uploaded to YouTube every minute and 110 million tweets per day. There is still an argument to be had for finding a balance between conventional journalistic technique and this new breed of data journalism. However, the information super-highway is rapidly becoming the most powerful tool in the journalistic arsenal. While it is fanciful to say that

computer-aided reporting is the future of journalism, it is also naĂŻve to simply reject its potential. Note on the author Sean McGrath has recently graduated from the University of Lincoln with a First Class BA (Hons) in Investigative Journalism and Research. He was also the winner of the John Pilger Award for Investigative Journalism. As well as having worked as a researcher for the BBC, he is an active blogger and has a background in social media technology, web design and Search Engine Optimisation. As for being a data journalistâ&#x20AC;Śhe will, he says, always remain a student.