Broad Data

Page 1

VOICES

BROAD DATA: EXPLORING THE EMERGING WEB OF DATA Jim Hendler Department of Computer Science and Cognitive Science Department Rensselaer Polytechnic Institute Troy, New York

Introduction For those of us who work on new technologies for the web, the emergence of new big data challenges is an exciting area of research. The ‘‘web of data’’ is the focus of this column. We will explore technologies, applications, and societal implications of open data, which is increasingly becoming available and reusable. We use the phrase ‘‘broad data’’ to emphasize our focus not on data in a centralized repository or data warehouse, but on the broad spectrum of the vast array of open data on the World Wide Web.

Three Vs Revisited Big data is generally characterized as being composed of the ‘‘three Vs’’—significant growth in the volume, velocity, or variety of data.1 Of these, volume is the facet that is most discussed. The term is used in three main ways: exploring the analysis of relational data found in the enterprise warehouses of large companies; storing and processing the very large amounts of data that come from scientific platforms such as the Large Hadron Collector or many of the new data-intensive astronomical and environmental platforms; or discussing the large unstructured holdings of industry, especially the huge repositories held by the major web companies. Volume is easily quantified—data, especially in a centralized repository, can be measured in terms of tera-, peta-, and exabytes as we continually collect and analyze more information about the world around us. Much of the work in this area is driven by the challenges of handling increasingly large repositories for analysis, data mining, and visualization. Velocity is a bit harder to quantify, as different analysts use the term in different ways. Those who focus on the data

18BD

coming from sensor networks and the emerging ‘‘internet of things’’ tend to focus on the bit rate of data streams and to explore challenges such as trying to determine in near realtime which data to save or lose, assuming it cannot all be stored, looking at how to handle the increasing amount of video streaming from now plentiful low-cost cameras, and developing appropriate technologies for storing and querying this kind of data. Others focus more on the means for handling the speed of growth in a large-scale data repository (Facebook reports growth of over 500 terabytes per day2) and developing computing technologies for querying these large-scale data resources (for example, it is has been reported that Google interacts with over 20 petabytes per day3). This latter aspect of velocity also gives rise to another challenge, variety. The information processed by these large web companies is not structured and neatly formatted. The data holdings include text, images, video, and audio formats. The applications most in demand, especially search-based ones, require the ability to store, query, and integrate results across this variety of information types. Typically, this is the use of the third V in most descriptions of big data.

The Web of Data In this column, however, we are going to be looking at a different aspect of variety, one that we believe is more challenging, less explored, and, in the long-term, more revolutionary. Increasingly, ‘‘raw data’’ is being published on the web in some form of a dump from a structured or semistructured data repository and made available for reuse. The leading edge of this movement is currently open government BIG DATA

MARCH 2013 DOI: 10.1089/big.2013.1506


VOICES Hendler

data, (Fig. 1) but more and more repositories are showing up in scientific, consumer, and other sectors.*

movement, how will users find the data they need for their pursuits? Having found a dataset, how can they comprehend the organization of the data and the details of what it repTo those of us who were around in the early days of the web, resents? How can data providers share their data without there’s a lot of excitement as we watch these sorts of numbers violating rules of privacy, confidentiality, and licensing? How grow. Many of us still remember the excitement when the can different datasets from different agencies, countries, and hundredth web page was announced, and the marvel of how cultures, described in many different languages, be usefully rapidly the page count grew. We also remember many of the integrated and jointly queried, visualized, and analyzed? challenges of that early web: as What new standards for data, metacontent increased, the ability to data, and repositories will be needed search for pages became a problem. the web? How can items in the ‘‘TO THOSE OF US WHO WERE on Interoperability between different different data repositories be linked AROUND IN THE EARLY DAYS to one another, in the way web pages browsers and languages was an inOF THE WEB, THERE’S A LOT link, so as to power the sorts of netcreasingly troublesome issue, and even as browsing became easier work-based algorithms so important OF EXCITEMENT AS WE and easier, publishing content to to large-scale, web-based applicaWATCH THESE SORTS the growing web remained beyond tions? OF NUMBERS GROW.’’ the competence of many users. The This column will be tracking and recompanies that overcame these porting on key developments in this challenges went on to become some of the giants of the modern web—Google for search, Netexciting and emerging area. We will explore issues of technologies for data search outside of centralized repositories, the scape (later Firefox) for browsing, and Facebook for letting development of metadata standards for the data web, the exusers share web content without special expertise. plication of the semantics required for linking datasets toAs the web of data grows, we see analogous challenges gether, and the economic, legal, and policy issues that arise starting to arise. Given the million government datasets from the opening of previously closed datasets. In short, we will available now, and the millions more projected in the next look at efforts and research, commercial and government, that few years as cities around the world join the open-data are exploring the V of broad variety—the broader the better.

FIG. 1.

A map of countries with open datasets (6/2012).

*The web addresses http://datalib.org and http://oad.simmons.edu/oadwiki/Data_Repositories and http://re3data.org are good starting places to ďŹ nd hundreds of examples covering a broad range of topics.

MARY ANN LIEBERT, INC. VOL. 1 NO. 1 MARCH 2013 BIG DATA

BD19


BROAD DATA Hendler

Author Disclosure Statement No competing financial interests exist.

References 1. Dumbill E. What is Big Data? O’Reilly Strata 2012. Available online at http://strata.oreilly.com/2012/01/whatis-big-data.html (Last accessed on December 31, 2012). 2. Facebook Engineering. Under the Hood: Scheduling MapReduce jobs more efficiently with Corona. 2012. Available online at www.facebook.com/notes/facebookengineering/under-the-hood-scheduling-mapreduce-jobsmore-efficiently-with-corona/10151142560538920 (Last accessed on December 31, 2012).

20BD

3. Galagher S. The Great Disk Drive in the Sky: How Web giants store big—and we mean big—data. Ars Technica 2012. Available online at http://arstechnica.com/business/ 2012/01/the-big-disk-drive-in-the-sky-how-the-giants-ofthe-web-store-big-data/ (Last accessed on December 31, 2012). Address correspondence to: Jim Hendler Department of Computer Science and Cognitive Science Department Rensselaer Polytechnic Institute (RPI) 110 8th Street Troy, NY 12180 E-mail: hendler@cs.rpi.edu

BIG DATA MARCH 2013