VOICES
BROAD DATA: EXPLORING THE EMERGING WEB OF DATA Jim Hendler Department of Computer Science and Cognitive Science Department Rensselaer Polytechnic Institute Troy, New York
Introduction For those of us who work on new technologies for the web, the emergence of new big data challenges is an exciting area of research. The ‘‘web of data’’ is the focus of this column. We will explore technologies, applications, and societal implications of open data, which is increasingly becoming available and reusable. We use the phrase ‘‘broad data’’ to emphasize our focus not on data in a centralized repository or data warehouse, but on the broad spectrum of the vast array of open data on the World Wide Web.
Three Vs Revisited Big data is generally characterized as being composed of the ‘‘three Vs’’—significant growth in the volume, velocity, or variety of data.1 Of these, volume is the facet that is most discussed. The term is used in three main ways: exploring the analysis of relational data found in the enterprise warehouses of large companies; storing and processing the very large amounts of data that come from scientific platforms such as the Large Hadron Collector or many of the new data-intensive astronomical and environmental platforms; or discussing the large unstructured holdings of industry, especially the huge repositories held by the major web companies. Volume is easily quantified—data, especially in a centralized repository, can be measured in terms of tera-, peta-, and exabytes as we continually collect and analyze more information about the world around us. Much of the work in this area is driven by the challenges of handling increasingly large repositories for analysis, data mining, and visualization. Velocity is a bit harder to quantify, as different analysts use the term in different ways. Those who focus on the data
18BD
coming from sensor networks and the emerging ‘‘internet of things’’ tend to focus on the bit rate of data streams and to explore challenges such as trying to determine in near realtime which data to save or lose, assuming it cannot all be stored, looking at how to handle the increasing amount of video streaming from now plentiful low-cost cameras, and developing appropriate technologies for storing and querying this kind of data. Others focus more on the means for handling the speed of growth in a large-scale data repository (Facebook reports growth of over 500 terabytes per day2) and developing computing technologies for querying these large-scale data resources (for example, it is has been reported that Google interacts with over 20 petabytes per day3). This latter aspect of velocity also gives rise to another challenge, variety. The information processed by these large web companies is not structured and neatly formatted. The data holdings include text, images, video, and audio formats. The applications most in demand, especially search-based ones, require the ability to store, query, and integrate results across this variety of information types. Typically, this is the use of the third V in most descriptions of big data.
The Web of Data In this column, however, we are going to be looking at a different aspect of variety, one that we believe is more challenging, less explored, and, in the long-term, more revolutionary. Increasingly, ‘‘raw data’’ is being published on the web in some form of a dump from a structured or semistructured data repository and made available for reuse. The leading edge of this movement is currently open government BIG DATA
MARCH 2013 DOI: 10.1089/big.2013.1506