New Insights on Big Data

Page 1

spring 2013


table of contents

4

Credits

5

Boom Factor

6

Connected World Sessions

8

Data Driven Business

10

Big Data for Enterprise IT

12

Beyond Hadoop

Exposing new technologies and trends in memory distributed systems, stream computing

and Hadoop supporting SQL.

14

Data Science

16

Design Sessions

18

Law, Ethics, and Open Data

20

Ecosystem Snapshot

22

Infographic Copyright 2013 Orange Silicon Valley. All rights reserved.


report from the 2013 O’Reilly’s Strata Conference on Big Data

Q uo Vadis Big Data? The Strata Conference has become the watershed event for Big Data in the US, and through its ecosystem’s capillarity, beyond. This year there were two key differences from previous Strata conferences. First, there was a significant presence of the larger players with Cisco, Intel, VMWare, Microsoft, and EMC putting forward solutions and proselytizing their technologies. Second, the subjects covered expanded beyond purely technical and functional subjects to include sessions on Law, Ethics, and Open Data as well as Design and Next Gen Data Scientists. Of course core subjects such as Tools for Hadoop Developers and an Introduction to Apache Drill were also very well represented. Large enterprise vendors are finding that they can’t escape the decidedly non-proprietary nature of Big Data software and hardware. To address this, they are trying various ways to make their offers palatable to enterprises seeking footholds in Big Data. This includes addressing the shortages of resources like Data Scientists, co-existence and even integration of seemingly fractious technologies like SQL and NoSQL, and abstraction of the technical features of the systems through new and innovatively designed user-friendly interfaces. What does 2013 have in store for Big Data? Certainly we will see consolidation around Hadoop of this currently highly fragmented industry which has almost as many packaged versions as there are vendors. Arguably we will see an increased tempo of acquisitions of pure-play Hadoop startups such as Cloudera and MapR. Lastly, we see the maturing of the core Hadoop engine and a shift of its creators to adjacent open source projects such as optimizing real-time performance, scalability, and cost, all factors not addressed in the original core Hadoop specification. The comprehensive conference report was assembled by the staff of Orange Silicon Valley, combining efforts from the Platform & Middleware and Enterprise groups. The conference organizer, O’Reilly Media, and its founder Tim O’Reilly, is widely regarded as the original evangelist of the Big Data trend in IT, and this conference as its annual gathering place of record.

3


hadoop is ‘big data platform of choice’ During Strata Santa Clara 2013, conference chair Alistair Croll asked the rhetorical question “who knew that the future of NoSQL was SQL?”, underscoring that Big Data technologies could only go mainstream by bridging the gap with legacy systems, most importantly SQL. At this year’s conference, Hadoop was confirmed as the Big Data platform of choice, and gained the support of legacy vendors, opening the doors to a much broader adoption.

“Who knew the future of NoSQL was SQL?”

Alistair Croll co-founded web performance startup Coradiant, and since that time has also launched Rednod, CloudOps, Bitcurrent, Year One Labs, the Bitnorth conference, the International Startup Festival and several other early-stage companies.

credits

Srinivas Chervirala Consumer Group

Asha Vellaikal Director, Consumer Group

Tony Mignot IT & Middleware Group

Shishir Garg Director, IT & Middleware Group

Jameson Buffmire Enterprise Group

Xavier Quintuna IT & Middleware Group 4

Gabriel Sidhom VP, Technology Development

This yellow elephant is the logo of Hadoop, the industry standard open-source platform for dealing with Big Data.


boom factor Three major announcements were made during Strata Santa Clara 2013; they are bound to finally help the Big Data market cross the chasm between a base of early adopters and the mainstream market.

HortonWorks brings Hadoop to Windows Microsoft Windows servers still represent 73 percent of the market in 2012, according to IDC. HortonWorks’ move to get Hadoop to run on Windows should help its distribution stand out from competition, and will let enterprises use their favorite analytics tool, Microsoft Excel, to manipulate their data, both structured and unstructured. It also helps Microsoft muscle up its hybrid cloud strategy. HortonWorks’s distribution will run on Windows Servers, on premise, and on Windows Azure, in the cloud, and will therefore allow Microsoft to better compete with Amazon Web Services. HortonWorks

EMC/ VMware/Greenplum melts analytics database with Hadoop Greenplum announced Pivotal HD, a completely re-architected Hadoop distribution that has been natively fused with Greenplum’s analytics database. Pivotal HD promises impressive performance vis-à-vis Hive (an open-source project that allows to run SQL queries on Hadoop), and Cloudera Impala, and perhaps most importantly, brings a trusted name familiar to enterprises worldwide. Greenplum doesn’t plan on making any significant contribution to Hadoop, as its analytics database is its core product. It has been confirmed to us informally, that Pivotal HD will be offered as a service via Pivotal’s upcoming public cloud offer.

Intel leads legacy enterprise vendors into the Big Data world In a surprising move, Intel has decided to create its own Hadoop distribution. Intel is committed to contributing developments back to the Hadoop community. The chip company is threatened on multiple fronts by ARM, and expanding its activities into software is a defensive move to protect its x86 architecture.

key takeaway A LOT OF THE BUZZ AROUND BIG Data so far has been about harnessing the potential of “unstructured data”, but getting there is easier said than done. On one side, the bulk of organizations have invested in SQL to query their (structured) data, and on the other side,

an enormous untapped opportunity lies in unstructured data, which is handled with unfamiliar technologies. While a whole ecosystem of vendors has emerged to lower the barriers to “Big Data”, mastering it has been a tall order for many user organizations. 5


our world connected

As sensors get cheaper and more ubiquitous, we are entering an age of enormous innovation in varied fields combining sensor data with machine learning.

insight

We are in the very early innings of sensor-powered advances which has the potential to impact business in multiple fields. For the sensors themselves, low-cost and lowpower are the two driving factors that are making it ubiquitous. The penetration of smartphones that are jam-packed with sensors is another impetus. The current challenge remains around combining and analyzing the raw sensor data to create actionable insights. Creative design around sensors to create cool consumer products is also a key trend observed in products such as the Nest thermostat. Big Data around the connected world has its own challenges. On one hand, the vast variability of mobile device feature, capabilities and usage models across different geographies create unique problems in designing good user experiences. However, this constraint can be converted to a massive opportunity by carefully analyzing usage data and personalizing appropriately.

6

Current Big Data technologies around batch processing are not good enough for the connected world. Realtime streaming computing systems as well as spatiotemporal data analysis systems are key. On one hand, IBM claims that the technology has evolved enough and is not a barrier. On the other hand, real-time

on-demand location intelligence is still immature. Rather than looking into complex analysis, it is more important for simple analysis that can scale in realtime to big data. While sensors are machine driven, human-generated social signals are another very important data source. Companies such as Bit.ly and LinkedIn are finding that they can create valuable new product experiences by analyzing raw data such as user profiles and shared links. These companies are starting to have an unfair advantage with the sheer volume of data contributed by the users as part of using their service. It is not surprising that big data innovation including many of the most valuable big data open source tools are being developed at companies such as Facebook, Twitter and LinkedIn. The human in the loop still remains a key component for many Big Data applications, where the complexity of certain tasks is too hard to be characterized by machines. The best approaches are ones where the human input is baked into the core of the automated algorithms a la Twitter. Such scalable social computing systems are also key to many 21st century scientific challenges.


trends There is wide variability in mobile devices around screensize, location, operating system, price points and many other features. For example, content consumption varies from geography to geography. Mobile web is more transactional as opposed to apps that offer richness and interactivity. Designing a product experienced optimized for this multitude of different devices and usage patterns can be turned into a big opportunity with higher user engagement by measuring data usage effectively. Smartphones are equipped with on-board sensors for location, movement, temperature etc. Behavio’s technology helps use smartphones to unlock insights around social and behavioral dynamics. Combining sensors and analytics can give an unprecedented understanding of how people work and collaborate resulting in actionable insights for building a more effective and productive organization. For example, changing

the way call center employees spent their breaks increased performance by 25% while reducing stress. Similarly, quantifying the failure of marketing and customer service to communicate led to a more cohesive and profitable organization. The primary sensor used was the wearable Sociometric badge which was developed in the MIT Media Lab. The traditional “store and analyze” approach typically used in data mining is not naturally designed for many problems in the connected world domain where the temporal dimension differentiates real-time data from conventional Big Data. New technologies such as stream computing has been developed that can continuously analyze data in motion to support real-time decision making. The human in the loop is an integral part of Big Data technology for a wide variety of applications. For example, Twitter has built a realtime human computation engine

to help identify queries as soon as they are trending, sending these queries to humans to be judged. These human annotations are then incorporated back into the core blueprint of the machinelearning based backend models. Crowdsourcing humans to support very large classifications tasks using brute force methods will not scale. Repeat classification is wasteful and does not take adaptive learning into account. Mining social media data along with intelligent crowdsourcing and distributed computation can lead to new viral product experiences. Bit.ly which is a link-shortening and sharing service has created Realtime – a search engine for real-time trending links. (http:// rt.ly). They derive interest graphs from social shared links. Combining human and machine intelligence in large-scale crowdsourcing to form social computing systems that can scale will be the way to proceed in scientific big data problems for the 21st century.

“data amplifies desire” Peter Skomoroch, Principal Data Scientist, LinkedIn, discussing how a combination of recommendation algorithms and crowd sourcing can create highly viral effects.

2,500,000,000,000,000,000

2.5 QUINTRILLION bytes of data created every day in 2012 Steward Collis, CTO of Awhere

7


business + big data

Case studies about how real world businesses are using Big Data.

90%

mobile users keep their mobile phone less than 1 meter away 24-hours, 7-days a week.

According to Telefonica Digital

70%

increase between 2010 to 2012 in number of respondents who think that their organization now realizes competitive advantage through information and analytics. Rebecca Shockley, IBM institute for business value

8

HIGHLIGHTS

01

Altimeter, suggests companies prepare for Big Data socially, by having analysts across the organization cohesively share approaches and tools. By starting small and social, organizations can familiarize themselves with Big Data challenges , which will only get worse as more data comes in, hence the sense of urgency.

02

Accenture estimates that all the pieces are now in place for Big Data to really take off, and urges companies to get ready by embedding data analysis into dayto-day operations to make better decisions, and by getting IT to evolve accordingly.


insight

The hype of Big Data is behind us.

Its reality is not questionable, and new technologies are constantly emerging to implement Big Data strategies. In 2013 will focus on how to do really reap the benefits of Big Data by first leveraging internal data, identifying working business cases that have maximum impact, and evolving the corporation to turn it into data-driven mode, transitioning from a hierarchical model to an “heterarchical” one. Starting small is always good, so social media is suggested as a test bed as it now impacts any organization, spans the whole enterprise and offers quick wins. Obama For America Chief Scientist Rayid Ghani touched upon the issue of privacy when it comes to leveraging social data, underscoring that just because data is available doesn’t mean that it can legally be used. He gave the example of Facebook and said his team needed an explicit agreement from users to utilize their data during the last presidential campaign. On the very issue of privacy, Khaled El Emam described how to push the envelope by running secure analytics on encrypted data (additively homomorphic encryption). Several examples of companies mastering Big Data were given, including AirBnB, which was able to quickly adapt its operations and leverage its data assets to change lives during Hurricane Katrina, when the apartment sharing website connected people in need of shelter with people willing to help. As more entities familiarize themselves with Big Data, machine learning emerges as a key technology to master. Revolution Analytics proposed a walk through what it takes to go from raw data to usable data using predictive analytics, using the open-source technology called R. An effective implementation leverages not only Big Data through Hadoop, but also more traditional data warehouses.

“Data is the commodity, action based on wisdom is the scarce resource“ Jen Van De Meer, of Luminary Labs, urging companies to liberate their data for the greater good.

9


it, the enterprise + Big data

How to create a big data strategy, undestanding the issues of managing data, and learn how data science can be used powerfully.

highlights

01

Big Data is an opportunity to drive Business-IT alignment in a bottom-up fashion. Big Data isn’t an IT problem anymore, it’s a business problem. Businesses need to drive, own, fund and be the primary user of Big Data systems. IT will move to a purely build, run and maintain model.

trends

02

A Big Data Business Maturity Model was presented outlining increasing levels of value being derived from data: Business Monitoring  Business Insights  Business Optimization  Data Monetization  Business Metamorphosis.

03

Data science still has a significant skills shortage. The skill set is a divergence from a traditional Data Analyst or Business Analyst, but a mix of the two with a focus on ‘data wrangling’, and data modeling. The result is a highly iterative process that repeats at various stages across this cycle: Discover Wrangle Profile  Model  Report.

The evolution of structured and unstructured data, query languages, and vendor push VS Enterprise IT pull. Gain insights from the data you already have. Instrument the data, gain insights rather than dashboards. Take time to understand the decisions that business users are taking, make sure the decisions being made are of high value and implement some level of data governance. 10


insight

Machine Learnt rather than Man-Made.

Despite the push from Merchant DBMSs to integrate and sell Big Data, this isn’t usually an effective approached as it’s a way for ‘Big Blue’ to sell more of their traditional software and consulting licenses. Big Data really represents the revival of ‘best-of-breed’ for enterprise IT and the need to challenge monolithic enterprise stacks. Most adopters of Big Data are moving from analyzing portions of their data to a complete end-to-end architecture. Architects have been trained to think of IT as a layer cake, typically comprising of brand name tiers. This is challenged by the commodity hardware revolution, combined with new programming models and new query models. The biggest challenge in the old BI world is change. The traditional model assumes data stability, and that works against the reality of new needs implying constantly changing data. If business users didn’t have a pain point, Shadow IT wouldn’t exist. Focus on attacking this problem and turn it into an opportunity for those business users. Most enterprise data sets are similar to the data Netflix uses for their recommendation engine. There is no need to over-engineer this data and simple algorithms yield good enough results. Expertise  Data  Algorithms  Machine Learnt rather than Man-Made. Most companies don’t run internal competitions on data but rely on human intelligence to make decisions. These decisions sometimes drive the company into deeper trouble. It’s important to ask humans to quantify their decisions with data, which should ideally be backed up by machine learning-based decisions. Data Mining methodologies – CRISP-DM outlines a set of laws that can apply across industries to optimize the use of data. A key difference between data mining and data science, is in the expectations of ROI – every datamining project requires proof of ROI before starting, while data science implies a definition of ROI that is calculated later in the project.

42%

of tech leaders are investing in Big Data projects or planning to spend within the next year. According to a Gartner study.

“Eliminate the human in the process, automate, to take advantage of real-time feeds.” Bill Schmarzo, CTO, EIM Service Line at EMC

11


Expose new technologies and trends in memory distributed systems, stream computing and Hadoop supporting SQL.

a better, hadoop

HIGHLIGHTS

01

Real-Time and Stream Computing technologies: they are the future to improve productivity around development and allow adjusting/ adapt business decision based on time. Technologies like Spark/Shark and Storm are still early in the market but they will consolidate in 2013.

02

Hadoop SQL (Distributed Data bases): The evolution of Hadoop will be from batch process to a distributed database model with support for SQL. Companies like Draw to Scale, Cloudera (Impala), Concurrent (Lingual) are working to reach this goal.

03

Interactive analysis platform: Interactive analysis between different sources is a must. As a result, projects like Drill, inspired by Google Dremel, and Cirro are working to create an interactive analysis platform based on SQL that sits on top of Hadoop and databases in the warehouse.

04

Time-related analysis: The future technology stack has to provide tools which react in micro-seconds. This was the message from the industry where time analysis is becoming a main driver to adapt or change business decisions. 12

$40

THOUSAND Cost of realtime analytics used by Oreo to generate an on-the-spot online marketing campign during the Super Bowl blackout, which resulted in 14,000 retweets. Steward Collis, CTO of Awhere

100x

Improved throughput for query processing from Spark and Shark relative to Hadoop MapReduce


faster

TRENDS Hadoop has been adopted in the industry for the last 3 years to become the de facto analytics batch-oriented platform. However, the timing to get the results of the analysis is related to the size of the cluster, size of the data set and complexity of the analysis. Those factors could create some business decision delays especially in mission-critical applications where the time is related to future actions around the business. In addition, the cost of development is also related to this wait because a developer (data scientist) will be idle until he/she gets results from his/her analysis. The industry and opensource community have started to implement the new wave of technologies with a focus on implementing new solutions that provide support for stream and real time analytics, in conjunction with the current Big Data batch-processing platforms to optimize processing time. Another focus is on integrating Hadoop with existing processes from “legacy” applications and the challenge to identify a qualified work force. Google is frequently mentioned multiple times as a Big Data equivalent to Oracle. The current mantra in the big data industry to solve any problem is “What Google will do?” Google developed its new system call Spanner & F1 which is a Globally-Distributed Database that uses SQL as a main interface with the platform. In conclusion, the possible answer to all migration and work force challenges for data analysis is Hadoop supporting SQL.

insight Opensource communities have developed software to solve their needs. The Hadoop ecosystem is attacking the current lagtime of batch-processing large amounts of data now on the orders of the minutes to hours. To decrease this lag time, new tools coming from academia (Spark/ Shark from UC Berkeley), and the web industry (Storm from Twitter) have been started to appear in the Open source community. These new tools share the goals of delivering complex analyses in real time, allowing business owners to have a clear perspective of their current situation and react quickly to events as they happen. Similarly, shortfalls in developer productivity and skilled data science work force is pushing the industry and the open source community to identify alternative solutions that could be implemented around Hadoop. The alternative solution to those problems is Hadoop supporting SQL. This new trend proposes two

different models: Hadoop supports SQL, or Hadoop evolves to become a distributed databases a la Google Spanner. For Hadoop supporting SQL, there are already companies providing alternative implementations that sit on top of Hadoop (Impala, Drawn To Scale) or coexist with Hadoop in the same environment (Apache Drill, Hadapt, CirrusDB), but most of them are in early stages. Another approach, which is creating the most excitement, is that Hadoop will evolve to become a distributed database. The community argument is that Google, the inventors of BigData as we know, have already moved from Batch (BigTable, MegaStore) to ‘distributed’ SQL with tools Spanner & F1). We predict the larger community will do the same. 13


the science of data

Inside the world of data practitioners, from the hard science of the latest algorithms to thorny issues of cultural and team-building.

trends Machine Learning, or Artificial Intelligence, is really the core of what data scientists do. It can be used to forecast revenues or workloads, or define products or inerfaces such as startup Burch Box talked about. This track was primarily targeted at data scientists, who crunch huge amounts of data in order to make sense of it. A lot of techniques were shared, ranging from emerging and specialized languages like R or Julia, to visualization frameworks like D3.js, to all sorts of algorithms and approaches. It was emphasized that three generations of machine learning tools have led us to a certain maturity level that will now make it easier to do more with data. The first

14

generation was that of tools like SPSS, Informatica, or SAS, that were very powerful in terms of analytics capabilities but could only scale vertically (i.e., by adding power to the node itself). The second generation however, made up of Mahout, RapidMiner, or Pentaho, could scale horizontally for virtually unlimited volume capabilities, but only offered a limited number of algorithms resulting in shallow analytics. Finally the third generation that’s currently unfolding with Spark, HaLoop, Twister, Apache Hama, Giraph, or Graph, allows for Big Data analytics on Hadoop, because more algorithms are available, and they can run in parallel and even in real-time in a Kafka-Storm integrated environment.


insight AI is the new UI. The science of Big Data is about harnessing the unstructured data pouring out of devices and platforms and crafting meaning that can be used to drive user experiences. This data-driven sense-making is what companies like Nest (home energy mangament), Square (merchant services), or a newcomer called Stitch Fix (fashion) are after. They are all creating new meaning and new experiences, which is why some are saying “AI is the new UI.” With 3 major announcements from Intel/Cisco/SAP, EMC/VMware/Greenplum, and HortonWorks/Microsoft, Big Data is now beyond the hype stage, going into mainstream. The key to a broader adoption is melding traditional SQL technology with emerging NoSQL databases to provide a more comprehensive set of tools for enterprises to do more with their data, whether it be structured (traditional databases) or unstructured (everything else). From a business standpoint, most of the Big Data potential lies in enabling new interfaces through machine learning, that stimulate usage.

“data is only a means to an end. There has to be a good understanding of what the data will be used for. our main focus has to be how we help the customer. everything else flows naturally from that ” Rajat Taneja, CTO of Electronic Arts

GENERATIONS of machine learning tools were necessary to allow for parallel computing that scales horizontally and yet provides all the algorithms necessary to state of the art analytics. Dr. Vijay Srinivas Agneeswaran, Impetus

15


data by design

User experience, new interfaces, interactivity, and visualization while tracking Big Data.

40%

percent of emails are read on mobile phones

Yael Garten, Data Scientist, LinkedIn, on the explosion of data available to data scientists.

HIGHLIGHTS

01

Core problems in considering Data Design consists of three successive stages. The first is Data Clean-up, the second is Aggregation of large data sets; and the third is Visualization.

16

02

50M

number of URLS shortened daily using bit.ly

Anna Smith, Data Scientist, Bit.ly (while talking about deriving interest graph from social data)

A discussion of Beauty encouraged attendees to “make beauty a priority”, and analyzed different parts of beauty. These include: a Visual Layer consisting of typography, color, size, page layout. The Interaction Layer provides application flow that facilitates user tasks. The Information Architecture layer provides the organization of an application. The final Application Layer delivers the overall experience of the application

03

In thinking about the user, multiple speakers drew on the concepts of “embracing the empathy with the users,” including “actively listening to customers,” and the well-established principle of “dogfooding” and evaluating your product through the lens of your customers. In this sense “everyone is a designer” and the designers role is to orchestrate the collaborative process, taking the best parts of the ideas discussed and expertly put them together.


trends simple design - less is more.

The theme of the design track is ‘simple design – less is more’. Almost all experts stress the importance for the need for accessible visualization platforms that allow non-technical users to prepare, analyze and display the insights without need for writing the code. At Strata, speakers demonstrated various data analysis and visualization tools such as Pandas, Force Layout, and PhiloGL -- a WebGL Framework for advanced data visualization, creative coding and game development. Multiple speakers talked about Dieter Rams, a famous industrial designer for popular consumer products in 1950’s, and his top ten principle of simple design from. Most of the presentations were technical with emphasis on introduction to visualization tools. Presenters demonstrated the insights from real world data such as recent presidential elections, political engagement map, and mobility in France using the visualization tools. Network Maps: Network maps are better than lists to view the network graphs. Network maps allow to drill-down to get insights from the interesting areas while providing a bigger picture. The ‘Four pillars of Data Visualization – Focus, Content, Structure and Formatting – should have clear purpose & focus, contains the right content, structure that show the relationships among the data elements, and great look & feel. On the topic “Does Design matters more than Math?” – design matters to communicate the information, math is important to generate the insightful information. Visual designers should ask themselves key questions while creating compelling user interfaces, the questions: ‘Is the purpose well defined?’, ‘Does the content support the purpose?’, ’Does the structure reveal the content?, and ‘Does the formatting facilitate consumption?’

“You know what you get if you have design without math? iOS Maps.” Monica Rogati, Senior Data Scientist, LinkedIn, on the importance of getting math as well as design right, using Apple’s recent ‘bad data’ iPhone Maps app as an example.

17


THE POLITICS OF OPEN DATA

Open data and heightened privacy concerns mean new, and often controversial thinking.

highlights

01

To retain their trusted relationship status, companies must ensure that customers see value in exposing their data, explicitly opt-in in for programs that could make their data uniquely vulnerable, and retain control over their preferences for the life of the customer relationship (the ability to change their data exposure level).

68%

of online users would select an easy-to-use Do Not Track mechanism.

14%

percent of users who believe internet companies are being honset about the use of their personal data.

Alysa Zeltzer Hutnik, Partner of Kelley Drye & Warren LLP

Alysa Zeltzer Hutnik, Partner of Kelley Drye & Warren LLP 18

02

Rules for the use of data are being written in real time and won’t be determined until industry takes action that is validated by government, or precedence is set through lawsuits.

03

Bringing data to application is old, instead bring your application to the data. There are many specialized platforms available but hadoop can be the single platform where you can query, process and search across the data and scale to 1000s of node.


“We need to help people think and act in terms of their connections, their observable and inferable behaviors, and their likely actions in the future.” Shelley Evenson, Executive Director, Organizational Evolution, Fjord, discussing how we need to bring consumers into the analytics discussion, rather than employ increasingly more data scientists.

trends we are entering a new era of privacy

but we’re currently only at the frontier. Rules, regulations, norms, expectations, are all being written in real time, and against experiences that push the boundaries of what is permissible and/or allowable. There are examples of “black-hat”, or intentionally harmful, activities in data science and big data analytics, but far more often companies are stumbling in the dark and aren’t aware of when or if they are violating laws or breaking consumer trust. Companies pursuing a Big Data strategy, or collecting large quantities of consumer data, should take a pro-active approach to informing customers of what data is being recorded and used, and work toward establish a trusting relationship. They can do this by being pro-active, transparent, and cooperative, and restricting the flow of information to third parties. There was some friction throughout the day about how easy it is for a data scientist to “go to the dark side” and corrupt datasets to cheat systems or compromise customer relationships. All the speakers agreed that customer data is increasingly exposed, but there is some disagreement about how much, and to what extent, it should be shared and acted on. The rights and responsibilities of companies are being defined in real time. At the same time, the global trend towasrds open data means extraordinary sums of data are available to be manipulated and shared by average consumers. Data policies dicating the use, terms of service, extensibility, transferability, and privacy concerns associated with large data-sets are being written in real time as data journalists push boundaries. In the Energy and Utility Sector, data analysis is contributing to more efficient transfer of energy, reducing brownouts and service interruption. In government, open data is pushing agencies to be more accountable and accurate in their reporting. For example, analysis of Argentinian Consumer Product Indexes has exposed government manipulation of monetary controls, and resulting inflation, to citizens. 19


our world connected

Ecosystem Snapshot

IBM Infosphere Streams: Continuously analyze petabytes of data at rates up to petabytes per day.

The most frequently mentioned companies at Strata 2013 who are trying to reshape Big Data.

Azul Systems: Only JVM that allows for very large in-memory datasets with predictable performance needed for low latency and high throughput in real-time analytics.

business and big data

p8

a better faster hadoop p12

Revolution Analytics: An enterprise provider of software and services for the Open Source R predictive analytics platform.

Spark/Shark: Developed by UC Berkeley AMP Lab, Spark and Shark is an open source cluster computing system that is designed to support real time analysis and interactive analysis.

it, enterprise + big data p10

ClearStory: Focuses on providing solutions around interactive analysis, live situation analysis and automatic analysis.

Data Direct Networks: Launched hScaler which is an optimized Hadoop appliance based on RDMA. Kaggle: A crowd-sourced Data Scientist platform, it runs data science competitions. Trifacta: Solving the non-IT user data wrangling problem, making data appealing to business users.

the science of data

p14

Revolution Analytics: A consultancy that provides software and services to help enterprises implement opensource statistical language called R, commonly used for predictive analytics. Forio Online Simulations: Developed an IDE (Integrated Development Environment) for Julia, a high-performance language, which provides distributed parallel execution, with familiar syntax. It is comparable to R, but performs much better. Gurobi: A mathematical programming solver that helps companies like Birch Box optimize their operations for instance, but could be used in a variety of settings. 20

p8

REST Devices: New types of wearable electronics using proprietary low-cost sensor technology. NEST: Smart learning thermostat.

the politics of open data p18

Apache Drill: Drill provides to developers ANSI SQL and generates specific native code to implement the analysis with each data source, be they Hadoop or other databases. Storm: Distributed and faulttolerant real time computation engine, used by Accenture as a key component for IT monitoring. TransLattice: Specialist in geographically distributed databases for enterprise, cloud and hybrid environments in architecture similar to Google Spanner. Cirro: Provides an abstraction layer between multiple data sources and the analyst offering an SQL layer to query multiple sources in the data warehouse.

data by design

MemSQL: Distributed database for real-time analytics.

p16

Tableau: Provides visual analytics tools for Big Data. Tableau’s drag-and-drop tools help business users to quickly understand the data and unearth the insights.

Zyxt Labs: A small lab working on making complex data simple, reducing noise in our technology-overwrought lives, data mining, coordinated messaging, social productivity, and summarization. Spinn3r: Delivers massive blog analytics, tied into the blog ping network provided by Google, Blogger, Ping-oMatic, WordPress, FeedBurner, and many other content management systems. Intel: Is working with other partners such as IP camera vendors to embed Hbaselike technology to enable intelligent distributed networks. Pecan Street: A non- profit organization with $10M in funding to understand home energy usage. They partnered with Intel for smart meter use case, using big data algorithms to understand patterns on energy usages. CommonCrawl: A non-profit with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable.


start up showcase Notable Companies to Watch from the Strata Conference

ZoomData

Metamarkets

Zoomdata just launched its Big Data visualization software to help companies analyze their data sets in real-time and on the go on mobile devices. Zoomdata offers an iPad client for free and a server-side streamprocessing engine that connects to multiple data feeds, packaged as a virtual machine.

Metamarkets provides real-time advertising analytics. The company’s interactive dashboard provides the ability to get instant insight on the fast changing dynamics of mobile, video and display marketplaces in real time. Metamarkets is headquartered in San Francisco, and backed by Khosla Ventures, AOL Ventures, True Ventures, IA Ventures and Data Collective.

Magna Labs

Behavio Behavio is another startup from MIT. There primary goal is to leverage the power of the phone as a sensor platform. Behavio has launched Funf – an open source sensing framework which is a sensing and data processing framework for mobile devices. Funf turns phones into smart sensors of people’s behaviors and surroundings – how people use their phones, how they communicate with others and the environment around them. Using the Funf SDK, developers can build apps with smart sensors. Currently, the Funf framework is only available for Android devices.

Magna Labs is an early big data startup. Their innovative approach to Hadoop may help provide developers and analysts better access relevant largescale data-sets. From the founder, “Magna is a service that provides developers with a simple, lightningfast interface for processing their data. It allows any developer to ask complex questions of their data without having to learn a new technology. Magna takes queries that currently take hours, down to minutes.”

Concurrent Concurrent is an application framework that facilitates and improves the development productivity for Hadoop. Development is expensive for data analytics in terms of time of implementation so to optimize this time, it is necessary to create and provide some abstraction layer to improve the performance. Part of that framework, they provide ANSI-SQL so developers could improve the adoption.

21


b i g data by time is money

54 seconds =$9

Time and compute platform charges needed to process 1 Terabyte of data, in latest Terasort competition, won by MapR running on Google Compute Engine. Inhouse IT solution to achieve this performance would cost an estimated $5 million. Greg Khairallah, Business Development Manager of Intel

analytics Percentage of all websites that use Google Analytics to track usage. Douglas van der Molen, Chief user experience architect, ClearStory Data

7 minutes versus 4 hours Time to sort 1 TB of data using Intel Apache Hadoop: 7 minutes with Hadoop versus 4 hours with the traditional SQL RDBMS. Greg Khairallah, Business Development Manager of Intel

22

10,000

Server instances delivered by using Apache Drill to provide an interactive analysis platform analyzing different data sources.

5,600,000,000

Records were generated by 400 homes over a period of 15 months.

Greg Khairallah, Business Development Manager of Intel

hadoop versus rdbms Hadoop costs 1/30th the cost for traditional RDBMS to store 1 TB of data. Hadoop costs $1K per 1TB of data, RDMS costs $30K. Charles Zedlewski, VP or Products for Cloudera


the numbers at the conference

2/3 Two-thirds of the presenters at Strata 2013 were vendors or consultants selling Big Data technology or solutions.

one

The number of telecommunication companies or their subsidiaries presenting at Strata 2013: Telefonica, the Spanish telecom provider.

1:6

Men outnumbered women 1 to 6.25 at Strata 2013. Of 194 presenters only 31 were women.

“Big Data is turning into a Big Relational Database” Tim O’Reilly, founder of conference organizer O’Reilly Media, talking about the convergence of Big Data and SQL, the original RDBMS.

scalability

600 trillion 100 It takes 600 Trillion pixels of Satellite Imagery Data to cover the entire earth.

Steward Collis, CTO of Awhere

Petabytes (one thousand terabytes) of data in Hadoop systems today, proving the scalability of Hadoop (Cloudera) Charles Zedlewski, VP of Products for Cloudera 23


Introducing OrangeTM

A new program from Orange Silicon Valley

Insight Onsite is your opportunity to participate in the revolutionary innovations of Silicon Valley. Have one of your enterprise innovation leaders work alongside us in San Francisco to meet new ideas and people, work on a concrete project relevant to your company, and forge lasting relationships between today's cutting-edge startups and your company. Onsite residencies ranging from three months or longer are available to qualiďŹ ed customers of Orange Business Services. To explore a 2013 residency at Orange Silicon Valley Contact Gabriel Sidhom at Gabriel.sidhom@orange.com


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.