Page 1

Radiant Advisors Publication






04 APRIL 2013 ISSUE 7



April 2013, Issue 7


[P4]The Honeymoon is Over for Big Data

Big data, it turns out, means precisely nothing and imprecisely anything you want it to mean.

[By Dr. Barry Devlin] FEATURES

[P8] Has the Big Data


[P16]Bringing BI and

The BI industry

Twilight of the (DM) Idols Big data is already a

is abuzz with one new question: is big

disruptive force: at once democratizing,

things that make a “big” difference when

data done?

reconfiguring, and destructive.

implementing big data.

[By Krish Krishnan]

[By Stephen Swoyer]

[By John O’Brien]

Bubble Burst?



Big Data Together


Big Data Revolution Are we becom-

Big Data

ing no more than sentient founts


Big Impact: Two use cases for big

data are having a Big Impact, at least

of Data? Mayer-Schönberger and

from a data management perspective.

Cukier put the pulse back in the

[P15] A Kludge Too Far?

Big Data Conversation.

[By Stephen Swoyer]

[By Lindy Ryan] SIDEBAR

2 • rediscoveringBI Magazine • #rediscoveringBI

FROM THE EDITOR “Big data” is being routinely paired with proportionally “big” descriptors: innovative, revolutionary, and even (in this editor’s humble opinion, the mother-of-all-big-descriptors) transformative. With the inexorable momentum of cyber-journalism keeping it affixed atop industry head-

radiant Advisors publication


The Big DaTa honeyMoon Over AlreAdy?

Bi's Big QUesTion

lines, big data has indeed earned itself quite the reputation, complete with a stalwart following of industry pundits, papers, and conferences. In fact, the whole big-data-thing has mutated into a sort-of largerthan-life caricature of promise and possibility – and, of course, power.

hAs the bubble burst?

Big DaTa vs. DaTa ManageMenT A zerO-sum scenAriO

Bi anD Big DaTa

Let’s face it: big data is the Incredible Hulk of BI -- gargantuan, brilliant, and, yes, sometimes even a bit hyper-aggressive. For many the mildmannered, Bruce Banner-esque data analyst out there, big data is, for better or worse, the remarkably-regenerative, impulsive alter ego of the industry, eager to show with brute force just how much we can really do

bringing them tOgether


After the big dAtA pArty

April 2013 issue 7

with all our data – what the tangible extent of all that big data power

Editor In Chief

is, so to speak. Yet, as with the inaugural debut of Stan Lee’s destructive antihero’s in The Incredible Hulk #1, in early 2013 we still haven’t begun to really see what big data can do yet. Not even close. In this month’s edition of RediscoveringBI, authors Dr. Barry Devlin, Krish Krishnan, Stephen Swoyer, and John O’Brien each explore different facets of that very construct, asking has the big data bubble actually burst, or is its honeymoon phase just over? How much is just hype? And, how

Lindy Ryan

Contributor Dr. Barry Devlin


much is only a precursor to what we’ll continue to see buzzing around

Krish Krishnan

and reinventing the industry?

What’s next for BI’s Incredible Hulk antihero, Big Data?

Contributor John O’Brien

Lindy R yan

Distinguished Writer Stephen Swoyer

Art Director Brendan Ferguson

Lindy Ryan Editor in Chief

For More Information:

Radiant Advisors

rediscoveringBI Magazine • #rediscoveringBI • 3

OPINION Radiant Advisors Publication

rediscoveringBI EvEnt-drivEn architEcturEs thE shifting lAndscAPE

timE of rEckoning sElEcting thE right bi solution

An ARchitEctuRAl collision couRsE

tying goAls to REquiREMEnts


On: Time for An Architectural Reckoning

arE data modEls dEad? thE REAl dEbAtE


shifting gEARs with ModERn bi ARchitEctuREs

MARch 2013 issuE 6

Evolution vs. Revolution This is an excellent, thought-provoking article. I believe that you are correct in the assertion that an Architectural Reckoning is underway. In fact, I believe it has been underway for at least 10 years. To focus on technology in general and Hadoop in particular is, however, to miss the point. The Reckoning is being driven by the intersection of business needs and technology advances. Both sides can be summed up as “faster and smarter” – and they are mutually reinforcing. I call this the “biz-tech ecosys-

yes, we are in a time of Architectural Reckoning but continuity of thinking and a mindset of evolution rather than revolution are vital." - Dr. Barry Devlin

tem”. On the technology side, Hadoop is and will be part of

Augmentation of Traditional DWs

it. So will a range of other data management technologies,

I totally agree with Claudia that “not all analytics now belong

including relational databases, for sure. And I believe that

inside the BI architecture” and that we are in a “very disrup-

the various approaches – column, in-memory and more – will

tive period of a lot of new technologies flooding in to busi-

be combined into a hybrid approach more powerful than

ness intelligence.” I also am not actually that far away from

any RDBMS we have today. And we will need that, because

the position Scott Davis takes: I agree that “Hadoop is a huge-

I am certain that the Data Warehouse – in a new, more cir-

ly transformational technology.” I just think for the short- to

cumscribed role, but central to consistency and reliability

medium-term Hadoop et al are going to augment, rather than

for information that must be of high quality – will continue

replace, traditional data warehouses. Will Hadoop replace a

to thrive. (And many thanks for the historical positioning of

traditional data warehouse database in the long term? Only if

Paul’s and my paper from 1988!) I call this new role “Core

it adds a lot of database like features, and then the argument

Business Information.”

becomes a lot less interesting – something akin to the “Will

As you also pointed out, it’s not just about data management.

Ingres/Informix/Sybase replace Oracle?” debate of yesteryear.

What is happening is also changing application development,

My main concern is how customers are going to embrace this

as well as process and business modeling and implementa-

new data landscape, rather than if they are going to. How are

tion. Collaborative and social computing are also vital com-

organizations going to build a data landscape that includes

ponents of the mix. So, yes, an inter-disciplinary approach will

Teradata, Aster, and Hadoop? How are they going to manage

be needed – not just within IT but across the business – IT

Analysis Services cubes and a smattering of legacy Oracle


data warehouses?

We are also in somewhat of a positive feedback loop – and as

Data warehouses currently take too long to build and are too

anyone who has ever put a microphone in front of speakers

hard to change. The new architectural changes are going to

knows, the result rapidly becomes very unpleasant. So, we do

make things worse not better.

need to step back from the hype of big data and recognize the

Yes, WhereScape does have a stake in the game –although

dangers as well as the opportunities.

not in the status quo. Regardless of the platform, design and

My bottom line: yes, we are in a time of Architectural

technology the need to deliver quickly without compromise

Reckoning (is this the same as a Paradigm Shift?) but continu-

remains the same. Who wants to manually build out a mul-

ity of thinking and a mindset of evolution rather than revolu-

tiple platform data warehouse? A data warehouse automation

tion are vital. I’m trying to capture this in my long-awaited (by

environment (such as WhereScape RED) helps simplify the

me, anyway) second book.

approach, and I believe is a key piece of the new architecture.

- Dr. Barry Devlin (Editor's Note: See The Honeymoon is Over for

- Michael Whitehead (Editor's Note: Michael Whitehead is the

Big Data by Dr. Barry Devlin in this month's issue)

CEO and Founder of WhereScape)

Have something to say? Send your letters to the editor at 4 • rediscoveringBI Magazine • #rediscoveringBI

Upcoming Webinar Inside Analysis


Inside Analysis with Dr. Robin Bloor and John O'Brien Hosted by Eric Kavanagh Don't miss Radiant Advisors' John O'Brien as he keynotes the upcoming Big Data Boot Camp.

May 21-22 New York Hilton



John will offer perspective into the dynamics and current issues being encountered in today's Big Data analytic implementations as well as the most important and strategic technologies currently emerging to meet the needs of the "Big Data Paradigm." Join John and other Big Data experts as they converge upon New York and be sure to save an extra $100 off the early bird rate by using this link. Early bird registra-

Agile and flexible -- those might well be the mantras of Modern Data Platforms. As organizations look to harness the latest advances in analytics and integration technologies, the focus turns quite sharply to architecture: the right data platform can empower companies to harness everything from Big Data to real-time, all without sacrificing data quality and governance. Register for this free Webcast to catch a preview of SPARK!: Modern Data Platforms, a three-day seminar series to be held in Austin, TX, from April 29 - May 1. The seminar will feature a tag-team of experts from Radiant Advisors and The Bloor Group, who will provide detailed instruction on the range of activities associated with modernizing and evolving robust data platforms. John O'Brien of Radiant will focus on Rediscovering BI, while Dr. Robin Bloor of The Bloor Group will discuss the Event-Driven Architecture. Attendees of the Webcast will receive a discount code for $150 off the in-person seminar. Follow the conversation #sparkevent

tion ends April 19.



[The term big data has passed its use-by date]

DR. BARRY DEVLIN IG DATA IS tumbling into the “Trough of Disillusionment,” according to Gartner’s Svetlana Sicular. If you fear that this means the end of the road for big data, see Mark Beyer’s (co-lead for Gartner big data research) remedial education on the meaning of Gartner’s Hype Cycle curve, although they might have chosen a less alarmist phrase! Let me put it another way: the big data honeymoon is over. Let’s quickly review the history of the romance before looking to the future of the relationship. For commercial computing, big data “dating” really began in the mid-2000s, when technical people in the burgeoning web business began to consider new ways to handle the exploding amounts and types of data generated by web usage. Before then, big data had been the dream -- or nightmare, actually -- of the scientific community where, from genetics to astrophysics, instrumentation was spewing data. In early 2008, the commercial romance of big data really began to get serious when Hadoop, the yellow poster elephant child of big data, was accepted as a top-level Apache project. The marketing paparazzi began stalking the couple soon after and, true to paparazzi nature, have been publishing a stream of outrageous claims and pictures ever since. By 2012, a shotgun wedding with business

6 • rediscoveringBI Magazine • #rediscoveringBI

Big data, it turns out, means precisely nothing and imprecisely anything”

was hastily arranged. By then the gloss had begun to wear off

has caused me to revisit many underlying assumptions about

and the honeymoon was washed out in a brief trip to Atlantic

information and I now see that there exist three domains of

City at the height of a super storm.

information that future business intelligence/analytics must

Enough Of The Past: Let’s Look Forward!

Big data does offer real and realizable business benefits, but there is one major issue: what actually is big data? The “volume, variety, and velocity” nomenclature, claimed by Doug Laney from a 2001 Meta Group research note, is useful shorthand at best. In reality, each attribute opens up a question of how far on any scale must data be in order to be called big -- how vast, how disparate, how fast? Furthermore, what combination of these three factors should be used in making a call? Big data, it turns out, means precisely nothing and imprecisely anything the Mad Men want it to mean. And, with the various additional “v-words” vaunted by vendors, the value vanishes. (Oops, I veered into the v-v-verge there!) The extent of this terminology problem was made clear in a big data survey conducted last fall by EMA and myself. Participants were those who declared they were investigating or implementing big data projects yet almost a third of respondents classed the data source for their projects as “process-mediated data” -- data originating from traditional operating systems. My conclusion: the term big data has passed its use-by date. Big data and “small data” are conceptually one and the same: just data, all data. Or, to be more semantically correct, all information, as I’ll explain in a new book later this year. (Editor's Note: Business Unintelligence: Via Analytics, Big Data and Collaboration to Innovative Business Insight will be published in Q3 2013 by Technics Publications). To be clear, I don’t consider that big data has taken us into a dead end. Rather, it has usefully exposed the fact that our traditional business intelligence (BI) view of the information available to and used by business is woefully inadequate. It

handle, as shown in in the accompanying figure: humansourced information, process-mediated data, and machinegenerated data. These domains are fundamentally different in their usage characteristics and in the demands they place on technology. The terms are largely self-explanatory, but more information can be found in my white paper. (See Barry Devlin’s The Big Data Zoo - Taming the Beasts: The need for an integrated platform for enterprise information). The bottom line is that we need a new architecture for information -- all of it and its entire life cycle in business.

The Biz-Tech Ecosystem

Both challenges and opportunities emerge as we shift the view from IT to business. The biggest challenge in the big data/analytics scene is the alleged dearth of so-called “data scientists.” How different are data scientists from the power users we’ve known in BI for decades? Arguably, the only substantive difference is deep statistical skill. The other characteristics mentioned -- data munging, business acumen, and storytelling -- are all common to power users. Statistics, however, is a very specialized skill that should, in principle, be tightly supervised to ensure valid and proper application. The phrase “lies, damn lies, and statistics” indicates the problem: statistics are far too easy to misuse -- deliberately or otherwise. Moreover, we seem to have blindly accepted an assertion that the exponential growth in data volumes implies a similar growth in hidden nuggets of useful business knowledge. This is unlikely to be true. Most of the good examples of business value coming from big data illustrate this. Real value emerges from a new type or new combination of data; growth in volumes leads to incremental increases in value, at best. These challenges aside, a focus on novel (big) data use does

rediscoveringBI Magazine • #rediscoveringBI • 7

drive opportunities for new businesses, business models, or, its heart are the collection, creation, and use of information, as opposed to data -- big or small -- as mandatory, core competencies of modern business.”

simply, ways to compete. A useful, cross-industry categorization (courtesy of IBM) of these opportunities is: • Big Data Exploration: analyze “big data” to identify new business opportunities • Enhanced 360° View of the Customer: incorporate humansourced information sources, such as call center logs and social media, into traditional CRM approaches • Security and Intelligence Extension: lower risk, detect fraud, and monitor cyber security in real-time, machine-generated data • Operations Analysis: analyze and use machine-generated data to drive immediate business results • Data Warehouse Augmentation: increase operational efficiency by integrating big data with BI This focus on (big) data is but the latest stage in the evolution of what I call the biz-tech ecosystem -- the symbiotic relationship between business and IT that drives all successful, modern businesses. Every business advance worth mentioning in the past twenty years has had technology, and almost always information technology, at its core. On the other hand, much of the advances in IT have been driven by business demands. The relative roles of business and IT people

Dr. Barry Devlin is Founder and Principal

may change as the process evolves, but that process is set to

of 9sight Consulting, and is among the

continue. And, at its heart are the collection, creation, and use

foremost authorities on business insight

of information, as opposed to data -- big or small -- as manda-

and big data. He is a widely respected

tory, core competencies of modern business.

analyst, consultant, lecturer, and author.

Share your comments >

8 • rediscoveringBI Magazine • #rediscoveringBI



to ensure that they are not crowded out by data and machine-

known as “big data” is rapidly reknitting the very

made answers.”

fabric of our lives, what we are just now begin-

In our brief email exchange, Mayer- Schönberger elaborated

ning to see and to understand – to appreciate

a bit more on this idea. “[We] try to understand the (human)

– is how.

dimension between input and output,” he noted. “Not through

Yet, so often our conversations about big data focus on these

the jargon-laden sociology of big data, but through what we

“how’s” in the abstract – on its benefits, potentials, and oppor-

believe is the flesh and blood of big data as it is done right

tunities, and likewise, its risks, challenges, and implications –


that we overlook the simpler, more primordial question: what’s

With the elegance of an Oxford University professor and

not changing?

The Economist’s data editor – Mayer-Schönberger and Cukier,

It’s a simple question that requires a simple answer. Us. Sure,

respectively – Big Data’s authors remind us that it is our

we can assert that we’re becoming more data-dependent. We

human traits of “creativity, intuition, and intellectual ambi-

generate more data: last month, social media giant Twitter

tion” that should be fostered in this brave new world of

blogged that its over 200-million active users generate

big data. That the inevitable “messiness” of big data can

over 400-million tweets per day. We consume more data: a

be directly correlated to the inherent “messiness” of being

now-outdated University of California report calculated that

human. And, most important, that the evolution of big data

American households collectively consumed 3.6 zettabytes of

as a resource and tool derives from (is a function of) the dis-

information in 2008. Are we – the data-generating organisms

tinctly human capacities of instinct, accident, and error, which

that we are – becoming no more than sentient founts of data?

manifest, even if unpredictably, in greatness. In that greatness

In Big Data: A Revolution That Will Transform How We Live,

is progress.

Work, and Think, authors Viktor Mayer-Schönberger and

That – progress – is the intrinsic value of big data. It’s what’s

Kenneth Cukier effectively put the pulse back in the Big Data

so compelling about Big Data (both the book and the thing

Conversation: “big data is not an ice-cold world of algorithms

itself): it’s not always about the inputs or outputs, but the

and automatons...we [must] carve out a place for the human:

space – or, what Mayer-Schönberger calls the “black box,” of

to reserve space for intuition, common sense, and serendipity




Share your comments >

Lindy Ryan is Editor in Chief of Radiant Advisors.

Big Data is available on Amazon and the Radiant Advisors eBookshelf 1 2 How Much Information?

rediscoveringBI Magazine • #rediscoveringBI • 9



[The BI industry is abuzz with one new question: is big data done?]

KRISH KRISHNAN ECENT ARTICLES IN leading business publications, a hype-cycle presentation by Gartner, and a number of blogs have all startled the world of big data by asking one “big” question: are we done? Did the big data bubble burst even quicker than the “dot com” bubble? Has the big data bubble burst? The answer is: not really. If anything, the market for infrastructure is booming with more vendors distributing commercial versions of open source software (like Hadoop and NoSQL). We are seeing the evolution of new consulting practices focused on analytics and – perhaps most important – traditional database vendors have all either embraced or announced support for big data platforms. So, what is the basis of this notion of failure or disappointment around the big data space?

The Promised Land

Among the potential gaps not understood clearly by adopters: One size does not fit all: Big data technologies were developed to solve the problems of extreme scalability and sustained performance. While these technologies have certainly overcome the traditional limitations of database-oriented data processing, the same techniques cannot be directly extended to solve problems in the same realm. MapReduce skill availability: To effectively use most of the big data platforms one has to be able to write some amount of Map Reduce code; however, this is an area where skills are evolving and (still) scarce. Programming dependence: Many corporations are unable to adjust to the idea of having teams design and develop code (or data processing) – much like application software development. Standardization of programming techniques for big data are still maturing.

In 2004, Google’s announcement of the general availability of

Business case: Most early adopters did not have a robust

MapReduce and Google File System started a flurry of activ-

business case, or, in many cases, the right business case to

ity building platforms aimed at solving scalability problems.

implement on these platforms. The lack of an end-state solu-

One of these projects was “Nutch,” a parallel search engine

tion -- or usage and ROI expectations -- has led to longer

on the open source platform. The team at Nutch succeeded

development and implementation cycles.

in building the infrastructure that attracted Yahoo to sponsor

Hype: Continued hype about the technology has caused

and incubate the project under its commercial name: Hadoop.

unrest amongst executives, line of business owners, IT, and

Submitted to open source in 2009, Hadoop quickly gained

business users, leading to often misunderstood capabilities

notoriety as the panacea for all data scalability problems.

of the platform as well as incorrect ROI or TCO expectations.

Since then it has become a viable platform for large-scale

But wait: it is not “all over” when we talk about big data,

computing needs and has been adopted as a data storage

rather we have come to the point in time where the reality of

and processing platform at many companies across the world.

the platform – and how to drive its adoption within corpora-

Subsequently, the last four years have also seen the evolution

tions – has started settling down. The big data bubble is well

of NoSQL databases and multiple other additional technolo-

and alive; in fact, it’s even progressing in the right direction.

gies on the Hadoop framework.

The Reality

How to Integrate Big Data

As corporations begin to see beyond the hype of big data,

Hadoop’s early adopters did not fully understand the com-

everyone from the executive sponsor to the implementa-

plexities of the platform until they began implementing the

tion team is beginning to recognize the need to dig a better

technology, and this lack of understanding inevitably has

foundation for integrating big data. There are a few subtle yet

spurred a sense of failure (or disappointment).

invaluable pointers in this process:

10 • rediscoveringBI Magazine • #rediscoveringBI

The big data bubble is well and alive; in fact, it’s even progressing in the right direction."

The Future 1. Build the business case and keep it simple

Several technology providers have announced their support

2. Create a data discovery environment that can be used

of big data platforms, including Datastax (Cassandra), Intel,

by line of business experts 3. Identify the data and patterns that are needed to create a robust foundation for analytics

Microsoft, EMC and HP (Hadoop), 10Gen (Mongo DB), and Cray (YARC Graph Analytics DB). These vendors -- along with existing vendors -- will undoubtedly continue to provide more

4. Create the initial analytics based on the data discovery

options and solution platforms for deploying and integrating

5. Visualize the data in a mash-up platform using

big data technologies within the enterprise platform.

semantic data integration techniques

The big data bubble has not busted; it is still only begin-

6. Get the business users to use the outcomes

ning and will be reaching various levels of maturity over the

7. Gain adoption of the users

following years. There are many layers of complexities and

8. Create a roadmap for the larger program

intricacies that need to be defined and formalized, but this is where the evolution and opportunities exist.

Share your comments >

While the overall process of big data integration seems closely aligned to the integration of any other project, there are key differences that can define the success of the big data

Krish Krishnan is a globally recognized

bubble in your corporation: data discovery, data analysis, and

expert in the strategy, architecture, and

data visualization. These three integral pillars will clearly

implementation of big data. His new

identify the basis of how to implement big data and monetize

book Data Warehousing in the Age of Big

such an exercise.

Data will be released in August 2013. rediscoveringBI Magazine • #rediscoveringBI • 11

[Big Data Vs. Data Management]


of big data, like that of the Black Death, is indifferent to the

for big data. Others – a prominent market watcher

hopes, prayers, expectations, or innumerate prognostications of

comes to mind – argue that big data, like so many

human actors. It’s inevitable. It’s going to happen. It’s going to

technologies or trends before it, is simply conforming

change everything.

to well-established patterns: following a period of hype, it’s

Even as the epitaphs are flying, the magic quadrants being

undergoing a correction. It’s regressing toward a mean.

plotted, and the opinions mongering, big data is changing

That was fast.

(chiefly by challenging) the status quo. This is particularly the

This doesn’t concern us. Big data is an epistemic shift. It’s

case with respect to the domain of data management (DM) and

going to transform how we know and understand — how we

its status quo. Here, big data is already a disruptive force: at

perceive — the world. What’s meant by the term “big data” is

once democratizing, reconfiguring, and destructive. We’ll con-

a force for destabilizing and reordering existing configura-

sider its reordering effect through the prism of Hadoop, which,

tions – much as the Bubonic Plague, or Black Death, was for

in the software development and data management worlds,

the Europe of the late-medieval period. It’s an unsettling anal-

has to a real degree become synonymous with what’s meant

ogy, but it underscores an important point: the phenomenon

by “big data.”

12 • rediscoveringBI Magazine • #rediscoveringBI


Big data is an epistemic shift. It’s going to transform how we know and understand — how we perceive — the world.”

The Citadel of Data Management

By running amok in the countryside, pillaging, burning, and

Big data has been described as a wake-up call for data man-

managed to drag the Lords of DM into open battle.

agement (DM) practitioners.

At last year’s Strata + Hadoop World confab in New York, NY,

If we’re grasping for analogies, the big data phenomenon

a representative with a prominent data integration (DI) ven-

seems less like a wake-up call than...a grim tableau straight

dor shared the story of a frustrated customer that it says had

out of 14th France.

developed – perforce – an especially ambitious project focus-

This was the time of the Black Death, which was to function as

ing on Hadoop.

an enormous force for social destabilization and reordering. It

The salient point, this vendor representative indicated, was

was also the time of the Hundred Years War, which was fought

that the business and IT stakeholders behind the project saw

between England and France on French soil. The manpower

in Hadoop an opportunity to upend the power and authority of

shortage of the invading English was exacerbated by the viru-

the rival DM team. “It’s almost like a coup d’etat for them,” he

lence of the Plague, which historians estimate killed between

said, explaining that both business stakeholders and software

one- to two-thirds of the European population. Outmanned

developers were exasperated by the glacial pace of the DM

– and outwoman-ed, for that matter, once Joan D’Arc abrupted

team’s responsiveness. “[T]hey asked how long it would take to

onto the scene – the English resorted to a time-tested tactic:

get source connectivity [for a proposed application and] they

the chevauchée. The logic of the chevauchée is fiendishly

were told nine months. Now they just want to go around them

simple: Edward III’s English forces were resource-constrained;

[i.e., the data management group],” this representative said.

they enjoyed neither the manpower nor the defensive advan-

“[T]hey basically want Hadoop to be their new massive data

tages – e.g., castles, towers, or city walls – that accrued (by


default) to the French. The English achieved their best outcomes in pitched battle; the French, on the other hand, were understandably reluctant to relinquish their fortifications, fixed or otherwise. The challenge for the English was to draw them out to fight. Enter the chevauchée. It describes the “tactic” of rampaging and pillaging – among other, far more horrific practices – in the comparatively defenseless French countryside. Left unchecked, the depredations of the chevauchée could ultimately comprise a threat to a ruler’s hegemony: fealty counts for little if it doesn’t at least afford one protection from other would-be conquerors. As a tactical tool, the chevauchée succeeded by challenging the legitimacy of a ruling power. Hadoop has had a similar effect. For the last two decades, the data management (DM) or data warehousing (DW) Powers That Be have been holed up in their fortified castles, dictating terms of access – dictating terms of ingest; dictating timetables and schedules, almost always to the frustration of the line of business, to say nothing of other IT stakeholders. Though Hadoop wasn’t conceived tactically, its adoption and growth have had a tactical aspect.

destroying stuff – or, by offering an alternative to the data warehouse-driven BI model – the Hottentots of Hadoop have

The Zero-Sum Scenario This zero-sum scenario sets up a struggle for information management supremacy. It proposes to isolate DM altogether; eventually it would starve the DM group out of existence. It views DM not as a potential partner for compromise, but as a zero-sum adversary. It’s an extremist position, to be sure; it nevertheless brings into focus the primary antagonism that exists between softwaredevelopment and data-management stakeholders. This antagonism must be seen as a factor in the promotion of Hadoop as a general-purpose platform for enterprise data management. Hadoop was created to address the unprecedented challenges associated with developing and managing data-intensive distributed applications. The impetus and momentum behind Hadoop originated with Web or distributed application developers. To some extent, Hadoop and other big data technology projects are still largely programmer-driven efforts. This has implications for their use on an enterprise-wide scale, because software developers and data management practitioners have very different worldviews. Both groups are accustomed to talking past one another. Each suspects the other of giving short shrift to its concerns or requirements. rediscoveringBI Magazine • #rediscoveringBI • 13

In short, both groups resent one another. This resentment

the conditions for change and transformation. Big data has

isn’t symmetrical, however; there’s a power imbalance. For a

had a similar effect in data management – chiefly by raising

quarter century now, the DM group hasn’t just managed data

questions about the warehouse’s ability to accommodate

-- it’s been able to dictate the terms and conditions of access

disruptions (e.g., new kinds of data and new analytic use

to the data that it manages. In this capacity, it’s been able to

cases) for which it wasn’t designed. Simply by claiming to

impose its will on multiple internal constituencies: not only

be Something New, big data raised questions about the DM

on software developers, but on line-of-business stakehold-

status quo.

ers, too. The irony is that the per-

This challenge was exploited by

ceived inflexibility and unrespon-

well-established insurgent cur-

siveness – the seeming indifference

rents inside both the line of busi-

– of DM stakeholders has helped to

ness and IT. The former has been

bring together two other nominally

fighting an insurgency against IT

antagonistic camps; in their resent-

for decades; however, in an age

ment of DM, software developers

of pervasive mobility, BYOD, social

and the line of business have been

collaboration, and (specific to the

able to find common cause.

DM space) analytic discovery, this

Few would deny that stakeholders

insurgency has taken on new force

jealously guard their fiefdoms. This

and urgency.

is as true of software developers

IT, for its part, has grappled with

and the line of business as it is of

insurgency in its own ranks: the

their counterparts in the DM world.

agile movement, which most in

Part of the problem is that DM

DM associate with project manage-

is viewed as an unreasonable or

ment, began as a software develop-

uncompromising stakeholder: e.g.,

ment initiative; it explicitly bor-

DM practitioners have been unable

rowed from the language of politi-

to meaningfully communicate the

cal revolution – the seminal agile

logic of their policies; they’ve like-

document is Kent Beck’s “Manifesto

wise been reluctant – or in some cases, unwilling – to revise

for Agile Software Development,” published in 2001 – in

these policies to address changing business requirements. In

championing an alternative to software development’s top-

addition, they’ve been slow to adopt technologies or meth-

down, deterministic status quo.

ods that promise to reduce latencies or which propose to

Agility and insurgency have been slower to catch on in DM.

empower line-of-business users. Finally, DM practitioners are

Nevertheless, insurgent pressure from both the line of busi-

fundamentally uncomfortable with practices – such as ana-

ness and IT is forcing DM stakeholders (and the vendors who

lytic discovery, with its preference for less-than-consistent

nominally service them) to reassess both their strategies and

data – which don’t comport with data management best

their positions.


However far-fetched, the possibility of a Hadoop-led chevau-

Hadoop and Big Data in Context That’s where the zero-sum animus comes from. It explains why some in business and IT champion Hadoop as a technology to replace – or at the very least, to displace – the DM status quo. There’s a much more

chée in the very heart of its enterprise fiefdom – with aid and comfort from a line-of-business class that DM has too often treated more as peasants than as enfranchised citizens – snagged the attention of data management practitioners. Big time.

pragmatic way of looking at what’s going on, however.


This is to see Hadoop in context – i.e., at the nexus of two

The Hadoop chevauchée got the attention of DM practitio-

related trends: viz., a decade-plus, bottom-up insurgency,

ners for another reason.

and a sweeping (if still coalescing) big data epistemic shift.

In its current state, Hadoop is no more suited for use as a

The two are related. Think back to the Bubonic Plague, which

general-purpose, all-in-one platform for reporting, discovery,

had a destabilizing effect on the late-Medieval social order.

and analysis than is the data warehouse. (See Sidebar: A

The depredations of the Plague effectively wiped out many

Kludge Too Far?)

of the practices, customs, and (not to put too fine a point on

Given the maturity of the DW, Hadoop is arguably much less

it) human stakeholders that might otherwise have contested

suited for this role. For all of its shortcomings, the data ware-


house is an inescapably pragmatic solution; (Contiued p21)

The Plague, then, cleared away the ante-status quo, creating

DM practitioners learned what works chiefly by figuring out

14 • rediscoveringBI Magazine • #rediscoveringBI



April 29 - May 1

At the Omni Downtown in Austin

GET DIRECTIONS Day One | Designing Modern Data Platforms These sessions provide an approach to confidently assess and make architecture changes, beginning with an understanding of how data warehouse architectures evolve and mature over time, balancing technical and strategic value delivery. We break down best practices into principles for creating new data platforms.

Day Two | Modern Data Integration These sessions provide the knowledge needed for understanding and modeling data integration frameworks to make confident decisions to approach, design, and manage evolving data integration blueprints that leverage agile techniques. We recognize data integration patterns for refactoring into optimized engines.

Day Three | Databases for Analytics These sessions review several of the most significant trends in analytic databases challenging BI architects today. Cutting through the definitions and hype of big data in the market, NoSQL databases offer a solution for a variety of data warehouse requirements. Register now at:

CAN'T MAKE IT? Catch us in San Francisco from May 28-30. Registration opens April 22nd. Use the priority code ReBI to save $150

Featured Keynotes By:

Sponsored by:

John O’Brien

Dr. Robin Bloor

Founder and CEO

Co-Founder and Principal Analyst

Radiant Advisors

The Bloor Group


BIG DATA: BIG IMPACT [STEPHEN SWOYER] The most common big data use cases tend to be less sexy

other hand, tout libraries that they say can be used as

than mundane.

MapReduce replacements. The result, both vendors claim,

In fact, two use cases for which big data is today having a

is ETL processing that’s (a) faster than vanilla Hadoop

Big Impact have decidedly sexy implications, at least from

MapReduce and (b) orders of magnitude faster than tradi-

a data management (DM) perspective.

tional enterprise ETL.

Both use cases address long-standing DM problems;

This stuff is available now. In the last 12 calendar months,

both likewise anticipate issues specific to the age of big

both Informatica and Talend announced “big data” ver-

data. The first involves using big data technologies to

sions of their ETL technologies for Hadoop MapReduce;

super charge ETL; the second, as a landing zone – i.e. , a

Pervasive and SyncSort have marketed Hadoop-able ver-

general-purpose virtual storage locker – for all kinds of

sions of their own ETL tools (DataRush and DMExpress,


respectively) for slightly longer. In every case, big data

Of the two, the first is the more mature: IT technologists

ETL tools abstract the complexity of Hadoop: ETL work-

have been talking up the potential of super-charged ETL

flows are designed in a GUI design studio; the tools them-

almost from the beginning.

selves generate jobs in the form of Java code, which can

Back then, this was framed largely in terms of MapReduce,

be fed into Hadoop.

the mega-scale parallel processing algorithm popular-

Just because the technology’s available doesn’t mean

ized by Google. Five years on, the emphasis has shifted

there’s demand for it.

to Hadoop itself as a platform for massively parallel ETL

Parallel processing ETL technologies have been available


for decades; not everybody needs or can afford them,

The rub is that performing stuff other than map and

however. David Inbar, senior director of big data products

reduce operations across a Hadoop cluster is kind of a

with Pervasive, concedes that demand for mega-scale ETL

kludge. (See sidebar: A KLUDGE TOO FAR?.)

processing used to be specialized.

However, because ETL processing can be broken down

At the same time, he says, usage patterns are changing;

into sequential map and reduce operations, data integra-

analytic practices and methods are changing. So, too, is

tion (DI) vendors have managed to make it work. Some DI

the concept of analytic scale: scaling from gigabyte-sized

players – e.g. , Informatica, Pervasive Software, SyncSort,

data sets to dozens or hundreds of terabytes – to say

and Talend, among others – market ETL products for

nothing of petabytes – is an increase of several orders

Hadoop. Both Informatica and Talend – along with ana-

of magnitude. In the emerging model, rapid iteration is

lytic specialist Pentaho Inc. – use Hadoop MapReduce to

the thing; this means being able to rapidly prepare and

perform ETL operations. Pervasive and SyncSort, on the

crunch data sets for analysis.

16 • rediscoveringBI Magazine • #rediscoveringBI

Just because the technology's available doesn't mean there's demand for it."



Nor is analysis a one-and-done affair, says Inbar: it’s itera-

The problem with MapReduce – to invoke a shopworn


cliché – is that it’s a hammer.

“What really matters is not so much if it uses MapReduce

From its perspective, any and every distributed processing

code or if it uses some other code; what really matters is

task wants and needs to be nailed. If Hadoop is to be a

does it perform and does it save you operational money –

useful platform for general-purpose parallel processing,

and can you actually iterate and discover patterns in the

it must be able to perform operations other than synchro-

first place faster than you would be able to otherwise?” he

nous map and reduce jobs.

asks. “It’s always possible to write custom code to get stuff

The problem is that MapReduce and Hadoop are tightly

done. Ultimately it’s a relatively straightforward [proposi-

coupled: the former has historically functioned as paral-

tion]: [manually] stringing together SQL code [for tradi-

lel processing yin to the Hadoop Distributed File System’s

tional ETL] or Java code [for Hadoop] can work, but it’s not

storage yang.

going to carry you forward.”

Enter the still-incubating Apache YARN project (YARN is

However, one of the data warehouse’s (DW) biggest selling

a bacronym for “Yet Another Resource Negotiator”), which

points is also its biggest limiting factor.

aims to decouple Hadoop from MapReduce.

The DW is a schema-mandatory platform. It’s most comfort-

Right now, Hadoop’s Job Tracker facility performs two

able speaking SQL. It uses a kludge – i.e. , the binary large

functions: resource management and job scheduling;

object (BLOB) – to accommodate unstructured, semi-struc-

YARN breaks Job Tracker into two discrete daemons.

tured, or non-traditional data-types. Hadoop, by contrast, is

From a DM perspective, this will make it possible to

a schema-optional platform.

perform asynchronous operations in Hadoop; it will also

For this reason, many in DM conceive of Hadoop as a virtual

enable pipelining, which – to the extent it’s possible in

storage locker for big data.

Hadoop today – is typically supported by vendor-specific

“You can drop any old piece of data on it without having to


do any of the upfront work of modeling the data and trans-

YARN’s been a long time coming, however: it’s part of

forming it [to conform to] your data model,” explains Rick

the Hadoop 2.0 framework, which is still in development.

Glick, vice president of technology and architecture with

Given what’s involved, some in DM say YARN’s going to

analytic discovery specialist ParAccel. “You can do that [i.e. ,

need seasoning before it can be used to manage mission-

transform and conform] as you move the data over.”

critical, production workloads.

At a recent industry event, several vendors – viz. ,

That said, YARN is hugely important to Hadoop. It has

Hortonworks, ParAccel, and Teradata, – touted Hadoop as

support from all of the Hadoop Heavies: Cloudera, EMC,

a point of ingest for all kinds of information. This “landing

Hortonworks, Intel, MapR, and others.

zone” scenario is something that customers are adopting

“It feels like it’s been coming for quite a while,” concedes

right now, says Pervasive’s Inbar; it has the potential to be

David Inbar, senior director of big data products with data

the most common use case for Hadoop in the enterprise.

integration specialist Pervasive Software. “All of the play-

“Before you can do all of the amazing/glamorous/ground-

ers … are in favor of it. Customers are going to need it. If

breaking analytical work … and innovation, you do actually

as a sysadmin you don’t have a unified view of everything

have to land and ingest and provision the data,” he argues.

that’s running and consum[ing] resources in your environ-

“Hadoop and HDFS are wonderful in that they let you [store

ment, that’s going to be suboptimal,” Inbar continues. “So

data] without having predefined what it is you think you’re

YARN is a mechanism that’s going to make it easier to

going to get out of it. Traditionally, the data warehouse

manage [Hadoop clusters]. It’s also going to open up the

requires you to predefine what you think you’re going to

Hadoop distributed data and processing framework to a

get out of it in the first place.”

wider range of compute engines and paradigms.” rediscoveringBI Magazine • #rediscoveringBI • 17



[Three things that make a “big” difference when implementing big data.]




affordably and easily exploit big data sets, and sometimes go

about big data and business analytics creating value,

even further with Cloud implementations. Gleaning insights

transforming businesses, and gaining new insights. Or,

from these vast data sets requires a completely different type

perhaps you’ve spent some time and resources during

of data platform and programming framework for creating

the past year reading publications or attending industry

insightful analytic routines.

events, or even launched a small scale “big data pilot” exper-

Analytics is not new to BI: the ability to execute statistical

iment. In any case, if you’re at the early stages of your com-

models and identify hidden patterns and clusters of data

pany’s journey into big data, there are some important con-

has long allowed for better business decision-making and

versations to keep in mind as you continue your path to

predictions. What these new BI analytic capabilities have

bringing business intelligence (BI) and your company’s big

in common is that they work beyond the capabilities of SQL

data together.


Big Data and the Business Intelligence Program

statements that govern relational database management systems to execute embedded algorithms. No longer are we constrained to sample data sets; advanced analytic tools can now execute their algorithms in parallel at the data layer. For

For the most part, big data environments are those that adopt

many years, data has been extracted from data warehouses

Apache’s Hadoop or one of its variants (like Cloudera, MapR,

into flat files to be executed outside the RDBMS by data min-

or HortonWorks) or the NoSQL databases (like MongoDB,

ing software packages (like SPSS, SAS, and Statistica). Both

Cassandra, or HBase with Hadoop). These data stores have

traditional capabilities -- reporting and dimensional analysis

massive scalability and unstructured data flexibility at the

– have always been needed, along with what is now being

best price. No longer

called “Analytics” in today’s BI programs.

reserved for the biggest IT shops, the democratization of big

Big data analytics are another one of the several BI capabili-

data comes from Hadoop’s ability to enable any company to

ties required by the business. And, even when big data is not

18 • rediscoveringBI Magazine • #rediscoveringBI

...the democratization of big data comes from Hadoop’s ability to enable any company to affordably and easily exploit big data sets”

so “big” there are other reasons why Hadoop and NoSQL are

them? Do you provide a semantic layer over both of them for

better solutions than RDBMSs, or cubes. Most common is when

users or between the data stores?

working with the data is beyond the capabilities of SQL and

Most companies are moving forward recognizing that both

tends to be more programmatic. The second most common

environments serve different purposes, but are part of a com-

is when the data be captured is constantly changing or is an

plete BI data platform. The traditional hub and spoke archi-

unknown structure, such that a database schema is difficult to

tecture of data warehouses and data marts is evolving into a

maintain. In this scenario, schema-less Hadoop and key value

modern data platform of three tiers: big data Hadoop, analytic

data stores are a clear solution. Another is when the data

databases, and the traditional RDBMS. Industry analysts are

needs to be stored in various data types, such as documents,

contemplating whether this is a two-tier or three-tier data

images, videos, sounds, or other non-record like data (think,

platform, especially given the expected maturing of Hadoop

for example, about the metadata to be extracted from a photo

in the coming years; however, it is safe to say that analytic

image, like date, time, geo-coding, technical photography data,

databases will be the cornerstone of modern BI data platforms

meta-tags, and perhaps even names of people from facial rec-

for years to come.

ognition). Most company big data environments today are less

The analytic database tier is really for highly-optimized or

than ten terabytes and fewer than eight nodes in the Hadoop

highly-specialized workloads -- such as columnar, MPP, and in-

cluster because of the other “non-bigness” requirements.

memory (or vector based) -- for analytic performance, or text


Data Platform = Big Data + Data Warehouse

analytics and graph databases for highly-specialized analytic capabilities. Big data governance and analytic lifecycles would encompass semantic and analytic discoveries made in Hadoop,

You might have already discussed what to do now that you

combined with traditional reference data, and then be migrat-

have both a Hadoop and data warehouse system. Should the

ed and productionized in a more controlled, monitored-- and

data warehouse be moved into Hadoop, or should you link

accessible -- analytics tier.

rediscoveringBI Magazine • #rediscoveringBI • 19


Determining Access

Apache “Hive” is sometimes called the “data warehouse application on top of Hadoop” as it enables a more generalized access capability for everyday users with its familiar Hive-QL format that SQL-familiar users can understand. Hive provides a semantic layer that allows for the definition of familiar tables and columns mapped to key-value pairs found in Hadoop. With virtual tables and columns in places, Hive users can write HQL to access data within the Hadoop environment. More recently, has been the release of “HCatalog,” which is making its way into the Apache Hadoop project. HCatalog is the semantic layer component similar to Hive, and allows for the definition of virtual tables and columns for communication with any application, not just HiveQL. Last summer, data visualization tool Tableau allowed users to work with and visualize Hadoop data for the first time via HCatalog. Today, many analytic databases are allowing users to work with tables that are views to HCatalog and Hadoop data. Some vendors also choose to leverage Hive as access to Hadoop data by leveraging its semantic layer and converting user SQL statements into HQL statements. Expect more BI vendors to follow suit and enable their own connectivity to Hadoop. There are emerging new agile analytic development methodologies and processes that enable the iterative and agile nature of analytics in big data environments for discovery, then couple that with data governance procedures to properly move the analytic models to a faster analytic database with operational controls and access. In this model, companies can store big data cheaply until its value can be determined, and then move it to

While big data has come a long way in just a short amount of time, it still has a long road ahead as an industry, as a maturing technology, and as best practices are realized and shared."

the appropriate production and valued data platform tier. This could be a map-reduce extract to a relational database data mart (or cube), or this could be executing the analytic program in an MPP, columnar, or in-memory high-performance database.

More to Come

While big data has come a long way in just a short amount of time, it still has a long road ahead as an industry, as a maturing technology, and as best practices are realized and shared. Don’t compare your company with mega e-commerce companies (like Yahoo, Facebook, Google, or LinkedIn) who live and breathe big data as a part of their mission critical core business functions for many years already. Rather, think of your company as the other 99% of companies -- small and large -- found in every industry exploring opportunities to unlock the hidden value in big data on their own. These companies typically already have a BI program underway, but now must grapple with the challenge of maintaining BI delivery from structured operational data combined with the new integration of big data platforms for business analysts, customers, and internal consumers. Share your comments >

John O’Brien is the Principal and CEO of Radiant Advisors, a strategic advisory and research firm that delivers innovative thought-leadership, publications, and industry news.

(Continued from p12) what doesn’t work. The genealogy of the data warehouse is

helped to democratize – in the guise of Hadoop. Aster and

encoded in a double-helix of intertwined lineages: the first is

Greenplum effectively excised MapReduce from Hadoop and

a lineage of failure; the second, a lineage of success born of

implemented it (as one algorithm among others) inside their

this failure. The latter has been won – at considerable cost –

massively parallel processing (MPP) database engines; this

at the expense of the former. A common DM-centric critique

gave them the ability to perform mapping/reducing opera-

of Hadoop (and of big data in general) is that some of its sup-

tions across their MPP clusters, on top of their own file sys-

porters want to throw out the old order and start from scratch.

tems. Hadoop and its Hadoop Distributed File System (HDFS)

As with the chevauchée – which entailed the destruction of

were nowhere in the mix.

infrastructure, agricultural sustenance, and formative social

It was, however, a big part of the backstory. Let’s turn the clock

institutions – many in DM (rightly) see in this a challenge to

back just a bit more, to early-2008, when Greenplum made a

an entrenched order or configuration.

move which hinted at what was to come – announcing API-

They likewise see the inevitability of avoidable mistakes –

level support for Hadoop and HDFS. In this way, Greenplum

particularly to the extent that Hadoop developers are con-

positioned its MPP appliance as a kind of choreographer for

temptuous of or indifferent to the finely-honed techniques,

external MapReduce jobs: by writing to its Hadoop API, devel-

methods, and best practices of data management.

opers could schedule MapReduce jobs to run on Hadoop and

“Reinvention is exactly it, … [but] they aren’t inventing data

HDFS. The resulting data, data sets, or analysis could then be

management technology. They don’t understand data manage-

recirculated back to the Greenplum RDBMS.

ment at all,” argues industry veteran Mark Madsen, a principal

Today, this is one of the schemes by which many in DM

with information management consultancy Third Nature Inc.

would like to accommodate Hadoop and big data. The differ-

Madsen is by no means a Hadoop hater; he notes that, as a

ence, at least relative to half a decade ago, is a kind of frank

schema-optional platform, Hadoop seems tailor-made for the

acceptance of the inevitability – and, to some extent, of the

age of big data: it can function as a virtual warehouse – i.e.,

desirability – of platform heterogeneity. Part of this has to do

as a general-purpose storage area – for information of any

with the “big” in big data: as volumes scale into the double-

and every kind.

or triple-digit terabyte -- or even into the petabyte – range,

The DW is schema-mandatory; its design is predicated on

technologists in every IT domain must reassess what they’re

a pair of best-of-all-possible-worlds assumptions: firstly,

doing and where they’re doing it, along with just how they

that data and requirements can be known and modeled in

expect to do it in a timely and cost-effective manner. Bound

advance; secondly, that requirements won’t significantly

up with this is acceptance of the fact that DM can no longer

change. For this very reason, the data warehouse will never be

simply dictate terms: that it must become more responsive to

a good general-purpose storage area. Madsen takes issue with

the concerns and requirements of line-of-business stakehold-

Hadoop’s promotion as an information management platform-

ers, as well as to those of its IT peers; that it must open itself


up to new types of data, new kinds of analytics, new ways of

Proponents who tout such a vision “understand data process-

doing things.

ing. They get code, not data,” he argues. “They write code and

“The overall strategy is one of cooperative computing,”

focus on that, despite the data being important. Their ethos

explains Rick Glick, vice president of technology and archi-

is around data as the expendable item. They think [that] code

tecture with analytic discovery specialist ParAccel Inc. “When

[is greater than or more important than] data, or maybe [they]

you’re dealing with terabytes or petabytes [of data], the chal-

believe that [even though they say] the opposite. So they do

lenge is that you want to move as little of it as possible. If

not understand managing data, data quality, why some data is

you’ve got these other [data processing] platforms, you inevi-

more important than other data at all times, while other data

tably say, ‘Where is the cheapest place to do it?’” This means

is variable and/or contextual. They build systems that pre-

proactively adopting technologies or methods that help to

sume data, simply source and store it, then whack away at it.”

promote agility, reduce latency, and empower line-of-business

The New Pragmatism

Initially, interest in Hadoop took the form of dismissive assessments. A later move was to co-opt some of the key technologies

users. This means running the “right” workloads in the “right” place, with “right” being understood as a function of both timeliness and cost-effectiveness. Share your comments >

associated with Hadoop and big data: almost five years ago, for example, Aster Data Systems Inc. and Greenplum Software

Stephen Swoyer is a technology

(both companies have since been acquired by Teradata

writer with more than 15 years of

and EMC, respectively) introduced in-database support for

experience. His writing has focused

MapReduce, the parallel processing algorithm that search

on business intelligence and data

giant Google had first helped to popularize, and which Yahoo

warehousing for almost a decade.

rediscoveringBI Magazine • #rediscoveringBI • 21

CHADVISED OPRESEAR REARCHAD CHADVISED DVISEDEVE DEVELOPR ABOUT RADIANT ADVISORS R E S E A R C H . . . A D V I S E . . . D E V E L O P. . . Radiant Advisors is a strategic advisory and research firm that networks with industry experts to deliver innovative thought-leadership, cutting-edge publications, and indepth industry research.

v i s i t w w w. r a d i a n t a d v i s o r s . c o m F o l l o w u s o n Tw i t t e r ! @ r a d i a n t a d v i s o r s

rediscoveringBI | April 2013  
rediscoveringBI | April 2013  

After the Big Data Party