Page 1

Become Efficient or Die The Story of BackType

Nathan Marz @nathanmarz


BackType helps businesses understand social media and make use of it

BackType Data Services (APIs) Social Media Analytics Dashboard

APIs • Conversational graph for url • Comment search • #Tweets / URL • Influence scores • Top sites • Trending links stream • etc.

URL Profiles

Site comparisons

Influencer Profiles

Twitter Account Analytics

Topic Analysis

Topic Analysis

BackType stats • >30 TB of data • 100 to 200 machine cluster • Process 100M messages per day • Serve 300 req/sec

BackType stats • 3 full time employees • 2 interns • 1.4M in funding

How? Avoid waste Invest in efficiency

Development philosophy • Waterfall • Agile • Scrum • Kanban

Development philosophy Suffering-oriented programming

Suffering-oriented Programming Don’t add process until you feel the pain of not having it


• Growing from 2 people to 3 people

Example • Founders were essentially “one brain” • Cowboy coding led to communication mishaps

• Added biweekly meeting to sync up


• Growing from 3 people to 5 people

Example • Moving through tasks a lot faster with 5 people

• Needed more frequent prioritization of tasks

• Changed biweekly meeting to weekly meeting

• Added “chat room standups” to facilitate mid-week adjustments

Suffering-oriented Programming Don’t build new technology until you feel the pain of not having it

Suffering-oriented Programming First, make it possible. Then, make it beautiful. Then, make it fast.


• Batch processing

Make it possible • Hack things out using MapReduce/ Cascading

• Learn the ins and outs of batch processing

Make it beautiful • Wrote (and open-sourced) Cascalog • The “perfect interface” to our data

Make it fast • Use it in production • Profile and identify bottlenecks • Optimize

Overengineering Attempting to create beautiful software without a thorough understanding of problem domain

Premature optimization Optimizing before creating “beautiful� design, creating unnecessary complexity

Knowledge debt 20 15

Knowledge debt

10 5 0

Your productivity Your potential

Knowledge debt Use small, independent projects to experiment with new technology

Example • Needed to write a small server to collect records into a Distributed Filesystem

• Wrote it using Clojure programming language

• Huge win: now we use Clojure for most of our systems

Example • Needed to implement social search • Wrote it using Neo4j • Ran into lot of problems with Neo4j and rewrote it later using Sphinx

Example • Needed an automated deploy for a

distributed stream processing system

• Wrote it using Pallet • Massive win: anticipate dramatic reduction in complexity in administering infrastructure

Knowledge debt

(Crappy job ad)

Knowledge debt Instead of hiring people who share your skill set, hire people with completely different skill sets

(food for thought)

Technical debt Technical debt builds up in a codebase

Technical debt • W needs to be refactored • X deploy should be faster • Y needs more unit tests • Z needs more documentation

Technical debt Never high enough priority to work on, but these issues built up and slow you down

BackSweep • Issues are recorded on a wiki page • We spend one day a month removing items from that wiki page

BackSweep • Keeps our codebase lean • Gives us a way to defer technical debt

issues when don’t have time to deal with them

• “Garbage collection for the codebase”

What is a startup? A startup is a human institution designed to deliver a new product or service under conditions of extreme uncertainty. - Eric Ries

How do you decide what to work on?

Don’t want to waste three months building a feature no one cares about This could be fatal!

Product development Valid? Form hypothesis


Test hypothesis






Pro product didn’t actually exist yet

Example • We tested different feature combinations and measured click through rate

• Clicking on “sign up” went to a survey page


Hypothesis #1

Customers want analytics on topics being discussed on Twitter

Testing hypothesis #1

• Fake feature -> clicking on topic goes to survey page

Testing hypothesis #1 • Do people click on those links? • If not, need to reconsider hypothesis

Hypothesis #2

Customers want to know how often topics are mentioned over time

Testing hypothesis #2 • Build topic mentions over time graph for

“big topics” our private beta customers are interested in (e.g. “nike”, “microsoft”, “apple”, “kodak”)

• Talk to customers

Hypothesis #3 • Customers want to see who’s talking about a topic on a variety of dimensions: recency, influence, num followers, or num retweets

Testing hypothesis #3

• Create search index on last 24 hours of data that can sort on all dimensions

Lean Startup

Questions? Twitter: @nathanmarz Email: Web:

The Secrets of Building Realtime Big Data Systems Nathan Marz @nathanmarz

Who am I?

Who am I?

Who am I?

Who am I?

(Upcoming book)

BackType • >30 TB of data • Process 100M messages / day • Serve 300 requests / sec • 100 to 200 machine cluster • 3 full-time employees, 2 interns

Built on open-source Thrift Cascading Scribe ZeroMQ Zookeeper Pallet

What is a data system? Raw data

View 1

View 2

View 3

What is a data system?


# Tweets / URL

Influence scores

Trending topics

Everything else: schemas, databases, indexing, etc are implementation

Essential properties of a data system

1. Robust to machine failure and human error

2. Low latency reads and updates

3. Scalable

4. General

5. Extensible

6. Allows ad-hoc analysis

7. Minimal maintenance

8. Debuggable

Layered Architecture Speed Layer

Batch Layer

Let’s pretend temporarily that update latency doesn’t matter

Let’s pretend it’s OK for a view to lag by a few hours

Batch layer • Arbitrary computation • Horizontally scalable • High latency

Batch layer

Not the end-all-be-all of batch computation, but the most general

Hadoop Distributed Filesystem

Distributed Filesystem

Input files

Output files MapReduce

Input files

Output files

Input files

Output files

Hadoop • Express your computation in terms of MapReduce

• Get parallelism and scalability “for free”

Batch layer • Store master copy of dataset • Master dataset is append-only

Batch layer view = fn(master dataset)

Master dataset

Batch layer MapReduce

Batch View 1


Batch View 2


Batch View 3

Batch layer • In practice, too expensive to fully

recompute each view to get updates

• A production batch workflow adds

minimum amount of incrementalization necessary for performance

Incremental batch layer Batch View 1

New data

Batch workflow

View maintenance


Query All data

Batch View 2

Batch View 3

Batch layer Robust and fault-tolerant to both machine and human error. Low latency reads. Low latency updates. Scalable to increases in data or traffic. Extensible to support new features or related services. Generalizes to diverse types of data and requests. Allows ad hoc queries. Minimal maintenance. Debuggable: can trace how any value in the system came to be.

Speed layer Compensate for high latency of updates to batch layer

Speed layer Key point: Only needs to compensate for data not yet absorbed in batch layer

Hours of data instead of years of data

Application-level Queries Batch Layer


Merge Speed Layer


Speed layer Once data is absorbed into batch layer, can discard speed layer results

Speed layer • Message passing • Incremental algorithms • Read/Write databases • Riak • Cassandra • HBase • etc.

Speed layer

Significantly more complex than the batch layer

Speed layer

But the batch layer eventually overrides the speed layer

Speed layer

So that complexity is transient

Flexibility in layered architecture • Do slow and accurate algorithm in batch layer

• Do fast but approximate algorithm in speed layer

• “Eventual accuracy”

Data model Every record is a single, discrete fact at a moment in time

Data model • Alice lives in San Francisco as of time 12345 • Bob and Gary are friends as of time 13723 • Alice lives in New York as of time 19827

Data model • Remember: master dataset is append-only • A person can have multiple location records

• “Current location” is a view on this data:

pick location with most recent timestamp

Data model • Extremely useful having the full history for each entity

• Doing analytics • Recovering from mistakes (like writing bad data)

Data model Reshare: true

Gender: female


Property Reactor

Tweet: 123

Tweet: 456 Reaction


Alice Property

Content: RT @bob Data is fun!



Content: Data is fun!

Questions? Twitter: @nathanmarz Email: Web: