/backtype-efficiency-and-big-data-systems

Page 1

Become Efficient or Die The Story of BackType

Nathan Marz @nathanmarz


BackType

BackType helps businesses understand social media and make use of it


BackType Data Services (APIs) Social Media Analytics Dashboard


APIs • Conversational graph for url • Comment search • #Tweets / URL • Influence scores • Top sites • Trending links stream • etc.



URL Profiles


Site comparisons


Influencer Profiles


Twitter Account Analytics


Topic Analysis


Topic Analysis


BackType stats • >30 TB of data • 100 to 200 machine cluster • Process 100M messages per day • Serve 300 req/sec


BackType stats • 3 full time employees • 2 interns • 1.4M in funding


How? Avoid waste Invest in efficiency


Development philosophy • Waterfall • Agile • Scrum • Kanban


Development philosophy Suffering-oriented programming


Suffering-oriented Programming Don’t add process until you feel the pain of not having it


Example

• Growing from 2 people to 3 people


Example • Founders were essentially “one brain” • Cowboy coding led to communication mishaps

• Added biweekly meeting to sync up


Example

• Growing from 3 people to 5 people


Example • Moving through tasks a lot faster with 5 people

• Needed more frequent prioritization of tasks

• Changed biweekly meeting to weekly meeting

• Added “chat room standups” to facilitate mid-week adjustments


Suffering-oriented Programming Don’t build new technology until you feel the pain of not having it


Suffering-oriented Programming First, make it possible. Then, make it beautiful. Then, make it fast.


Example

• Batch processing


Make it possible • Hack things out using MapReduce/ Cascading

• Learn the ins and outs of batch processing


Make it beautiful • Wrote (and open-sourced) Cascalog • The “perfect interface” to our data


Make it fast • Use it in production • Profile and identify bottlenecks • Optimize


Overengineering Attempting to create beautiful software without a thorough understanding of problem domain


Premature optimization Optimizing before creating “beautiful� design, creating unnecessary complexity


Knowledge debt 20 15

Knowledge debt

10 5 0

Your productivity Your potential


Knowledge debt Use small, independent projects to experiment with new technology


Example • Needed to write a small server to collect records into a Distributed Filesystem

• Wrote it using Clojure programming language

• Huge win: now we use Clojure for most of our systems


Example • Needed to implement social search • Wrote it using Neo4j • Ran into lot of problems with Neo4j and rewrote it later using Sphinx


Example • Needed an automated deploy for a

distributed stream processing system

• Wrote it using Pallet • Massive win: anticipate dramatic reduction in complexity in administering infrastructure


Knowledge debt

(Crappy job ad)


Knowledge debt Instead of hiring people who share your skill set, hire people with completely different skill sets

(food for thought)


Technical debt Technical debt builds up in a codebase


Technical debt • W needs to be refactored • X deploy should be faster • Y needs more unit tests • Z needs more documentation


Technical debt Never high enough priority to work on, but these issues built up and slow you down


BackSweep • Issues are recorded on a wiki page • We spend one day a month removing items from that wiki page


BackSweep • Keeps our codebase lean • Gives us a way to defer technical debt

issues when don’t have time to deal with them

• “Garbage collection for the codebase”


What is a startup? A startup is a human institution designed to deliver a new product or service under conditions of extreme uncertainty. - Eric Ries


How do you decide what to work on?


Don’t want to waste three months building a feature no one cares about This could be fatal!


Product development Valid? Form hypothesis

Keep

Test hypothesis

Invalid?

Learn

Discard


Example


Example

Pro product didn’t actually exist yet


Example • We tested different feature combinations and measured click through rate

• Clicking on “sign up” went to a survey page


Example


Hypothesis #1

Customers want analytics on topics being discussed on Twitter


Testing hypothesis #1

• Fake feature -> clicking on topic goes to survey page


Testing hypothesis #1 • Do people click on those links? • If not, need to reconsider hypothesis


Hypothesis #2

Customers want to know how often topics are mentioned over time


Testing hypothesis #2 • Build topic mentions over time graph for

“big topics” our private beta customers are interested in (e.g. “nike”, “microsoft”, “apple”, “kodak”)

• Talk to customers


Hypothesis #3 • Customers want to see who’s talking about a topic on a variety of dimensions: recency, influence, num followers, or num retweets


Testing hypothesis #3

• Create search index on last 24 hours of data that can sort on all dimensions



Lean Startup


Questions? Twitter: @nathanmarz Email: nathan.marz@gmail.com Web: http://nathanmarz.com


The Secrets of Building Realtime Big Data Systems Nathan Marz @nathanmarz


Who am I?


Who am I?


Who am I?


Who am I?

(Upcoming book)


BackType • >30 TB of data • Process 100M messages / day • Serve 300 requests / sec • 100 to 200 machine cluster • 3 full-time employees, 2 interns


Built on open-source Thrift Cascading Scribe ZeroMQ Zookeeper Pallet


What is a data system? Raw data

View 1

View 2

View 3


What is a data system?

Tweets

# Tweets / URL

Influence scores

Trending topics


Everything else: schemas, databases, indexing, etc are implementation


Essential properties of a data system


1. Robust to machine failure and human error


2. Low latency reads and updates


3. Scalable


4. General


5. Extensible


6. Allows ad-hoc analysis


7. Minimal maintenance


8. Debuggable


Layered Architecture Speed Layer

Batch Layer


Let’s pretend temporarily that update latency doesn’t matter


Let’s pretend it’s OK for a view to lag by a few hours


Batch layer • Arbitrary computation • Horizontally scalable • High latency


Batch layer

Not the end-all-be-all of batch computation, but the most general


Hadoop Distributed Filesystem

Distributed Filesystem

Input files

Output files MapReduce

Input files

Output files

Input files

Output files


Hadoop • Express your computation in terms of MapReduce

• Get parallelism and scalability “for free”


Batch layer • Store master copy of dataset • Master dataset is append-only


Batch layer view = fn(master dataset)


Master dataset

Batch layer MapReduce

Batch View 1

MapReduce

Batch View 2

MapReduce

Batch View 3


Batch layer • In practice, too expensive to fully

recompute each view to get updates

• A production batch workflow adds

minimum amount of incrementalization necessary for performance


Incremental batch layer Batch View 1

New data

Batch workflow

View maintenance

Append

Query All data

Batch View 2

Batch View 3


Batch layer Robust and fault-tolerant to both machine and human error. Low latency reads. Low latency updates. Scalable to increases in data or traffic. Extensible to support new features or related services. Generalizes to diverse types of data and requests. Allows ad hoc queries. Minimal maintenance. Debuggable: can trace how any value in the system came to be.


Speed layer Compensate for high latency of updates to batch layer


Speed layer Key point: Only needs to compensate for data not yet absorbed in batch layer

Hours of data instead of years of data


Application-level Queries Batch Layer

Query

Merge Speed Layer

Query


Speed layer Once data is absorbed into batch layer, can discard speed layer results


Speed layer • Message passing • Incremental algorithms • Read/Write databases • Riak • Cassandra • HBase • etc.


Speed layer

Significantly more complex than the batch layer


Speed layer

But the batch layer eventually overrides the speed layer


Speed layer

So that complexity is transient


Flexibility in layered architecture • Do slow and accurate algorithm in batch layer

• Do fast but approximate algorithm in speed layer

• “Eventual accuracy”


Data model Every record is a single, discrete fact at a moment in time


Data model • Alice lives in San Francisco as of time 12345 • Bob and Gary are friends as of time 13723 • Alice lives in New York as of time 19827


Data model • Remember: master dataset is append-only • A person can have multiple location records

• “Current location” is a view on this data:

pick location with most recent timestamp


Data model • Extremely useful having the full history for each entity

• Doing analytics • Recovering from mistakes (like writing bad data)


Data model Reshare: true

Gender: female

Property

Property Reactor

Tweet: 123

Tweet: 456 Reaction

Reactor

Alice Property

Content: RT @bob Data is fun!

Property

Bob

Content: Data is fun!


Questions? Twitter: @nathanmarz Email: nathan.marz@gmail.com Web: http://nathanmarz.com


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.