Issuu on Google+

Introducing: MongoDB David J. C. Beach

Sunday, August 1, 2010


David Beach Software Consultant (past 6 years) Python since v1.4 (late 90’s) Design, Algorithms, Data Structures Sometimes Database stuff not a “frameworks” guy Organizer: Front Range Pythoneers

Sunday, August 1, 2010


Outline

Part I: Trends in Databases Part II: Mongo Basic Usage Part III: Advanced Features

Sunday, August 1, 2010


Part I: Trends in Databases

Sunday, August 1, 2010


Database Trends Past: “Relational” (RDBMS)

WARNING: extreme oversimplification

Data stored in Tables, Rows, Columns Relationships designated by Primary, Foreign keys Data is controlled & queried via SQL

Sunday, August 1, 2010


Trends: Criticisms of RDBMS Rigid data model Hard to scale / distribute Slow (transactions, disk seeks)

Lots of disagreement over this There are points & counterpoints from both sides The debate is not over Not here to deliver a verdict POINT: This is why we see an explosion of new databases.

SQL not well standardized Awkward for modern/dynamic languages

Sunday, August 1, 2010


Trends: Fragmentation

As with so many things in technology, we’re seeing... FRAGMENTATION!

some examples of DB categories

Relational with ORM (Hibernate, SQLAlchemy) ODBMS / ORDBMS (push OO-concepts into database) Key-Value Stores (MemcacheDB, Redis, Cassandra) Graph (neo4j) Document Oriented (Mongo, Couch, etc...)

categories are incomplete some don’t fit neatly into categories

Sunday, August 1, 2010


Where Mongo Fits Mongo’s Tagline (taken from website)

“The Best Features of Document Databases, Key-Value Stores, and RDBMSes.”

Sunday, August 1, 2010


What is Mongo Document-Oriented Database Produced by 10gen / Implemented in C++ Source Code Available Runs on Linux, Mac, Windows, Solaris Database: GNU AGPL v3.0 License Drivers: Apache License v2.0

Sunday, August 1, 2010


many of these taken straight from home page

Mongo Advantages

json-style documents (dynamic schemas)

fast queries (auto-tuning planner)

flexible indexing (B-Tree)

fast insert & deletes (sometimes trade-offs)

replication and highavailability (HA) automatic sharding support (v1.6)* easy-to-use API

Sunday, August 1, 2010

sharding support available as of v1.6 (late July 2010)


Mongo Language Bindings C, C++, Java Python, Ruby, Perl PHP, JavaScript (many more community supported ones)

Sunday, August 1, 2010


Mongo Disadvantages No Relational Model / SQL

Sunday, August 1, 2010

Can mimic with foreign IDs, but referential integrity not enforced.

No Explicit Transactions / ACID

Operations can only be atomic within single collection. (Generally)

Limited Query API

You can do a lot more with MapReduce and JavaScript!


When to use Mongo My personal take on this...

Rich semistructured records (Documents) Transaction isolation not essential Humongous amounts of data Need for extreme speed You hate schema migrations

Sunday, August 1, 2010

Caveat: I’ve never used Mongo in Production!


Part II: Mongo Basic Usage BRIEFLY cover: - Download, Install, Configure - connection, creating DB, creating Collection - CRUD operations (Insert, Query, Update, Delete)

Sunday, August 1, 2010


Installing Mongo

Use a 64-bit OS (Linux, Mac, Windows)

Sunday, August 1, 2010

Get Binaries: www.mongodb.org

32-bit available; not for production

Run “mongod� process

32-bits limits database to 2 GB!

PyMongo uses memory-mapped files.


Installing PyMongo Download: http://pypi.python.org/pypi/pymongo/1.7 Build with setuptools (includes C extension for speed) # python setup.py install # python setup.py --no-ext install (to compile without extension)

Sunday, August 1, 2010


Mongo Anatomy Mongo Server Database Collection Document

Sunday, August 1, 2010


Getting a Connection Connection required for using Mongo

>>> import pymongo >>> connection = pymongo.Connection(“localhost�)

Sunday, August 1, 2010


Finding a Database Databases = logically separate stores Navigation using properties Will create DB if not found >>> db = connection.mydatabase

Sunday, August 1, 2010


Using a Collection Collection is analogous to Table Contains documents Will create collection if not found >>> blog = db.blog

Sunday, August 1, 2010


Inserting collection.insert(document) => document_id >>> entry1 = {“title”: “Mongo Tutorial”, “body”: “Here’s a document to insert.” } >>> blog.insert(entry1) ObjectId('4c3a12eb1d41c82762000001')

document Sunday, August 1, 2010


Inserting (contd.) Documents must have ‘_id’ field Automatically generated unless assigned 12-byte unique binary value

You can also assign your own ‘_id’, can be any unique value.

>>> entry1 {'_id': ObjectId('4c3a12eb1d41c82762000001'), 'body': "Here's a document to insert.", 'title': 'Mongo Tutorial'}

Mongo’s IDs are designed to be unique...

ID generated by driver. No waiting on DB.

Sunday, August 1, 2010

...even if hundreds of thousands of documents are generated per second, on numerous clustered machines.


Inserting (contd.) Documents may have different properties Properties may be atomic, lists, dictionaries >>> entry2 = {"title": "Another Post", "body": "Mongo is powerful", "author": "David", "tags": ["Mongo", "Power"]} >>> blog.insert(entry2) ObjectId('4c3a1a501d41c82762000002')

another document Sunday, August 1, 2010


Indexing May create index on any field If field is list => index associates all values

index by single value >>> blog.ensure_index(“author”) >>> blog.ensure_index(“tags”)

by multiple values Sunday, August 1, 2010


Bulk Insert Let’s produce 100,000 fake posts bulk_entries = [ ] for i in range(100000): entry = { "title": "Bulk Entry #%i" % (i+1), "body": "What Content!", "author": random.choice(["David", "Robot"]), "tags": ["bulk", random.choice(["Red", "Blue", "Green"])] } bulk_entries.append(entry)

Sunday, August 1, 2010


Bulk Insert (contd.) collection.insert(list_of_documents) Inserts 100,000 entries into blog Returns in 2.11 seconds >>> blog.insert(bulk_entries) [ObjectId(...), ObjectId(...), ...]

Sunday, August 1, 2010


Bulk Insert (contd.) returns in 7.90 seconds (vs. 2.11 seconds) driver returns early; DB is still working ...unless you specify “safe=True� >>> blog.remove() # clear everything >>> blog.insert(bulk_entries, safe=True)

Sunday, August 1, 2010


Querying collection.find_one(spec) => document spec = document of query parameters >>> blog.find_one({“title”: “Bulk Entry #12253”}) {u'_id': ObjectId('4c3a1e411d41c82762018a89'), u'author': u'Robot', u'body': u'What Content!', returned in 0.04s - extremely fast u'tags': [u'bulk', u'Green'], No index created for “title”! u'title': u'Bulk Entry #99999'} presumably, need more entries to effectively test index performance... Sunday, August 1, 2010


Querying (Specs) Multiple conditions on document => “AND” Value for tags is an “ANY” match >>> blog.find_one({“title”: “Bulk Entry #12253”, “tags”: “Green”}) {u'_id': ObjectId('4c3a1e411d41c82762018a89'), u'author': u'Robot', u'body': u'What Content!', u'tags': [u'bulk', u'Green'], u'title': u'Bulk Entry #99999'} presumably, need more entries to effectively test index performance... Sunday, August 1, 2010


Querying (Multiple) collection.find(spec) => cursor new items are fetched in bulk (behind the scenes) >>> green_items = [ ] >>> for item in blog.find({“tags”: “Green”}): green_items.append(item)

- or >>> green_items = list(blog.find({“tags”: “Green”}))

Sunday, August 1, 2010


Querying (Counting) Use the find() method + count() Returns number of matches found

>>> blog.find({"tags": "Green"}).count() 16646

presumably, need more entries to effectively test index performance...

Sunday, August 1, 2010


Updating collection.update(spec, document) updates single document matching spec “multi=True” => updates all matching docs >>> item = blog.find_one({“title”: “Bulk Entry #12253”}) >>> item.tags.append(“New”) >>> blog.update({“_id”: item[‘_id’]}, item)

Sunday, August 1, 2010


Deleting use remove(...) it works like find(...) >>> blog.remove({"author":"Robot"}, safe=True)

Example removed approximately 50% of records. Took 2.48 seconds

Sunday, August 1, 2010


Part III: Advanced Features

Sunday, August 1, 2010


Advanced Querying Regular Expressions {“tag” : re.compile(r“^Green|Blue$”)} Nested Values {“foo.bar.x” : 3} $where Clause (JavaScript)

Sunday, August 1, 2010


Advanced Querying $lt, $gt, $lte, $gte, $ne $in, $nin, $mod, $all, $size, $exists, $type $or, $not $elemmatch >>> blog.find({“$or”: [{“tags”: “Green”}, {“tags”: “Blue”}]})

Sunday, August 1, 2010


Advanced Querying collection.find(...) sort(“name”) - sorting limit(...) & skip(...) [like LIMIT & OFFSET] distinct(...) [like SQL’s DISTINCT] won’t beBY showing detailed collection.group(...) - like SQL’s GROUP examples of all these... there are good tutorials online for all of this

>>> blog.find().limit(50) # find 50 articles >>> blog.find().sort(“title”).limit(30) # 30 titles let’s move on to something even more interesting >>> blog.find().distinct(“author”) # unique author names Sunday, August 1, 2010


Map/Reduce Most powerful querying mechanism

collection.map_reduce(mapper, reducer) ultimate in querying power distribute across multiple nodes

Sunday, August 1, 2010


Map/Reduce Visualized 1

2

3

)LJXUH0DS5HGXFHORJLFDOGDWDIORZ

Java MapReduce Diagram Credit: +DYLQJUXQWKURXJKKRZWKH0DS5HGXFHSURJUDPZRUNVWKHQH[WVWHSLVWRH[SUHVVLW also see: Hadoop: The Definitive Guide LQFRGH:HQHHGWKUHHWKLQJVDPDSIXQFWLRQDUHGXFHIXQFWLRQDQGVRPHFRGHWR Map/Reduce : A Visual Explanation by Tom White; O’Reilly Books UXQ WKH MRE 7KH PDS IXQFWLRQ LV UHSUHVHQWHG E\ DQ LPSOHPHQWDWLRQ WKH Chapter 2, RI page 20Mapper LQWHUIDFHZKLFKGHFODUHVD map()PHWKRG([DPSOHVKRZVWKHLPSOHPHQWDWLRQRI RXUPDSIXQFWLRQ

([DPSOH0DSSHUIRUPD[LPXPWHPSHUDWXUHH[DPSOH import Sunday, August 1, 2010

java.io.IOException;


19OPQ

!

"

! " #

$

% !

()*+,-./.01-230*2/4*5+123/6)-/,+55-./ *+7/63/8-93/02/7:-/16,/;+2470*2</ )-.+402=/7:-/30>-/*;/7:-/?*)802=/3-7@ A-63+)-3/1+37/B-/162+6559/6==)-=67-.@

# C==)-=67-3/.-,-2.02=/*2/)-4*)./4*+273/ 1+37/?607/+2705/;02650>670*2@ $ A-63+)-3/462/+3-/,)*4-.+)65/5*=04@ % D057-)3/:6E-/62/FGAHC470E-G-4*).I 5**802=/3795-@ ' C==)-=67-/;057-)02=/1+37/B-/6,,50-./7*/ 7:-/)-3+57/3-7</2*7/02/7:-/16,H)-.+4-@ & C34-2.02=J/!K/L-34-2.02=J/I!

' &

db.runCommand({ mapreduce: "DenormAggCollection", query: { filter1: { '$in': [ 'A', 'B' ] }, filter2: 'C', filter3: { '$gt': 123 } }, map: function() { emit( { d1: this.Dim1, d2: this.Dim2 }, { msum: this.measure1, recs: 1, mmin: this.measure1, mmax: this.measure2 < 100 ? this.measure2 : 0 } );}, reduce: function(key, vals) { var ret = { msum: 0, recs: 0, mmin: 0, mmax: 0 }; for(var i = 0; i < vals.length; i++) { ret.msum += vals[i].msum; ret.recs += vals[i].recs; if(vals[i].mmin < ret.mmin) ret.mmin = vals[i].mmin; if((vals[i].mmax < 100) && (vals[i].mmax > ret.mmax)) ret.mmax = vals[i].mmax; } return ret; }, finalize: function(key, val) { val.mavg = val.msum / val.recs; return val; }, out: 'result1', verbose: true }); db.result1. find({ mmin: { '$gt': 0 } }). sort({ recs: -1 }). skip(4). limit(8);

G-E030*2/$</M)-67-./"N!NIN#IN' G048/F3B*)2-</)048*3B*)2-@*)=

SELECT Dim1, Dim2, SUM(Measure1) AS MSum, COUNT(*) AS RecordCount, AVG(Measure2) AS MAvg, MIN(Measure1) AS MMin MAX(CASE WHEN Measure2 < 100 THEN Measure2 END) AS MMax FROM DenormAggTable WHERE (Filter1 IN (’A’,’B’)) AND (Filter2 = ‘C’) AND (Filter3 > 123) GROUP BY Dim1, Dim2 HAVING (MMin > 0) ORDER BY RecordCount DESC LIMIT 4, 8

A*2=*LR

http://rickosborne.org/download/SQL-to-MongoDB.pdf Sunday, August 1, 2010


Map/Reduce Examples This is me, playing with Map/Reduce

Sunday, August 1, 2010


Health Clinic Example

Person registers with the Clinic Weighs in on the scale 1 year => comes in 100 times

Sunday, August 1, 2010


Health Clinic Example person = { “name”: “Bob”, ! “weighings”: [ ! ! {“date”: date(2009, 1, 15), “weight”: 165.0}, ! ! {“date”: date(2009, 2, 12), “weight”: 163.2}, ! ! ... ] }

Sunday, August 1, 2010


Map/Reduce Insert Script for i in range(N): person = { 'name': 'person%04i' % i } weighings = person['weighings'] = [ ] std_weight = random.uniform(100, 200) for w in range(100): date = (datetime.datetime(2009, 1, 1) + datetime.timedelta( days=random.randint(0, 365)) weight = random.normalvariate(std_weight, 5.0) weighings.append({ 'date': date, 'weight': weight }) weighings.sort(key=lambda x: x['date']) all_people.append(person)

Sunday, August 1, 2010


Insert Data Performance

LOG-LOG scale Linear scaling

Insert 1000

292s

100

29.5s

10 3.14s

1 1k

Sunday, August 1, 2010

10k

100k


Map/Reduce Total Weight by Day map_fn = Code("""function () { this.weighings.forEach(function(z) { emit(z.date, z.weight); }); }""") reduce_fn = Code("""function (key, values) { var total = 0; for (var i = 0; i < values.length; i++) { total += values[i]; } return total; }""") result = people.map_reduce(map_fn, reduce_fn)

Sunday, August 1, 2010


Map/Reduce Total Weight by Day >>> for doc in result.find(): print doc {u'_id': datetime.datetime(2009, 1, 1, 0, 0), u'value': 39136.600753163315} {u'_id': datetime.datetime(2009, 1, 2, 0, 0), u'value': 41685.341024046182} {u'_id': datetime.datetime(2009, 1, 3, 0, 0), u'value': 38232.326554504165} ... lots more ...

Sunday, August 1, 2010


Total Weight by Day Performance MapReduce 1000 384s

100 38.8s

10 4.29s

1 1k

Sunday, August 1, 2010

10k

100k


Map/Reduce Weight on Day map_fn = Code("""function () { var target_date = new Date(2009, 9, 5); var pos = bsearch(this.weighings, "date", target_date); var recent = this.weighings[pos]; emit(this._id, { name: this.name, date: recent.date, weight: recent.weight }); };""") reduce_fn = Code("""function (key, values) { return values[0]; };""") result = people.map_reduce(map_fn, reduce_fn, scope={"bsearch": bsearch})

Sunday, August 1, 2010


Map/Reduce bsearch() function bsearch = Code("""function(array, var min, max, mid, midval; for(min = 0, max = array.length mid = min + Math.floor((max midval = array[mid][prop]; if(value === midval) { break; } else if(value > midval) { min = mid + 1; } else { max = mid - 1; } } return (midval > value) ? mid };""")

Sunday, August 1, 2010

prop, value) { - 1; min <= max; ) { min) / 2);

1 : mid;


Weight on Day Performance MapReduce

1000

100

108s

10

10s

1 1k

Sunday, August 1, 2010

1.23s

10k

100k


Weight on Day (Python Version)

target_date = datetime.datetime(2009, 10, 5) for person in people.find(): dates = [ w['date'] for w in person['weighings'] ] pos = bisect.bisect_right(dates, target_date) val = person['weighings'][pos]

Sunday, August 1, 2010


Map/Reduce Performance MapReduce

1000

100

108s 26s

10

10s

1

1.23s

2.2s

0.37s

0.1 1k

Sunday, August 1, 2010

Python

10k

100k


Summary

Sunday, August 1, 2010


Resources www.mongodb.org

PyMongo api.mongodb.org/python

MongoDB www.10gen.com

Sunday, August 1, 2010

The Definitive Guide Oâ&#x20AC;&#x2122;Reilly


END OF SLIDES

Sunday, August 1, 2010


Chalkboard is not Comic Sans This is Chalkboard, not Comic Sans. This isnâ&#x20AC;&#x2122;t Chalkboard, itâ&#x20AC;&#x2122;s Comic Sans.

does it matter, anyway?

Sunday, August 1, 2010


introducing_in_mongodb