Search Appliances for Big Data Davood Rafiei University of Alberta
Big Structured Data
2011/2012 Top-‐performers TPC-‐C :Oracle / Oracle TPC-‐E: NEC & IBM / SQL server TPC-‐H: Dell/Exasol and Cisco/VectorWise
Big ScienPfic Data • Atlas detector at CERN
– 23 PB per second raw data – 10 PB per year filtered data, – Used by more than 150 universiPes and labs
• Sloan digital sky survey catalog archive
– DR8 covers 35% of the sky – Accessed through SkyServer and MS World telescope
Big (Everyone’s) Data
Produced everyday
Big (Everyone’s) Data
Le3 behind everyday!
Big (Everyone’s) Data
Used everyday
Big (Everyone’s) Data
Interacted everyday
Big (Everyone’s) Data
Analyzed everyday
Our Four Appliances • • • •
ReputaPon gauge (TOPIC, 2000) Network visualizaPon (ALVIN, 2005) Data extracPon (DeWild, 2007) Result diversificaPon (Diver, 2010)
TOPIC: Toronto Page Influence ComputaPon (Rafiei, Mendelzon, 2000)
• What are these pages known for? – www.cnn.com – www.cs.helsinki.fi – www.w3.org/People/Berners-‐Lee – www.cs.ualberta.ca – www.hot107.ca
URL
TOPIC
Search engines
ReputaPon
Back Links
search engines compared my favorite search engines a review of search engines
p
Probability of VisiPng a Page q
p
⎧ d R (q, t ) ⎪ n R ( p, t ) = (1− d ) ∑ + ⎨ Nt q→ p O(q) ⎪⎩ 0 n−1
if pagep is on topict otherwise
EvaluaPon • What is page www.macleans.ca known for?
1 -‐ Maclean's Magazine
2 -‐ macleans 3 -‐ Canadian UniversiPes
Personal Home Pages • www.w3.org/People/Berners-‐Lee –
History Of The Internet, Tim Berners-‐Lee, Internet History, W3C
• www-‐db.stanford.edu/~ullman –
Jeffrey D Ullman, Database Systems, Data Mining, Programming Languages
• www.cs.ualberta.ca/~jonathan –
Jonathan Shaeffer, Chess, Alberta, Games, Computer, University
• www.cs.ualberta.ca/~greiner –
Machine Learning
News Agencies CNN BBC ABC wired.com Int’l News
0.0237
0.0097
0.0003
0.0044
Weather
0.0121
0.0052
0.0008
0.0006
Sports
0.0070
0.0004
0
0.0028
Entertainment 0.0040
0.0015
0.0013
0.0012
Travel
0.0030
0.0008
0.0012
0.0005
Technology
0.0017
0.0006
0.0006
0.0079
Business
0.0017
0.0006
0.0004
0.0031
ALVIN: Alberta System for Visualizing Large Neworks (Rafiei, Curial, 2005)
• Large networks are ubiquitous – Internet, telephone, roads, network of people who are related, etc.
• Scenarios – How is my site related to Microsoj Web site? – My site gets many hits from Russia and I am wondering about possible relaPonships
Visualizing the Web Graph (our challenge) • Must scale up to millions and even billions of nodes and edges • Should ideally work for an arbitrary network (e.g. not just tree-‐like) • Should be able to focus on a part of the network (if needed)
Sampling the Network • SRS1: take a simple random sample of the nodes and add all edges between them • SRS2: take a simple random sample of the edges and include all their adjacent nodes • SRS3: take a sample using SRS2 and add all edges between nodes in the sample
Network ProperPes • Want to keep the sample size small, but sPll see the topology of the network • Traits found in a network – Degree distribuPon – Connected component size distribuPon – CharacterisPc path length – Clustering coefficient – Etc.
Degree dist. With SRS1 (movie database)
StraPfied Sampling A set of growth processes • Sample I (local growth) • Sample F (focused growth) • Sample G (global growth)
F: Focus set I: nodes and edges adjacent to a node or edge in F
Experiments • System – Implemented in C++ using LEDA class library – DB2 is used for backend storage
• Data – Snapshot of the Web (taken in 1999): 178 million pages and a billion edges – Movie database imdb: 450,000 actors and 17 million co-‐acPngs
IniPal Seeds
Seed set www.cs.wisc.edu www.cs.cornell.edu news.bbc.co.uk www.foxnews.com www.sciencemag.org
Global Growth
0.1% sample using SRS1
Another Global Growth
0.2% sample using SRS1
Seed set: actors in green Do a focused growth (400 edges)
DeWild: Data ExtracPon using Wild Cards (Li, Rafiei, 2007)
• Scenario: want to gather a list of – Neurosurgeons in Canada – Companies acquired by Google – Summer movies
• Search – explore – search … – Tedious and Pme-‐consuming – Poor scale up
Wild Card Queries • Let – % denote one or more nouns – * denote words with the same meanings
• Queries – % is a neurosurgeon in Canada – % is a summer *blockbuster* – Google *acquired* % – % invented light bulb
EvaluaPon Strategies • Query flasening % is a summer *blockbuster*
% is a summer blockbuster % is a summer movie % is a summer film
• Query expansion summer movies such as % % is a summer movie
summer movies including % such summer movies as % and other summer movies
Some Challenges • RewriPngs – a rewri'ng rule language
• Ranking instances – reinforcing rela'onships between instances and rewrites
• EvaluaPon of the extracPon accuracy and coverage – Compared to a QA system – Ad-‐hoc list extrac'on
List of Canadian Writers • Query: % is a Canadian writer • 1300 names retrieved – 91 of the first 100 were real Canadian writers – 156 of the first 200 were real Canadian writers
• Compared to the two most comprehensive lists on the Web – Of 156 real Canadian authors, one list misses 86 and another misses 70 names – Both combined misses 58 names
Diver: Diversifying Web Search Results (Rafiei, Bharat, Shukla, 2010) Joe A B C D E F G H I J
Bob A B C D E F G H I J
Mary A B C D E F G H I J
Cal
an ordering
A B C D E F G H I J
A C G . . . . . . .
Let’s Add Some Cost $1.00 $0.50 $0.33 $0.25 $0.20 $0.17 $0.14 $0.12 $0.11 $0.10 Subtotal Total
A B C D E F G H I J
7 Joe
3 Cal
5 Joe
5 Cal
1 Bob
A D
G H
A D
G H
A E
$8.75
$0.78
$9.53
$6.25
$1.30
$7.55
9 Mary C
$1.20
$2.97
$4.17
OpPmizaPon Problem Find the ordering(s) that achieves a desired expectaPon e while minimizing the variance (or vice versa).
Random Queries & ODP Categories
Result Relevance • Selected 42 ‘good’ queries for diversificaPon (out of 427 examined) – Not too long – Not too specific
• Baselines – Google – MMR* (MMR with reciprocal ranks from google for Sim(d,Q))
Results Google Score
0.56
Diver 0.61
MMR* 0.54
Soring: 0 (non-‐relevant)……1 (relevant w new content)
Query: manber Google
Diver
MMR*
1
Udi Manber – old home page
Udi Manber -‐ Wikipedia
Udi Manber – old home page
2
Udi Manber – Wikipedia
Jeffrey Manber – Wikipedia
David Manber – imdb.com
3
Udi Manber – publicaPons
Rachel Manber – academic profile
Udi Manber -‐ Wikipedia
Query: sergey Google
Diver
MMR*
1
Sergey Brin – Wikipedia
Sergey Brin – Google Management
Sergey Brin -‐ Wikipedia
2
Sergey Brin – Google Management
Sergey Korolyov – Wikipedia
Sergey Brin -‐ Stanford
3
Sergey Brin – Stanford
Sergey Formin (at U. Mich)
Sergey Brin (at forbe.com)
Query: jaguar Google
Diver
MMR*
1
Jaguar.com (car)
Jaguar.com (car)
Jaguar.com (car)
2
Jaguarusa.com (car)
Jjaguarusa.com (car)
Schrodinger.com (unrelated)
3
Jaguar – Wikipedia (animal)
Jaguar – Wikipedia (animal)
Jaguar.ca (car)
Conclusions • Covered – Big (everyone’s) data – Four tools • • • •
TOPIC ALVIN DeWild Diver
• Surge in interest in big data – Obama’s big data iniPaPve – This conference!
Conclusions • Search is far from being solved – And our quest for holy grail conPnues
• No shortage of problems • Want to make sense of your big data – We sure can help!