Davood Rafiei - Search Appliances for Big Data.ppt by Bon Mark Uy

Search Appliances for Big Data Davood Raﬁei University of Alberta

Big Structured Data

2011/2012 Top-‐performers TPC-‐C :Oracle / Oracle TPC-‐E: NEC & IBM / SQL server TPC-‐H: Dell/Exasol and Cisco/VectorWise

Big ScienPﬁc Data •  Atlas detector at CERN

–  23 PB per second raw data –  10 PB per year ﬁltered data, –  Used by more than 150 universiPes and labs

•  Sloan digital sky survey catalog archive

–  DR8 covers 35% of the sky –  Accessed through SkyServer and MS World telescope

Big (Everyone’s) Data

Produced everyday

Big (Everyone’s) Data

Le3 behind everyday!

Big (Everyone’s) Data

Used everyday

Big (Everyone’s) Data

Interacted everyday

Big (Everyone’s) Data

Analyzed everyday

Our Four Appliances •  •  •  •

ReputaPon gauge (TOPIC, 2000) Network visualizaPon (ALVIN, 2005) Data extracPon (DeWild, 2007) Result diversiﬁcaPon (Diver, 2010)

TOPIC: Toronto Page Inﬂuence ComputaPon (Raﬁei, Mendelzon, 2000)

•  What are these pages known for? –  www.cnn.com –  www.cs.helsinki.ﬁ –  www.w3.org/People/Berners-‐Lee –  www.cs.ualberta.ca –  www.hot107.ca

URL

TOPIC

Search engines

ReputaPon

Back Links

search engines compared my favorite search engines a review of search engines

Probability of VisiPng a Page q

⎧ d R (q, t ) ⎪ n R ( p, t ) = (1− d ) ∑ + ⎨ Nt q→ p O(q) ⎪⎩ 0 n−1

if pagep is on topict otherwise

EvaluaPon •  What is page www.macleans.ca known for?

1 -‐ Maclean's Magazine

2 -‐ macleans 3 -‐ Canadian UniversiPes

Personal Home Pages •  www.w3.org/People/Berners-‐Lee –

History Of The Internet, Tim Berners-‐Lee, Internet History, W3C

•  www-‐db.stanford.edu/~ullman –

Jeﬀrey D Ullman, Database Systems, Data Mining, Programming Languages

•  www.cs.ualberta.ca/~jonathan –

Jonathan Shaeﬀer, Chess, Alberta, Games, Computer, University

•  www.cs.ualberta.ca/~greiner –

Machine Learning

News Agencies CNN BBC ABC wired.com Int’l News

0.0237

0.0097

0.0003

0.0044

Weather

0.0121

0.0052

0.0008

0.0006

Sports

0.0070

0.0004

0.0028

Entertainment 0.0040

0.0015

0.0013

0.0012

Travel

0.0030

0.0008

0.0012

0.0005

Technology

0.0017

0.0006

0.0079

Business

0.0017

0.0006

0.0004

0.0031

ALVIN: Alberta System for Visualizing Large Neworks (Raﬁei, Curial, 2005)

•  Large networks are ubiquitous –  Internet, telephone, roads, network of people who are related, etc.

•  Scenarios –  How is my site related to Microsoj Web site? –  My site gets many hits from Russia and I am wondering about possible relaPonships

Visualizing the Web Graph (our challenge) •  Must scale up to millions and even billions of nodes and edges •  Should ideally work for an arbitrary network (e.g. not just tree-‐like) •  Should be able to focus on a part of the network (if needed)

Sampling the Network •  SRS1: take a simple random sample of the nodes and add all edges between them •  SRS2: take a simple random sample of the edges and include all their adjacent nodes •  SRS3: take a sample using SRS2 and add all edges between nodes in the sample

Network ProperPes •  Want to keep the sample size small, but sPll see the topology of the network •  Traits found in a network –  Degree distribuPon –  Connected component size distribuPon –  CharacterisPc path length –  Clustering coeﬃcient –  Etc.

Degree dist. With SRS1 (movie database)

StraPﬁed Sampling A set of growth processes •  Sample I (local growth) •  Sample F (focused growth) •  Sample G (global growth)

F: Focus set I: nodes and edges adjacent to a node or edge in F

Experiments •  System –  Implemented in C++ using LEDA class library –  DB2 is used for backend storage

•  Data –  Snapshot of the Web (taken in 1999): 178 million pages and a billion edges –  Movie database imdb: 450,000 actors and 17 million co-‐acPngs

IniPal Seeds

Seed set www.cs.wisc.edu www.cs.cornell.edu news.bbc.co.uk www.foxnews.com www.sciencemag.org

Global Growth

0.1% sample using SRS1

Another Global Growth

0.2% sample using SRS1

Seed set: actors in green Do a focused growth (400 edges)

DeWild: Data ExtracPon using Wild Cards (Li, Raﬁei, 2007)

•  Scenario: want to gather a list of –  Neurosurgeons in Canada –  Companies acquired by Google –  Summer movies

•  Search – explore – search … –  Tedious and Pme-‐consuming –  Poor scale up

Wild Card Queries •  Let –  % denote one or more nouns –  * denote words with the same meanings

•  Queries –  % is a neurosurgeon in Canada –  % is a summer *blockbuster* –  Google *acquired* % –  % invented light bulb

EvaluaPon Strategies •  Query ﬂasening % is a summer *blockbuster*

% is a summer blockbuster % is a summer movie % is a summer ﬁlm

•  Query expansion summer movies such as % % is a summer movie

summer movies including % such summer movies as % and other summer movies

Some Challenges •  RewriPngs –  a rewri'ng rule language

•  Ranking instances –  reinforcing rela'onships between instances and rewrites

•  EvaluaPon of the extracPon accuracy and coverage –  Compared to a QA system –  Ad-‐hoc list extrac'on

List of Canadian Writers •  Query: % is a Canadian writer •  1300 names retrieved –  91 of the ﬁrst 100 were real Canadian writers –  156 of the ﬁrst 200 were real Canadian writers

•  Compared to the two most comprehensive lists on the Web –  Of 156 real Canadian authors, one list misses 86 and another misses 70 names –  Both combined misses 58 names

Diver: Diversifying Web Search Results (Raﬁei, Bharat, Shukla, 2010) Joe A B C D E F G H I J

Bob A B C D E F G H I J

Mary A B C D E F G H I J

Cal

an ordering

A B C D E F G H I J

A C G . . . . . . .

Let’s Add Some Cost $1.00 $0.50 $0.33 $0.25 $0.20 $0.17 $0.14 $0.12 $0.11 $0.10 Subtotal Total

A B C D E F G H I J

7 Joe

3 Cal

5 Joe

5 Cal

1 Bob

A D

G H

A D

G H

A E

$8.75

$0.78

$9.53

$6.25

$1.30

$7.55

9 Mary C

$1.20

$2.97

$4.17

OpPmizaPon Problem Find the ordering(s) that achieves a desired expectaPon e while minimizing the variance (or vice versa).

Random Queries & ODP Categories

Result Relevance •  Selected 42 ‘good’ queries for diversiﬁcaPon (out of 427 examined) –  Not too long –  Not too speciﬁc

•  Baselines –  Google –  MMR* (MMR with reciprocal ranks from google for Sim(d,Q))

Results Google Score

0.56

Diver 0.61

MMR* 0.54

Soring: 0 (non-‐relevant)……1 (relevant w new content)

Query: manber Google

Diver

MMR*

Udi Manber – old home page

Udi Manber -‐ Wikipedia

Udi Manber – old home page

Udi Manber – Wikipedia

Jeﬀrey Manber – Wikipedia

David Manber – imdb.com

Udi Manber – publicaPons

Rachel Manber – academic proﬁle

Udi Manber -‐ Wikipedia

Query: sergey Google

Diver

MMR*

Sergey Brin – Wikipedia

Sergey Brin – Google Management

Sergey Brin -‐ Wikipedia

Sergey Brin – Google Management

Sergey Korolyov – Wikipedia

Sergey Brin -‐ Stanford

Sergey Brin – Stanford

Sergey Formin (at U. Mich)

Sergey Brin (at forbe.com)

Query: jaguar Google

Diver

MMR*

Jaguar.com (car)

Jaguarusa.com (car)

Jjaguarusa.com (car)

Schrodinger.com (unrelated)

Jaguar – Wikipedia (animal)

Jaguar.ca (car)

Conclusions •  Covered –  Big (everyone’s) data –  Four tools •  •  •  •

TOPIC ALVIN DeWild Diver

•  Surge in interest in big data –  Obama’s big data iniPaPve –  This conference!

Conclusions •  Search is far from being solved –  And our quest for holy grail conPnues

•  No shortage of problems •  Want to make sense of your big data –  We sure can help!