Davood Rafiei - Search Appliances for Big Data.ppt

Page 1

Search Appliances for Big Data Davood Rafiei University of Alberta


Big Structured Data

2011/2012 Top-­‐performers TPC-­‐C :Oracle / Oracle TPC-­‐E: NEC & IBM / SQL server TPC-­‐H: Dell/Exasol and Cisco/VectorWise


Big ScienPfic Data •  Atlas detector at CERN

–  23 PB per second raw data –  10 PB per year filtered data, –  Used by more than 150 universiPes and labs

•  Sloan digital sky survey catalog archive

–  DR8 covers 35% of the sky –  Accessed through SkyServer and MS World telescope


Big (Everyone’s) Data

Produced everyday


Big (Everyone’s) Data

Le3 behind everyday!


Big (Everyone’s) Data

Used everyday


Big (Everyone’s) Data

Interacted everyday


Big (Everyone’s) Data

Analyzed everyday



Our Four Appliances •  •  •  •

ReputaPon gauge (TOPIC, 2000) Network visualizaPon (ALVIN, 2005) Data extracPon (DeWild, 2007) Result diversificaPon (Diver, 2010)


TOPIC: Toronto Page Influence ComputaPon (Rafiei, Mendelzon, 2000)

•  What are these pages known for? –  www.cnn.com –  www.cs.helsinki.fi –  www.w3.org/People/Berners-­‐Lee –  www.cs.ualberta.ca –  www.hot107.ca


URL

TOPIC

Search engines

ReputaPon


Back Links

search engines compared my favorite search engines a review of search engines

p


Probability of VisiPng a Page q

p

⎧ d R (q, t ) ⎪ n R ( p, t ) = (1− d ) ∑ + ⎨ Nt q→ p O(q) ⎪⎩ 0 n−1

if pagep is on topict otherwise


EvaluaPon •  What is page www.macleans.ca known for?

1 -­‐ Maclean's Magazine

2 -­‐ macleans 3 -­‐ Canadian UniversiPes


Personal Home Pages •  www.w3.org/People/Berners-­‐Lee –

History Of The Internet, Tim Berners-­‐Lee, Internet History, W3C

•  www-­‐db.stanford.edu/~ullman –

Jeffrey D Ullman, Database Systems, Data Mining, Programming Languages

•  www.cs.ualberta.ca/~jonathan –

Jonathan Shaeffer, Chess, Alberta, Games, Computer, University

•  www.cs.ualberta.ca/~greiner –

Machine Learning


News Agencies CNN BBC ABC wired.com Int’l News

0.0237

0.0097

0.0003

0.0044

Weather

0.0121

0.0052

0.0008

0.0006

Sports

0.0070

0.0004

0

0.0028

Entertainment 0.0040

0.0015

0.0013

0.0012

Travel

0.0030

0.0008

0.0012

0.0005

Technology

0.0017

0.0006

0.0006

0.0079

Business

0.0017

0.0006

0.0004

0.0031


ALVIN: Alberta System for Visualizing Large Neworks (Rafiei, Curial, 2005)

•  Large networks are ubiquitous –  Internet, telephone, roads, network of people who are related, etc.

•  Scenarios –  How is my site related to Microsoj Web site? –  My site gets many hits from Russia and I am wondering about possible relaPonships


Visualizing the Web Graph (our challenge) •  Must scale up to millions and even billions of nodes and edges •  Should ideally work for an arbitrary network (e.g. not just tree-­‐like) •  Should be able to focus on a part of the network (if needed)


Sampling the Network •  SRS1: take a simple random sample of the nodes and add all edges between them •  SRS2: take a simple random sample of the edges and include all their adjacent nodes •  SRS3: take a sample using SRS2 and add all edges between nodes in the sample


Network ProperPes •  Want to keep the sample size small, but sPll see the topology of the network •  Traits found in a network –  Degree distribuPon –  Connected component size distribuPon –  CharacterisPc path length –  Clustering coefficient –  Etc.


Degree dist. With SRS1 (movie database)


StraPfied Sampling A set of growth processes •  Sample I (local growth) •  Sample F (focused growth) •  Sample G (global growth)

F: Focus set I: nodes and edges adjacent to a node or edge in F


Experiments •  System –  Implemented in C++ using LEDA class library –  DB2 is used for backend storage

•  Data –  Snapshot of the Web (taken in 1999): 178 million pages and a billion edges –  Movie database imdb: 450,000 actors and 17 million co-­‐acPngs


IniPal Seeds

Seed set www.cs.wisc.edu www.cs.cornell.edu news.bbc.co.uk www.foxnews.com www.sciencemag.org


Global Growth

0.1% sample using SRS1


Another Global Growth

0.2% sample using SRS1


Seed set: actors in green Do a focused growth (400 edges)


DeWild: Data ExtracPon using Wild Cards (Li, Rafiei, 2007)

•  Scenario: want to gather a list of –  Neurosurgeons in Canada –  Companies acquired by Google –  Summer movies

•  Search – explore – search … –  Tedious and Pme-­‐consuming –  Poor scale up


Wild Card Queries •  Let –  % denote one or more nouns –  * denote words with the same meanings

•  Queries –  % is a neurosurgeon in Canada –  % is a summer *blockbuster* –  Google *acquired* % –  % invented light bulb


EvaluaPon Strategies •  Query flasening % is a summer *blockbuster*

% is a summer blockbuster % is a summer movie % is a summer film

•  Query expansion summer movies such as % % is a summer movie

summer movies including % such summer movies as % and other summer movies


Some Challenges •  RewriPngs –  a rewri'ng rule language

•  Ranking instances –  reinforcing rela'onships between instances and rewrites

•  EvaluaPon of the extracPon accuracy and coverage –  Compared to a QA system –  Ad-­‐hoc list extrac'on


List of Canadian Writers •  Query: % is a Canadian writer •  1300 names retrieved –  91 of the first 100 were real Canadian writers –  156 of the first 200 were real Canadian writers

•  Compared to the two most comprehensive lists on the Web –  Of 156 real Canadian authors, one list misses 86 and another misses 70 names –  Both combined misses 58 names


Diver: Diversifying Web Search Results (Rafiei, Bharat, Shukla, 2010) Joe A B C D E F G H I J

Bob A B C D E F G H I J

Mary A B C D E F G H I J

Cal

an ordering

A B C D E F G H I J

A C G . . . . . . .


Let’s Add Some Cost $1.00 $0.50 $0.33 $0.25 $0.20 $0.17 $0.14 $0.12 $0.11 $0.10 Subtotal Total

A B C D E F G H I J

7 Joe

3 Cal

5 Joe

5 Cal

1 Bob

A D

G H

A D

G H

A E

$8.75

$0.78

$9.53

$6.25

$1.30

$7.55

9 Mary C

$1.20

$2.97

$4.17


OpPmizaPon Problem Find the ordering(s) that achieves a desired expectaPon e while minimizing the variance (or vice versa).


Random Queries & ODP Categories


Result Relevance •  Selected 42 ‘good’ queries for diversificaPon (out of 427 examined) –  Not too long –  Not too specific

•  Baselines –  Google –  MMR* (MMR with reciprocal ranks from google for Sim(d,Q))


Results Google Score

0.56

Diver 0.61

MMR* 0.54

Soring: 0 (non-­‐relevant)……1 (relevant w new content)


Query: manber Google

Diver

MMR*

1

Udi Manber – old home page

Udi Manber -­‐ Wikipedia

Udi Manber – old home page

2

Udi Manber – Wikipedia

Jeffrey Manber – Wikipedia

David Manber – imdb.com

3

Udi Manber – publicaPons

Rachel Manber – academic profile

Udi Manber -­‐ Wikipedia


Query: sergey Google

Diver

MMR*

1

Sergey Brin – Wikipedia

Sergey Brin – Google Management

Sergey Brin -­‐ Wikipedia

2

Sergey Brin – Google Management

Sergey Korolyov – Wikipedia

Sergey Brin -­‐ Stanford

3

Sergey Brin – Stanford

Sergey Formin (at U. Mich)

Sergey Brin (at forbe.com)


Query: jaguar Google

Diver

MMR*

1

Jaguar.com (car)

Jaguar.com (car)

Jaguar.com (car)

2

Jaguarusa.com (car)

Jjaguarusa.com (car)

Schrodinger.com (unrelated)

3

Jaguar – Wikipedia (animal)

Jaguar – Wikipedia (animal)

Jaguar.ca (car)


Conclusions •  Covered –  Big (everyone’s) data –  Four tools •  •  •  •

TOPIC ALVIN DeWild Diver

•  Surge in interest in big data –  Obama’s big data iniPaPve –  This conference!


Conclusions •  Search is far from being solved –  And our quest for holy grail conPnues

•  No shortage of problems •  Want to make sense of your big data –  We sure can help!


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.