Page 1

Search Appliances  for  Big  Data   Davood  Rafiei   University  of  Alberta  


Big Structured  Data  

2011/2012 Top-­‐performers   TPC-­‐C  :Oracle  /  Oracle     TPC-­‐E:  NEC  &  IBM  /  SQL  server   TPC-­‐H:  Dell/Exasol  and  Cisco/VectorWise  


Big ScienPfic  Data   •  Atlas  detector  at  CERN  

–  23 PB  per  second  raw  data   –  10  PB  per  year  filtered  data,     –  Used  by  more  than  150   universiPes  and  labs  

•  Sloan digital  sky  survey   catalog  archive  

–  DR8 covers  35%  of  the  sky     –  Accessed  through  SkyServer   and  MS  World  telescope  


Big (Everyone’s)  Data  

Produced everyday  


Big (Everyone’s)  Data  

Le3 behind  everyday!  


Big (Everyone’s)  Data  

Used  everyday  


Big (Everyone’s)  Data  

Interacted everyday  


Big (Everyone’s)  Data  

Analyzed everyday  


Our Four  Appliances   •  •  •  • 

ReputaPon gauge  (TOPIC,  2000)   Network  visualizaPon  (ALVIN,  2005)   Data  extracPon  (DeWild,  2007)   Result  diversificaPon  (Diver,  2010)  


TOPIC: Toronto  Page  Influence   ComputaPon   (Rafiei,  Mendelzon,  2000)  

•  What are  these  pages  known  for?   –  www.cnn.com   –  www.cs.helsinki.fi     –  www.w3.org/People/Berners-­‐Lee   –  www.cs.ualberta.ca   –  www.hot107.ca  


URL

TOPIC

Search engines  

ReputaPon


Back Links  

search engines  compared   my  favorite  search  engines   a  review  of  search  engines  

p


Probability of  VisiPng  a  Page   q  

p

⎧ d R (q, t ) ⎪ n R ( p, t ) = (1− d ) ∑ + ⎨ Nt q→ p O(q) ⎪⎩ 0 n−1

if pagep is on topict otherwise


EvaluaPon •  What  is  page  www.macleans.ca    known  for?  

1 -­‐  Maclean's  Magazine  

2 -­‐  macleans   3  -­‐  Canadian  UniversiPes  


Personal Home  Pages   •  www.w3.org/People/Berners-­‐Lee   – 

History Of  The  Internet,  Tim  Berners-­‐Lee,  Internet  History,   W3C  

•  www-­‐db.stanford.edu/~ullman – 

Jeffrey D  Ullman,  Database  Systems,  Data  Mining,   Programming  Languages  

•  www.cs.ualberta.ca/~jonathan – 

Jonathan Shaeffer,  Chess,  Alberta,  Games,  Computer,   University  

•  www.cs.ualberta.ca/~greiner – 

Machine Learning  


News Agencies   CNN BBC ABC wired.com Int’l News

0.0237

0.0097

0.0003

0.0044

Weather

0.0121

0.0052

0.0008

0.0006

Sports

0.0070

0.0004

0

0.0028

Entertainment 0.0040

0.0015

0.0013

0.0012

Travel

0.0030

0.0008

0.0012

0.0005

Technology

0.0017

0.0006

0.0006

0.0079

Business

0.0017

0.0006

0.0004

0.0031


ALVIN: Alberta  System  for  Visualizing   Large  Neworks   (Rafiei,  Curial,  2005)  

•  Large networks  are  ubiquitous   –  Internet,  telephone,  roads,  network  of  people   who  are  related,  etc.  

•  Scenarios –  How  is  my  site  related  to  Microsoj  Web  site?     –  My  site  gets  many  hits  from  Russia  and  I  am   wondering  about    possible  relaPonships    


Visualizing the  Web  Graph   (our  challenge)   •  Must  scale  up  to  millions  and  even  billions  of   nodes  and  edges   •  Should  ideally  work  for  an  arbitrary  network   (e.g.  not  just  tree-­‐like)   •  Should  be  able  to  focus  on  a  part  of  the  network   (if  needed)  


Sampling the  Network   •  SRS1:  take  a  simple  random  sample  of  the  nodes   and  add  all  edges  between  them     •  SRS2:  take  a  simple  random  sample  of  the  edges   and  include  all  their  adjacent  nodes     •  SRS3:  take  a  sample  using  SRS2  and  add  all  edges   between  nodes  in  the  sample  


Network ProperPes   •  Want  to  keep  the  sample  size  small,  but  sPll  see   the  topology  of  the  network   •  Traits  found  in  a  network   –  Degree  distribuPon   –  Connected  component  size  distribuPon   –  CharacterisPc  path  length   –  Clustering  coefficient   –  Etc.  


Degree dist.  With  SRS1     (movie  database)  


StraPfied Sampling   A  set  of  growth  processes   •   Sample  I  (local  growth)   •   Sample  F  (focused  growth)   •   Sample  G  (global  growth)  

F: Focus  set   I:  nodes  and  edges  adjacent            to  a  node  or  edge  in  F    


Experiments •  System   –  Implemented  in  C++  using  LEDA  class  library   –  DB2  is  used  for  backend  storage  

•  Data –  Snapshot  of  the  Web  (taken  in  1999):  178  million   pages  and  a  billion  edges   –  Movie  database  imdb:  450,000  actors  and  17   million  co-­‐acPngs  


IniPal Seeds  

Seed set   www.cs.wisc.edu       www.cs.cornell.edu       news.bbc.co.uk     www.foxnews.com     www.sciencemag.org        


Global Growth  

0.1% sample  using  SRS1  


Another Global  Growth  

0.2% sample  using  SRS1  


Seed set: actors in green Do a focused growth (400 edges)


DeWild: Data  ExtracPon  using  Wild   Cards   (Li,  Rafiei,  2007)  

•  Scenario: want  to  gather  a  list  of   –  Neurosurgeons  in  Canada   –  Companies  acquired  by  Google   –  Summer  movies  

•  Search –  explore  –  search  …   –  Tedious  and  Pme-­‐consuming   –  Poor  scale  up    


Wild Card  Queries   •  Let   –  %  denote  one  or  more  nouns   –  *  denote  words  with  the  same  meanings  

•  Queries –  %  is  a  neurosurgeon  in  Canada   –  %  is  a  summer  *blockbuster*   –  Google  *acquired*  %   –  %  invented  light  bulb  


EvaluaPon Strategies   •  Query  flasening   %  is  a  summer  *blockbuster*  

% is  a  summer  blockbuster   %  is  a  summer  movie   %  is  a  summer  film  

•  Query expansion   summer  movies  such  as  %   %  is  a  summer  movie  

summer movies  including  %   such  summer  movies  as   %  and  other  summer  movies  


Some Challenges   •  RewriPngs     –  a  rewri'ng  rule  language  

•  Ranking instances     –  reinforcing  rela'onships  between  instances  and   rewrites  

•  EvaluaPon of  the  extracPon  accuracy  and   coverage   –  Compared  to  a  QA  system   –  Ad-­‐hoc  list  extrac'on  


List of  Canadian  Writers   •  Query:  %  is  a  Canadian  writer   •  1300  names  retrieved   –  91  of  the  first  100  were  real  Canadian  writers   –  156  of  the  first  200  were  real  Canadian  writers  

•  Compared to  the  two  most  comprehensive   lists  on  the  Web   –  Of  156  real  Canadian  authors,  one  list  misses  86   and  another  misses  70  names   –  Both  combined  misses  58  names  


Diver: Diversifying  Web  Search  Results   (Rafiei,  Bharat,  Shukla,  2010)   Joe   A   B   C   D   E   F   G   H   I   J  

Bob A   B   C   D   E   F   G   H   I   J  

Mary A   B   C   D   E   F   G   H   I   J  

Cal

an ordering  

A B   C   D   E   F   G   H   I   J  

A C   G   .   .   .   .   .   .   .  


Let’s Add  Some  Cost   $1.00   $0.50   $0.33   $0.25   $0.20   $0.17   $0.14   $0.12   $0.11   $0.10   Subtotal     Total    

A B   C   D   E   F   G   H   I   J  

7 Joe  

3 Cal  

5 Joe  

5 Cal  

1 Bob  

A     D              

          G   H      

A     D              

          G   H      

A       E            

$8.75

$0.78

$9.53

$6.25

$1.30

$7.55

9 Mary       C                

$1.20

$2.97

$4.17


OpPmizaPon Problem   Find  the  ordering(s)  that  achieves  a  desired   expectaPon  e  while  minimizing  the  variance   (or  vice  versa).  


Random Queries  &  ODP  Categories  


Result Relevance   •  Selected  42  ‘good’  queries  for  diversificaPon   (out  of  427  examined)   –  Not  too  long   –  Not  too  specific  

•  Baselines –  Google   –  MMR*  (MMR  with  reciprocal  ranks  from  google   for  Sim(d,Q))  


Results Google Score

0.56

Diver 0.61

MMR* 0.54

Soring: 0  (non-­‐relevant)……1  (relevant  w  new  content)  


Query: manber   Google  

Diver

MMR*

1

Udi Manber  –  old  home  page  

Udi Manber  -­‐  Wikipedia  

Udi Manber  –  old  home  page  

2

Udi Manber  –  Wikipedia  

Jeffrey Manber  –  Wikipedia  

David Manber  –  imdb.com  

3

Udi Manber  –  publicaPons  

Rachel Manber  –  academic  profile  

Udi Manber  -­‐  Wikipedia  


Query: sergey   Google  

Diver

MMR*

1

Sergey Brin  –  Wikipedia  

Sergey Brin  –  Google  Management  

Sergey Brin  -­‐  Wikipedia  

2

Sergey Brin  –  Google  Management  

Sergey Korolyov  –  Wikipedia  

Sergey Brin  -­‐  Stanford  

3

Sergey Brin  –  Stanford  

Sergey Formin  (at  U.  Mich)    

Sergey Brin  (at  forbe.com)  


Query: jaguar   Google  

Diver

MMR*

1

Jaguar.com (car)  

Jaguar.com (car)  

Jaguar.com (car)  

2

Jaguarusa.com (car)  

Jjaguarusa.com (car)  

Schrodinger.com (unrelated)  

3

Jaguar –  Wikipedia   (animal)  

Jaguar –  Wikipedia   (animal)  

Jaguar.ca (car)  


Conclusions •  Covered   –  Big  (everyone’s)  data   –  Four  tools   •  •  •  • 

TOPIC ALVIN   DeWild   Diver  

•  Surge in  interest  in  big  data   –  Obama’s  big  data  iniPaPve   –  This  conference!  


Conclusions •  Search  is  far  from  being  solved     –  And  our  quest  for  holy  grail  conPnues  

•  No  shortage  of  problems   •  Want  to  make  sense  of  your  big  data   –  We  sure  can  help!  

Davood Rafiei - Search Appliances for Big Data.ppt  
Read more
Read more
Similar to
Popular now
Just for you