Issuu on Google+

4/24/12  

Migrating to the Hadoop Ecosystem: An experience report Eleni  Stroulia   Professor,  NSERC/AITF  (w.  IBM  support)  IRC  on  "Service  Systems  Management”     Computing  Science   University  of  Alberta  

http://ssrg.cs.ualberta.ca/     4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

1  

Outline •  Background     –  Why?  

•  PaaS  with  “the  Hadoop  Ecosystem”:     •    HDFS,  Hadoop,  and  HBase   –  What?  

•  The  TAPoR  Migration   –  How?  

•  Closing  Remarks   4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

2  

1  


4/24/12  

WHY?

4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

3  

Big Data… Cheap Hardware… •  Data  is  growing  at  an  unprecedented  rate   –  More  people  use  the  web  and  publish  data   •  The  Internet  Usage  around  the  world:  in  2000:  360m;  in   2011:  2  billion  (1/3  of  earth  population)   •  Facebook,  in  2009  uploading  60  TB  images  every  week  

–  Things  are  on  the  Internet   •  A  jet  engine  produces  10TB  data  every  30  Zlight  mins  

•  Commodity  hardware  is  cheap   •  Owning  and  maintaining  hardware  is  expensive   4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

4  

2  


4/24/12  

Internet World Usage •  2000:  360m     •  2011:  2  billion  (1/3  of  earth  population)   •  Source:  http://www.internetworldstats.com/stats.htm    

4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

5  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

6  

WHAT?

4/24/12  

3  


4/24/12  

Cloud Infrastructure: IaaS •  Providers  offer  on-­demand   virtual  computation,  memory   and  network  resources     •  Users  install  on  the  machines   operating  system  images  and   application  software   •  Computing  is  billed  as  a  utility   (pay  per  use)  

4/24/12  

IaaS   Infrastructure  as  a  Service  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

7  

Platform Cloud: PaaS •  Providers  deliver  a  solution   stack  (on  top  of  the   infrastructure)   –   i.e.,  operating  system,   programming  language   environment,  database,  web  server.    

•  Users  develop,  run  and  maintain   their  applications  on  this   platform   •  Some  platforms  are  “elastic”,  i.e.,   adapt  the  underlying  resources   based  on  application  demands   4/24/12  

PaaS   Pla1orm  as  a  Service   IaaS   Infrastructure  as  a  Service  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

8  

4  


4/24/12  

Software Cloud: SaaS •  Providers  install  and  operate   application  software  in  the  cloud     •  Users  use  cloud  clients  to  access   the  software   •  These  applications  are  elastic   •  Work  is  distributed  by  load   balancers     •  Applications  can  be  multitenant   (a  machine  may  serve  more  than   one  user  organization)  

SaaS   So4ware  as  a  Service   PaaS   Pla1orm  as  a  Service   IaaS   Infrastructure  as  a  Service  

•  SaaS  pricing  is  typically  (monthly   or  yearly)  Zlat  fee  per   user   4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

9  

Google’s Solution Scalability through Virtualization •  Key  observation:     Many  computations  are  data  parallel   •  Solution  Elements:   1.  MapReduce                                    Hadoop   2.  GFS                                                                          HDFS   3.  BigTable                                                    HBase   Google  

4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Apache  

10  

5  


4/24/12  

MapReduce/Hadoop 2  

•  Inspired  by  functional   programming:   –  –  –  –  – 

Input       Map()       Copy/Sort       Reduce()       Output  

1  

3  

4  

5  

6  

•  The  platform  takes  care  of     –  –  –  – 

RPC   raster   educe   orker   orver   he   sorted   iocal   ntermediate   The   m nto7fies   a  m  irterates   educe   w bout   these   4.   eriodically,   he   buffered   ptask   airs   aorker   re  tw riXen   df  isk,   job  scheduling     5.  3.   A  The   worker   assigned   ap   eads   tahe   contents   the   1. 6.   P program   uwses   tahe   M apReduce   library   tto  o   slplit   tohe   input   2.    Idata;   t  starts   Measter   anique   nd  worker   nodes.  Tkhe   mit  aster   assigns   f or   ach   u i ntermediate   ey,   p asses   the   k   ey   loca7ons,   w hich   u ses   R PC   t o   r ead   t he   b uffered   d ata   from   par77oned   i nto   R   r egions   b y   t he   p ar77oning   f unc7on.   corresponding   i nput   s plit;   p arses   k ey/value   p airs;   a nd   files   i nto   M   p ieces.     data-­‐locality   each   f  the  workers   any  otne   of  M  map  rteduce   asks  afnd   R  reduce   and   iontermediate   alues   o  the   the   local   worker   isks.   It   orts   the  udser’s   ata   by  ftunc7on.   he   unc7on.   passes   each   pair  dto   tvhe   usser-­‐defined   map   The   fault  tolerance   Eleni   tasks   Stroulia,   CS,  UoA   Big  Data,  faunc7on   nd  the  Cloud)   The   output   o f  Analy7cs,   the     reduce   is  appended  to  a  11   final   intermediate   k(eys.  

4/24/12  

intermediate  key/value  pairs  produced  by  the  map   output  afire   le  bfor   this  reduce   par77on.   func7on   uffered   in  memory.  

GFS/HDFS •  Distributed  Zile  system     •  Fault  tolerance  by   replication   •  Sequential  reads  of   large  data   •  Random  reads  of  small   data  (a  few  KBs)   •  Write  once;  read   multiple  times  

4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

12  

6  


4/24/12  

BigTable/HBase •  A  distributed,  3-­‐D  table  data  structure   –  time  as  the  third  dimension  (versioning)  

•  Rows  sorted  based  on  a  primary  key   •  Supports     –  updates   –  random  reads   –  real-­‐time  querying  

4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

13  

HBase Tables

•  Sorted  by  RowKey   •  Table  has  one  or  more  “column  families”.   •  A  column  family  is     –  A  group  of  column  qualiZiers  (deZined  at  run  time)   –  Stored  as  one  Zile  in  HDFS  

•  Sparse  tables  are  supported   •  Timestamp:  3rd  dimension   •  A  cell  is  identiZied  by  Table:Rowkey:CF:CQ:timestamp   4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

15  

7  


4/24/12  

HOW?

4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

17  

TAPoR

4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

18  

8  


4/24/12  

Three Migration Stories •  Migrating  to  IaaS   ✗ 1.  No  architectural  changes;  deploy  the  software   (with  a  load  balancer)  to  multiple  machines   (on  Amazon  EC2)      Improves  latency  BUT  does  not  address  the   scalability  problem  

✓   •  Migrating  to  PaaS   Using  Hadoop,  create  indices     2.  store  on  HDFS   3.  store  to  HBase     4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

19  

Migrating to PaaS

4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

20  

9  


4/24/12  

Indices on HDFS foo  

#6  

doc1,  doc4,  doc12  

doc1,  3123,  4223  

doc4,      

bar  

#234  

doc1,  doc4,  doc12,  ..   doc1,  3123,  4223,  …   doc4,      

foo2  

#199  

•  An  index  has,  for  each  word,  a  count  of  its  occurrences  in  the  collection,   a  list  of  the  Ziles  that  word  appears  in,  and  the  byte  locations  for  each  of   those  Ziles.     •  We  need  to  keep  key-­‐value  pairs  sorted  by  source  Zile     •  Map:  each  word  is  emitted  as  a  key  and  its  byte  location  and  the   corresponding  document  ID  as  values.     •  Reduce:  the  indices  for  each  word  are  combined  into  a  collective   index;  sorted  alphabetically.     •  A  separate  index  is  sorted  by  word  frequency  (to  support  the  top-­‐k   words  operation)   4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

21  

Indices on HBase

•  The  row  key  is  the  document  ID   •  Two  column  families,  “bl”  and  “spl”  (“byte  location”  and  “special   keywords”).   •  The  word  “foo”  occurred  twice  in  Document  1,  at  byte  offsets   3123  and  4223.     •  The  top  K  words  are  stored  in  the  “spl”  column  family   4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

22  

10  


4/24/12  

Results

4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

23  

In Conclusion •  Infrastructure  and  computation  must  scale  to  “Big   Data”     •  Migration  must  become  more  systematic   •  Migration  to  IaaS  is  simpler  but  less  effective  than   migration  to  PaaS   •  Migration  to  PaaS  usually  requires  rearchitecting  for     –  Data  preprocessing  and  Indexing   –  Reimplementation  of  features  to  rely  on  pre-­‐computed   indices  

•  The  cost-­‐effectiveness  question  is  application   speciZic   4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

24  

11  


4/24/12  

Thank You! •  Eleni  Stroulia   •  Professor,  NSERC/AITF  (w.  IBM  support)  IRC  on   "Service  Systems  Management”     •  Computing  Science   •  University  of  Alberta   •  http://ssrg.cs.ualberta.ca/    

•  Member  of  the  SAVI  Strategic  Research  Network  -­‐   http://savinetwork.ca/     4/24/12  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

25  

12  


Eleni Srtoulia -Migrating to the Hadoop Ecosystem.ppt