Page 1

4/24/12

Migrating to the Hadoop Ecosystem: An experience report Eleni Stroulia   Professor,  NSERC/AITF  (w.  IBM  support)  IRC  on  "Service  Systems  Management”     Computing  Science   University  of  Alberta  

http://ssrg.cs.ualberta.ca/   4/24/12  

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

1

Outline •  Background   –  Why?  

•  PaaS with  “the  Hadoop  Ecosystem”:     •    HDFS,  Hadoop,  and  HBase   –  What?  

•  The TAPoR  Migration   –  How?  

•  Closing Remarks   4/24/12  

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

2

1


4/24/12

WHY?

4/24/12

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

3

Big Data… Cheap Hardware… •  Data is  growing  at  an  unprecedented  rate   –  More  people  use  the  web  and  publish  data   •  The  Internet  Usage  around  the  world:  in  2000:  360m;  in   2011:  2  billion  (1/3  of  earth  population)   •  Facebook,  in  2009  uploading  60  TB  images  every  week  

–  Things are  on  the  Internet   •  A  jet  engine  produces  10TB  data  every  30  Zlight  mins  

•  Commodity hardware  is  cheap   •  Owning  and  maintaining  hardware  is  expensive   4/24/12  

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

4

2


4/24/12

Internet World Usage •  2000: 360m     •  2011:  2  billion  (1/3  of  earth  population)   •  Source:  http://www.internetworldstats.com/stats.htm    

4/24/12

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

5

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

6

WHAT?

4/24/12

3


4/24/12

Cloud Infrastructure: IaaS •  Providers offer  on-­demand   virtual  computation,  memory   and  network  resources     •  Users  install  on  the  machines   operating  system  images  and   application  software   •  Computing  is  billed  as  a  utility   (pay  per  use)  

4/24/12

IaaS Infrastructure  as  a  Service  

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

7

Platform Cloud: PaaS •  Providers deliver  a  solution   stack  (on  top  of  the   infrastructure)   –   i.e.,  operating  system,   programming  language   environment,  database,  web  server.    

•  Users develop,  run  and  maintain   their  applications  on  this   platform   •  Some  platforms  are  “elastic”,  i.e.,   adapt  the  underlying  resources   based  on  application  demands   4/24/12  

PaaS Pla1orm  as  a  Service   IaaS   Infrastructure  as  a  Service  

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

8

4


4/24/12

Software Cloud: SaaS •  Providers install  and  operate   application  software  in  the  cloud     •  Users  use  cloud  clients  to  access   the  software   •  These  applications  are  elastic   •  Work  is  distributed  by  load   balancers     •  Applications  can  be  multitenant   (a  machine  may  serve  more  than   one  user  organization)  

SaaS So4ware  as  a  Service   PaaS   Pla1orm  as  a  Service   IaaS   Infrastructure  as  a  Service  

•  SaaS pricing  is  typically  (monthly   or  yearly)  Zlat  fee  per   user   4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

9

Google’s Solution Scalability through Virtualization •  Key observation:     Many  computations  are  data  parallel   •  Solution  Elements:   1.  MapReduce                                    Hadoop   2.  GFS                                                                          HDFS   3.  BigTable                                                    HBase   Google  

4/24/12

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Apache

10

5


4/24/12

MapReduce/Hadoop 2

•  Inspired by  functional   programming:   –  –  –  –  – 

Input      Map()       Copy/Sort       Reduce()       Output  

1

3

4

5

6

•  The platform  takes  care  of     –  –  –  – 

RPC raster   educe   orker   orver   he   sorted   iocal   ntermediate   The   m nto7fies   a  m  irterates   educe   w bout   these   4.   eriodically,   he   buffered   ptask   airs   aorker   re  tw riXen   df  isk,   job  scheduling     5.  3.   A  The   worker   assigned   ap   eads   tahe   contents   the   1. 6.   P program   uwses   tahe   M apReduce   library   tto  o   slplit   tohe   input   2.    Idata;   t  starts   Measter   anique   nd  worker   nodes.  Tkhe   mit  aster   assigns   f or   ach   u i ntermediate   ey,   p asses   the   k   ey   loca7ons,   w hich   u ses   R PC   t o   r ead   t he   b uffered   d ata   from   par77oned   i nto   R   r egions   b y   t he   p ar77oning   f unc7on.   corresponding   i nput   s plit;   p arses   k ey/value   p airs;   a nd   files   i nto   M   p ieces.     data-­‐locality   each   f  the  workers   any  otne   of  M  map  rteduce   asks  afnd   R  reduce   and   iontermediate   alues   o  the   the   local   worker   isks.   It   orts   the  udser’s   ata   by  ftunc7on.   he   unc7on.   passes   each   pair  dto   tvhe   usser-­‐defined   map   The   fault  tolerance   Eleni   tasks   Stroulia,   CS,  UoA   Big  Data,  faunc7on   nd  the  Cloud)   The   output   o f  Analy7cs,   the     reduce   is  appended  to  a  11   final   intermediate   k(eys.  

4/24/12

intermediate key/value  pairs  produced  by  the  map   output  afire   le  bfor   this  reduce   par77on.   func7on   uffered   in  memory.  

GFS/HDFS •  Distributed Zile  system     •  Fault  tolerance  by   replication   •  Sequential  reads  of   large  data   •  Random  reads  of  small   data  (a  few  KBs)   •  Write  once;  read   multiple  times  

4/24/12

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

12

6


4/24/12

BigTable/HBase •  A distributed,  3-­‐D  table  data  structure   –  time  as  the  third  dimension  (versioning)  

•  Rows sorted  based  on  a  primary  key   •  Supports     –  updates   –  random  reads   –  real-­‐time  querying  

4/24/12

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

13

HBase Tables

•  Sorted by  RowKey   •  Table  has  one  or  more  “column  families”.   •  A  column  family  is     –  A  group  of  column  qualiZiers  (deZined  at  run  time)   –  Stored  as  one  Zile  in  HDFS  

•  Sparse tables  are  supported   •  Timestamp:  3rd  dimension   •  A  cell  is  identiZied  by  Table:Rowkey:CF:CQ:timestamp   4/24/12  

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

15

7


4/24/12

HOW?

4/24/12

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

17

TAPoR

4/24/12

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

18

8


4/24/12

Three Migration Stories •  Migrating to  IaaS   ✗ 1.  No  architectural  changes;  deploy  the  software   (with  a  load  balancer)  to  multiple  machines   (on  Amazon  EC2)      Improves  latency  BUT  does  not  address  the   scalability  problem  

✓ •  Migrating  to  PaaS   Using  Hadoop,  create  indices     2.  store  on  HDFS   3.  store  to  HBase     4/24/12  

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

19

Migrating to PaaS

4/24/12

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

20

9


4/24/12

Indices on HDFS foo

#6

doc1, doc4,  doc12  

doc1, 3123,  4223  

doc4,    

bar

#234

doc1, doc4,  doc12,  ..   doc1,  3123,  4223,  …   doc4,      

foo2

#199

•  An index  has,  for  each  word,  a  count  of  its  occurrences  in  the  collection,   a  list  of  the  Ziles  that  word  appears  in,  and  the  byte  locations  for  each  of   those  Ziles.     •  We  need  to  keep  key-­‐value  pairs  sorted  by  source  Zile     •  Map:  each  word  is  emitted  as  a  key  and  its  byte  location  and  the   corresponding  document  ID  as  values.     •  Reduce:  the  indices  for  each  word  are  combined  into  a  collective   index;  sorted  alphabetically.     •  A  separate  index  is  sorted  by  word  frequency  (to  support  the  top-­‐k   words  operation)   4/24/12  

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

21

Indices on HBase

•  The row  key  is  the  document  ID   •  Two  column  families,  “bl”  and  “spl”  (“byte  location”  and  “special   keywords”).   •  The  word  “foo”  occurred  twice  in  Document  1,  at  byte  offsets   3123  and  4223.     •  The  top  K  words  are  stored  in  the  “spl”  column  family   4/24/12  

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

22

10


4/24/12

Results

4/24/12

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

23

In Conclusion •  Infrastructure and  computation  must  scale  to  “Big   Data”     •  Migration  must  become  more  systematic   •  Migration  to  IaaS  is  simpler  but  less  effective  than   migration  to  PaaS   •  Migration  to  PaaS  usually  requires  rearchitecting  for     –  Data  preprocessing  and  Indexing   –  Reimplementation  of  features  to  rely  on  pre-­‐computed   indices  

•  The cost-­‐effectiveness  question  is  application   speciZic   4/24/12  

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

24

11


4/24/12

Thank You! •  Eleni Stroulia   •  Professor,  NSERC/AITF  (w.  IBM  support)  IRC  on   "Service  Systems  Management”     •  Computing  Science   •  University  of  Alberta   •  http://ssrg.cs.ualberta.ca/    

•  Member of  the  SAVI  Strategic  Research  Network  -­‐   http://savinetwork.ca/     4/24/12  

Eleni Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

25

12

Eleni Srtoulia -Migrating to the Hadoop Ecosystem.ppt  
Read more
Read more
Similar to
Popular now
Just for you