Issuu on Google+

IntroducAon  to  Data  Science   with  Hadoop   Glynn  Durham,  Senior  Instructor,  Cloudera   glynn@cloudera.com  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

1  of  36  


Terms   I  will  cover:  

with  a  few  extras:  

   Hadoop,  Hadoop  ecosystem   HDFS   MapReduce   Sqoop   Flume   Hive   Pig   Mahout   Machine  learning   Data  science  using  Hadoop  

     YARN   HBase   Impala   Oozie   data  products  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

2  of  36  


Hadoop   Hadoop  is:      a  plaLorm  for  big  data      several  Apache  SoNware        FoundaOon  (ASF)  projects        free  open  source  soNware   Major  parts:        Hadoop  Core  

   Hadoop  ecosystem   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

3  of  36  


Hadoop  Core  Main  Features:  File  System  and  Batch  Programming  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

4  of  36  


Hadoop  Core  

Hadoop  Core  consists  of:      HDFS     –   (Hadoop  Distributed  File  System),  for  storage      MapReduce   –   for  batch  programming  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

5  of  36  


HDFS  Writes  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

6  of  36  


HDFS  Reads  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

7  of  36  


HDFS  Strengths  and  Weaknesses  

HDFS  is  good  at:   –   storing  enormous  files  

–   storing  a  lot  of  data  reliably   –   throughput  on  sequenAal  writes   –   throughput  on  sequenAal  reads  of  a  file  or  part  of  a  file  

HDFS  is  not  good  at:   –   high  speed  random  reads  of  parts  of  a  file   HDFS  cannot:   –   update  any  part  of  a  file  once  wri>en*  

–   *  but  you  can  always  write  a  new  file,  and/or  delete,  move,              and  rename  files  and  directories   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8  of  36  


MapReduce:  Programming  with  Simple  FuncAons  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

9  of  36  


MapReduce  Chains  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

10  of  36  


MapReduce  at  Scale  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

11  of  36  


MapReduce  in  Hadoop  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

12  of  36  


MapReduce  Strengths  and  Weaknesses  

MapReduce  is  good  at:   –   processing  enormous  amounts  of  data   –   scaling  out  as  you  add  more  machines   –   conAnuing  to  compleAon,  even  when  some  machines  die  

MapReduce  is  not  good  at:   –   running  any  algorithm  you  can  think  up   –   algorithms  that  require  shared  state  overall*   –   *  but  maybe  you  can  get  clever  with  your  algorithm  design  

MapReduce  cannot:   –   run  in  real  Ame:  MapReduce  jobs  are  batch  jobs   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

13  of  36  


Detour:  YARN,  Yet  Another  Resource  NegoAator—near  future  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

14  of  36  


Hadoop  Ecosystem        The  Hadoop  Ecosystem  consists  of  other  projects  that  round  

out  Hadoop  Core  to  make  it  a  useful  pla\orm:   – Sqoop,  for  RDBMS  integraAon   – Flume,  for  event  ingesAon     – Hive,  for  "SQL"-­‐like  high-­‐level  programming   – Pig,  another  high-­‐level  programming  paradigm   – Mahout,  a  Java  library  for  machine  learning  in  Hadoop   Plus:   – HBase,  a  "NoSQL"  database  system   – Oozie,  a  workflow  manager  for  Hadoop  acAons   – ....   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

15  of  36  


Sqoop:  RDBMS  to  Hadoop  and  Back  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

16  of  36  


Flume:  IngesAng  ConAnuing  Event  Data  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

17  of  36  


Detour:  General  File  Input/Output  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

18  of  36  


MapReduce  revisited:  How  to  write  MapReduce  programs?   Java  MapReduce  API  

• 

The  most  expressive  technique  possible  

• 

The  most  work,  by  far  

• 

(Can  be  easier  with  Hadoop  Streaming:  a  way  to  use  streaming  programming   such  as  shell  scripOng  or  Python)   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

19  of  36  


Hive:  MapReduce  as  "SQL"  

• 

Familiar  language  and  programming  paradigm  

• 

Provides  interface  to  many  SQL-­‐compliant  tools   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

20  of  36  


Detour:  Impala,  High  Speed  AnalyAcs  in  Hadoop  

• 

5  to  30  Omes  faster  then  Hive  queries  (someOmes  100's  of  Omes  faster!)  

• 

Cloudera  exclusive  offering,  but  Apache  licensed,  so  it's  free  and  open  source   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

21  of  36  


Impala  Does  Not  Use  MapReduce  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

22  of  36  


Detour:  HBase,  A  NoSQL  Database  System  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

23  of  36  


Detour:  A  bit  more  about  HBase  

HBase  is  a  NoSQL  database  system:   –   programmers  create  and  use  database  tables     –   high  volume,  high  performance  access  to  individual  cells   –   much  weaker  query  language  than  SQL   –   lacks  ACID-­‐compliant  transacAons  

HBase  is  not  strictly  needed  to  do  "data  science"   –   a  resource  hog;  competes  with  analyAcal  programs   –   ogen  deployed  on  its  own  separate  cluster   –   may  be  part  of  your  organizaAon's  data  storage  and  delivery,    so  you  may  need  to  get  or  put  data  into  an  HBase  system*   –   *  (or  other  NoSQL  system)   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

24  of  36  


Pig:  Another  Language  for  MapReduce  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

25  of  36  


Mahout:  Machine  Learning  in  MapReduce   Mahout  is:    a  collecOon  of  algorithms,  mainly  focused  on  "the  three  C's"  of     machine  learning    wriden  in  Java    largely  implemented  over  Hadoop  MapReduce    invocable  from  the  command  line    extensible,  with  the  Java  API   Mahout  is  not:    a  turnkey  soluOon  for  doing  machine  learning    always  user-­‐friendly   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

26  of  36  


Machine  Learning  

"The  three  C's"  of  machine  learning:      ClassificaOon      Clustering      CollaboraOve  filtering  (recommenders)   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

27  of  36  


Supervised  Machine  Learning:  ClassificaAon  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

28  of  36  


Machine  Learning:  Clustering  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

29  of  36  


Machine  Learning:  CollaboraAve  Filtering  for  Recommenders  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

30  of  36  


Simple  Enterprise  Deployment:  Hadoop  as  ETL  Appliance  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

31  of  36  


Detour:  Oozie,  Workflow  within  Hadoop   Simple  workflow  within  Hadoop:   1.  Clear  out  staging  directory  in  HDFS   2.  Sqoop  import  from  OLTP  tables   3.  Hive  (or  Pig)  script  to  transform  data   4.  Sqoop  export  to  data  warehouse  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

32  of  36  


Hadoop:  The  Bigger  Picture  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

33  of  36  


Data  Science  with  Hadoop   A  data  scienOst  will:   1. 

IdenOfy  internal  and  external  data  for  potenOal  use  (general  data  wrangling  tools).  

2. 

Help  build  ingesOon  pipelines  to  obtain  data  for  use  (Flume,  Sqoop,  other).  

3. 

Examine,  clean,  and  anonymize  ingested  data  (Hive,  Impala,  Pig,  Hadoop  Streaming).  

4. 

Shape  data  into  useful  formats  (Hive,  Pig).  

5. 

Explore  data  sets  to  gain  understanding  of  problems,  trends,  reality  (Impala,  Hive,  Pig,   staOsOcal  programming).  

6. 

Build  predicOve  models  using  staOsOcal  programming,  machine  learning  (Mahout).  

7. 

Contribute  to  data  products:  products  in  the  organizaOon  that  are  built  in  large  part   from  the  data  itself  (Mahout,  Sqoop  export,  general  file  export).  

8. 

Conduct  experiments  with  data  products,  quanOfying  benefits  and/or  tradeoffs  of   system  changes  (Flume,  Sqoop,  staOsOcal  tests).  

9. 

Communicate  results  and  insights  to  stakeholders  (visualizaOon*).  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

34  of  36  


VisualizaAon:  Needs  VisualizaAon  Sogware  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

35  of  36  


Thank  you!   QuesAons?    ContribuAons?   Glynn  Durham,  Senior  Instructor,  Cloudera   glynn@cloudera.com  

©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

36  of  36  


Dataedge2013 glynndurham