Page 1

IntroducAon to  Data  Science   with  Hadoop   Glynn  Durham,  Senior  Instructor,  Cloudera   glynn@cloudera.com  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

1 of  36  


Terms I  will  cover:  

with a  few  extras:  

 Hadoop,  Hadoop  ecosystem   HDFS   MapReduce   Sqoop   Flume   Hive   Pig   Mahout   Machine  learning   Data  science  using  Hadoop  

   YARN   HBase   Impala   Oozie   data  products  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

2 of  36  


Hadoop Hadoop  is:      a  plaLorm  for  big  data      several  Apache  SoNware        FoundaOon  (ASF)  projects        free  open  source  soNware   Major  parts:        Hadoop  Core  

 Hadoop  ecosystem   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

3 of  36  


Hadoop Core  Main  Features:  File  System  and  Batch  Programming  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

4 of  36  


Hadoop Core  

Hadoop Core  consists  of:      HDFS     –   (Hadoop  Distributed  File  System),  for  storage      MapReduce   –   for  batch  programming  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

5 of  36  


HDFS Writes  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

6 of  36  


HDFS Reads  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

7 of  36  


HDFS Strengths  and  Weaknesses  

HDFS is  good  at:   –   storing  enormous  files  

–  storing  a  lot  of  data  reliably   –   throughput  on  sequenAal  writes   –   throughput  on  sequenAal  reads  of  a  file  or  part  of  a  file  

HDFS is  not  good  at:   –   high  speed  random  reads  of  parts  of  a  file   HDFS  cannot:   –   update  any  part  of  a  file  once  wri>en*  

–  *  but  you  can  always  write  a  new  file,  and/or  delete,  move,              and  rename  files  and  directories   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8 of  36  


MapReduce: Programming  with  Simple  FuncAons  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

9 of  36  


MapReduce Chains  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

10 of  36  


MapReduce at  Scale  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

11 of  36  


MapReduce in  Hadoop  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

12 of  36  


MapReduce Strengths  and  Weaknesses  

MapReduce is  good  at:   –   processing  enormous  amounts  of  data   –   scaling  out  as  you  add  more  machines   –   conAnuing  to  compleAon,  even  when  some  machines  die  

MapReduce is  not  good  at:   –   running  any  algorithm  you  can  think  up   –   algorithms  that  require  shared  state  overall*   –   *  but  maybe  you  can  get  clever  with  your  algorithm  design  

MapReduce cannot:   –   run  in  real  Ame:  MapReduce  jobs  are  batch  jobs   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

13 of  36  


Detour: YARN,  Yet  Another  Resource  NegoAator—near  future  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

14 of  36  


Hadoop Ecosystem        The  Hadoop  Ecosystem  consists  of  other  projects  that  round  

out Hadoop  Core  to  make  it  a  useful  pla\orm:   – Sqoop,  for  RDBMS  integraAon   – Flume,  for  event  ingesAon     – Hive,  for  "SQL"-­‐like  high-­‐level  programming   – Pig,  another  high-­‐level  programming  paradigm   – Mahout,  a  Java  library  for  machine  learning  in  Hadoop   Plus:   – HBase,  a  "NoSQL"  database  system   – Oozie,  a  workflow  manager  for  Hadoop  acAons   – ....   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

15 of  36  


Sqoop: RDBMS  to  Hadoop  and  Back  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

16 of  36  


Flume: IngesAng  ConAnuing  Event  Data  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

17 of  36  


Detour: General  File  Input/Output  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

18 of  36  


MapReduce revisited:  How  to  write  MapReduce  programs?   Java  MapReduce  API  

• 

The most  expressive  technique  possible  

• 

The most  work,  by  far  

• 

(Can be  easier  with  Hadoop  Streaming:  a  way  to  use  streaming  programming   such  as  shell  scripOng  or  Python)   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

19 of  36  


Hive: MapReduce  as  "SQL"  

• 

Familiar language  and  programming  paradigm  

• 

Provides interface  to  many  SQL-­‐compliant  tools   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

20 of  36  


Detour: Impala,  High  Speed  AnalyAcs  in  Hadoop  

• 

5 to  30  Omes  faster  then  Hive  queries  (someOmes  100's  of  Omes  faster!)  

• 

Cloudera exclusive  offering,  but  Apache  licensed,  so  it's  free  and  open  source   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

21 of  36  


Impala Does  Not  Use  MapReduce  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

22 of  36  


Detour: HBase,  A  NoSQL  Database  System  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

23 of  36  


Detour: A  bit  more  about  HBase  

HBase is  a  NoSQL  database  system:   –   programmers  create  and  use  database  tables     –   high  volume,  high  performance  access  to  individual  cells   –   much  weaker  query  language  than  SQL   –   lacks  ACID-­‐compliant  transacAons  

HBase is  not  strictly  needed  to  do  "data  science"   –   a  resource  hog;  competes  with  analyAcal  programs   –   ogen  deployed  on  its  own  separate  cluster   –   may  be  part  of  your  organizaAon's  data  storage  and  delivery,    so  you  may  need  to  get  or  put  data  into  an  HBase  system*   –   *  (or  other  NoSQL  system)   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

24 of  36  


Pig: Another  Language  for  MapReduce  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

25 of  36  


Mahout: Machine  Learning  in  MapReduce   Mahout  is:    a  collecOon  of  algorithms,  mainly  focused  on  "the  three  C's"  of     machine  learning    wriden  in  Java    largely  implemented  over  Hadoop  MapReduce    invocable  from  the  command  line    extensible,  with  the  Java  API   Mahout  is  not:    a  turnkey  soluOon  for  doing  machine  learning    always  user-­‐friendly   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

26 of  36  


Machine Learning  

"The three  C's"  of  machine  learning:      ClassificaOon      Clustering      CollaboraOve  filtering  (recommenders)   ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

27 of  36  


Supervised Machine  Learning:  ClassificaAon  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

28 of  36  


Machine Learning:  Clustering  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

29 of  36  


Machine Learning:  CollaboraAve  Filtering  for  Recommenders  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

30 of  36  


Simple Enterprise  Deployment:  Hadoop  as  ETL  Appliance  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

31 of  36  


Detour: Oozie,  Workflow  within  Hadoop   Simple  workflow  within  Hadoop:   1.  Clear  out  staging  directory  in  HDFS   2.  Sqoop  import  from  OLTP  tables   3.  Hive  (or  Pig)  script  to  transform  data   4.  Sqoop  export  to  data  warehouse  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

32 of  36  


Hadoop: The  Bigger  Picture  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

33 of  36  


Data Science  with  Hadoop   A  data  scienOst  will:   1. 

IdenOfy internal  and  external  data  for  potenOal  use  (general  data  wrangling  tools).  

2. 

Help build  ingesOon  pipelines  to  obtain  data  for  use  (Flume,  Sqoop,  other).  

3. 

Examine, clean,  and  anonymize  ingested  data  (Hive,  Impala,  Pig,  Hadoop  Streaming).  

4. 

Shape data  into  useful  formats  (Hive,  Pig).  

5. 

Explore data  sets  to  gain  understanding  of  problems,  trends,  reality  (Impala,  Hive,  Pig,   staOsOcal  programming).  

6. 

Build predicOve  models  using  staOsOcal  programming,  machine  learning  (Mahout).  

7. 

Contribute to  data  products:  products  in  the  organizaOon  that  are  built  in  large  part   from  the  data  itself  (Mahout,  Sqoop  export,  general  file  export).  

8. 

Conduct experiments  with  data  products,  quanOfying  benefits  and/or  tradeoffs  of   system  changes  (Flume,  Sqoop,  staOsOcal  tests).  

9. 

Communicate results  and  insights  to  stakeholders  (visualizaOon*).  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

34 of  36  


VisualizaAon: Needs  VisualizaAon  Sogware  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

35 of  36  


Thank you!   QuesAons?    ContribuAons?   Glynn  Durham,  Senior  Instructor,  Cloudera   glynn@cloudera.com  

© Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

36 of  36  

Dataedge2013 glynndurham  

test

Read more
Read more
Similar to
Popular now
Just for you