Page 1

Copyright ©  ClusterChimps.org  2011    

(Why ClusterChimps…?  ClusterMonkey  was  taken!)   All  rights  reserved       This  “book”  is  intended  for  developers  and  researchers  that  know  how  to  program  in  C.    We  will  walk  you   through  building  a  Virtual  Supercomputer  and  teach  you  how  to  program  it.    The  information  provided  is   not  intended  to  make  you  an  expert  but  to  expose  you  to  the  basic  underpinnings.    You  will  have  to  do   more  work  on  your  own  to  become  an  expert  (“bummer”).       Disclaimer:  While  we  discuss  products  from  Nvidia  and  Parallels  we  have  no  relationship  with  either   company  other  than  that  of  fanboy.  

About ClusterChimps.org ClusterChimps is   dedicated   to   helping   bring   inexpensive   supercomputing   to   the   masses   by   leveraging   emerging   technologies   coupled   with   bright   ideas   and   open   source   software.     We   do   this   because   we   believe  it  will  help  advance  computation  intensive  research  areas  including  basic  research,  engineering,   earth  science,  biology,  materials  science,  and  alternative  energy  research  just  to  name  a  few.     Throughout   this   text   we   have   “borrowed”   images,   code   snippets,   and   text   from   various   Nvidia   publications.    We  have  also  “borrowed”  images  and  text  from  Wikipedia,  openMPI  documentation  and   PVM  documentation.    We  hope  no  one  gets  their  “undies  in  a  bind”  over  this.    

About the Author Dr. Zaius   is   a   well   renowned   orangutan   in   the   fields   of   cluster   computing   and   GPGPU   programming.    He  has  spent  most  of  his  career  in   the   financial   industry   working   at   exchanges,   investment   banks,   and   hedge   funds.     He   is   currently   the   driving   force   behind   the   site   ClusterChimps.org.    Originally  from  the  island  of   Borneo,   Dr.   Zaius   now   resides   in   New   York   City   with  his  wife  and  3  children.    He  can  be  reached   at  zaius@clusterchimps.org.


How to  Read  This  Book   The   first   chapter   should   provide   you   with   information   regarding   the   general   idea   behind   a   Virtual   Supercomputer.     I   would   suggest   that   everyone   start   here.     Once   you   have   wrapped   your   head   around   the   basic   concept,   where   you   go   next   depends   on   what   you   already   know   and  what  you  want  to  know.   If  you  plan  on  building  your  own  Virtual  Supercomputer  and  you  intend  on  working  through  the   examples   you   should   read   through   Chapter   2   (Some   Assembly   Required).     This   will   provide   you   with   information   regarding   where   to   get   the   various   software   you   will   need   and   how   to   configure  your  hardware.   If  parallel  programming  concepts  are  foreign  to  you  be  sure  to  take  a  look  at  chapter  3.    It  will   explain  the  difference  between  task  parallelism  and  data  parallelism  and  go  over  the  manager  /   worker  pattern  that  will  be  used  throughout  the  book.   Your  next  stop  should  be  the  CUDA  and  OpenCL  intro  chapters  (4  and  5).    These  will  walk  you   through  the  basics  of  CUDA  and  OpenCL  so  that  you  have  an  understanding  of  what  they  are,   how  to  use  them,  and  what  you  should  use  them  for.    If  you  already  know  this  then  you  might   want   to   skip   over   to   Chapter   6   and   7   which   will   walk   you   through   using   MPI   and   PVM   in   conjunction  with  CUDA  /  OpenCL.   If   you   are   serious   about   using   your   Virtual   Supercomputer   for   solving   real   life   computational   problems   then   you   should   definitely   take   a   look   at   Chapter   8   (Optimized   PVM)   and   9   (Fault   Tolerant   PVM).     These   chapters   will   show   you   how   to   write   PVM   based   applications   for   your   Virtual  Supercomputer  that  are  as  parallel  as  possible  and  that  are  resilient  to  node  failures  and   reactive  to  grid  expansion  and  contraction.  

Table of Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Supercomputer Design . . . . . . . . . . . . . . . . . 2 CPU vs. GPU (The Big Smackdown) . . . . . . . . 3 What’s a Virtual Machine . . . . . . . . . . . . . . . 7 Hey… What’s The Big Idea . . . . . . . . . . . . 8 Programming The Beast . . . . . . . . . . . . . . . 11 Virtual Supercomputer vs Beowulf Cluster . . . 16 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Some Assemble Required . . . . . . . . . . . . . . . . . . . ?? Installing Tesla Cards . . . . . . . . . . . . . . . . ?? Installing PWE . . . . . . . . . . . . . . . . . . . . . ?? Linux VM . . . . . . . . . . . . . . . . . . . . . . . . . ?? Installing CUDA / OpenCL . . . . . . . . . . . . . . ?? Installing MPI . . . . . . . . . . . . . . . . . . . . . . ?? Installing PVM . . . . . . . . . . . . . . . . . . . . . . ?? Installing Code Examples . . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . ??

Parallel Program Decomposition . . . . . . . . . . . . . . ?? Task vs Data Parallelism . . . . . . . . . . . . . . . . ?? Granularity . . . . . . . . . . . . . . . . . . . . . . . . ?? Manager Worker . . . . . . . . . . . . . . . . . . . . . ?? Example: Calculating the Value of a Portfolio . . . ?? Gustafson’s Law . . . . . . . . . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Intro to CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? CUDA Program Structure . . . . . . . . . . . . . . . . . ?? Matrix Multiplication 1 . . . . . . . . . . . . . . . . . . . ?? Matrix Multiplication 2 . . . . . . . . . . . . . . . . . . ?? Matrix Multiplication 3 . . . . . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Intro to OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . ?? OpenCL Program Structure . . . . . . . . . . . . . . . ?? Matrix Multiplication 1 . . . . . . . . . . . . . . . . . . . ?? OpenCL Compiler . . . . . . . . . . . . . . . . . . . . . . ?? Matrix Multiplication 2 . . . . . . . . . . . . . . . . . . ?? Matrix Multiplication 3 . . . . . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . ??

Intro to MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? MPI Overview . . . . . . . . . . . . . . . . . . . . . . . ?? MPI Hello World . . . . . . . . . . . . . . . . . . . . . . ?? CPU Binomial Options Model . . . . . . . . . . . . . . ?? CUDA Binomial Options Model . . . . . . . . . . . . . . ?? OpenCL Binomial Options Model . . . . . . . . . . . . ?? MPI and Fault Tollerance . . . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Intro to PVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? PVM Overview . . . . . . . . . . . . . . . . . . . . . . . . ?? PVM Hello World . . . . . . . . . . . . . . . . . . . . . . ?? Binomial Option Model . . . . . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Optimized PVM . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Optimization . . . . . . . . . . . . . . . . . . . . . . . . . ?? Implementation Details . . . . . . . . . . . . . . . . . . ?? Optimized Manager . . . . . . . . . . . . . . . . . . . . ?? Optimized (Threaded) Worker . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . ??

Fault Tolerant PVM . . . . . . . . . . . . . . . . . . . . . . . . ?? Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Determining that “BAD Things� Have Happened . . ?? Implementation Details . . . . . . . . . . . . . . . . . . ?? Fault Tolerant Worker . . . . . . . . . . . . . . . . . . ?? Fault Tolerant Manager . . . . . . . . . . . . . . . . . . ?? A Simple System Monitor . . . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Appendix I: OpenCL Compiler . . . . . . . . . . . . . . . . . . ?? Appendix II: Measuring Flops . . . . . . . . . . . . . . . . . ?? Appendix III: Binomial Tree Option Pricing Model . . . ?? Appendix IV: Serial Binomial Option Pricing Model . . . ??



Introduction Remember a   few   years   ago   when   you   bought   that   brand   new   computer?     It   had   a   100MHtz   processor  and  200  MB  of  RAM.    You  were  the  talk  of  the  neighborhood.    You  were  so  proud  of   your   system.     You   droned   on   and   on   about   it   every   chance   you   had.     People   envied   the   fact   that   you   could   actually   run   more   than   one   application   at   a   time.     Everyone   was   so   jealous   of   your   system  that  they  wished  they  could  be  you.    And  then…  one  day  at  a  block  party…  Bob,  from  down   the   street,   drops   a   bombshell.   Bob   just   bought   a   200MHtz   system   with   300MB   of   RAM.     Your   system  was  no  longer  the  fastest  in  the  neighborhood!    Remember  the  shame…  the  humiliation…   how  were  you  going  to  be  able  to  show  your  face  in  public  again?    You’d  show  them…  you’d  show   them  all!    As  soon  as  the  233MHtz  models  came  out  you  ran  to  the  store  and  snatched  one  up.     This  machine  was  so  powerful  that  you  could  almost  run  the  latest  version  of  Windoz.    Once  again   you  were  the  top  dog…  the  alpha  male…  the  head  banana…  the  big  cheese!    Oh  what  a  wonderful   feeling  that  was.     As  time  passed  you  got  older  and  wiser  and  you  realized  that  this  was  a  sucker’s  game  that  was   draining  your  children’s  college  fund  (actually  your  wife  realized  it  and  you  were  forced  to  come  to   terms  with  it).    So  you  threw  in  the  towel.    You  learned  to  live  within  your  means.    You  stopped   upgrading  your  system.    You  lost  the  computational  arms  race  of  the  neighborhood.    Now  all  you   have   are   the   fading   memories   of   how   it   felt   to   have   more   computational   capacity   at   your   fingertips  than  anyone  else  in  your  close  nit  circle  of  friends.   Well  what  would  you  say  if  I  told  you  that  you  could  be  top  dog  again?    What  if  I  told  you  that  for  a   mere  $250,000   you   could   not   only   have   the   fastest   computer   in   the   neighborhood   again  but  one   of  the  top  500  fastest  computers  on  the  planet!    Now  I’m  sure  that  you  wouldn’t  think  twice  about   taking  out  a  second  mortgage  to  be  top  dog  of  planetary  proportions  but,  unless  you’re  up  for  a   divorce,  perhaps  you  had  better  find  a  different  source  of  funding.    Perhaps  you  can  convince  the   organization  you  work  for  to  fund  your  little  project?    Of  course  you  won’t  own  the  system  but  just   imagine  the  look  on  Bob’s  face  when  you  tell  him  you  built  it.    If  I’ve  peaked  your  interest  then   read  on.    The  following  pages  will  provide  you  with  step  by  step  instructions  on  how  to  build   and   program   an   inexpensive   Virtual   Supercomputer   utilizing   commodity   desktops,   Nvidia   GPUs,   Parallels  virtualization  software,  and  program  it  with  MPI  /  PVM,  CUDA,  and  OpenCL.  


Supercomputer Design A supercomputer  is  defined  as  a  computer  that  is  at  the  frontline  of  current  processing  capacity,   particularly  speed  of  calculation.    Supercomputers  are  used  for  solving  highly  calculation-­‐intensive   problems   involving   quantum   physics,   weather   forecasting,   climate   research,   molecular   modeling,   physical   simulations   and   financial   modeling   (just   to   name   a   few).     Supercomputers   were   first   introduced  in  the  1960s  by  Seymour  Cray  at  Control  Data  Corporation  (CDC).    In  the  1970s  Cray   left  to  form  his  own  company,  Cray  Research.    His  new  designs  took  the  industry  by  storm,  holding   the  top  spot  in  supercomputing  for  over  5  years  (1985–1990).   The   term   supercomputer   itself   is   rather   fluid,   and   today's   supercomputer   tends   to   become   tomorrow's   ordinary   computer.     CDC's   early   machines   were   simply   very   fast   scalar   processors,   some  ten  times  the  speed  of  the  fastest  machines  offered  by  other  companies.    In  the  1970s  most   supercomputers  were  dedicated  to  running  a  vector  processor,  and  many  of  the  newer  players   developed  their  own  such  processors  at  a  lower  price  to  enter  the  market.    The  early  to  mid-­‐1980s   saw   machines   with   a   modest   number   of   vector   processors   working   in   parallel   to   become   the   standard.     Typical   numbers   of   processors   were   in   the   range   of   four   to   sixteen.     In   the   later   1980s   and  1990s,  attention  turned  from  vector  processors  to  massive  parallel  processing  systems  with   thousands  of  "ordinary"  CPUs,  most  being  off  the  shelf  units.    Most  supercomputers  running  today   were   built   with   "off   the   shelf"   server-­‐class   microprocessors,   such   as   the   PowerPC,   Opteron,   or   Xeon,  and  are  really  just  highly-­‐tuned  computer  clusters  using  commodity  processors  combined   with  high-­‐speed  interconnects  and  high  performance  file  systems.   The   only   constant   in   the   IT   world   is   “change”.     The   supercomputer   industry   is   morphing   their   system   designs   again.     This   time   they   have   chosen   more   of   a   hybrid   approach.     They   are   now   designing  systems  that  are  composed  of  clusters  of  machines  consisting  of  traditional  CPUs  and   vector   processors.     In   this   particular   case   the   vector   processors   are   actually   manufactured   by   Nvidia  and  are  masquerading  as  GPUs  (more  on  that  later).      The  industries  new  design  is  best   exemplified  by  the  Tianhe-­‐1A.      As  of  January  2011,  according  to  www.top500.org,  the  Tianhe-­‐1A   is  the  fastest  supercomputer  on  the  planet  with  a  theoretical  peak  performance  of  4.7  quadrillion   floating   point   operations   per   second.     I   don’t   know   about   you   but   I   find   it   a   bit   difficult   to   comprehend  numbers  that  big.    To  give  you  a  frame  of  reference  if  you  had  $4.7  quadrillion  you   could  pay  off  the  U.S.  national  deficit…  324  times!    The  Tianhe-­‐1A  has  over  14,000  Xeon  X5670   processors  and  7,000  Tesla  M2050  vector  processors  (GPUs).       Well   if   we   want   our   virtual   supercomputer   to   be   near   the   top   of   the   pack   we   should   probably   follow  the  industries  heterogeneous  design.    Before  we  copy  it,  let’s  dig  a  bit  deeper  to  see  why  it   is  the  best  route  to  go.  


CPU vs. GPU (The Big Smackdown) A CPU   (Central   Processing   Unit)   functions   by   executing   a   sequence   of   instructions.   These   instructions   reside   in   some   sort   of   main   memory   and   typically   go   through   four   distinct   phases   during   their   CPU   lifecycle:   fetch,   decode,   execute,   and   writeback.   During   the   fetch   phase   the   instruction   is   retrieved   from   main   memory   and   loaded   onto   the   CPU.   Once   the   instruction   is   fetched   it   is   decoded   or   broken   down   into   an   opcode   (operation   to   be   performed)   and   operands   containing  values  or  memory  locations  to  be  operated  on  by  the  opcode.  Once  it  is  determined   what   operation   needs   to   be   performed   the   operation   is   executed.     This   may   involve   copying   memory   to   locations   specified   in   the   instructions   operands   or   having   the   ALU   (arithmetic   logic   unit)  perform  a  mathematical  operation.  The  final  phase  is  the  writeback  of  the  result  to  either   main  memory  or  a  CPU  register.    After  the  writeback  the  entire  process  repeats.  This  simple  form   of  CPU  operation  is  referred  to  as  subscalar  in  that  the  CPU  executes  one  instruction  operating  on   one  or  two  pieces  of  data  at  a  time.  Given  the  sequential  nature  of  this  design  it  will  take  four   clock   cycles   to   process   a   single   instruction.   That   is   why   this   type   of   operation   is   referred   to   “sub”scalar  or  less  than  one  instruction  per  clock  cycle.   To  make  their  CPUs  faster,  chip  manufactures  started  to  create  parallel  execution  paths  in  their   CPUs  by  pipelining  their  instructions.    Pipelining  allows  more  than  one  step  in  the  CPU  lifecycle  to   be   performed   at   any   given   time   by   breaking   down   the   pathway   into   discrete   stages.     This   separation   can   be   compared   to   an   assembly   line,   in   which   an   instruction   is   made   more   complete   at   each   stage   until   it   exits   the   execution   pipeline   and   is   retired.     While   pipelining   instructions   will   result   in   faster   CPUs   the   best   performance   that   can   be   achieved   is   scalar   or   one   complete   instruction  per  cycle.   To  achieve  speeds  faster  than  scalar  (or  superscalar)  chip  manufactures  started  to  embed  multiple   execution  units  in  their  designs  increasing  their  degree  of  parallelism  even  more.    In  a  superscalar   pipeline,  multiple  instructions  are  read  and  passed  to  a  dispatcher,  which  decides  whether  or  not   the  instructions  can  be  executed  in  parallel.    If  so,  they  are  dispatched  to  available  execution  units,   resulting   in   the   ability   for   several   instructions   to   be   executed   simultaneously.     The   more   instructions  a  superscalar  CPU  is  able  to  dispatch  simultaneously  to  waiting  execution  units,  the   more  instructions  will  be  completed  in  a  given  clock  cycle.       As  manufacturing  techniques  reach  theoretical  limits  in  miniaturization,  increased  use  of  parallel   computing  in  the  form  of  multi-­‐core  processors  has  been  employed  to  improve  overall  processing   performance.     A   multi-­‐core   processor   is   a   single   computing   component   with   two   or   more   independent   actual   processors   (called   "cores"),   which   are   the   units   that   read   and   execute   program  instructions.    Designers  may  couple  cores  in  a  multi-­‐core  device  tightly  or  loosely.    For   example,  cores  may  or  may  not  share  caches,  and  they  may  implement  message  passing  or  shared   memory  inter-­‐core  communication  methods.    Common  network  topologies  to  interconnect  cores   include  bus,  ring,  two-­‐dimensional  mesh,  and  crossbar.    Homogeneous  multi-­‐core  systems  include   only  identical  cores,  heterogeneous  multi-­‐core  systems  have  cores  which  are  not  identical.    Just  as   with   single-­‐processor   systems,   cores   in   multi-­‐core   systems   may   implement   superscalar,   vector   3

processing, SIMD,  or  multithreaded  architectures.    The  improvement  in  performance  gained  by   the  use  of  a  multi-­‐core  processor  depends  very  much  on  the  software  algorithms  used  and  their   implementation.     Techniques   like   instruction   pipelining,   adding   multiple   execution   units,   and   adding  multiple  cores,  have  enabled  modern  day  CPUs  to  significantly  increased  their  degree  of   instruction  parallelism,  however,  they  still  lag  far  behind  GPUs.   A  GPU  (Graphics  Processing  Unit)  is  a  special  purpose  processor,  known  as  a  stream  processor.  It  is   specifically  designed  to  perform  a  very  large  number  of  floating  point  operations  in  parallel.    These   processors  may  be  integrated  on  the  motherboard  or  attached  via  a  PCIExpress  card.    Today’s  high   end   GPUs   typically   have   gigabytes   of   dedicated   memory   and   several   hundred   processor   cores   capable   of   running   thousands   of   concurrent   threads   all   dedicated   to   performing   floating   point   math.   Stream   processing   is   a   technique   used   to   achieve   a   form   of   parallelism   known   as   data   level   parallelism.     The   concepts   behind   stream   processing   originated   back   in   the   heyday   of   the   supercomputer.       Applications   running   on   a   stream   processor   can   use   multiple   computational   units,   such   as   the   floating-­‐point   units   on   a   GPU,   without   explicitly   managing   allocation,   synchronization,   or   communication   among   those   units.     Not   all   algorithms   can   be   expressed   in   terms   of   a   data   parallel   solution.     The   ones   that   can,   realize   significant   performance   gains   by   running  on  a  GPU  and  taking  advantage  of  the  massive  parallelism  of  the  device  (compared  to  the   much  more  limited  degree  of  parallelism  of  modern  day  CPUs).   In  computing,  most  central  processing  units  are  labeled  in  terms  of  their  clock  speed  expressed  in   hertz.    This  number  refers  to  the  frequency  of  the  CPU's  master  clock  signal  ("clock  speed").    This   signal   is   simply   an   electrical   voltage   that   changes   from   low   to   high   and   back   again   at   regular   intervals.    Hertz  has  become   the   primary   unit   of   measurement  accepted  by   the   general   populace   to  determine  the  speed  of  a  CPU.    While  it  may  make  sense  to  compare  homogenous  architecture   CPUs  with  each  other  in  terms  of  their  clock  speed  it  does  not  make  sense  to  compare  the  clock   speed  of  heterogeneous  CPU  architectures.    For  example  if  we  are  comparing  a  sub  scalar  CPU   with  a  clock  speed  of  3.2  GHz  with  a  super  scalar  CPU  with  a  clock  speed  of  2.2  GHz  we  can  not   really  tell  which  one  will  be  able  to  execute  more  instructions  in  a  given  time  interval  because  the   super  scalar  processor  will  execute  multiple  instructions  in  a  single  clock  cycle.   To  compare  heterogeneous  CPU  architectures,  in  terms  of  their  speed,  we  need  to  use  a  different   metric  than  their  clock  speed.    The  industry  accepted  measure  is  FLOPS  (FLoating  point  Operations   Per  Second).    For  example,  if  a  system  is  rated  at  a  GFLOP  of  computational  capacity  it  means  that   the  system  can  perform  10  to  the  9  (1,000,000,000)  floating  point  operations  per  second.    The   computational  capacity  of  GPUs  can  also  be  measured  in  terms  of  FLOPS  so  let’s  take  a  look  at   how  CPUs  stack  up  against  GPUs  in  terms  of  peak  FLOPS.   From   the   figure   below   we   can   see   that   Intel’s   Westmere   chips   have   about   a   50   GFLOP   double   precision   rating   while   Nvidia’s   most   powerful   card   shipping   today   offers   500   GFLOPS   of   double   precision   computational   capacity.     As   you   can   see   from   this   chart   the   GPU   is   10   times   faster   than  


the CPU   at   double   precision   floating   point   math.     If   we   look   at   single   precision   computational   capacity  the  GPU  is  about  15  times  faster  than  the  CPU.    

Nvidia CEO  Jen-­‐Hsun  Huang,  while  speaking  at  the  Hot  Chips  symposium  at  Stanford  University,   predicted  that  GPU  computing  will  experience  a  rapid  performance  boost  over  the  next  six  years.   According   to   Huang,   GPU   compute   is   likely   to   increase   its   current   capabilities   by   570   times,   while   'pure'  CPU  performance  will  progress  by  a  limited  3  times.    Now  these  are  just  his  predictions  but   perhaps  six  years  from  now  we  will  be  talking  about  "Haung's  Law"  instead  of  Moore's  (moore  on   that   later).     Nvidia   provided   a   glimpse   their   technology   roadmap   at   their   GPU   Technology   Conference  in  2010.    As  you  can  see  from  the  chart  below  they  appear  to  be  on  track  in  turning   Haung’s  predictions  into  reality.  


The C1060  Tesla  card  was  Nvidia’s  first  GPU  designed  from  the  ground  up  for  performing  general-­‐ purpose  computations.    Their  second  generation  GPU  (code  named  Fermi)  is  4  times  faster  than   their  C1060  cards.    As  you  can  see  from  the  graph  above  their  3rd  generation  GPU  (code  named   Kepler)  promises  to  be  3  –  4  times  faster  than  Fermi  and  their  4th  generation  GPU  (code  named   Maxwell)  should  be  3  –  4  times  faster  than  that.    This  should  give  you  a  bit  of  insight  into  why  the   supercomputer  industry  is  going  down  the  path  of  their  hybrid  heterogeneous  CPU  /  GPU  designs.      


What’s a Virtual Machine There is   another,   seemingly   unrelated,   innovation   that   recently   occurred   in   the   computing   industry.    Enter  the  Virtual  Machine.    A  virtual  machine  (or  VM)  is  a  software  implementation  of  a   machine  (i.e.  computer)  that  executes  programs  just  like  a  physical  machine.    A  typical  usage  for  a   virtual   machine   is   to   consolidate   multiple   logical   (or   virtual)   machines   onto   a   single   physical   machine.    This  provides  for  faster  hardware  provisioning  and  more  fully  utilized  physical  hardware   resources.     Hosting   service   providers   such   as   Go   Daddy   use   VMs   to   “slice   and   dice”   their   physical   hardware   into   multiple   logical   machines   that   they   sell   to   their   customers.     Cloud   computing   infrastructures  such  as  Amazons  Elastic  Cloud  are  also  built  with  VMs.    Many  datacenters  across   multiple   vertical   industries   are   converting   their   companies’   infrastructure   to   run   on   VMs.     This   saves  them  datacenter  space  and  power  while  more  fully  utilizing  their  physical  hardware.

Moore’s Law   Gordon  Moore,  a  co-­‐founder  of  Intel,  noted  that   the  number  of  transistors  that  can  be  placed   inexpensively  on  an  integrated  circuit  increases   exponentially  doubling  every  two  years.  More   transistors  mean  faster  more  powerful  CPUs.   This  observation  coined  the  phrase  "Moore's   Law".  Gordon  Moore  also  coined  a  lesser  know   phrase  "Moore's  Second  Law".     Moore's  second  law  states  that  the  R&D,   manufacturing  and  QA  cost  associated  with   semiconductor  fabrication  also  increase   exponentially  over  time.  Moore’s  second  law   hypothesized  that  at  some  point  the  increasing   fabrication  cost  coupled  with  the  physical   limitations  of  semiconductor  fabrication   materials  will  erect  a  brick  wall  halting  the   exponential  advance  of  CPU  clock  rates.  While   Gordon  Moore  did  not  predict  when  this  halting   would  occur,  recent  products  from  Intel  and   AMD  would  suggest  that  they  (we)  have  hit  it.   Both  companies  seem  to  have  at  least   temporarily  abandoned  their  relentless  quest  for   faster  CPUs  in  favor  of  CPUs  consisting  of  many   cores.  

How can  virtualization  be  leveraged  to  help  us  build  a   supercomputer?     Due   to   the   impending   collapse   of   Moore’s   Law   (see   side   bar),   Intel   and   AMD   are   no   longer   producing   CPUs   with   significantly   faster   clock   speeds.     They   have   started   to   produce   chips   with   multiple   cores   instead.     Dual   quad   core   processors   have  become  the  norm  with  dual  hexa  and  dual  octa   core  designs  soon  to  follow.    This  does  not  bode  well   for   the   likes   of   Dell   and   HP   who   are   now   building   desktop  systems  with  significantly  more  computational   capacity  than  the  user  sitting  in  front  of  them  can  ever   hope  to  consume.       Well   there’s   no   reason   to   let   all   those   spare   clock   cycles  go  unused.    We  can  use  virtualization  software   to   slice   off   the   extra   computational   capacity   and   we   can  stitch  it  all  together  with  Open  Source  distributed   programming  frameworks.  

While Nvidia   was   busy   turning   the   supercomputer   industry   on   its   head,   Parallels,   a   virtual   machine   software   company,   was   busy   as   well.     Parallels   has   integrated   into   their   desktop   virtualization   product   (Parallels  Workstation  Extreme)  the  ability  to  dedicate   a   second   network   interface   card   (NIC)   and   graphics   card  (GPU)  to  a  VM  running  on  a  desktop.  While  this  may  not  sound  like  a  big  deal  their  approach   to   doing   it   is.     Parallels   utilizes   Intel’s   latest   virtualization   technology   that   supports   directed   I/O   (VT-­‐d).    This  differs  from  the  traditional  approach  of  graphics  virtualization  via  pass  thru  drivers   yielding   improved   isolation   and   greater   reliability,   availability,   and   performance.     The   intended   usage   of   Parallels   Workstation   Extreme   is   for   people   who   do   CAD   /   CAM   or   visualization   work.     7

Typically these  users  need  two  systems.    One  is  a  desktop  running  Windoz  for  email,  spreadsheets,   word  processing  and  internet  access  and  the  second  is  a  high-­‐end  UNIX  /  Linux  workstation  for   their  graphics  intensive  applications.    By  using  Parallels  Workstation  Extreme  product  they  only   need   one   system!     That   all   sounds   great,   but   we   are   not   running   CAD   /   CAM   or   visualization   applications.    We  want  to  run  computational  intensive  scientific  applications.    Parallels  innovative   approach  to  GPU  virtualization  works  for  building  Virtual  Supercomputers  as  well.    

Hey… What’s the Big Idea? As I   mentioned   earlier   a   supercomputer   is   really   just   a   bunch   of   calculation   units   (CPUs   &   Vector   Processors)  stitched  together  with  some  sort  of  high-­‐speed  network  transport  with  access  to  local   and/or  distributed  memory  and  high  performance  file  systems.    Workload  communication/control   software   is   used   to   manage   (or   distribute)   the   computational   workload   to   the   calculation   units   and  to  collect  the  results.    The  first  step  in  building  a  Virtual  Supercomputer  is  getting  our  hands   on  some  calculation  units.    Since  I’m  going  to  assume  that  most  of  you  don’t  have  your  own  silicon   fabrication   facilities   I   think   we   should   use   “off-­‐the-­‐shelf”   processors   to   build   our   Virtual   Supercomputer.    We  can  use  commodity  desktops,  virtual  machine  software,  and  GPUs  to  build   our  calculation  units.    We  will  also  need   workload  command  and  control  software,  and  a  high-­‐ speed  network.       Let’s  start  with  our  calculation  units.    I  work  for  a  medium  sized  company.    We  have  close  to  4,000   employees.     Practically   all   of   these   4,000   employees   have   a   desktop   computer   to   read   email,   access  the  internet,  write  documents,  create  presentations,  run  spreadsheets,  and  perform  their   job  specific  duties.    When  PCs  first  came  out  they  were  somewhat  overburdened  by  these  simple   tasks.     Today’s   desktop   machines   typically   have   quad   core   processors   (if   not   dual   quad   core   processors)  and  can  very  easily  keep  up  with  the  trivial  demands  of  the  typical  user’s  workflow.     Instead  of  relying  on  racks  of  new  servers  to  power  our  supercomputer  lets  just  take  1,000  of  our   users’  desktops  (that  we  already  own)  and  install  a  Linux  VM  on  them  using  Parallels  Workstation   Extreme  product.    We  can  configure  the  VM  to  start  at  boot  time  and  steal  four  processor  cores   without   the   user   even   knowing   that   it   is   there.     The   figure   below   depicts   one   node   of   our   Virtual   Supercomputer.    I  would  recommend  using  a  64  bit  Windows  7  machine  with  at  least  16GB  of   RAM.    This  will  allow  your  desktop  owners  to  run  all  of  their  MS  Windows  productivity  applications   in  the  host  OS  and  still  leave  you  with  plenty  of  memory  for  the  Linux  VM.


Our calculation   units   are   not   yet   complete   because   they   don’t   have   any   vector   processors.     To   solve   that   we   install   an   additional   Nvidia   GPU,   which   we   will   use   as   our   vector   processor,   in   each   desktop  pinning  it  to  the  VM.    I  would  recommend  an  Nvidia  C2050  Tesla  card.    The  C2050  will   give   you   500   GFLOPS   of   double   precision   computational   capacity.     If   we   put   a   C2050   in   1,000   desktops  our  Virtual  Supercomputer  will  have  a  theoretical  peak  of  half  a  PETAFLOP.    You  already   own  the  desktops  but  you  are  going  to  have  to  buy  a  Parallels  Workstation  Extreme  license  and  a   Tesla   card   for   each   node,   which   can   get   a   bit   expensive   if   you   need   1,000   of   each,   so   feel   free   to   adjust  the  size  of  your  Virtual  Supercomputer  to  suit  your  needs  and  wallet.   The  figure  below  depicts  a  completed  calculation  node  in  our  Virtual  Supercomputer.    Notice  that   there  are  two  graphics  cards  in  the  host.    One  graphics  card  will  drive  the  monitor  of  the  host  OS   and  the  other  graphics  card  will  be  directly  assigned  to  the  Linux  VM  (the  Tesla  C2050).  


Depending on  what  kind  of  jobs  we  intend  to  run  on  our  Virtual  Supercomputer  we  may  need   a   high   performance   file   system   to   keep   disk   I/O   from   becoming   a   bottleneck.       Lustre   is   a   massively   parallel   distributed   file   system,   generally   used   for   large   scale   cluster   computing.   The  name  Lustre  is  a  portmanteau  word  derived  from  Linux  and  cluster.    Lustre’s  aim  is  to   provide   a   file   system   for   clusters   of   tens   of   thousands   of   nodes   with   petabytes   of   storage   capacity  without  compromising  speed,  security  or  availability.    Lustre  file  systems  are  used  in   computer   clusters   ranging   from   small   workgroup   clusters   to   large-­‐scale,   multi-­‐site   clusters.     Fifteen   of   the   top   30   supercomputers   in   the   world   use   Lustre   file   systems,   including   the   world's  fastest  supercomputer.    We  already  have  a  significant  amount  of  ground  to  cover  in   learning  how  to  program  our  Virtual  Supercomputer  so  we  will  leave  the  high  performance   file  system  for  another  day  (and  another  book).    If  the  types  of  jobs  that  you  will  be  running   on   your   Virtual   Supercomputer   are   disk   I/O   intensive   you   can   install   Lustre   as   an   upgrade   after  your  system  is  functional.   The  only  thing  missing  from  our  design  is  a  high  speed  network.    Well  as  it  turns  out  a  gigabit   Ethernet  network  should  be  fast  enough  for  our  purposes.    While  we  could  run  our  Virtual   Supercomputer  on  a  slower  network  we  would  find  it  difficult  to  get  optimum  performance   out   of   it   and   our   users’   desktop   processing   might   be   degraded   from   time   to   time.     With   gigabit  Ethernet  you  will  get  good  performance  out  of  your  Virtual  Supercomputer  and  your   users’  desktops  will  not  be  adversely  impacted.    If  you  want  to  increase  the  network  capacity   of  your  system  just  add  a  second  network  interface  card  to  each  desktop  and  directly  assign   it  to  the  Linux  VM.    Configuring  your  calculation  nodes  in  this  manner  provides  your  VMs  with   the  entire  bandwidth  of  the  second  NIC  leaving  the  primary  NIC  100%  allocated  to  the  host.     This  would  look  like:    


In the   figure   above   the   half   height   cards   are   NICs   and   the   full   height   cards   are   GPUs.     Just   assigning   a   separate   NIC   to   the   VM   may   not   be   sufficient,   however,   if   your   models   are   performing   a   significant   amount   of   IO.     You   may   need   a   completely   separate   subnet   for   your   Virtual  Supercomputer  network  traffic.    If  you  only  have  one  switch,  as  in  the  diagram  below,   the   network   traffic   from   your   Virtual   Supercomputer   is   mixed   in   with   the   network   traffic   from  your  users’  desktops.        



You can  separate  the  network  traffic  by  adding  a  second  switch.    The  original  switch  is  used   to   route   your   desktop   users   network   traffic   and   the   second   switch   routes   the   network   traffic   for  your  Virtual  Supercomputer  as  in  the  diagram  below.    

                                                  With  the  above  configuration  you  achieve  complete  isolation  of  your  Virtual  Supercomputers   network  traffic.    While  this  will  cost  you  a  bit  more  I  think  you  will  find  it  money  well  spent.    

Programming the Beast Well now  that  we  have  figured  out  what  hardware  we  will  use  to  build  our  Virtual  Supercomputer   how  are  we  going  to  program  it?    First  we  are  going  to  need  to  decompose  the  problems  that  we   are  trying  to  solve  into  their  parallel  components.    Gene  Amdahl,  a  computer  architect  for  IBM,   coined   the   phrase   “Amdahl’s   Law”.     Amdahl’s   Law   states   that   the   speedup   of   a   program   using   multiple  processors  is  limited  by  the  time  needed  for  the  sequential  fraction  of  the  program.     For  example,  if  a  program  needs  20  hours  using  a  single  processor  core,  and  a  particular  portion  of   1   hour   cannot   be   parallelized,   while   the   remaining   portion   of   19   hours   (95%)   can   be   parallelized,   then  regardless  of  how  many  processors  we  devote  to  a  parallelized  execution  of  this  program,   the  minimal  execution  time  cannot  be  less  than  that  critical  1  hour.  Hence  the  speed  up  is  limited   up  to  20X.  The  flip  side  of  this  analysis  is  that  if  a  program  needs  20  hours  using  a  single  processor   core,  and  a  particular  portion  of  19  hours  cannot  be  parallelized,  while  the  remaining  portion  of  1   12

hour (5%)  can  be  parallelized,  then  regardless  of  how  many  processors  we  devote  to  a  parallelized   execution  of  this  program,  the  minimal  execution  time  cannot  be  less  than  that  critical  19  hours.       My  shop  teacher  in  High  School  once  told  me  that  it  is  very  important  to  pick  the  right  tool   for  the  job.  To  a  hammer  the  entire  world  looks  like  a  nail.  He  would  always  say,  “Don’t  let   me  catch  you  hammering  in  screws”.  His  point  was  that  you  need  to  know  what  a  tool  is  good   at  and  use  it  for  that  and  to  know  what  it  is  not  good  at  and  to  NOT  use  it  for  that.  It  seems   so   obvious   when   you   say   “Don’t   hammer   in   screws”   but   it   is   not   so   obvious   when   you   are   sifting  through  a  couple  of  thousand  lines  of  code  in  a  simulation  to  determine  the  best  way   to  speed  up  its  execution.  Let’s  explore  what  types  of  algorithms  will  work  best  on  our  Virtual   Supercomputer.   Michael   Flynn,   a   professor   emeritus   at   Stanford   University,   came   up   with   a   method   of   classifying   computer  architectures.    This  method  of  classification  became  know  as  “Flynn’s  Taxonomy”.    The   taxonomy   is   based   on   the   number   of   concurrent   instruction   (or   control)   and   data   streams   available  in  the  architecture.    Flynn’s  breakdown  is  as  follows:  

Single Instruction Single Instruction Multiple Instruction Multiple Instruction

Single Data Multiple Data Multiple Data Single Data


SISD   is   a   sequential   computer   which   exploits   no   parallelism   in   either   the   instruction   or   data   streams.   Examples   of   SISD   architecture   are   the   traditional   uniprocessor   machines   like   an   early   PC   or   mainframe.     MIMD   is   exemplified   by   a   system   having   multiple   autonomous   processors   simultaneously  executing  different  instructions   on  different  data  or  the  same  data.    Distributed   systems   are   generally   recognized   to   be   MIMD   architectures;   either   exploiting   a   single   shared   memory  space  or  a  distributed  memory  space.     SIMD   is   a   computer   which   exploits   multiple   data   streams   against   a   single   instruction   stream   to   perform  operations  which  may  be  naturally  parallelized  for  example  Nvidia’s  CUDA.    Let’s  dig  a   little  deeper  into  SIMD.  In  a  multiprocessor  system  executing  a  single  set  of  instructions  (SIMD),   data  parallelism  is  achieved  when  each  processor  performs  the  same  task  on  different  pieces  of   distributed  data.    In  some  situations,  a  single  execution  thread  controls  operations  on  all  pieces  of   data.   In   others,   different   threads   control   the   operation,   but   they   execute   the   same   code.   For  instance,  consider  a  2-­‐processor  system  (CPUs  A  and  B)  in  a  parallel  environment,  and  we  wish   to  do  a  task  on  some  data  D.  It  is  possible  to  tell  CPU  A  to  do  that  task  on  one  part  of  D  and  CPU  B   on  another  part  simultaneously,  thereby  reducing  the  duration   of  the  execution.  The  data  can  be  


assigned using  conditional  statements.  As  a  specific  example,  consider  multiplying  two  matrices.     In  a  SIMD  implementation,  CPU  A  could  multiply  all  elements  from  the  top  half  of  the  matrices,   while   CPU   B   could   multiply   all   elements   from   the   bottom   half   of   the   matrices.   Since   the   two   processors  work  in  parallel,  the  job  of  performing  matrix  multiplication  would  take  one  half  the   time  of  performing  the  same  operation  in  serial  using  one  CPU  alone.     Now  imagine  that  instead  of  two  processors  you  have  512  (like  some  modern  day  GPUs  have).   You   should   be   able   to   achieve   a   512X   speedup   of   this   algorithm   by   running   it   on   a   GPU.     Well   that's  the  theory.    In  practice  you  will  have  to  copy  data  on  and  off  the  device  over  the  PCI  Express   bus  and  possibly  swap  data  back  and  forth  between  shared  memory  and  global  memory  on  the   GPU.    Also,  while  a  GPU  has  significantly  more  cores  than  a  CPU,  these  cores  have  a  much  slower   clock   rate,   due   to   power   and   thermal   limitations,   so   you   are   not   going   to   actually   get   a   512X   speedup  but  you  will  get  a  very  significant  speedup.   Our  Virtual  Supercomputer  will  be  many  SIMD  systems  nested  inside  of  a  MIMD  system.    That   may  seem  a  bit  obtuse  so  let’s  look  at  a  graphic  representation.    The  images  below  show  a  high   level  architectural  view  of  MIMD  and  SIMD.    


Think of  each  Processing  Unit  in  the  MIMD  picture  to  be  desktops  connected  to  a  network  and  the   entire  SIMD  picture  to  be  a  GPU.    By  installing  a  GPU  in  each  desktop  we  get  many  SIMD  systems   nestled  inside  of  a  MIMD  system  as  exemplified  by  the  image  below:  


Decomposing a  program  or  algorithm  so  that  it  runs  well  on  our  MIMD  [SIMD]  system  is,  in  my   opinion,  more  of  an  art  than  a  science.    The  problem  space  needs  to  be  broken  down  at  a  coarse   grained  level  which  will  be  distributed  to  each  desktop  to  solve.      Once  the  computation  arrives  at   the  desktop  it  will  be  broken  down  to  a  fine-­‐grained  level,  which  will  be  distributed  to  each  GPU  to   solve.  We  will  provide  you  with  a  few  programming  examples  to  get  your  feet  wet  but  you  will   need  to  look  elsewhere  for  more  complete  coverage  of  this  topic  to  become  an  expert.     One   of   the   major   difficulties   in   distributed   parallel   computing   is   facilitating   the   communication   and   control   aspects   of   the   parallel   jobs.     How   do   we   start   and   stop   our   jobs?     How  do  we  get  the  data  that  our  algorithms  need  to  the  distributed  desktops  and  back  in  a   reliable  manner?    MPI  (Message  Passing  Interface)  is  a  specification  for  a  library  of  routines   that   provide   the   communication   and   control   facilities   necessary   for   building   and   running   distributed  parallel  programs.    MPI  is  not  maintained  by  a  standards  body  but  it  has  achieved   wide   spread   adoption   in   the   scientific   computing   community.   Many   of   today’s   supercomputers   are   powered   by   MPI.     One   of   the   drawbacks   to   MPI   is   that   it   is   not   very   resilient   to   node   failures.     As   the   size   of   your   cluster   grows   so   does   the   number   of   node   failures.         15

Another tool   that   can   be   used   to   facilitate   the   communication   and   control   aspects   of   our   parallel  jobs  is  PVM.    PVM  grew  out  of  a  research  project  at  Oak  Ridge  National  Laboratory  in   the  90’s.    The  projects  goals  were  to  develop  tools  to  support  high  performance  distributed   computing.    One  of  the  major  benefits  that  PVM  has  over  MPI  is  that  it  is  resilient  to  node   failures.     Given   that   our   virtual   supercomputer   will   be   running   on   a   collection   of   loosely   coupled   desktop   systems   with   a   potentially   wide   geographic   distribution,   we   may   find   that   we  need  the   resiliency.     Once   we   have   decomposed   our   program   we  can  use  MPI   or  PVM   to   distribute  the  coarse  grained  computations  to  the  Linux  VMs  on  our  cluster.       Once  those  computations  arrive  at  their  destinations  they  will  again  be  decomposed  into  more   fine   grained   calculations   that   will   run   on   the   GPU.     GPGPU   computing   or   General   Purpose   Graphics   Processing   Unit   computing   is   the   technique   of   using   a   GPU,   which   typically   handles   computation   only   for   computer   graphics,   to   perform   computations   in   applications   traditionally   handled  by  the  CPU.    We  will  explore  two  different  approaches  to  programming  the  GPUs.    We   will   take   a   look   at   Nvidia’s   CUDA   and   the   industry   standard   OpenCL.     CUDA   (an   acronym   for   Compute  Unified  Device  Architecture)  is  a  parallel  computing  architecture  developed  by  NVIDIA.     CUDA  is  the  computing  engine  in  NVIDIA  graphics  processing  units  that  is  accessible  to  software   developers  through  variants  of  industry  standard  programming  languages.    Programmers  can  use   'C  for  CUDA',  compiled  through  a  C  compiler,  to  code  algorithms  for  execution  on  the  GPU.    CUDA   gives  developers  access  to  the  virtual  instruction  set  and  memory  of  the  parallel  computational   elements   in   CUDA   GPUs.     Through   the   use   of   CUDA,   NVIDIA   GPUs   become   accessible   for   general   purpose  computations  like  CPUs.       We  also  mentioned  OpenCL.    What  is  OpenCL…  you  ask?    A  couple  of  years  ago  Apple  (you   know   that   scrappy   little   hardware   company   that   just   passed   Exxon   Mobil   in   market   capitalization  to  be  come  the  largest  company  in  the  U.S.)  was  trying  to  figure  out  a  way  to   position  developers  on  their  Mac  OS  to  more  easily  adapt  their  code  to  the  multi-­‐core  CPU   revolution  that  is  currently  under  way.  They  came  up  with  something  that  they  called  Grand   Central.     In   a   nutshell   Grand   Central   allows   developers   to   be   freed   from   many   of   the   difficult   tasks  involved  in  multi-­‐threaded  parallel  programming.    Someone  at  Apple  (probably  Steve)   noticed  that  the  Grand  Central  work  they  were  doing  mapped  nicely  to  GPGPU  programming.     Out   of   the   goodness   of   their   hearts   Apple   wrapped   up   their   Grand   Central   API   into   a   specification  and  released  it  to  the  Khronos  Group  (an  industry  consortium  that  creates  open   standards)  as  OpenCL  (Open  Compute  Language).     The  Khronos  Group  got  all  the  major  players  (Intel,  AMD,  ATI,  Nvidia,  RapidMind,  ClearSpeed,   Apple,  IBM,  Texas  Instruments,  Toshiba,  Los  Alamos  Nation  Laboratory,  Motorola,  QNX,  Bla…,   Bla…,  Bla…,  Yadda...,  Yadda...,  Yadda...)  together  and  they  hammered  out  the  1.0  version  of   the  OpenCL  specification  in  record  time.    Programs  written  to  run  on  OpenCL  are  written  in  a   manner  that  is  by  definition  parallel.    This  means  that  not  only  can  they  take  advantage  of   the  multiple  cores  on  a  CPU  but  that  they  can  also  take  advantage  of  the  hundreds  of  cores   on   a   GPU.     But   it   doesn’t   stop   there.     OpenCL   drivers   are   being   written   for   IBM   Cell   processors  (you  know…  the  little  fellow  that  powers  your  PS3),  ATI  GPUs,  Nvidia  GPUs,  AMD  


CPUs, Intel   CPUs   and   a   host   of   other   computing   devices.     We   will   also   be   able   to   program   Intel’s   MIC   chips   and   AMD’s   APUs   in   OpenCL   (at   least   that’s   what   my   friends   at   Intel   and   AMD  are  currently  saying).       Since  Nvidia’s  CUDA  is  robust  and  production  ready,  and  OpenCL  is  a  standard  supported  by   many  vendors  we  will  provide  you  with  an  introduction  to  CUDA  and  OpenCL.    Once  you  have   a  basic  understanding  of  CUDA  and  OpenCL  we  will  show  you  how  to  combine  them  with  MPI   and   PVM.     Combining   these   technologies   with   virtualization   will   allow   you   to   amass   an   amazing  amount  of  computational  capacity  for  a  very  small  cost.     You  might  wonder  why  do  we  need  the  desktop  VM?    Why  don’t  we  just  use  MPI  /  PVM  to   run  our  jobs  natively  on  the  Windows  desktops?    There  are  two  important  features  that  our   Virtual  Supercomputer  gets  from  the  desktop  VM.    First  and  foremost  most  people  who  are   serious   about   number   crunching   do   it   on   Linux.     I’m   sure   I’m   not   the   only   one   out   there   with   a   “Serious   Number   Crunchers   do   it   on   Linux”   t-­‐shirt…   on   second   thought…   maybe   I   am.     While  some  of  the  tools  you  use  to  crunch  your  numbers  may  be  available  on  Windoz…  many   are  not.    The  second  reason  for  the  desktop  VM  is  safety.    With  the  desktop  VM  we  restrict   the  number  crunching  processes  to  a  sandbox.    Without  the  desktop  VM  a  runaway  program   could   not   only   bring   down   one   of   your   users   desktop,   it   cold   bring   down   all   of   your   users   desktops.     By   running   in   a   desktop   VM   the   worst   that   can   happen   is   that   a   runaway   program   could  bring  down  your  Virtual  Supercomputer  leaving  your  users  desktops  unharmed.    

Virtual Supercomputer vs. Beowulf Cluster In 1994  a  group  of  researches  at  NASA  built  a  computing  cluster  out  of  commodity  desktops   using  Free  and  Open  Source  Software  (FOSS).    They  named  their  creation  Beowulf  after  the   main   character   in   the   Old   English   poem   Beowulf.     In   the   poem   the   hero   Beowulf   was   described   as   having   "thirty   men’s'   heft   of   grasp   in   the   grip   of   his   hand.     The   term   Beowulf   Cluster   is   now   commonly   used   to   describe   a   class   of   computational   clusters   that   have   the   same  attributes  as  NASA’s  original  cluster.     Our   Virtual   Supercomputer   has   many   of   the   same   attributes   of   a   Beowulf   Cluster.     It   is   powered   by   MPI   /   PVM   and   Linux,   just   like   a   Beowulf   Cluster.     It   is   built   out   of   a   collection   of   commodity   desktops   networked   together,   just   like   a   Beowulf   Cluster.     Where   a   Virtual   Supercomputer  differs  from  a  Beowulf  Cluster  is  that  it  is  run  on  desktop  hardware  that  you   already   own   and   it   is   turbo   charged   with   GPUs.     This   makes   a   Virtual   Supercomputer   significantly   less   expensive   than   a   Beowulf   Cluster   because   it   is   running   on   the   spare   computational   capacity   of   your   preexisting   desktop   hardware.     Drop   in   the   GPUs   and   the   FLOPS  available  for  your  computations  go  through  the  ceiling.         17

Other problems   that   need   to   be   addressed   with   a   Beowulf   Cluster   are   non   existent   with   a   Virtual   Supercomputer.     A   large   Beowulf   Cluster   will   take   up   a   significant   amount   of   space   in   a   datacenter   and   as   we   all   know   datacenter   space   is   expensive.     A   Virtual   Supercomputer   takes  up  zero  space.    A  Virtual  Supercomputer  will  only  increase  your  power  consumption  an   additional   250W   per   node   (to   power   the   GPU)   whereas   a   Beowulf   Cluster   can   chew   up   close   to   1000W   per   node.     While   the   GPUs   in   a   Virtual   Supercomputer   will   give   off   a   significant   amount  of  heat,  no  additional  cooling  cost  is  incurred  since  they  are  distributed  throughout   your  company.       While  you  could  just  drop  GPUs  into  your  Beowulf  Cluster  and  you  would  achieve  the  same   computational   capacity;   a   Virtual   Supercomputer   is   a   more   compact,   more   efficient,   more   elegant,  and  less  expensive  solution.        


Summary In this   chapter   we   walked   you   through   the   basic   concept   of   a   Virtual   Supercomputer.     By   combining  the  latest  advances  in  virtualization,  GPGPU  computing  and  some  tried  and  true  Open   Source   software   frameworks   we   will   show   you   how   to   build   and   program   an   extremely   powerful   computation  engine.   A   single   node   powered   by   an   Nvidia   Tesla   card   will   yield   a   significant   amount   of   computational  capacity.    When  we  stitch  many  nodes  together  (as  in  the  diagram  below)  and   handle   the   workflow   distribution   with   MPI   /   PVM   we   get   an   unbelievable   amount   of   computational  capacity  at  a  fraction  of  the  cost  of  a  traditional  HPC  cluster.    

In   the   remaining   chapters   of   this   book   we   will   provide   you   with   an   introduction   to   GPGPU   programming   using   the   two   most   widely   accepted   toolkits   for   programming   GPUs:   CUDA   and   OpenCL.    We  will  also  show  you  how  to  integrate  CUDA  and  OpenCL  based  models  with  MPI  and   PVM  so  that  you  can  not  only  unleash  the  hidded  computational  capacity  of  a  single  GPU  but  you   will  be  able  to  combine  the  computaional  capacity  of  hundreds  or  even  thousands  of  GPUs  into  a   single  computational  engine  or  Virtual  Supercomputer.   19


Profile for Mi Zaius

The ClusterChimps Guide to Building and Programming a Virtual Supercomputer  

This book covers CUDA, OpenCL, MPI, and PVM programming.

The ClusterChimps Guide to Building and Programming a Virtual Supercomputer  

This book covers CUDA, OpenCL, MPI, and PVM programming.