Page 1

The ParaPhrase  Project    

Parallel Pa'erns  for  Heterogeneous   Mul1core  Systems   Kevin  Hammond   University  of  St  Andrews  


Project Details   •  3  Year  targeted  research  project  (FP7  STReP)   •  Runs  from  1/10/11  to  30/9/14   •  Funded  by  objecJve  3.4,  “CompuJng  Systems”   •  Project  Number  ICT-­‐2011-­‐288570  

•  9 partners  from  five  countries   •  Strong  Italian  contribuJon  

•  €3.5M budget,  €2.6M  EU  contribuJon   •  Coordinated  by  the  University  of  St  Andrews,  Scotland,  UK  

2


The Dawn  of  a  New  Age?  

3


Scaling towards  Manycore   R A M

R A M

HT HT

R A M

HT R A M

HT HT PCIe/ PCIx USB PCI

SATA PATA

4


The Challenge   UlJmately,  developers  should  start  thinking  about  tens,  hundreds,  and  thousands  of  cores   now  in  their  algorithmic  development  and  deployment  pipeline.    

Anwar Ghuloum,  Principal  Engineer,  Intel  Microprocessor  Technology  Lab   The  dilemma  is  that  a  large  percentage  of  mission-­‐criJcal  enterprise  applicaJons  will  not   ``automagically''  run  faster  on  mulJ-­‐core  servers.  In  fact,  many  will  actually  run  slower.  We   must  make  it  as  easy  as  possible  for  applicaJons  programmers  to  exploit  the  latest   developments  in  mulJ-­‐core/many-­‐core  architectures,  while  sJll  making  it  easy  to  target   future  (and  perhaps  unanJcipated)  hardware  developments.

Patrick Leonard,  Vice  President  for  Product  Development   Rogue  Wave  SoJware  


Programming Issues     We  can  muddle  through  on  2-­‐8  cores          

maybe even  16   modified  sequenJal  code  may  work   we  may  be  able  to  use  mulJple  programs  to  soak  up  cores   BUT  larger  systems  are  much  more  challenging    

  Fundamentally, programmers  must  learn  to  “think  parallel”     this  requires  new  high-­‐level  programming  constructs     you  cannot  program  effecJvely  while  worrying  about  deadlocks  etc     they  must  be  eliminated  from  the  design!     you  cannot  program  effecJvely  while  fiddling  with  communicaJon  etc     this  needs  to  be  packaged/abstracted!  


A criKque  of  typical  current  approaches     ApplicaJons  programmers  must  be  systems  programmers     insufficient  assistance  with  abstracJon     too  much  complexity  to  manage       Difficult/impossible  to  scale,  unless  the  problem  is  simple     Difficult/impossible  to  change  fundamentals     scheduling     task  structure       migraJon     The  approaches  provide  libraries     they  need  to  provide  abstracJons   7  


The future:  megacore  computers?     Probably  not  just  scaled  versions  of  today’s  mulJcore              

Perhaps hundreds  of  dedicated  lightweight  integer  units   Hundreds  of  floaJng  point  units  (enhanced  GPU  designs)   A  few  heavyweight  general-­‐purpose  cores   Some  specialised  units  for  graphics,  authenJcaJon,  network  etc   possibly  so?  cores  (FPGAs  etc)   Highly  heterogeneous  

  Probably not  uniform  shared  memory     NUMA  is  likely,  even  hardware  distributed  shared  memory     or  even  message-­‐passing  systems  on  a  chip  

8


The ImplicaKons  for  Programming     We  must  program  heterogeneous  systems  in  an   integrated  way     it  will  be  impossible  to  program  each  kind  of  core   differently     it  will  be  impossible  to  take  staJc  decisions  about   placement  etc  

9


Some Possible  Language  Approaches     PaCern-­‐based  approaches     Parallel  stream-­‐based  approaches  

Avoid issues  such  as   deadlock  etc…  

  CoordinaJon approaches     Direct  programming  in  e.g.  Parallel  Haskell   Parallelism  by   ConstrucFon!  


PaNern-­‐Based Approaches   euler  ::  Int  -­‐>  Int   euler  n  =  length  (filter                                      (relprime  n)  [1..(n-­‐1)])     hcf  ::  Int  -­‐>  Int  -­‐>  Int   hcf  x  0  =  x   hcf  x  y  =  hcf  y  (rem  x  y)     relprime  ::  Int  -­‐>  Int  -­‐>  Bool   relprime  x  y  =  hcf  x  y  ==  1  

-­‐-­‐ boring  sequenJal  version   sumEuler  ::  Int  -­‐>  Int   sumEuler  n  =  sum  (map  euler  (mkList  n))     sumEulerParList  ::  Int  -­‐>  Int   sumEulerParList  n  =  sum  (map  euler  (mkList  n)                                                    `using`  parList)   Speedup   7   6   5   4  

Speedup

3 2   1   0   1  

2

3

4

5

6

7

8

11


MasterWorker  paNern  


MulKcore Results   42,473   candidate  groups  

8-­‐core Dell  Poweredge  2950,  2  x  Intel  Xeon  5355  quad-­‐core  @  2.66GHz   16GB  fully-­‐buffered  667MHz  DIMMs.    Centos  Linux  4.5.  


PaNerns of  Symbolic  ComputaKon     Standard  funcKonal  algorithmic  skeletons   parMap:: parZipWith:: parReduce:: parMapReduce:: masterWorker::

(a->b) -> [a] (a->b->c) -> [a] -> [b] (a->b->b) -> b -> [a] (a->b->b) -> (c->[(d,a)]) -> c (a->([a],b)) -> [a]

  New  parallel  domain-­‐specific  paNerns     orbit  calculaKon:     duplicate  eliminaKon:     compleKon  algorithm:     chain  reducKon:     parKKon  backtracking:  

-> -> -> -> ->

[b]! [c]! b! [(d,b)]! [b]!

generate  unprocessed  neighbouring  states    merge  two  lists    generate  new  objects  from  any  pair    generate  new  objects  from  any  pair    search  for  basis  objects  

  others?? search  skeleton,  classifica1on  skeleton,  modular  skeleton+CRA,   backtracking  search  


ParaPhrase Aims   Overall,  we  aim  to  produce  a  new  parern-­‐based  approach  to   programming  parallel  systems.    Specifically,  we  aim  to     1.   develop  high-­‐level  design  and  implementaJon  parerns  that  are   capable  of  easily  exposing  useful  parallelism  for  a  wide  range  of   parallel  applicaJons  running  on  heterogeneous  mulJcore/ manycore  systems.   2.  develop  new  dynamic  mechanisms  to  support  adapJvity  for     heterogeneous  mulJcore/manycore  systems,  so  demonstraJng   the  general  deployability  of  the  parern-­‐based  approach  to   parallelism.   15  


ParaPhrase Aims  (2)   3.  verify  that  these  parerns  and  adapJvity  mechanisms  can  be  used   easily  and  effecJvely  to  develop  a  wide  range  of  real-­‐world   applicaJons  for  heterogeneous  mulJcore/manycore  systems.   4.  ensure  that  there  is  scope  for  widespread  takeup  of  the  parerns   and  adapJvity  mechanisms  that  will  be  developed  by  the   ParaPhrase  project.   We  are  applying  our  work  in  two  language  sesngs    C/C++    ImperaJve,  General-­‐Purpose    Erlang    Commercial  FuncJonal,  Telecom   16  


Project Vision   Application Design

Pattern-based Development/ Refactoring

Parallelised Application

Parallelised Application

Dynamic

CPU

CPU

Parallelised Application

Mapping

CPU

CPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

Heterogeneous Hardware Pool

17


ParaPhrase Workpackages   WP2: Parallel Patterns

WP3: Component Interfaces for Adaptivity

WP6: Use Cases, Requirements and Evaluation

WP4: Refactoring Tools

WP7: Community Building

WP5: Compilation and Platform-Specific Deployment

18


Technical ObjecKves   1.  To  develop  high-­‐level  parallel  design  paCerns  that  easily  expose   parallelism  for  a  wide  variety  of  parallel  applicaJons  (WP2).   2.   To  develop  efficient  parallel  implementaFons  corresponding  to   these  parerns  that  can  be  easily    re-­‐targeted  to  different   hardware  architectures  (WP2/5).   3.  To  define  a  virtualisaFon  of  the  parallel    souware  in  the  form  of  a   low-­‐level  component  model  defining  well-­‐defined  component   state,  life-­‐cycle,  and  interfaces  (use  and  provide)  that  allows  the   re-­‐mapping  to,  possibly  heterogeneous,  hardware  devices  (WP3).  

19


Technical ObjecKves  (2)   1.  To  develop  refactoring  tools  that  support  the  parallel  design   process  by  allowing  the  straighvorward  inclusion  of  alternaJve   parallelisaJons  for  the  same  souware  design  (WP4).   2.  To  develop  adaptaFon  mechanisms  that  dynamically  and   efficiently  re-­‐map  souware  components  to  the  available   hardware  components  (WP3/5).  

20


PaNern-­‐Based ImplementaKon  

Parallel Pattern

Design

Implementation

Refactor

Parallel Pattern

Implementation

Implementation in Software Virtualisation Layer

21


StaKc Mapping   Implementation in Software Virtualisation Layer

Mapping

Mapping

Implementation on Heterogeneous Platform

22


Dynamic Re-­‐Mapping  

23


Feedback-­‐Directed CompilaKon   CompilaKon  Route   Source  

Analysis Route   Source   +CAL  

Compiler

Source +CAL  

StaJsJcal Analysis  

Object +  LPEL   +  CAL  

Mapping

Markov Model  

Aggregated Measurements   +  SVP  

Object +  SVP  

Dynamic AdaptaJon  

Performance AggregaJon  

Hardware Plavorm  

RunJme Measurement  

Measurements +  SVP  

Feedback Route   24  


Project ConsorKum  


ExperKse

  USTAN   RGU   MLNX   SCCH   Erlang   UNIPI  

USTUTT UNITO   QUB      

  parallel  parerns     refactoring     algorithmic  skeletons     parallel  applicaJons     behavioural  skeletons     adapJve  parallel  runJme  systems     graphics  processing  units     mulJcore/many-­‐core  programming     high-­‐performance  computers     compilaJon    


RepresentaKve Industrial  ApplicaKons   Chosen  by  SCCH  and  Erlang  SoluJons.    High-­‐performance   applicaJons  in  areas  like:     Large-­‐scale  databases     Video  streaming      3D  modelling     Advanced  data  mining     Machine  learning     Renewable  energy  producJon  


Expected Long-­‐Term  Impact     Accelerated  system  development  and  producJon     we  will  take  a  dynamic  approach  to  mapping  code  to  available  resources,   using  extra-­‐funcJonal  properJes.     VirtualisaFon  and  the  abstracFon  of  coordinaFon  improve  concurrent   code  reuse  

  Improved Speed  to  Market     We  will  develop  very  high-­‐level  programming  approaches,  helping   programmers  think  in  parallel  

  Good Parallelism  at  Low-­‐Effort     We  aim  for  at  least  a  tenfold  speedup  for  the  applicaJons  we  are   considering  


Impact of  Formal  Methods     Cost  Modelling     For  parerns  and  implementaJons     Capturing  staJsJcally  valid  informaJon     For  compilaJon  and  opJmisaJon  (correct  transformaJon)  

  Refactorings   Ensuring  correctness     Improving  behaviour  

  Key QuesJon:  What  is  the  semanJcs  of  parallelism?     How  do  I  know  when  I  have  a  “berer”  program?     How  do  I  know  that  two  parallel  programs  are  “the  same”?   29  


Conclusions   Parerns  help  programmers  learn  to  “think  in  parallel”     Capture  structure,  match  implementaJon  

  Cost-­‐directed refactoring     Rewrite  source  to  choose  “best”  parern  

  ImplementaJon builds  on  strongly-­‐virtualised  components     Hardware  and  souware  virtualisaJon  

  Targets: C/C++  and  Erlang  

30


Research DirecKons     What  parerns  are  needed  to  cover  our  applicaJons?     standard  parerns     domain-­‐specific  parerns     special  parerns  for  heterogeneity  

  Can we  program  heterogeneous  systems  without  knowing  the  target?       what  virtualisaJon  mechanisms  do  we  need     abstract  memory  access,  communicaJon,  state  

  What informaJon  is  needed  to  exploit  mulJcore  effecJvely?     metrics:  execuJon  Jme,  memory,  power     historical  v.  predicted  informaJon  

  Is staJc  or  dynamic  mapping  best?  


Determining Resource  Bounds  from   Source   1. 

Build formal operational semantics –  – 

2. 

Build mathematical models of resource usage –  –  –  – 

3. 

(Position xs) -> (PNORM myId send 10 # (Response (Position myPos)), (myId,myPos,myPower,0), *)# | (Power xs) -> (PNORM myId send 10 # (Response (Power myPower)), (myId,myPos,myPower,0), *)# | (Target (JF angle)) -> (*, (myId,myPos,myPower,0), angle)#

extract bounds on resource use

Compare bounds with measured usage – 

6. 

based on mathematical models

Analyse program source – 

5. 

relate programs to resource use formal models of complex program structures, real-time constructs T_init = Tcall + 5*Tpushvar + 3*Tmkint + Tmkvec(2)
 + … + Tcreateframe+Tmatchrule+…# metrics: execution time, stack high watermarks memory allocations/deallocations provable bounds on resource use case asp of#

Construct static analyses – 

4. 

captures low-level information (PowerPC 603e) uses AbsInt’s aiT tool for WCET information

Tplus = 1599 Tpush = 939 …

time and space on actual platform

Adapt software if necessary to improve bounds

13/14 July 2010

SEAS DTC  Annual  Technical  Conference  

32


Other ParaPhrase  Talks  Today   1. Horacio  Gonzalez-­‐Velez,  Robert-­‐Gordon  Uni.,  Scotland     VirtualisaJon  at  the  hardware  level  

2. Marco Aldinucci,  U.  Torino,  Italy     FastFlow  :  High-­‐Performance  CompilaJon    

3.  Marco Daneluro,  U.  Pisa,  Italy     Managing  AdapJvity  

4. Chris Brown,  St  Andrews  Uni.,  Scotland     Rule-­‐based  Refactoring  for  FuncJonal  Languages  

33


THANK YOU!  

http://www.paraphrase-ict.eu! @paraphrase_fp7  

34

Parallel Patterns for Adaptive Heterogeneous Multicore Systems  

The ParaPhrase project aims to produce a new structured design and implementation process for heterogeneous parallel architectures, where de...

Read more
Read more
Similar to
Popular now
Just for you