E-Nano Newsletter Special Issue April 2012 by Phantoms Foundation

The increased degree of parallelism poses an additional challenge for algorithm design on the one hand, and implementation on the other hand. While a parallelization of an inherently parallel algorithm is often straightforward, a parallelization of a conceptually serial algorithm may be even impossible. Consequently, it can be beneficial to substitute an established serial algorithm by a parallel variant. While the parallel algorithm may be less efficient in a serial processing environment, it is able to utilize all computational resources available on modern architectures, ultimately leading to shorter execution times. In the context of multi-scale methodologies, the plethora of different algorithms employed requires a careful replacement of bottlenecks caused by the use of serial algorithms. In this regard it is crucial to first rely on suitable parallel algorithms, since inherently serial algorithms cannot be implemented in a massively parallel manner. In addition to the increased physical details in multi-scale simulations, close interdisciplinary collaboration of engineers and physicists with numerical mathematicians on the one hand, and computer scientists on the other hand is the key to access the full potential of modern computing architectures in high performance computing, in order to carry out frontier developments and research. A multitude of different approaches for programming parallel hardware has emerged. The message passing interface (MPI) has become a de-facto standard for use on distributed memory machines such as computing clusters. On shared memory machines such as current workstations, the use of OpenMP allows for adding parallelism to existing code using suitable compiler

directives. Good performance enhancements can be obtained by OpenMP in certain cases, while the probability of breaking working code is minimized. Full control over the individual threads is provided by libraries such as the POSIX threads (pthreads), which is available on a wide range of different platforms. Additional vendor-specific compiler extensions are also available, most noteworthy is Cilk for C and C++. A different approach is automated code generation, which tries to automatically detect parallelism in existing code and may aid in migrating existing code. Examples of such an approach are HMPP by CAPS Enterprise or PGI Accelerators compilers. In this last case, as OpenMP for the multicore programming in shared memory systems, only some new directives need to be added into the original code as comments, which are then understood only by the specific compilers. With this approach the programming effort is less but performances are sometimes poorer. Indeed, they strongly depend on the characteristics of the algorithms. With HMPP it is possible to have a version of the code that is automatically generated but amenable to manual optimization afterwards. In contrast to the vast number of programming approaches for CPUs, only two mainstream approaches for programming GPUs are in use. On the one hand, CUDA is a proprietary language introduced by NVIDIA. On the other hand, OpenCL is a standardized framework for general multi- and manycore architectures. It is maintained by the Khronos group and implementations from AMD, Intel and NVIDIA are available. Unlike CUDA, OpenCL can be employed on GPUs and multi-core CPUs for different vendors. Although the OpenCL standard addresses multiple target devices, full hybrid support of OpenCL implementations for devices from different vendors still needs to evolve. Despite the unified programming approach, hardwarespecific implementations are required in order to obtain good performance. OpenCL can in principle also be used with special-purpose hardware such as field-programmable gate

nanoresearch

are now equipped with GPUs. It is worth mentioning that a single modern GPU board in 2012 offers the performance of the fastest supercomputers back in 1997. This enables numerical experiments on average desktop machines, which have been impossible even on supercomputers 15 years ago – provided that the underlying algorithm allows for a high degree of parallelism.