The ClusterChimps Guide to Building and Programming a Virtual Supercomputer by Mi Zaius

(Why ClusterChimps…? ClusterMonkey was taken!) All rights reserved This “book” is intended for developers and researchers that know how to program in C. We will walk you through building a Virtual Supercomputer and teach you how to program it. The information provided is not intended to make you an expert but to expose you to the basic underpinnings. You will have to do more work on your own to become an expert (“bummer”). Disclaimer: While we discuss products from Nvidia and Parallels we have no relationship with either company other than that of fanboy.

About ClusterChimps.org ClusterChimps is dedicated to helping bring inexpensive supercomputing to the masses by leveraging emerging technologies coupled with bright ideas and open source software. We do this because we believe it will help advance computation intensive research areas including basic research, engineering, earth science, biology, materials science, and alternative energy research just to name a few. Throughout this text we have “borrowed” images, code snippets, and text from various Nvidia publications. We have also “borrowed” images and text from Wikipedia, openMPI documentation and PVM documentation. We hope no one gets their “undies in a bind” over this.

About the Author Dr. Zaius is a well renowned orangutan in the fields of cluster computing and GPGPU programming. He has spent most of his career in the financial industry working at exchanges, investment banks, and hedge funds. He is currently the driving force behind the site ClusterChimps.org. Originally from the island of Borneo, Dr. Zaius now resides in New York City with his wife and 3 children. He can be reached at zaius@clusterchimps.org.

How to Read This Book The first chapter should provide you with information regarding the general idea behind a Virtual Supercomputer. I would suggest that everyone start here. Once you have wrapped your head around the basic concept, where you go next depends on what you already know and what you want to know. If you plan on building your own Virtual Supercomputer and you intend on working through the examples you should read through Chapter 2 (Some Assembly Required). This will provide you with information regarding where to get the various software you will need and how to configure your hardware. If parallel programming concepts are foreign to you be sure to take a look at chapter 3. It will explain the difference between task parallelism and data parallelism and go over the manager / worker pattern that will be used throughout the book. Your next stop should be the CUDA and OpenCL intro chapters (4 and 5). These will walk you through the basics of CUDA and OpenCL so that you have an understanding of what they are, how to use them, and what you should use them for. If you already know this then you might want to skip over to Chapter 6 and 7 which will walk you through using MPI and PVM in conjunction with CUDA / OpenCL. If you are serious about using your Virtual Supercomputer for solving real life computational problems then you should definitely take a look at Chapter 8 (Optimized PVM) and 9 (Fault Tolerant PVM). These chapters will show you how to write PVM based applications for your Virtual Supercomputer that are as parallel as possible and that are resilient to node failures and reactive to grid expansion and contraction.

Table of Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Supercomputer Design . . . . . . . . . . . . . . . . . 2 CPU vs. GPU (The Big Smackdown) . . . . . . . . 3 What’s a Virtual Machine . . . . . . . . . . . . . . . 7 Hey… What’s The Big Idea . . . . . . . . . . . . 8 Programming The Beast . . . . . . . . . . . . . . . 11 Virtual Supercomputer vs Beowulf Cluster . . . 16 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Some Assemble Required . . . . . . . . . . . . . . . . . . . ?? Installing Tesla Cards . . . . . . . . . . . . . . . . ?? Installing PWE . . . . . . . . . . . . . . . . . . . . . ?? Linux VM . . . . . . . . . . . . . . . . . . . . . . . . . ?? Installing CUDA / OpenCL . . . . . . . . . . . . . . ?? Installing MPI . . . . . . . . . . . . . . . . . . . . . . ?? Installing PVM . . . . . . . . . . . . . . . . . . . . . . ?? Installing Code Examples . . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . ??

Parallel Program Decomposition . . . . . . . . . . . . . . ?? Task vs Data Parallelism . . . . . . . . . . . . . . . . ?? Granularity . . . . . . . . . . . . . . . . . . . . . . . . ?? Manager Worker . . . . . . . . . . . . . . . . . . . . . ?? Example: Calculating the Value of a Portfolio . . . ?? Gustafsonâ&#x20AC;&#x2122;s Law . . . . . . . . . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Intro to CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? CUDA Program Structure . . . . . . . . . . . . . . . . . ?? Matrix Multiplication 1 . . . . . . . . . . . . . . . . . . . ?? Matrix Multiplication 2 . . . . . . . . . . . . . . . . . . ?? Matrix Multiplication 3 . . . . . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Intro to OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . ?? OpenCL Program Structure . . . . . . . . . . . . . . . ?? Matrix Multiplication 1 . . . . . . . . . . . . . . . . . . . ?? OpenCL Compiler . . . . . . . . . . . . . . . . . . . . . . ?? Matrix Multiplication 2 . . . . . . . . . . . . . . . . . . ?? Matrix Multiplication 3 . . . . . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . ??

Intro to MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? MPI Overview . . . . . . . . . . . . . . . . . . . . . . . ?? MPI Hello World . . . . . . . . . . . . . . . . . . . . . . ?? CPU Binomial Options Model . . . . . . . . . . . . . . ?? CUDA Binomial Options Model . . . . . . . . . . . . . . ?? OpenCL Binomial Options Model . . . . . . . . . . . . ?? MPI and Fault Tollerance . . . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Intro to PVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? PVM Overview . . . . . . . . . . . . . . . . . . . . . . . . ?? PVM Hello World . . . . . . . . . . . . . . . . . . . . . . ?? Binomial Option Model . . . . . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Optimized PVM . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Optimization . . . . . . . . . . . . . . . . . . . . . . . . . ?? Implementation Details . . . . . . . . . . . . . . . . . . ?? Optimized Manager . . . . . . . . . . . . . . . . . . . . ?? Optimized (Threaded) Worker . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . ??

Fault Tolerant PVM . . . . . . . . . . . . . . . . . . . . . . . . ?? Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Determining that â&#x20AC;&#x153;BAD Thingsâ&#x20AC;? Have Happened . . ?? Implementation Details . . . . . . . . . . . . . . . . . . ?? Fault Tolerant Worker . . . . . . . . . . . . . . . . . . ?? Fault Tolerant Manager . . . . . . . . . . . . . . . . . . ?? A Simple System Monitor . . . . . . . . . . . . . . . . ?? Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ?? Appendix I: OpenCL Compiler . . . . . . . . . . . . . . . . . . ?? Appendix II: Measuring Flops . . . . . . . . . . . . . . . . . ?? Appendix III: Binomial Tree Option Pricing Model . . . ?? Appendix IV: Serial Binomial Option Pricing Model . . . ??

Chapter

Introduction Remember a few years ago when you bought that brand new computer? It had a 100MHtz processor and 200 MB of RAM. You were the talk of the neighborhood. You were so proud of your system. You droned on and on about it every chance you had. People envied the fact that you could actually run more than one application at a time. Everyone was so jealous of your system that they wished they could be you. And then… one day at a block party… Bob, from down the street, drops a bombshell. Bob just bought a 200MHtz system with 300MB of RAM. Your system was no longer the fastest in the neighborhood! Remember the shame… the humiliation… how were you going to be able to show your face in public again? You’d show them… you’d show them all! As soon as the 233MHtz models came out you ran to the store and snatched one up. This machine was so powerful that you could almost run the latest version of Windoz. Once again you were the top dog… the alpha male… the head banana… the big cheese! Oh what a wonderful feeling that was. As time passed you got older and wiser and you realized that this was a sucker’s game that was draining your children’s college fund (actually your wife realized it and you were forced to come to terms with it). So you threw in the towel. You learned to live within your means. You stopped upgrading your system. You lost the computational arms race of the neighborhood. Now all you have are the fading memories of how it felt to have more computational capacity at your fingertips than anyone else in your close nit circle of friends. Well what would you say if I told you that you could be top dog again? What if I told you that for a mere $250,000 you could not only have the fastest computer in the neighborhood again but one of the top 500 fastest computers on the planet! Now I’m sure that you wouldn’t think twice about taking out a second mortgage to be top dog of planetary proportions but, unless you’re up for a divorce, perhaps you had better find a different source of funding. Perhaps you can convince the organization you work for to fund your little project? Of course you won’t own the system but just imagine the look on Bob’s face when you tell him you built it. If I’ve peaked your interest then read on. The following pages will provide you with step by step instructions on how to build and program an inexpensive Virtual Supercomputer utilizing commodity desktops, Nvidia GPUs, Parallels virtualization software, and program it with MPI / PVM, CUDA, and OpenCL.

Supercomputer Design A supercomputer is defined as a computer that is at the frontline of current processing capacity, particularly speed of calculation. Supercomputers are used for solving highly calculation-‐intensive problems involving quantum physics, weather forecasting, climate research, molecular modeling, physical simulations and financial modeling (just to name a few). Supercomputers were first introduced in the 1960s by Seymour Cray at Control Data Corporation (CDC). In the 1970s Cray left to form his own company, Cray Research. His new designs took the industry by storm, holding the top spot in supercomputing for over 5 years (1985–1990). The term supercomputer itself is rather fluid, and today's supercomputer tends to become tomorrow's ordinary computer. CDC's early machines were simply very fast scalar processors, some ten times the speed of the fastest machines offered by other companies. In the 1970s most supercomputers were dedicated to running a vector processor, and many of the newer players developed their own such processors at a lower price to enter the market. The early to mid-‐1980s saw machines with a modest number of vector processors working in parallel to become the standard. Typical numbers of processors were in the range of four to sixteen. In the later 1980s and 1990s, attention turned from vector processors to massive parallel processing systems with thousands of "ordinary" CPUs, most being off the shelf units. Most supercomputers running today were built with "off the shelf" server-‐class microprocessors, such as the PowerPC, Opteron, or Xeon, and are really just highly-‐tuned computer clusters using commodity processors combined with high-‐speed interconnects and high performance file systems. The only constant in the IT world is “change”. The supercomputer industry is morphing their system designs again. This time they have chosen more of a hybrid approach. They are now designing systems that are composed of clusters of machines consisting of traditional CPUs and vector processors. In this particular case the vector processors are actually manufactured by Nvidia and are masquerading as GPUs (more on that later). The industries new design is best exemplified by the Tianhe-‐1A. As of January 2011, according to www.top500.org, the Tianhe-‐1A is the fastest supercomputer on the planet with a theoretical peak performance of 4.7 quadrillion floating point operations per second. I don’t know about you but I find it a bit difficult to comprehend numbers that big. To give you a frame of reference if you had $4.7 quadrillion you could pay off the U.S. national deficit… 324 times! The Tianhe-‐1A has over 14,000 Xeon X5670 processors and 7,000 Tesla M2050 vector processors (GPUs). Well if we want our virtual supercomputer to be near the top of the pack we should probably follow the industries heterogeneous design. Before we copy it, let’s dig a bit deeper to see why it is the best route to go.

CPU vs. GPU (The Big Smackdown) A CPU (Central Processing Unit) functions by executing a sequence of instructions. These instructions reside in some sort of main memory and typically go through four distinct phases during their CPU lifecycle: fetch, decode, execute, and writeback. During the fetch phase the instruction is retrieved from main memory and loaded onto the CPU. Once the instruction is fetched it is decoded or broken down into an opcode (operation to be performed) and operands containing values or memory locations to be operated on by the opcode. Once it is determined what operation needs to be performed the operation is executed. This may involve copying memory to locations specified in the instructions operands or having the ALU (arithmetic logic unit) perform a mathematical operation. The final phase is the writeback of the result to either main memory or a CPU register. After the writeback the entire process repeats. This simple form of CPU operation is referred to as subscalar in that the CPU executes one instruction operating on one or two pieces of data at a time. Given the sequential nature of this design it will take four clock cycles to process a single instruction. That is why this type of operation is referred to “sub”scalar or less than one instruction per clock cycle. To make their CPUs faster, chip manufactures started to create parallel execution paths in their CPUs by pipelining their instructions. Pipelining allows more than one step in the CPU lifecycle to be performed at any given time by breaking down the pathway into discrete stages. This separation can be compared to an assembly line, in which an instruction is made more complete at each stage until it exits the execution pipeline and is retired. While pipelining instructions will result in faster CPUs the best performance that can be achieved is scalar or one complete instruction per cycle. To achieve speeds faster than scalar (or superscalar) chip manufactures started to embed multiple execution units in their designs increasing their degree of parallelism even more. In a superscalar pipeline, multiple instructions are read and passed to a dispatcher, which decides whether or not the instructions can be executed in parallel. If so, they are dispatched to available execution units, resulting in the ability for several instructions to be executed simultaneously. The more instructions a superscalar CPU is able to dispatch simultaneously to waiting execution units, the more instructions will be completed in a given clock cycle. As manufacturing techniques reach theoretical limits in miniaturization, increased use of parallel computing in the form of multi-‐core processors has been employed to improve overall processing performance. A multi-‐core processor is a single computing component with two or more independent actual processors (called "cores"), which are the units that read and execute program instructions. Designers may couple cores in a multi-‐core device tightly or loosely. For example, cores may or may not share caches, and they may implement message passing or shared memory inter-‐core communication methods. Common network topologies to interconnect cores include bus, ring, two-‐dimensional mesh, and crossbar. Homogeneous multi-‐core systems include only identical cores, heterogeneous multi-‐core systems have cores which are not identical. Just as with single-‐processor systems, cores in multi-‐core systems may implement superscalar, vector 3

processing, SIMD, or multithreaded architectures. The improvement in performance gained by the use of a multi-‐core processor depends very much on the software algorithms used and their implementation. Techniques like instruction pipelining, adding multiple execution units, and adding multiple cores, have enabled modern day CPUs to significantly increased their degree of instruction parallelism, however, they still lag far behind GPUs. A GPU (Graphics Processing Unit) is a special purpose processor, known as a stream processor. It is specifically designed to perform a very large number of floating point operations in parallel. These processors may be integrated on the motherboard or attached via a PCIExpress card. Today’s high end GPUs typically have gigabytes of dedicated memory and several hundred processor cores capable of running thousands of concurrent threads all dedicated to performing floating point math. Stream processing is a technique used to achieve a form of parallelism known as data level parallelism. The concepts behind stream processing originated back in the heyday of the supercomputer. Applications running on a stream processor can use multiple computational units, such as the floating-‐point units on a GPU, without explicitly managing allocation, synchronization, or communication among those units. Not all algorithms can be expressed in terms of a data parallel solution. The ones that can, realize significant performance gains by running on a GPU and taking advantage of the massive parallelism of the device (compared to the much more limited degree of parallelism of modern day CPUs). In computing, most central processing units are labeled in terms of their clock speed expressed in hertz. This number refers to the frequency of the CPU's master clock signal ("clock speed"). This signal is simply an electrical voltage that changes from low to high and back again at regular intervals. Hertz has become the primary unit of measurement accepted by the general populace to determine the speed of a CPU. While it may make sense to compare homogenous architecture CPUs with each other in terms of their clock speed it does not make sense to compare the clock speed of heterogeneous CPU architectures. For example if we are comparing a sub scalar CPU with a clock speed of 3.2 GHz with a super scalar CPU with a clock speed of 2.2 GHz we can not really tell which one will be able to execute more instructions in a given time interval because the super scalar processor will execute multiple instructions in a single clock cycle. To compare heterogeneous CPU architectures, in terms of their speed, we need to use a different metric than their clock speed. The industry accepted measure is FLOPS (FLoating point Operations Per Second). For example, if a system is rated at a GFLOP of computational capacity it means that the system can perform 10 to the 9 (1,000,000,000) floating point operations per second. The computational capacity of GPUs can also be measured in terms of FLOPS so let’s take a look at how CPUs stack up against GPUs in terms of peak FLOPS. From the figure below we can see that Intel’s Westmere chips have about a 50 GFLOP double precision rating while Nvidia’s most powerful card shipping today offers 500 GFLOPS of double precision computational capacity. As you can see from this chart the GPU is 10 times faster than

the CPU at double precision floating point math. If we look at single precision computational capacity the GPU is about 15 times faster than the CPU.

Nvidia CEO Jen-‐Hsun Huang, while speaking at the Hot Chips symposium at Stanford University, predicted that GPU computing will experience a rapid performance boost over the next six years. According to Huang, GPU compute is likely to increase its current capabilities by 570 times, while 'pure' CPU performance will progress by a limited 3 times. Now these are just his predictions but perhaps six years from now we will be talking about "Haung's Law" instead of Moore's (moore on that later). Nvidia provided a glimpse their technology roadmap at their GPU Technology Conference in 2010. As you can see from the chart below they appear to be on track in turning Haung’s predictions into reality.

The C1060 Tesla card was Nvidia’s first GPU designed from the ground up for performing general-‐ purpose computations. Their second generation GPU (code named Fermi) is 4 times faster than their C1060 cards. As you can see from the graph above their 3rd generation GPU (code named Kepler) promises to be 3 – 4 times faster than Fermi and their 4th generation GPU (code named Maxwell) should be 3 – 4 times faster than that. This should give you a bit of insight into why the supercomputer industry is going down the path of their hybrid heterogeneous CPU / GPU designs.

What’s a Virtual Machine There is another, seemingly unrelated, innovation that recently occurred in the computing industry. Enter the Virtual Machine. A virtual machine (or VM) is a software implementation of a machine (i.e. computer) that executes programs just like a physical machine. A typical usage for a virtual machine is to consolidate multiple logical (or virtual) machines onto a single physical machine. This provides for faster hardware provisioning and more fully utilized physical hardware resources. Hosting service providers such as Go Daddy use VMs to “slice and dice” their physical hardware into multiple logical machines that they sell to their customers. Cloud computing infrastructures such as Amazons Elastic Cloud are also built with VMs. Many datacenters across multiple vertical industries are converting their companies’ infrastructure to run on VMs. This saves them datacenter space and power while more fully utilizing their physical hardware.

Moore’s Law Gordon Moore, a co-‐founder of Intel, noted that the number of transistors that can be placed inexpensively on an integrated circuit increases exponentially doubling every two years. More transistors mean faster more powerful CPUs. This observation coined the phrase "Moore's Law". Gordon Moore also coined a lesser know phrase "Moore's Second Law". Moore's second law states that the R&D, manufacturing and QA cost associated with semiconductor fabrication also increase exponentially over time. Moore’s second law hypothesized that at some point the increasing fabrication cost coupled with the physical limitations of semiconductor fabrication materials will erect a brick wall halting the exponential advance of CPU clock rates. While Gordon Moore did not predict when this halting would occur, recent products from Intel and AMD would suggest that they (we) have hit it. Both companies seem to have at least temporarily abandoned their relentless quest for faster CPUs in favor of CPUs consisting of many cores.

How can virtualization be leveraged to help us build a supercomputer? Due to the impending collapse of Moore’s Law (see side bar), Intel and AMD are no longer producing CPUs with significantly faster clock speeds. They have started to produce chips with multiple cores instead. Dual quad core processors have become the norm with dual hexa and dual octa core designs soon to follow. This does not bode well for the likes of Dell and HP who are now building desktop systems with significantly more computational capacity than the user sitting in front of them can ever hope to consume. Well there’s no reason to let all those spare clock cycles go unused. We can use virtualization software to slice off the extra computational capacity and we can stitch it all together with Open Source distributed programming frameworks.

While Nvidia was busy turning the supercomputer industry on its head, Parallels, a virtual machine software company, was busy as well. Parallels has integrated into their desktop virtualization product (Parallels Workstation Extreme) the ability to dedicate a second network interface card (NIC) and graphics card (GPU) to a VM running on a desktop. While this may not sound like a big deal their approach to doing it is. Parallels utilizes Intel’s latest virtualization technology that supports directed I/O (VT-‐d). This differs from the traditional approach of graphics virtualization via pass thru drivers yielding improved isolation and greater reliability, availability, and performance. The intended usage of Parallels Workstation Extreme is for people who do CAD / CAM or visualization work. 7

Typically these users need two systems. One is a desktop running Windoz for email, spreadsheets, word processing and internet access and the second is a high-‐end UNIX / Linux workstation for their graphics intensive applications. By using Parallels Workstation Extreme product they only need one system! That all sounds great, but we are not running CAD / CAM or visualization applications. We want to run computational intensive scientific applications. Parallels innovative approach to GPU virtualization works for building Virtual Supercomputers as well.

Hey… What’s the Big Idea? As I mentioned earlier a supercomputer is really just a bunch of calculation units (CPUs & Vector Processors) stitched together with some sort of high-‐speed network transport with access to local and/or distributed memory and high performance file systems. Workload communication/control software is used to manage (or distribute) the computational workload to the calculation units and to collect the results. The first step in building a Virtual Supercomputer is getting our hands on some calculation units. Since I’m going to assume that most of you don’t have your own silicon fabrication facilities I think we should use “off-‐the-‐shelf” processors to build our Virtual Supercomputer. We can use commodity desktops, virtual machine software, and GPUs to build our calculation units. We will also need workload command and control software, and a high-‐ speed network. Let’s start with our calculation units. I work for a medium sized company. We have close to 4,000 employees. Practically all of these 4,000 employees have a desktop computer to read email, access the internet, write documents, create presentations, run spreadsheets, and perform their job specific duties. When PCs first came out they were somewhat overburdened by these simple tasks. Today’s desktop machines typically have quad core processors (if not dual quad core processors) and can very easily keep up with the trivial demands of the typical user’s workflow. Instead of relying on racks of new servers to power our supercomputer lets just take 1,000 of our users’ desktops (that we already own) and install a Linux VM on them using Parallels Workstation Extreme product. We can configure the VM to start at boot time and steal four processor cores without the user even knowing that it is there. The figure below depicts one node of our Virtual Supercomputer. I would recommend using a 64 bit Windows 7 machine with at least 16GB of RAM. This will allow your desktop owners to run all of their MS Windows productivity applications in the host OS and still leave you with plenty of memory for the Linux VM.

Our calculation units are not yet complete because they don’t have any vector processors. To solve that we install an additional Nvidia GPU, which we will use as our vector processor, in each desktop pinning it to the VM. I would recommend an Nvidia C2050 Tesla card. The C2050 will give you 500 GFLOPS of double precision computational capacity. If we put a C2050 in 1,000 desktops our Virtual Supercomputer will have a theoretical peak of half a PETAFLOP. You already own the desktops but you are going to have to buy a Parallels Workstation Extreme license and a Tesla card for each node, which can get a bit expensive if you need 1,000 of each, so feel free to adjust the size of your Virtual Supercomputer to suit your needs and wallet. The figure below depicts a completed calculation node in our Virtual Supercomputer. Notice that there are two graphics cards in the host. One graphics card will drive the monitor of the host OS and the other graphics card will be directly assigned to the Linux VM (the Tesla C2050).

Depending on what kind of jobs we intend to run on our Virtual Supercomputer we may need a high performance file system to keep disk I/O from becoming a bottleneck. Lustre is a massively parallel distributed file system, generally used for large scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster. Lustre’s aim is to provide a file system for clusters of tens of thousands of nodes with petabytes of storage capacity without compromising speed, security or availability. Lustre file systems are used in computer clusters ranging from small workgroup clusters to large-‐scale, multi-‐site clusters. Fifteen of the top 30 supercomputers in the world use Lustre file systems, including the world's fastest supercomputer. We already have a significant amount of ground to cover in learning how to program our Virtual Supercomputer so we will leave the high performance file system for another day (and another book). If the types of jobs that you will be running on your Virtual Supercomputer are disk I/O intensive you can install Lustre as an upgrade after your system is functional. The only thing missing from our design is a high speed network. Well as it turns out a gigabit Ethernet network should be fast enough for our purposes. While we could run our Virtual Supercomputer on a slower network we would find it difficult to get optimum performance out of it and our users’ desktop processing might be degraded from time to time. With gigabit Ethernet you will get good performance out of your Virtual Supercomputer and your users’ desktops will not be adversely impacted. If you want to increase the network capacity of your system just add a second network interface card to each desktop and directly assign it to the Linux VM. Configuring your calculation nodes in this manner provides your VMs with the entire bandwidth of the second NIC leaving the primary NIC 100% allocated to the host. This would look like:

In the figure above the half height cards are NICs and the full height cards are GPUs. Just assigning a separate NIC to the VM may not be sufficient, however, if your models are performing a significant amount of IO. You may need a completely separate subnet for your Virtual Supercomputer network traffic. If you only have one switch, as in the diagram below, the network traffic from your Virtual Supercomputer is mixed in with the network traffic from your users’ desktops.

You can separate the network traffic by adding a second switch. The original switch is used to route your desktop users network traffic and the second switch routes the network traffic for your Virtual Supercomputer as in the diagram below.

With the above configuration you achieve complete isolation of your Virtual Supercomputers network traffic. While this will cost you a bit more I think you will find it money well spent.

Programming the Beast Well now that we have figured out what hardware we will use to build our Virtual Supercomputer how are we going to program it? First we are going to need to decompose the problems that we are trying to solve into their parallel components. Gene Amdahl, a computer architect for IBM, coined the phrase “Amdahl’s Law”. Amdahl’s Law states that the speedup of a program using multiple processors is limited by the time needed for the sequential fraction of the program. For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimal execution time cannot be less than that critical 1 hour. Hence the speed up is limited up to 20X. The flip side of this analysis is that if a program needs 20 hours using a single processor core, and a particular portion of 19 hours cannot be parallelized, while the remaining portion of 1 12

hour (5%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimal execution time cannot be less than that critical 19 hours. My shop teacher in High School once told me that it is very important to pick the right tool for the job. To a hammer the entire world looks like a nail. He would always say, “Don’t let me catch you hammering in screws”. His point was that you need to know what a tool is good at and use it for that and to know what it is not good at and to NOT use it for that. It seems so obvious when you say “Don’t hammer in screws” but it is not so obvious when you are sifting through a couple of thousand lines of code in a simulation to determine the best way to speed up its execution. Let’s explore what types of algorithms will work best on our Virtual Supercomputer. Michael Flynn, a professor emeritus at Stanford University, came up with a method of classifying computer architectures. This method of classification became know as “Flynn’s Taxonomy”. The taxonomy is based on the number of concurrent instruction (or control) and data streams available in the architecture. Flynn’s breakdown is as follows:

Single Instruction Single Instruction Multiple Instruction Multiple Instruction

Single Data Multiple Data Multiple Data Single Data

(SISD) (SIMD) (MIMD) (MISD)

SISD is a sequential computer which exploits no parallelism in either the instruction or data streams. Examples of SISD architecture are the traditional uniprocessor machines like an early PC or mainframe. MIMD is exemplified by a system having multiple autonomous processors simultaneously executing different instructions on different data or the same data. Distributed systems are generally recognized to be MIMD architectures; either exploiting a single shared memory space or a distributed memory space. SIMD is a computer which exploits multiple data streams against a single instruction stream to perform operations which may be naturally parallelized for example Nvidia’s CUDA. Let’s dig a little deeper into SIMD. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code. For instance, consider a 2-‐processor system (CPUs A and B) in a parallel environment, and we wish to do a task on some data D. It is possible to tell CPU A to do that task on one part of D and CPU B on another part simultaneously, thereby reducing the duration of the execution. The data can be

assigned using conditional statements. As a specific example, consider multiplying two matrices. In a SIMD implementation, CPU A could multiply all elements from the top half of the matrices, while CPU B could multiply all elements from the bottom half of the matrices. Since the two processors work in parallel, the job of performing matrix multiplication would take one half the time of performing the same operation in serial using one CPU alone. Now imagine that instead of two processors you have 512 (like some modern day GPUs have). You should be able to achieve a 512X speedup of this algorithm by running it on a GPU. Well that's the theory. In practice you will have to copy data on and off the device over the PCI Express bus and possibly swap data back and forth between shared memory and global memory on the GPU. Also, while a GPU has significantly more cores than a CPU, these cores have a much slower clock rate, due to power and thermal limitations, so you are not going to actually get a 512X speedup but you will get a very significant speedup. Our Virtual Supercomputer will be many SIMD systems nested inside of a MIMD system. That may seem a bit obtuse so let’s look at a graphic representation. The images below show a high level architectural view of MIMD and SIMD.

Think of each Processing Unit in the MIMD picture to be desktops connected to a network and the entire SIMD picture to be a GPU. By installing a GPU in each desktop we get many SIMD systems nestled inside of a MIMD system as exemplified by the image below:

Decomposing a program or algorithm so that it runs well on our MIMD [SIMD] system is, in my opinion, more of an art than a science. The problem space needs to be broken down at a coarse grained level which will be distributed to each desktop to solve. Once the computation arrives at the desktop it will be broken down to a fine-‐grained level, which will be distributed to each GPU to solve. We will provide you with a few programming examples to get your feet wet but you will need to look elsewhere for more complete coverage of this topic to become an expert. One of the major difficulties in distributed parallel computing is facilitating the communication and control aspects of the parallel jobs. How do we start and stop our jobs? How do we get the data that our algorithms need to the distributed desktops and back in a reliable manner? MPI (Message Passing Interface) is a specification for a library of routines that provide the communication and control facilities necessary for building and running distributed parallel programs. MPI is not maintained by a standards body but it has achieved wide spread adoption in the scientific computing community. Many of today’s supercomputers are powered by MPI. One of the drawbacks to MPI is that it is not very resilient to node failures. As the size of your cluster grows so does the number of node failures. 15

Another tool that can be used to facilitate the communication and control aspects of our parallel jobs is PVM. PVM grew out of a research project at Oak Ridge National Laboratory in the 90’s. The projects goals were to develop tools to support high performance distributed computing. One of the major benefits that PVM has over MPI is that it is resilient to node failures. Given that our virtual supercomputer will be running on a collection of loosely coupled desktop systems with a potentially wide geographic distribution, we may find that we need the resiliency. Once we have decomposed our program we can use MPI or PVM to distribute the coarse grained computations to the Linux VMs on our cluster. Once those computations arrive at their destinations they will again be decomposed into more fine grained calculations that will run on the GPU. GPGPU computing or General Purpose Graphics Processing Unit computing is the technique of using a GPU, which typically handles computation only for computer graphics, to perform computations in applications traditionally handled by the CPU. We will explore two different approaches to programming the GPUs. We will take a look at Nvidia’s CUDA and the industry standard OpenCL. CUDA (an acronym for Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA. CUDA is the computing engine in NVIDIA graphics processing units that is accessible to software developers through variants of industry standard programming languages. Programmers can use 'C for CUDA', compiled through a C compiler, to code algorithms for execution on the GPU. CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. Through the use of CUDA, NVIDIA GPUs become accessible for general purpose computations like CPUs. We also mentioned OpenCL. What is OpenCL… you ask? A couple of years ago Apple (you know that scrappy little hardware company that just passed Exxon Mobil in market capitalization to be come the largest company in the U.S.) was trying to figure out a way to position developers on their Mac OS to more easily adapt their code to the multi-‐core CPU revolution that is currently under way. They came up with something that they called Grand Central. In a nutshell Grand Central allows developers to be freed from many of the difficult tasks involved in multi-‐threaded parallel programming. Someone at Apple (probably Steve) noticed that the Grand Central work they were doing mapped nicely to GPGPU programming. Out of the goodness of their hearts Apple wrapped up their Grand Central API into a specification and released it to the Khronos Group (an industry consortium that creates open standards) as OpenCL (Open Compute Language). The Khronos Group got all the major players (Intel, AMD, ATI, Nvidia, RapidMind, ClearSpeed, Apple, IBM, Texas Instruments, Toshiba, Los Alamos Nation Laboratory, Motorola, QNX, Bla…, Bla…, Bla…, Yadda..., Yadda..., Yadda...) together and they hammered out the 1.0 version of the OpenCL specification in record time. Programs written to run on OpenCL are written in a manner that is by definition parallel. This means that not only can they take advantage of the multiple cores on a CPU but that they can also take advantage of the hundreds of cores on a GPU. But it doesn’t stop there. OpenCL drivers are being written for IBM Cell processors (you know… the little fellow that powers your PS3), ATI GPUs, Nvidia GPUs, AMD

CPUs, Intel CPUs and a host of other computing devices. We will also be able to program Intel’s MIC chips and AMD’s APUs in OpenCL (at least that’s what my friends at Intel and AMD are currently saying). Since Nvidia’s CUDA is robust and production ready, and OpenCL is a standard supported by many vendors we will provide you with an introduction to CUDA and OpenCL. Once you have a basic understanding of CUDA and OpenCL we will show you how to combine them with MPI and PVM. Combining these technologies with virtualization will allow you to amass an amazing amount of computational capacity for a very small cost. You might wonder why do we need the desktop VM? Why don’t we just use MPI / PVM to run our jobs natively on the Windows desktops? There are two important features that our Virtual Supercomputer gets from the desktop VM. First and foremost most people who are serious about number crunching do it on Linux. I’m sure I’m not the only one out there with a “Serious Number Crunchers do it on Linux” t-‐shirt… on second thought… maybe I am. While some of the tools you use to crunch your numbers may be available on Windoz… many are not. The second reason for the desktop VM is safety. With the desktop VM we restrict the number crunching processes to a sandbox. Without the desktop VM a runaway program could not only bring down one of your users desktop, it cold bring down all of your users desktops. By running in a desktop VM the worst that can happen is that a runaway program could bring down your Virtual Supercomputer leaving your users desktops unharmed.

Virtual Supercomputer vs. Beowulf Cluster In 1994 a group of researches at NASA built a computing cluster out of commodity desktops using Free and Open Source Software (FOSS). They named their creation Beowulf after the main character in the Old English poem Beowulf. In the poem the hero Beowulf was described as having "thirty men’s' heft of grasp in the grip of his hand. The term Beowulf Cluster is now commonly used to describe a class of computational clusters that have the same attributes as NASA’s original cluster. Our Virtual Supercomputer has many of the same attributes of a Beowulf Cluster. It is powered by MPI / PVM and Linux, just like a Beowulf Cluster. It is built out of a collection of commodity desktops networked together, just like a Beowulf Cluster. Where a Virtual Supercomputer differs from a Beowulf Cluster is that it is run on desktop hardware that you already own and it is turbo charged with GPUs. This makes a Virtual Supercomputer significantly less expensive than a Beowulf Cluster because it is running on the spare computational capacity of your preexisting desktop hardware. Drop in the GPUs and the FLOPS available for your computations go through the ceiling. 17

Other problems that need to be addressed with a Beowulf Cluster are non existent with a Virtual Supercomputer. A large Beowulf Cluster will take up a significant amount of space in a datacenter and as we all know datacenter space is expensive. A Virtual Supercomputer takes up zero space. A Virtual Supercomputer will only increase your power consumption an additional 250W per node (to power the GPU) whereas a Beowulf Cluster can chew up close to 1000W per node. While the GPUs in a Virtual Supercomputer will give off a significant amount of heat, no additional cooling cost is incurred since they are distributed throughout your company. While you could just drop GPUs into your Beowulf Cluster and you would achieve the same computational capacity; a Virtual Supercomputer is a more compact, more efficient, more elegant, and less expensive solution.

Summary In this chapter we walked you through the basic concept of a Virtual Supercomputer. By combining the latest advances in virtualization, GPGPU computing and some tried and true Open Source software frameworks we will show you how to build and program an extremely powerful computation engine. A single node powered by an Nvidia Tesla card will yield a significant amount of computational capacity. When we stitch many nodes together (as in the diagram below) and handle the workflow distribution with MPI / PVM we get an unbelievable amount of computational capacity at a fraction of the cost of a traditional HPC cluster.

In the remaining chapters of this book we will provide you with an introduction to GPGPU programming using the two most widely accepted toolkits for programming GPUs: CUDA and OpenCL. We will also show you how to integrate CUDA and OpenCL based models with MPI and PVM so that you can not only unleash the hidded computational capacity of a single GPU but you will be able to combine the computaional capacity of hundreds or even thousands of GPUs into a single computational engine or Virtual Supercomputer. 19