Page 18

Cover Feature u costs and ameliorate the technical challenges. The PathForward program provides $258m in funding, shared between AMD, Cray, HPE, IBM, Intel and Nvidia, which must also provide additional capital amounting to at least 40 percent of the total project cost. “PathForward projects are designed to accelerate technologies to, for example, reduce power envelopes and make sure those exascale systems will be cost-competitive and powercompetitive,” Diachin said. For example, IBM’s PathForward projects span ten separate work packages, with a total of 78 milestones, across hardware and software. But for Turek, it is the software that is the real concern. “I wouldn't worry about the hardware, the node, pretty much at all. I think it's really communications, it's file systems, it's operating systems that are the big issues going forward," he said. “A lot of people think of building these supercomputers as an exercise in hardware, but that’s absolutely trivial. Everything else is actually trying to make the software run. And there are subtly complex problems that people don't think about. For example, if you're a systems administrator, how do you manage 10,000 nodes? Do you have a screen that shows 10,000 nodes and what's going on? That's a little hard to deal with. Do you have layers of screens? How do you get alerts? How do you manage the operational behavior of the system when there's so much information out there about the component pieces?”

Buck, told DCD. “Traditional Moore's Law has ended, we cannot build our exascale systems, or our supercomputers of the future, with traditional commodity CPU-based solutions,” he said. “We've got to go towards a future of accelerated computing, and that's not just about having an accelerator that has a lot of flops, but looking at all the different tools that accelerated computing brings to the exascale. “I think the ingredients are now clearly there on the table. We don't need to scale to millions of nodes to achieve exascale. That was an early concern - is there such a thing as an interconnect that can actually deliver or run a single application across millions of nodes?” Accelerated computing - a broad term for tasks that use non-CPU processors, that Nvidia has successfully rebranded to primarily mean its own GPUs - has cut the number of nodes required to reach exascale. “The reality is: with accelerating computing, we can make strong nodes, nodes that have four or eight, or sometimes even 16 GPUs in a single OS image, and that allows for technologies like InfiniBand to connect a modest number of nodes together to achieve amazing exascaleclass performance,” Buck said. THE FIRST LIGHT IBM and Nvidia are together, and separately, vying for several exascale projects. But if everything goes to plan, Intel and Cray will be delivering the first US exascale system - thanks to the past not going to plan. Previous US plans called for Intel and Cray to build Aurora, a 180 petaflop preexascale system, at Argonne National Laboratory. It was supposed to be based on Intel's third-generation ‘Knights Hill’ Xeon Phi processors, and use Cray’s ‘Shasta’ supercomputing architecture. Instead, by the summer of 2018, Knights Hill was discontinued. “As we were looking at our investments and we continued to invest into accelerators in the Xeon processor, we found that we were able to continue to win with Xeon in HPC,” Jennifer Huffstetler, VP and GM of data center product management at Intel Data Center Group, told DCD as way of explaining why Phi was canceled. That decision on how to proceed without Phi came as China was announcing a plan to build an exascale system by 2020 - way ahead of the US’ then-goal of 2023. Hoping to remain competitive, the DOE brought Aurora forward to 2021, with the aim of being the country’s first exascale system. We still don’t know which processor Intel will use, with the company’s Data Center Group GM Trish Damkroger promising “a new platform and new microarchitecture

“Moore's Law has ended, we cannot build our exascale systems with traditional CPUs" Each of the upcoming exascale systems in the US are budgeted under the CORAL-2 acquisition process, which is open to the same six US vendors involved with PathForward. “Those are not part of the ECP itself, but rather a separate procurement process,” Diachin explained. “The ECP is a large initiative that is really being designed to make sure that those exascale platforms are going to be ready to go on day one, with respect to running real science applications. “Our primary focus is on that software stack. It's really about the application, it's about the infrastructure, and the middleware, the runtime systems, the libraries.” For a future exascale system, IBM is likely to turn to the same partner it has employed for Summit and Sierra - GPU maker Nvidia. “You can already see us active in the preexascale era,” the company’s VP of accelerated computing and head of data centers, Ian

18 18 DCD Magazine •

Profile for DCD Magazine

DCD>Magazine Issue 31 - Exascale  

DCD>Magazine Issue 31 - Exascale