WHITE PAPER Intel® True Scale Fabric Architecture Supercomputing Intel® True Scale Fabric Architecture: Three Labs, One Conclusion Intel® True Scale Fabric in National Laboratory Systems Changes the View of Interconnects and the Work in Supercomputing Table of Contents EXECUTIVE SUMMARY Executive Summary. . . . . . . . . . . . . . . 1 When Fast is Not Fast Enough. . . . . . 2 Intel a Force in HPC . . . . . . . . . . . . . . . 2 Three United States national laboratories, known for their work in supercomputing, Key Components to a Fast System. . . . . . . . . . . . . . . . . . . 2 recently benchmarked new systems delivered to each of them. These systems are Intel, TLCC, and TLCC2. . . . . . . . . . . . . 2 “Unprecedented Scalability and Performance” – Chama at Sandia National Laboratories. . . . . . . 3 Key Findings. . . . . . . . . . . . . . . . . . . . . . . 4 Messaging Micro-benchmarks . . . . . . 4 - Bandwidth and Latency . . . . . . . . 4 - MPI Message Rate. . . . . . . . . . . . . . 5 - Random Message Bandwidth. . . 5 - Global Communications – - MPI Allreduce . . . . . . . . . . . . . . . . . . 5 Application Testing. . . . . . . . . . . . . . . . .5 - Cielo Acceptance Benchmarks . . 7 Sandia’s Conclusions. . . . . . . . . . . . . . . 7 “Changing the Way We Work” ­– Luna at Los Alamos National Laboratory. . . . . . . . . . . . . . . 8 Key Findings. . . . . . . . . . . . . . . . . . . . . . . 8 Application Testing. . . . . . . . . . . . . . . . .9 Communications Micro-benchmarks. . . . . . . . . . . . . . . . . 9 - Node-to-Node Bandwidth - and Adapter Contention. . . . . . . 10 - Global Communications – - MPI_Allreduce Results. . . . . . . . . 10 Los Alamos’ Conclusions. . . . . . . . . . . 10 Supreme Scalability – Zin at Lawrence Livermore National Laboratory. . . . . . . . . . . . . . 11 Summary and Conclusions. . . . . . . . 11 built on the Intel® Xeon® processor E5-2600 family and Intel® True Scale Fabric, based on InfiniBand* and the open source Performance Scale Messaging (PSM) interface. The scientists performing the benchmarks concluded in individual reports that Intel True Scale Fabric: ·Contributed to “unprecedented scalability and performance” in their systems, and it is allowing them to change how they work. ·Outperforms on some tests one of the most powerful, customized supercomputers in the world, rated 18 in the November, 2012 Top500 list.1 ·Delivers a level of performance they had not seen from a commodity interconnect before. The new systems, named Chama (at Sandia National Laboratories), Luna (at Los Alamos National Laboratory), and Zin (at Lawrence Livermore National Laboratory), are part of the Tri-Labs Linux* Capacity Clusters 2 (TLCC2) in the Advanced Simulation and Computing (ASC) program under the National Nuclear Security Administration (NNSA). This paper summarizes the findings of the reports from these three laboratories. Intel® True Scale Fabric Architecture: Three Labs, One Conclusion When Fast is Not Fast Enough InfiniBand Architecture has proven itself over the years as the interconnect technology of choice for high-performance computing (HPC). For a commodity interconnect, it continues to achieve performance advances above other industry standard networks, and it outperforms them by a significant factor. But, when it comes to the demands of HPC and MPI message passing, fast is never fast enough. While MPI, using InfiniBand Verbs, delivers fast communications, there is a costly overhead with Verbs and traditional offload processing on InfiniBand Host Channel Adapters (HCAs) that hinders scalability with larger core counts. Intel True Scale Fabric, with its open source Performance Scale Messaging (PSM) interface and onload traffic processing, was designed from the ground up to accelerate MPI messaging specifically for HPC. Intel True Scale Fabric delivers very high message rates, low MPI latency and high effective application bandwidth, enabling MPI applications to scale to thousands of nodes. This performance drove the choice of interconnect for the most recent acquisitions in the Advanced Simulation and Computing (ASC) Program’s Tri-Labs Linux Capacity Clusters 2 (TLCC2): Chama (at Sandia National Laboratories), Luna (at Los Alamos National Laboratory), and Zin (at Lawrence Livermore National Laboratory). Intel, a Force in HPC Intel has a long history in high-performance computing (HPC) systems and the national laboratories that use them. Intel built the first massively parallel processing (MPP) machine to reach one teraFLOP, and delivered it in 1996 to the Advanced Computing and Simulation (ASC) Program (formerly ASCI) as Option Red. Intel continues to be a driving force in supercomputing with Intel® processors in more systems on the Top500 list1 of the world’s fastest supercomputers than any other manufacturer. But it takes more than just a fast processor to live among the fastest 500 systems. Key Components to a Fast System The fastest systems use more than just Intel processors. Intel provides the components and software tools to help achieve the highest performing codes on some of the nation’s most critical computing jobs. •Intel® Xeon® processors – 377 (75 percent) of the top Top500 supercomputers use Intel® Architecture processors. •Intel® True Scale Fabric – designed specifically for HPC in order to minimize communications and enable efficient systems, Intel True Scale Fabric enables the fastest clusters based on InfiniBand* Architecture. •Intel® Xeon Phi™ coprocessors – built on many-core architecture, Intel Xeon Phi coprocessors offer unparalleled acceleration for certain codes. •Intel® Software Tools – a host of tools support cluster builders and application programmers to make their codes fast and efficient. • Intel® Storage Systems – HPC demands the fastest components, and Intel storage components deliver both speed and reliability. 2 Intel, TLCC, and TLCC2 The ASC Program under the National Nuclear Security Administration (NNSA) provides leading-edge, high-end simulation capabilities to support the Administration’s mission. Some of the fastest supercomputers in the world are managed under the ASC at the three NNSA laboratories: Los Alamos National Laboratory, Sandia National Laboratories, and Lawrence Livermore National Laboratory. These machines include “capacity” and “capability” HPC systems designed for a range of computing jobs and users. Capacity and capability machines are generally distinguished by their differences in size and users. While both categories have grown in computing abilities over the years, capability systems are typically dedicated to a smaller group of users and are much larger, comprising a number of cores as much as a magnitude higher than capacity machines (hundreds of thousands compared to tens of thousands of cores). The Tri-Lab Linux Capacity Clusters (TLCC) contribute to capacity computing at the three NNSA laboratories. TLCC is designed for scalability to adapt the resources to each job’s computing requirements, while running multiple jobs simultaneously. Thus, the systems consist of a number of Scalable Units (SU), each SU comprising 162 compute, user and management nodes, 2,592 cores and delivering about 50 teraFLOPS/SU. One TLCC procurement included the supercomputer Sierra, built with Intel True Scale Fabric components, housed at Lawrence Livermore National Laboratory. The more recent procurements for the second procurement of scalable Linux clusters, TLCC2, consist of three large Linux clusters, one each housed at an NNSA laboratory: •Chama – 8 SUs, with 1,296 nodes, located at Sandia National Laboratories in Albuquerque, New Mexico Intel® True Scale Fabric Architecture: Three Labs, One Conclusion •Luna – 10 SUs, with 1,620 nodes, located at Los Alamos National Laboratory in Los Alamos, New Mexico •Zin – 18 SUs with 2,916 nodes, located at Lawrence Livermore National Laboratory in Livermore, California All three machines are built around Intel® technologies, including Intel® Xeon® processors and Intel True Scale Fabric HCAs and switches. At all three laboratories, users and laboratory scientists have reported significant performance and scalability improvements over other machines, triggering scientists to take a new look at how their work gets done. “Unprecedented Scalability and Performance” – Chama at Sandia National Laboratories Sandia National Laboratories, headquartered in Albuquerque, New Mexico, has, over the last six decades, “delivered essential science and technology to resolve the nation’s most challenging security issues.”2 Sandia has a long history of highperformance computing. It is the home of the nation’s first teraFLOP supercomputer, ASCI Option Red, built by Intel in 1996. As one of the laboratories providing capacity computing to the NNSA ASC program, it received its latest TLCC2 capacity machine, Chama, in 2012. Configuration chama Red sky Cielo Total Computing Nodes 1,232 2,816 8,894 Processor Architecture Intel® Architecture formerly codenamed Sandy Bridge Intel® Architecture formerly codenamed Nehalem AMD MagnyCours* Cache 8 x 32 8 x 256 20 4 x 32 4 x 256 8 8 x 64 8 x 512 10 Cores/Node 16 8 16 Total Cores 19,712 22,528 142,304 Clock Speed (GHz) 2.60 2.93 2.40 Instruction Set Architecture (ISA) Intel® AVX SSE4.2 SSE4a Memory DDR3 1600 MHz DDR3 1333 MHz DDR3 1333 MHz Memory/Core (GB) 2 1.5 2 Compute Complex L1 (KB) L2 (KB) L3 (MB) Channels/Socket 4 3 4 Peak Node GLFOPS 332.8 94.76 153.6 Manufacturer Technology/Rate IB HW Interface Topology Intel (Qlogic) InfiniBand* QDR PSM Fat Tree Interconnect Mellanox* InfiniBand QDR Verbs 3D Torus: 6 x 6 x 8 Gemini* Custom Custom 3D Torus: 18 x 12 x 24 Table 1. Sandia National Laboratories Test Systems. With the acquisition of Chama, users began reporting 2x to 5x performance improvement on their jobs. Sandia scientists wanted to “understand the characteristics of this new resource.” So, they performed micro-benchmarks and application program testing on Chama and two other systems at Sandia: Red Sky, another capacity computing machine and predecessor to Chama in the TLCC, and Cielo, a capability supercomputer. Their findings are captured in their report.3 Table 1 lists the system configurations for Chama, Red Sky, and Cielo. 3 Intel® True Scale Fabric Architecture: Three Labs, One Conclusion A. Bandwidth 4000 4.5 HIGHER IS BETTER 3500 LOWER IS BETTER 4 3000 MICROSECONDS MBYTES/SECOND B. Latency 2500 2000 1500 1000 3.5 3 2.5 2 15 500 0 0 1 10 100 1000 10000 1000000 1e+06 8 1e+07 16 32 Cielo (X&Z) Cielo (Y) Cielo (X&Z) Cielo (Y) Red Sky Chama C. Message Rate 128 BYTES/SECOND/MPI TASK COMPARISON 4 512 1024 Red Sky Chama 1e+09 4.5 256 D. Random Messaging Bandwidth 5 CHAMA/CIELO COMPARISON 64 MESSAGE SIZE (BYTES) MESSAGE SIZE (BYTES) 3.5 3 2.5 2 15 1 HIGHER IS BETTER 1e+08 1e+07 1e+06 0.5 0 1 10 100 1000 10000 100000 1e+06 1e+07 100000 10 100 1 Task/node 2 Tasks/node 4 Tasks/node 16 Tasks/node 8 Tasks/node 1000 10000 MPI RANKS MESSAGE SIZE (BYTES) Cielo Chama Red Sky Figure 1. Sandia Inter-node MPI Performance. Key Findings Sandia scientists tested the systems across a range of characteristics beyond those impacted by interconnect, including memory performance and contention, processor performance, and more. Chama proved to be a well-balanced system with impressive performance results that outperformed Red Sky and compared well against Cielo. However, this paper focuses on the results of interconnect benchmarks and application testing to understand how interconnect contributes to the overall HPC performance. Thus, the tests revealed the following about the Intel True Scale Fabric interconnect: •Chama returned unprecedented results in MPI messaging rate at message sizes up to 1 KB, outperforming even Cielo’s custom interconnect. 4 •Chama delivered random messaging bandwidth the scientists had not yet seen from a commodity interconnect, exceeding Cielo by as much as 30 percent. •Collectives performance scaling for Chama compares well against the custom interconnect of Cielo, both outperforming Red Sky by an order of magnitude. •Chama scaled well against Cielo on three Sandia Finite Element production applications, which revealed severe scaling limitations on Red Sky. The key findings from these microbenchmarks and application tests indicate that Chama, with its Intel True Scale Fabric, “has a strong impact on applications” as attested by Chama users. Messaging Micro-benchmarks While standard traditional metrics include inter-node latency and bandwidth, Sandia scientists were keenly interested in Chama’s MPI messaging rate and scalable random message bandwidth performance. Figure 1 shows the benchmark results for these tests. Bandwidth and Latency Sandia codes are more sensitive to bandwidth than latency; this effect drove the choice for Chama’s Intel True Scale Fabric interconnect. As shown in Figures 1a and 1b, Chama performed well compared to Cielo’s custom Gemini* interconnect, according to Sandia scientists. We note that with sizes well within the typical HPC message size space, Red Sky’s bandwidth climbed much more slowly, remaining about half of Chama’s, and latency began to dramatically increase at just 64-byte messages. Intel® True Scale Fabric Architecture: Three Labs, One Conclusion A. 8 bytes B. 64 bytes 10000 1000 10 1 0.1 100 10 1 0.1 1 4 16 64 256 1024 4096 16384 Chama Red Sky 100 10 1 1 4 MPI RANKS Cielo LOWER IS BETTER 1000 AVERAGE TIME, MICROSECS AVERAGE TIME, MICROSECS AVERAGE TIME, MICROSECS 100 10000 LOWER IS BETTER LOWER IS BETTER 1000 C. 1024 bytes 16 64 256 1024 4096 16384 1 4 MPI RANKS Cielo Chama Red Sky 16 64 256 1024 4096 16384 MPI RANKS Cielo Chama Red Sky Figure 2. IMB MPI_Allreduce Performance. MPI Message Rate Of particular interest to the testers at Sandia, was the ability of the interconnect to process messages as core counts increased. HCA congestion on multi-core nodes is “becoming a significant constraint” in HPC with commodity interconnects, even those based on InfiniBand Architecture. “Therefore, the most important internode behavior for Chama is the significant gain in MPI message rate in comparison to Cielo.” For message sizes up to 1 KB, the Intel True Scale Fabric outperformed the custom interconnect of Cielo by 2x to 4x. For Sandia, this was an unprecedented event, which “…can have a significant positive impact on many applications, such as those that employ a sparse solver kernel.” Random Message Bandwidth Not all inter-node communications are structured. Indeed, many applications, such as Charon, induce unstructured communications across the fabric. Understanding node behavior using a measure of random message traffic can more readily predict system performance with such codes. Sandia uses a random messaging benchmark for understanding scalability in commodity clusters. The test “sends thousands of small messages from all MPI tasks with varying message sizes (100 bytes to 1 KB) to random MPI rank destinations.” An aggregate average random messaging bandwidth (Figure 1d) was derived from per process measurements. The measurements showed the following results, which the scientists had never seen with commodity interconnects benchmarked against a custom architecture: •Red Sky, compared to Chama, performed from 10x slower (32 cores) to 220x slower (8,192 cores) •Chama was 20 to 30 percent faster than Cielo, the capability supercomputer Chama’s Intel True Scale Fabric scales extremely well with applications that create random traffic on large systems. Global Communications – MPI Allreduce For understanding behavior of Chama with applications that are sensitive to collective operations, Sandia averaged scalability performance data from a thousand trials using 8, 64, and 1024 byte transfers. As shown in Figure 2, Chama performs competitively to Cielo across all ranks. Both perform an order of magnitude better than Red Sky in some cases, with Red Sky’s performance falling off above 1 KB messages. Application Testing With respect to Red Sky, the above benchmarks highlight the discoveries of previous studies Sandia performed on commodity clusters like Red Sky and Chama, namely the poor scalability with applications that use implicit solvers and the poor parallel efficiency with higher amounts of unstructured message traffic. (Characteristics not exhibited by Chama in the micro-benchmarks.) These results and other discoveries in previous commodity clusters provided a “strong case” for Sandia to invest in more custom MPP machines. However, users of Chama have reported performance improvements with their application codes of 2x to 5x. To further understand these experiences, scientists proceeded with application testing. 5 Intel® True Scale Fabric Architecture: Three Labs, One Conclusion ML/AZTEC TIME PER BICGSTAB LTR (SECS) 0.5 LOWER IS BETTER 0.45 0.4 0.35 Maximum Performance Improvement at Scale 0.3 0.25 0.2 0.15 0.1 32 64 128 Application Science Domain Key Algorithm Timing Metric Chama: Red Sky Chama: CIELO Aleph Plasma simulation Finite Element Method (FEM) particle move + field solves Weak scaling, fixed number of steps 4.2x 1.3x AMG2006 Algebraic multigrid Laplace solver, preconditioned Conjugate Gradient Weak scaling, 100 iterations 1.5x 1.75x Aria CFD, Thermodynamics Implicit FEM Strong scaling, 25 time steps 3.4x 2.6x Charon Semiconductor device simulation Implicit FEM Weak scaling, fixed number of iterations 2.5x 1.6x 256 512 1024 2048 4096 8192 16384 MPI RANKS Cielo Chama Red Sky B. Aleph 4500 LOWER IS BETTER 4000 3500 TIME (SECS) Figure 3 graphs the results for the Finite Element Method tests Aleph, Aria, and Charon; Figure 4 shows the performance for AMG2006. Again, Red Sky exhibits severe scaling limitations, while Chama outperforms Cielo on all tests. Table 2 lists four of the five applications used, along with their results, to help reveal how Chama compared to Red Sky and Cielo at scale. These results are consistent with users’ experiences. A. Charon 3000 2500 Table 2. Sandia Application Scaling Tests. 2000 1500 1000 250 LOWER IS BETTER 500 100 1000 10000 200 Chama Red Sky C. Sierra/Aria 120 LOWER IS BETTER TIME (SECS) 100 150 100 50 80 0 60 1 4 16 64 256 MPI RANKS 40 Cielo 20 0 Chama Red Sky Figure 4. AMG2006 Scaling Comparisons. 1 100 10 MPI RANKS Cielo Chama Red Sky Figure 3. Charon, Aleph, and Aria Application Scaling. 6 PCG SOLVE TIME (SECS) MPI RANKS Cielo 1000 1024 4096 16384 Intel® True Scale Fabric Architecture: Three Labs, One Conclusion Cielo Acceptance Benchmarks A number of other applications were benchmarked on Chama, not covered by the current Sandia report. However, results of four of the six Tri-Lab Cielo acceptance benchmarks were included. They are shown in Figure 5. While “not as spectacular” as the earlier tests, Sandia scientists considered these results good. Sandia’s Conclusions Sandia scientists stated the results for Chama’s Intel True Scale Fabric performance to be “unprecedented” and “never before seen” for a commodity interconnect. With its onload processing and its PSM interface, Chama’s Intel True Scale Fabric outperformed Red Sky’s verbs-based InfiniBand communications and was competitive with the capability supercomputer Cielo. MPI profiles revealed that Chama’s faster MPI processing of the Intel True Scale Fabric contributed to its scalability and to the 2x to 5x performance improvement experienced by Chama’s users. HPCCG HPCCG CTH CTH UMT UMT SAGE SAGE AMG2006 AMG2006 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 PERFORMANCE GAIN FACTOR: CHAMA OVER RED SKY 1024 MPI Tasks 128 MPI Tasks 16 MPI Tasks 0.0 0.5 1.0 1.5 2.0 2.5 PERFORMANCE GAIN FACTOR: CHAMA OVER CIELO 1024 MPI Tasks 128 MPI Tasks 16 MPI Tasks Figure 5. Cielo Acceptance Test Performance Summary. “The performance gains that Sandia users are experiencing with their applications on…Chama, has resulted in many positive feedbacks from happy users. …we are seeing unprecedented performance and scalability of many key Sandia applications.” 7 Intel® True Scale Fabric Architecture: Three Labs, One Conclusion “Changing the Way We Work” – Luna at Los Alamos National Laboratory Los Alamos National Laboratory has a nearly 70 year history of discovery and innovation in science and technology. Its mission is to “develop and apply science and technology to ensure the safety, security, and reliability of the U.S. nuclear deterrent; to reduce global threats; and solve other emerging national security and energy challenges.”4 In 2012, it acquired Luna as part of TLCC2, and “reports from users have been extremely positive.” In particular, two directed stockpile work (DSW) problems completed by users Mercer-Smith and Scott ran 3.9x and 4.7x faster on Luna than other systems. Scientists at Los Alamos were asked to understand why Luna performed so much better. Their research is captured in benchmarks and application testing between Luna and Typhoon.5 Table 3 lists the configurations of the two systems used in the evaluation. Key Findings Los Alamos scientists performed application tests to compare performance and scalability plus micro-benchmarks to help understand what makes the systems perform differently. As at Sandia, the tests were comprehensive across a variety of characteristics; however, this paper focuses on the results of interconnect micro-benchmarks and application testing. We note that the authors discovered Typhoon exhibited atypical InfiniBand bandwidth performance during the singlenode communication micro-benchmark. This led to a later evaluation of Typhoon’s InfiniBand performance and an ensuing report.6 The findings revealed that a configuration problem caused a lower than expected InfiniBand performance on Typhoon. When corrected and the application xRAGE used in the current tests was rerun, Typhoon improved by about 21 percent on xRAGE. Whether or not this handicap carried across to all Typhoon tests is unclear. Thus, in this paper, where Category Parameter Typhoon Luna CPU core Make Model Clock speed L1 data cache size L2 cache size AMD Magny-Cours* Opteron* 6128 2.0 GHz 64 KB 0.5 MB Intel® Sandy Bridge Intel® Xeon E5-2670 2.6 GHz 32KB 0.25MB CPU socket Cores Shared L3 cache size Memory controllers 8 12 MB 4 x DDR3-1333 8 20 MB 4 x DDR3-1600 Node Sockets Memory capacity 4 64 GB 2 32 GB Network Make Type Mellanox* QDR InfiniBand (Verbs) Intel® True Scale Fabric QDR InfiniBand (PSM) System Integrator Compute nodes I/O nodes Installation date Appro 416 12 March 2011 Appro 1540 60 April 2012 Table 3. Los Alamos National Laboratory Test Systems. 8 appropriate, we awarded Typhoon a 21 percent benefit and present the resultant values in parentheses next to the original report’s results. Nonetheless, Luna generally outperformed Typhoon on every test and micro-benchmark Los Alamos performed, with some variability. The Los Alamos tests revealed the following: •Across several comparisons, Luna rates from 1.2x to 4.7x faster than Typhoon. •Luna’s interconnect supports nearly full InfiniBand QDR bandwidth with little to no contention scaling to 16 cores, while Typhoon starts out fast and degrades steadily to 32 cores without achieving nearly full InfiniBand speeds. •At 16 cores, Luna’s Intel True Scale Fabric is 2.10x (1.74x) faster than Typhoon; at 32 cores, the difference rises to 2.19x (1.81x) faster. •Collectives performance showed Luna with an average of 1.95x (1.61x) improvement over Typhoon, but with variability. The key findings from these microbenchmarks and application tests indicate that Luna, with its Intel True Scale Fabric, delivers a wide range of performance improvements over Typhoon. “Luna is the best machine that the laboratory has ever had.” Intel® True Scale Fabric Architecture: Three Labs, One Conclusion Application Testing Los Alamos scientists performed four application tests with variations on the number of cores and nodes for different tests. They tried to thoroughly understand what drives Luna’s significant improvements and attempted to repeat the improvements Mercer-Smith and Scott experienced. The tests and source of other metrics are briefly described in Table 4 along with the results. The extent of their comprehensive testing is beyond the capacity of this paper; therefore only the results are summarized below, shown in Figure 6. Using theoretical calculations, actual measurements and the experiences reported by users, Luna averages about 2.5x faster than Typhoon. Communications Micro-benchmarks As with Sandia, Los Alamos scientists ran several micro-benchmarks to isolate some of the causes of Luna’s performance edge over Typhoon. Los Alamos tests also isolated several improvements at the node and processor architectural levels. But, again, this paper focuses on the results that the interconnect contributed to the overall performance. Application/Source Luna: Typhoon DESCRIPTION Theoretical peak memory bw 1.2x This is the simple ratio of Luna’s memory to Typhoon’s memory bandwidth xRAGE 1.56x (1.29x) A collectives-heavy code EAP test suite 1.69x (1.40x) A collection of 332 regression tests from the Eulerian Applications Project (EAP) run nightly on Luna and Typhoon Mizzen problem 2.07x (1.71x) An integrated code representative of the types of codes normally run on Luna and Typhoon Theoretical compute rate 2.6x Calculated maximum theoretical FLOPs High-performance Linpack* benchmark 2.72x According to the June 2012 Top500 list Partisn, sn timing 2.75x (2.28x) A more communications-active code compared to xRAGE, with many small message exchanges ASC1 code (Mercer-Smith & Scott) 3.9x DSW problem; not part of the current testing ASC2 code (Mercer-Smith & Scott) 4.7x DSW problem; not part of the current testing Table 4. Application Test Descriptions. 1.2 THEORETICAL PEAK MEMORY BANDWIDTH 1.56 xRAGE, ASTEROID PROBLEM (PAKIN & LANG) 1.69 EAP TEST SUITE, GEOMETRIC MEAN (EAP TEAM) 2.07 INTEGRATED CODE, MIZZEN PROBLEM (BROWN) 2.6 THEORETICAL PEAK COMPUTE RATE HIGH-PERFORMANCE LINPACK BENCHMARK 2.72 PARTISN, sn TIMING (PAKIN & LANG) 2.76 3.9 ASC1 CODE/PROBLEM (MERCER-SMITH & SCOTT) 4.7 ASC2 CODE/PROBLEM (MERCER-SMITH & SCOTT) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 LUNA: TYPHOON PERFORMANCE RATIO, 128 MPI RANKS Figure 6. Luna: Typhoon Applications Performance Summary. 9 4,000 TYPHOON: LUNA MPI_ALLREDUCE LATENCY AGGREGATE COMMUNICATION BANDWIDTH (B/µs) Intel® True Scale Fabric Architecture: Three Labs, One Conclusion 3,000 2,000 1,000 0 4 8 12 16 20 24 28 32 NUMBER OF COMMUNICATING PAIRS OF PROCESSES Typhoon Luna 7 6 5 4 3 2 1 0 22 24 26 28 210 212 214 216 Theoretical Peak Figure 7. Network Bandwidth as a Function of Contention for the NIC. Figure 8. Ratio of Luna’s MPI_Allreduce latency to Typhoon’s for 128 MPI Ranks. Node-to-Node Bandwidth and Adapter Contention Global Communications – MPI Allreduce Results Los Alamos’ Conclusions This micro-benchmark exchanges a large volume of data between two nodes, starting with a single core on each node and scaling to all 16 cores on a node (for Luna) or 32 cores (for Typhoon). The test records the bandwidth consumed for each exchange. Figure 7 charts the results. For collectives performance, the Los Alamos authors created a micro-benchmark that reports the average time per MPI_ Allreduce operation for various message sizes across 128 MPI ranks. Figure 8 graphs the results of Luna’s to Typhoon’s performance. The authors note “…the geometric mean of the measurements indicate that Typhoon takes an average (horizontal line) of 1.95x (1.61x) as long as Luna to perform an MPI_Allreduce…” However, they also are drawn to the variability of the results. They consider it, like other results in their study, that “there is a large set of corner cases where Luna can be many times faster than Typhoon—and some applications may in fact hit these cases—but more modest speedups are the more common case.” For Luna, the first exchanges do not saturate the network, but within four cores, full speed is achieved at 3,151 B/μs and held across all 16 cores, with little measurable degradation from contention. This was also seen at Sandia where the messaging rate scaled well across many MPI ranks. Typhoon’s network, however, while starting out faster than Luna at 1,879 B/μs, degraded steadily to 1,433 B/μs as core count increased, indicative of contention as the adapter tries to pass traffic from more cores. The scientists determined, “while Luna’s per-core (aggregate divided by number of cores) communication bandwidth is 2.10x [(1.74x)]7 that of Typhoon’s at 16 cores/ node, this ratio increases to 2.19x [(1.81x)] when comparing a full Luna node to a full Typhoon node.” 10 218 220 MESSAGE SIZE (W) Luna outperforms Typhoon from 1.2X to 4.7x, as indicated by both theoretical and actual results. The authors conclude that “…almost all key components of Luna— CPUs, memory and network—are faster than their Typhoon counterparts, but by widely varying amounts (and in nonlinear patterns) based on how these hardware resources are utilized.” Indeed, Luna is considered the best machine the Laboratory owns by one set of users. Other user experiences are quite positive, to the point that it is having an impact on some work going forward. “Luna tends to be about twice as fast as Typhoon across the various micro-benchmarks, but there are many outliers.” Intel® True Scale Fabric Architecture: Three Labs, One Conclusion Supreme Scalability – Zin at Lawrence Livermore National Laboratory In 2011, the Laboratory acquired Zin, the latest addition to its TLCC2. The Zin cluster comprises 2,916 nodes, 46,208 cores, and Intel True Scale Fabric network. Soon after it was delivered in 2011, it was awarded number 15 on the Top500 list of the fastest supercomputers in the world. A year later, it is still in the top 30 fastest systems. In 2012, Lawrence Livermore scientists ran scalability benchmarks across Zin and several other systems in the TriLabs complex, including other TLCC units and capability machines, such as Cielo at Sandia National Laboratories. The results were presented at SC12 in November. Figure 9 graphs the results. Of the six systems in the comparison, Cielo, Purple, and Dawn are capability MPP machines, while Sierra, Muir, and Zin are capacity clusters—all three using Intel True Scale Fabric networks. In this graph, the lower and flatter the scalability line, the better. A slope of 0 indicates ideal scalability. The three most scalable systems (Sierra, Muir, and Zin) were interconnected with Intel True Scale Fabric components. Zin outperforms the other two systems built on custom interconnects. We note that Cielo is the capability supercomputer at Sandia against which Chama competed so well. 2.0000 1.8000 1.6000 MICROSECONDS PER ZONE-ITERATION Beginning operations in 1952, Lawrence Livermore National Laboratory has grown into a diverse complex of science, research, and technology, part of which supports the ASC Program and missions of the NNSA. The Terascale Simulation Facility (TSF) at Lawrence Livermore National Laboratory houses TLCC2 clusters and includes the world’s second fastest supercomputer, Sequoia, according to the Top500 list.1 Weak Scaling – 3D Radiation Problem’s Average Zone-Iteration Grind Time Per Machine SLOPES Purple - 0.000079 Dawn (BG/P) - 0.000016 Zin - 0.000012 Cielo - 0.000010 Sierra - 0.000008 Muir - 0.000005 Note: 0 slope is ideal scaling 1.4000 1.2000 1.0000 INTEL® TRUE SCALE FABRIC Muir - Full QDR Sierra - Full QDR Zin - Full (16 MPI/Node) LOWER AND FLATTER IS BETTER 0.8000 0.6000 0.4000 0.2000 0.0000 0 5000 PROCESSORS (CPUs) Cielo - PGI Full (16 MPI/Node) Purple at Retire (NewComm) Muir - Full QDR - Intel® True Scale Fabric Sierra - Full QDR - Intel® True Scale Fabric 10000 Dawn - 2.2 Zin - Full (16 MPI/Node) - Intel® True Scale Fabric Source: Lawrence Livermore National Laboratory, Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. See Top500 list for configuration information at http://top500.org. Figure 9. Scaling Results of Zin and Other Tri-Labs Machines. Summary and Conclusions Across three NNSA national laboratories, TLCC and TLCC2 capacity computing systems powered by Intel True Scale Fabric networks and Intel Xeon processor E5-2600 outperform other machines, including MPP capability supercomputers. At Sandia National Laboratories, Chama delivers “unprecedented scalability and performance.” Users of Luna at Los Alamos National Laboratory claim it is the “best machine the Laboratory has ever had,” and it is “changing the way we work.” Zin at Lawrence Livermore National Laboratory, along with two other TLCC clusters built with Intel True Scale Fabric, dominate the scalability testing results of the most scalable systems in the benchmark. These tests reveal how Intel True Scale Fabric with PSM and onload processing outperform other interconnects used in HPC and drive some of the fastest supercomputers in the world. 11 Intel® True Scale Fabric Architecture: Three Labs, One Conclusion Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PAT-ENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel’s Web site at www.intel.com. 1www.top500.org/list/2012/11 2www.sandia.gov/about/index.html 3Rajan, M. and D.W. Doerfler, P.T. Lin, S.D. Hammond, R.F. Barrett, and C.T. Vaughan. “Unprecedented Scalability and Performance of the New NNSA Tri-Lab Linux Capacity Cluster 2,” Sandi National laboratories. 4www.lanl.gov/mission/index.php 5Pakin, Scott and Michael Lang. “Performance Comparison of Luna and Typhoon,” Los Alamos National Laboratory High-Performance Computing Division, November 19, 2012. 6Coulter, Susan and Daryl W. Grunau. “Typhoon IB Performance,” Los Alamos National Laboratory, March 8, 2013. 7Bracketed values are added by Intel to offset the report results as described earlier. Copyright © 2013 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Printed in USA 0513/ML/HBD/PDF Please Recycle 328985-001US