Intel True Scale Fabric Architecture: Three Labs, One Conclusion

WHITE PAPER
Intel® True Scale Fabric
Architecture
Supercomputing
Intel® True Scale Fabric Architecture:
Three Labs, One Conclusion
Intel® True Scale Fabric in National Laboratory Systems Changes the View of Interconnects
and the Work in Supercomputing
Table of Contents
EXECUTIVE SUMMARY
Executive Summary. . . . . . . . . . . . . . . 1
When Fast is Not Fast Enough. . . . . . 2
Intel a Force in HPC . . . . . . . . . . . . . . . 2
Three United States national laboratories, known for their work in supercomputing,
Key Components
to a Fast System. . . . . . . . . . . . . . . . . . . 2
recently benchmarked new systems delivered to each of them. These systems are
Intel, TLCC, and TLCC2. . . . . . . . . . . . . 2
“Unprecedented Scalability
and Performance” – Chama at
Sandia National Laboratories. . . . . . . 3
Key Findings. . . . . . . . . . . . . . . . . . . . . . . 4
Messaging Micro-benchmarks . . . . . . 4
- Bandwidth and Latency . . . . . . . . 4
- MPI Message Rate. . . . . . . . . . . . . . 5
- Random Message Bandwidth. . . 5
- Global Communications –
- MPI Allreduce . . . . . . . . . . . . . . . . . . 5
Application Testing. . . . . . . . . . . . . . . . .5
- Cielo Acceptance Benchmarks . . 7
Sandia’s Conclusions. . . . . . . . . . . . . . . 7
“Changing the Way We Work” ­–
Luna at Los Alamos
National Laboratory. . . . . . . . . . . . . . . 8
Key Findings. . . . . . . . . . . . . . . . . . . . . . . 8
Application Testing. . . . . . . . . . . . . . . . .9
Communications
Micro-benchmarks. . . . . . . . . . . . . . . . . 9
- Node-to-Node Bandwidth
- and Adapter Contention. . . . . . . 10
- Global Communications –
- MPI_Allreduce Results. . . . . . . . . 10
Los Alamos’ Conclusions. . . . . . . . . . . 10
Supreme Scalability –
Zin at Lawrence Livermore
National Laboratory. . . . . . . . . . . . . . 11
Summary and Conclusions. . . . . . . . 11
built on the Intel® Xeon® processor E5-2600 family and Intel® True Scale Fabric, based
on InfiniBand* and the open source Performance Scale Messaging (PSM) interface.
The scientists performing the benchmarks concluded in individual reports that
Intel True Scale Fabric:
·Contributed to “unprecedented scalability and performance” in their systems, and
it is allowing them to change how they work.
·Outperforms on some tests one of the most powerful, customized supercomputers
in the world, rated 18 in the November, 2012 Top500 list.1
·Delivers a level of performance they had not seen from a commodity
interconnect before.
The new systems, named Chama (at Sandia National Laboratories), Luna (at Los Alamos
National Laboratory), and Zin (at Lawrence Livermore National Laboratory), are part
of the Tri-Labs Linux* Capacity Clusters 2 (TLCC2) in the Advanced Simulation and
Computing (ASC) program under the National Nuclear Security Administration (NNSA).
This paper summarizes the findings of the reports from these three laboratories.
Intel® True Scale Fabric Architecture:
Three Labs, One Conclusion
When Fast is Not Fast Enough
InfiniBand Architecture has proven
itself over the years as the interconnect
technology of choice for high-performance
computing (HPC). For a commodity interconnect, it continues to achieve performance advances above other industry
standard networks, and it outperforms
them by a significant factor. But, when
it comes to the demands of HPC and
MPI message passing, fast is never fast
enough. While MPI, using InfiniBand Verbs,
delivers fast communications, there is a
costly overhead with Verbs and traditional
offload processing on InfiniBand Host
Channel Adapters (HCAs) that hinders
scalability with larger core counts.
Intel True Scale Fabric, with its open
source Performance Scale Messaging (PSM)
interface and onload traffic processing,
was designed from the ground up to accelerate MPI messaging specifically for HPC.
Intel True Scale Fabric delivers very high
message rates, low MPI latency and high
effective application bandwidth, enabling
MPI applications to scale to thousands of
nodes. This performance drove the choice
of interconnect for the most recent acquisitions in the Advanced Simulation and
Computing (ASC) Program’s Tri-Labs Linux
Capacity Clusters 2 (TLCC2): Chama (at
Sandia National Laboratories), Luna (at Los
Alamos National Laboratory), and Zin (at
Lawrence Livermore National Laboratory).
Intel, a Force in HPC
Intel has a long history in high-performance computing (HPC) systems and
the national laboratories that use them. Intel built the first massively parallel
processing (MPP) machine to reach one teraFLOP, and delivered it in 1996 to the
Advanced Computing and Simulation (ASC) Program (formerly ASCI) as Option
Red. Intel continues to be a driving force in supercomputing with Intel® processors in more systems on the Top500 list1 of the world’s fastest supercomputers
than any other manufacturer. But it takes more than just a fast processor to live
among the fastest 500 systems.
Key Components to a Fast System
The fastest systems use more than just Intel processors. Intel provides the
components and software tools to help achieve the highest performing codes
on some of the nation’s most critical computing jobs.
•Intel® Xeon® processors – 377 (75 percent) of the top Top500
supercomputers use Intel® Architecture processors.
•Intel® True Scale Fabric – designed specifically for HPC in order to minimize
communications and enable efficient systems, Intel True Scale Fabric enables
the fastest clusters based on InfiniBand* Architecture.
•Intel® Xeon Phi™ coprocessors – built on many-core architecture, Intel Xeon
Phi coprocessors offer unparalleled acceleration for certain codes.
•Intel® Software Tools – a host of tools support cluster builders and application
programmers to make their codes fast and efficient.
• Intel® Storage Systems – HPC demands the fastest components,
and Intel storage components deliver both speed and reliability.
2
Intel, TLCC, and TLCC2
The ASC Program under the National
Nuclear Security Administration (NNSA)
provides leading-edge, high-end simulation
capabilities to support the Administration’s
mission. Some of the fastest supercomputers in the world are managed under the
ASC at the three NNSA laboratories:
Los Alamos National Laboratory, Sandia
National Laboratories, and Lawrence
Livermore National Laboratory. These
machines include “capacity” and “capability”
HPC systems designed for a range of
computing jobs and users.
Capacity and capability machines are
generally distinguished by their differences
in size and users. While both categories
have grown in computing abilities over
the years, capability systems are typically
dedicated to a smaller group of users and
are much larger, comprising a number of
cores as much as a magnitude higher than
capacity machines (hundreds of thousands
compared to tens of thousands of cores).
The Tri-Lab Linux Capacity Clusters (TLCC)
contribute to capacity computing at the
three NNSA laboratories. TLCC is designed
for scalability to adapt the resources to
each job’s computing requirements, while
running multiple jobs simultaneously. Thus,
the systems consist of a number of Scalable Units (SU), each SU comprising 162
compute, user and management nodes,
2,592 cores and delivering about 50 teraFLOPS/SU. One TLCC procurement included
the supercomputer Sierra, built with Intel
True Scale Fabric components, housed at
Lawrence Livermore National Laboratory.
The more recent procurements for the second procurement of scalable Linux clusters,
TLCC2, consist of three large Linux clusters,
one each housed at an NNSA laboratory:
•Chama – 8 SUs, with 1,296 nodes,
located at Sandia National Laboratories
in Albuquerque, New Mexico
Intel® True Scale Fabric Architecture:
Three Labs, One Conclusion
•Luna – 10 SUs, with 1,620 nodes, located
at Los Alamos National Laboratory in Los
Alamos, New Mexico
•Zin – 18 SUs with 2,916 nodes, located at
Lawrence Livermore National Laboratory
in Livermore, California
All three machines are built around
Intel® technologies, including Intel® Xeon®
processors and Intel True Scale Fabric
HCAs and switches. At all three laboratories, users and laboratory scientists have
reported significant performance and
scalability improvements over other
machines, triggering scientists to take a
new look at how their work gets done.
“Unprecedented Scalability
and Performance” – Chama at
Sandia National Laboratories
Sandia National Laboratories, headquartered in Albuquerque, New Mexico, has,
over the last six decades, “delivered essential science and technology to resolve
the nation’s most challenging security
issues.”2 Sandia has a long history of highperformance computing. It is the home of
the nation’s first teraFLOP supercomputer,
ASCI Option Red, built by Intel in 1996. As
one of the laboratories providing capacity
computing to the NNSA ASC program, it
received its latest TLCC2 capacity machine,
Chama, in 2012.
Configuration
chama
Red sky
Cielo
Total Computing
Nodes
1,232
2,816
8,894
Processor
Architecture
Intel® Architecture formerly
codenamed Sandy Bridge
Intel® Architecture
formerly codenamed
Nehalem
AMD MagnyCours*
Cache
8 x 32
8 x 256
20
4 x 32
4 x 256
8
8 x 64
8 x 512
10
Cores/Node
16
8
16
Total Cores
19,712
22,528
142,304
Clock Speed (GHz)
2.60
2.93
2.40
Instruction Set
Architecture (ISA)
Intel® AVX
SSE4.2
SSE4a
Memory
DDR3 1600 MHz
DDR3 1333 MHz
DDR3 1333 MHz
Memory/Core (GB)
2
1.5
2
Compute Complex
L1 (KB)
L2 (KB)
L3 (MB)
Channels/Socket
4
3
4
Peak Node GLFOPS
332.8
94.76
153.6
Manufacturer
Technology/Rate
IB HW Interface
Topology
Intel (Qlogic)
InfiniBand* QDR
PSM
Fat Tree
Interconnect
Mellanox*
InfiniBand QDR
Verbs
3D Torus: 6 x 6 x 8
Gemini*
Custom
Custom
3D Torus: 18 x 12 x 24
Table 1. Sandia National Laboratories Test Systems.
With the acquisition of Chama, users began
reporting 2x to 5x performance improvement on their jobs. Sandia scientists
wanted to “understand the characteristics
of this new resource.” So, they performed
micro-benchmarks and application program
testing on Chama and two other systems
at Sandia: Red Sky, another capacity
computing machine and predecessor to
Chama in the TLCC, and Cielo, a capability supercomputer. Their findings are
captured in their report.3
Table 1 lists the system configurations
for Chama, Red Sky, and Cielo.
3
Intel® True Scale Fabric Architecture:
Three Labs, One Conclusion
A. Bandwidth
4000
4.5
HIGHER IS BETTER
3500
LOWER IS BETTER
4
3000
MICROSECONDS
MBYTES/SECOND
B. Latency
2500
2000
1500
1000
3.5
3
2.5
2
15
500
0
0
1
10
100
1000
10000
1000000
1e+06
8
1e+07
16
32
Cielo (X&Z)
Cielo (Y)
Cielo (X&Z)
Cielo (Y)
Red Sky
Chama
C. Message Rate
128
BYTES/SECOND/MPI TASK
COMPARISON
4
512
1024
Red Sky
Chama
1e+09
4.5
256
D. Random Messaging Bandwidth
5
CHAMA/CIELO COMPARISON
64
MESSAGE SIZE (BYTES)
MESSAGE SIZE (BYTES)
3.5
3
2.5
2
15
1
HIGHER IS BETTER
1e+08
1e+07
1e+06
0.5
0
1
10
100
1000
10000
100000
1e+06
1e+07
100000
10
100
1 Task/node
2 Tasks/node
4 Tasks/node
16 Tasks/node
8 Tasks/node
1000
10000
MPI RANKS
MESSAGE SIZE (BYTES)
Cielo
Chama
Red Sky
Figure 1. Sandia Inter-node MPI Performance.
Key Findings
Sandia scientists tested the systems
across a range of characteristics beyond
those impacted by interconnect, including
memory performance and contention,
processor performance, and more. Chama
proved to be a well-balanced system
with impressive performance results that
outperformed Red Sky and compared well
against Cielo. However, this paper focuses
on the results of interconnect benchmarks
and application testing to understand how
interconnect contributes to the overall
HPC performance. Thus, the tests
revealed the following about the Intel
True Scale Fabric interconnect:
•Chama returned unprecedented results
in MPI messaging rate at message sizes
up to 1 KB, outperforming even Cielo’s
custom interconnect.
4
•Chama delivered random messaging
bandwidth the scientists had not yet
seen from a commodity interconnect,
exceeding Cielo by as much as 30 percent.
•Collectives performance scaling for
Chama compares well against the custom
interconnect of Cielo, both outperforming Red Sky by an order of magnitude.
•Chama scaled well against Cielo on three
Sandia Finite Element production applications, which revealed severe scaling
limitations on Red Sky.
The key findings from these microbenchmarks and application tests indicate
that Chama, with its Intel True Scale Fabric,
“has a strong impact on applications” as
attested by Chama users.
Messaging Micro-benchmarks
While standard traditional metrics include
inter-node latency and bandwidth, Sandia
scientists were keenly interested in
Chama’s MPI messaging rate and scalable
random message bandwidth performance.
Figure 1 shows the benchmark results
for these tests.
Bandwidth and Latency
Sandia codes are more sensitive to bandwidth than latency; this effect drove the
choice for Chama’s Intel True Scale Fabric
interconnect. As shown in Figures 1a and
1b, Chama performed well compared to Cielo’s custom Gemini* interconnect, according
to Sandia scientists. We note that with
sizes well within the typical HPC message
size space, Red Sky’s bandwidth climbed
much more slowly, remaining about half of
Chama’s, and latency began to dramatically
increase at just 64-byte messages.
Intel® True Scale Fabric Architecture:
Three Labs, One Conclusion
A. 8 bytes
B. 64 bytes
10000
1000
10
1
0.1
100
10
1
0.1
1
4
16
64
256
1024
4096 16384
Chama
Red Sky
100
10
1
1
4
MPI RANKS
Cielo
LOWER IS BETTER
1000
AVERAGE TIME, MICROSECS
AVERAGE TIME, MICROSECS
AVERAGE TIME, MICROSECS
100
10000
LOWER IS BETTER
LOWER IS BETTER
1000
C. 1024 bytes
16
64
256
1024
4096 16384
1
4
MPI RANKS
Cielo
Chama
Red Sky
16
64
256
1024
4096 16384
MPI RANKS
Cielo
Chama
Red Sky
Figure 2. IMB MPI_Allreduce Performance.
MPI Message Rate
Of particular interest to the testers at
Sandia, was the ability of the interconnect to process messages as core counts
increased. HCA congestion on multi-core
nodes is “becoming a significant constraint”
in HPC with commodity interconnects, even
those based on InfiniBand Architecture.
“Therefore, the most important internode
behavior for Chama is the significant gain in
MPI message rate in comparison to Cielo.”
For message sizes up to 1 KB, the Intel
True Scale Fabric outperformed the custom interconnect of Cielo by 2x to 4x. For
Sandia, this was an unprecedented event,
which “…can have a significant positive
impact on many applications, such as
those that employ a sparse solver kernel.”
Random Message Bandwidth
Not all inter-node communications are
structured. Indeed, many applications,
such as Charon, induce unstructured
communications across the fabric.
Understanding node behavior using a
measure of random message traffic can
more readily predict system performance
with such codes. Sandia uses a random
messaging benchmark for understanding
scalability in commodity clusters. The test
“sends thousands of small messages from
all MPI tasks with varying message sizes
(100 bytes to 1 KB) to random MPI rank
destinations.” An aggregate average random messaging bandwidth (Figure 1d) was
derived from per process measurements.
The measurements showed the following
results, which the scientists had never
seen with commodity interconnects benchmarked against a custom architecture:
•Red Sky, compared to Chama, performed
from 10x slower (32 cores) to 220x
slower (8,192 cores)
•Chama was 20 to 30 percent faster than
Cielo, the capability supercomputer
Chama’s Intel True Scale Fabric scales extremely well with applications that create
random traffic on large systems.
Global Communications – MPI Allreduce
For understanding behavior of Chama with
applications that are sensitive to collective
operations, Sandia averaged scalability performance data from a thousand trials using
8, 64, and 1024 byte transfers. As shown
in Figure 2, Chama performs competitively
to Cielo across all ranks. Both perform an
order of magnitude better than Red Sky in
some cases, with Red Sky’s performance
falling off above 1 KB messages.
Application Testing
With respect to Red Sky, the above
benchmarks highlight the discoveries
of previous studies Sandia performed
on commodity clusters like Red Sky and
Chama, namely the poor scalability with
applications that use implicit solvers and
the poor parallel efficiency with higher
amounts of unstructured message traffic.
(Characteristics not exhibited by Chama in
the micro-benchmarks.)
These results and other discoveries in
previous commodity clusters provided a
“strong case” for Sandia to invest in more
custom MPP machines. However, users
of Chama have reported performance
improvements with their application codes
of 2x to 5x. To further understand these
experiences, scientists proceeded with
application testing.
5
Intel® True Scale Fabric Architecture:
Three Labs, One Conclusion
ML/AZTEC TIME PER BICGSTAB LTR (SECS)
0.5
LOWER IS BETTER
0.45
0.4
0.35
Maximum Performance
Improvement at Scale
0.3
0.25
0.2
0.15
0.1
32
64
128
Application
Science
Domain
Key
Algorithm
Timing
Metric
Chama:
Red Sky
Chama:
CIELO
Aleph
Plasma simulation
Finite Element
Method (FEM)
particle move +
field solves
Weak scaling,
fixed number
of steps
4.2x
1.3x
AMG2006
Algebraic multigrid
Laplace solver,
preconditioned
Conjugate
Gradient
Weak scaling,
100 iterations
1.5x
1.75x
Aria
CFD,
Thermodynamics
Implicit FEM
Strong scaling,
25 time steps
3.4x
2.6x
Charon
Semiconductor
device simulation
Implicit FEM
Weak scaling,
fixed number
of iterations
2.5x
1.6x
256 512 1024 2048 4096 8192 16384
MPI RANKS
Cielo
Chama
Red Sky
B. Aleph
4500
LOWER IS BETTER
4000
3500
TIME (SECS)
Figure 3 graphs the results for the Finite
Element Method tests Aleph, Aria, and
Charon; Figure 4 shows the performance
for AMG2006. Again, Red Sky exhibits
severe scaling limitations, while Chama
outperforms Cielo on all tests.
Table 2 lists four of the five applications
used, along with their results, to help reveal how Chama compared to Red Sky and
Cielo at scale. These results are consistent
with users’ experiences.
A. Charon
3000
2500
Table 2. Sandia Application Scaling Tests.
2000
1500
1000
250
LOWER IS BETTER
500
100
1000
10000
200
Chama
Red Sky
C. Sierra/Aria
120
LOWER IS BETTER
TIME (SECS)
100
150
100
50
80
0
60
1
4
16
64
256
MPI RANKS
40
Cielo
20
0
Chama
Red Sky
Figure 4. AMG2006 Scaling Comparisons.
1
100
10
MPI RANKS
Cielo
Chama
Red Sky
Figure 3. Charon, Aleph,
and Aria Application Scaling.
6
PCG SOLVE TIME (SECS)
MPI RANKS
Cielo
1000
1024
4096
16384
Intel® True Scale Fabric Architecture:
Three Labs, One Conclusion
Cielo Acceptance Benchmarks
A number of other applications were
benchmarked on Chama, not covered
by the current Sandia report. However,
results of four of the six Tri-Lab Cielo
acceptance benchmarks were included.
They are shown in Figure 5. While “not as
spectacular” as the earlier tests, Sandia
scientists considered these results good.
Sandia’s Conclusions
Sandia scientists stated the results for
Chama’s Intel True Scale Fabric performance to be “unprecedented” and “never
before seen” for a commodity interconnect.
With its onload processing and its PSM
interface, Chama’s Intel True Scale Fabric
outperformed Red Sky’s verbs-based InfiniBand communications and was competitive
with the capability supercomputer Cielo.
MPI profiles revealed that Chama’s faster
MPI processing of the Intel True Scale
Fabric contributed to its scalability and to
the 2x to 5x performance improvement
experienced by Chama’s users.
HPCCG
HPCCG
CTH
CTH
UMT
UMT
SAGE
SAGE
AMG2006
AMG2006
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
PERFORMANCE GAIN FACTOR: CHAMA OVER RED SKY
1024 MPI Tasks
128 MPI Tasks
16 MPI Tasks
0.0
0.5
1.0
1.5
2.0
2.5
PERFORMANCE GAIN FACTOR: CHAMA OVER CIELO
1024 MPI Tasks
128 MPI Tasks
16 MPI Tasks
Figure 5. Cielo Acceptance Test Performance Summary.
“The performance gains that Sandia users are experiencing
with their applications on…Chama, has resulted in many positive
feedbacks from happy users. …we are seeing unprecedented
performance and scalability of many key Sandia applications.”
7
Intel® True Scale Fabric Architecture:
Three Labs, One Conclusion
“Changing the Way We
Work” – Luna at Los Alamos
National Laboratory
Los Alamos National Laboratory has a
nearly 70 year history of discovery and
innovation in science and technology. Its
mission is to “develop and apply science
and technology to ensure the safety,
security, and reliability of the U.S. nuclear
deterrent; to reduce global threats; and
solve other emerging national security
and energy challenges.”4
In 2012, it acquired Luna as part of
TLCC2, and “reports from users have
been extremely positive.” In particular,
two directed stockpile work (DSW) problems completed by users Mercer-Smith
and Scott ran 3.9x and 4.7x faster on Luna
than other systems. Scientists at Los Alamos were asked to understand why Luna
performed so much better. Their research
is captured in benchmarks and application
testing between Luna and Typhoon.5
Table 3 lists the configurations of the
two systems used in the evaluation.
Key Findings
Los Alamos scientists performed application tests to compare performance and
scalability plus micro-benchmarks to help
understand what makes the systems perform differently. As at Sandia, the tests
were comprehensive across a variety
of characteristics; however, this paper
focuses on the results of interconnect
micro-benchmarks and application testing.
We note that the authors discovered
Typhoon exhibited atypical InfiniBand
bandwidth performance during the singlenode communication micro-benchmark.
This led to a later evaluation of Typhoon’s
InfiniBand performance and an ensuing report.6 The findings revealed that
a configuration problem caused a lower
than expected InfiniBand performance
on Typhoon. When corrected and the application xRAGE used in the current tests
was rerun, Typhoon improved by about
21 percent on xRAGE. Whether or not this
handicap carried across to all Typhoon
tests is unclear. Thus, in this paper, where
Category
Parameter
Typhoon
Luna
CPU core
Make
Model
Clock speed
L1 data cache size
L2 cache size
AMD Magny-Cours*
Opteron* 6128
2.0 GHz
64 KB
0.5 MB
Intel® Sandy Bridge
Intel® Xeon E5-2670
2.6 GHz
32KB
0.25MB
CPU socket
Cores
Shared L3 cache size
Memory controllers
8
12 MB
4 x DDR3-1333
8
20 MB
4 x DDR3-1600
Node
Sockets
Memory capacity
4
64 GB
2
32 GB
Network
Make
Type
Mellanox*
QDR InfiniBand (Verbs)
Intel® True Scale Fabric
QDR InfiniBand (PSM)
System
Integrator
Compute nodes
I/O nodes
Installation date
Appro
416
12
March 2011
Appro
1540
60
April 2012
Table 3. Los Alamos National Laboratory Test Systems.
8
appropriate, we awarded Typhoon a 21
percent benefit and present the resultant
values in parentheses next to the original
report’s results. Nonetheless, Luna generally outperformed Typhoon on every
test and micro-benchmark Los Alamos
performed, with some variability.
The Los Alamos tests revealed
the following:
•Across several comparisons, Luna rates
from 1.2x to 4.7x faster than Typhoon.
•Luna’s interconnect supports nearly full
InfiniBand QDR bandwidth with little to
no contention scaling to 16 cores, while
Typhoon starts out fast and degrades
steadily to 32 cores without achieving
nearly full InfiniBand speeds.
•At 16 cores, Luna’s Intel True Scale Fabric
is 2.10x (1.74x) faster than Typhoon; at
32 cores, the difference rises to 2.19x
(1.81x) faster.
•Collectives performance showed Luna
with an average of 1.95x (1.61x) improvement over Typhoon, but with variability.
The key findings from these microbenchmarks and application tests indicate
that Luna, with its Intel True Scale Fabric,
delivers a wide range of performance
improvements over Typhoon.
“Luna is the best machine
that the laboratory
has ever had.”
Intel® True Scale Fabric Architecture:
Three Labs, One Conclusion
Application Testing
Los Alamos scientists performed four
application tests with variations on the
number of cores and nodes for different
tests. They tried to thoroughly understand what drives Luna’s significant
improvements and attempted to repeat
the improvements Mercer-Smith and
Scott experienced. The tests and source
of other metrics are briefly described in
Table 4 along with the results.
The extent of their comprehensive testing is beyond the capacity of this paper;
therefore only the results are summarized
below, shown in Figure 6.
Using theoretical calculations, actual measurements and the experiences reported
by users, Luna averages about 2.5x faster
than Typhoon.
Communications Micro-benchmarks
As with Sandia, Los Alamos scientists ran
several micro-benchmarks to isolate some
of the causes of Luna’s performance edge
over Typhoon. Los Alamos tests also isolated several improvements at the node
and processor architectural levels. But,
again, this paper focuses on the results
that the interconnect contributed to the
overall performance.
Application/Source
Luna:
Typhoon
DESCRIPTION
Theoretical peak
memory bw
1.2x
This is the simple ratio of Luna’s memory to Typhoon’s
memory bandwidth
xRAGE
1.56x (1.29x)
A collectives-heavy code
EAP test suite
1.69x (1.40x)
A collection of 332 regression tests from the
Eulerian Applications Project (EAP) run nightly
on Luna and Typhoon
Mizzen problem
2.07x (1.71x)
An integrated code representative of the types of codes
normally run on Luna and Typhoon
Theoretical
compute rate
2.6x
Calculated maximum theoretical FLOPs
High-performance
Linpack* benchmark
2.72x
According to the June 2012 Top500 list
Partisn, sn timing
2.75x (2.28x)
A more communications-active code compared to xRAGE,
with many small message exchanges
ASC1 code
(Mercer-Smith & Scott)
3.9x
DSW problem; not part of the current testing
ASC2 code
(Mercer-Smith & Scott)
4.7x
DSW problem; not part of the current testing
Table 4. Application Test Descriptions.
1.2
THEORETICAL PEAK MEMORY BANDWIDTH
1.56
xRAGE, ASTEROID PROBLEM (PAKIN & LANG)
1.69
EAP TEST SUITE, GEOMETRIC MEAN (EAP TEAM)
2.07
INTEGRATED CODE, MIZZEN PROBLEM (BROWN)
2.6
THEORETICAL PEAK COMPUTE RATE
HIGH-PERFORMANCE LINPACK BENCHMARK
2.72
PARTISN, sn TIMING (PAKIN & LANG)
2.76
3.9
ASC1 CODE/PROBLEM (MERCER-SMITH & SCOTT)
4.7
ASC2 CODE/PROBLEM (MERCER-SMITH & SCOTT)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
LUNA: TYPHOON PERFORMANCE RATIO, 128 MPI RANKS
Figure 6. Luna: Typhoon Applications Performance Summary.
9
4,000
TYPHOON: LUNA MPI_ALLREDUCE LATENCY
AGGREGATE COMMUNICATION BANDWIDTH (B/µs)
Intel® True Scale Fabric Architecture:
Three Labs, One Conclusion
3,000
2,000
1,000
0
4
8
12
16
20
24
28
32
NUMBER OF COMMUNICATING PAIRS OF PROCESSES
Typhoon
Luna
7
6
5
4
3
2
1
0
22
24
26
28
210
212
214
216
Theoretical Peak
Figure 7. Network Bandwidth as a Function of Contention for the NIC.
Figure 8. Ratio of Luna’s MPI_Allreduce
latency to Typhoon’s for 128 MPI Ranks.
Node-to-Node Bandwidth
and Adapter Contention
Global Communications –
MPI Allreduce Results
Los Alamos’ Conclusions
This micro-benchmark exchanges a large
volume of data between two nodes,
starting with a single core on each node
and scaling to all 16 cores on a node (for
Luna) or 32 cores (for Typhoon). The test
records the bandwidth consumed for each
exchange. Figure 7 charts the results.
For collectives performance, the Los
Alamos authors created a micro-benchmark
that reports the average time per MPI_
Allreduce operation for various message
sizes across 128 MPI ranks. Figure 8
graphs the results of Luna’s to Typhoon’s
performance. The authors note “…the
geometric mean of the measurements
indicate that Typhoon takes an average
(horizontal line) of 1.95x (1.61x) as long
as Luna to perform an MPI_Allreduce…”
However, they also are drawn to the variability of the results. They consider it, like
other results in their study, that “there is
a large set of corner cases where Luna can
be many times faster than Typhoon—and
some applications may in fact hit these
cases—but more modest speedups are the
more common case.”
For Luna, the first exchanges do not saturate the network, but within four cores,
full speed is achieved at 3,151 B/μs and
held across all 16 cores, with little measurable degradation from contention. This was
also seen at Sandia where the messaging
rate scaled well across many MPI ranks.
Typhoon’s network, however, while starting out faster than Luna at 1,879 B/μs,
degraded steadily to 1,433 B/μs as core
count increased, indicative of contention
as the adapter tries to pass traffic from
more cores.
The scientists determined, “while Luna’s
per-core (aggregate divided by number of
cores) communication bandwidth is 2.10x
[(1.74x)]7 that of Typhoon’s at 16 cores/
node, this ratio increases to 2.19x [(1.81x)]
when comparing a full Luna node to a full
Typhoon node.”
10
218 220
MESSAGE SIZE (W)
Luna outperforms Typhoon from 1.2X to
4.7x, as indicated by both theoretical and
actual results. The authors conclude that
“…almost all key components of Luna—
CPUs, memory and network—are faster
than their Typhoon counterparts, but by
widely varying amounts (and in nonlinear
patterns) based on how these hardware
resources are utilized.”
Indeed, Luna is considered the best
machine the Laboratory owns by one set
of users. Other user experiences are quite
positive, to the point that it is having an
impact on some work going forward.
“Luna tends to be about twice as fast as
Typhoon across the various micro-benchmarks,
but there are many outliers.”
Intel® True Scale Fabric Architecture:
Three Labs, One Conclusion
Supreme Scalability –
Zin at Lawrence Livermore
National Laboratory
In 2011, the Laboratory acquired Zin, the
latest addition to its TLCC2. The Zin cluster
comprises 2,916 nodes, 46,208 cores, and
Intel True Scale Fabric network. Soon after
it was delivered in 2011, it was awarded
number 15 on the Top500 list of the fastest supercomputers in the world. A year
later, it is still in the top 30 fastest systems.
In 2012, Lawrence Livermore scientists
ran scalability benchmarks across Zin
and several other systems in the TriLabs complex, including other TLCC units
and capability machines, such as Cielo at
Sandia National Laboratories. The results
were presented at SC12 in November.
Figure 9 graphs the results.
Of the six systems in the comparison,
Cielo, Purple, and Dawn are capability MPP
machines, while Sierra, Muir, and Zin are
capacity clusters—all three using Intel
True Scale Fabric networks.
In this graph, the lower and flatter the
scalability line, the better. A slope of 0
indicates ideal scalability. The three most
scalable systems (Sierra, Muir, and Zin)
were interconnected with Intel True Scale
Fabric components. Zin outperforms the
other two systems built on custom interconnects. We note that Cielo is the capability supercomputer at Sandia against
which Chama competed so well.
2.0000
1.8000
1.6000
MICROSECONDS PER ZONE-ITERATION
Beginning operations in 1952, Lawrence
Livermore National Laboratory has grown
into a diverse complex of science, research,
and technology, part of which supports the
ASC Program and missions of the NNSA.
The Terascale Simulation Facility (TSF) at
Lawrence Livermore National Laboratory
houses TLCC2 clusters and includes the
world’s second fastest supercomputer,
Sequoia, according to the Top500 list.1
Weak Scaling – 3D Radiation Problem’s Average Zone-Iteration Grind Time Per Machine
SLOPES
Purple - 0.000079
Dawn (BG/P) - 0.000016
Zin - 0.000012
Cielo - 0.000010
Sierra - 0.000008
Muir - 0.000005
Note: 0 slope is ideal scaling
1.4000
1.2000
1.0000
INTEL® TRUE SCALE FABRIC
Muir - Full QDR
Sierra - Full QDR
Zin - Full (16 MPI/Node)
LOWER AND FLATTER IS BETTER
0.8000
0.6000
0.4000
0.2000
0.0000
0
5000
PROCESSORS (CPUs)
Cielo - PGI Full (16 MPI/Node)
Purple at Retire (NewComm)
Muir - Full QDR - Intel® True Scale Fabric
Sierra - Full QDR - Intel® True Scale Fabric
10000
Dawn - 2.2
Zin - Full (16 MPI/Node) - Intel® True Scale Fabric
Source: Lawrence Livermore National Laboratory, Intel does not control or audit the design or implementation of third party benchmark data or Web
sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance
benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for
purchase. See Top500 list for configuration information at http://top500.org.
Figure 9. Scaling Results of Zin and Other Tri-Labs Machines.
Summary and Conclusions
Across three NNSA national laboratories,
TLCC and TLCC2 capacity computing
systems powered by Intel True Scale
Fabric networks and Intel Xeon processor
E5-2600 outperform other machines,
including MPP capability supercomputers. At Sandia National Laboratories,
Chama delivers “unprecedented scalability
and performance.” Users of Luna at Los
Alamos National Laboratory claim it is the
“best machine the Laboratory has ever
had,” and it is “changing the way we work.”
Zin at Lawrence Livermore National Laboratory, along with two other TLCC clusters
built with Intel True Scale Fabric, dominate
the scalability testing results of the most
scalable systems in the benchmark.
These tests reveal how Intel True Scale
Fabric with PSM and onload processing
outperform other interconnects used in
HPC and drive some of the fastest supercomputers in the world.
11
Intel® True Scale Fabric Architecture:
Three Labs, One Conclusion
Disclaimers
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE,
TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH
PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL
PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PAT-ENT,
COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR
INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions
marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to
them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents
which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel’s Web site at www.intel.com.
1www.top500.org/list/2012/11
2www.sandia.gov/about/index.html
3Rajan, M. and D.W. Doerfler, P.T. Lin, S.D. Hammond, R.F. Barrett, and C.T. Vaughan. “Unprecedented Scalability and Performance
of the New NNSA Tri-Lab Linux Capacity Cluster 2,” Sandi National laboratories.
4www.lanl.gov/mission/index.php
5Pakin, Scott and Michael Lang. “Performance Comparison of Luna and Typhoon,” Los Alamos National Laboratory High-Performance
Computing Division, November 19, 2012.
6Coulter, Susan and Daryl W. Grunau. “Typhoon IB Performance,” Los Alamos National Laboratory, March 8, 2013.
7Bracketed values are added by Intel to offset the report results as described earlier.
Copyright © 2013 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Printed in USA
0513/ML/HBD/PDF
Please Recycle
328985-001US