Proprietary or Commodity? Interconnect Performance in Large-scale Supercomputers Author: Olli-Pekka Lehto

Proprietary or Commodity?
Interconnect Performance in Large-scale Supercomputers
Author: Olli-Pekka Lehto
Supervisor: Prof. Jorma Virtamo
Instructor: D.Sc. (Tech.) Jussi Heikonen
Test platforms
Testing methodology
 A Modern High Performance Computing (HPC) system consists of:
Login nodes, service nodes, IO nodes, Compute nodes
• Usually multicore/SMP
Interconnect network(s) which links all nodes together
 Applications are run in parallel
The individual tasks of a parallel application exchange data via the interconnect
The interconnect plays a critical role in the overall performance of the parallel
 Commodity system
Use of off-the-shelf components
Leverages economies of scale involved in the PC industry
“Industry standard” architectures which allow for the system to be
extended in a vendor-independent fashion
 Proprietary system
A highly integrated system designed specifically for HPC
May contain some off-the-shelf components but cannot be extended in
a vendor independent fashion.
Background:The Cluster Revolution
In the last decade clusters have rapidly become the system architecture of choice in
HPC. The high end of the market is still dominated by proprietary MPP systems.
Background: The Architecture Evolution
The architectures of proprietary and commodity HPC systems have been converging.
Nowadays it's difficult to differentiate between the two.
Increasing R&D costs drive move
towards commodity components
Proprietary system 2002
Power, Custom
AIX, Tru64, Irix,
Operating system
Processor count
Processor type
Competition between AMD and Intel
Competition in specialized cluster interconnects
Proprietary system 2007 Commodity system 2007 Commodity system 2002
AMD Opteron, Intel Xeon, Intel Itanium, Power
Intel Pentium, AMD Athlon
Linux, AIX
Custom, InfiniBand
Gigabit Ethernet,
InfiniBand, Quadrics,
Myrinet (1st gen), Gigabit
The interconnect network is a key differentiating factor between commodity and
proprietary architectures.
Disclaimer: IBM Blue Gene is an notable execption
Problem Statement
Does a supercomputer architecture with a proprietary interconnection
network offer a significant performance advantage compared
to a similiar sized cluster with a commodity network?
Test Platforms
 In 2007 CSC - Scientific Computing Ltd. conducted a
10M€ procurement to aquire new HPC systems
 The procurement was split into two parts
Lot 1: Capability computing
• Massively parallel “Grand Challenge” –computation
Lot 2: Capacity computing
• Sequential and small to medium size parallel problems
• Problems requiring large amounts of memory
Test Platform 1: Louhi
 Winner of Lot 1: “capability computing”
 Cray XT4
Phase 1 (2007): 2024 2.6GHz AMD Opteron cores (dual core): 10.5TFlop/s
Phase 2 (2008): 6736 2.6GHz AMD Opteron cores (quad core* ): 70.1TFlop/s
 Unicos/lc operating system
Linux on the login and service nodes
Catamount microkernel on compute nodes
 Proprietary “SeaStar2” interconnection network
3-dimensional torus topology
Each node connected to 6 neighbors with 7,6GByte/s links
Each node has a router integrated into the NIC
NIC connected directly to AMD HyperTransport bus
NIC has an onboard CPU for protocol processing (“protocol offloading”)
Remote Direct Memory Access (RDMA)
* 4 Flops/cycle
Test Platform 1: Louhi
Source: Cray Inc.
Test Platform 2: Murska
 Winner of Lot 2: “capacity computing”
 HP CP4000BL XC blade cluster
2048 2.6GHz AMD Opteron cores (dual-core):10.6 TFlop/s
Dual-socket HP BL465c server blades
 HPC Linux operating system
HP's turnkey cluster OS based on RHEL
 InfiniBand interconnect network
A multipurpose high-speed network
Fat tree topology (blocking)
24-port blade enclosure switches
16 16Gbit/s DDR *) downlinks
8 16Gbit/s DDR uplinks, running at 8Gbit/s
288-port master switch with SDR **) ports (8Gbit/s)
Recently upgraded to DDR
Host Channel Adapters (HCA) connected to 8x PCIe buses
Remote Direct Memory Access (RDMA)
*) Double Data Rate
**) Single Data Rate
Test Methodology
 Testing of individual parameters with microbenchmarks
End-to-end communication latency and bandwidth
Communication processing overhead
Consistency of performance across the system
 Testing of real-world behavior with a scientific application
Gromacs – A popular open source molecular dynamics application
 All measurements use Message Passing Interface (MPI)
MPI is by far the most popular parallel programming application programming
interface (API) for HPC
Murska uses HP's HP-MPI implementation
Louhi uses a Cray-modified version of the MPICH2 implementation
End-to-end Latency
 Intel MPI Benchmarks (IMB) PingPong test was used
Measures the latency to send and recieve a single point-to-point message as a
function of the message size
Arguably the most popular metric of interconnect performance (esp. short messages)
 Murska's HP-MPI has 2 modes of operation (RDMA and SRQ)
RDMA requires 256kbytes of memory per MPI task while SRQ (Shared Recieve
Queue) has a constant memory requirement
SRQ causes a notable
latency overhead
Murska using RDMA
has a slightly lower latency
with short messages
As the message size grows,
Louhi outperforms Murska
End-to-end Bandwidth
 IMB PingPong test with large message sizes
Inverse of the latency
Test was done between nearest-neighbor nodes
Louhi has still not reached
peak bandwidth with
the largest message size
(4 megabytes)
The gap between RDMA
and SRQ narrows as the
link becomes saturated:
SRQ doesn't affect performance
with large message sizes
Communication Processing Overhead
 Measure of how much communication stresses the CPU
The C in HPC stands for Computing, not Communication ;)
 MPI has asynchronous communication routines which overlap
communication and computation
This requires autonomous communication mechanisms
• Murska has RDMA, Louhi has protocol offloading and RDMA
 Sandia Nat'l Labs SMB benchmark was used
See how much work one process gets done while another process communicates
with it constantly.
Result as application availability percentage
• 100%: communication is performed completely in the background
• 0%: communication is performed completely in the foreground, no work done
Separate results for getting work done at the sender and at the reciever side
Reciever Side Availability
100% Availability:
Communication does
not interfere with
processing at all
Louhi's availability
improves with
8k-128k messages
Murska's availability
drops significantly
between 16k and 32k
0% Availability:
Processing communication
hogs the CPU completely
Sender Side Availability
Louhi's availability
improves dramatically
with large messages:
The offload engine can
process packets
 A popular and mature molecular dynamics simulation
Open Source, downloadable from
 Programmed with MPI
Designed to exploit overlapping computation and communication, if
 Parallel speedup was measured by using a fixed size
molecular system
Run times for task counts from 16 to 128 were measured
MPI calls were profiled
• How much time spent in communication subroutines?
• Which subroutines were the most time-consuming?
Gromacs Run Time
Time spent in MPI
routines starts
increasing at 64 tasks
Murska’s scaling stops at
32 tasks
Louhi’s scaling stops at
64 tasks
MPI Call Profile
Fraction of the
total MPI time
spent in a specific
MPI call
With small message sizes the MPI_Alltoall (all processes send a message
to each other) dominates the time spent in MPI.
MPI_Wait (wait for an asynchronous
message transfer to complete) starts
dominating time usage on Murska as the
task count grows.
MPI_Alltoall starts dominating MPI
time usage on Louhi again as the task
count grows.
 On Murska, a tradeoff has to be made with large parallel problems
SRQ: Sacrifice latency in favor of memory capacity
RDMA: Sacrifice memory capacity in favor of latency
 Murska is able to outperform Louhi in some benchmarks
Especially in short message performance in RDMA mode
 Louhi was more consistent in providing low processing overhead
Being able to overlap long messages tends to be more important than short
messages as they take more time to complete
 Gromacs scaled significantly better on Louhi
Most likely largely due to lower communication processing overhead
 A proprietary system still has it's place
The interconnect is designed from ground up to handle MPI communication and
HPC workloads
The streamlined microkernel also helps out
 Focusing only on “hero numbers” (e.g. short message latency) can be
Thank you!