Ranger - T

advertisement
Experiences and Achievements in
Deploying Ranger, the First NSF "Path to
Petascale" System
Tommy Minyard
Associate Director
Texas Advanced Computing Center
The University of Texas at Austin
HPCN Workshop
May 14, 2009
Ranger
Ranger: What is it?
•
Ranger is a unique instrument
for computational scientific
research housed at UT
•
Results from over 2 ½ years of
initial planning and deployment
efforts beginning Nov. 2005
•
Funded by the National Science
Foundation as part of a unique
program to reinvigorate High
Performance Computing in the
United States
•
Oh yeah, it’s a Texas-sized
supercomputer
How Much Did it Cost and Who’s Involved?
• TACC selected for very first NSF ‘Track2’ HPC
system
– $30M system acquisition
– Sun Microsystems is
the vendor
– We competed against almost
every US open science HPC
center
• TACC, Cornell, Arizona State HPCI are teamed to
operate/support the system four 4 years ($29M)
Ranger Project Timeline
Dec05
initial planning and system conceptual design
Feb06
TACC submits proposal, get’s a few nights sleep
Sep06
award, press release, relief
1Q07
equipment begins arriving
2Q07
facilities upgrades complete
2Q-4Q07
construction of system
4Q07
early users
Jan08
acceptance testing
Feb08
initial production and operations
Jun08
processor upgrade complete, system acceptance
Who Built Ranger?
An international collaboration which leverages a
number of freely-available software packages
(e.g. Linux, InfiniBand, Lustre)
Ranger System Summary
•
Compute power – 579.4 Teraflops
– 3,936 Sun four-socket blades
– 15,744 AMD “Barcelona” processors
• Quad-core, four flops/clock cycle
•
Memory - 123 Terabytes
– 2 GB/core, 32 GB/node
– 123 TB/s aggregate bandwidth
•
Disk subsystem - 1.7 Petabytes
– 72 Sun x4500 “Thumper” I/O servers, 24TB each
– 50 GB/sec total aggregate I/O bandwidth
– 1 PB raw capacity in largest filesystem
•
Interconnect – 1 GB/s, 1.6-2.85 μsec latency, 7.8 TB/s backplane
– Sun Data Center Switches (2), InfiniBand, up to 3456 4x ports each
– Full non-blocking 7-stage fabric
– Mellanox ConnectX InfiniBand HCAs
InfiniBand Fabric Connectivity
Ranger Space, Power and Cooling
•
Total Power: 3.4 MW!
•
System: 2.4 MW
– 96 racks – 82 compute, 12 support, plus 2 switches
– 116 APC In-Row cooling units
– 2,054 sq.ft. footprint (~4,500 sq.ft. including PDUs)
•
Cooling: ~1 MW
– In-row units fed by three 350-ton chillers (N+1)
– Enclosed hot-aisles by APC
– Supplemental 280-tons of cooling from CRAC units
•
Observations:
– Space less an issue than power
– Cooling > 25kW per rack a challenge
– Power distribution a challenge, almost 1,400 circuits
External Power and Cooling Infrastructure
Last racks delivered: Sept 15, 2007
Switches in place
InfiniBand Cabling in Progress
Hot aisles enclosed
InfiniBand Cabling Complete
InfiniBand Cabling for Ranger
• Sun switch design with reduced cable count,
manageable, but still a challenge to cable
– 1312 InfiniBand 12x to 12x cables
– 78 InfiniBand 12x to three 4x splitter cables
– Cable lengths range from 9-16m
• 15.4 km total InfiniBand cable length
Connects InfiniBand
switch to C48 Network
Express Module
Connects InfiniBand switch to
standard 4x connector HCA
Ranger Cable Envy?
•
On a system like Ranger, even
designing the cables is a big
challenge (1 cable can transfer data
~2000 times faster then your best
ever wireless connection)
•
The cables on Ranger are 1st
demonstration of their kind for
InfiniBand cabling (1 cable is really 3
cables inside), new 12x connector
•
Routing them to all the various
components is no fun either
Hardware Deployment Challenges
• AMD quad-core processor delays and TLB
errata
• Shear quantity of components, logistics
nightmare
• InfiniBand cable quality
• New hardware, many firmware and BIOS
updates
• Scale, scale, scale
Software Deployment Challenges
• Lustre with latest OFED InfiniBand, RAID5
crashes, quota problems
• Sun Grid Engine scalability and performance
• InfiniBand Subnet manager and routing
algorithms
• MPI collective tuning and large job startup
MPI Scalability and Collective Tuning
10000000
MVAPICH
MVAPICH-devel
OpenMPI
Average Time (usec)
.
1000000
100000
10000
1000
100
10
1
10
100
1000
10000 100000 1E+06 1E+07
Size (bytes)
Current Ranger Software Stack
• OFED 1.3.1 with Linux 2.6.18.8 kernel
• Lustre 1.6.7.1
• Multiple MPI stacks supported
– MVAPICH 1.1
– MVAPICH2 1.2
– OpenMPI 1.3
• Intel 10.1 and PGI 7.2 compilers
• Modules package to setup environment
MPI Tests: P2P Bandwidth
1200
Ranger - OFED 1.2 - MVAPICH 0.9.9
Lonestar - OFED 1.1 MVAPICH 0.9.8
Bandwidth (MB/sec)
1000
800
600
400
200
Effective
Effective Bandwith
Bandwith is
is improved
improved
at
smaller
message
size
at smaller message size
0
1
10
100
1000
10000
100000 1000000 1000000
0
Message Size (Bytes)
1E+08
Ranger: Bisection BW Across 2 Magnums
120.0%
1800
Ideal
1600
Measured
Full Bisection BW Efficiency
Bisection BW (GB/sec)
2000
1400
1200
1000
800
600
400
200
100.0%
80.0%
60.0%
40.0%
20.0%
0
0
20
40
60
# of Ranger Compute Racks
80
100
0.0%
1
2
4
8
16
32
64
# of Ranger Compute Racks
•
Able to sustain ~73% bisection bandwidth efficiency with all nodes
communicating (82 racks)
•
Subnet routing is key! – Using fat-tree routing from OFED 1.3 which has cached
routing to minimize the overhead of remaps
82
Software Challenges: Large MPI Jobs
Time to run 16K
hello world:
MVAPICH:
MVAPICH: 50
50 secs
secs
OpenMPI:
OpenMPI: 140
140 secs
secs
Upgraded
Upgraded performance
performance in
in Oct.
Oct. 08
08
First Year Production Experiences
• Demand for system exceeded expectations
• Applications scaled better than predicted
– Jobs with 16K MPI tasks routine on system now
– Several groups scaling applications to full system
• Filesystem performance very good
• HPC expert user support required to solve some
system issues
– Application performance variability
– System noise and OS jitter
Ranger Usage
•
Who uses Ranger?
– a community of researchers from around the US (along with
international collaborators)
– more than 2200 allocated users as of Apr 2009
– 450 individual research projects
•
Usage to Date?
– >700,000 jobs have been run through the queues
– >450 million CPU hours consumed
•
How long did it take to fill up the largest Lustre file system?
– We were able to go ~6months prior to turning on the file purging
mechanism
– Steady state usage allows us to retain data for about 30 days
– Generate ~10-20 TB a day currently
Who Uses Ranger?
Parallel Filesystem Performance
$SCRATCH File System Throughput
35
Stripecount=1
Stripecount=4
50
Write Speed (GB/sec)
Write Speed (GB/sec)
60
40
30
20
10
0
$SCRATCH Application Performance
Stripecount=1
Stripecount=4
30
25
20
15
10
5
0
1
10
100
1000
# of Writing Clients
10000
1
10
100
1000
# of Writing Clients
Some applications measuring 35GB/s of performance
10000
Application Performance Variability
Problem
• User code running and performing consistently per
iteration at 8K and 16K tasks
• Intermittently during a run, iterations would slow
down for a while, then resume
• Impact was tracked to be system wide
• Monitoring InfiniBand error counters isolated problem
to single node HCA causing congestion
• Users don’t have access to the IB switch counters,
hard to diagnose in application
System Noise, OS Jitter Measurement
• Application scales and performs poorly at larger MPI
task counts, 2K and above
• Application dominated by Allreduce with a brief
amount of computational work between
– No performance problem when code run 15-way
– Significant difference at 8K between 15-way and 16-way
• Noise isolated to system monitoring processes:
– SGE health monitoring – 30% at 4K
– IPMI daemon – 12% at 4K
Note: Other applications not measurably impacted
by the health monitoring processes
OS Jitter Impact on Performance
20.0
15‐Way
18.0
16‐Way
Time (secs)
16.0
14.0
12.0
10.0
8.0
6.0
4.0
2.0
0.0
0
2000
4000
6000
# of Cores
8000
10000
Weather Forecasting
• TACC worked with NOAA to produce accurate simulations
of Hurricane Ike, and new storm surge models
Using up to 40,000
processing cores at
once, researchers
simulating both
global and regional
weather predictions
received ondemand access to
Ranger, enabling
not only ensemble
forecasting, but
also real-time, highresolution
predictions.
Volker Bromm is
investigating the conditions
during the formation of the
first galaxies in the
universe after the big
bang.
This image shows two
separate quantities,
temperature and hydrogen
density, as the first galaxy
is forming and evolving.
Volker Bromm, Thomas Grief, Chris Burns, The University of Texas at Austin
Researching the Origins of the
Universe
Computing the Earth’s Mantle
• Omar Ghattas is studying
convection in the Earth’s
interior. He is simulating a
model mantle convection
problem. Images depict
rising temperature plume
within the Earth's mantle,
indicating the dynamicallyevolving mesh required to
resolve steep thermal
gradients.
• Ranger's speed and
memory permit higher
resolution simulations of
mantle convection, which
will lead to a better
understanding of the
dynamic evolution of the
solid Earth
Carsten Burstedde, Omar Ghattas, Georg Stadler, Tiankai Tu, Lucas Wilcox, The University of Texas at Austin
Application Example: Earth Sciences
Mantle Convection, AMR Method
Courtesy: Omar Ghattas, et. al.
Application Example: Earth Sciences
Courtesy: Omar Ghattas, et. al.
Application Examples: DNS/Turbulence
Courtesy: P.K. Yeung, Diego Donzis, TG 2008
UINTAH Framework
Lessons Learned
• Planning for risks essential, agile deployment
schedule a must
• Everything breaks as you increase scale, hard to test
until system complete
• Close vendor/customer/3rd party provider
collaboration required for successful deployment
• New hardware needs thorough testing in “real-world”
environments
Ongoing Challenges
• Supporting multiple groups running at 32K+ cores
• Continued application and library porting and tuning
• Enhanced MPI job startup reliability and speed
beyond 32K tasks, continued tuning of MPI
• Optimal CPU and memory affinity settings
• Better job scheduling and scalability of SGE
• Improving application scalability and user productivity
• Ensuring filesystem and IB fabric stability
Spur - Visualization System
•
128 cores, 1.125 TB distributed
memory, 32 GPUs
•
1 Sun Fire X4600 server
•
•
–
8 AMD dual-core CPUs
–
256 GB memory
–
4 NVIDIA FX5600 GPUs
7 Sun Fire X4440 servers
–
4 AMD quad-core CPUs per node
–
128 GB memory per node
–
4 NVIDIA FX5600 GPUs per node
Shares Ranger InfiniBand and file
systems
Summary
• Significant challenges deploying system at the
size of Ranger, especially with new hardware
and software
• Application scalability and system usage
exceeding expectations
• Collaborative effort with many groups
successfully overcoming the challenges posed by
system of this scale
• Open source software, commodity hardware
based supercomputing at “petascale” feasible
More About TACC:
Texas Advanced Computing Center
www.tacc.utexas.edu
info@tacc.utexas.edu
(512) 475-9411
Download