TX-2500 An Interactive, On-Demand Rapid-Prototyping HPC System

advertisement
TX-2500
An Interactive, On-Demand
Rapid-Prototyping HPC System
Albert Reuther, William Arcand, Tim Currie, Andrew Funk,
Jeremy Kepner, Matthew Hubbell, Andrew McCabe, and Peter Michaleas
HPEC 2007
September 18-20, 2007
This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations,
conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
MIT Lincoln Laboratory
TX-2500
AIR - Slide-1
Outline
• Introduction
• Motivation
• HPEC Software Design Workflow
• Interactive, On-Demand
• High Performance
• Summary
MIT Lincoln Laboratory
TX-2500
AIR - Slide-2
LLGrid Applications
MIT Lincoln Laboratory
TX-2500
AIR - Slide-3
Example App: Prototype GMTI & SAR Signal Processing
Analyst Workstation
Running Matlab
Streaming
Sensor
Data
On-board
processing
Research
Sensor
Real-time front-end processing
A/D
Subband
180 MHz
Filter Bank
BW 16
430 GOPS
•
•
•
RAID Disk
Recorder
SAR
GMTI
…
(new)
Delay &
Equalize
Non-real-time GMTI processing
Adaptive
Pulse
Beamform 4 Compress
190
Doppler
Process
320
STAP
(Clutter)
50
3
Subband
Combine
Target
Detection
100
Airborne research sensor data collected
Research analysts develop signal processing algorithms in
MATLAB® using collected sensor data
Individual runs can last hours or days on single workstation
MIT Lincoln Laboratory
TX-2500
AIR - Slide-4
HPEC Software Design Workflow
Desktop
Cluster
Embedded
Computer
Algorithm Development
Develop serial code
Requirements
• Low barrier-to-entry
• Interactive
• Immediate
execution
• High performance
Parallel code
Embedded code
Verification
Parallel code
Embedded code
Data Analysis
Parallel code
Embedded code
MIT Lincoln Laboratory
TX-2500
AIR - Slide-5
Outline
• Introduction
• Interactive, On-Demand
• Technology
• Results
• High Performance
• Summary
MIT Lincoln Laboratory
TX-2500
AIR - Slide-6
Parallel Matlab (pMatlab)
Application
Output
Analysis
Input
Vector/Matrix
Comp
Conduit
Task
Library Layer (pMatlab)
Parallel
Library
User
Interface
Kernel Layer
Messaging
(MatlabMPI)
Cluster Launch
(gridMatlab)
Math
(MATLAB)
Hardware
Interface
Parallel
Hardware
Layered Architecture for parallel computing
•
•
Kernel layer does single-node math & parallel messaging
Library layer provides a parallel data and computation toolbox to Matlab
users
MIT Lincoln Laboratory
TX-2500
AIR - Slide-7
Batch vs. Interactive, On-Demand
Scheduling Tradeoffs
• What is the goal of the system?
– Capability computing
– Capacity computing
• What types of algorithms and code will system run?
– Stable algorithms
– Algorithm development
• What’s more expensive?
– The system and it’s maintenance
– The researchers’ time
Capability computing,
stable algorithms and/or
high system cost
Capacity computing,
algorithm development,
and/or high researcher cost
Batch Queue(s)
Interactive, OnDemand Scheduling
MIT Lincoln Laboratory
TX-2500
AIR - Slide-8
LLGrid User Queues
Interactive,
On-Demand
Queue
Limit: 64 CPUs
per user (more
upon request)
Batch
(Default)
Queue
High
Priority
Interactive Only
•
•
•
Medium
Priority
Limit: 96/128
(days/nights,
weekends) CPUs
per user (more
upon request)
Low
Priority
Interactive and Batch
Not using scheduler’s interactive features (lsrun, lsgrun, bsub -I)
Certain CPUs for interactive, on-demand jobs only
CPU allotments will change when upgrading to larger system
MIT Lincoln Laboratory
TX-2500
AIR - Slide-9
LLGrid Usage
December 2003 – August 2007
10000
Statistics
• 280 + 864 CPUs
• 226 Users
• 234,658 Jobs
• 130,160 CPU Days
100
>8 CPU hours - Infeasible on
Desktop
>8 CPUs - Requires OnDemand Parallel
Computing
1
Job duration (seconds)
1M
LLGrid Usage
1
10
100
Processors used by Job
1000
Jobs Legend:
alphaGrid
betaGrid
TX-2500
MIT Lincoln Laboratory
TX-2500
AIR - Slide-10
Usage Statistics
December-03 – August-07
Number of Jobs
5000
Total jobs run
4000
3000
2000
234,658
Median CPUs per job
11
Mean CPUs per job
20
Maximum CPUs per job
856
Total CPU time
130,160d 15h
Median job duration
1000
0
1
16
32
48
64
84
96
112 128 144
35s
Mean job duration
40m 41s
Max job duration
18d 7h 6m
Processors Used by Jobs
•
7000
Number of Jobs
6000
•
5000
•
•
•
4000
3000
2000
1000
0
1
10
100
1000
104
105
Most jobs use less than 32
CPUs
Many jobs with 2, 8, 24, 30,
32, 64, and 100 CPUs
Very low median job duration
Modest mean job duration
Some jobs still take hours on
16 or 32 CPUs
106
Duration of Jobs (seconds)
MIT Lincoln Laboratory
TX-2500
AIR - Slide-11
70
Processors Used by Jobs
Processors Used by Jobs
Individuals’ Usage Examples
60
50
40
30
20
10
0
18:00
19:00
20:00
•
•
•
Developing non-linear
equalization ASIC
Post-run processing from
overnight run
Debug runs during day
Prepare for long overnight
runs
60
50
40
30
20
10
0
20:00
21:00
Tuesday, 15-May-2007
•
70
21:00
22:00
23:00
0:00
Saturday, 5-May-2007
•
•
•
Simulating laser propagation
through atmosphere
Simulation results direct
subsequent algorithm
development and parameters
Many engineering iterations
during course of day
MIT Lincoln Laboratory
TX-2500
AIR - Slide-12
Outline
• Introduction
• Interactive, On-Demand
• High Performance
• Technology
• Results
• Summary
MIT Lincoln Laboratory
TX-2500
AIR - Slide-13
TX-2500
Service Nodes
Shared
network
storage
LSF-HPC
resource
manager/
scheduler
Rocks Mgmt, 411,
Web Server,
Ganglia
To LLAN
432
PowerEdge 2850
Dual 3.2 GHz EM64-T Xeon (Irwindale
CPUs on Lindenhurst Chipset)
8 GB RAM memory
Two Gig-E Intel interfaces
InfiniBand interface
Six 300-GB disk drives
•
•
•
•
•
432+3 Nodes
864+6 CPUs
3.4 TB RAM
0.78 PB of Disk
28 Racks
MIT Lincoln Laboratory
TX-2500
AIR - Slide-14
InfiniBand Topology
28 Racks
16 Servers Each
28 Cisco SFS 7000P
Rack Switches
(24 InfiniBand 10Gbps 4X ports each)
Four Cisco SFS 7008P
Core Switches (96
InfiniBand 10-Gbps 4X
ports each)
• Two uplinks from each rack switch to each core switch
• Provides redundancy and minimizes contention
MIT Lincoln Laboratory
TX-2500
AIR - Slide-15
Parallel Computing Architecture Issues
CPU
RAM
RAM
RAM
RAM
Registers
Instr. Operands
Cache
Disk
Disk
Disk
Disk
Blocks
Network Switch
Local Memory
Messages
CPU
CPU
CPU
CPU
Remote Memory
RAM
RAM
RAM
RAM
Disk
Disk
Disk
Disk
•
Disk
Standard architecture produces a “steep” multi-layered memory hierarchy
–
•
Pages
Programmer must manage this hierarchy to get good performance
Need to measure each level to determine system performance
MIT Lincoln Laboratory
TX-2500
AIR - Slide-16
Increasing Capacity
CPU
Increasing Programmability
CPU
Performance
Implications
Increasing Latency
CPU
Corresponding
Memory Hierarchy
Increasing Bandwidth
Standard Parallel
Computer Architecture
HPC Challenge Benchmarks
HPC Challenge
Benchmark
Corresponding
Memory Hierarchy
• Top500: solves a system
Registers
Ax = b
Instr. Operands
• STREAM: vector operations
Cache
A=B+sxC
Blocks
• FFT: 1D Fast Fourier Transform
bandwidth
Z = FFT(X)
latency
• RandomAccess: random updates
T(i) = XOR( T(i), r )
• Iozone: Read and write to disk
(Not part of HPC Challenge)
•
•
Local Memory
Messages
Remote Memory
Pages
Disk
HPC Challenge with Iozone measures this hierarchy
Benchmarks performance of architecture
MIT Lincoln Laboratory
TX-2500
AIR - Slide-17
Example SAR Application
Back-End
Detection and ID
Front-End
Sensor Processing
Raw SAR
Data
Top500
Adaptive
Beamforming
• Solve linear systems
• Pulse compression
• Polar Interpolation
• FFT, IFFT (corner turn)
FFT
SAR
Image
Images Change
Formation
Detection
Target
ID
• Many small
correlations on
random pieces
of large image
Random
Access
• Large images
difference &
threshold
STREAM
• HPC Challenge benchmarks are similar to pieces of real apps
• Real applications are an average of many different operations
• How do we correlate HPC Challenge with application performance?
MIT Lincoln Laboratory
TX-2500
AIR - Slide-18
Flops: Top500
Ethernet vs. InfiniBand
GFLOPS
GFLOPS
Problem Size Comparison - InfiniBand
Problem Size
(MB/node)
Problem Size=4000 MB/node
CPUs
CPUs
• Top500 measures register-to-CPU comms (1.42 TFlops peak)
• Top500 does not stress the network subsystem
MIT Lincoln Laboratory
TX-2500
AIR - Slide-19
Memory Bandwidth: STREAM
Ethernet vs. InfiniBand
GFLOPS
GFLOPS
Problem Size Comparison - InfiniBand
Problem Size
(MB/node)
Problem Size=4000 MB/node
CPUs
CPUs
• Memory to CPU bandwidth tested (~2.7 GB/s per node)
• Unaffected by memory footprint of data set
MIT Lincoln Laboratory
TX-2500
AIR - Slide-20
Network Bandwidth: FFT
Ethernet vs. InfiniBand
GFLOPS
GFLOPS
Problem Size Comparison - InfiniBand
Problem Size
(MB/node)
Problem Size=4000 MB/node
CPUs
CPUs
• Infiniband makes gains due to higher bandwidth
• Ethernet plateaus around 256 CPUs
MIT Lincoln Laboratory
TX-2500
AIR - Slide-21
Network Latency: RandomAccess
Ethernet vs. InfiniBand
GFLOPS
GFLOPS
Problem Size Comparison - InfiniBand
Problem Size
(MB/node)
Problem Size=4000 MB/node
CPUs
CPUs
• Low latency of InfiniBand drives GUPS performance
• Strong mapping to embedded interprocessor comms
MIT Lincoln Laboratory
TX-2500
AIR - Slide-22
Read Bandwidth (MB/s)
Iozone Results
Write Bandwidth (MB/s)
• 98% of the RAID arrays exceeded 50 MB/s throughput
• Comparable performance to embedded storage
MIT Lincoln Laboratory
TX-2500
AIR - Slide-23
Command used: iozone -r 64k -s 16g -e -w -f /state/partition1/iozone_run
Tera
Top500
Memory Hierarchy
STREAM
Registers
Instr. Operands
Cache
FFT
Giga
bandwidth
latency
Blocks
Local Memory
Messages
Remote Memory
RandomAccess
Pages
Disk
Mega
Effective Bandwidth (words/second)
HPC Challenge Comparison
InfiniBand, Problem Size=4000 MB/node
1
2
4
8
16
32
CPUs
•
•
64
128
256
416
384
All results in words/second
Highlights memory hierarchy
MIT Lincoln Laboratory
TX-2500
AIR - Slide-24
Summary
Algorithm Development
Develop serial code
Parallel code
Embedded code
Verification
Parallel code
Embedded code
Data Analysis
Parallel code
•
•
•
•
Low barrier-to-entry
Interactive
Immediate execution
High Performance
Embedded code
Parallel Matlab and User Statistics
HPC Challenge and Results
MIT Lincoln Laboratory
TX-2500
AIR - Slide-25
Back Up Slides
MIT Lincoln Laboratory
TX-2500
AIR - Slide-26
TX-2500: Analysis of Potential Bottlenecks
Peak Bandwidths
CPU1
3.2 GHz
“Irwindale”
PCIeX X8
4.0 GB/s
PCIeX X8
4.0 GB/s
10X IB
1.8 GB/s
DDR2-400MHz
256 MB Cache
U320 SCSI
320 MB/s
10 Gig
1.8 GB/s
IB Core Switch
Cisco SFS 7008P
240 GB/s Back Plane
Infiniband HBA
1.8GB/s
IB Rack Switch
Cisco SFS 7000P
60 GB/s Back Plane
PERC4e/DC
RAID Cache
PERC4e/DC
RAID Contr.
U320 SCSI
320 MB/s
IB Rack Switch
Cisco SFS 7000P
60 GB/s Back Plane
GigE Central Switches
Nortel 5530 Stack
5 GB/s Back Plane
Ch.B (3.2 GB/s)
PCIeX X8
4.0 GB/s
Infiniband HBA
1.8GB/s
GigE Rack Switch
Nortel 5510 Stack
5 GB/s Back Plane
PCIeX X4
2.0 GB/s
DDR2-400MHz
Ch.A (3.2 GB/s)
North Bridge
“Lindenhurst”
DDR2-400MHz
MCH
DDR2-400MHz
256 MB Cache
GigE Rack Switch
Nortel 5510 Stack
5 GB/s Back Plane
Hublink
1.5 GB/s
10X IB
1.8 GB/s
U320
U320
U320
U320
SCSI
U320
SCSI
U320
SCSI
SCSI
SCSI
SCSI
10k
10k
10k
RPM
10k
RPM
10k
RPM
10k
RPM
RPM
RPM
~40
~40
~40
MB/s
~40
MB/s
~40
MB/s
~40
MB/s
MB/s
MB/s
GigE Interface
GigE Interface
1 Gbps (0.18GB/s)
0.18GB/s
10 Gig
1.8 GB/s
PERC4e/DC
RAID Contr.
GigE Interface
GigE Interface
1 Gbps (0.18GB/s)
0.18GB/s
South Bridge ICH5
USB, Video, etc.
CPU1
3.2 GHz
“Irwindale”
PCIeX X8
4.0 GB/s
PERC4e/DC
RAID Cache
PCIeX X4
2.0 GB/s
South Bridge ICH5
USB, Video, etc.
PCIeX X8
4.0 GB/s
PCIeX X8
4.0 GB/s
Ch.B (3.2 GB/s)
Hublink
1.5 GB/s
GigE
0.18 GB/s
DDR2-400MHz
DDR2-400MHz
2 GB RAM
2 GB RAM
North Bridge
“Lindenhurst”
DDR2-400MHz
MCH
FSB-800MHz
(6.4 GB/s)
U320
U320
U320
U320
SCSI
U320
SCSI
U320
SCSI
SCSI
SCSI
SCSI
10k
10k
10k
RPM
10k
RPM
10k
RPM
10k
RPM
RPM
RPM
~40
~40
~40
MB/s
~40
MB/s
~40
MB/s
~40
MB/s
MB/s
MB/s
MIT Lincoln Laboratory
TX-2500
AIR - Slide-27
DDR2-400MHz
DDR2-400MHz
2 GB RAM
2 GB RAM
DDR2-400MHz
Ch.A (3.2 GB/s)
Dell
CPU0
PowerEdge 3.2 GHz
“Irwindale”
2850
GigE
0.18 GB/s
FSB-800MHz
(6.4 GB/s)
Dell
PowerEdge
2850
DDR2-400MHz
DDR2-400MHz
2 GB RAM
2 GB RAM
DDR2-400MHz
DDR2-400MHz
2 GB RAM
2 GB RAM
CPU0
3.2 GHz
“Irwindale”
Spatial/Temporal Locality Results
1
MetaSim data from Snavely et al (SDSC)
FFT
0.8
GUPS
Temporal Score
HPL (Top500)
Stream
HPC
Challenge
HPL
0.6
Avus Large
cth7 Large
Scientific
Applications
0.4
Gamess Large
Scientific
Applications
lammps Large
FFT
overflow2
wrf
0.2
Intelligence, Surveillance, Reconnaisance
Applications
STREAM
RandomAccess
0
0
0.2
0.4
0.6
0.8
1
Spatial Score
• HPC Challenge bounds real applications
– Allows us to map between applications and benchmarks
MIT Lincoln Laboratory
TX-2500
AIR - Slide-28
HPL “Top500” Benchmark
•
•
•
High Performance Linpack (HPL) solves a system Ax = b
Core operation is a LU factorization of a large MxM matrix
Results are reported in floating point operations per second (flops)
Parallel Algorithm
Registers
Instr. Operands
Cache
Blocks
Local Memory
A
P0
P2
P0
P2
P1
P3
P1
P3
P0
P2
P0
P2
P1
P3
P1
P3
Messages
LU
Factorization
L
P0
P2
P0
P2
P1
P3
P1
P3
P0
P2
P0
P2
P1
P3
P1
P3
Remote Memory
Pages
Disk
2D block cyclic distribution
is used for load balancing
• Linear system solver (requires all-to-all communication)
• Stresses local matrix multiply performance
• DARPA HPCS goal: 2 Petaflops (8x over current best)
MIT Lincoln Laboratory
TX-2500
AIR - Slide-29
U
STREAM Benchmark
•
•
Performs scalar multiply and add
Results are reported in bytes/second
Parallel Algorithm
Registers
Instr. Operands
Cache
Blocks
Local Memory
Messages
Remote Memory
Pages
A
=
B
+
sxC
0
1
...
Np-1
0
1
...
Np-1
0
1
...
Np-1
Disk
• Basic operations on large vectors (requires no communication)
• Stresses local processor to memory bandwidth
• DARPA HPCS goal: 6.5 Petabytes/second (40x over current best)
MIT Lincoln Laboratory
TX-2500
AIR - Slide-30
FFT Benchmark
•
•
•
1D Fast Fourier Transforms an N element complex vector
Typically done as a parallel 2D FFT
Results are reported in floating point operations per second (flops)
Parallel Algorithm
Registers
Instr. Operands
FFT rows
FFT columns
Cache
Blocks
Local Memory
Messages
Remote Memory
Pages
0
1
:
Np-1
0
1
. . Np-1
corner
turn
Disk
• FFT a large complex vector (requires all-to-all communication)
• Stresses interprocessor communication of large messages
• DARPA HPCS goal: 0.5 Petaflops (200x over current best)
MIT Lincoln Laboratory
TX-2500
AIR - Slide-31
RandomAccess Benchmark
•
•
•
Randomly updates N element table of unsigned integers
Each processor generates indices, sends to all other processors, performs XOR
Results are reported in Giga Updates Per Second (GUPS)
Parallel Algorithm
Registers
Table
Instr. Operands
Cache
Blocks
Local Memory
Messages
Remote Memory
Pages
0
1
..
Np-1
1
..
NP-1
Send,
XOR,
Update
0
Generate random indices
Disk
• Randomly updates memory (requires all-to-all communication)
• Stresses interprocessor communication of small messages
• DARPA HPCS goal: 64,000 GUPS (2000x over current best)
MIT Lincoln Laboratory
TX-2500
AIR - Slide-32
IO Zone File System Benchmark
• File system benchmark tool
• Can measure large variety of file
•
system characteristics
We benchmarked:
– Read and write throughput
performance (MB/s)
– 16 GB files (to test RAID, not caches)
– 64 kB blocks (best performance on
hardware)
– On six-disk hardware RAID set
– On all 432 compute nodes
Local Disk
Memory Hierarchy
CPUs
Local Cache
Local Memory
Disk Cache
Disk(s)
Our iozone tests
MIT Lincoln Laboratory
TX-2500
AIR - Slide-33
Download