TX-2500 An Interactive, On-Demand Rapid-Prototyping HPC System Albert Reuther, William Arcand, Tim Currie, Andrew Funk, Jeremy Kepner, Matthew Hubbell, Andrew McCabe, and Peter Michaleas HPEC 2007 September 18-20, 2007 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory TX-2500 AIR - Slide-1 Outline • Introduction • Motivation • HPEC Software Design Workflow • Interactive, On-Demand • High Performance • Summary MIT Lincoln Laboratory TX-2500 AIR - Slide-2 LLGrid Applications MIT Lincoln Laboratory TX-2500 AIR - Slide-3 Example App: Prototype GMTI & SAR Signal Processing Analyst Workstation Running Matlab Streaming Sensor Data On-board processing Research Sensor Real-time front-end processing A/D Subband 180 MHz Filter Bank BW 16 430 GOPS • • • RAID Disk Recorder SAR GMTI … (new) Delay & Equalize Non-real-time GMTI processing Adaptive Pulse Beamform 4 Compress 190 Doppler Process 320 STAP (Clutter) 50 3 Subband Combine Target Detection 100 Airborne research sensor data collected Research analysts develop signal processing algorithms in MATLAB® using collected sensor data Individual runs can last hours or days on single workstation MIT Lincoln Laboratory TX-2500 AIR - Slide-4 HPEC Software Design Workflow Desktop Cluster Embedded Computer Algorithm Development Develop serial code Requirements • Low barrier-to-entry • Interactive • Immediate execution • High performance Parallel code Embedded code Verification Parallel code Embedded code Data Analysis Parallel code Embedded code MIT Lincoln Laboratory TX-2500 AIR - Slide-5 Outline • Introduction • Interactive, On-Demand • Technology • Results • High Performance • Summary MIT Lincoln Laboratory TX-2500 AIR - Slide-6 Parallel Matlab (pMatlab) Application Output Analysis Input Vector/Matrix Comp Conduit Task Library Layer (pMatlab) Parallel Library User Interface Kernel Layer Messaging (MatlabMPI) Cluster Launch (gridMatlab) Math (MATLAB) Hardware Interface Parallel Hardware Layered Architecture for parallel computing • • Kernel layer does single-node math & parallel messaging Library layer provides a parallel data and computation toolbox to Matlab users MIT Lincoln Laboratory TX-2500 AIR - Slide-7 Batch vs. Interactive, On-Demand Scheduling Tradeoffs • What is the goal of the system? – Capability computing – Capacity computing • What types of algorithms and code will system run? – Stable algorithms – Algorithm development • What’s more expensive? – The system and it’s maintenance – The researchers’ time Capability computing, stable algorithms and/or high system cost Capacity computing, algorithm development, and/or high researcher cost Batch Queue(s) Interactive, OnDemand Scheduling MIT Lincoln Laboratory TX-2500 AIR - Slide-8 LLGrid User Queues Interactive, On-Demand Queue Limit: 64 CPUs per user (more upon request) Batch (Default) Queue High Priority Interactive Only • • • Medium Priority Limit: 96/128 (days/nights, weekends) CPUs per user (more upon request) Low Priority Interactive and Batch Not using scheduler’s interactive features (lsrun, lsgrun, bsub -I) Certain CPUs for interactive, on-demand jobs only CPU allotments will change when upgrading to larger system MIT Lincoln Laboratory TX-2500 AIR - Slide-9 LLGrid Usage December 2003 – August 2007 10000 Statistics • 280 + 864 CPUs • 226 Users • 234,658 Jobs • 130,160 CPU Days 100 >8 CPU hours - Infeasible on Desktop >8 CPUs - Requires OnDemand Parallel Computing 1 Job duration (seconds) 1M LLGrid Usage 1 10 100 Processors used by Job 1000 Jobs Legend: alphaGrid betaGrid TX-2500 MIT Lincoln Laboratory TX-2500 AIR - Slide-10 Usage Statistics December-03 – August-07 Number of Jobs 5000 Total jobs run 4000 3000 2000 234,658 Median CPUs per job 11 Mean CPUs per job 20 Maximum CPUs per job 856 Total CPU time 130,160d 15h Median job duration 1000 0 1 16 32 48 64 84 96 112 128 144 35s Mean job duration 40m 41s Max job duration 18d 7h 6m Processors Used by Jobs • 7000 Number of Jobs 6000 • 5000 • • • 4000 3000 2000 1000 0 1 10 100 1000 104 105 Most jobs use less than 32 CPUs Many jobs with 2, 8, 24, 30, 32, 64, and 100 CPUs Very low median job duration Modest mean job duration Some jobs still take hours on 16 or 32 CPUs 106 Duration of Jobs (seconds) MIT Lincoln Laboratory TX-2500 AIR - Slide-11 70 Processors Used by Jobs Processors Used by Jobs Individuals’ Usage Examples 60 50 40 30 20 10 0 18:00 19:00 20:00 • • • Developing non-linear equalization ASIC Post-run processing from overnight run Debug runs during day Prepare for long overnight runs 60 50 40 30 20 10 0 20:00 21:00 Tuesday, 15-May-2007 • 70 21:00 22:00 23:00 0:00 Saturday, 5-May-2007 • • • Simulating laser propagation through atmosphere Simulation results direct subsequent algorithm development and parameters Many engineering iterations during course of day MIT Lincoln Laboratory TX-2500 AIR - Slide-12 Outline • Introduction • Interactive, On-Demand • High Performance • Technology • Results • Summary MIT Lincoln Laboratory TX-2500 AIR - Slide-13 TX-2500 Service Nodes Shared network storage LSF-HPC resource manager/ scheduler Rocks Mgmt, 411, Web Server, Ganglia To LLAN 432 PowerEdge 2850 Dual 3.2 GHz EM64-T Xeon (Irwindale CPUs on Lindenhurst Chipset) 8 GB RAM memory Two Gig-E Intel interfaces InfiniBand interface Six 300-GB disk drives • • • • • 432+3 Nodes 864+6 CPUs 3.4 TB RAM 0.78 PB of Disk 28 Racks MIT Lincoln Laboratory TX-2500 AIR - Slide-14 InfiniBand Topology 28 Racks 16 Servers Each 28 Cisco SFS 7000P Rack Switches (24 InfiniBand 10Gbps 4X ports each) Four Cisco SFS 7008P Core Switches (96 InfiniBand 10-Gbps 4X ports each) • Two uplinks from each rack switch to each core switch • Provides redundancy and minimizes contention MIT Lincoln Laboratory TX-2500 AIR - Slide-15 Parallel Computing Architecture Issues CPU RAM RAM RAM RAM Registers Instr. Operands Cache Disk Disk Disk Disk Blocks Network Switch Local Memory Messages CPU CPU CPU CPU Remote Memory RAM RAM RAM RAM Disk Disk Disk Disk • Disk Standard architecture produces a “steep” multi-layered memory hierarchy – • Pages Programmer must manage this hierarchy to get good performance Need to measure each level to determine system performance MIT Lincoln Laboratory TX-2500 AIR - Slide-16 Increasing Capacity CPU Increasing Programmability CPU Performance Implications Increasing Latency CPU Corresponding Memory Hierarchy Increasing Bandwidth Standard Parallel Computer Architecture HPC Challenge Benchmarks HPC Challenge Benchmark Corresponding Memory Hierarchy • Top500: solves a system Registers Ax = b Instr. Operands • STREAM: vector operations Cache A=B+sxC Blocks • FFT: 1D Fast Fourier Transform bandwidth Z = FFT(X) latency • RandomAccess: random updates T(i) = XOR( T(i), r ) • Iozone: Read and write to disk (Not part of HPC Challenge) • • Local Memory Messages Remote Memory Pages Disk HPC Challenge with Iozone measures this hierarchy Benchmarks performance of architecture MIT Lincoln Laboratory TX-2500 AIR - Slide-17 Example SAR Application Back-End Detection and ID Front-End Sensor Processing Raw SAR Data Top500 Adaptive Beamforming • Solve linear systems • Pulse compression • Polar Interpolation • FFT, IFFT (corner turn) FFT SAR Image Images Change Formation Detection Target ID • Many small correlations on random pieces of large image Random Access • Large images difference & threshold STREAM • HPC Challenge benchmarks are similar to pieces of real apps • Real applications are an average of many different operations • How do we correlate HPC Challenge with application performance? MIT Lincoln Laboratory TX-2500 AIR - Slide-18 Flops: Top500 Ethernet vs. InfiniBand GFLOPS GFLOPS Problem Size Comparison - InfiniBand Problem Size (MB/node) Problem Size=4000 MB/node CPUs CPUs • Top500 measures register-to-CPU comms (1.42 TFlops peak) • Top500 does not stress the network subsystem MIT Lincoln Laboratory TX-2500 AIR - Slide-19 Memory Bandwidth: STREAM Ethernet vs. InfiniBand GFLOPS GFLOPS Problem Size Comparison - InfiniBand Problem Size (MB/node) Problem Size=4000 MB/node CPUs CPUs • Memory to CPU bandwidth tested (~2.7 GB/s per node) • Unaffected by memory footprint of data set MIT Lincoln Laboratory TX-2500 AIR - Slide-20 Network Bandwidth: FFT Ethernet vs. InfiniBand GFLOPS GFLOPS Problem Size Comparison - InfiniBand Problem Size (MB/node) Problem Size=4000 MB/node CPUs CPUs • Infiniband makes gains due to higher bandwidth • Ethernet plateaus around 256 CPUs MIT Lincoln Laboratory TX-2500 AIR - Slide-21 Network Latency: RandomAccess Ethernet vs. InfiniBand GFLOPS GFLOPS Problem Size Comparison - InfiniBand Problem Size (MB/node) Problem Size=4000 MB/node CPUs CPUs • Low latency of InfiniBand drives GUPS performance • Strong mapping to embedded interprocessor comms MIT Lincoln Laboratory TX-2500 AIR - Slide-22 Read Bandwidth (MB/s) Iozone Results Write Bandwidth (MB/s) • 98% of the RAID arrays exceeded 50 MB/s throughput • Comparable performance to embedded storage MIT Lincoln Laboratory TX-2500 AIR - Slide-23 Command used: iozone -r 64k -s 16g -e -w -f /state/partition1/iozone_run Tera Top500 Memory Hierarchy STREAM Registers Instr. Operands Cache FFT Giga bandwidth latency Blocks Local Memory Messages Remote Memory RandomAccess Pages Disk Mega Effective Bandwidth (words/second) HPC Challenge Comparison InfiniBand, Problem Size=4000 MB/node 1 2 4 8 16 32 CPUs • • 64 128 256 416 384 All results in words/second Highlights memory hierarchy MIT Lincoln Laboratory TX-2500 AIR - Slide-24 Summary Algorithm Development Develop serial code Parallel code Embedded code Verification Parallel code Embedded code Data Analysis Parallel code • • • • Low barrier-to-entry Interactive Immediate execution High Performance Embedded code Parallel Matlab and User Statistics HPC Challenge and Results MIT Lincoln Laboratory TX-2500 AIR - Slide-25 Back Up Slides MIT Lincoln Laboratory TX-2500 AIR - Slide-26 TX-2500: Analysis of Potential Bottlenecks Peak Bandwidths CPU1 3.2 GHz “Irwindale” PCIeX X8 4.0 GB/s PCIeX X8 4.0 GB/s 10X IB 1.8 GB/s DDR2-400MHz 256 MB Cache U320 SCSI 320 MB/s 10 Gig 1.8 GB/s IB Core Switch Cisco SFS 7008P 240 GB/s Back Plane Infiniband HBA 1.8GB/s IB Rack Switch Cisco SFS 7000P 60 GB/s Back Plane PERC4e/DC RAID Cache PERC4e/DC RAID Contr. U320 SCSI 320 MB/s IB Rack Switch Cisco SFS 7000P 60 GB/s Back Plane GigE Central Switches Nortel 5530 Stack 5 GB/s Back Plane Ch.B (3.2 GB/s) PCIeX X8 4.0 GB/s Infiniband HBA 1.8GB/s GigE Rack Switch Nortel 5510 Stack 5 GB/s Back Plane PCIeX X4 2.0 GB/s DDR2-400MHz Ch.A (3.2 GB/s) North Bridge “Lindenhurst” DDR2-400MHz MCH DDR2-400MHz 256 MB Cache GigE Rack Switch Nortel 5510 Stack 5 GB/s Back Plane Hublink 1.5 GB/s 10X IB 1.8 GB/s U320 U320 U320 U320 SCSI U320 SCSI U320 SCSI SCSI SCSI SCSI 10k 10k 10k RPM 10k RPM 10k RPM 10k RPM RPM RPM ~40 ~40 ~40 MB/s ~40 MB/s ~40 MB/s ~40 MB/s MB/s MB/s GigE Interface GigE Interface 1 Gbps (0.18GB/s) 0.18GB/s 10 Gig 1.8 GB/s PERC4e/DC RAID Contr. GigE Interface GigE Interface 1 Gbps (0.18GB/s) 0.18GB/s South Bridge ICH5 USB, Video, etc. CPU1 3.2 GHz “Irwindale” PCIeX X8 4.0 GB/s PERC4e/DC RAID Cache PCIeX X4 2.0 GB/s South Bridge ICH5 USB, Video, etc. PCIeX X8 4.0 GB/s PCIeX X8 4.0 GB/s Ch.B (3.2 GB/s) Hublink 1.5 GB/s GigE 0.18 GB/s DDR2-400MHz DDR2-400MHz 2 GB RAM 2 GB RAM North Bridge “Lindenhurst” DDR2-400MHz MCH FSB-800MHz (6.4 GB/s) U320 U320 U320 U320 SCSI U320 SCSI U320 SCSI SCSI SCSI SCSI 10k 10k 10k RPM 10k RPM 10k RPM 10k RPM RPM RPM ~40 ~40 ~40 MB/s ~40 MB/s ~40 MB/s ~40 MB/s MB/s MB/s MIT Lincoln Laboratory TX-2500 AIR - Slide-27 DDR2-400MHz DDR2-400MHz 2 GB RAM 2 GB RAM DDR2-400MHz Ch.A (3.2 GB/s) Dell CPU0 PowerEdge 3.2 GHz “Irwindale” 2850 GigE 0.18 GB/s FSB-800MHz (6.4 GB/s) Dell PowerEdge 2850 DDR2-400MHz DDR2-400MHz 2 GB RAM 2 GB RAM DDR2-400MHz DDR2-400MHz 2 GB RAM 2 GB RAM CPU0 3.2 GHz “Irwindale” Spatial/Temporal Locality Results 1 MetaSim data from Snavely et al (SDSC) FFT 0.8 GUPS Temporal Score HPL (Top500) Stream HPC Challenge HPL 0.6 Avus Large cth7 Large Scientific Applications 0.4 Gamess Large Scientific Applications lammps Large FFT overflow2 wrf 0.2 Intelligence, Surveillance, Reconnaisance Applications STREAM RandomAccess 0 0 0.2 0.4 0.6 0.8 1 Spatial Score • HPC Challenge bounds real applications – Allows us to map between applications and benchmarks MIT Lincoln Laboratory TX-2500 AIR - Slide-28 HPL “Top500” Benchmark • • • High Performance Linpack (HPL) solves a system Ax = b Core operation is a LU factorization of a large MxM matrix Results are reported in floating point operations per second (flops) Parallel Algorithm Registers Instr. Operands Cache Blocks Local Memory A P0 P2 P0 P2 P1 P3 P1 P3 P0 P2 P0 P2 P1 P3 P1 P3 Messages LU Factorization L P0 P2 P0 P2 P1 P3 P1 P3 P0 P2 P0 P2 P1 P3 P1 P3 Remote Memory Pages Disk 2D block cyclic distribution is used for load balancing • Linear system solver (requires all-to-all communication) • Stresses local matrix multiply performance • DARPA HPCS goal: 2 Petaflops (8x over current best) MIT Lincoln Laboratory TX-2500 AIR - Slide-29 U STREAM Benchmark • • Performs scalar multiply and add Results are reported in bytes/second Parallel Algorithm Registers Instr. Operands Cache Blocks Local Memory Messages Remote Memory Pages A = B + sxC 0 1 ... Np-1 0 1 ... Np-1 0 1 ... Np-1 Disk • Basic operations on large vectors (requires no communication) • Stresses local processor to memory bandwidth • DARPA HPCS goal: 6.5 Petabytes/second (40x over current best) MIT Lincoln Laboratory TX-2500 AIR - Slide-30 FFT Benchmark • • • 1D Fast Fourier Transforms an N element complex vector Typically done as a parallel 2D FFT Results are reported in floating point operations per second (flops) Parallel Algorithm Registers Instr. Operands FFT rows FFT columns Cache Blocks Local Memory Messages Remote Memory Pages 0 1 : Np-1 0 1 . . Np-1 corner turn Disk • FFT a large complex vector (requires all-to-all communication) • Stresses interprocessor communication of large messages • DARPA HPCS goal: 0.5 Petaflops (200x over current best) MIT Lincoln Laboratory TX-2500 AIR - Slide-31 RandomAccess Benchmark • • • Randomly updates N element table of unsigned integers Each processor generates indices, sends to all other processors, performs XOR Results are reported in Giga Updates Per Second (GUPS) Parallel Algorithm Registers Table Instr. Operands Cache Blocks Local Memory Messages Remote Memory Pages 0 1 .. Np-1 1 .. NP-1 Send, XOR, Update 0 Generate random indices Disk • Randomly updates memory (requires all-to-all communication) • Stresses interprocessor communication of small messages • DARPA HPCS goal: 64,000 GUPS (2000x over current best) MIT Lincoln Laboratory TX-2500 AIR - Slide-32 IO Zone File System Benchmark • File system benchmark tool • Can measure large variety of file • system characteristics We benchmarked: – Read and write throughput performance (MB/s) – 16 GB files (to test RAID, not caches) – 64 kB blocks (best performance on hardware) – On six-disk hardware RAID set – On all 432 compute nodes Local Disk Memory Hierarchy CPUs Local Cache Local Memory Disk Cache Disk(s) Our iozone tests MIT Lincoln Laboratory TX-2500 AIR - Slide-33