HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. (MITRE) Jack Dongarra (UTK) Piotr Luszczek (ICL/UTK) 28 September 2004 MITRE Slide-1 HPCchallenge Benchmarks ICL/UTK Outline • • • • • Brief DARPA HPCS Overview Architecture/Application Characterization HPCchallenge Benchmarks Preliminary Results Summary MITRE Slide-2 HPCchallenge Benchmarks ICL/UTK High Productivity Computing Systems Create a new generation of economically viable computing systems and a procurement methodology for the security/industrial community (2007 – 2010) Impact: Performance (time-to-solution): speedup critical national security applications by a factor of 10X to 40X Programmability (idea-to-first-solution): reduce cost and time of developing application solutions Portability (transparency): insulate research and operational application software from system Robustness (reliability): apply all known techniques to protect against outside attacks, hardware faults, & programming errors HPCS Program Focus Areas Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology Fill the Critical Technology and Capability Gap Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing) MITRE Slide-3 HPCchallenge Benchmarks ICL/UTK High Productivity Computing Systems -Program Overview Create a new generation of economically viable computing systems and a procurement methodology for the security/industrial community (2007 – 2010) Full Scale Development Half-Way Point Phase 2 Technology Assessment Review Petascale/s Systems Vendors Validated Procurement Evaluation Methodology Advanced Design & Prototypes Test Evaluation Framework Concept Study New Evaluation Framework Phase 1 MITRE Slide-4 HPCchallenge Benchmarks Phase 2 (2003-2005) Phase 3 (2006-2010) ICL/UTK HPCS Program Goals‡ • HPCS overall productivity goals: – Execution (sustained performance) – 1 Petaflop/sec (scalable to greater than 4 Petaflop/sec) Reference: Production workflow Development 10X over today’s systems Reference: Lone researcher and Enterprise workflows • Productivity Framework – Base lined for today’s systems – Successfully used to evaluate the vendors emerging productivity techniques – Provide a solid reference for evaluation of vendor’s proposed Phase III designs. • Subsystem Performance Indicators 1) 2) 3) 4) 2+ PF/s LINPACK 6.5 PB/sec data STREAM bandwidth 3.2 PB/sec bisection bandwidth 64,000 GUPS MITRE Slide-5 HPCchallenge Benchmarks ‡Bob Graybill (DARPA/IPTO) (Emphasis added) ICL/UTK Outline • • • • • Brief DARPA HPCS Overview Architecture/Application Characterization HPCchallenge Benchmarks Preliminary Results Summary MITRE Slide-6 HPCchallenge Benchmarks ICL/UTK Processor-Memory Performance Gap CPU “Moore’s Law” Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 7%/yr. 100 10 1 µProc 60%/yr. 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance 1000 •Alpha 21264 full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions • Caches in Pentium Pro: 64% area, 88% transistors *Taken from Patterson-Keeton Talk to SigMod MITRE Slide-7 HPCchallenge Benchmarks ICL/UTK Processing vs. Memory Access • Doesn’t cache solve this problem? – It depends. With small amounts of contiguous data, usually. With large amounts of non-contiguous data, usually not – In most computers the programmer has no control over cache – Often “a few” Bytes/FLOP is considered OK • However, consider operations on the transpose of a matrix (e.g., for adjunct problems) – Xa= b XTa = b – If X is big enough, 100% cache misses are guaranteed, and we need at least 8 Bytes/FLOP (assuming a and b can be held in cache) • Latency and limited bandwidth of processor-memory and node-node communications are major limiters of performance for scientific computation MITRE Slide-8 HPCchallenge Benchmarks ICL/UTK Processing vs. Memory Access High Performance LINPACK Consider another benchmark: Linpack Ax=b Solve this linear equation for the vector x, where A is a known matrix, and b is a known vector. Linpack uses the BLAS routines, which divide A into blocks. On the average Linpack requires 1 memory reference for every 2 FLOPs, or 4Bytes/Flop. Many of these can be cache references MITRE Slide-9 HPCchallenge Benchmarks ICL/UTK Processing vs. Memory Access STREAM TRIAD Consider the simple benchmark: STREAM TRIAD a(i) = b(i) + q * c(i) a(i), b(i), and c(i) are vectors; q is a scalar Vector length is chosen to be much longer than cache size Each execution includes 2 memory loads + 1 memory store 2 FLOPs 12 Bytes/FLOP (assuming 32 bit precision) No computer has enough memory bandwidth to reference 12 Bytes for each FLOP! MITRE Slide-10 HPCchallenge Benchmarks ICL/UTK Processing vs. Memory Access RandomAccess 64 bits Tables The expected value of the number of accesses per memory location T[ k ] E[ T[ k ] ] = (2n+2 / 2n) = 4 T 2n 1/2 Memory k T[ k ] Define Addresses Highest n bits Length 2n+2 {Ai} MITRE Bit-Level Exclusive Or Data Stream Slide-11 HPCchallenge Benchmarks ai k = [ai <63, 64-n>] Sequences of bits within ai Data-Driven Memory Access ai 64 bits The Commutative and Associative nature of allows processing in any order Acceptable Error — 1% Look ahead and Storage — 1024 per “node” ICL/UTK Bounding Mission Partner Applications HPCS Productivity Design Points FFT Spatial Locality Low RandomAccess Mission Partner Applications HPL PTRANS STREAM High High MITRE Slide-12 HPCchallenge Benchmarks Temporal Locality Low ICL/UTK Outline • • • • • Brief DARPA HPCS Overview Architecture/Application Characterization HPCchallenge Benchmarks Preliminary Results Summary MITRE Slide-13 HPCchallenge Benchmarks ICL/UTK HPCS HPCchallenge Benchmarks • HPCSchallenge Benchmarks – Being developed by Jack Dongarra (ICL/UT) – Funded by the DARPA High Productivity Computing Systems (HPCS) program (Bob Graybill (DARPA/IPTO)) To examine the performance of High Performance Computer (HPC) architectures using kernels with more challenging memory access patterns than High Performance Linpack (HPL) MITRE Slide-14 HPCchallenge Benchmarks ICL/UTK HPCchallenge Goals • To examine the performance of HPC architectures using kernels with more challenging memory access patterns than HPL – HPL works well on all architectures ― even cachebased, distributed memory multiprocessors due to 1. 2. 3. 4. • • Extensive memory reuse Scalable with respect to the amount of computation Scalable with respect to the communication volume Extensive optimization of the software To complement the Top500 list To provide benchmarks that bound the performance of many real applications as a function of memory access characteristics ― e.g., spatial and temporal locality MITRE Slide-15 HPCchallenge Benchmarks ICL/UTK HPCchallenge Benchmarks • • • • • • • • • • • Local DGEMM (matrix x matrix multiply) STREAM – – – – FFT EP-RandomAccess COPY SCALE ADD TRIADD EP-RandomAccess 1D FFT Global High Performance LINPACK (HPL) PTRANS — parallel matrix transpose G-RandomAccess 1D FFT b_eff — interprocessor bandwidth and latency DGEMM STREAM FFT G-RandomAccess HPL PTRANS HPCchallenge pushes spatial and temporal boundaries; sets performance bounds Available for download http://icl.cs.utk.edu/hpcc/ MITRE Slide-16 HPCchallenge Benchmarks ICL/UTK Web Site http://icl.cs.utk.edu/hpcc/ • • • • • • • • • • Home Rules News Download FAQ Links Collaborators Sponsors Upload Results MITRE Slide-17 HPCchallenge Benchmarks ICL/UTK Outline • • • • • Brief DARPA HPCS Overview Architecture/Application Characterization HPCchallenge Benchmarks Preliminary Results Summary MITRE Slide-18 HPCchallenge Benchmarks ICL/UTK Preliminary Results Machine List (1 of 2) Affiliatio n Manufacturer System ProcessorType Procs U Tenn Atipa Cluster AMD 128 procs Conquest cluster AMD Opteron 128 AHPCRC Cray X1 124 procs X1 Cray X1 MSP 124 AHPCRC Cray X1 124 procs X1 Cray X1 MSP 124 AHPCRC Cray X1 124 procs X1 Cray X1 MSP 124 ERDC Cray X1 60 procs X1 Cray X1 MSP 60 ERDC Cray X1 60 procs X1 Cray X1 MSP 60 ORNL Cray X1 252 procs X1 Cray X1 MSP 252 ORNL Cray X1 252 procs X1 Cray X1 MSP 252 AHPCRC Cray X1 120 procs X1 Cray X1 MSP 120 ORNL Cray X1 64 procs X1 Cray X1 MSP 64 AHPCRC Cray T3E 1024 procs T3E Alpha 21164 ORNL HP zx6000 Itanium 2 128 procs Integrity zx6000 Intel Itanium 2 128 PSC HP AlphaServer SC45 128 procs AlphaServer SC45 Alpha 21264B 128 ERDC HP AlphaServer SC45 484 procs AlphaServer SC45 Alpha 21264B 484 MITRE Slide-19 HPCchallenge Benchmarks ICL/UTK 1024 Preliminary Results Machine List (2 of 2) Affiliation Manufacturer System ProcessorType Procs IBM IBM 655 Power4+ 64 procs eServer pSeries 655 IBM Power 4+ 64 IBM IBM 655 Power4+ 128 procs eServer pSeries 655 IBM Power 4+ 128 IBM IBM 655 Power4+ 256 procs eServer pSeries 655 IBM Power 4+ 256 NAVO IBM p690 Power4 504 procs p690 IBM Power 4 504 ARL IBM SP Power3 512 procs RS/6000 SP IBM Power 3 512 ORNL IBM p690 Power4 256 procs p690 IBM Power 4 256 ORNL IBM p690 Power4 64 procs p690 IBM Power 4 64 ARL Linux Networx Xeon 256 procs Powell Intel Xeon U Manchester SGI Altix Itanium 2 32 procs Altix 3700 Intel Itanium 2 32 ORNL SGI Altix Itanium 2 128 procs Altix Intel Itanium 2 128 U Tenn SGI Altix Itanium 2 32 procs Altix Intel Itanium 2 32 U Tenn SGI Altix Itanium 2 32 procs Altix Intel Itanium 2 32 U Tenn SGI Altix Itanium 2 32 procs Altix Intel Itanium 2 32 U Tenn SGI Altix Itanium 2 32 procs Altix Intel Itanium 2 32 NASA ASC SGI Origin 23900 R16K 256 procs Origin 3900 SGI MIPS R16000 256 U Aachen/RWTH SunFire 15K 128 procs Sun Fire 15k/6800 SMP-Cluster Sun UltraSparc III 128 OSC Voltaire Cluster Xeon 128 procs Pinnacle 2X200 Cluster Intel Xeon 128 MITRE Slide-20 HPCchallenge Benchmarks ICL/UTK 256 STREAM TRIAD vs HPL 120-128 Processors Basic Performance 120-128 Processors 2.5 EP-STREAM TRIAD Tflop/s HPL TFlop/s 2.0 1.5 ()op/s 1.0 HPL Ax=b oc pr 8 12 on Xe er s c st s lu ro C 8p oc pr 2 ire lta K 1 128 s Vo 15 2 oc m pr ocs ire u i 8 r n F ta n 12 8 p I Su + x 12 s r4 lti e 5 I A ow C4 oc pr P rS SG 8 5 2 65 erve 2 1 s c M IB haS ium pro lp tan 28 A P 00 I D 1 H 60 AM zx ter s P H lus roc C p a ip 24 s At 1 1 roc X p y 24 s ra C 1 1 roc p X y 24 s ra C 1 1 roc p X y 20 ra C 11 X y a(i) = b(i) + q *c(i) ra C s ICL/UTK MITRE Slide-21 HPCchallenge Benchmarks STREAM TRIAD 0.5 0.0 STREAM TRIAD vs HPL >252 Processors Basic Performance >=252 Processors 2.5 2.0 ()op/s 1.0 a(i) = b(i) + q *c(i) HPL Ax=b ra C s oc s pr c ro 24 p s 10 12 E 5 oc r T3 r3 4 p ocs e y r ra 50 4 p ow C P r4 48 ocs SP we 5 pr 4 6 M Po S C IB 0 25 r 9 p6 erve 16K ocs r M R IB haS 0 6 p lp 390 25 A 2 on cs P H igi n Xe pro r x 6 I O wor 25 cs 4 ro r SG et N we 6 p x 5 Po 2 nu Li 90 r 4+ p6 we o s M IB 5 P roc 65 2 p M 25 cs IB X1 pro y 52 ra C 12 X y STREAM TRIAD 0.5 ICL/UTK MITRE Slide-22 HPCchallenge Benchmarks EP-STREAM TRIAD Tflop/s HPL TFlop/s 1.5 0.0 STREAM ADD vs PTRANS 60-128 Processors Basic Performance 60-128 Processors 10,000.0 1,000.0 100.0 s oc pr 8 12 n eo rX s s te roc o c us p pr Cl 2 8 28 s ire 1 1 ro c cs lta 5 K 2 p ro m 1 Vo ire niu 1 28 8 p 2 s nF Ita r4+ 5 1 ro c Su ltix we C4 8 p I A Po r S 12 cs SG 55 rve 2 pr o 6 e m M aS niu 28 I B lp h I t a 1 A 00 MD HP x60 er A s z t c us r o HP C l 4 p s c a ip 12 r o At X1 4 p s 2 c s ay 1 r o Cr X1 4 p cs roc 2 ay 1 r o 4 p cs Cr X1 0 p 4 6 pro 2 r ay 1 e 4 Cr X1 Pow + 6 ay 0 e r4 Cr p69 ow s c M P ro IB 655 4 p s c M 6 ro IB X1 0 p s ay 6 ro c Cr X1 0 p ay 6 Cr X1 ay Cr ICL/UTK MITRE Slide-23 HPCchallenge Benchmarks a = a + bT PTRANS a(i) = b(i) + c(i) STREAM ADD 0.1 PTRANS GB/s EP-STREAM ADD GB/s GB/s 10.0 1.0 STREAM ADD vs PTRANS >252 Processors Basic Performance >=252 Processors 10,000.0 EP-STREAM ADD GB/s 100.0 a(i) = b(i) + c(i) PTRANS a = a + bT ra C s oc s pr c 24 pr o 10 12 cs E 5 o 3 s T3 pr er y 4 oc r ra ow 50 4 p C P s r4 8 SP we 5 4 roc M Po C4 6 p IB 25 90 rS p6 rve 6K cs e 1 o M IB haS 0 R pr lp 390 256 A 2 n s P H igi n eo roc X p r x I O o r 2 5 6 cs SG tw 4 ro e N wer 6 p x 5 nu Po + 2 Li 90 r4 p6 we o s M IB 5 P roc 65 2 p s M 5 IB 1 2 roc X p y 52 ra C 12 X y STREAM ADD 0.1 ICL/UTK MITRE Slide-24 HPCchallenge Benchmarks PTRANS GB/s 1,000.0 GB/s 10.0 1.0 Outline • • • • • Brief DARPA HPCS Overview Architecture/Application Characterization HPCchallenge Benchmarks Preliminary Results Summary MITRE Slide-25 HPCchallenge Benchmarks ICL/UTK Summary • DARPA HPCS Subsystem Performance Indicators – – – – • Important to understand architecture/application characterization – • 2+ PF/s LINPACK 6.5 PB/sec data STREAM bandwidth 3.2 PB/sec bisection bandwidth 64,000 GUPS Where did all the lost “Moore’s Law performance go?” HPCchallenge Benchmarks — http://icl.cs.utk.edu/hpcc/ – – Peruse the results! Contribute! HPCS Productivity Design Points FFT Spatial Locality Low RandomAccess Mission Partner Applications HPL PTRANS STREAM High High MITRE Slide-26 HPCchallenge Benchmarks Temporal Locality Low ICL/UTK