An Overview of High Performance Computing and Challenges for the Future Jack Dongarra INNOVATIVE COMP ING LABORATORY University of Tennessee Oak Ridge National Laboratory University of Manchester 3/23/2016 1 Outline • Top500 Results • Four Important Concepts that Will Effect Math Software Effective Use of Many-Core Exploiting Mixed Precision in Our Numerical Computations Self Adapting / Auto Tuning of Software Fault Tolerant Algorithms 2 H. Meuer, H. Simon, E. Strohmaier, & JD Rate - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year Size SC‘xy in the States in November Meeting in Germany in June - All data available from www.top500.org 3 Performance Development 4.92 PF/s 1 Pflop/s 281 TF/s IBM BlueGene/L 100 Tflop/s 10 Tflop/s SUM NEC Earth Simulator N=1 1.17 TF/s 1 Tflop/s 4.0 TF/s IBM ASCI White 6-8Redyears Intel ASCI 59.7 GF/s 100 Gflop/s Fujitsu 'NWT' 10 Gflop/s 1 Gflop/s N=500 My Laptop 0.4 GF/s 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 100 Mflop/s 4 29th List: The TOP10 Manufacturer 1 IBM 2 10 Cray 3 2 4 3 Sandia/Cray IBM 5 IBM 6 4 IBM 7 IBM 8 Dell 9 5 IBM 10 SGI Computer BlueGene/L eServer Blue Gene Jaguar Cray XT3/XT4 Red Storm Cray XT3 BGW eServer Blue Gene New York BLue eServer Blue Gene ASC Purple eServer pSeries p575 BlueGene/L eServer Blue Gene Abe PowerEdge 1955, Infiniband MareNostrum JS21 Cluster, Myrinet HLRB-II SGI Altix 4700 Rmax Installation Site Country 280.6 DOE/NNSA/LLNL USA 2005 131,072 101.7 DOE/ORNL USA 2007 23,016 101.4 DOE/NNSA/Sandia USA 2006 26,544 91.29 IBM Thomas Watson USA 2005 40,960 82.16 Stony Brook/BNL USA 2007 36,864 75.76 DOE/NNSA/LLNL USA 2005 12,208 73.03 Rensselaer Polytechnic Institute/CCNI USA 2007 32,768 62.68 NCSA USA 2007 62.63 Barcelona Supercomputing Center Spain 56.52 LRZ Germany [TF/s] www.top500.org Year #Proc 9,600 2006 12,240 2007 9,728 29th List / June 2007 page 5 Performance Projection 1 EF/s Eflop/s 100 PF/s 10PF/s PF/s Pflop/s 1 PF/s SUM 100 TF/s 10 TF/s 1 TF/s 100 GF/s 10 GF/s 1 GF/s 100 MF/s N=1 N=500 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 6 Cores per System - June 2007 250 150 100 50 0 33-64 65-128 129-256 257-512 5131024 10252048 20494096 4k-8k 8k-16k 16k-32k 32k-64k 64k128k 7 Number of Systems 200 150000 1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325 337 349 361 373 385 397 409 421 433 445 457 469 481 493 Rmax (Gflop/s) RMax 300000 250000 14 systems > 50 Tflop/s 200000 88 systems > 10 Tflop/s 326 systems > 5 Tflop/s 100000 50000 0 8 Chips Used in Each of the 500 Systems Sun Sparc 1% NEC 1% HP Alpha 0% Intel IA-32 6% 96% = 58% Intel 17% IBM Intel EM 64T 46% 21% AMD Cray 0% HP PA-RISC 2% AM D x86_64 21% Intel IA-64 6% IBM Power 17% 9 Interconnects / Systems Others 500 Cray Interconnect SP Switch 400 Crossbar 300 Quadrics 200 Infiniband Myrinet 100 (128) (46) Gigabit Ethernet (206) 2006 2007 2004 2005 2002 2003 1999 2000 2001 1997 1998 1995 1996 1993 1994 0 N/A GigE + Infiniband + Myrinet = 74% 10 Countries / Systems Rank Site Manufact Computer Procs RMax Interconn ect Segment Family 66 CINECA IBM eServer 326 Opteron Dual 5120 12608 Academic Infband 132 SCS S.r.l. HP Cluster Platform 3000 Xeon 1024 7987.2 Research Infband 271 Telecom Italia HP SuperDome 875 MHz 3072 5591 Industry Myrinet 295 Telecom Italia HP Cluster Platform 3000 Xeon 740 5239 Industry Gige 305 Esprinet HP Cluster Platform 3000 Xeon 664 5179 Industry Gige www.top500.org 29th List / June 2007 page 11 Power is an Industry Wide Problem Google facilities leveraging hydroelectric power “Hiding in Plain Sight, Google Seeks More Power”, by John Markoff, June 14, 2006 old aluminum plants >500,000 servers worldwide New Google Plant in The Dulles, Oregon, from NYT, June 14, 2006 12 Gflop/KWatt in the Top 20 300 250 200 150 100 50 0 13 IBM BlueGene/L #1 131,072 Cores Total of 33 systems in the Top500 1.6 MWatts (1600 homes) 43,000 ops/s/person (64 racks, 64x32x32) 131,072 procs Rack (32 Node boards, 8x8x16) 2048 processors BlueGene/L Compute ASIC Node Board (32 chips, 4x4x2) 16 Compute Cards 64 processors Compute Card (2 chips, 2x1x1) Chip 4 processors (2 processors) 17 watts 180/360 TF/s 32 TB DDR 90/180 GF/s 16 GB DDR 2.8/5.6 GF/s 4 MB (cache) 2.9/5.7 TF/s 0.5 TB DDR 5.6/11.2 GF/s 1 GB DDR The compute node ASICs include all networking and processor functionality. Each compute ASIC includes two 32-bit superscalar PowerPC 440 embedded cores (note that L1 cache coherence is not maintained between these cores). (13K sec about 3.6 hours; n=1.8M) Full system total of 131,072 processors “Fastest Computer” BG/L 700 MHz 131K proc 64 racks Peak: 367 Tflop/s Linpack: 281 Tflop/s 14 77% of peak Increasing the number of gates into a tight knot and decreasing the cycle time of the processor Increase Clock Rate & Transistor Density Lower Voltage Cache Cache Core C1 Core C2 Cache C3 C4 Core We have seen increasing number of gates on a chip and increasing clock speed. Heat becoming an unmanageable problem, Intel Processors > 100 Watts C1 C2 C1 C2 C1 C2 C3 C4 C3 C4 We will not see the dramatic increases in clock speeds in the future. Cache C1 C2 C1 C2 C3 C4 C3 C4 C3 C4 However, the number of gates on a chip will continue to increase. 15 Power Cost of Frequency • Power ∝ Voltage2 x Frequency (V2F) • Frequency ∝ Voltage • Power ∝Frequency3 16 Power Cost of Frequency • Power ∝ Voltage2 x Frequency (V2F) • Frequency ∝ Voltage • Power ∝Frequency3 17 What’s Next? All Large Core Mixed Large and Small Core Many Small Cores All Small Core Many FloatingPoint Cores + 3D Stacked Memory SRAM Different Classes of Chips Home Games / Graphics Business Scientific Novel Opportunities in Multicores • Don’t have to contend with uniprocessors • Not your same old multiprocessor problem How does going from Multiprocessors to Multicores impact programs? What changed? Where is the Impact? • Communication Bandwidth • Communication Latency 19 Communication Bandwidth • How much data can be communicated between two cores? • What changed? Number of Wires Clock rate Multiplexing • Impact on programming model? Massive data exchange is possible Data movement is not the bottleneck processor affinity not that important 20 10,000X 32 Giga bits/sec ~300 Tera bits/sec Communication Latency • How long does it take for a round trip communication? • What changed? Length of wire Pipeline stages • Impact on programming model? Ultra-fast synchronization Can run real-time apps on multiple cores 21 50X ~200 Cycles ~4 cycles 80 Core • Intel’s 80 Core chip 1 Tflop/s 62 Watts 1.2 TB/s internal BW 22 NSF Track 1 – NCSA/UIUC • • • • • • • • $200M 10 Pflop/s; 40K 8-core 4Ghz IBM Power7 chips; 1.2 PB memory; 5PB/s global bandwidth; interconnect BW of 0.55PB/s; 18 PB disk at 1.8 TB/s I/O bandwidth. For use by a few people NSF UTK/JICS Track 2 proposal • $65M over 5 years for a 1 Pflop/s system $30M over 5 years for equipment • 36 cabinets of a Cray XT5 • (AMD 8-core/chip, 12 socket/board, 3 GHz, 4 flops/cycle/core) $35M over 5 years for operations • Power cost: • $1.1M/year • Cray Maintenance: • $1M/year • To be used by the NSF community 1000’s of users • Joins UCSD, PSC, TACC Last Year’s Track 2 award to U of Texas Major Changes to Software • Must rethink the design of our software Another disruptive technology • Similar to what happened with cluster computing and message passing Rethink and rewrite the applications, algorithms, and software • Numerical libraries for example will change For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this 27 Major Changes to Software • Must rethink the design of our software Another disruptive technology • Similar to what happened with cluster computing and message passing Rethink and rewrite the applications, algorithms, and software • Numerical libraries for example will change For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this 28 A New Generation of Software: Algorithms follow hardware evolution in time LINPACK (80’s) (Vector operations) Rely on - Level-1 BLAS operations LAPACK (90’s) (Blocking, cache friendly) Rely on - Level-3 BLAS operations PLASMA (00’s) New Algorithms (many-core friendly) Rely on - a DAG/scheduler - block data layout - some extra kernels Those new algorithms - have a very low granularity, they scale very well (multicore, petascale computing, … ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms. A New Generation of Software: Parallel Linear Algebra Software for Multicore Architectures (PLASMA) Algorithms follow hardware evolution in time LINPACK (80’s) (Vector operations) Rely on - Level-1 BLAS operations LAPACK (90’s) (Blocking, cache friendly) Rely on - Level-3 BLAS operations PLASMA (00’s) New Algorithms (many-core friendly) Rely on - a DAG/scheduler - block data layout - some extra kernels Those new algorithms - have a very low granularity, they scale very well (multicore, petascale computing, … ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms. Steps in the LAPACK LU DGETF2 LAPACK DLSWP LAPACK DLSWP LAPACK DTRSM (Triangular solve) BLAS DGEMM BLAS (Factor a panel) (Backward swap) (Forward swap) (Matrix multiply) 31 LU Timing Profile (4 processor system) Threads – no lookahead Time for each component 1D decomposition and SGI Origin DGETF2 DLASWP(L) DLASWP(R) DTRSM DGEMM DGETF2 DLSWP DLSWP DTRSM Bulk Sync Phases DGEMM Adaptive Lookahead - Dynamic Event Driven Multithreading Reorganizing algorithms to use this approach 33 Fork-Join vs. Dynamic Execution A T T A B T C Fork-Join – parallel BLAS C Time Experiments on Intel’s Quad34Core Clovertown with 2 Sockets w/ 8 Treads Fork-Join vs. Dynamic Execution A T T A B T C Fork-Join – parallel BLAS C Time DAG-based – dynamic scheduling Time saved Experiments on Intel’s Quad35Core Clovertown with 2 Sockets w/ 8 Treads With the Hype on Cell & PS3 We Became Interested • The PlayStation 3's CPU based on a "Cell“ processor • Each Cell contains a Power PC processor and 8 SPEs. (SPE is processing unit, SPE: SPU + DMA engine) An SPE is a self contained vector processor which acts independently from the others. • 4 way SIMD floating point units capable of a total of 25.6 Gflop/s @ 3.2 GHZ 204.8 Gflop/s peak! The catch is that this is for 32 bit floating point; (Single Precision SP) And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!! • Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues SPE ~ 25 Gflop/s peak 36 Performance of Single Precision on Conventional Processors • Realized have the similar situation on our commodity processors. Size SGEMM/ DGEMM Size SGEMV/ DGEMV 3000 2.00 5000 1.70 • That is, SP is 2X as UltraSparc-IIe fast as DP on many Intel PIII systems 3000 1.64 5000 1.66 Coppermine 3000 2.03 5000 2.09 • The Intel Pentium and AMD Opteron have SSE2 PowerPC 970 Intel Woodcrest 3000 2.04 5000 1.44 3000 1.81 5000 2.18 • 2 flops/cycle DP • 4 flops/cycle SP Intel XEON Intel Centrino Duo 3000 2.04 5000 1.82 3000 2.71 5000 2.21 • IBM PowerPC has AltiVec • 8 flops/cycle SP • 4 flops/cycle DP • No DP on AltiVec AMD Opteron 246 Single precision is faster because: • Higher parallelism in SSE/vector units • Reduced data motion • Higher locality in cache 32 or 64 bit Floating Point Precision? • A long time ago 32 bit floating point was used Still used in scientific apps but limited • Most apps use 64 bit floating point Accumulation of round off error • A 10 TFlop/s computer running for 4 hours performs > 1 Exaflop (1018) ops. Ill conditioned problems IEEE SP exponent bits too few (8 bits, 10±38) Critical sections need higher precision • Sometimes need extended precision (128 bit fl pt) However some can get by with 32 bit fl pt in some parts • Mixed precision a possibility Approximate in lower precision and then refine or improve solution to high precision. 38 Idea Goes Something Like This… • Exploit 32 bit floating point as much as possible. Especially for the bulk of the computation • Correct or update the solution with selective use of 64 bit floating point to provide a refined results • Intuitively: Compute a 32 bit result, Calculate a correction to 32 bit result using selected higher precision and, Perform the update of the 32 bit results with the correction using high precision. 39 Mixed-Precision Iterative Refinement • Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) SINGLE O(n3) x = L\(U\b) SINGLE O(n2) r = b – Ax DOUBLE O(n2) WHILE || r || not small enough END z = L\(U\r) SINGLE O(n2) x=x+z DOUBLE O(n1) r = b – Ax DOUBLE O(n2) Mixed-Precision Iterative Refinement • Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) SINGLE O(n3) x = L\(U\b) SINGLE O(n2) r = b – Ax DOUBLE O(n2) WHILE || r || not small enough z = L\(U\r) SINGLE O(n2) x=x+z DOUBLE O(n1) r = b – Ax DOUBLE O(n2) END Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. It can be shown that using this approach we can compute the solution to 64-bit floating point precision. • Requires extra storage, total is 1.5 times normal; • O(n3) work is done in lower precision • O(n2) work is done in high precision • Problems if the matrix is ill-conditioned in sp; O(108) Results for Mixed Precision Iterative Refinement for Dense Ax = b Architecture (BLAS) 1 2 3 4 5 6 7 8 9 10 11 • Single precision is faster than DP because: Higher parallelism within vector units 4 ops/cycle (usually) instead of 2 ops/cycle Reduced data motion 32 bit data instead of 64 bit data Higher locality in cache More data items in cache Intel Pentium III Coppermine (Goto) Intel Pentium III Katmai (Goto) Sun UltraSPARC IIe (Sunperf) Intel Pentium IV Prescott (Goto) Intel Pentium IV-M Northwood (Goto) AMD Opteron (Goto) Cray X1 (libsci) IBM Power PC G5 (2.7 GHz) (VecLib) Compaq Alpha EV6 (CXML) IBM SP Power3 (ESSL) SGI Octane (ATLAS) Results for Mixed Precision Iterative Refinement for Dense Ax = b Architecture (BLAS) 1 2 3 4 5 6 7 8 9 10 11 Architecture (BLAS-MPI) • Intel Pentium III Coppermine (Goto) Intel Pentium III Katmai (Goto) Sun UltraSPARC IIe (Sunperf) Intel Pentium IV Prescott (Goto) Intel Pentium IV-M Northwood (Goto) AMD Opteron (Goto) Cray X1 (libsci) IBM Power PC G5 (2.7 GHz) (VecLib) Compaq Alpha EV6 (CXML) IBM SP Power3 (ESSL) SGI Octane (ATLAS) # procs n DP Solve /SP Solve DP Solve /Iter Ref # iter AMD Opteron (Goto – OpenMPI MX) 32 22627 1.85 1.79 6 AMD Opteron (Goto – OpenMPI MX) 64 32000 1.90 1.83 6 Single precision is faster than DP because: Higher parallelism within vector units 4 ops/cycle (usually) instead of 2 ops/cycle Reduced data motion 32 bit data instead of 64 bit data Higher locality in cache More data items in cache What about the Cell? • Power PC at 3.2 GHz DGEMM at 5 Gflop/s Altivec peak at 25.6 Gflop/s • Achieved 10 Gflop/s SGEMM • 8 SPUs 204.8 Gflop/s peak! The catch is that this is for 32 bit floating point; (Single Precision SP) And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!! • Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues 44 Moving Data Around on the Cell 256 KB 25.6 GB/s Injection bandwidth Worst case memory bound operations (no reuse of data) 3 data movements (2 in and 1 out) with 2 ops (SAXPY) For the cell would be 4.6 Gflop/s (25.6 GB/s*2ops/12B) Injection bandwidth IBM Cell 3.2 GHz, Ax = b 250 200 8 SGEMM (Embarrassingly Parallel) SP Peak (204 Gflop/s) SP Ax=b IBM 150 .30 secs GFlop/s DP Peak (15 Gflop/s) DP Ax=b IBM 100 50 3.9 secs 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Matrix Size 46 IBM Cell 3.2 GHz, Ax = b 250 200 SP Peak (204 Gflop/s) 8 SGEMM (Embarrassingly Parallel) SP Ax=b IBM 150 .30 secs GFlop/s DSGESV DP Peak (15 Gflop/s) DP Ax=b IBM .47 secs 100 8.3X 50 3.9 secs 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Matrix Size 47 Cholesky on the Cell, Ax=b, A=AT, xTAx > 0 Single precision performance Mixed precision performance using iterative refinement Method achieving 64 bit accuracy 33 For the SPE’s standard C code and C language SIMD extensions (intrinsics) 48 Cholesky - Using 2 Cell Chips 49 Intriguing Potential • Exploit lower precision as much as possible Payoff in performance • Faster floating point • Less data to move • Automatically switch between SP and DP to match the desired accuracy Compute solution in SP and then a correction to the solution in DP • Potential for GPU, FPGA, special purpose processors What about 16 bit floating point? • Use as little you can get away with and improve the accuracy • Applies to sparse direct and iterative linear systems and Eigenvalue, optimization problems, where Newton’s method is used. 50 Correction = - A\(b – Ax) IBM/Mercury Cell Blade From IBM or Mercury 2 Cell chip Each w/8 SPEs 512 MB/Cell ~$8K - 17K Some SW 33 51 Sony Playstation 3 Cluster PS3-T From IBM or Mercury 2 Cell chip Each w/8 SPEs 512 MB/Cell ~$8K - 17K Some SW From WAL*MART PS3 1 Cell chip w/6 SPEs 256 MB/PS3 $600 Download SW Dual boot 33 52 Cell Hardware Overview 25.6 Gflop/s PE 25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s PE PE PE SIT CELL 200 GB/s PowerPC PE 25.6 Gflop/s PE PE PE 25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s 25 GB/s 512 MiB 3.2 GHz 25 GB/s injection bandwidth 200 GB/s between SPEs 32 bit peak perf 8*25.6 Gflop/s 204.8 Gflop/s peak 64 bit peak perf 8*1.8 Gflop/s 14.6 Gflop/s peak 512 MiB memory PS3 Hardware Overview 25.6 Gflop/s PE 25.6 Gflop/s 25.6 Gflop/s PE Disabled/Broken: Yield issues PE SIT CELL 200 GB/s PowerPC GameOS Hypervisor PE 25.6 Gflop/s 25 GB/s 256 MiB PE PE 25.6 Gflop/s 25.6 Gflop/s 3.2 GHz 25 GB/s injection bandwidth 200 GB/s between SPEs 32 bit peak perf 6*25.6 Gflop/s 153.6 Gflop/s peak 64 bit peak perf 6*1.8 Gflop/s 10.8 Gflop/s peak 1 Gb/s NIC 256 MiB memory PlayStation 3 LU Codes 180 160 140 6 SGEMM (Embarrassingly Parallel) SP Peak (153.6 Gflop/s) 120 GFlop/s SP Ax=b IBM 100 DP Peak (10.9 Gflop/s) 80 60 40 20 0 0 500 1000 1500 2000 2500 Matrix Size 33 55 PlayStation 3 LU Codes 180 160 140 6 SGEMM (Embarrassingly Parallel) SP Peak (153.6 Gflop/s) 120 GFlop/s SP Ax=b IBM 100 DSGESV DP Peak (10.9 Gflop/s) 80 60 40 20 0 0 500 1000 1500 Matrix Size 33 2000 2500 56 Cholesky on the PS3, Ax=b, A=AT, xTAx > 0 33 57 HPC in the Living Room 33 58 Matrix Multiple on a 4 Node PlayStation3 Cluster What's good What's bad Very cheap: ~4$ per Gflop/s (with 32 bit fl pt theoretical peak) Fast local computations between SPEs Perfect overlap between communications and computations is possible (Open-MPI running): PPE does communication via MPI SPEs do computation via SGEMMs Gigabit network card. 1 Gb/s is too little for such computational power (150 Gflop/s per node) Linux can only run on top of GameOS (hypervisor) Extremely high network access latencies (120 usec) Low bandwidth (600 Mb/s) Only 256 MB local memory Only 6 SPEs 33 Gold: Computation: 8 ms Blue: Communication: 20 ms Users Guide for SC on PS3 • SCOP3: A Rough Guide to Scientific Computing on the PlayStation 3 • See webpage for details 33 Conclusions • For the last decade or more, the research investment strategy has been overwhelmingly biased in favor of hardware. • This strategy needs to be rebalanced barriers to progress are increasingly on the software side. • Moreover, the return on investment is more favorable to software. Hardware has a half-life measured in years, while software has a half-life measured in decades. • High Performance Ecosystem out of balance Hardware, OS, Compilers, Software, Algorithms, Applications • No Moore’s Law for software, algorithms and applications Collaborators / Support Alfredo Buttari, UTK Julien Langou, UColorado Julie Langou, UTK Piotr Luszczek, MathWorks Jakub Kurzak, UTK Stan Tomov, UTK 33