Where the World Stands on Supercomputing Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 3/14/2014 1 H. Meuer, H. Simon, E. Strohmaier, & JD R Rate - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b dense Ax=b, d problem bl TPP performance f - Updated twice a year Size SC‘xy SC xy in the States in November Meeting in Germany in June - All data available from www.top500.org 2 Performance Development of HPC Over the Last 20 Years 1E+09 224 PFlop/s 100 Pflop/s 100000000 33.9 PFlop/s 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 N=1 N 1 10 Tflop/s 118 TFlop/s 10000 1 Tflop/s Tfl 1000/ 100 Gflop/s 100 6-8 years 1.17 TFlop/s N=500 My Laptop (70 Gflop/s) 59 59.7 GFlop/s G op/s 10 Gflop/s 10 My iPad2 & iPhone 4s (1.02 Gflop/s) 1 Gflop/s 1 400 MFlop/s 100 Mflop/s 0,1 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 State of Supercomputing p p g in 2014 • Pflops computing fully established with 31 systems. systems • Three technology architecture possibilities or are thriving. • Commodity (e.g. Intel) • Commodity + accelerator (e.g. GPUs) • Special purpose lightweight cores (e.g. IBM BG, ARM) • Interest in supercomputing is now worldwide, and growing i iin many new markets k t (over 50% of Top500 computers are in industry). • Exascale projects exist in many countries and regions. 4 November 2013: The TOP10 Rank 1 2 3 Site Computer National University Tianhe 2 NUDT, Tianhe-2 NUDT of Defense Xeon 12C 2.2GHz + IntelXeon Technology Phi (57c) + Custom DOE / OS Titan, Cray XK7 (16C) + Nvidia Oak Ridge Kepler GPU ((14c)) + Custom g Nat Lab p DOE / NNSA Sequoia, BlueGene/Q (16c) L Livermore Nat Lab + custom Country Cores Rmax [Pflops] % of Peak Power MFlops [MW] /Watt China 3,120,000 33.9 62 17.8 1905 USA 560,640 17.6 65 8.3 2120 USA 1,572,864 17.2 85 7.9 2063 4 RIKEN Advanced Inst for Comp Sci K computer Fujitsu SPARC64 VIIIfx (8c) + Custom Japan 705,024 10.5 93 12.7 827 5 DOE / OS Argonne Nat Lab Mira, BlueGene/Q (16c) + Custom USA 786,432 8.16 85 3.95 2066 6 S i CSCS Swiss Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom S i Swiss 115 984 115,984 6 27 6.27 81 2 3 2.3 2726 7 Texas Advanced Computing Center Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB USA 204,900 2.66 61 3.3 806 8 Forschungszentrum F h t Juelich (FZJ) JuQUEEN, J QUEEN BlueGene/Q, Bl G /Q Power BQC 16C 1.6GHz+Custom Germany 458,752 5.01 85 2.30 2178 USA 393,216 4.29 85 1.97 2177 Germany 147,456 2.90 91* 3.42 848 9 10 500 Vulcan, BlueGene/Q, DOE / NNSA L Livermore Nat Lab Power BQC 16C 1.6GHz+Custom Leibniz Rechenzentrum Banking SuperMUC, Intel (8c) + IB HP USA 22,212 .118 50 Accelerators ((53 systems) y ) 60 Intel MIC (13) Intel MIC (13) Clearspeed CSX600 (0) Syste ems 50 40 ATI GPU (2) ( ) IBM PowerXCell 8i (0) NVIDIA 2070 (4) 30 NVIDIA 2050 (7) NVIDIA 2090 (11) 20 10 0 2006 2007 2008 2009 2010 2011 2012 2013 NVIDIA K20 (16) ( ) 19 US 9 China 6 Japan 4 Russia 2 France 2 Germany 2 India 1 Italy 1 Poland 1 Australia 2 Brazil 1 Saudi Arabia 1 South Korea 1 Spain 2 Switzerland 1 UK Top500 p Performance Share of Accelerators 53 of the 500 systems provide 35% of the accumulated performance 35% 30% 25% 20% 15% 10% 5% 2013 2012 2011 2010 2009 2008 2007 0% 2006 Fra action off Total TO OP500 Perfo ormance e 40% For the Top 500: Rank at which Half of Total Performance is Accumulated 90 80 70 60 50 40 35 30 30 20 25 5 10 0 Pflop/s Numbers of Sysstems 100 Top 500 November 2013 20 Top 16 computers have half of the computing p g 10 the Top 500 1994 1996 1998 2000 2002 2004 2006 2008 power 2010 of2012 15 5 0 0 100 200 300 400 500 Commodity plus Accelerator Today Commodity Accelerator (GPU) Intel Xeon 8 cores 3 GHz 8*4 ops/cycle 96 Gflop/s (DP) Nvidia K20X “Kepler” 2688 “C “Cuda d cores”” .732 GHz 2688*2/3 ops/cycle 1.31 Tflop/s (DP) Interconnect PCI-e Gen2/3 16 lane 64 Gb/s (8 GB/s) 1 GW/s 192 Cuda cores/SMX 2688 “Cuda cores” 6 GB 9 Countries Share Absolute Counts US: 267 China: 63 Japan: 28 UK: 23 France: 22 Germany: 20 Top500 p From Italyy 3/14/2014 11 Linpack p Efficiency y 100% 90% Liinpack E Efficiency 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500 Linpack p Efficiency y 100% 90% Liinpack E Efficiency 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500 Linpack p Efficiency y 100% 90% Liinpack E Efficiency 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500 Linpack p Efficiency y 100% 90% Liinpack E Efficiency 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500 #1 System on the Top500 Over the Past 21 Years 9 6 2 (16 machines in that club) Top500 List Computer TMC CM-5/1024 Fujitsu Numerical Wind Tunnel 11/93 (1) Intel XP/S140 6/94 (1) j Numerical Wind Tunnel 11/94 - 11/95 ((3)) Fujitsu Hitachi SR2201/1024 6/96 (1) Hitachi CP-PACS/2048 11/96 (1) 6/97 - 6/00 (7) Intel ASCI Red 11/00 - 11/01 (3) IBM ASCI White, SP Power3 375 MHz 6/02 - 6/04 (5) NEC Earth-Simulator 11/04 - 11/07 (7) IBM BlueGene/L 6/93 (1) 6/08 - 6/09 (3) IBM Roadrunner –PowerXCell 8i 3.2 Ghz Cray Jaguar - XT5-HE 2.6 GHz NUDT Tianhe-1A, Tianhe 1A X5670 2.93Ghz 2 93Ghz NVIDIA 11/10 (1) 6/11 - 11/11 (2) Fujitsu K computer, SPARC64 VIIIfx IBM Sequoia BlueGene/Q 6/12 (1) C Cray XK7 Titan Tit AMD + NVIDIA Kepler K l 11/12 (1) 6/13 – 11/13(2) NUDT Tianhe-2 Intel IvyBridge & Xeon Phi 11/09 - 6/10 (2) 16 r_max (Tflop/s) n_max Hours MW .060 52224 0.4 .124 31920 0.1 1. .143 55700 0.2 .170 42000 0.1 1. .220 138,240 2.2 .368 103,680 0.6 2.38 . 362,880 , 3.7 . ..85 7.23 518,096 3.6 35.9 1,000,000 5.2 6.4 478 1,000,000 478. 1 000 000 0.4 04 14 1.4 1,105. 2,329,599 1,759. 2 566 2,566. 10,510. 16,324. 17 590 17,590. 33,862. 2.1 2.3 5,474,272 17.3 3 600 000 3.4 3,600,000 34 11,870,208 29.5 12,681,215 23.1 4 423 680 0.9 4,423,680 09 9,960,000 5.4 6.9 40 4.0 9.9 7.9 82 8.2 17.8 Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 000000 10 Pflop/s 000000 1 Pflop/s 000000 N=1 100 Tflop/s 100000 10 Tflop/s 10000 1 Tflop/s 1000 N=500 100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 0,1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Today’s y #1 System y Systems 2014 2020-2022 Difference Today y & Exa 55 Pflop/s 1 Eflop/s ~20x 18 MW ((3 Gflops/W) f p ) ~20 MW ((50 Gflops/W) f p ) O(1) ~15x 15x 1.4 PB 32 - 64 PB ~50x Node performance 3.43 TF/s 1.2 or 15TF/s O(1) Node concurrency 24 cores CPU + 171 cores CoP O(1k) or 10k ~5x - ~50x 6.36 GB/s 200-400GB/s ~40x 16,000 O(100,000) or O(1M) ~6x - ~60x 3.12 M O(billion) ~100x Few / day O(<1 day) O(?) System peak Power System memory Node Interconnect nterconnect BW System size (nodes) Total concurrency MTTF Tianhe-2 (1.024 PB CPU + .384 PB CoP) ( 4 CPU +3 CoP) (.4 12.48M . 8M thr threads a s ((4/core) /cor ) Exascale System Architecture with a cap of $200M and 20MW Systems 2014 2020-2022 Difference Today y & Exa 55 Pflop/s 1 Eflop/s ~20x 18 MW ((3 Gflops/W) f p ) ~20 MW ((50 Gflops/W) f p ) O(1) ~15x 15x 1.4 PB 32 - 64 PB ~50x Node performance 3.43 TF/s 1.2 or 15TF/s O(1) Node concurrency 24 cores CPU + 171 cores CoP O(1k) or 10k ~5x - ~50x 6.36 GB/s 200-400GB/s ~40x 16,000 O(100,000) or O(1M) ~6x - ~60x 3.12 M O(billion) ~100x Few / day O(<1 day) O(?) System peak Power System memory Node Interconnect nterconnect BW System size (nodes) Total concurrency MTTF Tianhe-2 (1.024 PB CPU + .384 PB CoP) ( 4 CPU +3 CoP) (.4 12.48M . 8M thr threads a s ((4/core) /cor ) Exascale System Architecture with a cap of $200M and 20MW Systems 2014 2022 Difference Today y & Exa 55 Pflop/s 1 Eflop/s ~20x 18 MW ((3 Gflops/W) f p ) ~20 MW ((50 Gflops/W) f p ) O(1) ~15x 15x 1.4 PB 32 - 64 PB ~50x Node performance 3.43 TF/s 1.2 or 15TF/s O(1) Node concurrency 24 cores CPU + 171 cores CoP O(1k) or 10k ~5x - ~50x 6.36 GB/s 200-400GB/s ~40x 16,000 O(100,000) or O(1M) ~6x - ~60x 3.12 M O(billion) ~100x Few / day Many / day O(?) System peak Power System memory Node Interconnect nterconnect BW System size (nodes) Total concurrency MTTF Tianhe-2 (1.024 PB CPU + .384 PB CoP) ( 4 CPU +3 CoP) (.4 12.48M . 8M thr threads a s ((4/core) /cor ) DOE Exascale Computing Initiative DOE Exascale Computing Initiative Proposed Timeline p Re esearch & D Developme ent Platform Acquisitions Application Development Future Computer Systems: Pathway Towards Exascale Science, Engineering and Defense Applications Exascale Co-Design: Driving the design of Exascale HW and SW Design Forward Fast Forward System Design Phase Prototype Build Phase Path Forward Phase Software Technology: Programming Environment, Resiliency, OS & Runtimes Extreme Scale Research Programs (SC/ASCR & NNSA/ASC): Fundamental Technology P0 FY 2012 2013 2014 2015 2016 2017 P1 2018 Node Prototype P2 2019 Petascale Prototype 2020 2021 2022 Exascale Prototype 2023 2024 3/14/2014 22 3/14/2014 23 3/14/2014 24 3/14/2014 25 3/14/2014 26 EU Funded: CRESTA,, DEEP,, & MontMont-Blanc ♦ The CRESTA, DEEP and Mont- ♦ ♦ Each will study different ♦ DEEP: Computer + Blanc projects, projects with a combined funding of 25 M Euros aspects of the exascale challenge using a coco design model spanning hardware, systemware and software applications. applications ♦ This funding g represents p the first in a sustained investment in exascale research by Europe. 3/14/2014 CREST focuses on ft nott hardware h d software and Co-design Booster Nodes ♦ . Lightweight processor, Energy EfficientEfficient ARM based. 27 Major Changes to Software & Algorithms • Must rethink the design of our algorithms l ith and d software ft Another disruptive technology • Similar to what happened with cluster computing and message passing R Rethink thi k and d rewrite it th the applications, li ti algorithms, and software Data movement is expense Flop/s are cheap, so are provisioned in excess 28 Summaryy • Major Challenges are ahead for extreme computing ti Parallelism O(109) • Programming issues Hybrid • Peak and HPL may be very misleading • No where near close to peak for most apps Fault Tolerance • Today Sequoia BG/Q node failure rate is 1.25 failures/day Power • 50 Gflops/w (today at 2 Gflops/w) • We will need completely new approaches and technologies to reach the Exascale level Evolution Over the Last 30 Years ♦ Initially, y, commodity y PCs where decentralized systems ♦ As chip manufacturing process shank to less than a micron, they started t t d to t integrate i t t f features t ondie: ¾1989: ¾1999: ¾2009: ¾2016 ¾2016: 3/14/2014 FPU (Intel 80486DX) SRAM (Intel Pentium III) GPU (AMD Fusion) DRAM on chip hi (3D stacking) ki ) 30 Future Systems May Be Composed of Different Kinds of Cores DRAM chips (cells) 3D DRAM (cells) Latency Low er latency Memory controller Address Data Address Data Higher bandw idth Memory controller 3/14/2014 31 Future Chip p Design g 3/14/2014 32 Critical Issues at Peta & Exascale for Algorithm and Software Design • Synchronization-reducing algorithms Break Fork-Join model • Communication-reducing algorithms Use methods which have lower bound on communication • Mixed precision methods 2x speed of ops and 2x speed for data movement • Autotuning Today’s machines are too complicated, build “smarts” into software ft to t adapt d t to t th the h hardware d • Fault resilient algorithms Implement algorithms that can recover from failures/bit flips • Reproducibility of results Today we can’t can t guarantee this. We understand the issues issues, but some of our “colleagues” have a hard time with this. HPL - Good Things g ♦ ♦ ♦ ♦ ♦ ♦ ♦ Easy to run E Easy to understand Easy to check results Stresses certain parts of the system Historical database of performance information G d community Good it outreach t h tool t l “Understandable” to the outside world ♦ “If your computer doesn’t perform well on the LINPACK Benchmark, you will probably be disappointed with the performance of your application on the computer.” 34 HPL - Bad Things g ♦ LINPACK Benchmark is 36 years old ¾ TOP500 (HPL) is 20.5 years old ♦ Floating point point-intensive intensive performs O(n3) floating ♦ ♦ ♦ ♦ ♦ ♦ ♦ point operations and moves O(n2) data. No longer so strongly correlated to real apps. R Reports P Peak k Flops Fl (although hybrid systems see only 1/2 to 2/3 of Peak) Encourages poor choices in architectural features Overall usability of a system is not measured Used as a marketing tool Decisions on acquisition made on one number Benchmarking for days wastes a valuable resource 35 Goals for New Benchmark ♦ Augment the TOP500 listing with a benchmark that correlates with important scientific and technical apps not well represented by HPL ♦ Encourage vendors to focus on architecture features needed for high performance on those important scientific and technical apps. apps ¾ ¾ ¾ ¾ ¾ ♦ Stress a balance of floating point and communication bandwidth and latency Reward investment in high performance collective ops g performance p point-to-point p p messages g of various sizes Reward investment in high Reward investment in local memory system performance Reward investment in parallel runtimes that facilitate intra-node parallelism Provide an outreach/communication tool ¾ Easy to understand ¾ Easy to optimize ¾ Easy to implement, run, and check results ♦ P id a historical Provide hi t i l database d t b of f performance f information i f ti ¾ The new benchmark should have longevity 36 Proposal: p HPCG ♦ High Hi h Performance Conju Conjugate ate Gradient (HPCG). (HPCG) ♦ Solves Ax=b, A large, sparse, b known, x computed. mpu . ♦ An optimized implementation of PCG contains essential computational and communication patterns that are prevalent in a variety of methods for discretization and numerical solution of PDEs ♦ Patterns: ¾ Dense and sparse computations. ¾ Dense D and d sparse collective. ll ti ¾ Data-driven parallelism (unstructured sparse triangular solves). ♦ Strong verification and validation properties (via spectral properties of CG). 37 Collaborators / Software / Support • PLASMA http://icl.cs.utk.edu/plasma/ MAGMA http://icl.cs.utk.edu/magma/ Quark (RT for Shared Memory) http://icl.cs.utk.edu/quark/ • PaRSEC(Parallel Runtime Scheduling and Execution Control) C ) http://icl.cs.utk.edu/parsec/ Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver MAGMA PLASMA 38 Big g Data Big data is like teenage sex: everyone talks about it, nobody b d really ll knows k how h to t d do it it, everyone thinks everyone else is doing it, so everyone claims they are doing it... Dan Arielyy 39 Conclusions ♦ For the last decade or more, the research iinvestment t t strategy t t h has been b overwhelmingly biased in favor of h d hardware. ♦ This strategy needs to be rebalanced barriers to progress are increasingly on the software side. • High Performance Ecosystem out of balance ¾ Hardware, OS, Compilers, Software, Algorithms, Applications ¾ No Moore’s Law for software, algorithms and applications Broad Community Support and Development of the Exascale Initiative Since 2007 http://science energy gov/ascr/news-and-resources/program-documents/ http://science.energy.gov/ascr/news-and-resources/program-documents/ ♦ Town Hall Meetings April-June 2007 ♦ Scientific Grand Challenges Workshops N Nov, 2008 – Oct, O t 2009 ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ ¾ Climate Science (11/08) High Energy Physics (12/08) Nuclear Physics (1/09) Fusion Energy (3/09) Nuclear Energy (5/09) gy (8/09) ( ) Biology Material Science and Chemistry (8/09) National Security (10/09) Cross-cutting technologies (2/10) Mission Imperatives ♦ Exascale Steering Committee ¾ “Denver” vendor NDA visits (8/09) ¾ SC09 vendor feedback meetings ¾ Extreme Architecture and Technology Workshop (12/09) ♦ International Exascale Software Project ¾ Santa Fe, NM (4/09); Paris, France (6/09); Tsukuba, Japan (10/09); Oxford (4/10); Maui (10/10); San Francisco (4/11); Cologne (10/11); Kobe (4/12) Fundamental Science 41 Future Systems May Be Composed of Different Kinds of Cores DRAM chips (cells) 3D DRAM (cells) Latency Low er latency Memory controller Address Data Address Data Higher bandw idth Memory controller 3/14/2014 42 3/14/2014 43 Parallelization of QR Factorization Parallelize the update: dgemm • Easy and done E dd i in any reasonable software. bl ft • This is the 2/3n3 term in the FLOPs count. • Can be done “efficiently” with LAPACK+multithreaded BLAS - R V A(1) Pan nel Update e of the remain ning subma atrix factorization dgeqf2 + dlarft qr(( q ) Fork F k - Join J i parallelism ll li Bulk Sync Processing dlarfb R V A(2) 44 Synchronization (in LAPACK LU) Step 1 Step 2 Step 3 Step 4 ... ¾ fork join ¾ bulk synchronous processing Allowing for delayed update, out of order, asynchronous, dataflow execution 45 Data Layout y is Critical • Tile data layout where each data tile is contiguous in memory • Decomposed into several fine-grained tasks, which better fit the memory of the small core caches 46 PLASMA: Parallel Linear Algebra s/w for Multicore Architectures •Objectives High utilization of each core Scaling to large number of cores Shared or distributed memory Ch l k Cholesky 4x4 •Methodology Dynamic DAG scheduling (QUARK) Explicit parallelism Implicit communication Fine granularity / block data layout •Arbitrary DAG with dynamic scheduling Fork-join Fork join parallelism DAG G scheduled parallelism Time 47 Synchronization y Reducing g Algorithms g z Regular trace z Factorization steps pipelined z Stalling only due to natural load imbalance z Dynamic z Out of order execution z Fine grain tasks z Independent p block operations p The colored area over the rectangle is the efficiency Tile QR ffactorization; Til t i ti M Matrix t i size i 4000 4000x4000, 4000 Til Tile size i 200 8-socket, 6-core (48 cores total) AMD Istanbul 2.8 GHz PowerPack 2.0 The PowerPack platform consists of software and hardware instrumentation. 49 Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/ Power for Q QR Factorization LAPACK’s QR Factorization Fork join based Fork-join MKL’s QR Factorization Fork-join based PLASMA’s Conventional QR Factorization DAG based PLASMA’s Communication Reducing QR Factorization DAG based dual-socket quad-core Intel Xeon E5462 (Harpertown) processor @ 2.80GHz (8 cores total) w / MLK BLAS matrix size is very tall and skinny (mxn is 1,152,000 by 288) 50 Performance: Least Squares q Performance of the LU factorization (flop/s) Performance: Singular g Values Performance of the LU factorization (flop/s) Performance: Eigenvalues g Experiments p on Large g Core Machines 54 Pipelining: Cholesky Inversion 3 Steps: Factor, Invert L, Multiply L L’ss 48 cores POTRF, TRTRI and LAUUM. The matrix is 4000 x 4000,tile size is 200 x 200, POTRF+TRTRI+LAUUM: 25 (7t-3) Cholesky Factorization alone: 3t-2 Pipelined: 18 (3t+6) 55 Toward fast Eigensolver flops p formula: n3/3*time Higher is faster Keeneland system, using one node 3 NVIDIA GPUs (M2090@ 1 1.1 1 GHz GHz, 5 5.4 4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB) Characteristics • • • • Too many Blas-2 op, Relies on p panel factorization,, ÎBulk sync phases, ÎMemory bound algorithm. A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine grained memory aware tasks, ICL Technical report, 03/2012. Toward fast Eigensolver flops p formula: n3/3*time Higher is faster Keeneland system, using one node 3 NVIDIA GPUs (M2090@ 1 1.1 1 GHz GHz, 5 5.4 4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB) Characteristics • • • • Blas-2 GEMV moved to the GPU, Accelerate the algorithm g by y doing g all BLAS-3 on GPU,, ÎBulk sync phases, ÎMemory bound algorithm. A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine grained memory aware tasks, ICL Technical report, 03/2012. Two-Stage Approach to Tridiagonal Form Two((Communication Reducing) g) • Reduction to band On multicore + GPUs Performance as in the one-sided factorizations [derived from fast Level 3 BLAS] • Band to tridiagonal Leads to “irregular” (bulge chasing) computation Done very efficiently on multicore ! GPUs are used to assemble the orthogonal Q from the transformations [needed to find the eigenvectors] Toward fast Eigensolver flops p formula: n3/3*time Higher is faster Keeneland system, using one node 3 NVIDIA GPUs (M2090@ 1 1.1 1 GHz GHz, 5 5.4 4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB) Characteristics first • • • • stage Stage 1: BLAS-3, increasing computational intensity, Stage g 2: BLAS-1.5,, new cache friendlyy kernel,, 4X/12X faster than standard approach, Bottelneck: if all Eigenvectors are required, it has 1 back transformation extra cost. second stage t A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine grained memory aware tasks, ICL Technical report, 03/2012. Communication Avoiding Algorithms • Goal: Algorithms that communicate as little as possible • Jim Demmel and company have been working on algorithms th t obtain that bt i a provable bl minimum i i communication. i ti (M (M. A Anderson d yesterday) • Direct methods ((BLAS,, LU,, Q QR,, SVD,, other decompositions) p ) • Communication lower bounds for all these problems • Algorithms that attain them (all dense linear algebra, some sparse) • Iterative methods – Krylov subspace methods for Ax=b, Ax=λx • Communication lower bounds, and algorithms that attain them (depending d d on sparsity structure) • For QR Factorization they can show: 60 Communication Reducing QR Factorization Quad-socket, quad-core machine Intel Xeon EMT64 E7340 at 2.39 GHz. Theoretical peak is 153.2 Gflop/s with 16 cores. Matrix size 51200 by 3200 Mixed Precision Methods • Mixed precision, use the lowest precision required to achieve a given accuracy outcome Improves runtime, reduce power consumption, ti llower d data t movementt Reformulate to find correction to solution, rather than solution; Δx rather than x. 62 Idea Goes Something g Like This… • Exploit 32 bit floating point as much as possible. possible Especially for the bulk of the computation • Correct or update d the h solution l with h selective l use of 64 bit floating point to provide a refined f d results l • Intuitively: Compute a 32 bit result, Calculate a correction to 32 bit result using g selected higher precision and, Perform the update of the 32 bit results with the correction using high precision. 63 Mixed--Precision Iterative Refinement Mixed • Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) x = L\(U\b) r = b – Ax g WHILE || r || not small enough z = L\(U\r) x = x + z r = b – Ax END 3 SINGLE SINGLE DOUBLE O(n ) 2 O(n ) 2 O(n ) SINGLE DOUBLE DOUBLE O(n ) 1 O(n ) 2 O(n ) 2 Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. Mixed--Precision Iterative Refinement Mixed • Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) x = L\(U\b) r = b – Ax g WHILE || r || not small enough z = L\(U\r) x = x + z r = b – Ax END 3 SINGLE SINGLE DOUBLE O(n ) 2 O(n ) 2 O(n ) SINGLE DOUBLE DOUBLE O(n ) 1 O(n ) 2 O(n ) 2 Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. It can be shown that using this approach we can compute the solution to 64 64-bit bit floating point precision. precision • • • • Requires extra storage, total is 1.5 times normal; O(n3) work is done in lower precision O(n2) work is done in high precision Problems if the matrix is ill-conditioned in sp; O(108) Ax = b FERMI Tesla C2050: 448 CUDA cores @ 1.15GHz SP/DP peak is 1030 / 515 GFlop/s 500 Single Precision 450 400 Gflop p/s 350 300 Double Precision 250 200 150 100 50 0 960 3200 5120 7040 Matrix size 8960 11200 13120 Ax = b FERMI Tesla C2050: 448 CUDA cores @ 1.15GHz SP/DP peak is 1030 / 515 GFlop/s 500 Single Precision 450 400 Mixed Precision Gflop p/s 350 300 Double Precision 250 200 150 100 Similar results for Cholesky & QR factorizations 50 0 960 3200 5120 7040 Matrix size 8960 11200 13120 Reproducibility p y ∑ xi when done in parallel can’t • For example guarantee the order of operations. • Lack of reproducibility due to floating point nonassociativity and algorithmic adaptivity (including autotuning) in efficient production mode • Bit-level reproducibility may be unnecessarily expensive most of the time • Force routine adoption of uncertainty quantification Given the many unresolvable uncertainties in program p g inputs, p , bound the error in the outputs p in terms of errors in the inputs 68 Conclusions • For the last decade or more, the research i investment strategy h has b been overwhelmingly biased in favor of hardware. • This strategy needs to be rebalanced barriers to progress are increasingly on the software side. • High Performance Ecosystem out of balance Hardware, Hardware OS, OS Compilers, Compilers Software, Software Algorithms, Algorithms Applications • No Moore’s Law for software, algorithms and applications 70 Clusters with GPUs (Cholesky) Clusters with GPUs (Cholesky) Use 12 cores and 3 GPUs per node Input size = 34560*sqrt(NumberNodes) DGEMM UB Distri. GPUs O vverall Tflo ps 100 80 60 40 1.5 DGEMM UB Distri. GPUs mkl_scalapack 10.3 Tflo ops Per N Node 120 10 1.0 0.5 20 0 1 2 4 8 16 32 64 100 Number off Nodes 0.0 1 2 4Number 8 16 32 64 100 N b off Nodes N d On the Keeneland system: 100 nodes Each node has two 6-core Intel Westmere CPUs and three Nvidia Fermi GPUs SW used: Intel MKL 10.3.5, CUDA 4.0, OpenMPI 1.5.1, PLASMA 2.4.1 Sparse Direct Solver and Iterative Refinement 72 MUMPS package based on multifrontal approach which generates small dense matrix multiplies Opteron w/Intel compiler Speedup Over DP 2 Iterative Refinement Single Precision 1.8 1.6 1.4 1.2 1 0.8 0.6 04 0.4 0.2 0 12 en th wa t0 1 a nk ve o2 rs to 0 a1 rm k 8f 6 qa th2 e m ne rb 1 s 0 sa p_ na co d t_ ul m 06 0 ap kiv 04 0 ap kiv t1 ar he 12 5 an f in b3 ep n5 o ws da 2 6 y vit ca 71 c- q p1 k oc bl k39 t ss d bc l_2 i rfo ai 1 6 H 10 Si 64 G Ite ra vti e R e fn i e me n t 0 Tim Davis's Collection, n=100K - 3M Sparse p Iterative Methods ((PCG)) 73 • Outer/Inner Iteration Outer iterations using 64 bit floating point Inner iteration: In 32 bit floating point • O Outer t iteration it ti in i 64 bit fl floating ti point i t and d iinner iteration in 32 bit floating point Mixed Precision Computations for Sparse Inner/OuterInner/Outer-type Iterative Solvers 74 Speedups for mixed precision Inner SP/Outer DP (SP/DP) iter. methods vs DP/DP (CG2, GMRES2, PCG2, and PGMRES2 with diagonal prec.) (Higher is better) 2 2 2 2 Iterations for mixed precision SP/DP iterative methods vs DP/DP (Lower is better) Machine: Intel Woodcrest (3GHz, 1333MHz bus) Stopping criteria: Relative to r0 residual reduction (10-12) M i size Matrix i 6,021 18,000 39,000 120,000 240,000 Condition number Standard Q QR Block Reduction • We have a m x n matrix A we want to reduce to upper triangular form. Standard Q QR Block Reduction • We have a m x n matrix A we want to reduce to upper triangular form. Q 1T Standard Q QR Block Reduction • We have a m x n matrix A we want to reduce to upper triangular form. R Q 1T Q 2T Q 3T A = Q1Q2Q3R = QR Communication Avoiding QR Example Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620 1610–1620, Pasadena Pasadena, CA CA, Jan Jan. 1988 1988. ACM ACM. Penn Penn. State State. Communication Avoiding QR Example Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620 1610–1620, Pasadena Pasadena, CA CA, Jan Jan. 1988 1988. ACM ACM. Penn Penn. State State. Communication Avoiding QR Example Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620 1610–1620, Pasadena Pasadena, CA CA, Jan Jan. 1988 1988. ACM ACM. Penn Penn. State State. Communication Avoiding QR Example Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620 1610–1620, Pasadena Pasadena, CA CA, Jan Jan. 1988 1988. ACM ACM. Penn Penn. State State. Communication Avoiding QR Example Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620 1610–1620, Pasadena Pasadena, CA CA, Jan Jan. 1988 1988. ACM ACM. Penn Penn. State State. PLASMA DPLASMA O N D S DAGuE QUARK N N DAG DAG O 1 O C LU 2 3 QR 4 3 1 3 3 3 3 3 C 4 OTR P F SYRK GEMM TRSM LU 4 GETRF GESM TSRF SM QR 4 GEQRT LARFB TSQRT SRFB DAG C P S Example: Cholesky 4x4 Example: Cholesky 4x4 RT is using the symbolic information from the compiler to make scheduling, message passing, & RT decisions Data distribution: regular, g , irregular Task priorities Task priorities No left looking or right looking, more adaptive or opportunistic more adaptive or opportunistic Software Stack commerciall or Netlib PLASMA distrib bution PLASMA PLASMA QUARK core BLAS LAPACK CBLAS (C)LAPACK BLAS QUARK - QUeuing And Runtime for Kernels LAPACK - Linear Algebra PACKage BLAS - Basic Linear Algebra Subroutines hwloc - hardware locality POSIX threads hwloc Big DAGs: No Global Critical Path Big DAGs: No Global Critical Path • DAGs get very big, very fast – So windows of active tasks are used; this means no global critical path – Matrix of NBxNB tiles; NB3 operation • NB=100 gives 1 million tasks 86 PLASMA Local Scheduling Dynamic Scheduling: Sliding Window Dynamic Scheduling: Sliding Window T 10 30 10 10 LU PLASMA Local Scheduling Dynamic Scheduling: Sliding Window Dynamic Scheduling: Sliding Window T 10 30 10 10 LU PLASMA Local Scheduling Dynamic Scheduling: Sliding Window Dynamic Scheduling: Sliding Window T 10 30 10 10 LU PLASMA Local Scheduling Dynamic Scheduling: Sliding Window Dynamic Scheduling: Sliding Window T 10 30 10 10 LU Exascale (1018 Flop/ Flop/ss) Systems: Two Possible Swim Lanes • Light weight processors (think BG/P) ~1 GHz processor (109) ~1 Kilo cores/socket (103) g sockets/system y (10 ( 6) ~1 Mega Socket Level Cores scale-out for planar geometry • Hybrid system (think GPU based) ~1 GHz processor (109) ~10 Kilo FPUs/socket (104) 100 Kilo sockets/system (105) ~100 Node Level 3D packaging The High g Cost of Data Movement •Flop/ Flop/ss or percentage of peak flop/ flop/ss become much less relevant Approximate power costs (in picoJoules picoJoules)) 2011 DP FMADD flop 2018 100 pJ 10 pJ DP DRAM read 4800 pJ 1920 pJ Local Interconnect pJ 7500 p 2500 p pJ Cross System 9000 pJ 3500 pJ Source: John Shalf, LBNL •Algorithms & Software: minimize data movement;; p perform more work p per unit data movement. 92 Factors that Necessitate Redesign of Our Software • • • • Steepness of the ascent from terascale to petascale to exascale Extreme parallelism and hybrid design • Preparing for million/billion way parallelism Tightening memory/bandwidth bottleneck • Limits on power/clock speed implication on multicore • Reducing d communication will ll become b much more intense • Memory per core changes, byte-to-flop ratio will change Necessary Fault Tolerance • MTTF will drop • Checkpoint/restart has limitations Software infrastructure does not exist today Emerging g g Computer p Architectures • Are needed by y applications pp • Applications are given (as function of time) • Architectures are given (as function of time) • Algorithms g and software must be adapted p or created to bridge to computer architectures for the sake of the complex p applications pp 94 Three Design g Points Todayy • Gigascale Laptop: Uninode-Multicore (Your iPhone and iPad are Gflop/s devices) • Terascale Deskside: Multinode-Multicore • Petacale Center: Multinode-Multicore Three Design g Points for Tomorrow ♦ Terascale T l Laptop: L t ¾Manycore y ♦ Petascale P t sc l Deskside: D sksid : ¾Manynode-Manycore ♦ Exacale Center: ¾Manynode-Manycore 97 Challenges of using GPUs High levels of parallelism M Many GPU cores [ e.g. Tesla C2050 (Fermi) has 448 CUDA cores ] Hybrid/heterogeneous architectures Match algorithmic requirements to architectural strengths [ e.g. small, non-parallelizable tasks to run on CPU, large and parallelizable on GPU ] Compute vs communication gap Exponentially growing gap; persistent challenge [ Processor speed improves 59%, memory bandwidth 23%, latency 5.5% ] [ on all levels, e.g. a GPU Tesla C1070 (4 x C1060) has compute power of O(1 000) Gflop/s O(1,000) Gfl / but b GPUs GPU communicate i through h h the h CPU using i O(1) GB/s connection ] Matrix Algebra on GPU and Multicore Architectures (MAGMA) MAGMA: a new ggeneration linear algebra g ((LA)) libraries to achieve the fastest p possible time to an accurate solution on hybrid/heterogeneous architectures Homepage: http://icl.cs.utk.edu/magma/ MAGMA & LAPACK − MAGMA uses LAPACK and extends its functionality to hybrid systems (w/ GPUs); − MAGMA is i d designed i d tto b be similar i il tto LAPACK iin functionality, data storage and interface − MAGMA leverages years of experience in developing open source LA software k lik LAPACK ScaLAPACK, S LAPACK BLAS, BLAS ATLAS d PLASMA packages like LAPACK, ATLAS, and MAGMA developers/collaborators − U of Tennessee, Knoxville; U of California, Berkeley; U of Colorado, Denver − INRIA Bordeaux - Sud Ouest & INRIA Paris – Saclay, France; KAUST, Saudi Arabia − Community effort [similarly to the development of LAPACK / ScaLAPACK] Hybridization Methodology MAGMA uses HYBRIDIZATION methodology based on Representing linear algebra algorithms as collections of TASKS and DATA DEPENDENCIES among them P Properly l SCHEDULING the th tasks' t k ' execution ti over th the multicore and the GPU hardware components H b id CPU Hybrid CPU+GPU GPU algorithms l ith (small tasks for multicores and large tasks for GPUs) Successfully applied to fundamental linear algebra algorithms One and two-sided factorizations and solvers Iterative eigen-solvers te at ve linear l ea and a de ge solve s Faster, cheaper, better ? g High-level Leveraging prior developments Exceeding in performance homogeneous solutions 100 Accelerating Dense Linear Algebra with GPUs Hessenberg factorization in DP [ for the general eigenvalue problem ] LU Factorization in double precision (DP) [ for solving a dense linear system ] 220 W* possibile v isualizzare l'immagine. La memoria del computer potrebbe essere insufficiente per aprire l'immagine oppure l'immagine potrebbe essere danneggiata. Riav v iare il computer e aprire di nuov o il file. Se v iene v isualizzata di nuov o la x rossa, potrebbe essere necessario eliminare l'immagine e inserirla di nuov o. Impossibile v isualizzare l'immagine. La memoria del computer potrebbe essere insufficiente per aprire l'immagine oppure l'immagine potrebbe essere danneggiata. Riav v iare il computer e aprire di nuov o il file. Se v iene v isualizzata di nuov o la x rossa, potrebbe essere necessario eliminare l'immagine e inserirla di nuov o. 1022 W* GPU Fermi C2050 [448 CUDA Cores @ 1.15 GHz ] + Intel Q9300 [ 4 cores @ 2.50 GHz] DP peak 515 + 40 GFlop/s System cost ~ $3,000 Power * ~ 220 W * CPU AMD ISTANBUL [ 8 sockets x 6 cores (48 cores) @2.8GHz ] DP peak 538 GFlop/s System cost ~ $30,000 Power * ~ 1,022 W Computation consumed power rate (total system rate minus idle rate), measured with KILL A WATT PS, Model P430 Architecture of Heterogeneous Multi--core and MultiMulti Multi-GPU Systems y Architecture of a Keeneland compute node. Two Intel Xeon 2.8 GHz 6-core X5660 processors (Westmere) Three NVIDIA Fermi M2070 Choleskyy Factorization ((DP)) • • Weak scalability on many nodes (Keeneland) Input size: 34560, 46080, 69120, 92160, 138240, 184320, 276480, 460800 75 Tflops Choleskyy Factorization ((DP)) • • Weak scalability on many nodes (Keeneland) Input size: 34560, 46080, 69120, 92160, 138240, 184320, 276480, 460800 MAGMA Software Stack CPU di t . distr HYBRID GPU Til & LAPACK Al Tile Algorithms ith with ith DAGuE DAG E MAGNUM / Rectangular / PLASMA Tile Algorithms multi PLASMA / Quark Scheduler LAPACK Algorithms and Tile Kernels MAGMA 1.0 MAGMA SPARSE single MAGMA BLAS LAPACK BLAS BLAS Linux, Windows, Mac OS X | C/C++, Fortran | Matlab, Python CUDA MAGMA 1.0 32 algorithms l ih are d developed l d (total – 122 routines) – LU, LLT, QR, LQ, Sym λ, Non-Sym λ, SVD Every algorithm is in 4 precisions (s/c/d/z, denoted by X) There are 3 mixed precision algorithms (zc & ds, denoted by XX) These are hybrid algorithms Expressed in terms of BLAS Support is for single CUDA-enabled NVIDIA GPU, either Tesla or Fermi MAGMA BLAS A subset of GPU BLAS, optimized for Tesla and Fermi GPUs Mixed Precision • Single Precision is 2X faster than Double Precision • With GP-GPUs 8x • Power saving issues • Reduced data motion 32 bit data instead of 64 bit data • Higher locality in cache More data items in cache Idea Goes Something g Like This… • Exploit 32 bit floating point as much as possible. possible Especially for the bulk of the computation • Correct or update d the h solution l with h selective l use of 64 bit floating point to provide a refined f d results l • Intuitively: Compute a 32 bit result, Calculate a correction to 32 bit result using g selected higher precision and, Perform the update of the 32 bit results with the correction using high precision. 108 Mixed--Precision Iterative Refinement Mixed • Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) x = L\(U\b) r = b – Ax g WHILE || r || not small enough z = L\(U\r) x = x + z r = b – Ax END 3 SINGLE SINGLE DOUBLE O(n ) 2 O(n ) 2 O(n ) SINGLE DOUBLE DOUBLE O(n ) 1 O(n ) 2 O(n ) 2 Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. Mixed--Precision Iterative Refinement Mixed • Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) x = L\(U\b) r = b – Ax g WHILE || r || not small enough z = L\(U\r) x = x + z r = b – Ax END 3 SINGLE SINGLE DOUBLE O(n ) 2 O(n ) 2 O(n ) SINGLE DOUBLE DOUBLE O(n ) 1 O(n ) 2 O(n ) 2 Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. It can be shown that using this approach we can compute the solution to 64 64-bit bit floating point precision. precision • • • • Requires extra storage, total is 1.5 times normal; O(n3) work is done in lower precision O(n2) work is done in high precision Problems if the matrix is ill-conditioned in sp; O(108) Ax = b FERMI Tesla C2050: 448 CUDA cores @ 1.15GHz SP/DP peak is 1030 / 515 GFlop/s 500 Single Precision 450 400 Gflop p/s 350 300 Double Precision 250 200 150 100 50 0 960 3200 5120 7040 Matrix size 8960 11200 13120 Ax = b FERMI Tesla C2050: 448 CUDA cores @ 1.15GHz SP/DP peak is 1030 / 515 GFlop/s 500 Single Precision 450 400 Mixed Precision Gflop p/s 350 300 Double Precision 250 200 150 100 Similar results for Cholesky & QR factorizations 50 0 960 3200 5120 7040 Matrix size 8960 11200 13120 Mixed Precision Methods • Mixed precision, use the lowest precision required to achieve a given accuracy outcome Improves runtime, reduce power consumption, ti llower d data t movementt Reformulate to find correction to solution, rather than solution; Δx rather than x. 113 Communication Avoiding Algorithms • Goal: Algorithms that communicate as little as possible • Jim Demmel and company have been working on algorithms th t obtain that bt i a provable bl minimum i i communication. i ti (M (M. A Anderson d yesterday) • Direct methods ((BLAS,, LU,, Q QR,, SVD,, other decompositions) p ) • Communication lower bounds for all these problems • Algorithms that attain them (all dense linear algebra, some sparse) • Iterative methods – Krylov subspace methods for Ax=b, Ax=λx • Communication lower bounds, and algorithms that attain them (depending d d on sparsity structure) • For QR Factorization they can show: 114 Software Stack commerciall or Netlib PLASMA distrib bution PLASMA PLASMA QUARK core BLAS LAPACK CBLAS (C)LAPACK BLAS QUARK - QUeuing And Runtime for Kernels LAPACK - Linear Algebra PACKage BLAS - Basic Linear Algebra Subroutines hwloc - hardware locality POSIX threads hwloc Reproducibility p y ∑ xi when done in parallel can’t • For example guarantee the order of operations. • Lack of reproducibility due to floating point nonassociativity and algorithmic adaptivity (including autotuning) in efficient production mode • Bit-level reproducibility may be unnecessarily expensive most of the time • Force routine adoption of uncertainty quantification Given the many unresolvable uncertainties in program p g inputs, p , bound the error in the outputs p in terms of errors in the inputs 116 Three Ideas for Fault Tolerant Linear Algebra Algorithms • Lossless diskless check‐pointing for iterative methods • Checksum maintained in active processors • On failure, roll back to checkpoint and continue • No lost data Three Ideas for Fault Tolerant Linear Algebra Algorithms • Lossless diskless check‐pointing for iterative methods • Checksum maintained in active processors • On failure, roll back to checkpoint and continue • No lost data • Lossyy approach for iterative methods pp • No checkpoint for computed data maintained • On failure, approximate missing data and carry on • Lost data but use approximation to recover Three Ideas for Fault Tolerant Linear Algebra Algorithms • Lossless diskless check‐pointing for it ti iterative methods th d • Checksum maintained in active processors • On failure, roll back to checkpoint and continue • No lost data • Lossy approach for iterative methods • No No checkpoint maintained checkpoint maintained • On failure, approximate missing data and carry on • Lost data but use approximation to recover • Check‐pointless methods for dense algorithms • Ch Checksum maintained as part of k i i d f computation • No roll back needed; No lost data PLASMA People z Current Team z Past Members z Outside Contributors z Dulceneia Becker z Emmanuel Agullo z Fred Gustavson z Henricus Bouwmeester z Wesley Alvaro z Lars Karlsson z Jack Dongarra z Alfredo Buttari z Bo Kågström z Mathieu Faverge z Bilel Hadri z Azzam Haidar z Blake Haugen z Jakub Kurzak z Julien Langou z Hatem Ltaief z Piotr Łuszczek A copy of the slides are on my website. Google “dongarra” 121 First … • Thank a number of people who have helped with this work Emmanuel Agullo, Agullo George Bosilca, Bosilca Aurelien Bouteiller, Anthony Danalis, Jim Demmel, Tingxing g g "Tim" Dong, g, Mathieu Faverge, g , Azzam Haidar, Thomas Herault, Mitch Horton, Jakub Kurzak, Julien Langou, Julie Langou, Pierre Lemarinier, Piotr Luszczek, Hatem Ltaief, Stanimire Tomov, Asim YarKhan, … • Much h off what h I will ll describe d b has h been b done before,, at least in theory. y 122 123 28 Supercomputers p p in the UK Rank 24 65 69 70 93 154 160 186 190 191 211 212 213 228 233 234 278 279 339 351 365 404 405 415 416 482 484 Site Computer University of Edinburgh Cray XE6 12 12-core core 2.1 GHz Atomic Weapons Establishment Bullx B500 Cluster, Xeon X56xx 2.8Ghz, QDR Infiniband ECMWF Power 575, p6 4.7 GHz, Infiniband ECMWF Power 575, p6 4.7 GHz, Infiniband University of Edinburgh Cray XT4, 2.3 GHz University of Southampton iDataPlex, iDataPlex Xeon QC 2.26 2 26 GHz, GHz Ifband, Ifband Windows HPC2008 R2 IT Service Provider Cluster Platform 4000 BL685c G7, Opteron 12C 2.2 Ghz, GigE IT Service Provider Cluster Platform 3000 BL460c G7, Xeon X5670 2.93 Ghz, GigE Computacenter (UK) LTD Cluster Platform 3000 BL460c G1, Xeon L5420 2.5 GHz, GigE Classified xSeries x3650 Cluster Xeon QC GT 2.66 GHz, Infiniband Classified BladeCenter HS22 Cluster, WM Xeon 6-core 2.66Ghz, Ifband Classified BladeCenter HS22 Cluster, WM Xeon 6-core 2.66Ghz, Ifband Classified BladeCenter HS22 Cluster, WM Xeon 6-core 2.66Ghz, Ifband IT Service Provider Cluster Platform 4000 BL685c G7, Opteron 12C 2.1 Ghz, GigE g Financial Institution iDataPlex,, Xeon X56xx 6C 2.66 GHz,, GigE Financial Institution iDataPlex, Xeon X56xx 6C 2.66 GHz, GigE UK Meteorological Office Power 575, p6 4.7 GHz, Infiniband UK Meteorological Office Power 575, p6 4.7 GHz, Infiniband Cluster Platform 3000 BL460c, Xeon 54xx 3.0GHz, Computacenter p ((UK)) LTD GigEthernet g Asda Stores BladeCenter HS22 Cluster, WM Xeon 6-core 2.93Ghz, GigE Financial Services xSeries x3650M2 Cluster, Xeon QC E55xx 2.53 Ghz, GigE Financial Institution BladeCenter HS22 Cluster, Xeon QC GT 2.53 GHz, GigEthernet Financial Institution BladeCenter HS22 Cluster, Xeon QC GT 2.53 GHz, GigEthernet Bank xSeries x3650M3 x3650M3, Xeon X56xx 2 2.93 93 GHz, GHz GigE Bank xSeries x3650M3, Xeon X56xx 2.93 GHz, GigE IT Service Provider Cluster Platform 3000 BL460c G6, Xeon L5520 2.26 GHz, GigE 07 IT Service Provider Cluster Platform 3000 BL460c G6, Xeon X5670 2.93 GHz, 10G Cores 44376 12936 8320 8320 12288 8000 14556 9768 11280 6368 5880 5880 5880 12552 9480 9480 3520 3520 Rmax Tflop/s 279 124 115 115 95 66 65 59 58 58 55 55 55 54 53 53 51 51 7560 8352 8096 7872 7872 7728 7728 8568 4392 47 47 46 44 44 43 43 40 40 124 Programming g g Model Approaches pp • Hierarchical approach (intra-node + internode) d ) • Part I: Inter-node model for communicating between nodes • MPI scaling to millions of nodes: Importance high; risk low • One-sided communication scaling: Importance medium; risk low • Part II: Intra-node model for on-chip concurrency • Overriding Risk: No single path for node architecture • OpenMP, Pthreads: High risk (may not be feasible with h node d architectures); h high h h payoff ff (already l d in some applications) • New API, extended PGAS, or CUDA/OpenCL to handle hierarchies of memories and cores: Medium risk (reflects architecture directions); Medium payoff (reprogramming of node code) • Socket Level C Cores scale-out l t ffor planar l geometry t Unified approach: single high level model for entire system • High risk; high payoff for new codes, new application domains Node Level Slide 125 Programming models requires a dual approach. pp • Hierarchical approach: intra-node + inter-node • something new … Partt II: IInter-node P t d model d l for f communicating between nodes • MPI scaling to millions of nodes: Importance high; risk low; provides path for incremental progress • One-sided communication scaling: Importance medium; risk low • Part II: Intra-node model for on-chip concurrency • Overriding Risk: No single path for node architecture • OpenMP, Pthreads: High risk (may not be feasible with node architectures); high payoff (already in some applications) • New API, extended PGAS, or CUDA/OpenCL to handle hierarchies of memories and cores: Medium risk ( fl (reflects architecture hi di directions); i ) Medium payoff (reprogramming of node code) • Unified approach: single high level model d l for f entire ti system t • High risk; high payoff for new codes, new application domains something old ... Power Profiles PLASMA DP 300 250 200 Po ower (Watts) Two dual-core 1.8 GHz AMD Opteron processors Theoretical peak: 14.4 Gflops per node DGEMM using i 4 threads: th d 12.94 12 94 Gflops Gfl PLASMA 2.3.1, GotoBLAS2 Experiments: PLASMA LU solver in double precision PLASMA LU solver in mixed precision 150 System CPU Memory Disk Motherboad 100 50 Time to Solution (s) GFLOPS Accuracy|| Ax − b || PLASMA DP PLASMA Mixed 0 39.5 22.8 300 10.01 17.37 250 2.0E-02 1.3E-01 200 (|| A |||| X || + || b ||)Nε Iterations System Energy (KJ) 7 10852.8 6314.8 0 10 20 30 40 50 Time (seconds) PLASMA Mixed Precision Powe er (Watts) N = 8400, using 4 cores 150 100 System CPU Memory Disk Motherboad 50 0 0 10 20 30 Time (seconds) 40 50 128 The High g Cost of Data Movement •Flop/ Flop/ss or percentage of peak flop/s flop/s become much less relevant Approximate power costs (in picoJoules picoJoules)) 2011 2018 DP FMADD flop 100 pJ 10 pJ DP DRAM read 2000 pJ 1000 pJ DP copper link traverse (short) 1000 pJ 100 pJ DP optical link traverse (long) 3000 pJ 500 pJ 129 130 • “Nothing you can't spell will ever work.” – Will Rogers work. 131 Prioritization of critical path and noncritical tasks • DAG scheduling of critical path tasks • Allows taking advantage of asynchronicity h b between j steps p and adaptive p major load balancing for noncritical tasks 132 Synchronization y Avoiding g Methods 133 In the States: CoCo-Design Centers & Exascale Software Center • Co-Design Centers The Co-Design Process is where system architects, application software designers, applied mathematicians, and computer scientists work together to produce a computational science discovery environment • Exascale Software Center • • • • • Deliver high quality system software for exascale platforms • ~2015, ~2018 Identify required software capabilities Identify gaps Design g and develop p open-source p software components p Both: evolve existing components, develop new ones Includes maintainability, support, verification Ensure functionality, y, stability, y, and performance p Collaborate with platform vendors to integrate software 134 Increasing the Level of Asynchronous Behavior • DAG level description of methods expressing p g parallelism p explicitly p y in DAGs, so that scheduling tasks dynamically, support massive parallelism, and apply common optimization techniques to increase throughput. throughput • Scheduler needed • Standards LAPACK Step 1 T LU/LL /QR Q Step 2 Step 3 Step 4 • Fork-join, bulk synchronous processing ... 136 Tiled Operations p & Look Ahead • Break ea task tas into smaller operations; tiles • Unwind outer t lloop Scaling g for LU Matrix Size 138 If We Had A Small Matrix Problem • We would generate the DAG, fi d the find h critical i i l path h and d execute it. • DAG too large to generate ahead of time Not explicitly generate Dynamically generate the DAG as we go • Machines will have large number of cores in a distributed fashion Will have to engage in message p passing g Distributed management Locally have a run time system The DAGs are Large g • Here is the DAG for a factorization on a 20 x 20 matrix i • For a large matrix say O(106) the DAG is huge • Many challenges for the software 140 PLASMA Scheduling Dynamic Scheduling: Sliding Window Tile LU factorization 10 x 10 tiles 300 tasks 100 task window PLASMA Scheduling Dynamic Scheduling: Sliding Window Tile LU factorization 10 x 10 tiles 300 tasks 100 task window PLASMA Scheduling Dynamic Scheduling: Sliding Window Tile LU factorization 10 x 10 tiles 300 tasks 100 task window PLASMA Scheduling Dynamic Scheduling: Sliding Window Tile LU factorization 10 x 10 tiles 300 tasks 100 task window DAG and Scheduling g • DAG is dynamically generated and implicit • Everything designed g for distributed memory y systems y • Runtime system on each node or core • Run time • Bin Bi 1 • See if new data has arrived • Bin 2 • See if new dependences p are satisfied • If so move task to Bin 3 • Bin 3 • Exec a task that’s ready • Notify children of completion • Send data to children • If no work do work stealing 145 Some Questions Q • What’s the best way to represent the DAG? • What’s Wh ’ the h b best approach h to d dynamically i ll generating i the DAG? • What run time system should we use? We will probably build something that we would target to the underlying y g system’s y RTS. • What about work stealing? Can we do better than nearest neighbor work stealing? • What does the program look like? Experimenting with Cilk, Charm++, UPC, Intel Threads I would like to reuse as much of the existing software as possible 146 PLASMA Scheduling Dynamic Scheduling with QUARK z • Sequential algorithm definition • Side-effect-free tasks • Directions of arguments (IN, OUT, INOUT) • Runtime resolution of data hazards (RaW (RaW, WaR, WaR WaW) • Implicit construction of the DAG • Processing of the tasks by a sliding window Old concept z Jade (Stanford University) z SMP Superscalar (Barcelona Supercomputer Center) z StarPU (INRIA) PLASMA Scheduling Dynamic Scheduling: Tile LU Trace z Regular trace z Factorization steps pipelined z Stalling only due to natural load imbalance 8 8-socket, k t 6 6-core (48 cores ttotal) t l) AMD IIstanbul t b l2 2.8 8 GH GHz Redesign g • Asychronicity • Avoid A id fork-join f k j i (B (Bulk lk sync d design) i ) • Dynamic y Scheduling g • Out of order execution • Fine Granularity • Independent block operations • Locality of Reference • Data storage – Block Data Layout 33 149 150 Communication Reducing Methods 151 Experimental p Results • On two cluster machines and a Cray XT5 system Cluster 1: 2 cores per node (Grig at UTK) Cluster 2: 8 cores per node (Newton at UTK) Cray XT5: 12 cores per node (Jaguar in ORNL) • In comparison with vendors vendors’ ScaLAPACK library 152 Take as input tall and skinny matrix Strong g Scalabilityy on Jaguar g • • Fixed-size input for an increasing number of cores Each node has 2 sockets, 6-core AMD Opteron 2.6GHz per socket 1800 1600 Tile CA CA-QR QR GFLO OPS 1400 ScaLAPACK 1200 1000 800 600 400 200 0 1 2 4 8 12 24 N b off C Number Cores 153 48 96 192 384 Weak Scalabilityy on Jaguar g Increase the input size while the number of cores increases Each node has 2 sockets, 6-core AMD Opteron per socket • • peak k dgemm 10 GFLOPS per Core 8 6 Tile CA-QR 4 ScaLAPACK 2 0 1 2 4 8 12 24 48 96 192 Number of Cores 154 384 768 1536 3072 Applying Tile CAQR to GeneralGeneral-Size Matrices • Each node has 2 sockets, 6-core 6 core AMD Opteron per socket • Using 16 nodes on Jaguar (i.e., 192 cores) 1400 1200 GFLOP PS 1000 800 600 Tile CA-QR 400 ScaLAPACK 200 0 8 16 32 6 64 128 2 6 256 Number of Tile Columns Figure: Crossover point on matrices with 512 rows 155 512 12 Idle processes: P0 and P1 GMRES speedups on 88-core Clovertown 156 Communication-avoiding iterative Communicationmethods • Iterative Solvers: • • Exascale challenges for iterative solvers: • • • • Collectives, synchronization. Memory latency/BW. Not viable on exascale systems in present forms. Communication-avoiding (s-step) iterative solvers: • • • Dominant cost of many apps (up to 80+% of runtime). runtime) Idea: Perform s steps in bulk ( s=5 or more ): • s times fewer synchronizations. • s times fewer data transfers: Better latency/BW. Problem: Numerical accuracy of orthogonalization. TSQR Implementation: • • • • 2-level parallelism (Inter and intra node). Memory hierarchy optimizations. Fl ibl node-level Flexible d l l scheduling h d li via i IIntel t l Threading Building Blocks. Generic scalar data type: supports mixed and extended precision. LAPACK – Serial, S i l MGS –Threaded Th d d modified difi d G Gram-Schmidt S h idt TSQR capability: • Critical for exascale solvers. • Part P t off the th Trilinos T ili scalable l bl multicore lti capabilities. • Helps all iterative solvers in Trilinos (available to external libraries, too). • Staffing: Mark Hoemmen (lead, (lead postpost doc, UC-Berkeley), M. Heroux • Part of Trilinos 10.6 release, Sep 2010. Mixed Precision Methods 158 Exploiting Mixed Precision Computations • Single precision is faster than DP because: Higher parallelism within floating point units • 4 ops/cycle (usually) instead of 2 ops/cycle Reduced data motion • 32 bit data instead of 64 bit data Higher locality in cache • More data items in cache Fault Overcoming g Methods 160 Autotuning g 161 Automatic Performance Tuning • Writing high performance software is hard • Ideal: Id l get high hi h fraction f i off peak k performance f ffrom one algorithm • Reality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler,… Best choice can depend on knowing a lot of applied mathematics and computer science Changes Ch g with ith each h new h hardware, d compiler il release • Automatic p performance tuning g Use machine time in place of human time for tuning Search over possible implementations Use performance models to restrict search space Past successes: ATLAS, FFTW, Spiral, Open-MPI How to Deal with Complexity? p y • Many parameters in the code needs to be optimized optimized. • Software adaptivity is the key for applications to effectively use available resources whose complexity is exponentially increasing MFLOPS O S Detect Hardware Parameters L1Si L1Size NR MulAdd L* ATLAS Search Engine (MMSearch) NB MU,NU,KU xFetch MulAdd Latency ATLAS MM Code Generator (MMCase) Compile, Execute, Measure MiniMMM Source 163 Auto--Tuning Auto Best algorithm implementation can depend strongly on the problem problem, computer architecture architecture, compiler compiler,… There are 2 main approaches Model-driven optimization p [Analytical models for various parameters; Heavily used in the compilers community; May y not give g optimal p results ] Empirical optimization [ Generate large number of code versions and runs them on a given platform to determine the best performing p p g one;; Effectiveness depends on the chosen parameters to optimize and the search heuristics used ] Natural approach is to combine them in a hybrid approach [1st model-driven to limit the search space for a 2nd empirical part ] [ Another A h aspect iis adaptivity d i i – to treat cases where h tuning i can not b be restricted to optimizations at design, installation, or compile time ] 165 International Community Effort 166 We believe this needs to be an international collaboration for various reasons including: • The scale of investment • The need for international input on requirements p Asians, and others are working g on • US, Europeans, their own software that should be part of a larger vision for HPC. • No global evaluation of key missing components • Hardware features are uncoordinated with software development www.exascale.org Outline • • • • • Push towards Exascale Science drivers IESP and EESI work Importance of doing it now Worldwide HPC and challenges 167 Moore’s Law Reinterpreted p • Number of cores per chip doubles every 2 year, year while clock speed decreases (not increases). Need to deal with systems with millions of concurrent threads • Future generation will have billions of threads! Need to be able to easily replace interchip p parallelism p with intro-chip p parallelism • Number of threads of execution doubles every 2 year 10+ Pflop/s Systems Planned in the States ♦ DOE Funded, Titan at ORNL, Based on Cray design with accelerators, 20 Pflop/s, 2012 ♦ DOE Funded, Sequoia at L Lawrence Livermore Li Nat. N Lab, L b Based on IBM’s BG/Q, 20 Pflop/s 2012 Pflop/s, ♦ DOE Funded, BG/Q at Argonne National Lab, Lab Based on IBM IBM’ss BG/Q, 10 Pflop/s, 2012 ♦ NSF Funded, Blue Waters at University of Illinois UC, Based on IBM’s BM’ P Power 7 Proc, P 10 Pflop/s, 2012 Roadmap p Components p www.exascale.org Exascale Software Center (in 1 slide)) • S Scope Deliver high quality system software for exascale platforms • ~2015, ~2018 Id Identify tif software ft gaps, research h&d develop l solutions, l ti ttestt and d support deployment Increase the productivity and capability and reduce the risk of exascale deployments • Cost: Applied R&D: ~10-20 distributed teams of 3 to 7 people each Large, primarily l centralized l d QA, integration, and d verification center • Schedule Overview 2010 – Q1 2011: Planning and technical reviews April 2011: Launch Exascale Software Center! 2014, 2017: SW ready for integration for 2015, 2018 systems respectively Scaling g Strong scaling: fixed problem size. • Data on each node decreases as the number of nodes increases Weak scaling: fixed the data size on each node. node • Problem size increases as the number of node increases. 172 Potential System Architecture Targets System attributes 2010 “2015” “2018” System peak 2 Peta 200 Petaflop/sec 1 Exaflop/sec Power 6 MW 15 MW 20 MW y memory y System 0.3 PB 5 PB 32-64 PB Node performance 125 GF 0.5 TF 7 TF 1 TF 10 TF Node memory BW 25 GB/s 0 1 TB/sec 0.1 1 TB/sec 0 4 TB/sec 0.4 4 TB/sec Node concurrency 12 O(100) O(1,000) O(1,000) O(10,000) System size (nodes) 18,700 50,000 5,000 1,000,000 100,000 Total T t l Node N d Interconnect BW MTTI 1 5 GB/ 1.5 GB/s 20 GB/sec GB/ 200 GB/sec GB/ days O(1day) O(1 day) Moore’s Law reinterpreted p • N Number b off cores per chip hi will ill d double bl every two years • Clock speed will not increase (possibly decrease) because of Power Power ∝ Voltage2 * Frequency Voltage ∝ Frequency Power ∝ Frequency 3 • Need to deal with systems y with millions of concurrent threads • Need to deal with inter-chip inter chip parallelism as well as intra-chip parallelism Future Computer p Systems y • Most likely be a hybrid design • Think hi k standard d d multicore li chips hi and d accelerator (GPUs) • Today accelerators are attached generation more integrated g • Next g • Intel’s Larrabee? Now called “Knights Corner” and “Knights Corner Knights Ferry Ferry” to come. come 48 x86 cores • AMD’s Fusion in 2011 - 2013 Multicore with embedded graphics ATI • Nvidia’s d ’ plans? l 175 What’ss Next? What All Large Core Mixed Large and Small Core Many Small Cores All Small Core Many FloatingPoint Cores Different Classes of Chips Home Games / Graphics Business Scientific MAGMA Software Available through MAGMA MAGMA'ss homepage http://icl.cs.utk.edu/magma/ Included are the 3 one one-sided sided matrix factorizations Iterative Refinement Algorithm (Mixed Precision) Standard (LAPACK) data layout and accuracy Two LAPACK-style interfaces CPU interface: both input and output are on the CPU GPU interface: both input and output are on the GPU This release is intended for single GPU Today’s Fastest Computer System attributes 2011 Fujitsu K “2015” “2018” Difference 2011 & 2018 8.7 Pflop/s 200 Pflop/s 1 Eflop/sec O(100) 115 Power 10 MW 15 MW ~20 MW System memory 1.6 PB 5 PB 32-64 PB Node performance 128 GF 0.5 TF 7 TF 1 TF 10 TF O(10) – O(100) Node memory BW 64 GB/s 0.1 TB/sec 1 TB/sec 0.4 TB/sec 4 TB/sec O(100) 62 Node concurrency 8 O(100) O(1,000) O(1,000) O(10,000) O(100) – O(1000) Total Concurrency 548,352 O(108) O(109) O(1000) 1823 Total Node Interconnect BW 20 GB/s 20 GB/sec 200 GB/sec O(10) days O(1day) O(1 day) - O(10) Computer System peak MTTI O(10) 20 Potential System Architecture Targets with $200M and 20 MW caps Potential System Architecture Targets with $200M and 20 MW caps System attributes 2011 Fujitsu K “2015” “2018” Difference 2011 & 2018 8.7 Pflop/s 200 Pflop/s 1 Eflop/sec O(100) 115 Power 10 MW 15 MW ~20 MW System memory 1.6 PB 5 PB 32-64 PB Node performance 128 GF 0.5 TF 7 TF 1 TF 10 TF O(10) – O(100) Node memory BW 64 GB/s 0.1 TB/sec 1 TB/sec 0.4 TB/sec 4 TB/sec O(100) 62 Node concurrency 8 O(100) O(1,000) O(1,000) O(10,000) O(100) – O(1000) Total Concurrency 548,352 O(108) O(109) O(1000) 1823 Total Node Interconnect BW 20 GB/s 20 GB/sec 200 GB/sec O(10) days O(1day) O(1 day) - O(10) Computer System peak MTTI O(10) 20