HPC as a Driver for Computing Technology and Education Tarek El-Ghazawi The George Washington University Washington D.C., USA NOW- July 2015: The TOP 10 Systems Rank 1 2 3 Site RIKEN Advanced Inst for Comp Sci, Japan 5 DOE / OS Argonne Nat Lab, USA 6 Swiss CSCS 7 KAUST, Saudi 8 TACC, USA 10 Cores National Super Tianhe-2 NUDT, Computer Center in Xeon 12C 2.2GHz + IntelXeon Phi 3,120,000 Guangzhou, China (57c) + Custom DOE / OS Titan, Cray XK7, AMD (16C) + Oak Ridge Nat Lab 560,640 Nvidia Kepler GPU (14c) + Custom USA DOE / NNSA Sequoia, BlueGene/Q (16c) + L Livermore Nat Lab 1,572,864 custom USA 4 9 Computer % of Peak Power MFlops [MW] /Watt 33.9 62 17.8 1905 17.6 65 8.3 2120 17.2 85 7.9 2063 K computer Fujitsu SPARC64 VIIIfx (8c) + Custom 705,024 10.5 93 12.7 827 Mira, BlueGene/Q (16c) Custom 786,432 8.16 85 3.95 2066 115,984 6.27 81 2.3 2726 Shaheen II, Cray XC30, Xeon 16C + 196,608 Custom 5.54 77 4.5 1146 5.17 61 4.5 1489 5.01 85 2.30 2178 4.29 85 1.97 2177 + Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB 204,900 Forschungszentrum JuQUEEN, BlueGene/Q, Juelich (FZJ), 458,752 Power BQC 16C 1.6GHz+Custom Germany DOE / NNSA Vulcan, BlueGene/Q, 393,216 LLNL, USA Power BQC 16C Tarek 1.6GHz+Custom El-Ghazawi, GWU 500 (422) Software Comp Rmax [Pflops] HP Cluster USA 18,896 .309 48 2 HPC is a Top National Priority! Executive Order from the White House Establishment of a National Strategic Computing Initiative (NCSI) – 29 July 2015 Tarek El-Ghazawi, GWU 3 3 National Strategic Computing Initiative Five strategic themes of the NSCI: 1) Create systems that can apply exaflops of computing power to exabytes of data 2) Keep the United States at the forefront of HPC capabilities 3) Improve HPC application developer productivity 4) Make HPC readily available 5) Establish hardware technology for future HPC systems Tarek El-Ghazawi, GWU 4 4 Future/Investments - International Exascale HPC Programs Country Funding Year(s) Remarks European Union €700M 2014-20 Private-Public Partnership commitment through European Tech Platform for HPC (ETP4HPC) €143.4M in 2014-15 €74M 2011- 6 dedicated FP7 Exascale projects $2B 2014-20 Led by IISc (Indian Institute of Science) and ISRO (Indian Space Research Organization). Targeting a 132 ExaFLOP/s machine $750M 2014-19 C-DAC (Center for Development of Advanced Computing) to set up 70 supercomputers over 5 years $1.38B 2013-20 Post-K computer to be installed at RIKEN; Tentatively based on Extreme SIMD chip “PACS-G” India Japan China - Due to U.S./DoC ban will use Chinese 5 Tarek El-Ghazawi, parts GWU to upgrade current #1 system 5 Why is HPC Important? Critical for economic competitiveness (Highlighted by Minster Daoudi) because of its wide applications (through simulations and intensive data analyses) Drives computer hardware and software innovations for future conventional computing Is becoming ubiquitous, i.e. all computing/information technology is turning into Parallel!! Is that why it is turning into an international HPC muscle flexing contest? Tarek El-Ghazawi, GWU 6 Why is HPC Important? (1)Competitiveness Design Design Build Model Simulate Tarek El-Ghazawi, GWU Test Build 7 Why is HPC Important? Competitiveness Molecular Dynamics HIV-1 Protease Simulation for 2ns: • 2 weeks on a desktop • 6 hours on a supercomputer Gene Sequence Alignment Inhibitor Drug HPC Application Examples Phylogenetic Analysis: • 32 days on desktop • 1.5 hrs supercomputer Car Crash Simulations 2 million elements simulation: • 4 days on a desktop • 25 minutes on a supercomputer Tarek El-Ghazawi, GWU Understanding Fundamental Structure of Matter Requires a billionbillion calculations per second 8 Why is HPC Important? (2) HPC of Today is Conventional Computing for Tomorrow The ASCI Red Supercomputer 9000 chips for 3 TeraFLOPs in 1997 Tarek El-Ghazawi, GWU Intel 80 Core Chip 1 Chip and 1 TeraFLOPs in 2007 9 3- Why is HPC Important?HPC Concepts are becoming Ubiquitous Samsung S6 – 8 Cores Sony PS3 HPC is Ubiquitous! All Computing is becoming HPC, Can we become Uses the Cell Processors! bystanders? The Road Runner: Was Fastest Supercomputer in 08 Uses Cell Processors! Tarek El-Ghazawi, GWU Tile64: A 64 CPU ChipCan be in your future laptop! 10 How Did we Get Here - Supercomputers in recent History Computer Processor # Pr. Year Rmax (TFlops) Tianhe-2 (MilkyWay-2) TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P 3120000 2013-till now 33,862 Titan Cray XK7, Opteron 16 Cores, 2.2GHz, Nvidia K20X 560640 2012 17,600 SPARC64 VIIIfx 2.0GHz, 705024 2011 10,510 Intel EM64T Xeon X56xx (Westmere-EP) 2930 MHz (11.72 Gflops) + NVIDIA GPU, FT-1000 8C 186368 2010 2,566 Cray XT5-HE Opteron Six Core 2.6 GHz 224162 2009 1,759 PowerXCell 8i 3200 MHz (12.8 GFlops) 122400 2008 1,026 BlueGene/L - eServer Blue Gene Solution, IBM PowerPC 440 700 MHz (2.8 GFlops) 212992 2007 478 BlueGene/L - eServer Blue Gene Solution, IBM PowerPC 440 700 MHz (2.8 GFlops) 131072 2005 280 BlueGene/L beta-System IBM PowerPC 440 700 MHz (2.8 GFlops) 32768 2004 70.7 NEC 1000 MHz (8 GFlops) 5120 2002 35.8 IBM ASCI White,SP POWER3 375 MHz (1.5 GFlops) 8192 2001 7.2 IBM ASCI White,SP POWER3 375MHz (1.5 GFlops) 8192 2000 4.9 Intel IA-32 Pentium Pro 333 MHz (0.333 GFlops) 9632 1999 2.4 K-Computer, Japan Tianhe-1A, China Jaguar, Cray Roadrunner, IBM Earth-Simulator / NEC Intel ASCI Red Tarek El-Ghazawi, GWU 11 How Did we Get Here - Supercomputers in recent History See: http://spectrum.ieee.org/tech-talk/computing/hardware/china- builds-worlds-fastest-supercomputer Tarek El-Ghazawi, GWU 12 How Did we Get Here - Supercomputers in recent History Performance Vector Machines Massively Parallel Processors MPPs with Multicores and Heterogeneous Accelerators PetaFLOPS TeraFLOPS Discrete Integrated Time 1993HPCC 20082011 End of Moore’s Law in Clocking! Tarek El-Ghazawi, GWU 13 NOW- July 2015: The TOP 10 Systems Rank 1 2 3 Site RIKEN Advanced Inst for Comp Sci, Japan 5 DOE / OS Argonne Nat Lab, USA 6 Swiss CSCS 7 KAUST, Saudi 8 TACC, USA 10 Cores National Super Tianhe-2 NUDT, Computer Center in Xeon 12C 2.2GHz + IntelXeon Phi 3,120,000 Guangzhou, China (57c) + Custom DOE / OS Titan, Cray XK7, AMD (16C) + Oak Ridge Nat Lab 560,640 Nvidia Kepler GPU (14c) + Custom USA DOE / NNSA Sequoia, BlueGene/Q (16c) + L Livermore Nat Lab 1,572,864 custom USA 4 9 Computer % of Peak Power MFlops [MW] /Watt 33.9 62 17.8 1905 17.6 65 8.3 2120 17.2 85 7.9 2063 K computer Fujitsu SPARC64 VIIIfx (8c) + Custom 705,024 10.5 93 12.7 827 Mira, BlueGene/Q (16c) Custom 786,432 8.16 85 3.95 2066 115,984 6.27 81 2.3 2726 Shaheen II, Cray XC30, Xeon 16C + 196,608 Custom 5.54 77 4.5 1146 5.17 61 4.5 1489 5.01 85 2.30 2178 4.29 85 1.97 2177 + Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB 204,900 Forschungszentrum JuQUEEN, BlueGene/Q, Juelich (FZJ), 458,752 Power BQC 16C 1.6GHz+Custom Germany DOE / NNSA Vulcan, BlueGene/Q, 393,216 LLNL, USA Power BQC 16C Tarek 1.6GHz+Custom El-Ghazawi, GWU 500 (422) Software Comp Rmax [Pflops] HP Cluster USA 18,896 .309 48 14 How to Make Progress Launch a competitive funding cycle or a large national project Pose a system challenge ~ 33.8 PFLOPS/17.8 Mwatt provides about 2GF/Watt To get to Exascale using same total power we need 200GF/Watt Pose an application challenge(s) Let the community compete for government funding with innovative ideas Tarek El-Ghazawi, GWU 15 Challenges - The End of Moore’s Law The phenomenon of exponential improvements in processors was observed in 1979 by Intel co-founder Gordon Moore The speed of a microprocessor doubles every 18-24 months, assuming the price of the processor stays the same Wrong, not anymore! The price of a microchip drops about 48% every 18-24 months, assuming the same processor speed and on chip memory capacity Ok, for Now The number of transistors on a microchip doubles every 18-24 months, assuming the price of the chip stays the same Ok, for Now Tarek El-Ghazawi, GWU 16 No faster clocking but more Cores? Source: Ed Davis, Intel Tarek El-Ghazawi, GWU 17 Accelerators and Dealing with the Moore’s Law Challenge Through Parallelism Fab. Freq Process # Cores Peak FP Performance Peak DP Power Flops/W Memory SPFP GFlops DPFP GFlops W 1+8 204 102.4 92 1.11 25.6 XDR 0.75 2880 4290 1430 235 6.1 288 GDDR5 22 1.24 61 (244 threads) 2417 1208 300 4.0 352 GDDR5 Intel Xeon 12core 2.7 GHz E52697v2 22 2.7 12 518.4 259.2 130 1.99 59.7 DDR31866 AMD Opteron 6370P Interlagos 32 2.5 16 320 160 99 1.62 42.7 DDR31333 Xilinx XC7VX1140T 28 - - 801 241 43 5.6 - - Xilinx XCUV440 20 - - 1306 402 80* 5.0* Altera Stratix V GSB8 28 - - 59 5.0 - - nm GHz PowerXCell 8i 65 3.2 Nvidia Kepler K40 28 Intel Xeon Phi 7120P 604 296 Tarek El-Ghazawi, GWU BW GB/s Memory type 18 Accelerators/Heterogeneous Computing FPGAs Cell GPUs Phi … Microprocessor Application SAVINGS Speedup Cost Power Size DNA Match 8723 22x 779x 253x DES Breaker 38514 96x 3439x 1116x El-Ghazawi et. al. The Promise of HPRCs. IEEE Computer, February 2008 Tarek El-Ghazawi, GWU 19 A General Execution Model for Heterogeneous Computers Accelerator µP CELL B.E. •Transfer of Control •Input Data GPU PC Clearspeed FPGA Intel Xeon Phi •Output Data •Transfer of Control Tarek El-Ghazawi, GWU 20 Challenges for Accelerators 1. Application must lend itself to the 90-10 rule, and different accelerators suit diffent type of computations 2. Programmer partitions the code across the CPU and accelerator 3. Programmer co-schedules CPU and accelerator, and ensures good utilization of the expensive accelerator resources 4. Programmer explicitly transfers data between CPU and accelerator 5. Accelerators are fast as compared to the link, and overhead that can render the use of the accelerator useless or harmful 6. Multiple programming paradigms are needed 7. New accelerator means learning/porting to a new programming interface 8. Changing the ratio of CPUs to accelerators requires also substantial programming unless accelerators are vituralized Tarek El-Ghazawi, GWU 21 Challenges for Advancing or for Exascale 1. Energy Efficiency 2. Interconnect Technology 3. Memory Technology 4. Scalable System Software 5. Programming Systems 6. Data Management 7. Exascale Algorithms DoE ASCAC Subcommittee Report Feb 2014 8. Algorithms for Discovery, Design & Decision 9. Resilience and Correctness 10. Scientific Productivity Data movementTarek and/or programming El-Ghazawi, GWUrelated 22 Exascale Technological Challenges The Power Wall Frequency scaling is no longer possible, power increases rapidly The Memory Wall Gap between processor speed and memory speed is widening The Interconnect Wall Available bandwidth per compute operations is dropping Power needed for data movement is increasing Programmability Wall, Resilience Wall, .. Tarek El-Ghazawi, GWU 23 23 The Data Movement Challenge Bandwidth density vs. system distance Energy vs. system distance [Source: ASCAC 14] Locality matters a lot, cost (energy and time) rapidly increases with distance Locality should be exploited at short distance, needed more at far distances Tarek El-Ghazawi, GWU 24 Data Movement and the Hierarchical Locality Challenge Tarek El-Ghazawi, GWU 25 25 Locality is Not Flat Anymore– Chip and System Tarek El-Ghazawi, GWU 26 26 Locality is Not Flat in Anymore – Chip and System Tarek El-Ghazawi, GWU 27 27 Locality is Not Flat Anymore – Chip and System Tarek El-Ghazawi, GWU 28 28 Locality is Not Flat in Extreme Scale – Chip and System Cray XC40 Tarek El-Ghazawi, GWU 29 29 Locality in Extreme Scale – Chip and System Perspectives TTT TILE64 Tile64 Cray XC40 30 Tarek El-Ghazawi, GWU 30 What Does that Mean for Programmers Exploiting Hierarchical Locality Machine level and Chip level Hierarchical Tiled Data Structures Hierarchical Locality Exploitation with RTS MPI+X Tarek El-Ghazawi, GWU 31 General Implications Short term programming challenge Golden opportunity for smart programmer New hardware advances needed first and they will influence software May be silicon based, may be nano technologies like carbon nano-tube transistors by IBM (9nm), may keep things the way they are from the software side for a while Tarek El-Ghazawi, GWU 32 General Implications- Longer Run Long-term hardware technology may move toward Nano-photonics for computing Quantum Computing Many of the new hardware computing innovations may show first as discrete accelerators, then on the chip accelerator, then move closer to the processor internal circuitry ( data path ) Tarek El-Ghazawi, GWU 33 Longer term The bad news: with the limits of the silicon approached we may see departures from conventional methods of computing which may dramatically change the way we conceive software The good news: history has shown that good ideas from the past get resurrected in new ways Tarek El-Ghazawi, GWU 34 Conclusions Graduating and intelligent IT workforce can be a golden egg for countries like Morocco You can teach skills but it is imperative to teach and stress concepts in the curriculum Stress Parallelism Stress Locality See the recommendations by IEEE/NSF and SIAM for incorporating parallelism in Computer Science, Computer Engineering, and Computational Science and Engineering Curricula, and add locality For the very long-term There is nothing better than having good foundations in Physics and Math even for CS Tarek El-Ghazawi, GWU and CE majors 35 Conclusions cont. Integrate teaching soft skills as President Ouaouicha said Communications Entrepreneurism and marketing, individually and in groups Patenting and legal Tarek El-Ghazawi, GWU 36