EECS 594 Spring 2011 Lecture 2: Overview of High-Performance Computing 1 100 Pflop/s 100000000 44.16 PFlop/s 10 Pflop/s 10000000 2.56 PFlop/s 1 Pflop/s 1000000 100 Tflop/s SUM 100000 31 TFlop/s 10 Tflop/s 10000 1 Tflop/s 1000 N=1 1.17 TFlop/s 100 Gflop/s 100 59.7 GFlop/s 6-8 years N=500 10 Gflop/s 10 My Laptop 1 Gflop/s 1 100 Mflop/s 0.1 400 MFlop/s 1993 1995 1997 1999 2001 2003 My 2005 2007(402009 2010 iPhone Mflop/s) 1 Looking at the Gordon Bell Prize (Recognize outstanding achievement in high-performance computing applications and encourage development of parallel processing ) 1 GFlop/s; 1988; Cray Y-MP; 8 Processors Static finite element analysis 1 TFlop/s; 1998; Cray T3E; 1024 Processors Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method. 1 PFlop/s; 2008; Cray XT5; 1.5x105 Processors Superconductive materials 1 EFlop/s; ~2018; ?; 1x107 Processors (109 threads) Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 SUM 1000000 1 Pflop/s Gordon Bell Winners N=1 100 Tflop/s 100000 10 Tflop/s 10000 1 1000 Tflop/s N=500 100 Gflop/s 100 10 Gflop/s 10 2020 2018 2016 2014 2012 2010 2006 2004 2002 2000 1998 1996 1994 0.1 2008 My laptop 1 Gflop/s 1 100 Mflop/s 2 Intel 81% (406) AMD 11% (57) IBM 8% (40) 5 Countries Share 3 Customer Segments Performance of Top20 Over 10 Years Pflop/s Tianhe-1A, NSCC 3.0 Jaguar ORNL 2.5 2.0 Roadrunner LANL 1.5 BG/L LLNL 1.0 0.5 Earth Simulator 0.0 ASCI White LLNL 1 3 Rank 5 7 9 11 13 15 17 19 Jun Jun -10 Jun -09 Jun -08 -07 Jun -06 Jun -05 Jun -04 Jun -03 Jun -02 Jun -01 4 Pflop/s Club (11 systems; Peak) Name Peak Pflop/s “Linpack” Pflop/s Country Tianhe-1A 4.70 2.57 China NUDT: Hybrid Intel/Nvidia/Self Nebula 2.98 1.27 China Dawning: Hybrid Intel/Nvidia/IB Jaguar 2.33 1.76 US Tsubame 2.0 2.29 1.19 Japan HP: Hybrid Intel/Nvidia/IB RoadRunner 1.38 1.04 US IBM: Hybrid AMD/Cell/IB Hopper 1.29 1.054 US Cray: AMD/Self Tera-100 1.25 1.050 France Bull: Intel/IB Mole-8.5 1.14 .207 China CAS: Hybrid Intel/Nvidia/IB Kraken 1.02 .831 US Cray: AMD/Self Cielo 1.02 .817 US Cray: AMD/Self JuGene 1.00 .825 Germany IBM: BG-P/Self Cray: AMD/Self Performance of Countries 100,000 10,000 US 1,000 100 10 1 0 5 Performance of Countries 100,000 US 10,000 EU 1,000 100 10 1 0 Performance of Countries 100,000 US 10,000 EU Japan 1,000 100 10 1 0 6 Performance of Countries 100,000 US 10,000 EU Japan China 1,000 100 10 1 0 • Of the 500 Fastest Supercomputer • Worldwide, Industrial Use is > 60% • • • • • • • • • • • • • • • • • • • • • • • • • • • Aerospace Automotive Biology CFD Database Defense Digital Content Creation Digital Media Electronics Energy Environment Finance Gaming Geophysics Image Proc./Rendering Information Processing Service Information Service Life Science Media Medicine Pharmaceutics Research Retail Semiconductor Telecomm Weather and Climate Research Weather Forecasting 14 7 ¨ Google facilities leveraging hydroelectric power “Hiding in Plain Sight, Google Seeks More Power”, by John Markoff, June 14, 2006 old aluminum plants Google Plant in The Dalles, Oregon, from NYT, June 14, 2006 Microsoft and Yahoo are building big data centers upstream in Wenatchee and Quincy, Wash. – To keep up with Google, which means they need cheap electricity and readily accessible data networking Microsoft Quincy, Wash. 470,000 Sq Ft, 47MW! 15 Facebook 300,000 sq ft 1.5 cents per kW hour Prineville OR Microsoft 700,000 sq ft in Chicago Apple 500,000 sq ft in Rural NC 4 cents kW/h 8 Rank Site Computer Country Cores Rmax [Pflops] % of Peak 1 Nat. SuperComputer Center in Tianjin NUDT YH, X5670 2.93Ghz 6C, NVIDIA GPU China 186,368 2.57 55 4.04 636 2 DOE / OS Oak Ridge Nat Lab USA 224,162 1.76 75 7.0 251 China 120,640 1.27 43 2.58 493 Japan 73,278 1.19 52 1.40 850 Hopper, Cray XE6 12-core 2.1 GHz USA 153,408 1.054 82 2.91 362 Tera-100 Bull bullx supernode S6010/S6030 France 138,368 1.050 84 4.59 229 Roadrunner / IBM BladeCenter QS22/LS21 USA 122,400 1.04 76 2.35 446 8 NSF / NICS / U of Tennessee Jaguar / Cray Cray XT5 sixCore 2.6 GHz USA 98,928 .831 81 3.09 269 9 Forschungszentrum Juelich (FZJ) Jugene / IBM Blue Gene/P Solution Germany 294,912 .825 82 2.26 365 10 DOE/ NNSA / Los Alamos Nat Lab Cray XE6 8-core 2.4 GHz USA 107,152 .817 79 2.95 277 3 4 5 6 7 Jaguar / Cray Cray XT5 sixCore 2.6 GHz Nebulea / Dawning / TC3600 Nat. Supercomputer Blade, Intel X5650, Nvidia Center in Shenzhen C2050 GPU Tusbame 2.0 HP ProLiant GSIC Center, Tokyo SL390s G7 Xeon 6C X5670, Institute of Technology Nvidia GPU DOE/SC/LBNL/NERSC Commissariat a l'Energie Atomique (CEA) DOE / NNSA Los Alamos Nat Lab Power Flops/ [MW] Watt 9 Rank Site Computer Country Cores Rmax [Pflops] % of Peak 1 Nat. SuperComputer Center in Tianjin NUDT YH, X5670 2.93Ghz 6C, NVIDIA GPU China 186,368 2.57 55 4.04 636 2 DOE / OS Oak Ridge Nat Lab USA 224,162 1.76 75 7.0 251 China 120,640 1.27 43 2.58 493 Japan 73,278 1.19 52 1.40 850 Hopper, Cray XE6 12-core 2.1 GHz USA 153,408 1.054 82 2.91 362 Tera-100 Bull bullx supernode S6010/S6030 France 138,368 1.050 84 4.59 229 Roadrunner / IBM BladeCenter QS22/LS21 USA 122,400 1.04 76 2.35 446 8 NSF / NICS / U of Tennessee Jaguar / Cray Cray XT5 sixCore 2.6 GHz USA 98,928 .831 81 3.09 269 9 Forschungszentrum Juelich (FZJ) Jugene / IBM Blue Gene/P Solution Germany 294,912 .825 82 2.26 365 10 DOE/ NNSA / Los Alamos Nat Lab Cray XE6 8-core 2.4 GHz USA 107,152 .817 79 2.95 277 Jaguar / Cray Cray XT5 sixCore 2.6 GHz Nebulea / Dawning / TC3600 Nat. Supercomputer Blade, Intel X5650, Nvidia Center in Shenzhen C2050 GPU Tusbame 2.0 HP ProLiant GSIC Center, Tokyo SL390s G7 Xeon 6C X5670, Institute of Technology Nvidia GPU 3 4 5 DOE/SC/LBNL/NERSC Commissariat a l'Energie Atomique (CEA) DOE / NNSA Los Alamos Nat Lab 6 7 Power Flops/ [MW] Watt • Has 3 Pflops systems NUDT, Tianhe-1A, located in Tianjin Dual-Intel 6 core + Nvidia Fermi w /custom interconnect • Budget 600M RMB • MOST 200M RMB, Tianjin Government 400M RMB CIT, Dawning 6000, Nebulea, located in Shenzhen Dual-Intel 6 core + Nvidia Fermi w /QDR Ifiniband • Budget 600M RMB • MOST 200M RMB, Shenzhen Government 400M RMB Mole-8.5 Cluster/320x2 Intel QC Xeon E5520 2.26 Ghz + 320x6 Nvidia Tesla C2050/QDR Infiniband • Fourth one planned for Shandong 10 • Loongson (Chinese: 龙芯; academic name: Godson, also known as Dragon chip) is a family of general-purpose MIPS-compatible CPUs developed at the Institute of Computing Technology, Chinese Academy of Sciences. • The chief architect is Professor Weiwu Hu. • The 65 nm Loongson 3 (Godson-3) is able to run at a clock speed between 1.0 to 1.2 GHz, with 4 CPU cores (10W) first and 8 cores later (20W), and it is expected to debut in 2010. • Will use this chip as basis for Petascale system in 2010. 22 11 Recently upgraded to a 2.3 Pflop/s system with more than 224K processor cores using AMD’s 6 Core Peak performance chip. 2.3 PF System memory 300 TB Disk space 10 PB Disk bandwidth 240+ GB/s Interconnect bandwidth 374 TB/s Office of Science University of Tennessee’s National Institute for Computational Sciences Housed at ORNL, operated for the NSF, named Kraken Number 8 on the Top500 Just upgraded to 1 Pflop/s peak 99,072 cores, AMD 2.6 GHz 6 core chip, w/129 TB memory 12 University of Illinois - Blue Waters will be the powerhouse of the National Science Foundation’s strategy to support supercomputers for scientists nationwide T1 Blue Waters NCSA/Illinois 10 Pflop/s peak; 1 Pflop/s sustained per second in 2010 Kraken NICS/U of Tennessee 1 Pflop/s peak per second Ranger TACC/U of Texas 504 Tflop/s peak per second Campuses across the U.S. Several sites 50-100 Tflop/s peak per second T2 T3 427 use Quad-Core 59 use Dual-Core 6 use 9-core 2 use 6-Core Intel Clovertown (4 cores) Sun Niagra2 (8 cores) Fujitsu Venus (8 cores) AMD Istambul (6 cores) IBM Power 7 (8 cores) IBM Cell (9 cores) Intel Polaris [experimental] (80 cores) 26 IBM BG/P (4 cores) 13 • Today Typical server node chip ~ 8 cores 1k node cluster 8,000 cores Laptop ~ 2 cores (low power) Intel SCC 48 cores • By 2020 Typical server node chip ~400 cores 1k node cluster 400,000 cores Laptop ~ 100 cores (low power) Cores per Die Predicted Number of CPU Cores 600 500 400 300 200 100 0 Intel 80 cores (teraflop) Server Laptop 1 2 3 4 5 6 7 8 9 10 11 12 13 Years from now Assuming continuation of 58% historical density improvement per year Tilera 100 GP cores • Most likely be a hybrid design • Think standard multicore chips and accelerator (GPUs) • Today accelerators are attached • Next generation more integrated • Intel’s “Knights Corner” and “Knights Ferry” to come. 48 x86 cores • AMD’s Fusion in 2011 - 2013 Multicore with embedded graphics ATI • Nvidia’s Project Denver plans to develop an integrated chip using ARM architecture in 2013. 28 14 " High levels of parallelism Many GPU cores, serial kernel execution [ e.g. 240 in the Nvidia Tesla; up to 512 in Fermi – to have concurrent kernel execution ] " Hybrid/heterogeneous architectures Match algorithmic requirements to architectural strengths [ e.g. small, non-parallelizable tasks to run on CPU, large and parallelizable on GPU ] " Compute vs communication gap Exponentially growing gap; persistent challenge [ Processor speed improves 59%, memory bandwidth 23%, latency 5.5% ] [ on all levels, e.g. a GPU Tesla C1070 (4 x C1060) has compute power o O(1,000) Gflop/s but GPUs communicate through the CPU using O(1) GB/s connection ] Moore’s Law is Alive and Well 1.E+07 1.E+06 Transistors (in Thousands) 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1970 1975 1980 1985 1990 1995 2000 2005 2010 Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick 15 But Clock Frequency Scaling Replaced by Scaling Cores / Chip 1.E+07 15 Years of exponential growth ~2x year has ended 1.E+06 Transistors (in Thousands) Frequency (MHz) Cores 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1970 1975 1980 1985 1990 1995 2000 2005 2010 Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick Performance Has Also Slowed, Along with Power 1.E+07 1.E+06 1.E+05 Power is the root cause of all this Transistors (in Thousands) Frequency (MHz) Power (W) 1.E+04 Cores 1.E+03 A hardware issue just became a software problem 1.E+02 1.E+01 1.E+00 1.E-01 1970 1975 1980 1985 1990 1995 2000 2005 2010 Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick 16 • Power ∝ Voltage2 x Frequency (V2F) • Frequency ∝ Voltage • Power ∝Frequency3 33 • Power ∝ Voltage2 x Frequency (V2F) • Frequency ∝ Voltage • Power ∝Frequency3 34 17 • Number of cores per chip doubles every 2 year, while clock speed decreases (not increases). • Need to deal with systems with millions of concurrent threads • Future generation will have billions of threads! • Need to be able to easily replace inter-chip parallelism with intro-chip parallelism • Number of threads of execution doubles every 2 year Average Number of Cores Per Supercomputer 100,000 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 • Must rethink the design of our software Another disruptive technology • Similar to what happened with cluster computing and message passing Rethink and rewrite the applications, algorithms, and software 36 18 Systems 2009 2015 2018 System peak 2 Pflop/s 100-200 Pflop/s 1 Eflop/s System memory 0.3 PB 5 PB 10 PB Node performance 125 Gflop/s 400 Gflop/s 1-10 Tflop/s Node memory BW 25 GB/s 200 GB/s >400 GB/s Node concurrency 12 O(100) O(1000) Interconnect BW 1.5 GB/s 25 GB/s 50 GB/s System size (nodes) 18,700 250,000-500,000 O(106) Total concurrency 225,000 O(108) O(109) Storage 15 PB 150 PB 300 PB IO 0.2 TB 10 TB/s 20 TB/s MTTI days days O(1 day) Power 7 MW ~10 MW ~20 MW 37 • Steepness of the ascent from terascale to petascale to exascale • Extreme parallelism and hybrid design Preparing for million/billion way parallelism • Tightening memory/bandwidth bottleneck Limits on power/clock speed implication on multicore Reducing communication will become much more intense Memory per core changes, byte-to-flop ratio will change • Necessary Fault Tolerance MTTF will drop Checkpoint/restart has limitations Software infrastructure does not exist today www.exascale.org 19 • For the last decade or more, the research investment strategy has been overwhelmingly biased in favor of hardware. • This strategy needs to be rebalanced barriers to progress are increasingly on the software side. • Moreover, the return on investment is more favorable to software. Hardware has a half-life measured in years, while software has a half-life measured in decades. • High Performance Ecosystem out of balance Hardware, OS, Compilers, Software, Algorithms, Applications • No Moore’s Law for software, algorithms and applications • The simplest and most useful way to classify modern parallel computers is by their memory model: shared memory distributed memory 40 20 P P P P P P BUS Memory Shared memory - single address space. All processors have access to a pool of shared memory. (Ex: SGI Origin, Sun E10000) Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (Ex: CRAY T3E, IBM SP, clusters) P P P P P P M M M M M M Network 41 P P P P P P BUS Memory Non-uniform memory access (NUMA): Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scale than SMPs (SGI Origin) Uniform memory access (UMA): Each processor has uniform access to memory. Also known as symmetric multiprocessors (Sun E10000) P P P P P P P BUS BUS Memory Memory P Network 42 21 • Processors-memory nodes are connected by some type of interconnect network Massively Parallel Processor (MPP): tightly integrated, single system image. Cluster: individual computers connected by s/w CPU CPU CPU MEM MEM MEM CPU CPU CPU MEM MEM MEM CPU CPU CPU MEM MEM MEM Interconnect Network 43 • Latency: How long does it take to start sending a "message"? Measured in microseconds. (Also in processors: How long does it take to output results of some operations, such as floating point add, divide etc., which are pipelined?) • Bandwidth: What data rate can be sustained once the message is started? Measured in Mbytes/sec. 44 22 Percentage of peak A rule of thumb that often applies A contemporary processor, for a spectrum of applications, delivers (i.e., sustains) 10% of peak performance There are exceptions to this rule, in both directions Why such low efficiency? 45 Why Fast Machines Run Slow Latency Overhead Waiting for access to memory or other parts of the system Extra work that has to be done to manage program concurrency and parallel resources the real work you want to perform Starvation Not enough work to do due to insufficient parallelism or poor load balancing among distributed resources Contention Delays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint. 46 23 Processor-DRAM Memory Gap “Moore’s Law” µProc 60%/yr. (2X/1.5yr) Chip Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/10 yrs) 47 Memory hierarchy Typical latencies for today’s technology 48 24 My Laptop 2.13 GHz 2 ops/cycle DP /core 8.54 Gflop/s peak FSB 1.07 GHz 64 bit data path (8 bytes) or 8.56 GB/s With 8 bytes/word (DP) 1.07 GW/s from memory 49 Intel Clovertown Quad-core processor Each core does 4 floating point ops/s Say 2.4 GHz thus 4 ops/core*4 flop/s * 2.4 GHz = 38.4 Gflop/s peak FSB 1.066 GHz 1.066 GHz*8B /8 (W/B) = 1.066 GW/s » There’s your problem 50 25 Three Types of Cache Misses Compulsory (or cold-start) misses First access to data Can be reduced via bigger cache lines Can be reduced via some pre-fetching Capacity misses Misses due to the cache not being big enough Can be reduced via a bigger cache Conflict misses Misses due to some other memory line having evicted the needed cache line Can be reduced via higher associativity Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 52 26 The Principle of Locality The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Last 15 years, HW relied on localilty for speed 53 Principals of Locality Temporal: an item referenced now will be again soon. Spatial: an item referenced now causes neighbors to be referenced soon. Lines, not words, are moved between memory levels. Both principals are satisfied. There is an optimal line size based on the properties of the data bus and the memory subsystem designs. Cache lines are typically 32-128 bytes 54 27 Counting cache misses nxn 2-D array, element size = e bytes, cache line size = b bytes memory/cache line One cache miss for every cache line: n2 x e /b Total number of memory accesses: n2 Miss rate: e/b Example: Miss rate = 4 bytes / 64 bytes = 6.25% Unless the array is very small memory/cache line One cache miss for every access Example: Miss rate = 100% Unless the array is very small Cache Thrashing Thrashing occurs when frequently used cache lines replace each other. There are three primary causes for thrashing: Instructions and data can conflict, particularly in unified caches. Too many variables or too large of arrays are accessed that do not fit into cache. Indirect addressing, e.g., sparse matrices. Machine architects can add sets to the associativity. Users can buy another vendor’s machine. However, neither solution is realistic. 56 28 HW#2 Look at Matrix Multiply C C + A × B For _ = 1, n For _ = 1, n For _ = 1, n Cij = Cij + Aik*Bkj Look at performance for various ordering of i, j, and k. 57 29