Runnemede: Disruptive Technologies for UHPC John Gustafson Intel Labs HPC User Forum – Houston 2011 1 “We’re going to try to make the entire exascale machine cache-coherent.” —Bill Dally, Nvidia “Caches are for morons.” —Shekhar Borkar, Intel The battle lines are drawn… 2 Intel’s UHPC Approach Design test chips with the idea of maximizing learning. Very different from producing product roadmap processor designs. Going from Peta to Exa is nothing like the last few 1000x increases… 3 Building with Today’s Technology TFLOP Machine today Decode and control Translations …etc 4450W Power supply losses Cooling…etc 5KW 100W Disk 10TB disk @ 1TB/disk @10W 100pJ com per FLOP Com 100W Memory 150W 0.1B/FLOP @ 1.5nJ per Byte Compute 200W 200pJ per FLOP KW Tera, MW Peta, GW Exa? 4 The Power & Energy Challenge TFLOP Machine today 4550W 5KW Disk 100W Com 100W Memory 150W Compute 200W TFLOP Machine then With Exa Technology 5W ~3W ~5W 2W 5W ~20W 5 Scaling Assumptions Technology (High Volume) 45 nm (2008) 32 nm (2010) 22 nm (2012) 16 nm (2014) 11 nm (2016) 8 nm (2018) 5 nm (2020) Transistor density 1.75 1.75 1.75 1.75 1.75 1.75 1.75 Frequency scaling 15% 10% 8% 5% 4% 3% 2% Vdd scaling -10% -7.5% -5% -2.5% -1.5% -1% -0.5% SD Leakage scaling/micron 1X Optimistic to 1.43X Pessimistic 65 nm Core + Local Memory DP FP Add, Multiply Integer Core, RF Router 5mm2 (50%) Memory 0.35MB 5mm2 (50%) 10 mm2, 3 GHz, 6 GF, 1.8 W 8 nm Core + Local Memory DP FP Add, Multiply Integer Core, RF Router 0.17mm2 (50%) Memory 0.35MB 0.17mm2 (50%) ~0.6mm 0.34 mm2, 4.6 GHz, 9.2 GF, 0.24 to 0.46 W 6 Near Threshold Logic 10 101 2 1 101 10-1 1 0.4 0.6 0.8 1.0 1.2 Supply Voltage (V) H. Kaul et al, 16.6: ISSCC08 10 1.4 -2 101 400 350 300 250 200 150 100 50 320mV 0.2 65nm CMOS, 50°C 1 Subthreshold Region 103 Energy Efficiency (GOPS/Watt) 65nm CMOS, 50°C 450 0 0.2 9.6X 10-1 Active Leakage Power (mW) 102 Total Power (mW) Maximum Frequency (MHz) 104 320mV 0.4 0.6 0.8 1.0 1.2 10-2 1.4 Supply Voltage (V) 7 Revise DRAM Architecture Signaling Energy cost today: ~150 pJ/bit M Control DRAM Array New DRAM architecture Page Addr RAS Traditional DRAM Page Page Page Page Page Addr CAS Activates many pages Lots of reads and writes (refresh) Small amount of read data is used Requires small number of pins Activates few pages Read and write (refresh) what is needed All read data is used Requires large number of I/Os (3D) 8 Data Locality Chip to memory Communication: ~1.5 nJ per Byte ~150 pJ per Byte Core-to-core Communication on the chip: ~10 pJ per Byte Chip to chip Communication: ~100 pJ per Byte Data movement is expensive—keep it local (1) Core to core, (2) Chip-to-chip, (3) Memory 9 Disruptive Approach to Faults We tend to assume that execution faults (soft errors, hard errors) are rare. And it’s a valid speculation. Currently. Soon, we will need much more paranoia in hardware designs. 10 Road to Unreliability? From Peta to Exa Reliability Issues 1,000X parallelism More hardware for something to go wrong >1,000X intermittent faults due to soft errors Aggressive Vcc scaling to reduce power/energy Gradual faults due to increased variations More susceptible to Vcc droops (noise) More susceptible to dynamic temp variations Exacerbates intermittent faults—soft errors Deeply scaled technologies Aging related faults Lack of burn-in? Variability increases dramatically Resiliency will be the cornerstone 11 Resiliency Faults Example Faults cause errors (data & control) Permanent faults Stuck-at 0 & 1 Datapath errors Detected by parity/ECC Gradual faults Variability Temperature Silent data corruption Need HW hooks Control errors Control lost (Blue screen) Intermittent faults Soft errors Voltage droops Aging faults Minimal overhead for resiliency Degradation Applications System Software Programming system Microcode, Platform Microarchitecture Circuit & Design Error detection Fault isolation Fault confinement Reconfiguration Recovery & Adapt 12 Execution Model and Codelets Programming Models/Systems (Rich) Sea of Codelets • • Codelet - Code that can be executed nonpreemptively with an “event-driven” model Shared memory model based on LC (Location Consistency – a generalized singleassignment model [GaoSarkar1980]) Run Time System Net Cores Peripherals/Devices Hardware Abstraction Advanced Hardware Monitoring 13 Summary Voltage scaling to reduce power and energy • Explodes parallelism • Cost of communication vs computation—critical balance • Resiliency to combat side-effects and unreliability Programming system for extreme parallelism Application driven, HW/SW co-design approach Self-awareness & execution model to harmonize 14