The Dark Silicon Implications for Microprocessors Karu Sankaralingam University of Wisconsin-Madison Collaborators: Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, and Doug Burger Multicore Decade? We have relied on multicore scaling for over five years. ? 2000 2005 Pentium Core 2 Extreme Quad-Core Dual-Core 2010 i7 980x Hex-Core 2015 How much longer will it be our primary performance scaling technique? 2 Finding Optimal Multicore Designs Comprehensive design space: Fixed area budget Fixed power budget Two sets of CMOS scaling projections Optimal core and diverse multicore organizations Parallel benchmarks For next 5 technology generations, we find the best performing multicore from a comprehensive design space search for each of the PARSEC benchmarks 3 Symmetric Multicore Projections 20 Speedup 16 18x Target Symmetric 12 8 3.4x in 10 years 4 0 0 2 4 Year 6 8 10 Symmetric multicores alone will not sustain the multicore era. 4 Multicore Solutions 20 16 Speedup Asymmetric Topologies Target Symmetric Asymmetric 12 8 3.5x 4 0 0 2 4 Year 6 8 10 5 Multicore Solutions Speedup 20 Dynamic Topologies Target 16 Symmetric 12 Dynamic Asymmetric 8 3.5x 4 0 0 2 4 Year 6 8 10 [Chakraborty (2008), Suleman et al (2009)] 6 Multicore Solutions Speedup 20 Composed/Fused Topologies Target 16 Symmetric Asymmetric 12 Dynamic Composed 8 3.7x 4 0 0 2 4 Year 6 8 10 [Ipek et al (2007), Kim et al (2007)] 7 Multicore Solutions 20 Symmetric 16 Speedup GPU-Style Cores Target Asymmetric Dynamic 12 Composed GPU 8 2.7x 4 0 0 2 4 Year 6 8 10 8 Multicore Era Projections 20 Speedup 16 18x Target Composed Composed 12 8 3.7x 4 0 0 2 4 Year 6 8 The best designs speed up 14% per year rather than the recent trend of 34% per year 10 9 Why Diminishing Returns? Transistor area is still scaling Voltage and capacitance scaling have slowed Result: designs are power, not area, limited 10 Overview Devices • Find the best case technology scaling Cores • Find the best cores Multicores • Find the best multicore organization Projections • Predict best case multicore performance for each technology generation 11 Device Scaling Projections From 45 nm to 8 nm: Conservative Optimistic Area 32x 32x Power 4.5x 8.3x Frequency 1.3x 3.9x [Borkar 2007] [ITRS 2010] 12 Modeling Ideal Core Power/Perf. 30 Intel Nehalem AMD Shanghai Intel Core Intel Atom Power (TDP, Watts) 25 Nehalem 20 Pareto Frontier includes all optimal power/performance points 15 Repeat using core area for optimal area/performance points 10 5 0 0 Atom 10 20 SPECmark Score 30 40 13 Combining Device and Core Models 30 45 nm Frontier Power (TDP, Watts) 25 20 Device Scaling 32 nm Frontier 15 10 5 0 0 10 20 SPECmark Score 30 40 14 Overview Devices • Find the best case technology scaling Cores • Find the best cores Multicores • Find the best multicore organization Projections • Predict best case multicore performance for each technology generation 15 What belongs in multicore model? Styles Number of Threads, Cache Sizes Topologies Pareto Frontiers Area & Power Budget Architectures Cache & memory latencies, memory bandwidth Area & Power / Performance Tradeoffs Applications PARSEC fparallel, Data Use 16 Multicore Speedup Model Multicore = Speedup 1 1-fparallel fparallel + Serial Speedup Parallel Speedup 17 Multicore Performance Model Performance is limited by: Memory bandwidth BWmax / (instructions per byte from memory) and Computation Ncores (core frequency/CPIexe) core utilization [Guz et al, 2009] 18 Core Utilization Model Core utilization is limited by: Fraction of Time Core is Ready to Issue Number of Threads in Core / Number of Threads to Keep Busy [Guz et al, 2009] 19 Multicore Model & Pareto Frontiers 30 25 100 points 20 A(q), 15 P(q) 10 5 0 0 10 20 q 30 40 20 Translating from SPECmark 1. From q, find core’s SPECmark speedup 2. Frequency linearly distributed from Atom to Nehalem 3. Recall: model predicts benchmark performance as f(benchmark chars, frequency, CPIexe) 4. Compute CPIexe such that Benchmark Speedup = SPECmark Speedup 21 Area and Power Constraints Ncores x A(q) ≤ Area Budget Ncores x P(q) ≤ Power Budget Dark silicon = Ncores / # of cores that fit in chip area 22 Overview Devices • Find the best case technology scaling Cores • Find the best cores Multicores • Find the best multicore organization Projections • Predict best case multicore performance for each technology generation 23 Dark Silicon Percentage Dark Silicon 100% 100% 8 nm: AtAt22 ITRS Conservative 80% 80% 71% 60% 60% 51% 40% 40% Sources of Dark Silicon: Power + Limited Parallelism 20% 20% 0% 0% blacksholes bodytrack canneal ferret streamcluster streamcluster 17% 26% GM GM 24 Overall Performance 20 ITRS: All Topologies Conservative: All Topologies ITRS: Symmetric Conservative: Symmetric Symmetric Conservative: Target Speedup 16 18x 16x fparallel = 0.99 12 8x 6x 3x 8 4 0 0 2 4 6 8 10 Year 25 Conclusions Multicore performance gains are limited Unicore Era Multicore Era ? Need at least 18%-40% per generation from architecture alone without additional power 26 Specialization Shrinking chips Pervasive Efficiency 27