The Yin and Yang of Hardware Heterogeneity: Can Software Survive? Kathryn S McKinley 1 2 Computation Turing 1936 3 The Transistor Shockley, Bardeen, Brattain 1947 4 Virtuous Cycle Doubling of Transistors faster, smaller, cheaper, … Software Software Innovation Innovation Device Innovation Software Complexity Hardware Complexity Sequential Interface Sequential Interface 5 Hardware 6 Dennard Scaling is over Power = Clock Speed × Voltage2 Performance Performance Power 7 Dark silicon Powered fraction of a chip @ 40mm2 30w 60% 50% 50% Electricity Electricity costs in U.S. Data Centers 40% 30% 20% 18% 10% $$$$$$ 9% 0% 90nm 45nm 32nm [Goulding et al. Hot Chips 2010] 2011 $7.4 billion 2006 $4.5 billion [U.S. EPA 2007] Battery life Multicore Hardware 2003 2006 2008 Pentium4 (130) C2D(65) C2Q(65) i7(45) 130nm 65nm 45nm 55M tran. 131mm2 1 core 2-way SMT Northwood 291M tran. 143mm2 2 cores no SMT Conroe Kentsfield 731M tran. 263mm2 4 cores 2-way SMT Bloomfield 2008 2009 2009 Atom(45) C2D(45) AtomD(45) 45nm 45nm 45nm 2010 i5 (32) 32nm 47M tran. 228M tran. 176M tran. 382M tran. 2 2 2 36mm 82mm 87mm 81mm2 1 core 2 cores 2 cores + GPU 2 cores 2-way SMT no SMT 2-way SMT 2-way SMT Diamondville Wolfdale Pineview Clarkdale Each die shown at correct scale 9 Virtuous Cycle Doubling of Transistors faster, smaller, cheaper, … End of Dennard Scaling Software Software Innovation Innovation Device Innovation Software Complexity Hardware Complexity Sequential Interface Sequential Interface Parallel Interface 10 Software 11 PC Software era ending 12 Software = Mobile + Web + Cloud 13 Software People App Writers Computer Scientists 14 Languages People Use PHP 15 Software Fast Performance enough Productivity Managing complexity Abstractions 16 Hardware & Software pulling in opposite directions 17 Hardware & Software Marriage 18 How’s this marriage working out? 19 What should we measure? 20 energy = ò power dt 21 Multicore Hardware 2003 2006 2008 Pentium4 (130) C2D(65) C2Q(65) i7(45) 130nm 65nm 45nm 55M tran. 131mm2 1 core 2-way SMT Northwood 291M tran. 143mm2 2 cores no SMT Conroe Kentsfield 731M tran. 263mm2 4 cores 2-way SMT Bloomfield 2008 2009 2009 Atom(45) C2D(45) AtomD(45) 45nm 45nm 45nm 2010 i5 (32) 32nm 47M tran. 228M tran. 176M tran. 382M tran. 2 2 2 36mm 82mm 87mm 81mm2 1 core 2 cores 2 cores + GPU 2 cores 2-way SMT no SMT 2-way SMT 2-way SMT Diamondville Wolfdale Pineview Clarkdale Each die shown at correct scale 22 scalable non-scalable Workloads native 61 benchmarks managed Native Non-scalable Java Non-scalable 27 C, C++, Fortran SPEC CPU2006 18 Java SPEC jvm98, DaCapo, pjbb2005 0.25 Native Scalable 0.25 11 C, C++ PARSEC 0.25 Java Scalable 0.25 5 Java DaCapo 23 Power vs Performance Power is benchmark dependent 100 80 60 2003 2008 Power (W) Pentium 4 (130) i7 (45) 2006 40 Core 2 Duo (65) i5 (32) ? 20 2010 Core 2 Duo (45) ? 2008 10 0.5 1 2 3 4 5 Performance / Reference Performance 24 Parallelism did not solve the power problem 25 Web Page Energy on Windows Phone 1.4 1400000 Energy (mWh) 1.2 Load data 1 1200000 1000000 0.8 800000 0.6 600000 0.4 400000 0.2 200000 0 0 amazon apple baidu bing ebay google msn paypal wikipedia youtube Data downloaded (bytes) Load energy What’s next in hardware? 27 Pareto Analysis (45nm) Workload determines energy efficient architecture Average Native Non-scale Java Non-scale Java Scale Native Scale 0.60 Normalized Group Energy 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.00 1.00 2.00 3.00 4.00 5.00 6.00 Group Performance / Group Reference Performance 7.00 28 Native scalable ✔ Java non-scalable ✔ Java scalable ✔ Native non-scalable ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ I7(45)4C2T@2.7Ghz ✔ I7(45)4C2T@2.7GHz no TB I7(45)4C2T@2.1GHz I7(45)4C2T@1.6GHz I7(45)4C1T@2.7GHz I7(45)4C1T@2.7GHz no TB I7(45)2c2T@1.6GHz I7(45)2C1T@1.6GHz I7(45)1C2T@2.4GHz I7(45)1C2T@1.6GHz I7(45)1C1T@2.7GHz I7(45)1C1T@2.7GHz no TB Core2D(45)2C1T@3.1G Hz Core2D(45)2C1T@1.6GHz Atom(45)1C2T@1.7Ghz Workload determines energy efficient architecture ✔ ✔ ✔ ✔ ✔ ✔ 29 Parallelism & Heterogeneity big 00 00 0 0 0 0 00 0 0 00 00 00 0 0 0 00 00 00 0 0 0 0 small 00 0 0 0 0 00 0 0 0 0 0 0 0 00 00 0 0 0 00 0 0 0 0 00 0 0 0 0 custom 30 Motivation Single ISA Heterogeneity NVIDIA Tegra3 5 Cortex A9 (4 x 1.4 GHz, 1 x 500 MHz) Texas Instruments OMAP5432 2 Cortex A15 + 2 Cortex M4 31 Heterogeneous Hardware Energy Efficiency Complexity 32 Heterogeneous parallel hardware + software July 31, 1922. Train wreck at Laurel, Maryland [Washington Post, August 1, 1922] 33 Exploiting Heterogeneity Parallelism Ubiquity Differentiation 34 Case Studies Mobile UI Managed runtime Always awake Interactive cloud [UIST’13] [ISCA’12] [NSDI’12] [ICAC’13] 35 Case Studies Mobile UI Managed runtime Always awake Interactive cloud [UIST’13] [ISCA’12] [NSDI’12] [ICAC’13] 36 User Interface I/O on OMAP big/little Kihm, Guimbretière UIST’13 0 0 0 0 0 0 0 0 37 User Interface I/O & Heterogeneity Characteristics Parallelism Ubiquity Differentiated UI I/O 38 A9+M3 Heterogeneity Battery life Increase A9 A9 + M3 Dispatch A9 + display control A9 + M3 execute 3 2.5 2 1.5 1 0.5 0 Scrolling Pen Inking Virtual Keyboard Keyboard 39 Case Studies Mobile UI Managed runtime Always awake Interactive cloud [UIST’13] [ISCA’12] [NSDI’12] [ICAC’13] 40 VM Services on little cores Cao et al., ISCA’12 0 0 0 0 0 0 0 41 VM Services & Heterogeneity Characteristics Parallelism Ubiquity Differentiated VM Services ? 42 Measured (fill) GC JIT Model (empty) GC & JIT Normalized to 1 core at 2.8 GHz 1.20 1.16 1.12 1.08 1.04 1.00 Better Better 0.96 Better Better 0.92 Performance Power Energy PPE 2.8 GHz AMD + 2.8 GHz AMD | 2.8 GHz AMD + 1.66 GHz Atom 43 Case Studies Mobile UI Managed runtime Always awake Interactive cloud [UIST’13] [ISCA’12] [NSDI’12] [ICAC’13] 44 Somniloquy Application stubs on little cores wake up big core as needed Agarwal et al., NSDI’12 Applications filtering, notifications, downloads, keep alive Host Big core RAM, peripherals, … Apps Somniloquy daemon OS Network stack Laptop Wakeup filters App stubs Embedded OS Network stack CPU + DRAM + Flash Little core Network interface Big: Lenovo X60 Laptop + Little: gumstix Power Consumption in Watts 16 14 12 10 8 6 4 2 0 Little sleep Somniloquy Big @ low power Big Core 46 Case Studies ✔ Mobile UI Managed runtime Always awake Interactive cloud [UIST’13] [ISCA’12] [NSDI’12] [ICAC’13] 47 Case Studies Mobile UI Managed runtime Always awake Interactive cloud [UIST’13] [ISCA’12] [NSDI’12] [ICAC’13] 48 Interactive Cloud Services Bing, Finance, Recommendations, Games Ren et al. ICAC’13 Characteristics Parallelism Ubiquity Differentiated Interactive Services ? 49 Interactive Services Workload Characterization of Bing Search 1. Responsiveness deadline ~100ms = interactive 2. Partial execution trades quality for responsiveness Completion vs Quality Quality 1.00 0.80 0.60 0.40 0.20 0.00 0.0 0.2 0.4 0.6 0.8 1.0 Completion Ratio Data Centers are Power Limited ~88 Watts/core Homogeneous 0 00 00 0 00 00 0 00 00 0 00 00 0 00 00 5 SMT Nehalum cores i5 670 32nm 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 22 SMT AtomD cores Bonnell 32nm 51 Homogeneous Cores Throughput or quality, but not both! Quality Goal 5 Nehalums 1.00 0 0 00 0 Quality 0.99 0 0 00 0 22 Atoms 0 0 00 0 0 0 00 0 0 0 00 0 0.98 0 0 0 0 0 0 0 0 0.97 0.96 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.95 30 60 90 120 150 180 210 Queries per second 240 270 300 52 Interactive Services Workload Characterization of Bing Search 1. Responsiveness deadline ~100ms = interactive 2. Partial execution trades quality for responsiveness 3. Unknown, highly variable service demand 00 00 00 0.00 0.0 0.2 0.4 0.6 + 0.8 1.0 3 SMT Nehalum 9 SMT Atom cores Completion Ratio Demand (ms) 100 00 0.20 00 90 00 90 00 70 00 55 00 Differentiated 45 0.40 0 00 00 0 00 00 35 0 00 0.60 00 25 0.80 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 5 Quality 1.00 Heterogeneous Demand Distribution (ms) 15 Completion vs Quality Slow to Fast scheduling Long jobs: no difference between slow-to-fast & fast-to-slow Short jobs: slow to fast consumes less energy Unknown service demand: long jobs migrate to fast cores Short job Long job Time Short job Long job Time 54 FOF: Fast Old & First Scheduler 1. Schedule fastest core first 2. Promote older jobs to faster cores Small Medium Big 55 Data Centers are Power Limited ~88 Watts/core Homogeneous 0 00 00 Heterogeneous 0 00 00 0 00 00 0 00 00 00 00 00 00 00 00 00 00 00 3 SMT Nehalum + 9 SMT Atom cores 0 00 00 0 00 00 0 00 00 0 00 00 5 SMT Nehalum cores i5 670 32nm 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 22 SMT AtomD cores Bonnell 32nm 56 FOF Heterogeneous Cores throughput and quality! Quality Goal 22 Atoms 5 Nehalums Heterogeneous 3 N + 9 Atoms 1.00 Quality 0.99 0.98 50% throughput increase or buy 33% fewer servers 0.97 0.96 0.95 30 60 90 120 150 180 210 Queries per second 240 270 300 57 Simultaneous Multithreading = Dynamic Heterogeneity SMT off 1 SMT core 2 SMT cores All SMT 4 fast 3 fast + 2 slow 2 fast + 4 slow 8 slow 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 00 0 0 0 0 0 0 00 0 0 0 00 0 0 0 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 SMT 58 FOF Scheduler for SMT 1. Schedule fastest core first Fastest = unshared, share with youngest job 2. Promote older jobs to faster cores free core? find sharing of (oldest, other), move other to free core 59 Slow to Fast on SMT Monte Carlo Finance Server 6 Core 2-way SMT 3.33 GHz Intel Xeon SMT off SMT + Share Old SMT + FOF 1.000 Quality 0.995 0.990 0.985 0.980 16% Improvement 0.975 0.970 4 12 20 31 39 Queries per second 47 55 60 Slow to Fast for Energy start on slow Monte Carlo Finance Server Energy (Joules) 6 Core 2-way SMT 3.33 GHz Intel Xeon Homogeneous Deadline Only Slow to Fast 2.40 2.30 2.20 2.10 2.00 1.90 1.80 1.70 1.60 1.50 1.40 0.20 0.22 0.24 0.26 0.28 0.30 Average Latency (seconds) 61 Key Analytical Results Slow to Fast Theorem An optimal solution migrates jobs from slower to faster cores to minimize energy Heterogeneity Theorem More heterogeneity (ratio of fastest to slowest core) is desirable for higher p in Lp norm (the tail!) 62 63 64 Software Challenges in a Power Constrained World Optimizing performance, power, & energy Software/Hardware co-design Software portability Programming models & abstractions 00 00 0 0 0 0 00 0 0 00 00 00 0 0 0 00 00 00 0 0 0 0 00 0 0 0 0 00 0 0 0 0 0 0 0 00 00 OOPSLA’13 0 0 0 00 0 0 0 0 00 0 0 0 0 65 New Virtuous Cycle? Doubling of Transistors faster, smaller, cheaper, … Specialization Software Innovation Device Innovation Software Complexity Hardware Complexity Sequential Interface Sequential Interface New Interfaces & Abstractions 66 Collaborators on this work Steve Blackburn, Ting Cao, Hadi Esmaeilzadeh, Ivan Jibaja Yuxiong He, Shaolei Ren, Sameh Elnikety, Xi Yang, Yong Hun Eom Todd Mytkowicz, James Bornholt, Aman Kansal 67 The Future? Thank you 68 Bibliography • • • • • • • • Exploiting Processor Heterogeneity in Interactive Systems, S. Ren, Y. He, S. Elnikety, and K. S. McKinley, The USENIX International Conference on Autonomic Computing (ICAC), San Jose, CA, June 2013. Asymmetric Cores for Low Power User Interface Systems, Jaeyeon Kihm and Francois Guimbretiere, ACM Symposium on User Interface Software and Technology (UIST), October, 2013. Poster. The Yin and Yang of Hardware Heterogeneity: Can Software Survive? K. S. McKinley, ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), Indianapolis, IN, October 2013. The New Global Ecosystem in Advanced Computing: Implications for U.S. Competitiveness and National Security, Committee on Global Approaches to Advanced Computing (member), Board on Global Science and Technology Policy and Global Affairs Division, National Research Council of the National Academies, The National Academies Press, Washington, D.C., 2012. The Model Is Not Enough: Understanding Energy Consumption in Mobile Devices, J. Bornholt, T. Mytkowicz, and K. S. McKinley, Hot Chips, San Jose, CA, August 2012. Looking Back and Looking Forward: Power, Performance, and Upheaval, H. Esmaeilzadeh, T. Cao, X. Yang, S. M. Blackburn, and K. S. McKinley, Communications of the ACM (CACM), Research Highlights, 55(7), July, 2012. The Yin and Yang of Power and Performance for Asymmetric Hardware and Managed Software, T. Cao, T. Gao, S. M. Blackburn, and K.S. McKinley, ACM/IEEE International Symposium on Computer Architecture, Portland, OR, June, 2012. Yuvraj Agarwal, Steve Hodges, Ranveer Chandra, James Scott, Paramvir Bahl, and Rajesh Gupta, Somniloquy: Augmenting Network Interfaces to Reduce PC Energy Usage, in Networked Systems Design & Implementation (NSDI), USENIX, 22 April 2009 69 A Byte of My Story Mentors Family ACM Fellow Congressional Testimony70 I fail a lot Rejected job applications 1984 (all), 1993 (8 of 11), 2011 (4 of 8) Failed PhD qualifying exam Rejected first three grant applications Rejected 3 times my most cited paper Rejected papers, grants, papers, … I learn & persist 71