CS 152 Computer Architecture and Engineering Lecture 7 -- Power and Energy 2014-2-11 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Today: Power and Energy Metrics: Power and energy (intro) Short Break. Metrics: Power and energy (technique) CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Universal: Power and energy units are comparable across all of applied physics. Power and Energy So, we use automobiles to introduce terminology. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB The Watt: Unit of power. A rate of energy (J/s). A gas pump hose delivers 6 MW. 120 KW: The power delivered by a Tesla Supercharger. Tesla Model S has a 306 MJ battery 1J=1W (good for 265 miles). CS 152: L7: Power and Energy The Joule: Unit of energy. A 1 Gallon gas container holds 130 MJ of energy. 1 W = 1 J/s. UC Regents Spring 2014 © UCB Sad fact: Computers turn electrical energy into heat. Computation is a byproduct. Energy and Performance Air or water carries heat away, or chip melts. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB The Joule: Unit of energy. Can also be expressed as Watt-Seconds. Burning 1 Watt for 100 seconds uses 100 Watt-Seconds of energy. 1A 1V + - CS 152: L7: Power and Energy This is how electric tea pots work ... 1 Joule heats 1 gram of water 0.24 degree C 1 Joule of Heat Energy per Second 1 Ohm Resistor The Watt: Unit of power. The amount of energy burned in the resistor in 1 second. 20 W rating: Maximum power the package is able to transfer to the air. Exceed rating and resistor UC Regents Spring 2014 © UCB Cooling an iPod nano ... Like resistor on last slide, iPod relies on passive transfer of heat from case to the air. Why? Users don’t want fans in their pocket ... To stay “cool to the touch” via passive cooling, power budget of 5 W. If iPod nano used 5W all the time, its battery would last 15 minutes ... CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Powering an iPod nano (2005 edition) 1.2 W-hour battery: Can supply 1.2 watts of power for 1 hour. 1.2 W-hr / 5 W ≈ 15 minutes. More W-hours require bigger battery and thus bigger “form factor” -it wouldn’t be “nano” anymore :-). Real specs for iPod nano : 14 hours for music, 4 hours for slide shows. 85 mW for 300 mW for slides. music. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Finding the (2005) iPod nano CPU ... A close relative ... Two 80 MHz CPUs One CPU used for audio, one for slides. Low-power ARM roughly 1mW per MHz ... variable clock, sleep modes CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB The CPU is only part of power budget! “Amdahl’s Law for Power” “other” GPU LCD Backlight CPU LCD If our CPU took no power at all to run, that would only double battery life! 2004-era notebook running a full workload. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB What’s happened since 2005? 2010 nano 0.74 ounces (50% of 2005 Nano) “Up to” 24 hours audio playback. 70% improvement from 2005 nano. 0.39 W Hr (33% of 2005 Processors and Energy CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB 2.6 Billion 1 Million 2 Thousand Moore’s Law Main driver: device scaling ... From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB 1974: Dennard Scaling If we scale the gate length by a factor 𝞳 , how should we scale other aspects of transistor to get the “best” results? not scaled 𝞳=5 scaling Dennard Scaling Things we do: scale dimensions, doping, Vdd. not scaled 𝞳=5 scaling What we get: 𝞳2 as many transistors at the same power density! Whose gates switch 𝞳 Power density scaling ended in 2003 times faster! (Pentium 4: 3.2GHz, 82W, 55M FETs). Why? We could no longer scale Vdd. The Why? We can no longer fully scale Vdd ... Power because MOS transistor leakage current is Wall no longer a negligible part of the power budget. Switching Energy: Fundamental Physics Every logic transition dissipates energy. V dd V dd C 2 2 1 1 C C E0E12 V 2 V = = dd dd >0 Strong result: Independent>1of technology. How can we limit switching energy? (1) Reduce # of clock transitions. But we have work to do ... (2) Reduce Vdd. But lowering Vdd limits the clock speed ... circuits. But more transistors can do more (3) Fewer work. (4) Reduce C per node. One reason why we scale CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Scaling switching energy per gate ... IC process scaling (“Moore’s Law”) Due to reducing V and C (length and width of Cs decrease, but plate distance gets smaller). Recent slope more shallow because V is being scaled From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005. CS 152: L7: Power and Energy lessUC Regents Spring 2014 © UCB Second Factor: Leakage Currents Even when a logic gate isn’t switching, it burns power. Isub: Even when this nFet is off, it passes an Ioff leakage current. 0V = We can engineer any Ioff we like, but a lower Ioff also results in a lower Ion, and thus a lower maximum clock speed. Intel’s 2006 processor designs, leakage vs switching Igate: Ideal capacitors have power A lot of work was zero DC current. But modern done to get a ratio this good ... transistor gates are a few 50/50 is atoms thick, and are not Bill Holt, Intel, Hot Chips 17. common. ideal. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Engineering “On” Current at 25 nm ... V I V g We can increase Ion by raising Vdd and/or lowering Vt. d ds V s I ds 1.2 mA = I 0.25 ≈ V I off on t = 0 ??? 0.7 = V CS 152: L7: Power and Energy dd UC Regents Spring 2014 © UCB Plot on a “Log” Scale to See “Off” Current V I V d ds V s g We can decrease Ioff by raising Vt - but that lowers Ion. I ds 1.2 mA = I 0.25 ≈ V I off on t ≈ 10 nA 0.7 = V CS 152: L7: Power and Energy dd UC Regents Spring 2014 © UCB Ioff? Ion? Recall: Timing Lecture ... I An “off” n-FET while bucket fills. I Why open? Gnd << A “on” V t n-FET empties the bucket. Why on? Vdd >> V CS 152: L7: Power and Energy t I = Current through nds FET ds 1.2 mA = I off 0.25 ≈ V I I on off on t ≈ 10 nA 0.7 = V dd UC Regents Spring 2014 © UCB Device engineers trade speed and power 2 We can reduce CV (Pactive) by lowering Vdd. We can increase speed by raising Vdd and lowering Vt. We can reduce leakage (Pstandby) by raising Vt. From: Silicon Device Scaling to the Sub-10-nm Regime Meikei Ieong,1* Bruce Doris,2 Jakub Kedzierski,1 Ken Rim,1 Min Yang1 CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Customize processes for product types ... From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Transistor channel is a raised fin. Gate controls channel from sides and top. Ids Channel depth is fin width. 12-15nm for L=22nm. Intel 22nm Process Vgs CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Clock rates have flattened out, but ... CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Performance: Put more transistors to work CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB 2.6 Billion 1 Million 2 Thousand Moore’s Law We still scale to get more transistors per unit area ... but we use design techniques to reduce power. Takeaway Abstractions From Part I ... CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Small circuits can go very fast in standard CMOS ... This oscillator runs at 210 GHz in a 32 nm SOI CMOS logic process, and consumes 42 mW ... But if we used these techniques for a CPU, our 150 W air-cooled power limit would limit our design to using about ten thousand transistors ... ... but we are used to using 100s of millions! The Dennard scaling stopped working in 2004. Power Why? MOSFET off currents became non-negligible. Wall We limit clock speed to prevent chip from melting. Dynamic Power: 4 ways to reduce it ... Every logic transition dissipates energy. V dd V dd C 2 2 1 1 C C E0E12 V 2 V = = dd dd >1 >0 Strong result: Independent of technology. How can we limit switching energy? (1) Reduce # of clock transitions. But we have work to do ... (2) Reduce Vdd. But lowering Vdd limits the clock speed ... circuits. But more transistors can do more (3) Fewer work. (4) Reduce C per node. One reason why we scale CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Static power: We trade off speed for power. Even when a logic gate isn’t switching, it burns power. Isub: Even when this nFet is off, it passes an Ioff leakage current. 0V = We can engineer any Ioff we like, but a lower Ioff also results in a lower Ion, and thus a lower maximum clock speed. Intel’s 2006 processor designs, leakage vs switching Igate: Ideal capacitors have power A lot of work was zero DC current. But modern done to get a ratio this good ... transistor gates are a few 50/50 is atoms thick, and are not Bill Holt, Intel, Hot Chips 17. common. ideal. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Factor of 60 in leakage vs. 3.5 in speed Chart shows 9 different NAND gates for an IC process, each with a different speed vs. static power tradeoff. CS 152: L7: Power and Energy (40, 45, 50) are transistor channel lengths (in nm) UC Regents Spring 2014 © UCB FO4 (Delay, Power) vs Vdd -- 65nm FO4 is the delay of one inverter driving four additional inverters. Power in this plot includes dynamic and static. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Six low-power design techniques Parallelism and pipelining Power-down idle transistors Slow down non-critical paths Clock gating Data-dependent processing Thermal management CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Design Technique #1 (of 6) Trading Hardware for Power via Parallelism and Pipelining ... CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB And so, we can transform this: Gate delay roughly linear with Vdd 2 P ~ F ⨯ Vdd 2 P~1⨯1 Block processes stereo audio. 1/2 of clocks for “left”, 1/2 for “right”. Into this: Top block processes “left”, bottom “right”. 2 Vdd P ~ #blks ⨯ F ⨯ P ~ 2 ⨯ 1/2 ⨯ 1/4 = 1/4 CV2 power only This magic trick brought to you by Cory Hall ... CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Chandrakasan & Brodersen (UCB, 1992) Simple Pipelined From: CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Multiple Cores for Low Power Trade hardware for power, on a large scale ... CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Cell: The PS3 chip CS 152: L7: Power and Energy 2006 UC Regents Spring 2014 © UCB Cell (PS3 Chip): 1 CPU + 8 “SPUs” L2 Cache 512 KB PowerPC 8 Synergistic Processing Units (SPUs) CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB One Synergistic Processing Unit (SPU) SPU issues 2 inst/cycle (in order) to 7 execution units 256 KB Local Store, 128 128-bit Registers SPU fills Local Store using DMA to DRAM and network CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB A “Schmoo” plot for a Cell SPU ... The lower Vdd, the less dynamic energy consumption. 2 1 1 C E02 = >1 V dd CS 152: L7: Power and Energy E12 = >0 2 C V dd The lower Vdd, the longer the maximum clock period, the slower the clock frequency. UC Regents Spring 2014 © UCB Clock speed alone doesn’t help E/op ... But, lowering clock frequency while keeping voltage constant spreads the same amount of work over a longer time, so chip stays cooler ... 2 2 1 1 C E02 = >1 CS 152: L7: Power and Energy V dd E12 = >0 C V dd UC Regents Spring 2014 © UCB Scaling V and f does lower energy/op 1 W to get 2.2 GHz 7W to reliably get 4.4 GHz performance. 26 C die performance. 47C die temp. If a program that needs a 4.4 temp. Ghz CPU can be recoded to use two 2.2 Ghz CPUs ... big win. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB How iPod nano 2005 puts its 2 cores to use ... CS 152: L7: Power and Energy Two 80 MHz CPUs. Was used in several nano generations, with one CPU doing audio decoding, the other doing UC Regents Spring 2014 © UCB 2013 Macbook Air Voltage range: 0.655V to 1.041V ... 2.5x in CV2 energy Haswell CPU/GPU CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Design Technique #2 (of 6) Powering down idle circuits CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Add “sleep” transistors to logic ... Example: Floating point unit logic. When running fixed-point instructions, put logic “to sleep”. +++ When “asleep”, leakage power is dramatically reduced. --- Presence of sleep transistors slows down the clock rate when the logic block is in use. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Intel example: Sleeping cache blocks A tiny current supplied in “sleep” maintains SRAM From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005. state. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Intel Medfield Intel Medfield Switches 45 power “islands.” Fine-grained control of leakage power, to track user activity. “Race to idle” strategy -- finish tasks quickly, to get to power down. Playing a game ... Watching a video ... Looking at phone screen, not doing anything ... Phone in your pocket, waiting for a call ... Design Technique #3 (of 6) Slow down “slack paths” CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Fact: Most logic on a chip is “too fast” The critical path Most wires have hundreds of picoseconds to spare. From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Use several supply voltages on a chip ... Why use multi-Vdd? We can reduce dynamic power by using low-power Vdd for logic off the critical path. What if we can’t do a multi-Vdd design? In a multi-Vt process, we can reduce leakage power on the slow logic by using high-Vth From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005. transistors. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Logical partition into 0.8V and 1.0V nets done manually to meet 350 MHz spec (90nm). Level-shifter insertion and placement done automatically. Dynamic power in 0.8V section cut 50% below baseline. Leakage power in 1.0V section cut 70% below baseline. From a chapter from new book on ASIC design by Chinnery and Keutzer (UCB). CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Design Technique #4 (of 6) Gating clocks to save power CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB On a CPU, where does the power go? Half of the power go to latches (Flip-Flops). Most of the time, the latches don’t change state. So (gasp) gated clocks are a big win. But, done with CAD tools in a disciplined way. From: Bose, Martonosi, Brooks: Sigmetrics-2001 Tutorial CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Synopsis Design Compiler can do this ... <= CS 152: L7: Power and Energy “Up to 70% power savings at the block level, for applicable circuits” Synopsis Data Sheet UC Regents Spring 2014 © UCB Power Compiler also can do this ... 10-20% push-button power savings, using techniques like this one. CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Design Technique #5 (of 6) Data-Dependent Processing CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Example: Video Decode Transform Most of the time, the inputs flip between small positive and negative integers. In 2’s complement, wastes power: +1: 0b00001 -1: 0b11110 Solution: Add bias value to all inputs 30+% power reduction for a bias of 64. For this linear transform, correcting the output for the bias is trivial. Design Technique #6 (of 6) Thermal Management CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB Keep chip cool to minimize leakage power A recipe for thermal runaway CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB IBM Power 4: How does die heat up? 4 dies on a multi-chip module 2 CPUs per die CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB 115 Watts: Concentrated in “hot spots” Hot spots Fixed point units Cache logic 66.8 C == 152 F CS 152: L7: Power and Energy 82 C == 179.6 UC Regents Spring 2014 © UCB Idea: Monitor temperature, servo clock speed TDP = Thermal Design Point Repeatedly running the same benchmark on three Apple products. iPad Air, iPad Mini retina, and iPhone 5S all use the A7 The TDP of each form factor dictates how long it can run at “top speed” TDP = Thermal Design Point On Thursday Time to market via chip verification. Have fun in section !