ELEC 5270/6270 Spring 2011 Low-Power Design of Electronic Circuits Power Aware Microprocessors Vishwani D. Agrawal James J. Danaher Professor Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 vagrawal@eng.auburn.edu http://www.eng.auburn.edu/~vagrawal/COURSE/E6270_Spr11/course.html Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 1 Year 1999 2002 2005 2008 2011 2014 Feature size (nm) 180 130 100 70 50 35 Logic transistors/cm2 6.2M 18M 39M 84M 180M 390M Clock (GHz) 1.25 2.1 3.5 6.0 10.0 16.9 Chip size (mm2) 340 430 520 620 750 900 Power supply (V) 1.8 1.5 1.2 0.9 0.6 0.5 High-perf. Power (W) 90 130 160 170 175 183 Untrue predictions. SIA Roadmap for Processors (1999) Source: http://www.semichips.org Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 2 Power Reduction in Processors Hardware methods: Architecture: Voltage reduction for dynamic power Dual-threshold devices for leakage reduction Clock gating, frequency reduction Sleep mode Instruction set hardware organization Software methods Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 3 Performance Criteria Throughput – computations per unit time. Performance is inverse of time – increasing CPU time indicates lower performance. Power – computations per watt. Energy efficiency – performance/joule. Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 4 SPEC CPU2006 Benchmarks Standard Performance Evaluation Corporation (SPEC) http://www.spec.org Twelve integer and 17 floating point programs, CINT2006 and CFP2006. Each program run time is normalized to obtain a SPEC ratio with respect to the run time of Sun Ultra Enterprise 2 system with a 296 MHz UltraSPARC II processor. It takes about 12 days to run all benchmarks on reference system. CINT2006 and CFP2006 metrics are the geometric means of SPEC ratios: Peak metric – each program is individually optimized (aggressive compilation). Base metric – common optimization for all programs. Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 5 SPEC CINT2006 Results http://www.spec.org/cpu2006/results/cint2006.html Dell Inc., PowerEdge R610 CPU: Intel Xeon X5670, 2.93 GHz Number of chips 2, cores 12, threads/core 2 Performance metric 36.6 base, 39.4 peak Dell Inc. PowerEdge M905 CPU: AMD Opteron 8381 HE, 2.50 GHz Number of chips 4, cores 16, threads/core 1 Performance metric 15.8 base, 19.1 peak Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 6 SPEC CFP2006 Results http://www.spec.org/cpu2006/results/cfp2006.html Dell Inc., PowerEdge R610 CPU: Intel Xeon X5670, 2.93 GHz Number of chips 2, cores 12, threads/core 2 Performance metric 42.5 base, 45.8 peak Dell Inc. PowerEdge M905 CPU: AMD Opteron 8381 HE, 2.50 GHz Number of chips 4, cores 16, threads/core 1 Performance metric 17.4 base, 21.5 peak Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 7 Other Benchmarks LINPACK is numerically intensive floating point linear system (Ax = b) program used for benchmarking supercomputers. SPECPOWER_ssj2008 measures power and performance of a computer system. The initial benchmark addresses the performance of server-side Java; additional workloads are planned. http://www.spec.org/benchmarks.html#power Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 8 Second Quarter 2010 SPECpower_ssj2008 Results http://www.spec.org/power_ssj2008/results/res2010q2/ Apr 7, 2010: Hewlett-Packard ProLiant DL385 G7 CPU: AMD Opteron 6174, 2.2GHz Number of chips 2, cores 12, threads/core 2 Total memory 16GB ssj operations @ 100% 888,819 Average power @ 100% 271 W Average power @ active idle 101 W Overall ssj operations per watt 2,355 Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 9 Second Quarter 2010 SPECpower_ssj2008 Results http://www.spec.org/power_ssj2008/results/res2010q2/ May 19, 2010: Dell Inc., PowerEdge R610 CPU: Intel Xeon X5670, 2.93 GHz Number of chips 2, cores 12, threads 2 Total memory 12GB ssj operations @ 100% 914,076 Average power @ 100% 244 W Average power @ active idle 62.3 W Overall ssj operations per watt 2,938 Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 10 Energy SPEC Benchmarks Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark programs is also measured. Energy efficiency of a benchmark program is given by: 1/(Execution time) Energy efficiency = ──────────── joules consumed D. A. Patterson and J. L. Hennessy, Computer Organization & Design: The Hardware/Software Interface, 4th Edition, Morgan Kaufmann Publishers (Elsevier), 2009, Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 11 Energy Efficiency Efficiency averaged on n benchmark programs: n 1/n Efficiency = ( Π Efficiencyi ) i=1 where Efficiencyi is the efficiency for program i. Relative efficiency: Efficiency of a computer Relative efficiency = ───────────────── Eff. of reference computer Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 12 SPEC2000 Relative Energy Efficiency 6 5 Pentium M @1.6/0.6GHz Energyefficient procesor Pentium 4-M @2.4GHz (Reference) 4 3 2 1 SPECFP2000 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 SPECINT2000 0 Pentium III-M @1.2GHz Always Laptop Min. power max. clock adaptive clk. min. clock Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 13 Voltage Scaling Dynamic: Reduce voltage and frequency during idle or low activity periods. Static: Clustered voltage scaling Logic on non-critical paths given lower voltage. 47% power reduction with 10% area increase reported. M. Igarashi et al., “Clustered Voltage Scaling Techniques for Low-Power Design,” Proc. IEEE Symp. Low Power Design, 1997. Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 14 Processor Utilization Throughput = Operations / second Throughput Compute-intensive processes Maximum throughput Low throughput (background) processes System idle Time Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 15 Examples of Processes Compute-intensive: spreadsheet, spelling check, video decoding, scientific computing. Low throughput: data entry, screen updates, low bandwidth I/O data transfer. Idle: no computation, no expected output. Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 16 Effects of Voltage Reduction Voltage reduction increases delay, decreases throughput: Slow reduction in throughput at first Rapid reduction in throughput for VDD ≤ Vth Time per operation (TPO) increases Voltage reduction continues to reduce power consumption: Energy Copyright Agrawal, 2007 per operation (EPO) = Power × TPO ELEC5270/6270 Spring 11, Lecture 14 17 Energy per Operation (EPO) 1.0 0.5 EPO Power TPO 0.0 1 2 3 4 5 VDD / Vth Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 18 Dynamic Voltage and Clock Time spent in: Throughput Fast Slow Idle mode mode mode Battery life Always full speed 10% 0% 90% 1 hr Sometimes full speed 1% 90% 9% 5.3 hrs Rarely full speed 0.1% 99% 0.9% 9.2 hrs T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessors, Springer, 2002, pp. 35-36. Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 19 Example: Find Minimum Energy Mode Processor data (rated operation): 2 GHz clock 1.5 volt supply voltage 0.5 volt threshold voltage Power consumption 50 watts dynamic power 50 watts static power Maximum clock frequency for V volt supply f α (V – VTH)/V Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 20 Example Cont. Dynamic power: Pd = CV2f = C(1.5)2×2×109 = 50W C = 11.11 nF, capacitance switching/cycle Pd = 11.11 V2f Dynamic energy per cycle: Ed = Pd/f = 11.11 V2 Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 21 Example Cont. Clock frequency: f = k (V – VTH)/V = k (1.5 – 0.5)/1.5 = 2 GHz k = 3 GHz, a proportionality constant f = 3(V – 0.5)/V GHz Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 22 Example Cont. Static power: Ps = k’ V2 = k’ (1.5)2 = 50W k’ = 22.22 mho, total leakage conductance Ps = 22.22 V2 Static energy per cycle: Es = Ps/f = 22.22 V3/[3(V – 0.5)] = 7.41 V3/(V – 0.5) Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 23 Example Cont. Total energy per cycle: E = Ed + Es = 11.11 V2 + 7.41 V3/(V – 0.5) To minimize E, ∂E/∂V = 0, or 5V2 – 4.6V + 0.75 = 0 Solutions of quadratic equation: V = 0.679 volt, 0.221 volt Discard second solution, which is lower than the threshold voltage of 0.5 volt. Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 24 Example: Result Voltage 1.5 V Low energy mode 0.679 V Clock frequency 2 GHz 791 MHz 60% Dynamic energy/cycle 25.00 nJ 5.12 nJ 79.52% Static energy/cycle 25.00 nJ 12.96 nJ 48.16% Total energy/cycle 50.0 nJ 18.08 nJ 63.84% Dynamic power 50.0 W 4.05 W 91.90% Static power 50.0 W 10.25 W 79.50% Total power 100.0 W 14.20 W 85.80% Rated mode Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 Reduction (%) 54.7% 25 Problem of Process Variation in Nanometer Technologies Lower Vth Copyright Agrawal, 2007 Clock specification Nominal voltage Vth Higher voltage operation Yield loss due to high leakage Lower voltage operation Number of chips Power specification Yield loss due to slow speed From a presentation: Power Reduction using LongRun2 in Transmeta’s Efficon Processor, by D. Ditzel May 17, 2006 Higher Vth ELEC5270/6270 Spring 11, Lecture 14 26 Pipeline Gating A pipeline processor uses speculative execution. Idea: Stop fetching instructions if a branch hazard is expected: Incorrect branch prediction results in pipeline stalls and wasted energy. If the count (M) of incorrect predictions exceeds a prespecified number (N), then suspend fetching instruction for some k cycles. Ref.: S. Manne, A. Klauser and D. Grunwald, “Pipeline Gating: Speculation Control for Energy Reduction,” Proc. 25th Annual International Symp. Computer Architecture, June 1998. Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 27 Slack Scheduling Application: Superscalar, out-of-order execution: An instruction is executed as soon as the required data and resources become available. A commit unit reorders the results. Delay the completion of instructions whose result is not immediately needed. Example of RISC instructions: add sub and or xor Copyright Agrawal, 2007 r0, r1, r2; r3, r4, r5; r9, r1, r9; r5, r9, r10; r2, r10, r11; (A) (B) (C) (D) (E) J. Casmira and D. Grunwald, “Dynamic Instruction Scheduling Slack,” Proc. ACM Kool Chips Workshop, Dec. 2000. ELEC5270/6270 Spring 11, Lecture 14 28 Slack Scheduling Example Standard scheduling A B C D E Slack scheduling B C A D E Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 29 Slack Scheduling Scheduling logic Re-order buffer Slack bit Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 Low-power execution units 30 Clock Distribution H-Tree Fanout, λ = 4 Tree depth, s = logλN No. of flip-flops = N clock Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 31 Clock Power Pclk = CLVDD2f + CLVDD2f / λ + CLVDD2f / λ2 + . . . = CLVDD2f where CL = λ = stages – 1 Σ n=0 1 ─ λn total load capacitance of N flip-flops constant fanout at each stage in distribution network Clock consumes about 40% of total processor power, because (1) Clock is always active (2) Makes two transitions per cycle, (α = 2) (3) Clock gating is useful; inhibit clock to unused blocks Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 32 Properties of H-Tree Balanced clock skew. Small delay and power consumption. Requires fine-tuning for complex layout. Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 33 Clock Power and Delay Unit size buffer or inverter delay = d Total dynamic power supplied to N flipflops, P = CLVDD2f Total power consumption of clock network: Flip-flps, N Clock power per flip-flop Clock delay 1 P d 4 P 4d 16 1.25P 8d 64 1.3125P 12d 128 1.327125P 16d Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 34 Clock Network Examples Alpha 21064 Alpha 21164 Alpha 21264 Technology 0.75μ CMOS 0.5μ CMOS 0.35μ CMOS Frequency (MHz) 200 300 600 Total capacitance 12.5nF Clock load 3.25nF 3.75nF Clock power 40% 40% (20W) Max. clock skew 200ps (<10%) 90ps Clock gating used. Total power 80 110W D. W. Bailey and B. J. Benschneider, “Clocking Design and Analysis for a 600-MHz Alpha Microprocessor,” IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 1627-1633, Nov. 1998. Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 35 Power Reduction Example Alpha 21064: 200MHz @ 3.45V, power dissipation = 26W Reduce voltage to 1.5V, power (5.3x) = 4.9W Eliminate FP, power (3x) = 1.6W Scale 0.75μ → 0.35μ, power (2x) = 0.8W Reduce clock load, power (1.3x) = 0.6W Reduce frequency 200 →160MHz, power (1.25x) = 0.5W J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1703-1714, Nov. 1996. Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 36 For More on Microprocessors T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessor Design, Springer, 2002. R. Graybill and R. Melhem, Power Aware Computing, New York: Plenum Publishers, 2002. Copyright Agrawal, 2007 ELEC5270/6270 Spring 11, Lecture 14 37