ELEC 5970-001/6970-001(Fall 2005) Special Topics in Electrical Engineering Low-Power Design of Electronic Circuits Power Aware Microprocessors Vishwani D. Agrawal James J. Danaher Professor Department of Electrical and Computer Engineering Auburn University http://www.eng.auburn.edu/~vagrawal vagrawal@eng.auburn.edu 11/15/05 ELEC 5970-001/6970-001 Lecture 19 1 SIA Roadmap for Processors (1999) Year 1999 2002 2005 2008 2011 2014 Feature size (nm) 180 130 100 70 50 35 Logic transistors/cm2 6.2M 18M 39M 84M 180M 390M Clock (GHz) 1.25 2.1 3.5 6.0 10.0 16.9 Chip size (mm2) 340 430 520 620 750 900 Power supply (V) 1.8 1.5 1.2 0.9 0.6 0.5 High-perf. Power (W) 90 130 160 170 175 183 Source: http://www.semichips.org 11/15/05 ELEC 5970-001/6970-001 Lecture 19 2 Power Reduction in Processors • Just about everything is used. • Hardware methods: • • • • Voltage reduction for dynamic power Dual-threshold devices for leakage reduction Clock gating, frequency reduction Sleep mode • Architecture: • Instruction set • hardware organization • Software methods 11/15/05 ELEC 5970-001/6970-001 Lecture 19 3 SPEC CPU2000 Benchmarks • Twelve integer and 14 floating point programs, CINT2000 and CFP2000. • Each program run time is normalized to obtain a SPEC ratio with respect to the run time of Sun Ultra 5_10 with a 300MHz processor. • CINT2000 and CFP2000 summary measurements are the geometric means of SPEC ratios. 11/15/05 ELEC 5970-001/6970-001 Lecture 19 4 Reference CPU s: Sun Ultra 5_10 300MHz Processor 3500 3000 2500 2000 CINT2000 CFP2000 1500 1000 0 11/15/05 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi 500 ELEC 5970-001/6970-001 Lecture 19 5 CINT2000: 3.4GHz Pentium 4, HT Technology (D850MD Motherboard) SPECint2000_base = 1341 SPECint2000 = 1389 2500 2000 1500 Base ratio Opt. ratio 1000 11/15/05 ELEC 5970-001/6970-001 Lecture 19 twolf bzip2 vortex gap perlbmk eon parser crafty mcf gcc vpr 0 gzip 500 Source: www.spec.org 6 Two Benchmark Results • Baseline: A uniform configuration not optimized for specific program: • Same compiler with same settings and flags used for all benchmarks • Other restrictions • Peak: Run is optimized for obtaining the peak performance for each benchmark program. 11/15/05 ELEC 5970-001/6970-001 Lecture 19 7 CFP2000: 3.6GHz Pentium 4, HT Technology (D925XCV/AA-400 Motherboard) 3000 SPECfp2000_base = 1627 SPECfp2000 = 1630 2500 2000 1500 Base ratio Opt. ratio 1000 0 11/15/05 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi 500 ELEC 5970-001/6970-001 Lecture 19 Source: www.spec.org 8 CINT2000: 1.7GHz Pentium 4 (D850MD Motherboard) 11/15/05 ELEC 5970-001/6970-001 Lecture 19 twolf bzip2 vortex gap perlbmk eon parser crafty mcf gcc vpr Base ratio Opt. ratio gzip 1000 900 800 700 600 500 400 300 200 100 0 SPECint2000_base = 579 SPECint2000 = 588 Source: www.spec.org 9 CFP2000: 1.7GHz Pentium 4 (D850MD Motherboard) SPECfp2000_base = 648 SPECfp2000 = 659 1400 1200 1000 800 600 Base ratio Opt. ratio 400 0 11/15/05 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi 200 ELEC 5970-001/6970-001 Lecture 19 Source: www.spec.org 10 Energy SPEC Benchmarks • Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark programs is also measured. Energy efficiency of a benchmark program is given by: 1/(Execution time) Energy efficiency = ──────────── joules consumed 11/15/05 ELEC 5970-001/6970-001 Lecture 19 11 Energy Efficiency • Efficiency averaged on n benchmark programs: n 1/n Efficiency = ( Π Efficiencyi ) i=1 where Efficiencyi is the efficiency for program i. • Relative efficiency: Efficiency of a computer Relative efficiency = ───────────────── Eff. of reference computer 11/15/05 ELEC 5970-001/6970-001 Lecture 19 12 SPEC2000 Relative Energy Efficiency 6 5 Pentium M @1.6/0.6GHz Energyefficient procesor Pentium 4-M @2.4GHz (Reference) 4 3 2 1 SPECFP2000 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 SPECINT2000 0 Pentium III-M @1.2HGz Always Laptop Min. power max. clock adaptive clk. min. clock 11/15/05 ELEC 5970-001/6970-001 Lecture 19 13 Voltage Scaling • Dynamic: Reduce voltage and frequency during idle or low activity periods. • Static: Clustered voltage scaling • Logic on non-critical path given lower voltage • 47% power reduction with 10% area increase reported. • M. Igarashi et al., “Clustered Voltage Scaling Techniques for Low-Power Design,” Proc. IEEE Symp. Low Power Design, 1997. 11/15/05 ELEC 5970-001/6970-001 Lecture 19 14 Pipeline Gating • A pipeline processor uses speculative execution. • Incorrect branch prediction results in pipeline stalls and wasted energy. • Idea: Stop fetching instructions if a branch hazard is expected: • If the count (M) of incorrect predictions exceeds a prespecified number (N), then suspend fetching instruction for some k cycles. • Ref.: S. Manne, A. Klauser and D. Grunwald, “Pipeline Gating: Speculation Control for Energy Reduction,” Proc. 25th Annual International Symp. Computer Architecture, June 1998. 11/15/05 ELEC 5970-001/6970-001 Lecture 19 15 Slack Scheduling • Application: Superscalar, out-of-order execution: • An instruction is executed as soon as data and resources it needs become available. • A commit unit reorders the results. • Delay the execution of instructions whose result is not immediately needed. • Example of RISC instructions: • • • • • 11/15/05 add r0, r1, r2; sub r3, r4, r5; and r9, x1, r9; or r5, r9, r10; xor r2, r10, r11; (A) (B) (C) (D) (E) J. Casmira and D. Grunwald, “Dynamic Instruction Scheduling Slack,” Proc. ACM Kool Chips Workshop, Dec. 2000. ELEC 5970-001/6970-001 Lecture 19 16 Slack Scheduling Example Standard scheduling A B C D E Slack scheduling B C A D E 11/15/05 ELEC 5970-001/6970-001 Lecture 19 17 Slack Scheduling Scheduling logic Re-order buffer Slack bit 11/15/05 ELEC 5970-001/6970-001 Lecture 19 Low-power execution units 18 Parallel Architecture Processor Input Output Processor Input 11/15/05 Output f/2 f Processor Capacitance = C Voltage = V Frequency = f Power = CV2f f/2 ELEC 5970-001/6970-001 Lecture 19 f Capacitance = 2.2C Voltage = 0.6V Frequency = 0.5f Power = 0.396CV2f 19 ½ Proc. ½ Proc. Output f f Capacitance = C Voltage = V Frequency = f Power = CV2f 11/15/05 Output Input Register Processor Register Input Register Pipeline Architecture Capacitance = 1.2C Voltage = 0.6V Frequency = f Power = 0.432CV2f ELEC 5970-001/6970-001 Lecture 19 20 Approximate Trend n-parallel proc. n-stage pipeline proc. Capacitance nC C Voltage V/n V/n Frequency f/n f Power CV2f/n2 CV2f/n2 Chip area n times 10-20% increase G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Kluwer Academic Publishers, 1998. 11/15/05 ELEC 5970-001/6970-001 Lecture 19 21 Clock Distribution clock 11/15/05 ELEC 5970-001/6970-001 Lecture 19 22 Clock Power Pclk = CLVDD2f + CLVDD2f / λ + CLVDD2f / λ2 + . . . = CLVDD2f where CL = λ = stages – 1 Σ n=0 1 ─ λn total load capacitance constant fanout at each stage in distribution network Clock consumes about 40% of total processor power. 11/15/05 ELEC 5970-001/6970-001 Lecture 19 23 Clock Network Examples Alpha 21064 Alpha 21164 Alpha 21264 Technology 0.75μ CMOS 0.5μ CMOS 0.35μ CMOS Frequency (MHz) 200 300 600 Total capacitance 12.5nF Clock load 3.25nF Clock power Max. clock skew 3.75nF 20W 200ps (<10%) 90ps D. W. Bailey and B. J. Benschneider, “Clocking Design and Analysis for a 600-MHz Alpha Microprocessor,” IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 1627-1633, Nov. 1998. 11/15/05 ELEC 5970-001/6970-001 Lecture 19 24 Power Reduction Example • • • • • • • Alpha 21064: 200MHz @ 3.45V, power dissipation = 26W Reduce voltage to 1.5V, power (5.3x) = 4.9W Eliminate FP, power (3x) = 1.6W Scale 0.75→0.35μ, power (2x) = 0.8W Reduce clock load, power (1.3x) = 0.6W Reduce frequency 200→160MHz, power (1.25x) = 0.5W J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1703-1714, Nov. 1996. 11/15/05 ELEC 5970-001/6970-001 Lecture 19 25