ELEC 5270/6270 Spring 2009 Low-Power Design of Electronic Circuits Power Aware Microprocessors

advertisement
ELEC 5270/6270 Spring 2009
Low-Power Design of Electronic Circuits
Power Aware Microprocessors
Vishwani D. Agrawal
James J. Danaher Professor
Dept. of Electrical and Computer Engineering
Auburn University, Auburn, AL 36849
vagrawal@eng.auburn.edu
http://www.eng.auburn.edu/~vagrawal/COURSE/E6270_Spr09/course.html
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
1
SIA Roadmap for Processors (1999)
Year
1999
2002
2005
2008
2011
2014
Feature size (nm)
180
130
100
70
50
35
Logic transistors/cm2
6.2M
18M
39M
84M
180M
390M
Clock (GHz)
1.25
2.1
3.5
6.0
10.0
16.9
Chip size (mm2)
340
430
520
620
750
900
Power supply (V)
1.8
1.5
1.2
0.9
0.6
0.5
High-perf. Power (W)
90
130
160
170
175
183
Source: http://www.semichips.org
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
2
Power Reduction in Processors


Just about everything is used.
Hardware methods:





Architecture:



Voltage reduction for dynamic power
Dual-threshold devices for leakage reduction
Clock gating, frequency reduction
Sleep mode
Instruction set
hardware organization
Software methods
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
3
SPEC CPU2000 Benchmarks





Twelve integer and 14 floating point programs,
CINT2000 and CFP2000.
Each program run time is normalized to obtain a SPEC
ratio with respect to the run time of Sun Ultra 5_10 with a
300MHz processor.
CINT2000 and CFP2000 summary measurements are
the geometric means of SPEC ratios.
LINPACK is numerically intensive floating point linear
system (Ax = b) program used for benchmarking
supercomputers.
SPECPOWER_ssj2008 measures power and
performance of a computer system.
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
4
Reference CPU s: Sun Ultra 5_10
300MHz Processor
3500
3000
2500
2000
CINT2000
CFP2000
1500
1000
0
gzip
vpr
gcc
mcf
crafty
parser
eon
perlbmk
gap
vortex
bzip2
twolf
wupwise
swim
mgrid
applu
mesa
galgel
art
equake
facerec
ammp
lucas
fma3d
sixtrack
apsi
500
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
5
CINT2000: 3.4GHz Pentium 4, HT
Technology (D850MD Motherboard)
SPECint2000_base = 1341
SPECint2000 = 1389
2500
2000
1500
Base ratio
Opt. ratio
1000
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
0
gzip
500
Source: www.spec.org
6
Two Benchmark Results

Baseline: A uniform configuration not
optimized for specific program:
 Same
compiler with same settings and flags used
for all benchmarks
 Other restrictions

Peak: Run is optimized for obtaining the
peak performance for each benchmark
program.
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
7
CFP2000: 3.6GHz Pentium 4, HT Technology
(D925XCV/AA-400 Motherboard)
SPECfp2000_base = 1627
SPECfp2000 = 1630
3000
2500
2000
1500
Base ratio
Opt. ratio
1000
0
wupwise
swim
mgrid
applu
mesa
galgel
art
equake
facerec
ammp
lucas
fma3d
sixtrack
apsi
500
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
Source: www.spec.org
8
CINT2000: 1.7GHz Pentium 4
(D850MD Motherboard)
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
Base ratio
Opt. ratio
gzip
1000
900
800
700
600
500
400
300
200
100
0
SPECint2000_base = 579
SPECint2000 = 588
Source: www.spec.org
9
CFP2000: 1.7GHz Pentium 4
(D850MD Motherboard)
SPECfp2000_base = 648
SPECfp2000 = 659
1400
1200
1000
800
600
Base ratio
Opt. ratio
400
0
wupwise
swim
mgrid
applu
mesa
galgel
art
equake
facerec
ammp
lucas
fma3d
sixtrack
apsi
200
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
Source: www.spec.org
10
Energy SPEC Benchmarks

Energy efficiency mode: Besides the
execution time, energy efficiency of SPEC
benchmark programs is also measured.
Energy efficiency of a benchmark program
is given by:
1/(Execution time)
Energy efficiency =
────────────
joules consumed
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
11
Energy Efficiency


Efficiency averaged on n benchmark programs:
n
1/n
Efficiency
=
( Π Efficiencyi )
i=1
where Efficiencyi is the efficiency for program i.
Relative efficiency:
Efficiency of a computer
Relative efficiency = ─────────────────
Eff. of reference computer
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
12
SPEC2000 Relative Energy Efficiency
6
5
Pentium M
@1.6/0.6GHz Energyefficient procesor
Pentium 4-M
@2.4GHz (Reference)
4
3
2
1
SPECFP2000
SPECINT2000
SPECFP2000
SPECINT2000
SPECFP2000
SPECINT2000
0
Pentium III-M
@1.2GHz
Always
Laptop
Min. power
max. clock adaptive clk. min. clock
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
13
Voltage Scaling
Dynamic: Reduce voltage and frequency
during idle or low activity periods.
 Static: Clustered voltage scaling

 Logic
on non-critical paths given lower voltage.
 47% power reduction with 10% area increase
reported.
 M. Igarashi et al., “Clustered Voltage Scaling
Techniques for Low-Power Design,” Proc. IEEE
Symp. Low Power Design, 1997.
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
14
Processor Utilization
Throughput = Operations / second
Throughput
Compute-intensive
processes
Maximum
throughput
Low throughput
(background)
processes
System
idle
Time
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
15
Examples of Processes
Compute-intensive: spreadsheet, spelling
check, video decoding, scientific
computing.
 Low throughput: data entry, screen
updates, low bandwidth I/O data transfer.
 Idle: no computation, no expected output.

Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
16
Effects of Voltage Reduction

Voltage reduction increases delay,
decreases throughput:
 Slow
reduction in throughput at first
 Rapid reduction in throughput for VDD ≤ Vth
 Time per operation (TPO) increases

Voltage reduction continues to reduce
power consumption:
 Energy
Copyright Agrawal, 2007
per operation (EPO) = Power × TPO
ELEC6270 Spring 09, Lecture 12
17
Energy per Operation (EPO)
1.0
0.5
EPO
Power
TPO
0.0
1
2
3
4
5
VDD / Vth
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
18
Dynamic Voltage and Clock
Time spent in:
Throughput
Fast Slow
Idle
mode mode mode
Battery
life
Always full speed
10%
0%
90%
1 hr
Sometimes full speed
1%
90%
9%
5.3 hrs
Rarely full speed
0.1%
99%
0.9% 9.2 hrs
T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessors,
Springer, 2002, pp. 35-36.
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
19
Example: Find Minimum Energy Mode

Processor data (rated operation):
2 GHz clock
 1.5 volt supply voltage
 0.5 volt threshold voltage
 Power consumption

 50
watts dynamic power
 50 watts static power

Maximum clock frequency for V volt supply
f
α
(V – VTH)/V
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
20
Example Cont.
Dynamic power:
Pd = CV2f = C(1.5)2×2×109 = 50W
C = 11.11 nF, capacitance switching/cycle
Pd = 11.11 V2f
 Dynamic energy per cycle:
Ed = Pd/f = 11.11 V2

Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
21
Example Cont.

Clock frequency:
f = k (V – VTH)/V = k (1.5 – 0.5)/1.5 = 2 GHz
k = 3 GHz, a proportionality constant
f = 3(V – 0.5)/V GHz
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
22
Example Cont.
Static power:
Ps = k’ V2 = k’ (1.5)2 = 50W
k’ = 22.22 mho, total leakage conductance
Ps = 22.22 V2
 Static energy per cycle:
Es = Ps/f = 22.22 V3/[3(V – 0.5)]
= 7.41 V3/(V – 0.5)

Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
23
Example Cont.
Total energy per cycle:
E = Ed + Es = 11.11 V2 + 7.41 V3/(V – 0.5)
 To minimize E, ∂E/∂V = 0, or
5V2 – 4.6V + 0.75 = 0
 Solutions of quadratic equation:
V = 0.679 volt, 0.221 volt
 Discard second solution, which is lower
than the threshold voltage of 0.5 volt.

Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
24
Example: Result
Voltage
1.5 V
Low energy
mode
0.679 V
Clock frequency
2 GHz
791 MHz
60%
Dynamic energy/cycle
25.00 nJ
5.12 nJ
79.52%
Static energy/cycle
25.00 nJ
12.96 nJ
48.16%
Total energy/cycle
50.0 nJ
18.08 nJ
63.84%
Dynamic power
50.0 W
4.05 W
91.90%
Static power
50.0 W
10.25 W
79.50%
Total power
100.0 W
14.20 W
85.80%
Rated mode
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
Reduction
(%)
54.7%
25
Power
specification
Clock
specification
Lower voltage operation
Higher voltage operation
Number of chips
Problem of Process Variation in
Nanometer Tecchnologies
Yield loss
due to high
leakage
Lower Vth
Copyright Agrawal, 2007
Nominal
voltage
Vth
Yield loss
due to slow
speed
From a presentation:
Power Reduction
using LongRun2 in
Transmeta’s
Efficon Processor,
by D. Ditzel
May 17, 2006
Higher Vth
ELEC6270 Spring 09, Lecture 12
26
Pipeline Gating

A pipeline processor uses speculative execution.


Idea: Stop fetching instructions if a branch
hazard is expected:


Incorrect branch prediction results in pipeline stalls and
wasted energy.
If the count (M) of incorrect predictions exceeds a prespecified number (N), then suspend fetching instruction for
some k cycles.
Ref.: S. Manne, A. Klauser and D. Grunwald,
“Pipeline Gating: Speculation Control for Energy
Reduction,” Proc. 25th Annual International
Symp. Computer Architecture, June 1998.
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
27
Slack Scheduling

Application: Superscalar, out-of-order execution:




An instruction is executed as soon as the required data and
resources become available.
A commit unit reorders the results.
Delay the completion of instructions whose result
is not immediately needed.
Example of RISC instructions:





add r0, r1, r2;
sub r3, r4, r5;
and r9, x1, r9;
or r5, r9, r10;
xor r2, r10, r11;
Copyright Agrawal, 2007
(A)
(B)
(C)
(D)
(E)
J. Casmira and D. Grunwald,
“Dynamic Instruction Scheduling
Slack,” Proc. ACM Kool Chips
Workshop, Dec. 2000.
ELEC6270 Spring 09, Lecture 12
28
Slack Scheduling Example
Standard scheduling
A
B
C
D
E
Slack scheduling
B
C
A
D
E
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
29
Slack Scheduling
Scheduling logic
Re-order buffer
Low-power
execution units
Slack bit
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
30
Clock Distribution
clock
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
31
Clock Power
Pclk
= CLVDD2f + CLVDD2f / λ + CLVDD2f / λ2 + . . .
= CLVDD2f
where CL =
λ =
stages – 1
Σ
n=0
1
─
λn
total load capacitance
constant fanout at each stage in distribution
network
Clock consumes about 40% of total processor power.
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
32
Clock Network Examples
Alpha 21064
Alpha 21164
Alpha 21264
Technology
0.75μ CMOS
0.5μ CMOS
0.35μ CMOS
Frequency (MHz)
200
300
600
Total capacitance
12.5nF
Clock load
3.25nF
3.75nF
Clock power
40%
40% (20W)
Max. clock skew
200ps (<10%)
90ps
Clock gating
used. Total
power 80 110W
D. W. Bailey and B. J. Benschneider, “Clocking Design and Analysis for
a 600-MHz Alpha Microprocessor,” IEEE J. Solid-State Circuits, vol. 33,
no. 11, pp. 1627-1633, Nov. 1998.
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
33
Power Reduction Example







Alpha 21064: 200MHz @ 3.45V, power dissipation = 26W
Reduce voltage to 1.5V, power (5.3x) = 4.9W
Eliminate FP, power (3x) = 1.6W
Scale 0.75→0.35μ, power (2x) = 0.8W
Reduce clock load, power (1.3x) = 0.6W
Reduce frequency 200→160MHz, power (1.25x) = 0.5W
J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC
Microprocessor,” IEEE J. Solid-State Circuits, vol. 31, no.
11, pp. 1703-1714, Nov. 1996.
Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
34
For More on Microprocessors
T. D. Burd and R. W. Brodersen, Energy
Efficient Microprocessor Design, Springer,
2002.
 R. Graybill and R. Melhem, Power Aware
Computing, New York: Plenum Publishers,
2002.

Copyright Agrawal, 2007
ELEC6270 Spring 09, Lecture 12
35
Download