ELEC 5270/6270 Fall 2007 Low-Power Design of Electronic Circuits Power Aware Microprocessors

advertisement
ELEC 5270/6270 Fall 2007
Low-Power Design of Electronic Circuits
Power Aware Microprocessors
Vishwani D. Agrawal
James J. Danaher Professor
Dept. of Electrical and Computer Engineering
Auburn University, Auburn, AL 36849
vagrawal@eng.auburn.edu
http://www.eng.auburn.edu/~vagrawal/COURSE/E6270_Fall07/course.html
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
1
SIA Roadmap for Processors (1999)
Year
1999
2002
2005
2008
2011
2014
Feature size (nm)
180
130
100
70
50
35
Logic transistors/cm2
6.2M
18M
39M
84M
180M
390M
Clock (GHz)
1.25
2.1
3.5
6.0
10.0
16.9
Chip size (mm2)
340
430
520
620
750
900
Power supply (V)
1.8
1.5
1.2
0.9
0.6
0.5
High-perf. Power (W)
90
130
160
170
175
183
Source: http://www.semichips.org
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
2
Power Reduction in Processors


Just about everything is used.
Hardware methods:





Architecture:



Voltage reduction for dynamic power
Dual-threshold devices for leakage reduction
Clock gating, frequency reduction
Sleep mode
Instruction set
hardware organization
Software methods
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
3
SPEC CPU2000 Benchmarks




Twelve integer and 14 floating point programs,
CINT2000 and CFP2000.
Each program run time is normalized to obtain a
SPEC ratio with respect to the run time of Sun
Ultra 5_10 with a 300MHz processor.
CINT2000 and CFP2000 summary
measurements are the geometric means of
SPEC ratios.
LINPACK is numerically intensive floating point
linear system (Ax = b) program used for
benchmarking supercomputers.
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
4
Reference CPU s: Sun Ultra 5_10
300MHz Processor
3500
3000
2500
2000
CINT2000
CFP2000
1500
1000
0
gzip
vpr
gcc
mcf
crafty
parser
eon
perlbmk
gap
vortex
bzip2
twolf
wupwise
swim
mgrid
applu
mesa
galgel
art
equake
facerec
ammp
lucas
fma3d
sixtrack
apsi
500
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
5
CINT2000: 3.4GHz Pentium 4, HT
Technology (D850MD Motherboard)
SPECint2000_base = 1341
SPECint2000 = 1389
2500
2000
1500
Base ratio
Opt. ratio
1000
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
0
gzip
500
Source: www.spec.org
6
Two Benchmark Results

Baseline: A uniform configuration not
optimized for specific program:
 Same
compiler with same settings and flags used
for all benchmarks
 Other restrictions

Peak: Run is optimized for obtaining the
peak performance for each benchmark
program.
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
7
CFP2000: 3.6GHz Pentium 4, HT Technology
(D925XCV/AA-400 Motherboard)
SPECfp2000_base = 1627
SPECfp2000 = 1630
3000
2500
2000
1500
Base ratio
Opt. ratio
1000
0
wupwise
swim
mgrid
applu
mesa
galgel
art
equake
facerec
ammp
lucas
fma3d
sixtrack
apsi
500
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
Source: www.spec.org
8
CINT2000: 1.7GHz Pentium 4
(D850MD Motherboard)
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
Base ratio
Opt. ratio
gzip
1000
900
800
700
600
500
400
300
200
100
0
SPECint2000_base = 579
SPECint2000 = 588
Source: www.spec.org
9
CFP2000: 1.7GHz Pentium 4
(D850MD Motherboard)
SPECfp2000_base = 648
SPECfp2000 = 659
1400
1200
1000
800
600
Base ratio
Opt. ratio
400
0
wupwise
swim
mgrid
applu
mesa
galgel
art
equake
facerec
ammp
lucas
fma3d
sixtrack
apsi
200
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
Source: www.spec.org
10
Energy SPEC Benchmarks

Energy efficiency mode: Besides the
execution time, energy efficiency of SPEC
benchmark programs is also measured.
Energy efficiency of a benchmark program
is given by:
1/(Execution time)
Energy efficiency =
────────────
joules consumed
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
11
Energy Efficiency


Efficiency averaged on n benchmark programs:
n
1/n
Efficiency
=
( Π Efficiencyi )
i=1
where Efficiencyi is the efficiency for program i.
Relative efficiency:
Efficiency of a computer
Relative efficiency = ─────────────────
Eff. of reference computer
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
12
SPEC2000 Relative Energy Efficiency
6
5
Pentium M
@1.6/0.6GHz Energyefficient procesor
Pentium 4-M
@2.4GHz (Reference)
4
3
2
1
SPECFP2000
SPECINT2000
SPECFP2000
SPECINT2000
SPECFP2000
SPECINT2000
0
Pentium III-M
@1.2GHz
Always
Laptop
Min. power
max. clock adaptive clk. min. clock
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
13
Voltage Scaling
Dynamic: Reduce voltage and frequency
during idle or low activity periods.
 Static: Clustered voltage scaling

 Logic
on non-critical paths given lower voltage.
 47% power reduction with 10% area increase
reported.
 M. Igarashi et al., “Clustered Voltage Scaling
Techniques for Low-Power Design,” Proc. IEEE
Symp. Low Power Design, 1997.
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
14
Processor Utilization
Throughput = Operations / second
Throughput
Compute-intensive
processes
Maximum
throughput
Low throughput
(background)
processes
System
idle
Time
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
15
Examples of Processes
Compute-intensive: spreadsheet, spelling
check, video decoding, scientific
computing.
 Low throughput: data entry, screen
updates, low bandwidth I/O data transfer.
 Idle: no computation, no expected output.

Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
16
Effects of Voltage Reduction

Voltage reduction increases delay,
decreases throughput:
 Slow
reduction in throughput at first
 Rapid reduction in throughput for VDD ≤ Vth
 Time per operation (TPO) increases

Voltage reduction continues to reduce
power consumption:
 Energy
Copyright Agrawal, 2007
per operation (EPO) = Power × TPO
ELEC6270 Fall 07, Lecture 14
17
Energy per Operation (EPO)
1.0
0.5
EPO
Power
TPO
0.0
1
2
3
4
5
VDD / Vth
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
18
Dynamic Voltage and Clock
Time spent in:
Throughput
Fast Slow
Idle
mode mode mode
Battery
life
Always full speed
10%
0%
90%
1 hr
Sometimes full speed
1%
90%
9%
5.3 hrs
Rarely full speed
0.1%
99%
0.9% 9.2 hrs
T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessors,
Springer, 2002, pp. 35-36.
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
19
Problem of Process Variation and
Leakage
Clock
specification
Number of chips
Power
specification
Yield loss
due to high
leakage
Lower Vth
Copyright Agrawal, 2007
Yield loss
due to slow
speed
Vth
ELEC6270 Fall 07, Lecture 14
From a presentation:
Power Reduction
using LongRun2 in
Transmeta’s
Efficon Processor,
by D. Ditzel
May 17, 2006
Higher Vth
20
Pipeline Gating

A pipeline processor uses speculative execution.


Idea: Stop fetching instructions if a branch
hazard is expected:


Incorrect branch prediction results in pipeline stalls and
wasted energy.
If the count (M) of incorrect predictions exceeds a prespecified number (N), then suspend fetching instruction for
some k cycles.
Ref.: S. Manne, A. Klauser and D. Grunwald,
“Pipeline Gating: Speculation Control for Energy
Reduction,” Proc. 25th Annual International
Symp. Computer Architecture, June 1998.
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
21
Slack Scheduling

Application: Superscalar, out-of-order execution:




An instruction is executed as soon as the required data and
resources become available.
A commit unit reorders the results.
Delay the completion of instructions whose result
is not immediately needed.
Example of RISC instructions:





add r0, r1, r2;
sub r3, r4, r5;
and r9, x1, r9;
or r5, r9, r10;
xor r2, r10, r11;
Copyright Agrawal, 2007
(A)
(B)
(C)
(D)
(E)
J. Casmira and D. Grunwald,
“Dynamic Instruction Scheduling
Slack,” Proc. ACM Kool Chips
Workshop, Dec. 2000.
ELEC6270 Fall 07, Lecture 14
22
Slack Scheduling Example
Standard scheduling
A
B
C
D
E
Slack scheduling
B
C
A
D
E
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
23
Slack Scheduling
Scheduling logic
Re-order buffer
Low-power
execution units
Slack bit
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
24
Clock Distribution
clock
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
25
Clock Power
Pclk
= CLVDD2f + CLVDD2f / λ + CLVDD2f / λ2 + . . .
= CLVDD2f
where CL =
λ =
stages – 1
Σ
n=0
1
─
λn
total load capacitance
constant fanout at each stage in distribution
network
Clock consumes about 40% of total processor power.
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
26
Clock Network Examples
Alpha 21064
Alpha 21164
Alpha 21264
Technology
0.75μ CMOS
0.5μ CMOS
0.35μ CMOS
Frequency (MHz)
200
300
600
Total capacitance
12.5nF
Clock load
3.25nF
3.75nF
Clock power
40%
40% (20W)
Max. clock skew
200ps (<10%)
90ps
Clock gating
used. Total
power 80 110W
D. W. Bailey and B. J. Benschneider, “Clocking Design and Analysis for
a 600-MHz Alpha Microprocessor,” IEEE J. Solid-State Circuits, vol. 33,
no. 11, pp. 1627-1633, Nov. 1998.
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
27
Power Reduction Example







Alpha 21064: 200MHz @ 3.45V, power dissipation = 26W
Reduce voltage to 1.5V, power (5.3x) = 4.9W
Eliminate FP, power (3x) = 1.6W
Scale 0.75→0.35μ, power (2x) = 0.8W
Reduce clock load, power (1.3x) = 0.6W
Reduce frequency 200→160MHz, power (1.25x) = 0.5W
J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC
Microprocessor,” IEEE J. Solid-State Circuits, vol. 31, no.
11, pp. 1703-1714, Nov. 1996.
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
28
Parallel Architecture
Processor
Input
Output
Output
Processor
f/2
Input
f
Processor
Capacitance = C
Voltage = V
Frequency = f
Power = CV2f
f/2
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
f
Capacitance = 2.2C
Voltage = 0.6V
Frequency = 0.5f
Power = 0.396CV2f
29
Output Input
½
Proc.
Register
Processor
Register
Input
Register
Pipeline Architecture
½
Proc.
Output
f
f
Capacitance = 1.2C
Voltage = 0.6V
Frequency = f
Power = 0.432CV2f
Capacitance = C
Voltage = V
Frequency = f
Power = CV2f
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
30
Approximate Trend
n-parallel proc.
n-stage pipeline proc.
Capacitance
nC
C
Voltage
V/n
V/n
Frequency
f/n
f
Power
CV2f/n2
CV2f/n2
Chip area
n times
10-20% increase
G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Kluwer
Academic Publishers, 1998.
Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
31
For More on Microprocessors
T. D. Burd and R. W. Brodersen, Energy
Efficient Microprocessor Design, Springer,
2002.
 R. Graybill and R. Melhem, Power Aware
Computing, New York: Plenum Publishers,
2002.

Copyright Agrawal, 2007
ELEC6270 Fall 07, Lecture 14
32
Download