ELEC 5270/6270 Spring 2011 Low-Power Design of Electronic Circuits Power Aware Microprocessors

advertisement
ELEC 5270/6270 Spring 2011
Low-Power Design of Electronic Circuits
Power Aware Microprocessors
Vishwani D. Agrawal
James J. Danaher Professor
Dept. of Electrical and Computer Engineering
Auburn University, Auburn, AL 36849
vagrawal@eng.auburn.edu
http://www.eng.auburn.edu/~vagrawal/COURSE/E6270_Spr11/course.html
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
1
Year
1999
2002
2005
2008
2011
2014
Feature size (nm)
180
130
100
70
50
35
Logic transistors/cm2
6.2M
18M
39M
84M
180M
390M
Clock (GHz)
1.25
2.1
3.5
6.0
10.0
16.9
Chip size (mm2)
340
430
520
620
750
900
Power supply (V)
1.8
1.5
1.2
0.9
0.6
0.5
High-perf. Power (W)
90
130
160
170
175
183
Untrue predictions.
SIA Roadmap for Processors (1999)
Source: http://www.semichips.org
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
2
Power Reduction in Processors

Hardware methods:





Architecture:



Voltage reduction for dynamic power
Dual-threshold devices for leakage reduction
Clock gating, frequency reduction
Sleep mode
Instruction set
hardware organization
Software methods
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
3
Performance Criteria

Throughput – computations per unit time.

Performance is inverse of time – increasing
CPU time indicates lower performance.
Power – computations per watt.
 Energy efficiency – performance/joule.

Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
4
SPEC CPU2006 Benchmarks






Standard Performance Evaluation Corporation (SPEC)
http://www.spec.org
Twelve integer and 17 floating point programs, CINT2006
and CFP2006.
Each program run time is normalized to obtain a SPEC
ratio with respect to the run time of Sun Ultra Enterprise 2
system with a 296 MHz UltraSPARC II processor.
It takes about 12 days to run all benchmarks on reference
system.
CINT2006 and CFP2006 metrics are the geometric means
of SPEC ratios:


Peak metric – each program is individually optimized (aggressive
compilation).
Base metric – common optimization for all programs.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
5
SPEC CINT2006 Results

http://www.spec.org/cpu2006/results/cint2006.html

Dell Inc., PowerEdge R610
 CPU:
Intel Xeon X5670, 2.93 GHz
 Number of chips 2, cores 12, threads/core 2
 Performance metric 36.6 base, 39.4 peak

Dell Inc. PowerEdge M905
 CPU:
AMD Opteron 8381 HE, 2.50 GHz
 Number of chips 4, cores 16, threads/core 1
 Performance metric 15.8 base, 19.1 peak
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
6
SPEC CFP2006 Results

http://www.spec.org/cpu2006/results/cfp2006.html

Dell Inc., PowerEdge R610
 CPU:
Intel Xeon X5670, 2.93 GHz
 Number of chips 2, cores 12, threads/core 2
 Performance metric 42.5 base, 45.8 peak

Dell Inc. PowerEdge M905
 CPU:
AMD Opteron 8381 HE, 2.50 GHz
 Number of chips 4, cores 16, threads/core 1
 Performance metric 17.4 base, 21.5 peak
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
7
Other Benchmarks


LINPACK is numerically intensive floating point linear
system (Ax = b) program used for benchmarking
supercomputers.
SPECPOWER_ssj2008 measures power and performance
of a computer system.


The initial benchmark addresses the performance of server-side
Java; additional workloads are planned.
http://www.spec.org/benchmarks.html#power
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
8
Second Quarter 2010
SPECpower_ssj2008 Results

http://www.spec.org/power_ssj2008/results/res2010q2/

Apr 7, 2010: Hewlett-Packard ProLiant DL385 G7
 CPU:
AMD Opteron 6174, 2.2GHz
 Number of chips 2, cores 12, threads/core 2
 Total memory 16GB
 ssj operations @ 100% 888,819
 Average power @ 100% 271 W
 Average power @ active idle 101 W
 Overall ssj operations per watt 2,355
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
9
Second Quarter 2010
SPECpower_ssj2008 Results

http://www.spec.org/power_ssj2008/results/res2010q2/

May 19, 2010: Dell Inc., PowerEdge R610
 CPU:
Intel Xeon X5670, 2.93 GHz
 Number of chips 2, cores 12, threads 2
 Total memory 12GB
 ssj operations @ 100% 914,076
 Average power @ 100% 244 W
 Average power @ active idle 62.3 W
 Overall ssj operations per watt 2,938
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
10
Energy SPEC Benchmarks

Energy efficiency mode: Besides the execution
time, energy efficiency of SPEC benchmark
programs is also measured. Energy efficiency of
a benchmark program is given by:
1/(Execution time)
Energy efficiency
=
────────────
joules consumed
D. A. Patterson and J. L. Hennessy, Computer Organization & Design:
The Hardware/Software Interface, 4th Edition, Morgan Kaufmann
Publishers (Elsevier), 2009,
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
11
Energy Efficiency


Efficiency averaged on n benchmark programs:
n
1/n
Efficiency
=
( Π Efficiencyi )
i=1
where Efficiencyi is the efficiency for program i.
Relative efficiency:
Efficiency of a computer
Relative efficiency = ─────────────────
Eff. of reference computer
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
12
SPEC2000 Relative Energy Efficiency
6
5
Pentium M
@1.6/0.6GHz Energyefficient procesor
Pentium 4-M
@2.4GHz (Reference)
4
3
2
1
SPECFP2000
SPECINT2000
SPECFP2000
SPECINT2000
SPECFP2000
SPECINT2000
0
Pentium III-M
@1.2GHz
Always
Laptop
Min. power
max. clock adaptive clk. min. clock
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
13
Voltage Scaling
Dynamic: Reduce voltage and frequency
during idle or low activity periods.
 Static: Clustered voltage scaling

 Logic
on non-critical paths given lower voltage.
 47% power reduction with 10% area increase
reported.
 M. Igarashi et al., “Clustered Voltage Scaling
Techniques for Low-Power Design,” Proc. IEEE
Symp. Low Power Design, 1997.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
14
Processor Utilization
Throughput = Operations / second
Throughput
Compute-intensive
processes
Maximum
throughput
Low throughput
(background)
processes
System
idle
Time
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
15
Examples of Processes
Compute-intensive: spreadsheet, spelling
check, video decoding, scientific
computing.
 Low throughput: data entry, screen
updates, low bandwidth I/O data transfer.
 Idle: no computation, no expected output.

Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
16
Effects of Voltage Reduction

Voltage reduction increases delay,
decreases throughput:
 Slow
reduction in throughput at first
 Rapid reduction in throughput for VDD ≤ Vth
 Time per operation (TPO) increases

Voltage reduction continues to reduce
power consumption:
 Energy
Copyright Agrawal, 2007
per operation (EPO) = Power × TPO
ELEC5270/6270 Spring 11, Lecture 14
17
Energy per Operation (EPO)
1.0
0.5
EPO
Power
TPO
0.0
1
2
3
4
5
VDD / Vth
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
18
Dynamic Voltage and Clock
Time spent in:
Throughput
Fast Slow
Idle
mode mode mode
Battery
life
Always full speed
10%
0%
90%
1 hr
Sometimes full speed
1%
90%
9%
5.3 hrs
Rarely full speed
0.1%
99%
0.9% 9.2 hrs
T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessors,
Springer, 2002, pp. 35-36.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
19
Example: Find Minimum Energy Mode

Processor data (rated operation):
2 GHz clock
 1.5 volt supply voltage
 0.5 volt threshold voltage
 Power consumption

 50
watts dynamic power
 50 watts static power

Maximum clock frequency for V volt supply
f
α
(V – VTH)/V
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
20
Example Cont.
Dynamic power:
Pd = CV2f = C(1.5)2×2×109 = 50W
C = 11.11 nF, capacitance switching/cycle
Pd = 11.11 V2f
 Dynamic energy per cycle:
Ed = Pd/f = 11.11 V2

Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
21
Example Cont.

Clock frequency:
f = k (V – VTH)/V = k (1.5 – 0.5)/1.5 = 2 GHz
k = 3 GHz, a proportionality constant
f = 3(V – 0.5)/V GHz
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
22
Example Cont.
Static power:
Ps = k’ V2 = k’ (1.5)2 = 50W
k’ = 22.22 mho, total leakage conductance
Ps = 22.22 V2
 Static energy per cycle:
Es = Ps/f = 22.22 V3/[3(V – 0.5)]
= 7.41 V3/(V – 0.5)

Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
23
Example Cont.
Total energy per cycle:
E = Ed + Es = 11.11 V2 + 7.41 V3/(V – 0.5)
 To minimize E, ∂E/∂V = 0, or
5V2 – 4.6V + 0.75 = 0
 Solutions of quadratic equation:
V = 0.679 volt, 0.221 volt
 Discard second solution, which is lower
than the threshold voltage of 0.5 volt.

Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
24
Example: Result
Voltage
1.5 V
Low energy
mode
0.679 V
Clock frequency
2 GHz
791 MHz
60%
Dynamic energy/cycle
25.00 nJ
5.12 nJ
79.52%
Static energy/cycle
25.00 nJ
12.96 nJ
48.16%
Total energy/cycle
50.0 nJ
18.08 nJ
63.84%
Dynamic power
50.0 W
4.05 W
91.90%
Static power
50.0 W
10.25 W
79.50%
Total power
100.0 W
14.20 W
85.80%
Rated mode
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
Reduction
(%)
54.7%
25
Problem of Process Variation in
Nanometer Technologies
Lower Vth
Copyright Agrawal, 2007
Clock
specification
Nominal
voltage
Vth
Higher voltage operation
Yield loss
due to high
leakage
Lower voltage operation
Number of chips
Power
specification
Yield loss
due to slow
speed
From a presentation:
Power Reduction
using LongRun2 in
Transmeta’s
Efficon Processor,
by D. Ditzel
May 17, 2006
Higher Vth
ELEC5270/6270 Spring 11, Lecture 14
26
Pipeline Gating

A pipeline processor uses speculative execution.


Idea: Stop fetching instructions if a branch
hazard is expected:


Incorrect branch prediction results in pipeline stalls and
wasted energy.
If the count (M) of incorrect predictions exceeds a prespecified number (N), then suspend fetching instruction for
some k cycles.
Ref.: S. Manne, A. Klauser and D. Grunwald,
“Pipeline Gating: Speculation Control for Energy
Reduction,” Proc. 25th Annual International
Symp. Computer Architecture, June 1998.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
27
Slack Scheduling

Application: Superscalar, out-of-order execution:




An instruction is executed as soon as the required data and
resources become available.
A commit unit reorders the results.
Delay the completion of instructions whose result
is not immediately needed.
Example of RISC instructions:





add
sub
and
or
xor
Copyright Agrawal, 2007
r0, r1, r2;
r3, r4, r5;
r9, r1, r9;
r5, r9, r10;
r2, r10, r11;
(A)
(B)
(C)
(D)
(E)
J. Casmira and D. Grunwald,
“Dynamic Instruction Scheduling
Slack,” Proc. ACM Kool Chips
Workshop, Dec. 2000.
ELEC5270/6270 Spring 11, Lecture 14
28
Slack Scheduling Example
Standard scheduling
A
B
C
D
E
Slack scheduling
B
C
A
D
E
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
29
Slack Scheduling
Scheduling logic
Re-order buffer
Slack bit
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
Low-power
execution units
30
Clock Distribution H-Tree
Fanout, λ = 4
Tree depth, s = logλN
No. of flip-flops = N
clock
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
31
Clock Power
Pclk
= CLVDD2f + CLVDD2f / λ + CLVDD2f / λ2 + . . .
= CLVDD2f
where CL =
λ =
stages – 1
Σ
n=0
1
─
λn
total load capacitance of N flip-flops
constant fanout at each stage in distribution
network
Clock consumes about 40% of total processor power, because
(1) Clock is always active
(2) Makes two transitions per cycle, (α = 2)
(3) Clock gating is useful; inhibit clock to unused blocks
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
32
Properties of H-Tree
Balanced clock skew.
 Small delay and power consumption.
 Requires fine-tuning for complex layout.

Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
33
Clock Power and Delay
Unit size buffer or inverter delay = d
 Total dynamic power supplied to N flipflops, P = CLVDD2f
 Total power consumption of clock network:

Flip-flps, N
Clock power per flip-flop
Clock delay
1
P
d
4
P
4d
16
1.25P
8d
64
1.3125P
12d
128
1.327125P
16d
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
34
Clock Network Examples
Alpha 21064
Alpha 21164
Alpha 21264
Technology
0.75μ CMOS
0.5μ CMOS
0.35μ CMOS
Frequency (MHz)
200
300
600
Total capacitance
12.5nF
Clock load
3.25nF
3.75nF
Clock power
40%
40% (20W)
Max. clock skew
200ps (<10%)
90ps
Clock gating
used. Total
power 80 110W
D. W. Bailey and B. J. Benschneider, “Clocking Design and Analysis for
a 600-MHz Alpha Microprocessor,” IEEE J. Solid-State Circuits, vol. 33,
no. 11, pp. 1627-1633, Nov. 1998.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
35
Power Reduction Example







Alpha 21064: 200MHz @ 3.45V, power dissipation = 26W
Reduce voltage to 1.5V, power (5.3x) = 4.9W
Eliminate FP, power (3x) = 1.6W
Scale 0.75μ → 0.35μ, power (2x) = 0.8W
Reduce clock load, power (1.3x) = 0.6W
Reduce frequency 200 →160MHz, power (1.25x) = 0.5W
J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC
Microprocessor,” IEEE J. Solid-State Circuits, vol. 31, no.
11, pp. 1703-1714, Nov. 1996.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
36
For More on Microprocessors
T. D. Burd and R. W. Brodersen, Energy
Efficient Microprocessor Design, Springer,
2002.
 R. Graybill and R. Melhem, Power Aware
Computing, New York: Plenum Publishers,
2002.

Copyright Agrawal, 2007
ELEC5270/6270 Spring 11, Lecture 14
37
Download