ELEC 5270/6270 Spring 2013 Low-Power Design of Electronic Circuits Power Aware Microprocessors

advertisement
ELEC 5270/6270 Spring 2013
Low-Power Design of Electronic Circuits
Power Aware Microprocessors
Vishwani D. Agrawal
James J. Danaher Professor
Dept. of Electrical and Computer Engineering
Auburn University, Auburn, AL 36849
vagrawal@eng.auburn.edu
http://www.eng.auburn.edu/~vagrawal/COURSE/E6270_Spr13/course.html
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
1
Year
1999
2002
2005
2008
2011
2014
Feature size (nm)
180
130
100
70
50
35
Logic transistors/cm2
6.2M
18M
39M
84M
180M
390M
Clock (GHz)
1.25
2.1
3.5
6.0
10.0
16.9
Chip size (mm2)
340
430
520
620
750
900
Power supply (V)
1.8
1.5
1.2
0.9
0.6
0.5
High-perf. Power (W)
90
130
160
170
175
183
Untrue predictions.
SIA Roadmap for Processors (1999)
Source: http://www.semichips.org
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
2
Power Reduction in Processors

Hardware methods:





Architecture:



Voltage reduction for dynamic power
Dual-threshold devices for leakage reduction
Clock gating, frequency reduction
Sleep mode
Instruction set
hardware organization
Software methods
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
3
Performance Criteria

Throughput – computations per unit time.

Performance is inverse of time – increasing
CPU time indicates lower performance.
Power – computations per watt.
 Energy efficiency – performance/joule.

Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
4
SPEC CPU2006 Benchmarks






Standard Performance Evaluation Corporation (SPEC)
http://www.spec.org
Twelve integer and 17 floating point programs, CINT2006
and CFP2006.
Each program run time is normalized to obtain a SPEC
ratio with respect to the run time of Sun Ultra Enterprise 2
system with a 296 MHz UltraSPARC II processor.
It takes about 12 days to run all benchmarks on reference
system.
CINT2006 and CFP2006 metrics are the geometric means
of SPEC ratios:


Peak metric – each program is individually optimized (aggressive
compilation).
Base metric – common optimization for all programs.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
5
SPEC CINT2006 Results

http://www.spec.org/cpu2006/results/cint2006.html

Dell Inc., PowerEdge R610
 CPU:
Intel Xeon X5670, 2.93 GHz
 Number of chips 2, cores 12, threads/core 2
 Performance metric 36.6 base, 39.4 peak

Dell Inc. PowerEdge M905
 CPU:
AMD Opteron 8381 HE, 2.50 GHz
 Number of chips 4, cores 16, threads/core 1
 Performance metric 15.8 base, 19.1 peak
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
6
SPEC CFP2006 Results

http://www.spec.org/cpu2006/results/cfp2006.html

Dell Inc., PowerEdge R610
 CPU:
Intel Xeon X5670, 2.93 GHz
 Number of chips 2, cores 12, threads/core 2
 Performance metric 42.5 base, 45.8 peak

Dell Inc. PowerEdge M905
 CPU:
AMD Opteron 8381 HE, 2.50 GHz
 Number of chips 4, cores 16, threads/core 1
 Performance metric 17.4 base, 21.5 peak
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
7
Other Benchmarks


LINPACK is numerically intensive floating point linear
system (Ax = b) program used for benchmarking
supercomputers.
SPECPOWER_ssj2008 measures power and performance
of a computer system.


The initial benchmark addresses the performance of server-side
Java; additional workloads are planned.
http://www.spec.org/benchmarks.html#power
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
8
Second Quarter 2010
SPECpower_ssj2008 Results

http://www.spec.org/power_ssj2008/results/res2010q2/

Apr 7, 2010: Hewlett-Packard ProLiant DL385 G7
 CPU:
AMD Opteron 6174, 2.2GHz
 Number of chips 2, cores 12, threads/core 2
 Total memory 16GB
 ssj operations @ 100% 888,819
 Average power @ 100% 271 W
 Average power @ active idle 101 W
 Overall ssj operations per watt 2,355
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
9
Second Quarter 2010
SPECpower_ssj2008 Results

http://www.spec.org/power_ssj2008/results/res2010q2/

May 19, 2010: Dell Inc., PowerEdge R610
 CPU:
Intel Xeon X5670, 2.93 GHz
 Number of chips 2, cores 12, threads 2
 Total memory 12GB
 ssj operations @ 100% 914,076
 Average power @ 100% 244 W
 Average power @ active idle 62.3 W
 Overall ssj operations per watt 2,938
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
10
Energy SPEC Benchmarks

Energy efficiency mode: Besides the execution
time, energy efficiency of SPEC benchmark
programs is also measured. Energy efficiency of
a benchmark program is given by:
1/(Execution time)
Energy efficiency
=
────────────
Average power
D. A. Patterson and J. L. Hennessy, Computer Organization & Design:
The Hardware/Software Interface, 4th Edition, Morgan Kaufmann
Publishers (Elsevier), 2009,
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
11
Energy Efficiency


Efficiency averaged on n benchmark programs:
n
1/n
Efficiency
=
( Π Efficiencyi )
i=1
where Efficiencyi is the efficiency for program i.
Relative efficiency:
Efficiency of a computer
Relative efficiency = ─────────────────
Eff. of reference computer
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
12
SPEC2000 Relative Energy Efficiency
6
5
Pentium M
@1.6/0.6GHz Energyefficient procesor
Pentium 4-M
@2.4GHz (Reference)
4
3
2
1
SPECFP2000
SPECINT2000
SPECFP2000
SPECINT2000
SPECFP2000
SPECINT2000
0
Pentium III-M
@1.2GHz
Always
Laptop
Min. power
max. clock adaptive clk. min. clock
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
13
Voltage Scaling
Dynamic: Reduce voltage and frequency
during idle or low activity periods.
 Static: Clustered voltage scaling

 Logic
on non-critical paths given lower voltage.
 47% power reduction with 10% area increase
reported.
 M. Igarashi et al., “Clustered Voltage Scaling
Techniques for Low-Power Design,” Proc. IEEE
Symp. Low Power Design, 1997.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
14
Processor Utilization
Throughput = Operations / second
Throughput
Compute-intensive
processes
Maximum
throughput
Low throughput
(background)
processes
System
idle
Time
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
15
Examples of Processes
Compute-intensive: spreadsheet, spelling
check, video decoding, scientific
computing.
 Low throughput: data entry, screen
updates, low bandwidth I/O data transfer.
 Idle: no computation, no expected output.

Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
16
Effects of Voltage Reduction

Voltage reduction increases delay,
decreases throughput:
 Slow
reduction in throughput at first
 Rapid reduction in throughput for VDD ≤ Vth
 Time per operation (TPO) increases

Voltage reduction continues to reduce
power consumption:
 Energy
Copyright Agrawal, 2007
per operation (EPO) = Power × TPO
ELEC5270/6270 Spring 13, Lecture 8
17
Energy per Operation (EPO)
1.0
0.5
EPO
Power
TPO
0.0
1
2
3
4
5
VDD / Vth
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
18
Dynamic Voltage and Clock
Time spent in:
Throughput
Fast Slow
Idle
mode mode mode
Battery
life
Always full speed
10%
0%
90%
1 hr
Sometimes full speed
1%
90%
9%
5.3 hrs
Rarely full speed
0.1%
99%
0.9% 9.2 hrs
T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessors,
Springer, 2002, pp. 35-36.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
19
Example: Find Minimum Energy Mode

Processor data (rated operation):
2 GHz clock
 1.5 volt supply voltage
 0.5 volt threshold voltage
 Power consumption

 50
watts dynamic power
 50 watts static power

Maximum clock frequency for V volt supply
(alpha-power law): f α
(V – VTH)/V
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
20
Alpha-Power Law Model

Variation of delay with supply voltage:
delay α VDD /(VDD – VTH )α
VTH = Threshold voltage
α = 1 for short-channel devices, ≈ 2 for long-channel devices




T. Sakurai and A. R. Newton, “Delay analysis of series-connected MOSFET
circuits,” IEEE Journal of Solid-State Circuits, Vol. 26, pp.122–131, Feb. 1991.
T. Sakurai and A. R. Newton, “A simple MOSFET model for circuit analysis,”
IEEE Transaction on Electron Devices, Vol. 38, No. 4, pp.887–894, Apr. 1991.
T. Sakurai, “High-speed circuit design with scaled-down MOSFETs and low
supply voltage (invited),” Proc. IEEE ISCAS, pp.1487–1490, Chicago, May 1993.
T. Sakurai, “Alpha-Power Law MOS Model,” IEEE Solid-State Circuits Society
Newsletter, Vol. 9, No. 4, pp. 4–5, Oct. 2004.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
21
Example Cont.
Dynamic power:
Pd = CV2f = C(1.5)2×2×109 = 50W
C = 11.11 nF, capacitance switching/cycle
Pd = 11.11 V2f
 Dynamic energy per cycle:
Ed = Pd/f = 11.11 V2

Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
22
Example Cont.

Clock frequency:
f = k (V – VTH)/V = k (1.5 – 0.5)/1.5 = 2 GHz
k = 3 GHz, a proportionality constant
f = 3(V – 0.5)/V GHz
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
23
Example Cont.
Static power:
Ps = k’ V2 = k’ (1.5)2 = 50W
k’ = 22.22 mho, total leakage conductance
Ps = 22.22 V2
 Static energy per cycle:
Es = Ps/f = 22.22 V3/[3(V – 0.5)]
= 7.41 V3/(V – 0.5)

Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
24
Example Cont.
Total energy per cycle:
E = Ed + Es = 11.11 V2 + 7.41 V3/(V – 0.5)
 To minimize E, ∂E/∂V = 0, or
5V2 – 4.6V + 0.75 = 0
 Solutions of quadratic equation:
V = 0.679 volt, 0.221 volt
 Discard second solution, which is lower
than the threshold voltage of 0.5 volt.

Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
25
Example: Result
Voltage
1.5 V
Low energy
mode
0.679 V
Clock frequency
2 GHz
791 MHz
60%
Dynamic energy/cycle
25.00 nJ
5.12 nJ
79.52%
Static energy/cycle
25.00 nJ
12.96 nJ
48.16%
Total energy/cycle
50.0 nJ
18.08 nJ
63.84%
Dynamic power
50.0 W
4.05 W
91.90%
Static power
50.0 W
10.25 W
79.50%
Total power
100.0 W
14.20 W
85.80%
Rated mode
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
Reduction
(%)
54.7%
26
Cycle Efficiency


Cycle efficiency is a rating similar to the
maximum clock frequency rating.
Analogy:



Cycle efficiency is similar to miles per gallon (mpg)
Maximum clock frequency is similar to miles per hour
(mph)
Reference: A. Shinde and V. D. Agrawal,
“Managing Performance and Efficiency of a
Processor,” Proc. 45th IEEE Southeastern Symp.
System Theory, March 2013.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
27
Performance in Time
 Performance
is measured with respect to
a program.
 Performance
=
1
Execution Time
D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the
hardware/Software Interface, Fourth Edition, San Francisco, California:
Morgan Kaufman Publishers, Inc., 2008.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
2828
Performance in Energy (Efficiency)
 Efficiency
is measured with respect to a
program.
 Performance 
Performance
Power consumption
1

Energy dissipated
D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the
Hardware/Software Interface, Fourth Edition, San Francisco, California:
Morgan Kaufman Publishers, Inc., 2008.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
2929
Two Performances
 Time
performance
 Energy
performance
1

Execution Time
1

Energy dissipated
D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the
Hardware/Software Interface, Fourth Edition, San Francisco, California:
Morgan Kaufman Publishers, Inc., 2008.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
3030
Time Performance or Clock
Speed of a processor is measured in
cycles per second or clock frequency (f).
 Execution time of a program using C clock
cycles = C/f
 Time performance = f/C
 Clock period (1/f) is the time per cycle.

Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
31
Energy Performance
Energy efficiency of a processor may be
measured in cycles per joule or cycle
efficiency (η).
 Energy dissipated by a program using C
clock cycles = C/η
 Energy performance = η/C
 1/η is energy per cycle (EPC)

Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
32
Characterizing Device Technology
Speed and Efficiency






Consider 90nm CMOS technology.
Use predictive technology model (PTM).
Example circuit: Eight-bit ripple carry adder.
Nominal voltage = 1.2 volts.
Simulation for varying operating conditions (VDD = 100mV
through 1.2V) using Spice:
 With random vectors for energy per cycle (EPC = 1/η).
 With critical path vectors for clock period (1/f).
Reference: W. Zhao and Y. Cao, “New Generation of Predictive
Technology Model for Sub-45nm Early Design Exploration,“ IEEE
Trans. Electron Devices, vol. 53, no. 11, pp. 2816–2823, 2006.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
33
Energy per Cycle of 8-Bit Adder

K. Kim, “Ultra Low Power CMOS Design,” PhD Dissertation, Auburn University,
Dept. of ECE, Auburn, Alabama, May 2011.
34
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
Cycle Time of 8-Bit Adder

K. Kim, “Ultra Low Power CMOS Design,” PhD Dissertation, Auburn University, Dept.
of ECE, Auburn, Alabama, May 2011.
35
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
Pentium M processor






Published data: H. Hanson, K. Rajamani, S. Keckler, F.
Rawson, S. Ghiasi, J. Rubio, “Thermal Response to
DVFS: Analysis with an Intel Pentium M,” Proc.
International Symp. Low Power Electronics and Design,
2007, pp. 219-224.
VDD = 1.2V
Maximum clock rate = 1.8GHz
Critical path delay, td = 1/1.8GHz = 555.56ps
Power consumption = 120W
EPC = 120/(1.8GHz) = 66.67nJ
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
36
Cycle Efficiency and Frequency
37
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
Example

For a program that executes in 1.8 billion clock cycles.
Voltage
VDD
Frequency
f
MHz
Cycle Efficiency,
η
Execution
Time
second
Total
Energy
Consumed
Power
f/η
1.2 V
1800
megacycles/s
15
megacycles/joule
1.0
120 Joules
120W
0.6 V
277
megacycles/s
70
megacycles/joule
6.5
25 Joules
39.6W
200 mV
54.5
megacycles/s
660
megacycles/joule
33
2.72 Joules
0.083W
38
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
Cycle Efficiency
New performance rating: Cycle efficiency η
unit is cycles per joule.
 Clock frequency f in cycles per second is a
similar rating with respect to time.
 Similarity to other popular ratings:




η → mpg
f → mph
Two ratings allow effective time and energy
management of an electronic system.
39
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
Problem of Process Variation in
Nanometer Technologies
Lower Vth
Copyright Agrawal, 2007
Clock
specification
Nominal
voltage
Vth
Higher voltage operation
Yield loss
due to high
leakage
Lower voltage operation
Number of chips
Power
specification
Yield loss
due to slow
speed
From a presentation:
Power Reduction
using LongRun2 in
Transmeta’s
Efficon Processor,
by D. Ditzel
May 17, 2006
Higher Vth
ELEC5270/6270 Spring 13, Lecture 8
40
Clock Distribution H-Tree
Fanout, λ = 4
Tree depth, s = logλN
No. of flip-flops = N
clock
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
41
Clock Power
Pclk
= CLVDD2f + CLVDD2f / λ + CLVDD2f / λ2 + . . .
= CLVDD2f
where CL =
λ =
stages – 1
Σ
n=0
1
─
λn
total load capacitance of N flip-flops
constant fanout at each stage in distribution
network
Clock consumes about 40% of total processor power, because
(1) Clock is always active
(2) Makes two transitions per cycle, (α = 2)
(3) Clock gating is useful; inhibit clock to unused blocks
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
42
Properties of H-Tree
Balanced clock skew.
 Small delay and power consumption.
 Requires fine-tuning for complex layout.

Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
43
Clock Power and Delay
Unit size buffer or inverter delay = d
 Total dynamic power supplied to N flipflops, P = CLVDD2f
 Total power consumption of clock network:

Flip-flps, N
Clock power per flip-flop
Clock delay
1
P
d
4
P
4d
16
1.25P
8d
64
1.3125P
12d
128
1.327125P
16d
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
44
Clock Network Examples
Alpha 21064
Alpha 21164
Alpha 21264
Technology
0.75μ CMOS
0.5μ CMOS
0.35μ CMOS
Frequency (MHz)
200
300
600
Total capacitance
12.5nF
Clock load
3.25nF
3.75nF
Clock power
40%
40% (20W)
Max. clock skew
200ps (<10%)
90ps
Clock gating
used. Total
power 80 110W
D. W. Bailey and B. J. Benschneider, “Clocking Design and Analysis for
a 600-MHz Alpha Microprocessor,” IEEE J. Solid-State Circuits, vol. 33,
no. 11, pp. 1627-1633, Nov. 1998.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
45
Architecture Level: Pipeline Gating

A pipeline processor uses speculative execution.


Idea: Stop fetching instructions if a branch
hazard is expected:


Incorrect branch prediction results in pipeline stalls and
wasted energy.
If the count (M) of incorrect predictions exceeds a prespecified number (N), then suspend fetching instruction for
some k cycles.
Ref.: S. Manne, A. Klauser and D. Grunwald,
“Pipeline Gating: Speculation Control for Energy
Reduction,” Proc. 25th Annual International
Symp. Computer Architecture, June 1998.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
46
Slack Scheduling

Application: Superscalar, out-of-order execution:




An instruction is executed as soon as the required data and
resources become available.
A commit unit reorders the results.
Delay the completion of instructions whose result
is not immediately needed.
Example of RISC instructions:





add
sub
and
or
xor
Copyright Agrawal, 2007
r0, r1, r2;
r3, r4, r5;
r9, r1, r9;
r5, r9, r10;
r2, r10, r11;
(A)
(B)
(C)
(D)
(E)
J. Casmira and D. Grunwald,
“Dynamic Instruction Scheduling
Slack,” Proc. ACM Kool Chips
Workshop, Dec. 2000.
ELEC5270/6270 Spring 13, Lecture 8
47
Slack Scheduling Example
Standard scheduling
A
B
C
D
E
Slack scheduling
B
C
A
D
E
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
48
Slack Scheduling
Scheduling logic
Re-order buffer
Low-power
execution units
Slack bit
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
49
Power Reduction Example







Alpha 21064: 200MHz @ 3.45V, power dissipation = 26W
Reduce voltage to 1.5V, power (5.3x) = 4.9W
Eliminate FP, power (3x) = 1.6W
Scale 0.75μ → 0.35μ, power (2x) = 0.8W
Reduce clock load, power (1.3x) = 0.6W
Reduce frequency 200 →160MHz, power (1.25x) = 0.5W
J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC
Microprocessor,” IEEE J. Solid-State Circuits, vol. 31, no.
11, pp. 1703-1714, Nov. 1996.
Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
50
For More on Microprocessors
T. D. Burd and R. W. Brodersen, Energy
Efficient Microprocessor Design, Springer,
2002.
 R. Graybill and R. Melhem, Power Aware
Computing, New York: Plenum Publishers,
2002.

Copyright Agrawal, 2007
ELEC5270/6270 Spring 13, Lecture 8
51
Download