slides - Georgia Institute of Technology

advertisement
The Elusive Metric for
Low-Power Architecture
Research
Hsien-Hsin “Sean” Lee
A. Utku Diril
Joshua B. Fryman
Yuvraj S. Dhillon
Center for Experimental Research in Computer Systems
Georgia Institute of Technology
Atlanta, GA 30332
Workshop for Complexity-Effective Design, San Diego, CA, 2003
Background Picture
 Energy-Delay product (EDP) [Gonzalez & Horowitz 96]
 “Power” is meaningless ( frequency)
2
 “Energy per instruction” is elusive ( CV )
 “Energy  Delay” (J/SPEC or J  IPC) is better
3
 Use Alpha-power model, ED  CVdd
(Vdd - Vth)
 Note that no “physical” meaning of EDP
 Widespread adoption
 De facto standard by community
 Metric for energy and complexity effectiveness
 New architectural techniques have arrived
 New hardware exploiting low-power opportunities
 Temperature-aware power detectors
 Voltage & Frequency Scaling
 Multi-threshold voltage
WCED-03
2
Outline of the Talk
Potential pitfalls
Yeah, we all know, it is obvious…. but
Which “E” goes in ED product?
Impact of new hardware (more transistors)
Methodology matters in deep submicron
processes
Observations
Summary
WCED-03
3
Calculating ED Product
New architecture solutions save energy at the
expense of (insensitive) performance loss
A number of research results were reported in the
following manner:
Technique “X” for Data Cache
 Reduce 50% energy of Data Cache
 Lose 20% IPC
 EDP = (1-0.5)(1+0.2) = 0.60  Very Energy efficient
Technique “Y” for Branch Predictor
 Reduce 10% energy of Branch Predictor
 Lose 20% IPC
 EDP = (1-0.1)(1+0.2) = 1.08  Energy inefficient
WCED-03
4
So What is E and What is D in EDP?
 Hypothetical black box
 Battery (i.e. E) shared by
 CPU,
DRAM, chipsets, graphics, TFT,
Wi-Fi, HDD, flash disk
 D typically account for some system effect
such as DRAM latency
 Improvement proposed:
 Remove 5% of E from flash disk
 No delay incurred
 Is this a good design decision?
 Flash disk is 10% of total E in system
 Improvement amounts to 0.5% system
impact
 “In-the-noise” improvement
 Is the “complexity” worth the effort?
Gfx card
flash
C.S.
802.11
DDRDRAM
HDD
TFT Display
Battery
 So, is EDP used in the right way? And
WCED-03
is EDP so important?
5
Energy Efficiency: E versus D
100
Esaved=99%
Esaved=90%
Esaved=58%
Esaved=50%
Esvaed=30%
Esaved=10%
Esaved=5%
Maxmum Delay Tolerance
10
1
0.1
0.01
0.001
0.0001
0
WCED-03
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Power Distribution of a FU w.r.t. target system
1
6
Example: Energy Efficiency: E vs. D
100
Esaved=99%
Esaved=90%
Esaved=58%
Esaved=50%
Esvaed=30%
Esaved=10%
Esaved=5%
Maxmum Delay Tolerance
10
1
0.1
Tolerate ~25%
performance loss
0.01
0.001
0.0001
0
WCED-03
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Energy Distribution w.r.t. target system
1
7
Using EDP: Pentium Pro
0.3
IFU (22%)
IEU (14%)
ROB, DCU (11.1%)
RS, FPU, Global Clock (7.9%)
RAT, MOB (6.3%)
BTB (4.7%)
0.28
0.26
Maximum Delay Tolerance
0.24
 Data Source: [Brooks
et al. 00]
 Assume 100% for
CPU
 40% IFU power
reduction can tolerate
< 10% performance
loss
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Energy Saved for a functional unit u
WCED-03
8
Maximum Delay Tolerance
But CPU is not 100% of a System
150
140
130
120
110
100
90
80
70
60
50
40
30
20
10
0
CPU=100%
CPU=75%
CPU=50%
CPU=25%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
WCED-03
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
9
Case Study: Filter Cache
[Kin et. al 97,00]
The Filter Cache design as reported
58% Energy savings in “L1 Caches”
21% IPC degradation
ED product as shown
 (1-0.58)(1+0.21)
<< 1
 suggests this is a winning design
Question is “which E ?”
WCED-03
10
Filter Cache: E Values
Esaved = 58% [Kin et al. 00]
1.4
FilterCache
CPU=100%
CPU=70%
CPU=50%
CPU=25%
FilterCache SA-110 (I$+D$=43%)
1.3
1.2
 Use StrongARM 110
 43% () energy by
Maximum Delay Tolerance
1.1
Caches
1
 27% in I-CACHE
 16% in D-CACHE
0.9
 CPU=X% stands for
0.8
X% of overall power
drawn by CPU
 Delay Tolerance
0.7
0.6
0.5
FC slowdown 21%
0.4
 33% : CPU=100%
0.3
 21% : CPU=70%
0.2
 14% : CPU=50%
0.1
 6% : CPU=25%
 Not energy-efficient if
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Energy distribution for a functional unit u wrt CPU only
WCED-03
CPU < 70%
11
Rethinking EDP:
Switching Activity vs. New Hardware
Ignore leakage and short-circuit power
Dynamic switching power is dominant
The “E” would be below
T: Transistor count
 f: frequency
Pdyn  a  f
2
C Vdd
 a  f  Cg avg T
2
Vdd
Pdynref  Pdynnew
aref  f T  anew  (f  f )  (T  T )
WCED-03
12
ED Variables
The elegant ratio governing E…
aref
f T f T
1  

anew
f
T
fT
To include the application delay, D…
2
aref  f T   D 
 1  
 1  
anew 
f
T  D 
Can be applied to Macromodeling to
determine the trade-off between transistor
count and performance degradation
WCED-03
13
Impact of Additional Transistor Count
50
45
30% switching reduced
25% switching reduced
10% switching reduced
45
40
40
35
35
% Impact on f
% Impact on D
50
30% switching reduced
25% switching reduced
10% switching reduced
30
25
20
15
10
5
30
25
20
15
10
5
0
0
-35 -30 -25 -20 -15 -10
-5
0
5
10
15
20
25
30
35
40
45
% Impact on T (given freq. unchanged)
0
5
10
15
20
25
30
35
40
45
50
% Impact on T (given delay unchanged
by frequency scaling
 Given a new avg switching probability of new architecture
 LHS: Trading transistors with delay given no freq. scaling
 RHS: Delay recovered by freq. scaling
WCED-03
14
Role of Leakage Energy
 As Deep Sub-Micron (DSM) era is upon us...
More than 50%
power
from leakage
Source: Intel Corp. Custom Integrated Circuits Conference 2002
 Leakage ignorance could revert conclusion
 Early architecture evaluation
 Leakage cannot be isolated from switching during evaluation
 Additional HW can be harmful
WCED-03
15
Evaluate the Leakage when adding HW
in Early Stage of Arch Definition
 Example: Dual-speed pipeline [Pyreddy and
Tyson’01]
 Idea appears to be plausible
x% inst 1-x% inst
non-critical critical
 Identify critical instructions [Tune et al 01] [Seng et al. 01]
 Two datapaths: fast and slow
 Critical inst  fast pipe; remainder to slow
 Slow pipe consumes less E than fast pipe
 E.g.
Multi-voltage supply, lower frequency
 Let’s evaluate and assume:
 N instructions;
 x  slow datapath
 (N-x)  fast datapath
slow
fast
 How does leakage impact efficiency?
 What x value to achieve energy efficiency?
WCED-03
16
Dual Datapath Leakage Impact
0.5
Minimum instructions to Slow Datapath
 ”r” is power
0.45
ratio of slow vs.
fast
 A small r 
0.4
 impair
0.35
performance
 Slow path
becomes
critical path
0.3
0.25
0.2
0.15
0.1
r = 0.9
r = 0.75
r = 0.60
r = 0.5
r = 0.4
r = 0.2
0.05
0
0
Today
WCED-03
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Static-to-Total Energy Ratio
0.8
0.9
1
Soon to be
17
Dual Datapath Leakage Impact
0.5
Minimum instructions to Slow Datapath
 ”r” is power
0.45
ratio of slow vs.
fast
 A small r 
Soon to be
0.4
 impair
0.35
performance
 Slow path
becomes
critical path
0.3
0.25
 % of non-critical
0.2
0.15
inst needed for
slow datapath
Today
 Today: ~17%
0.1
0.05
0
0
WCED-03
 Soon: ~40%
r = 0.9
r = 0.75
r = 0.60
r = 0.5
r = 0.4
r = 0.2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Static-to-Total Energy Ratio
0.8
0.9
1
18
Energy Savings v. # Inst of Slow Path
r = 75%
r = 50%
20
20
15
15
10
10
5
5
0
0
-5
-5
-10
-10
-15
-15
-20
-20
-25
-25
-30
-30
-35
-35
-40
-40
-45
Static-to-Total=1%
Static-to-Total=20%
Static-to-Total=33%
Static-to-Total=50%
Static-to-Total=67%
Static-to-Total=75%
-50
-55
-60
0
0.1
0.2
0.3
-45
Static-to-Total=1%
Static-to-Total=20%
Static-to-Total=33%
Static-to-Total=50%
Static-to-Total=67%
Static-to-Total=75%
-50
-55
-60
0.4
0
0.1
0.2
0.3
0.4
 X-axis : % of instructions to non-critical datapath
 Y-axis : % Energy saved
 If send 30% instructions to non-critical datapth
 Only save ~5% energy (savings only on datapath) in DSM for r=75%
 Consume more energy in DSM for r=50%
WCED-03
 Is the extra complexity paid off?
19
Observations
It is insufficient to examine ED product on
a microscale; the entire system must be
examined.
Adding HW complexity for low energy
needs to be evaluated thoroughly
If the target process is not DSM, ED product
can be examined via simplified ratio analysis
For DSM process
 Leakage
must be accounted for in local and
system E
 Additional HW could be an overkill
WCED-03
20
Summary
Low-power architecture research:
Metric  could be elusive
Methodology 
 More susceptible to reverse conclusions than
performance research, if not meticulously applied
 2nd order effect today  1st order effect tomorrow
“Complexity” can be ineffective in energy reduction
Purposes of our study
Provide analytical models and methodology for early
evaluation
No intention to invalidate prior results
 WCED
 WDDD
Raise more discussions
To get it right in education
WCED-03
21
That’s All Folks !
WCED-03
22
Download