PowerPoint

advertisement
A Multi-Core Approach to Addressing
the Energy-Complexity Problem in
Microprocessors
Rakesh Kumar
Keith Farkas (HP Labs)
Norman Jouppi (HP Labs)
Partha Ranganathan (HP Labs)
Dean Tullsen (UCSD)
Motivation
 Power is an important issue for processors
 Going up every successive generation (with
complexity)
-Up to 150W for Alpha 21464!
Past Techniques for Power Reduction
 Voltage/frequency scaling
Limitation: Limited by technology. Also, not
possible below a certain feature-size.

Architectural Adaptation
-shut off portions of core when not needed
-dynamic speculation control
-reconfigurable caches
Limitations:
-Very few choices to make
-Only dynamic power being saved
-Has associated overhead
Single-ISA Heterogeneous Multi-Core Architectures
Our Proposal
 Have multiple heterogeneous cores on the
same die
 Match workload (or workload phase) to core
that achieves best efficiency according to
some objective function
(Ensure that the new core has acceptable performance)
 Power down the unused cores
Motivation
 Hypotheses
 Performance difference between cores varies
based on workload or workload phases
 Different cores have varying relative energy
efficiencies for the same workload
 Implication: possibility of dynamically
changing “best” core
Goals of the Paper
 Validate the hypotheses
 Get an idea of the design space
 Get an idea of the potential benefits
Outline of Talk
 Motivation
 Past Work
 Our Work



Assumptions
Decisions
Methodology
 Results and Conclusions
 Summary and Future Work
Choice of Cores on the Die
 Five Cores on the die:
In-order: QED R4700, EV4(Alpha 21064), EV5(Alpha 21164)
Out-of-order: EV6 (Alpha 21264),"EV8-“
 All cores assumed to be without L2-cache.
 “EV8-”: Issue width is same as EV8(Alpha 21464)
-Resources reduced to account for a single thread.
-Core-power dissipation: 100W
Properties of the Cores
Processor
R4700
EV4
EV5
EV6
EV8-
Issue-width
1
2
4
6(OOO)
8(OOO)
I-Cache
2-way 16KB
DM, 8KB
DM, 8KB
2-way 64KB
4-way 64KB
D-Cache
2-way 16KB
DM, 8KB
DM, 8KB
2-way 64KB
4-way 64KB
Branch
Pred.
MSHR
No
2KB/1-bit
2K-gshare
1
2
4
Notice the gradation!
Hybrid 2-levelHybrid 2-level
8
16
Properties of Cores (contd.)
 Assume all cores implemented in 0.1um
-Scaled area and power accordingly
 Clock Speed?
-All Alpha cores assumed to run at 2.1GHz (EV6
frequency at 0.10 micron)
-R4700 assumed to run at 1GHz
Core Power and Area

peak power of core estimated from data sheets
- minus that used by L2 caches and pins
- then scaled for .1um process

area of core estimated from die photos
- minus that of i/o pad, wires, L2 cache & control
- then scaled for .1um process

L2 cache area and power
- estimated using CACTI
Core Power and Area (contd.)
Processor
Core-power (in W)
Core-area (in mm^2)
R4700
0.45
3
EV4
4.97
3
EV5
9.83
5
EV6
17.80
24
EV8-
92.88
260
EV8- consumes 200 times more power than
R4700! It is more than 85 times bigger too!
Core Power and Area (contd.)
Methodology
 Simulator used: SMTSIM
 ROB-size, Activelist-size and Load-store queue
always kept big enough to ensure no conflicts.
 Benchmarks used: 14 chosen randomly out of
SPEC2000 suite
 Fast-forwarded for 2 billion instructions,
simulated for 1 billion instructions.
 Data collected after every 1 million instructions.
Validating Hypotheses
 Performance difference between cores varies
based on workload or workload phases (IPS)
 Different cores have varying relative energy
efficiencies for the same workload (IPS/W)
Performance Variation with Time
2
1.6
1.2
IPS
EV8EV6
EV5
EV4
R4700
0.8
0.4
0
1
201
401
601
801
Committed instructions (in millions)
Ah! Those clear, distinct phases!
Variation of Energy Efficiency with Time
80
70
60
IPS/W
50
R4700
EV4
EV5
EV6
EV8-
40
30
20
10
0
1
201
401
601
801
Committed instructions (in millions)
Power dominates IPS/W numbers!
How does a composite objective function fare?
Energy-delay Product Profile
0.2
R4700
EV4
EV5
EV6
EV8-
0.16
IPS^2/W
0.12
0.08
0.04
0
1
201
401
601
Committed instructions(in millions)
801
So why not run on the “best” core at all points of time??
Choosing Dynamically the Core with Best
Energy-Delay Product (perf. loss<50%)
0.2
0.16
R4700
EV4
EV5
EV6
EV8Best-path
IPS^2/W
0.12
0.08
0.04
0
1
201
401
601
801
Committed instructions (in millions)
Notice the regions where best-path is not
along the best energy-delay product!
Choosing Dynamically the Core with Best Energy-Delay
product (perf. loss<50%) [Summary of Results]
Energy-Delay
Performance
Savings(%) Degradation(%)
Maximum
Minimum
Mean
97.9
0.1
65.4
Number of Switchings:
Maximum=387(art)
Minimum=0
Median=1
8.5
0.1
18.2
Dissecting the Results
 More improvements possible –
locally-best decisions not necessarily globally-best
there was a performance constraint
choice of cores not the best for this objective-function
cache-configurations not necessarily the best
 Even for present improvements, beats voltage
scaling handsomely(44.2% ED2 improvement)
Conclusion
 Enormous potential for power-savings
 No leakage-power solution
 Does considerable IP reuse
 Complexity-appropriate
-every application match to the “appropriate” complexity core
Tip of the iceberg? Current/Future Work
 Cores can be non-ordered
 Some cores can be multithreaded
 Throughput impact of the architecture
Questions?
Download