Composite Cores: Fast Switching on Tightly Coupled

advertisement
Composite Cores:
Pushing Heterogeneity into a Core
Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das,
Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch,
and Scott Mahlke
University of Michigan
Micro 45
May 8th 2012
University of Michigan
Electrical Engineering and Computer Science
High Performance Cores
High energy yields high performance
Performance
Energy
Low performance Time
DOES NOT yield low energy
High performance cores waste energy on low
performance phases
2
University of Michigan
Electrical Engineering and Computer Science
Core Energy Comparison
Out-of-Order
In-Order
Dally, IEEE Computer’08
Brooks, ISCA’00
• Out-Of-Order
hardware
Do we contains
alwaysperformance
need theenhancing
extra hardware?
• Not necessary for correctness
3
University of Michigan
Electrical Engineering and Computer Science
Previous Solution:
Heterogeneous Multicore
• 2+ Cores
• Same ISA, different implementations
– High performance, but more energy
– Energy efficient, but less performance
• Share memory at high level
– Share L2 cache ( Kumar ‘04)
– Coherent L2 caches (ARM’s big.LITTLE)
• Operating System (or programmer) maps
application to smallest core that provides needed
performance
4
University of Michigan
Electrical Engineering and Computer Science
Current System Limitations
• Migration between cores incurs high overheads
– 20K cycles (ARM’s big.LITTLE)
• Sample-based schedulers
– Sample different cores performances and then decide whether to
reassign the application
– Assume stable performance with a phase
• Phase must be long to be recognized and exploited
Do instructions
finer grained
– 100M-500M
in lengthphases
exist?
Can we exploit them?
5
University of Michigan
Electrical Engineering and Computer Science
Performance Change in GCC
3
Big Core
Little Core
Instructions / Cycle
2.5
2
1.5
1
0.5
0
0
100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000
Instructions
• Average IPC over a 1M instruction window (Quantum)
• Average IPC over 2K Quanta
6
University of Michigan
Electrical Engineering and Computer Science
Finer Quantum
3
Big Core
Instructions / Cycle
2.5
Little Core
2
1.5
1
0.5
0
160K
170K
Instructions
180K
• 20K instruction window from GCC
What
if
we
could
map
these
to
a
Little
Core?
• Average IPC over 100 instruction quanta
7
University of Michigan
Electrical Engineering and Computer Science
Our Approach: Composite Cores
• Hypothesis: Exploiting fine-grained phases allows
more opportunities to run on a Little core
• Problems
I. How to minimize switching overheads?
II. When to switch cores?
• Questions
I. How fine-grained should we go?
II. How much energy can we save?
8
University of Michigan
Electrical Engineering and Computer Science
Problem I: State Transfer
10s of KB
iCache
iTLB
Branch Pred
Fetch
State
transfer costs can be veryFetch
high:
~20K cycles (ARM’s big.LITTLE)
Decode
RAT
<1 KB
Rename
dCache
iTLB
Branch Pred
Decode
Reg File
Reg File
dTLB
iCache
10s of KB
InO
Execute
Limits
O3 switching to coarse granularity:
Execute
100M Instructions ( Kumar’04)
9
dTLB
dCache
University of Michigan
Electrical Engineering and Computer Science
Creating a Composite Core
Only
iCache
one uEngine
Big
iTLB
active atFetch
a time
O3 Execute
RAT
Decode
uEngine
Reg File
Branch Pred
Load/Store
Queue
dTLB
iCache
iTLB
Fetch
Controller
Branch Pred
dCache
dTLB
<1KB
dCache
dCache
dTLB
iCache
Little
Fetch
iTLB
uEngine
Branch
Pred
Decode
10
Reg File
Mem
inO Execute
University of Michigan
Electrical Engineering and Computer Science
Hardware Sharing Overheads
• Big uEngine needs
– High fetch width
– Complex branch prediction
– Multiple outstanding data cache misses
• Little uEngine wants
– Low fetch width
– Simple branch prediction
– Single outstanding data cache miss
• Must build shared units for Big uEngine
•
– Little
over-provision
for Littleenergy
uEngine
pays ~8%
overhead to use over
Assume clock gating for inactive uEngine
provisioned
– Still has static
leakage energyfetch + caches
11
University of Michigan
Electrical Engineering and Computer Science
Problem II: When to Switch
• Goal: Maximize time on the Little uEngine subject to
maximum performance loss
• User-Configurable
• Traditional OS-based schedulers won’t work
– Decisions to frequent
– Needs to be made in hardware
• Traditional sampling-based approaches won’t work
– Performance not stable for long enough
– Frequent switching just to sample wastes cycles
12
University of Michigan
Electrical Engineering and Computer Science
What uEngine to Pick
3
Big Core
Little Core
Instructions / Cycle
2.5
2
Run on
Little
1.5
Difference
Run on
Big
Run on
Big
1
0.5
0
0
200000
ΔπΆπ‘ƒπΌπ‘‡β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘
•
400000
600000
Instructions
800000
1000000
This value is hard to determine a priori, depends on application
Run on
Little
Let
configure
the target
value
– Useuser
a controller
to learn appropriate
value over time
13
University of Michigan
Electrical Engineering and Computer Science
Reactive Online Controller
πΆπ‘ƒπΌπ‘‘π‘Žπ‘Ÿπ‘”π‘’π‘‘
𝑆∗
𝐢𝑃𝐼𝑏𝑖𝑔
𝐢𝑃𝐼𝑏𝑖𝑔
𝐢𝑃𝐼𝐡𝑖𝑔
User-Selected
ΔπΆπ‘ƒπΌπ‘‡β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘ = 𝐾𝑝 πΆπ‘ƒπΌπ‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ + 𝐾𝑖
Performance
+
πΆπ‘ƒπΌπ‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
𝐢𝑃𝐼𝑙𝑖𝑑𝑑𝑙𝑒
Little Model
πΆπ‘ƒπΌπ‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
𝐢𝑃𝐼𝑙𝑖𝑑𝑑𝑙𝑒
ΔπΆπ‘ƒπΌπ‘‡β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘ + 𝐢𝑃𝐼𝑙𝑖𝑑𝑑𝑙𝑒 ≤ 𝐢𝑃𝐼𝑏𝑖𝑔
𝐢𝑃𝐼𝑙𝑖𝑑𝑑𝑙𝑒
𝐢𝑃𝐼𝑏𝑖𝑔
Threshold
Controller ΔπΆπ‘ƒπΌπ‘‡β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘
πΆπ‘ƒπΌπ‘Žπ‘π‘‘π‘’π‘Žπ‘™
Big Model
Switching
Controller
Little uEngine
Big uEngine
πΆπ‘ƒπΌπ‘‚π‘π‘ π‘’π‘Ÿπ‘£π‘’π‘‘
14
University of Michigan
Electrical Engineering and Computer Science
uEngine Modeling
Little uEngine
IPC: 1.66
Collect Metrics of active uEngine
• iL1, dL1 cache misses
• L2 cache misses
• Branch Mispredicts
• ILP, MLP, CPI
while(flag){
foo();
flag = bar();
}
Use a linear model to estimate
inactive uEngine’s performance
Big uEngine
15
IPC: 2.15
???
IPC:
University of Michigan
Electrical Engineering and Computer Science
Evaluation
Architectural Feature
Parameters
Big uEngine
3 wide O3 @ 1.0GHz
12 stage pipeline
128 ROB Entries
128 entry register file
Little uEngine
2 wide InOrder @ 1.0GHz
8 stage pipeline
32 entry register file
Memory System
32 KB L1 i/d cache, 1 cycle access
1MB L2 cache, 15 cycle access
1GB Main Mem, 80 cycle access
Controller
5% performance loss relative to all big core
16
University of Michigan
Electrical Engineering and Computer Science
Little Engine Utilization
Fine-Grained Quantum
Traditional OS-Based Quantum
Little Engine Utilization
100%
astar
80%
bzip2
60%
gcc
gobmk
40%
h264ref
20%
hmmer
mcf
0%
100
1000
10000
100000
1000000
Quantum Length (Instructions)
10000000
omnetpp
sjeng
• 3-Wide
O3 time
(Big) on
vs. little
2-Wide
InOrder
More
engine
with(Little)
same
loss
• 5% performanceperformance
loss relative to
all Big
17
University of Michigan
Electrical Engineering and Computer Science
Switches / Million Instructions
Engine Switches
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0
astar
~1 Switch / 306 Instructions
bzip2
gcc
gobmk
~1 Switch / 2800 Instructions
h264ref
hmmer
mcf
omnetpp
100
1000
10000
100000
1000000
Quantum Length (Instructions)
10000000
sjeng
Need LOTS of switching to maximize utilization
18
University of Michigan
Electrical Engineering and Computer Science
Performance Relative to Big
Performance Loss
Composite Cores
105%
astar
( Quantum Length = 1000 )
100%
bzip2
gcc
95%
gobmk
90%
h264ref
85%
hmmer
mcf
80%
100
1000
10000
100000
1000000
Quantum Length (Instructions)
10000000
omnetpp
sjeng
Switching overheads negligible until ~1000
instructions
19
University of Michigan
Electrical Engineering and Computer Science
Fine-Grained vs. Coarse-Grained
• Little uEngine’s average power 8% higher
– Due to shared hardware structures
• Fine-Grained can map 41% more instructions to the
Little uEngine over Coarse-Grained.
• Results in overall 27% decrease in average power
over Coarse-Grained
20
University of Michigan
Electrical Engineering and Computer Science
Decision Techniques
1. Oracle
Knows both uEngine’s performance for all quantums
2. Perfect Past
Knows both uEngine’s past performance perfectly
3. Model
Knows only active uEngine’s past, models inactive uEngine
using default weights
All models target 95% of the all Big uEngine’s performance
21
University of Michigan
Electrical Engineering and Computer Science
Dynamic Instructions On Little
Little Engine Utilization
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Oracle
Perfect Past
Model
Astar
Bzip2
Gcc
GoBmk
H264ref Hmmer
Mcf
OmnetPP
Sjeng
Average
Maps 25% of the dynamic instructions
High
Issue
utilization
widthonto
dominates
forthememory
bound application
bound
Littlecomputation
uEngine
22
University of Michigan
Electrical Engineering and Computer Science
Energy Savings Relative to Big
Energy Savings
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Oracle
Perfect Past
Model
Astar
Bzip2
Gcc
GoBmk H264ref Hmmer
Mcf
OmnetPP
Sjeng
Average
• Includes
overhead ofin
shared
hardware
structures
18%the
reduction
energy
consumption
23
University of Michigan
Electrical Engineering and Computer Science
User-Configured Performance
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1%
5% 10% 20% 1%
Utilization
5% 10% 20% 1%
Overall Performance
5% 10% 20%
Energy Savings
20%performance
performanceloss
loss yields
yields 44%
energy savings
1%
4% energy
savings
24
University of Michigan
Electrical Engineering and Computer Science
More Details in the Paper
•
•
•
•
Estimated uEngine area overheads
uEngine model accuracy
Switching timing diagram
Hardware sharing overheads analysis
25
University of Michigan
Electrical Engineering and Computer Science
Conclusions
Questions?
• Even high performance applications experience
fine-grained phases of low throughput
– Map those to a more efficient core
• Composite Cores allows
– Fine-grained migration between cores
– Low overhead switching
• 18% energy savings by mapping 25% of the
instructions to Little uEngine with a 5% performance
loss
26
University of Michigan
Electrical Engineering and Computer Science
Composite Cores:
Pushing Heterogeneity into a Core
Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das,
Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch,
and Scott Mahlke
University of Michigan
Micro 45
May 8th 2012
University of Michigan
Electrical Engineering and Computer Science
Back Up
28
University of Michigan
Electrical Engineering and Computer Science
The DVFS Question
• Lower voltage is useful when:
– L2 Miss (stalled on commit)
• Little uArch is useful when:
– Stalled on L2 Miss (stalled at issue)
– Frequent branch mispredicts (shorter pipeline)
– Dependent Computation
http://www.arm.com/files/downloads/big_LITTLE_Final_Final.pdf
29
University of Michigan
Electrical Engineering and Computer Science
Sharing Overheads
Average Power Relative to the Big Core
Big uEngine
Little Core
Little uEngine
110%
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
30
University of Michigan
Electrical Engineering and Computer Science
Performance
Performance Relative to Big
103%
Oracle
Perfect Past
Model
100%
98%
95%
93%
90%
Astar
Bzip2
Gcc
GoBmk H264ref Hmmer
Mcf
OmnetPP Sjeng
Average
5% performance loss
31
University of Michigan
Electrical Engineering and Computer Science
Model Accuracy
Model
Average Performance
40%
20%
15%
10%
5%
0%
-100%
Average Performance
35%
25%
Percent of Quantums
Percent of Quantums
30%
Model
30%
25%
20%
15%
10%
5%
-50%
0%
50%
Percent Deviation From Actual
100%
Little -> Big
0%
-100%
-50%
0%
50%
Percent Deviation From Actual
Big -> Little
32
University of Michigan
Electrical Engineering and Computer Science
100%
Regression Coefficients
100%
Relative Coefficient Magnatude
90%
80%
L2 Miss
70%
Branch Mispredicts
60%
ILP
50%
L2 Hit
40%
MLP
30%
Active uEngine Cycles
20%
Constant
10%
0%
Little -> Big
33
Big -> Little
University of Michigan
Electrical Engineering and Computer Science
Different Than Kumar et al.
Kumar et al.
Composite Cores
• Coarse-grained switching
• OS Managed
• Fine-grain switching
• Hardware Managed
• Minimal shared state (L2’s)
• Maximizes shared state (L2’s, L1’s,
Branch Predictor, TLBs)
• Requires sampling
• On-the-fly prediction
• 6 Wide O3 vs. 8 Wide O3
• Has InOrder, but never uses it!
• 3 Wide O3 vs. 2 Wide InOrder
Coarse-grained vs. fine-grained
34
University of Michigan
Electrical Engineering and Computer Science
Register File Transfer
RAT
Num
-
Num
Value
Num
Value
Commit
Registers
3 stage pipeline
1. Map to physical register in RAT
2. Read physical register
3. Write to new register file
If commit updates, repeat
35
Registers
University of Michigan
Electrical Engineering and Computer Science
uEngine Model
• Linear model:
𝑦 = π‘Ž0 +
π‘Žπ‘– π‘₯𝑖
– π‘Ž0 : Average uEngine performance
– π‘₯𝑖 : Performance counter value
– π‘Žπ‘– : Weight of performance counter
• Different weights for big and little uEngine models
• Fixed vs. per-application weights?
– Default weights, fixed at design time
– Per-application weights
36
University of Michigan
Electrical Engineering and Computer Science
Download