Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley

advertisement
Using Criticality to
Attack Performance Bottlenecks
Brian Fields
UC-Berkeley
(Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)
Bottleneck Analysis
Bottleneck Analysis:
Determining the performance effect of an
event on execution time
An event could be:
•
•
•
•
•
•
an instruction’s execution
an instruction-window-full stall
a branch mispredict
a network request
inter-processor communication
etc.
Why is Bottleneck Analysis Important?
Bottleneck Analysis Applications
Run-time Optimization
• Resource arbitration
• e.g., how to scheduling memory accesses?
• Effective speculation
• e.g., which branches to predicate?
•Dynamic reconfiguration
• e.g, when to enable hyperthreading?
• Energy efficiency
• e.g., when to throttle frequency?
Design Decisions
• Overcoming technology constraints
• e.g., how to mitigate effect of long wire latencies?
Programmer Performance Tuning
• Where have the cycles gone?
• e.g., which cache misses should be prefetched?
Why is Bottleneck Analysis Hard?
Current state-of-art
Event counts:
Exe. time = (CPU cycles + Mem. cycles) * Clock cycle time
where:
Mem. cycles = Number of cache misses * Miss penalty
miss1 (100 cycles)
miss2 (100 cycles)
2 misses but only 1 miss penalty
Parallelism
Parallelism in systems complicates
performance understanding
• Two parallel cache misses
• Two parallel threads
• A branch mispredict and full-store-buffer stall
occur in the same cycle that three loads are
waiting on the memory system and two floatingpoint multiplies are executing
Criticality Challenges
• Cost
• How much speedup possible from optimizing an event?
• Slack
• How much can an event be “slowed down” before
increasing execution time?
• Interactions
• When do multiple events need to be optimized
simultaneously?
• When do we have a choice?
• Exploit in Hardware
Our Approach
Our Approach: Criticality
Bottleneck Analysis:
Determining the performance effect of an
event on execution time
Critical events affect execution time,
non-critical do not
Defining criticality
Need Performance Sensitivity
• slowing down a “critical” event should slow down the
entire program
• speeding up a “noncritical” event should leave execution
time unchanged
Standard Waterfall Diagram
Time
1
2
3
R5 = 0
F
E
C
R3 = 0
F
E
C
R1 = #array + R3
F
E
R6 = ld[R1]
F
11
12
13
R5 = R5 + 100
F
E
C
R0 = R5
F
R3 = R3 + 1
F
R5 = R6 + R5
F
cmp R6, 0
bf L1
Ret R0
4
5
6
7
8
9
10
14
15
C
E
C
E
C
E
F
C
E
C
F
E
C
E
F
C
E
C
Annotated with Dependence Edges
Time
1
2
3
R5 = 0
F
E
C
R3 = 0
F
E
C
R1 = #array + R3
F
E
R6 = ld[R1]
F
11
12
13
R5 = R5 + 100
F
E
C
R0 = R5
F
R3 = R3 + 1
F
R5 = R6 + R5
F
cmp R6, 0
bf L1 (MISP)
Ret R0
4
5
6
7
8
9
10
14
15
C
E
C
E
C
E
F
C
E
C
F
E
C
E
F
C
E
C
Annotated with Dependence Edges
Time
1
2
3
R5 = 0
F
E
C
R3 = 0
F
E
C
R1 = #array + R3
F
R6 = ld[R1]
F
E
R3 = R3 + 1
F
R5 = R6 + R5
F
cmp R6, 0
bf L1
4
5
6
7
8
9
10
11
13
14
15
Fetch BW
Data Dep
C
E
C
E
C
E
F
ROB
C
E
C
F
E
Branch
Misp.
C
R5 = R5 + 100
F
R0 = R5
F
Ret R0
12
E
C
E
F
C
E
C
Edge Weights Added
Time
1
2
3
R5 = 0
F
E
C
R3 = 0
F
11
12
E
C
R5 = R5 + 100
F
E 1 C
R0 = R5
F 1
R1 = #array + R3
1
1
F
E
1
R6 = ld[R1]
4
1
F
F
R5 = R6 + R5
F1
bf L1
6
2
C
0
C
1
E
7
8
9
10
13
14
15
C
E
R3 = R3 + 1
cmp R6, 0
5
E
F
C
E
F
1
C
E
C
3
Ret R0
E 1 C
F
E
C
Convert to Graph
R5 = 0
F
1
E
1
C
R3 = 0
0
F
1
E
1
C
F
1
1
E 1
C
0
F
2
1
E 2
1
C
1
R1 = #array + R3
R6 = ld[R1]
1
2
F 1
R3 = R3 + 1
0
F
R5 = R6 + R5
1
0
E
1
E
1 1
1 C
1 C
cmp R6, 0
F
2 E
1
bf L1
F
1
C
E
1
C
1
C
0
F
1 E
1
1 E
1
C
F
1
E
1
C
3
R5 = R5 + 100
R0 = R5
Ret R0
F
1
1
Convert to Graph
R5 = 0
F
1
E
1
C
R3 = 0
0
F
1
E
1
C
F
1
1
E 1
C
0
F
2
1
E 2
1
C
1
R1 = #array + R3
R6 = ld[R1]
1
2
F 1
R3 = R3 + 1
0
F
R5 = R6 + R5
1
0
E
1
E
1 1
1 C
1 C
cmp R6, 0
F
2 E
1
bf L1
F
1
C
E
1
C
1
C
0
F
1 E
1
1 E
1
C
F
1
E
1
C
3
R5 = R5 + 100
R0 = R5
Ret R0
F
1
1
Smaller graph instance
Critical Icache miss,
But how costly?
F
0
1
C
10
F
1
1
E
F
1
1
E
1
C
0
F
1
E
3
1
C
E
1
C
Non-critical,
But how much
slack?
1
F
1
E
1
C
Add “hidden” constraints
Critical Icache miss,
But how costly?
F
0
1
E
1
1
1
C
10
F
E
C
0
1
1
1
E
3
1
0
F
1
1
1
1
F
E
F
1
2
1
1
1
C
C
1
0
Non-critical,
But how much
slack?
E
1
C
Add “hidden” constraints
Cost = 13 – 7 = 6 cycles
F
0
1
E
1
1
1
C
10
F
E
C
0
1
1
1
E
3
1
0
F
1
1
1
1
F
E
F
1
2
1
1
1
C
C
1
0
Slack = 13 – 7
= 6 cycles
E
1
C
Slack “sharing”
F
0
1
E
1
1
1
C
10
F
E
C
0
1
E
3
1
1
1
1
C
1
F
1
1
1
0
F
E
1
2
1
C
E
1
1
0
F
1
C
Slack = 6
cycles
Can delay one edge by 6 cycles, but not both!
Slack = 6
cycles
Percent of Dynamic Instructions
Machine Imbalance
~80% insts have at least 5
cycles of apportioned slack
100
90
80
70
global
60
50
apportioned
40
30
20
10
0
0
10
20
30
40
50
60
70
Number of Cycles of Slack (perl)
80
90
100
Criticality Challenges
• Cost
• How much speedup possible from optimizing an event?
• Slack
• How much can an event be “slowed down” before
increasing execution time?
• Interactions
• When do multiple events need to be optimized
simultaneously?
• When do we have a choice?
• Exploit in Hardware
Simple criticality not always enough
Sometimes events have nearly equal criticality
miss #1 (99)
miss #2 (100)
Want to know
• how critical is each event?
• how far from critical is each event?
Actually, even that is not enough
Our solution: measure interactions
Two parallel cache misses
miss #1 (99)
miss #2 (100)
Cost(miss #1) = 0
Cost(miss #2) = 1
Cost({miss #1, miss #2}) = 100
Aggregate cost > Sum of individual costs  Parallel interaction
100
0+1
icost = aggregate cost – sum of individual costs
= 100 – 0 – 1 = 99
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
1. Positive icost
 parallel interaction
2. Zero icost ?
miss #1
miss #2
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
1. Positive icost
 parallel interaction
2. Zero icost
 independent
3. Negative icost ?
miss #1
miss #1
miss #2
...
miss #2
Negative icost
Two serial cache misses (data dependent)
miss #1 (100) miss #2 (100)
ALU latency (110 cycles)
Cost(miss #1) = ?
Negative icost
Two serial cache misses (data dependent)
miss #1 (100) miss #2 (100)
ALU latency (110 cycles)
Cost(miss #1) = 90
Cost(miss #2) = 90
Cost({miss #1, miss #2}) = 90
icost = aggregate cost – sum of individual costs
= 90 – 90 – 90 = -90
Negative icost  serial interaction
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
1. Positive icost
miss #1
 parallel interaction
2. Zero icost
 independent
3. Negative icost
 serial interaction
Fetch
BranchBW
mispredict
Load-Replay
LSQ
stall
Trap
miss #2
miss #1
..
.
miss #1
miss #2
miss #2
ALU latency
Why care about serial interactions?
miss #1 (100)
miss #2 (100)
ALU latency (110 cycles)
Reason #1 We are over-optimizing!
Prefetching miss #2 doesn’t help if miss #1 is
already prefetched (but the overhead still costs us)
Reason #2 We have a choice of what to optimize
Prefetching miss #2 has the same effect as miss #1
Icost Case Study: Deep pipelines
1
4
Dcache (DL1)
Looking for serial interactions!
Icost Breakdown
(6 wide, 64-entry window)
gcc
DL1
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
gzip
vortex
Icost Breakdown
(6 wide, 64-entry window)
gcc
DL1
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
gzip
30.5 %
vortex
Icost Breakdown
(6 wide, 64-entry window)
gcc
gzip
DL1
30.5 %
DL1+window
-15.3
DL1+bw
6.0
DL1+bmisp
-3.4
DL1+dmiss
-0.4
DL1+alu
-8.2
DL1+imiss
0.0
...
...
Total
100.0
vortex
Icost Breakdown
(6 wide, 64-entry window)
gcc
gzip
vortex
DL1
18.3 %
30.5 %
25.8 %
DL1+window
-4.2
-15.3
-24.5
DL1+bw
10.0
6.0
15.5
DL1+bmisp
-7.0
-3.4
-0.3
DL1+dmiss
-1.4
-0.4
-1.4
DL1+alu
-1.6
-8.2
-4.7
DL1+imiss
0.1
0.0
0.4
...
...
...
...
Total
100.0
100.0
100.0
Icost Case Study: Deep pipelines
DL1 access
F
5
1
F
5
4
E
6
i1
0
F
1
C
i2
5
E
9
14
18
1
C
i3
0
F
5
E
4
C
12
5
2
E
C
i4
window edge
6
1
C
i5
F
5
E
E
7
0
12
F
1
7
0
C
i6
Icost Case Study: Deep pipelines
DL1 access
F
5
1
F
5
4
E
6
i1
0
F
1
C
i2
5
E
9
14
18
1
C
i3
0
F
5
E
4
C
12
5
2
E
C
i4
window edge
6
1
C
i5
F
5
E
E
7
0
12
F
1
7
0
C
i6
Icost Case Study: Deep pipelines
DL1 access
F
5
1
F
5
4
E
6
i1
0
F
1
C
i2
5
E
9
14
18
1
C
i3
0
F
5
E
4
C
12
5
2
E
C
i4
window edge
6
1
C
i5
F
5
E
E
7
0
12
F
1
7
0
C
i6
Icost Case Study: Deep pipelines
DL1 access
F
5
1
F
5
4
E
6
i1
0
F
1
C
i2
5
E
9
14
18
1
C
i3
0
F
5
E
4
C
12
5
2
E
C
i4
window edge
6
1
C
i5
F
5
E
E
7
0
12
F
1
7
0
C
i6
Icost Case Study: Deep pipelines
DL1 access
F
5
1
F
5
4
E
6
i1
0
F
1
C
i2
5
E
9
14
18
1
C
i3
0
F
5
E
4
C
12
5
2
E
C
i4
window edge
6
1
C
i5
F
5
E
E
7
0
12
F
1
7
0
C
i6
Icost Case Study: Deep pipelines
DL1 access
F
5
1
F
5
4
E
6
i1
0
F
1
C
i2
5
E
9
14
18
1
C
i3
0
F
5
E
4
C
12
5
2
E
C
i4
window edge
6
1
C
i5
F
5
E
E
7
0
12
F
1
7
0
C
i6
Criticality Challenges
• Cost
• How much speedup possible from optimizing an event?
• Slack
• How much can an event be “slowed down” before
increasing execution time?
• Interactions
• When do multiple events need to be optimized
simultaneously?
• When do we have a choice?
• Exploit in Hardware
Exploit in Hardware
• Criticality Analyzer
• Online, fast-feedback
• Limited to critical/not critical
• Replacement for Performance Counters
• Requires offline analysis
• Constructs entire graph
Only last-arriving edges can be critical
R2
Observation:
R3
E
R1
R2 + R3
Dependence
resolved early
If dependence into R2 is on critical path,
then value of R2 arrived last.
critical  arrives last
arrives last 
 critical
Determining last-arrive edges
Observe events within the machine
last_arrive[F] =
F
F
F
F
F
F
F
F
E
E
E
E
E
E
E
E
C
C
C
C
C
C
C
C
EF if branch misp.
CF if ROB stall
FF otherwise
last_arrive[C] =
last_arrive[E] =
F
F
F
F
F
F
F
F
F
E
E
E
E
E
E
E
E
E
C
C
C
C
C
C
C
C
C
EE observe
arrival order of
operands
FE if data
ready on fetch
EC if commit
pointer is
delayed
CC otherwise
Last-arrive edges
The last-arrive rule
 CP consists only of “last-arrive” edges
F
E
C
Prune the graph
Only need to put last-arrive edges in graph
No other edges could be on CP
F
E
C
newest
…and we’ve found the critical path!
Backward propagate along last-arrive edges
F
E
C
 Found CP by only observing last-arrive edges
 but still requires constructing entire graph
newest
Step 2. Reducing storage reqs
CP is a ”long” chain of last-arrive edges.
 the longer a given chain of last-arrive
edges, the more likely it is part of the CP
Algorithm: find sufficiently long last-arrive chains
1.
Plant token into a node n
2.
Propagate forward, only along last-arrive edges
3.
Check for token after several hundred cycles
4.
If token alive, n is assumed critical
Online Criticality Detection
Forward propagate token
F
Plant
Token
E
C
newest
Online Criticality Detection
Forward propagate token
F
Plant
Token
E
C
Tokens
“Die”
newest
Online Criticality Detection
Forward propagate token
F
Plant
Token
E
C
Token
survives!
Putting it all together
Prediction Path
CP
prediction
table
PC
E-critical?
Training Path
Token-Passing
Analyzer
OOO Core
Last-arrive edges
(producer  retired instr)
Results
• Performance (Speed)
• Scheduling in clustered machines
• 10% speedup
• Selective value prediction
• Deferred scheduling (Crowe, et al)
• 11% speedup
• Heterogeneous cache (Rakvic, et al.)
• 17% speedup
• Energy
• Non-uniform machine: fast and slow pipelines
• ~25% less energy
• Instruction queue resizing (Sasanka, et al.)
• Multiple frequency scaling (Semeraro, et al.)
• 19% less energy with 3% less performance
• Selective pre-execution (Petric, et al.)
Exploit in Hardware
• Criticality Analyzer
• Online, fast-feedback
• Limited to critical/not critical
• Replacement for Performance Counters
• Requires offline analysis
• Constructs entire graph
Profiling goal
Goal:
•
Construct graph
many dynamic instructions
Constraint:
•
Can only sample sparsely
Genome sequencing
Profiling goal
Goal:
•
DNA strand
Construct graph
DNA
Constraint:
•
Can only sample sparsely
“Shotgun” genome sequencing
DNA
“Shotgun” genome sequencing
DNA
“Shotgun” genome sequencing
DNA
...
...
“Shotgun” genome sequencing
DNA
...
...
Find overlaps among samples
...
...
Mapping “shotgun” to our situation
many dynamic instructions
Icache miss
Dcache miss
Branch misp.
No event
Profiler hardware requirements
...
...
Profiler hardware requirements
...
Match!
...
Sources of error
Error Source
Gcc
Parser
Twolf
Modeling execution as
a graph
2.1 %
6.0%
0.1 %
Errors in graph
construction
5.3 %
1.5 %
1.6 %
Sampling only a few
graph fragments
4.8 %
6.5 %
7.2 %
Total
12.2 %
14.0 %
8.9 %
Conclusion: Grand Challenges
• Cost
• How much speedup possible from optimizing an
event?
modeling
• Slack
token-passing
analyzer
• How much can an event be “slowed down” before
increasing execution time?
• Interactions
shotgun profiling
• When do multiple events need to be optimized
simultaneously?
• When do we have a choice?
parallel interactions
serial interactions
Conclusion: Bottleneck Analysis Applications
Run-time Optimization
• Effective speculation
• Resource arbitration
Selective value
prediction
Scheduling and steering in
clustered processors
• Dynamic reconfiguration
• Energy efficiency
Resize instruction window
Non-uniform machines
Design Decisions
• Overcoming technology constraints
Programmer Performance Tuning
• Where have the cycles gone?
Helped cope with highlatency dcache
Measured cost of cache
misses/branch mispredicts
Outline
Simple Criticality
• Definition (ISCA ’01)
• Detection (ISCA ’01)
• Application (ISCA ’01-’02)
Advanced Criticality
• Interpretation (MICRO ’03)
• What types of interactions are possible?
• Hardware Support (MICRO ’03, TACO ’04)
• Enhancement to performance counters
Simple criticality not always enough
Sometimes events have nearly equal criticality
miss #1 (99)
miss #2 (100)
Want to know
• how critical is each event?
• how far from critical is each event?
Actually, even that is not enough
Our solution: measure interactions
Two parallel cache misses
miss #1 (99)
miss #2 (100)
Cost(miss #1) = 0
Cost(miss #2) = 1
Cost({miss #1, miss #2}) = 100
Aggregate cost > Sum of individual costs  Parallel interaction
100
0+1
icost = aggregate cost – sum of individual costs
= 100 – 0 – 1 = 99
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
1. Positive icost
 parallel interaction
2. Zero icost ?
miss #1
miss #2
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
1. Positive icost
 parallel interaction
2. Zero icost
 independent
3. Negative icost ?
miss #1
miss #1
miss #2
...
miss #2
Negative icost
Two serial cache misses (data dependent)
miss #1 (100) miss #2 (100)
ALU latency (110 cycles)
Cost(miss #1) = ?
Negative icost
Two serial cache misses (data dependent)
miss #1 (100) miss #2 (100)
ALU latency (110 cycles)
Cost(miss #1) = 90
Cost(miss #2) = 90
Cost({miss #1, miss #2}) = 90
icost = aggregate cost – sum of individual costs
= 90 – 90 – 90 = -90
Negative icost  serial interaction
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
1. Positive icost
miss #1
 parallel interaction
2. Zero icost
 independent
3. Negative icost
 serial interaction
Fetch
BranchBW
mispredict
Load-Replay
LSQ
stall
Trap
miss #2
miss #1
..
.
miss #1
miss #2
miss #2
ALU latency
Why care about serial interactions?
miss #1 (100)
miss #2 (100)
ALU latency (110 cycles)
Reason #1 We are over-optimizing!
Prefetching miss #2 doesn’t help if miss #1 is
already prefetched (but the overhead still costs us)
Reason #2 We have a choice of what to optimize
Prefetching miss #2 has the same effect as miss #1
Outline
Simple Criticality
• Definition (ISCA ’01)
• Detection (ISCA ’01)
• Application (ISCA ’01-’02)
Advanced Criticality
• Interpretation (MICRO ’03)
• What types of interactions are possible?
• Hardware Support (MICRO ’03, TACO ’04)
• Enhancement to performance counters
Profiling goal
Goal:
•
Construct graph
many dynamic instructions
Constraint:
•
Can only sample sparsely
Genome sequencing
Profiling goal
Goal:
•
DNA strand
Construct graph
DNA
Constraint:
•
Can only sample sparsely
“Shotgun” genome sequencing
DNA
“Shotgun” genome sequencing
DNA
“Shotgun” genome sequencing
DNA
...
...
“Shotgun” genome sequencing
DNA
...
...
Find overlaps among samples
...
...
Mapping “shotgun” to our situation
many dynamic instructions
Icache miss
Dcache miss
Branch misp.
No event
Profiler hardware requirements
...
...
Profiler hardware requirements
...
Match!
...
Sources of error
Error Source
Gcc
Parser
Twolf
Sources of error
Error Source
Gcc
Parser
Twolf
Modeling execution as
a graph
2.1 %
6.0%
0.1 %
Sources of error
Error Source
Gcc
Parser
Twolf
Modeling execution as
a graph
2.1 %
6.0%
0.1 %
Errors in graph
construction
5.3 %
1.5 %
1.6 %
Sources of error
Error Source
Gcc
Parser
Twolf
Modeling execution as
a graph
2.1 %
6.0%
0.1 %
Errors in graph
construction
5.3 %
1.5 %
1.6 %
Sampling only a few
graph fragments
4.8 %
6.5 %
7.2 %
Total
12.2 %
14.0 %
8.9 %
Conclusion: Bottleneck Analysis Applications
Run-time Optimization
• Effective speculation
• Resource arbitration
Selective value
prediction
Scheduling and steering in
clustered processors
• Dynamic reconfiguration
• Energy efficiency
Resize instruction window
Non-uniform machines
Design Decisions
• Overcoming technology constraints
Programmer Performance Tuning
• Where have the cycles gone?
Helped cope with highlatency dcache
Measured cost of cache
misses/branch mispredicts
Conclusion: Grand Challenges
• Cost
• How much speedup possible from optimizing an
event?
modeling
• Slack
token-passing
analyzer
• How much can an event be “slowed down” before
increasing execution time?
• Interactions
shotgun profiling
• When do multiple events need to be optimized
simultaneously?
• When do we have a choice?
parallel interactions
serial interactions
Backup Slides
Related Work
Criticality Prior Work
Critical-Path Method, PERT charts
• Developed for Navy’s “Polaris” project-1957
• Used as a project management tool
• Simple critical-path, slack concepts
“Attribution” Heuristics
• Rosenblum et al.: SOSP-1995, and many others
• Marks instruction at head of ROB as critical, etc.
• Empirically, has limited accuracy
• Does not account for interactions between events
Related Work: Microprocessor Criticality
Latency tolerance analysis
• Srinivasan and Lebeck: MICRO-1998
Heuristics-driven criticality predictors
• Tune et al.: HPCA-2001
• Srinivasan et al.: ISCA-2001
“Local” slack detector
• Casmira and Grunwald: Kool Chips Workshop-2000
ProfileMe with pair-wise sampling
• Dean, et al.: MICRO-1997
Unresolved Issues
Alternative I: Addressing Unresolved Issues
Modeling and Measurement
• What resources can we model effectively?
• difficulty with mutual-exclusion-type resouces (ALUs)
• Efficient algorithms
• Release tool for measuring cost/slack
Hardware
• Detailed design for criticality analyzer
• Shotgun profiler simplifications
• gradual path from counters
Optimization
• explore heuristics for exploiting interactions
Alternative II: Chip-Multiprocessors
Programmer Performance Tuning
Parallelizing applications
• What makes a good division into threads?
• How can we find them automatically, or at least help
programmers to find them?
Design Decisions
•
•
•
•
Should each core support out-of-order execution?
Should SMT be supported?
How many processors are useful?
What is the effect of inter-processor latency?
Unresolved issues
Modeling and Measurement
• What resources can we model effectively?
• difficulty with mutual-exclusion-type resouces (ALUs)
• In other words, unanticipated side effects
Altered Execution
Original Execution
(to compute cost of inst #3
cache miss)
1. ld r2, [Mem] (cache miss)
2. add r3  r2 + 1
3. ld r4, [Mem] (cache miss)
4. add r6  r4 + 1
F
1
0
10
E
1
C 0
F
0
F
1
1
E
E
10
C 0
10
0
10
Adder
contention
F
1
E
E
1
C
0
F
1
1
C 0
Contention
edge
1. ld r2, [Mem]
(cache miss)
2. add r3  r2 + 1
3. ld r4, [Mem]
(cache hit)
4. add r6  r4 + 1
10
10
C 0
F
0
F
1
1
E
E
1
C 0
2
0
2
No
contention
F
1
E
1
C 0
Incorrect critical path due
to contention edge
1
C
Should not
be here
Unresolved issues
Modeling and Measurement (cont.)
• How should processor policies be modeled?
• relationship to icost definition
• Efficient algorithms for measuring icosts
• pairs of events, etc.
• Release tool for measuring cost/slack
Unresolved issues
Hardware
• Detailed design for criticality analyzer
• help to convince industry-types to build it
• Shotgun profiler simplifications
• gradual path from counters
Optimization
• Explore icost optimization heuristics
• icosts are difficult to interpret
Validation
Validation: can we trust our model?
Run two simulations :
• Reduce CP latencies
 Expect “big” speedup
• Reduce non-CP latencies
 Expect no speedup
Speedup per Cycle Reduced
Validation: can we trust our model?
1
0.8
Reducing CP Latencies
0.6
0.4
Reducing non-CP Latencies
0.2
0
crafty
eon
gcc
gzip
perl
vortex
galgel
mesa
Validation
Two steps:
Increase latencies of insts. by their
apportioned slack
1.
•
2.
for three
1)
2)
possible,
3)
apportioning strategies:
latency+1,
5-cycles to as many instructions as
12-cycles to as many loads as possible
Compare to baseline (no delays inserted)
Validation
120%
Percent of Execution Time
110%
100%
90%
80%
70%
baseline
60%
latency + 1
50%
five cycles
40%
12 cycles to loads
30%
20%
10%
0%
ammp
art
gcc
gzip
mesa
parser
perl
Worst case: Inaccuracy of 0.6%
vortex
average
Slack Measurements
Three slack variants
Local slack:
# cycles latency can be increased
without delaying any subsequent instructions
Global slack:
# cycles latency can be increased
without delaying the last instruction in the program
Apportioned slack:
Distribute global slack among instructions
using an apportioning strategy
Percent of Dynamic Instructions
Slack measurements
100
90
80
70
60
50
~21% insts have at least 5
cycles of local slack
40
30
20
local
10
0
0
10
20
30
40
50
60
70
Number of Cycles of Slack (perl)
80
90
100
Percent of Dynamic Instructions
Slack measurements
~90% insts have at least 5
cycles of global slack
100
90
80
70
global
60
50
40
30
20
local
10
0
0
10
20
30
40
50
60
70
Number of Cycles of Slack (perl)
80
90
100
Percent of Dynamic Instructions
Slack measurements
~80% insts have at least 5
cycles of apportioned slack
100
90
80
70
global
60
50
40
apportioned
30
20
local
10
0
0
10
20
30
40
50
60
70
80
Number of Cycles of Slack (perl)
A large amount of exploitable slack exists
90
100
Application-centered
Slack Measurements
Load slack
Can we tolerate a long-latency L1 hit?
design:
wire-constrained machine, e.g. Grid
non-uniformity:
multi-latency L1
apportioning strategy:
apportion ALL slack to load instructions
Apportion all slack to loads
Percent of Dynamic Loads
100
90
80
perl
70
gcc
60
50
40
gzip
30
20
10
0
0
10
20
30
40
50
60
70
80
Number of Cycles of Slack on Load Instructions
Most loads can tolerate an L2 cache hit
90
100
Multi-speed ALUs
Can we tolerate ALUs running at half frequency?
design:
fast/slow ALUs
non-uniformity:
multi-latency execution latency, bypass
apportioning strategy:
give slack equal to original latency + 1
Percent of Dynamic Instructions
Latency+1 apportioning
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
ammp
art
gcc
gzip
mesa
parser
perl
vortex average
Most instructions can tolerate doubling their latency
Slack Locality and Prediction
Predicting slack
Two steps to PC-indexed, history-based prediction:
1.
2.
Measure slack of a dynamic instruction
Store in array indexed by PC of static instruction
Two requirements:
1.
2.
Locality of slack
Ability to measure slack of a dynamic instruction
Percent of (weighted) static instructions
Locality of slack
100
90
80
70
60
50
40
ideal
30
20
10
0
ammp
art
gcc
gzip
mesa
parser
perl
vortex
average
Percent of (weighted) static instructions
Locality of slack
100
90
80
70
60
50
40
ideal
30
20
10
100%
0
ammp
art
gcc
gzip
mesa
parser
perl
vortex
average
Percent of (weighted) static instructions
Locality of slack
100
90
80
70
60
50
40
ideal
30
90%
20
95%
10
100%
0
ammp
art
gcc
gzip
mesa
parser
perl
vortex
PC-indexed, history-based predictor
can capture most of the available slack
average
Slack Detector
delay and observe effective for hardware predictor
Problem #1
Iterating repeatedly over same dynamic instruction
Solution
Only sample dynamic instruction once
Problem #2
Determining if overall execution time increased
Solution
Check if delay made instruction critical
Slack Detector
delay and observe
Goal:
Determine whether instruction has n cycles of slack
1. Delay the instruction by n cycles
2. Check if critical (via critical-path analyzer)
3. No, instruction has n cycles of slack
4. Yes, instruction does not have n cycles of slack
Slack Application
Fast/slow cluster microarchitecture
P  F2
save ~37% core power
Fetch +
Rename
Steer
WIN
Reg
ALUs
Fast, 3-wide cluster
WIN
Reg
ALUs
Data
Cache
Bypass Bus
Slow, 3-wide cluster
Aggressive non-uniform design:
•
Higher execution latencies
•
Increased (cross-domain) bypass latency
•
Decreased effective issue bandwidth
Picking bins for the slack predictor
Two decisions
1. Steer to fast/slow cluster
2. Schedule with high/low priority within a cluster
Use implicit slack predictor with four bins:
1.
2.
3.
4.
Steer
Steer
Steer
Steer
to
to
to
to
fast cluster + schedule with high priority
fast cluster + schedule with low priority
slow cluster + schedule with high priority
slow cluster + schedule with low priority
Slack-based policies
1.1
1
0.9
Normalized IPC
0.8
0.7
2 fast, high-power clusters
0.6
slack-based policy
0.5
0.4
reg-dep steering
0.3
0.2
0.1
0
ammp
art
gcc
gzip
mesa
parser
perl
vortex
average
10% better performance from hiding non-uniformities
CMP case study
Multithreaded Execution Case Study
Two questions:
•
How should a program be divided into threads?
• what makes a good cutpoint?
• how can we find them automatically, or at least help
programmers find them?
•
What should a multiple-core design look like?
• should each core support out-of-order execution?
• should SMT be supported?
• how many processors are useful?
• what is the effect of inter-processor latency?
Parallelizing an application
Why parallelize a single-thread application?
• Legacy code, large code bases
• Difficult to parallelize apps
• Interpreted code, kernels of operating systems
• Like to use better programming languages
• Scheme, Java instead of C/C++
Parallelizing an application
Simplifying assumption
• Program binary unchanged
Simplified problem statement
• Given a program of length L, find a cutpoint that divides
the program into two threads that provides maximum
speedup
Must consider:
• data dependences, execution latencies, control
dependences, proper load balancing
Parallelizing an application
Naive solution:
• try every possible cutpoint
Our solution:
• efficiently determine the effect of every possible
cutpoint
• model execution before and after every cut
Solution
first instruction
start
0
F
1
1
2 1
3
2
0
1
3
1
1
1
2
2
0
1
1
1 4
0
1
1
1
E
C
1
0
1
0
4
0
0
1
1
0
2
0
0
1
1
2
2
0
0
last instruction
Parallelizing an application
Considerations:
• Synchronization overhead
• add latency to EE edges
• Synchronization may involve turning EE to EF
• Scheduling of threads
• additional CF edges
Challenges:
• State behavior (one thread to multiple processors)
• caches, branch predictor
• Control behavior
• limits where cutpoints can be made
Parallelizing an application
More general problem:
• Divide a program into N threads
• NP-complete
Icost can help:
• icost(p1,p2) << 0 implies p1 and p2 redundant
• action: move p1 and p2 further apart
Preliminary Results
Experimental Setup
•
Simulator, based loosely on SimpleScalar
•
Alpha SpecInt binaries
Procedure
1.
Assume execution trace is known
2.
Look at each 1k run of instructions
3.
Test every possible cutpoint using 1k graphs
Dynamic Cutpoints
Cost Distribution of Dynamic Cutpoints
Cumulative Pct. of Cutpoints
100
bzip
90
crafty
80
eon
70
gap
gcc
60
parser
50
perl
40
tw ol
vpr
30
20
10
0
0
20
40
60
80
Execution time reduction (cycles)
Only 20% of cuts yield benefits of > 20 cycles
100
Usefulness of cost-based policy
Speedups from parallelizing programs for a twoprocessor system
30
fixed-interval
simple cost-based
Speedup %
25
20
15
10
5
0
bzip
crafty
eon
gap
gcc
gzip
mcf
parser
perl
twolf
vpr
Static Cutpoints
Cost Distribution of Static Cutpoints
Cumulative Pct. of Instructions
100
bzip
90
crafty
80
eon
gap
70
gcc
60
gzip
mcf
50
parser
40
perl
tw olf
30
vpr
20
10
0
0
20
40
60
80
100
120
140
160
180
Avg. per-dynamic-instance Cost of Static Instructions
Up to 60% of cuts yield benefits of > 20 cycles
Future Avenues of Research
• Map cutpoints back to actual code
• Compare automatically generated cutpoints to humangenerated ones
• See what performance gains are in a simulator, as opposed
to just on the graph
• Look at the effect of synchronization operations
• What additional overhead do they introduce?
• Deal with state, control problems
• Might need some technique outside of the graph
Multithreaded Execution Case Study
Two possible questions:
•
How should a program be divided into threads?
• what makes a good cutpoint?
• how can we find them automatically, or at least help
programmers find them?
•
What should a multiple-core design look like?
• should each core support out-of-order execution?
• should SMT be supported?
• how many processors are useful?
• what is the effect of inter-processor latency?
CMP design study
What we can do:
• Try out many configurations quickly
• dramatic changes in architecture often only small
changes in graph
• Identifying bottlenecks
• especially interactions
CMP design study: Out-of-orderness
Is out-of-order execution necessary
in a CMP?
Procedure
• model execution with different configurations
• adjust CD edges
• compute breakdowns
• notice resource/events interacting with CD edges
CMP design study: Out-of-orderness
first instruction
0
F
1
1
2 1
3
2
0
1
3
1
1
1
2
2
0
1
1
1 4
0
1
1
1
E
C
1
0
1
0
4
0
0
1
1
0
2
0
0
1
1
2
2
0
0
last instruction
CMP design study: Out-of-orderness
Results summary
• Single-core: Performance taps out at 256 entries
• CMP: Performance gains up through 1024 entries
• some benchmarks see gains up to 16k entries
Why more beneficial?
• Use breakdowns to find out.....
CMP design study: Out-of-orderness
Components of window cost
• cache misses holding up retirement?
• long strands of data dependencies?
• predictable control flow?
Icost breakdowns give quantitative and
qualitative answers
CMP design study: Out-of-orderness
Independent
100%
ALU
window
cost
0%
cache
misses
Parallel
Interaction
ALU
cache
misses
interaction
Serial
Interaction
ALU
cache
misses
interaction
cost(window) +
icost(window, A) + icost(window, B) + icost(window, AB) = 0
equal
Summary of Preliminary Results
icost(window, ALU operations) << 0
• primarily communication between processors
• window often stalled waiting for data
Implications
• larger window may be overkill
• need a cheap non-blocking solution
• e.g., continual-flow pipelines
CMP design study: SMT?
Benefits
• reduced thread start-up latency
• reduced communication costs
How we could help
• distribution of thread lengths
• breakdowns to understand effect of communication
CMP design study: How many processors?
Start
#1
#2
#1
#2
#1
CMP design study: Other Questions
What is the effect of inter-processor
communication latency?
• understand hidden vs. exposed communication
Allocating processors to programs
• methodology for O/S to better assign programs to
processors
Waterfall To Graph Story
Standard Waterfall Diagram
Time
1
2
3
R5 = 0
F
E
C
R3 = 0
F
E
C
R1 = #array + R3
F
E
R6 = ld[R1]
F
11
12
13
R5 = R5 + 100
F
E
C
R0 = R5
F
R3 = R3 + 1
F
R5 = R6 + R5
F
cmp R6, 0
bf L1
Ret R0
4
5
6
7
8
9
10
14
15
C
E
C
E
C
E
F
C
E
C
F
E
C
E
F
C
E
C
Annotated with Dependence Edges
Time
1
2
3
R5 = 0
F
E
C
R3 = 0
F
E
C
R1 = #array + R3
F
E
R6 = ld[R1]
F
11
12
13
R5 = R5 + 100
F
E
C
R0 = R5
F
R3 = R3 + 1
F
R5 = R6 + R5
F
cmp R6, 0
bf L1
Ret R0
4
5
6
7
8
9
10
14
15
C
E
C
E
C
E
F
C
E
C
F
E
C
E
F
C
E
C
Annotated with Dependence Edges
Time
1
2
3
R5 = 0
F
E
C
R3 = 0
F
E
C
R1 = #array + R3
F
R6 = ld[R1]
F
E
R3 = R3 + 1
F
R5 = R6 + R5
F
cmp R6, 0
bf L1
4
5
6
7
8
9
10
11
13
14
15
Fetch BW
Data Dep
C
E
C
E
C
E
F
ROB
C
E
C
F
E
Branch
Misp.
C
R5 = R5 + 100
F
R0 = R5
F
Ret R0
12
E
C
E
F
C
E
C
Edge Weights Added
Time
1
2
3
R5 = 0
F
E
C
R3 = 0
F
11
12
E
C
R5 = R5 + 100
F
E 1 C
R0 = R5
F 1
R1 = #array + R3
1
1
F
E
1
R6 = ld[R1]
4
1
F
F
R5 = R6 + R5
F1
bf L1
6
2
C
0
C
1
E
7
8
9
10
13
14
15
C
E
R3 = R3 + 1
cmp R6, 0
5
E
F
C
E
F
1
C
E
C
3
Ret R0
E 1 C
F
E
C
Convert to Graph
R5 = 0
F
1
E
1
C
R3 = 0
0
F
1
E
1
C
1
E 1
C
1
E 2
1
C
1
F
R1 = #array + R3
R6 = ld[R1]
1
0
F
2
2
F 1
R3 = R3 + 1
0
F
R5 = R6 + R5
0
E
C
E
1 1
1 C
1 C
cmp R6, 0
F
2 E
1
bf L1
F
E
1
C
1
C
0
F
1 E
1
E
1
C
F
1
E
1
C
3
R5 = R5 + 100
R0 = R5
Ret R0
F
1
Find Critical Path
R5 = 0
F
1
E
1
C
R3 = 0
0
F
1
E
1
C
1
E 1
C
1
E 2
1
C
1
F
R1 = #array + R3
R6 = ld[R1]
1
0
F
2
2
F 1
R3 = R3 + 1
0
F
R5 = R6 + R5
0
E
C
E
1 1
1 C
1 C
cmp R6, 0
F
2 E
1
bf L1
F
E
1
C
1
C
0
F
1 E
1
E
1
C
F
1
E
1
C
3
R5 = R5 + 100
R0 = R5
Ret R0
F
1
Add Non-last-arriving Edges
R5 = 0
F
E
1
1
0
R3 = 0
1
F
E
1
0
R1 = #array + R3
1
F
1
E
1
0
R6 = ld[R1]
F
1
2
0
R3 = R3 + 1
1
2
2
F
1
1
E
1
E
1
1
0
R5 = R6 + R5
cmp R6, 0
bf L1
R5 = R5 + 100
R0 = R5
Ret R0
F
1
E
1
1
0
F
1
E
2
1
0
F
1
0
1
1
3
F
E
1
0
1
E
1
1
1
1
F
E
1
0
1
1
F
1
E
1
C
0
C
1
0
C
1
0
C
1
1 0
C
1
1 0
C
1
1 0
C
1
1 0
C
1
1 0
C
1
0
C
1
0
C
Graph Alterations
R5 = 0
F
R3 = 0
1
0
F
R1 = #array + R3 1
0
F
R6 = ld[R1]
1
0
F
1
0
F
1
0
F
1
0
F
bf L1
1
0
F
R5 = R5 + 100
1
0
F
R0 = R5
1
0
F
R3 = R3 + 1
R5 = R6 + R5
cmp R6, 0
Ret R0
0
F
E
1
E
1
1
E
1
1
E
2
1
1
1
2
2
1
1
2
1
1
1
1
E
E
E
1
E
E
1
E
1
E
1
1
1
C
0
C
1
0
C
1
0
C
1
1 0
C
1
1 0
C
1
1 0
C
1
1 0
C
1
1 0
C
1
0
C
1
0
C
1
1
1
1
Branch
misprediction
made correct
Token-passing analyzer
Step 1. Observing
R2
Observation:
R3
E
R1
R2 + R3
Dependence
resolved early
If dependence into R2 is on critical path,
then value of R2 arrived last.
critical  arrives last
arrives last 
 critical
Determining last-arrive edges
Observe events within the machine
last_arrive[F] =
F
F
F
F
F
F
F
F
E
E
E
E
E
E
E
E
C
C
C
C
C
C
C
C
EF if branch misp.
CF if ROB stall
FF otherwise
last_arrive[C] =
last_arrive[E] =
F
F
F
F
F
F
F
F
F
E
E
E
E
E
E
E
E
E
C
C
C
C
C
C
C
C
C
EE observe
arrival order of
operands
FE if data
ready on fetch
EC if commit
pointer is
delayed
CC otherwise
Last-arrive edges: a CPU stethoscope
CF
FE
EE
EF
EC
FF
CC
CPU
Last-arrive edges
0
F
1
0
1
2
1
1
1
0
1
3
1
E
3
C
1
2
0
4
1
0
1
0
2
0
0
0
1 4
1
1
2
2
0
1
1
1
2
1
0
1
2
0
0
1
Remove latencies
Do not need explicit weights
F
E
C
Last-arrive edges
The last-arrive rule
 CP consists only of “last-arrive” edges
F
E
C
Prune the graph
Only need to put last-arrive edges in graph
No other edges could be on CP
F
E
C
newest
…and we’ve found the critical path!
Backward propagate along last-arrive edges
F
E
C
 Found CP by only observing last-arrive edges
 but still requires constructing entire graph
newest
Step 2. Efficient analysis
CP is a ”long” chain of last-arrive edges.
 the longer a given chain of last-arrive
edges, the more likely it is part of the CP
Algorithm: find sufficiently long last-arrive chains
1.
Plant token into a node n
2.
Propagate forward, only along last-arrive edges
3.
Check for token after several hundred cycles
4.
If token alive, n is assumed critical
Token-passing example
ROB Size
1. plant
token
Critical
2. propagate
token
 Found CP without constructing entire graph
3. is token alive?
4. yes, train
critical
Implementation: a small SRAM array
Last-arrive producer node (inst id, type)
…
Token Queue
Write
Read
Commited (inst id, type)
Size of SRAM: 3 bits  ROB size < 200
Bytes
Simply replicate for additional tokens
Putting it all together
Prediction Path
CP
prediction
table
PC
E-critical?
Training Path
Token-Passing
Analyzer
OOO Core
Last-arrive edges
(producer  retired instr)
Scheduling and Steering
Case Study #1: Clustered architectures
issue
window
steering
scheduling
1. Current state of art (Base)
2. Base + CP Scheduling
3. Base + CP Scheduling + CP Steering
Current State of the Art
Constant issue width, clock frequency
Normalized IPC
1.10
1.00
0.90
0.80
unclustered
2 cluster
0.70
4 cluster
0.60
crafty
eon
gcc
gzip
perl
vortex
galgel
 Avg. clustering penalty for 4 clusters: 19%
mesa
CP Optimizations
Base + CP Scheduling
Normalized IPC
1.10
1.00
0.90
0.80
unclustered
2 cluster
0.70
4 cluster
0.60
crafty
eon
gcc
gzip
perl
vortex
galgel
mesa
CP Optimizations
Base + CP Scheduling + CP Steering
Normalized IPC
1.10
1.00
0.90
0.80
unclustered
2 cluster
0.70
4 cluster
0.60
crafty
eon
gcc
gzip
perl
vortex
galgel
mesa
 Avg. clustering penalty reduced from 19% to 6%
Token-passing Vs. Heuristics
Local Vs. Global Analysis
Previous CP predictors:
local resource-sensitive predictions (HPCA 01, ISCA 01)
25.0%
20.0%
Speedup
token-passing
15.0%
oldest-uncommited
10.0%
oldest-unissued
5.0%
0.0%
-5.0%
crafty
eon
gcc
gzip
perl
vortex
galgel
mesa
 CP exploitation seems to require global analysis
Icost case study
Icost Case Study: Deep pipelines
Deep pipelines cause long latency loops:
• level-one (DL1) cache access,
issue-wakeup, branch misprediction, …
But can often mitigate them indirectly
Assume 4-cycle DL1 access; how to mitigate?
Increase cache ports? Increase window size?
Increase fetch BW? Reduce cache misses?
Really, looking for serial interactions!
Icost Case Study: Deep pipelines
DL1 access
F
5
1
F
5
4
E
6
i1
0
F
1
C
i2
5
E
9
14
18
1
C
i3
0
F
5
E
4
C
12
5
2
E
C
i4
window edge
6
1
C
i5
F
5
E
E
7
0
12
F
1
7
0
C
i6
Icost Case Study: Deep pipelines
DL1 access
F
5
1
F
5
4
E
6
i1
0
F
1
C
i2
5
E
9
14
18
1
C
i3
0
F
5
E
4
C
12
5
2
E
C
i4
window edge
6
1
C
i5
F
5
E
E
7
0
12
F
1
7
0
C
i6
Icost Case Study: Deep pipelines
DL1 access
F
5
1
F
5
4
E
6
i1
0
F
1
C
i2
5
E
9
14
18
1
C
i3
0
F
5
E
4
C
12
5
2
E
C
i4
window edge
6
1
C
i5
F
5
E
E
7
0
12
F
1
7
0
C
i6
Icost Case Study: Deep pipelines
DL1 access
F
5
1
F
5
4
E
6
i1
0
F
1
C
i2
5
E
9
14
18
1
C
i3
0
F
5
E
4
C
12
5
2
E
C
i4
window edge
6
1
C
i5
F
5
E
E
7
0
12
F
1
7
0
C
i6
Icost Case Study: Deep pipelines
DL1 access
F
5
1
F
5
4
E
6
i1
0
F
1
C
i2
5
E
9
14
18
1
C
i3
0
F
5
E
4
C
12
5
2
E
C
i4
window edge
6
1
C
i5
F
5
E
E
7
0
12
F
1
7
0
C
i6
Icost Case Study: Deep pipelines
DL1 access
F
5
1
F
5
4
E
6
i1
0
F
1
C
i2
5
E
9
14
18
1
C
i3
0
F
5
E
4
C
12
5
2
E
C
i4
window edge
6
1
C
i5
F
5
E
E
7
0
12
F
1
7
0
C
i6
Icost Case Study: Deep pipelines
DL1 access
F
5
1
F
5
4
E
6
i1
0
F
1
C
i2
5
E
9
14
18
1
C
i3
0
F
5
E
4
C
12
5
2
E
C
i4
window edge
6
1
C
i5
F
5
E
E
7
0
12
F
1
7
0
C
i6
Icost Breakdown
(6 wide, 64-entry window)
gcc
DL1
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
gzip
vortex
Icost Breakdown
(6 wide, 64-entry window)
gcc
DL1
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
gzip
30.5 %
vortex
Icost Breakdown
(6 wide, 64-entry window)
gcc
gzip
DL1
30.5 %
DL1+window
-15.3
DL1+bw
6.0
DL1+bmisp
-3.4
DL1+dmiss
-0.4
DL1+alu
-8.2
DL1+imiss
0.0
...
...
Total
100.0
vortex
Icost Breakdown
(6 wide, 64-entry window)
gcc
gzip
vortex
DL1
18.3 %
30.5 %
25.8 %
DL1+window
-4.2
-15.3
-24.5
DL1+bw
10.0
6.0
15.5
DL1+bmisp
-7.0
-3.4
-0.3
DL1+dmiss
-1.4
-0.4
-1.4
DL1+alu
-1.6
-8.2
-4.7
DL1+imiss
0.1
0.0
0.4
...
...
...
...
Total
100.0
100.0
100.0
Vortex Breakdowns, enlarging the window
64
DL1
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
128
256
Vortex Breakdowns, enlarging the window
64
128
256
DL1
25.8
8.9
3.9
DL1+window
-24.5
-7.7
-2.6
DL1+bw
15.5
16.7
13.2
DL1+bmisp
-0.3
-0.6
-0.8
DL1+dmiss
-1.4
-2.1
-2.8
DL1+alu
-4.7
-2.5
-0.4
DL1+imiss
0.4
0.5
0.3
...
...
...
...
Total
100.0
80.8
75.0
Shotgun Profiling
Profiling goal
Goal:
•
Construct graph
many dynamic instructions
Constraint:
•
Can only sample sparsely
Genome sequencing
Profiling goal
Goal:
•
DNA strand
Construct graph
DNA
Constraint:
•
Can only sample sparsely
“Shotgun” genome sequencing
DNA
“Shotgun” genome sequencing
DNA
“Shotgun” genome sequencing
DNA
...
...
“Shotgun” genome sequencing
DNA
...
...
Find overlaps among samples
...
...
Mapping “shotgun” to our situation
many dynamic instructions
Icache miss
Dcache miss
Branch misp.
No event
Profiler hardware requirements
...
...
Profiler hardware requirements
...
Match!
...
Offline Profiler Algorithm
long sample
detailed samples
Design issues
• Choosing signature bits
 Identify microexecution context
if
=
then
=
• Determining PCs (for better detailed sample matching)
long sample
Start PC 12 16 20 24 56 60 . . .
branch
encode taken/not-taken bit in signature
Sources of error
Error Source
Gcc
Parser
Twolf
Sources of error
Error Source
Building graph
fragments
Gcc
Parser
Twolf
Sources of error
Error Source
Building graph
fragments
Sampling only a few
graph fragments
Gcc
Parser
Twolf
Sources of error
Error Source
Building graph
fragments
Sampling only a few
graph fragments
Modeling execution as
a graph
Gcc
Parser
Twolf
Sources of error
Error Source
Gcc
Parser
Twolf
Building graph
fragments
5.3 %
1.5 %
1.6 %
Sampling only a few
graph fragments
Modeling execution as
a graph
Sources of error
Error Source
Gcc
Parser
Twolf
Building graph
fragments
5.3 %
1.5 %
1.6 %
Sampling only a few
graph fragments
4.8 %
6.5 %
7.2 %
Modeling execution as
a graph
Sources of error
Error Source
Gcc
Parser
Twolf
Building graph
fragments
5.3 %
1.5 %
1.6 %
Sampling only a few
graph fragments
4.8 %
6.5 %
7.2 %
Modeling execution as
a graph
2.1 %
6.0%
0.1 %
Sources of error
Error Source
Gcc
Parser
Twolf
Building graph
fragments
5.3 %
1.5 %
1.6 %
Sampling only a few
graph fragments
4.8 %
6.5 %
7.2 %
Modeling execution as
a graph
2.1 %
6.0%
0.1 %
Total
12.2 %
14.0 %
8.9 %
Icost vs. Sensitivity Study
Compare Icost and Sensitivity Study
Corollary to DL1 and ROB serial interaction:
As load latency increases, the benefit from enlarging the
ROB increases.
F
1
DL1 access
0
F
1
1
E
3
2
C
i1
0
C
i2
1
E
1
2
2
1
C
i3
1
F
1
4
E
F
0
1
2
E
C
i4
2
1
C
i5
F
1
E
E
3
0
1
F
1
3
0
C
i6
Compare Icost and Sensitivity Study
25
DL1 Latency
Speedup
20
10
5
4
3
2
1
15
10
5
0
64
128
192
ROB size
256
Compare Icost and Sensitivity Study
Sensitivity Study Advantages
•
More information
•
e.g., concave or convex curves
Interaction Cost Advantages
•
Easy (automatic) interpretation
•
•
Sign and magnitude have well defined meanings
Concise communication
•
DL1 and ROB interact serially
Outline
• Definition (ISCA ’01)
• what does it mean for an event to be critical?
• Detection (ISCA ’01)
• how can we determine what events are critical?
• Interpretation (MICRO ’04, TACO ’04)
• what does it mean for two events to interact?
• Application (ISCA ’01-’02, TACO ’04)
• how can we exploit criticality in hardware?
Our solution: measure interactions
Two parallel cache misses (Each 100 cycles)
miss #1 (100)
miss #2 (100)
Cost(miss #1) = 0
Cost(miss #2) = 0
Cost({miss #1, miss #2}) = 100
Aggregate cost > Sum of individual costs  Parallel interaction
100
0+0
icost = aggregate cost – sum of individual costs
= 100 – 0 – 0 = 100
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
1. Positive icost
 parallel interaction
2. Zero icost ?
miss #1
miss #2
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
1. Positive icost
 parallel interaction
2. Zero icost
 independent
3. Negative icost ?
miss #1
miss #1
miss #2
...
miss #2
Negative icost
Two serial cache misses (data dependent)
miss #1 (100) miss #2 (100)
ALU latency (110 cycles)
Cost(miss #1) = ?
Negative icost
Two serial cache misses (data dependent)
miss #1 (100) miss #2 (100)
ALU latency (110 cycles)
Cost(miss #1) = 90
Cost(miss #2) = 90
Cost({miss #1, miss #2}) = 90
icost = aggregate cost – sum of individual costs
= 90 – 90 – 90 = -90
Negative icost  serial interaction
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
1. Positive icost
miss #1
 parallel interaction
2. Zero icost
 independent
3. Negative icost
 serial interaction
Fetch
BranchBW
mispredict
Load-Replay
LSQ
stall
Trap
miss #2
miss #1
..
.
miss #1
miss #2
miss #2
ALU latency
Why care about serial interactions?
miss #1 (100)
miss #2 (100)
ALU latency (110 cycles)
Reason #1 We are over-optimizing!
Prefetching miss #2 doesn’t help if miss #1 is
already prefetched (but the overhead still costs us)
Reason #2 We have a choice of what to optimize
Prefetching miss #2 has the same effect as miss #1
Outline
• Definition (ISCA ’01)
• what does it mean for an event to be critical?
• Detection (ISCA ’01)
• how can we determine what events are critical?
• Interpretation (MICRO ’04, TACO ’04)
• what does it mean for two events to interact?
• Application (ISCA ’01-’02, TACO ’04)
• how can we exploit criticality in hardware?
Criticality Analyzer (ISCA ‘01)
Goal
• Detect criticality of dynamic instructions
Procedure
1. Observe last-arriving edges
•
uses simple rules
2. Propagate a token forward along last-arriving edges
•
at worst, a read-modify-write sequence to a small array
3. If token dies, non-critical; otherwise, critical
Slack Analyzer (ISCA ‘02)
Goal
• Detect likely slack of static instructions
Procedure
1. Delay the instruction by n cycles
2. Check if critical (via critical-path analyzer)
•
•
No, instruction has n cycles of slack
Yes, instruction does not have n cycles of slack
Shotgun Profiling (TACO ‘04)
Goal
• Create representative graph fragments
Procedure
• Enhance ProfileMe counters with context
• Use context to piece together counter samples
Download