ppt

advertisement
Diverge-Merge Processor (DMP)
Hyesoon Kim
José A. Joao
Onur Mutlu*
Yale N. Patt
HPS Research Group
University of Texas at Austin
*Microsoft Research
Outline





Predicated Execution
Diverge-Merge Processor (DMP)
Implementation of DMP
Experimental Evaluation
Conclusion
2
Predicated Execution
(normal branch code)
(predicated code)
A
if (cond) {
b = 0;
}
else {
b = 1;
}
T
N
C
B
A
B
C
D
D
A
B
C
p1 = (cond)
branch p1, TARGET
A
mov b, 1
jmp JOIN
B
C
TARGET:
mov b, 0
p1 = (cond)
(!p1) mov b, 1
(p1) mov b, 0
Convert control flow dependence to data dependence
3
Benefit of Predicated Execution
 Predicated Execution can be high performance
and energy-efficient.
Predicated Execution
Fetch Decode Rename Schedule RegisterRead Execute
A
F
E
A
D
B
C
C
F
D
E
C
A
B
F
E
C
D
B
A
A
D
B
C
E
F
C
A
B
D
E
F
B
A
D
C
E
F
A
E
F
C
D
B
D
E
B
C
A
F
C
D
A
B
E
B
C
A
D
A
B
C
A
B
A
B
Branch Prediction
D
Fetch Decode Rename Schedule RegisterRead Execute
F
E
Pipeline flush!!
F
4
E
D
B
A
Limitations/Problems of Predication
 ISA: Predicate registers and predicated instructions

Dynamic-Hammock Predication[Klauser’98] can solve this problem
but it is only applicable to simple hammocks.
 Adaptivity: Static predication is not adaptive to run-time
branch behavior.

Branch behavior changes based on input set, phase, control-flow path.

Wish Branches[Kim’05]
 Complex CFG: A large subset of control-flow graphs is not
converted to predicated code.

Function calls, loops, many instructions inside a region,
and complex CFGs

Hyperblock[Mahlke’92] cannot adapt to frequently-executed paths
dynamically.
5
Outline





Predicated Execution
Diverge-Merge Processor (DMP)
Implementation of DMP
Experimental Evaluation
Conclusion
6
Diverge-Merge Processor (DMP)
 DMP can dynamically predicate complex branches
(in addition to simple hammocks).
 The compiler identifies
 Diverge branches
 Control-flow merge (CFM) points
 The microarchitecture decides when and what to
predicate dynamically.
7
Dynamic Predication
Low-confidence
A
T
N
C
B
A
(mov R1, 1)
PR10 = 1
B
H
A
B
C
(mov R1, 0)
PR11 = 0
C
p1 = (cond)
branch p1, TARGET
mov R1, 1
jmp JOIN
TARGET:
mov R1, 0
H JOIN:
add R5, R1, 1
select-µops (φ-nodes in SSA)
PR12 = (cond) ? PR11 : PR10
H
Klauser et al.[PACT’98]: Dynamic-hammock predication
8
Diverge-Merge Processor
A
C
Diverge Branch
B
B
D
F
A
C
E
E
G
H
Insert select-µops
H
CFM point
Frequently executed path
Not frequently executed path
9
Diverge-Merge Processor
A
C
A
A
A
A
A
B
D
F
E
A
G
H
Frequently executed path
diverge-branch
Not frequently executed path
10
executed block
CFM point
Control-Flow Graphs
A
A
A
A
A
. . . . . . . . . . .
simple hammock nested hammock
DMP
Dynamic
Hammock
SW pred
Wish br.
Dual-path
11
frequently-hammock
loop
non-merging
Dual-path Execution vs. DMP
Dual-path
A
C
Low-confidence
B
D
E
F
path 1
path 2
DMP
path 1
path 2
C
B
C
B
D
D
CFM
CFM
E
F
E
F
D
E
F
12
Control-Flow Graphs
A
A
A
A
A
. . . . . . . . . . .
simple hammock nested hammock
frequently-hammock
DMP
Dynamichammock
SW pred
sometimes
Wish br.
sometimes
Dual-path
13
loop
non-merging
Distribution of Mispredicted Branches
66% of mispredicted branches can be dynamically
predicated in DMP.
non-merging
12
loop
10
frequently
nested
8
simple
6
4
2
14
88 li
ks
im
am
ea
n
m
go
ij p
eg
af
pa ty
rs
er
e
pe on
rlb
m
k
ga
vo p
rte
x
bz
ip
2
tw
ol
co f
m
p
cr
m
cf
0
gz
ip
vp
r
gc
c
Mispredictions per kilo instructions (MPKI)

Distribution of Mispredicted Branches
66% of mispredicted branches can be dynamically
predicated in DMP.
non-merging
12
loop
10
frequently
nested
8
simple
6
4
2
15
88 li
ks
im
am
ea
n
m
go
ij p
eg
af
t
pa y
rs
er
e
pe on
rlb
m
k
ga
vo p
rte
x
bz
ip
2
tw
ol
co f
m
p
cr
m
cf
0
gz
ip
vp
r
gc
c
Mispredictions per kilo instructions (MPKI)

Outline





Predicated Execution
Diverge-Merge Processor (DMP)
Implementation of DMP
Experimental Evaluation
Conclusion
16
Fetch Mechanism
Low Confidence
A
C
Diverge Branch
B
A
B
Round-robin fetch
D
F
E
C
E
G
H
CFM point
H
predicted path
17
Dynamic Predication
A
B
C
E
branch r0, C
add r1  r3, #1
add r1  r2, # -1
branch pr10,C p1 = pr10
add pr21  pr13, #1 (p1)
add pr31  pr12, # -1(!p1)
select-µop pr41 = p1? pr21 : pr31
H
add r4  r1, r3
add pr24  pr41, pr13
Arch.
Phy.
M
R1
PR11
PR41
PR21
1
R2
PR12
R3
PR13
RAT1
Arch.
Phy.
M
R1
PR11
PR31
1
R2
PR12
R3
PR13
RAT2
Forks RAT, RAS, and GHR
18
DMP Support
 ISA Support
 Mark diverge branches/CFM points.
 Compiler Support [CGO’07]
 The compiler identifies diverge branches and the
corresponding CFM points.
 Hardware Support





Confidence estimator
Fetch mechanisms
Load/store processing
Instruction retirement
Dynamic predication
19
Hardware Complexity Analysis
DMP
Dyn. Dual
ham. path
Multi
path
SW
Wish
pred. br.
Front-End

Confidence Estimator

Rename Support


Predicate Registers




Select-Uop Gen.




ST-LD Forwarding






Check Flush/no Flush





20









Outline





Predicated Execution
Diverge-Merge Processor (DMP)
Implementation of DMP
Experimental Evaluation
Conclusion
21
Simulation Methodology
 12 SPEC 2000 INT, 5 SPEC 95 INT
 Different input sets for profiling and evaluation
 Alpha ISA execution driven simulator
 Baseline processor configuration

64KB perceptron predictor/O-GEHL (paper)
 Minimum 30-cycle branch misprediction penalty
 8-wide, 512-entry instruction window
 2 KB 12-bit history enhanced JRS confidence
estimator
 Less aggressive processor (paper)
 Power model using Wattch
22
Different CFG types
simple
simple,nested
simple,nested,frequently
simple,nested,frequently,loop
50
40
30
20
10
23
hmean
m88ksim
li
ijpeg
go
comp
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
0
gzip
IPC improvement (%)
60
Performance Improvement
Performance Improvement (%)
25
DMP
dynamic-hammock
dual-path
multipath
limited software predication
wish branches
20
15
10
5
0
24
Energy Consumption
Reduction (%)
10
DMP
dynamic-hammock
dual-path
multipath
limited software predication
wish branches
5
0
-5
25
Outline





Predicated Execution
Diverge-Merge Processor (DMP)
Implementation of DMP
Experimental Evaluation
Conclusion
26
Conclusion

DMP introduces the concept of frequently-hammocks and it
dynamically predicates complex CFGs.

DMP can overcome the three major limitations of software
predication: ISA support, adaptivity, complex CFG.
 DMP reduces branch mispredictions energy efficiently

19% performance improvement, 9% less energy
 DMP divides the work between the compiler and the
microarchitecture:


The compiler analyzes the control-flow graphs.
The microarchitecture decides when and what to predicate
dynamically.
27
Thank You!!
Questions?
Handling Mispredictions
Diverge Br.
A
C
D
F
B
Misprediction!
E
G
H
A
A
CFM point
branch pr10,C p1 = pr10
B add pr21  pr13, #1 (p1)
(0)
B
C
C
E
E
(1)
add pr31  pr12, # -1(!p1)
add pr44  pr34, # -1(!p1)
(1)
select-µop
pr41 = p1? pr21 : pr31
D
add pr34  pr31, pr13
H
add pr24  pr41, pr13
D
H
predicted path
30
Flush
Loop Branches
 Exit Condition
 The loop branch is predicted to exit the loop.
 Benefit
 Reduced pipeline flushes: when the predicated
loop is iterated more times than it should be.
 Instructions in the extra iterations of the loop
become NOPs. Instructions after loop-exit can
still be executed.
 Negative Effects
 Increased execution delay of loop-carried
dependencies
 The overhead of select-µops
31
Loop Branches
 Predicate each loop iteration separately
A
B
A add r1 r1, #1
r0 = (cond1)
branch A, r0
A
A
A add r1 r1, #1
r0 = (cond1)
branch A, r0
A add r1 r1, #1
r0 = (cond1)
branch A, r0
branch A, pr10
add pr21  pr11, #1
pr20 = (cond1)
branch A, pr20
p1 = pr10
(p1)
(p1)
(p1) p2 = pr20
select-uop pr22 = p1 ? pr21: pr11
select-uop pr23 = p1? pr20: pr10
A add pr31  pr22, #1
pr30 = (cond1)
branch A, pr30
B add r7 r1, #10
(p2)
(p2)
(p2)
select-uop pr32 = p2 ? pr31: pr22
select-uop pr33 = p2 ? pr30: pr23
Loop br. is predicted to exit the loop
B add pr7  pr32, #10
32
Enhanced Mechanisms
 Multiple CFM points
 The hardware chooses one CFM point for
each instance of dynamic predication.
 Exit Optimizations
 Counter Policy: What if one path does not
reach the CFM point?
 Number of fetched instructions > Threshold
 Yield Policy: What if another low
confidence diverge branch is
encountered in dynamic predication
mode?
 Later low confidence branch is more likely
mispredicted.
33
A
B
G
H
C
D
E
F
Detailed DMP Support
 32 Predicate register ids
 Fetch mechanism




High performance I-Cache
Fetch two cache lines
Predict 3 branches
Fetch stops at the first taken branch
34
35
amean
m88ksim
li
ijpeg
go
comp
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
Merge (%)
Diverge and Merge?
100%
80%
60%
40%
20%
0%
36
amean
m88ksim
li
ijpeg
go
comp
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
Diverge branch actually mispredicted (%)
Useful Dynamic Predication Mode
30
25
20
15
10
5
0
Perfect Branch Prediction
4 wide-20 stages-128 window
0
delta (%)
8 wide-30 stages-512 window
-5
100
90
80
70
60
50
40
30
20
10
0
0
Energy
EDP
-10
-10
-20
-15
-30
-20
-40
-25
-50
-30
Performance
-35
-60
-40
-70
37
Maximum Power
Maximum Power Increament (%)
8
DMP
dynamic-hammock
dual-path
6
multipath
software predication
wish branches
4
2
0
38
Branch Predictor Effects
35
IPC delta (%)
30
25
perceptron-dynamic-hammock
perceptron-dual-path
perceptron-multipath
perceptron-DMP
OGEHL-base
OGEHL-dynamic-hammock
OGEHL-dual-path
OGEHL-multipath
OGEHL-DMP
20
15
10
5
0
39
Confidence Estimator Effects
35
IPC delta (%)
30
25
20
512B
2KB
4KB
16KB
perfect
15
10
5
0
dynamic-hammock
multipath
dual-path
40
DMP
ip
vp
r
gc
c
m
c
cr f
af
p a ty
rs
er
eo
pe n
rlb
m
k
ga
vo p
rte
x
bz
ip
2
tw
ol
co f
m
p
go
ijp
eg
m
88 li
ks
i
hm m
ea
n
gz
-5
gz
ip
vp
r
gc
c
m
c
cr f
af
pa ty
rs
er
e
pe on
rlb
m
k
ga
vo p
rte
bz x
ip
2
tw
ol
co f
m
p
go
ijp
eg
m
88 li
ks
am im
ea
n
Execution time normalized to the baseline
IPC delta (%)
Results in Less Aggressive Processors
35
30
dynamic-hammock
25
dual-path
multi-path
20
15
dmp
10
5
0
1.05
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
limited software predication
wish branches
dmp
41
42
hmean
120
m88ksim
227
li
ijpeg
229
go
comp
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
140
vpr
gzip
IPC delta (%)
DMP vs. Perfect Conditional BP
dmp
Perf BP
100
80
60
40
20
0
Enhanced DMP Mechanisms
60
single-cfm
multiple-cfm
mcfm-counter
mcfm-counter-yield
40
30
20
10
43
hmean
m88ksim
li
ijpeg
go
comp
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
-10
vpr
0
gzip
IPC delta (%)
50
IPC improvement (%)
40
35
30
25
20
15
10
5
0
-5
47%
mcf
44
hmean
m88ksim
li
ijpeg
go
comp
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
gcc
58%
vpr
gzip
DMP vs. Other Mechanisms
dynamic-hammock
dual-path
multipath
DMP
Comparisons with Predication/Wish Branches
non-predicated
0.8
0.6
0.4
limited software predication
wish branches
DMP
0.2
45
amean
m88ksim
li
ijpeg
go
comp
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
0
gzip
Normalized execution time
1

Average overhead:
 Dynamic-hammock: 4 instructions/entry
 Dual-path: 150 instructions/entry
 Multipath: 200 instructions/entry
 DMP: 20 instructions/entry
46
amean
m88ksim
li
ijpeg
go
comp
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
gcc
vpr
mcf
dynamic-hammock
dual-path
multipath
DMP
80
70
60
50
40
30
20
10
0
gzip
Reduction in pipeline flushes (%)
Reduction in Pipeline Flushes
Handling Nested Diverge Branches
Diverge Br.
 Basic DMP
A
C
 Ignore other low
confidence div.
branches
B
 Enhanced DMP
D
F
E
 Exit dynamic
predication mode
and re-enter from
the younger low
confidence branch
on predicted path
(Yield policy)
G
H
CFM point
47
Compiler Support [CGO’07]
 Compiler analyzes the control flow
and the profile data
 Step1: Identify diverge branch candidates and
CFM points.
 Step2: Select diverge branches based on
(1) the number of instructions between a branch
and the CFM point
(2) the probability of merging at the CFM point
 Heuristics or a cost-benefit model
 Step3: Mark the selected branches/CFM points.
48
Future Research
 Hardware Support
 Better confidence estimators
 Efficient hardware mechanism to detect
diverge branches and CFM points
 Increase hardware complexity but eliminate
the need for ISA/compiler support
 Compiler Support
 Better compiler algorithms [CGO’07]
49
Power Measurement Configurations
 100 nm Technology
 Baseline processor
 4GHZ
 Less aggressive processor
 1.5GHz
 CC3 clock-gating model in Wattch: unused
units dissipate only 10% of their maximum
power
 DMP: one more RAT/RAS/GHR, select-uop
generation module, additional fields in BTB,
predicate registers, CFM registers, loadstore forwarding, instruction retirement
50
350
dynamic-hammock
dual-path
multipath
dmp
300
250
200
150
100
50
51
amean
m88ksim
li
ijpeg
go
comp
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
0
gzip
Wrong-path instructions per entry
Fetched wrong-path instructions per entry into
dynamic-predication/dual-path mode
Fetched/Executed Instructions
5
0
baseline
less-aggressive
delta (%)
-5
-10
fetched instructions
executed instructions
max power
energy
energy-delay product
-15
-20
-25
52
ISA Support
 Example of Diverge Br and CFM
markers
OPCODE
TARGET
00 : normal branch
10 : diverge forward branch
11 : diverge loop branch
CFM = CFM rel address + PC
53
CFM rel address
Entering Dynamic Predication Mode
 Entry condition
 When a diverge branch has low confidence.
 The Front-end
 Stores the address of the CFM point to the CFM
register.
 Forks the RAS, GHR, and RAT.
 Allocates a predicate register.
 Fetch Mechanisms
 Round-robin fetch from two paths
 The processor follows the branch predictor until
it reaches the corresponding CFM point.
54
Exiting Dynamic Predication Mode
 Exit condition
 Both paths of a diverge branch have
reached the corresponding CFM point.
 A diverge branch is resolved.
 Select-µop mechanism
 Similar to φ-node in SSA
 Merges register values from two paths.
55
Multipath Execution
Low-confidence
A
path 2
C
path 3
B
path 4
D
E
F
G
H
H
H
H
H
I
I
I
I
I
C
D
path 1
B
E
F
Low-confidence
G
Instructions after the control-flow merge point are fetched multiple times.
Waste of resources and energy.
56
Modeling Software Predication
 Mark using a binary instrumentation
tool
 All simple and nested hammocks can
be predicated.
 All instruction between a branch and
the control-flow merge point are
fetched.
 All nested branches are predicated.
57
Download