* Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim,

advertisement
Hardware-based Devirtualization
(VPC Prediction)
Hyesoon Kim, Jose A. Joao, Onur Mutlu++,
Chang Joo Lee, Yale N. Patt, Robert Cohn*
++
*
Outline
 Background and Motivation
 VPC (Virtual Program Counter) Prediction
 Results
 Conclusion
2
Direct vs. Indirect Branch
A
T
TARG
N
A
br.cond TARGET
A+1
Conditional (Direct) Branch
R1 = MEM[R2]
branch R1
?
a
b
d
r
Indirect Branch
Indirect branches are costly on processor performance
 Much more difficult to predict than conditional (direct)
branches: multiple target addresses
 Indirect branch predictor requires a large structure
3
Source Code Examples
 Switch structures
 Virtual function calls
Source code:
Shape *s = …;
a = s->area();
Static assembly code:
R1 = MEM[R2]
call R1
4
// virtual function call
// function address lookup
// a register-indirect call
lo
r
fir er
ef
o
vt x
u
cy ne
gw
em i n
ac ac
de win rore s
sk ex ad
to pl
p- or
se er
a
ou rch
tlo
o
ex k
si cel
m
w ics
in
am
av p
na
sa w ida
-w in
or dv
ld d
w
pp ind
tv
sq iew
lse
rv
AV r
G
ie
xp
MPKI
Indirect Branch Mispredictions
16
14
12
5
direct
indirect
10
8
6
4
2
0
Data from Intel Core Duo processor
Branch Predictor
Direction
Predictor
GHR ..1001010
PC Addr
0x0800
Hash
TARG2
TARG2
Indirect Branch Predictor
T TARG1
Direct
IndirectBranch?
Branch?
Branch Target Buffer (BTB)
6
PC+1
Predicted
target
Outline
 Background and Motivation
 VPC (Virtual Program Counter) Prediction
 Results
 Conclusion
7
VPC Prediction: Basic Idea
 Key idea: Treat an indirect branch as
multiple “virtual” conditional branches
 Only for prediction purposes
 Use the conditional branch predictor
8
VPC Branch Predictor
Direction
Predictor
GHR ..1001010
PC Addr
Hash
0x0800
VPC2
VPC1
TARG2
TARG1
Branch Target Buffer
9
Predicted
target
VPC Prediction: Basic Idea
 Key idea: Treat an indirect branch as
multiple “virtual” conditional branches
 Only for prediction purposes
 Use the conditional branch predictor
 Benefits:
 No separate complex structure
 Can be applied to any other conditional branch
prediction algorithm
 Improve conditional branch prediction algorithm
 Will improve the indirect branch prediction accuracy
10
Inspiration: Static Devirtualization
Source code:
Shape *s = …;
a = s->area();
// an indirect call
Optimized source code:
Shape *s = …;
if (s->type == Rectangle)
a = Rectangle::area();
else if (s->type == Circle)
a = Circle::area();
else
a = s->area();
// a conditional branch at PC: X
// a conditional branch at PC: Y
// an indirect call at PC: Z
Small talk(’84), Calder and Grunwald (’94), Garret et al. (’94) , Ishizaki et al.(’00)
11
VPC Prediction
Source code:
Shape *s = …;
a = s->area();
Static assembly code:
R1 = MEM[R2]
call R1
Dynamic virtual branches (for
conditional jump TARGET1
conditional jump TARGET2
conditional jump TARGET3
conditional jump TARGET4
12
// an indirect call
// PC: L
prediction purposes):
// virtual PC = L
// virtual PC = L XOR HASHVAL[1]
// virtual PC = L XOR HASHVAL[2]
// virtual PC = L XOR HASHVAL[3]
Virtual PC Address Generation
Use original PC address and iteration counter value
Hash value table
0xabcd
iteration
counter value
0x018a
0x7a9c
0x…
PC
13
Virtual PC
VPC Prediction Process-I
Real Instruction
call R1
Direction Predictor
GHR
// PC: L
Virtual Instructions
cond.
cond.
cond.
cond.
jump
jump
jump
jump
TARG1
TARG2
TARG3
TARG4
PC
//
//
//
//
VPC:
VPC:
VPC:
VPC:
L
VL2
VL3
VL4
Next iteration
14
not taken
1111
L
BTB
TARG1
VPC Prediction Process-II
Real Instruction
call R1
Direction Predictor
VGHR
// PC: L
1110
Virtual Instructions
cond.
cond.
cond.
cond.
jump
jump
jump
jump
TARG1
TARG2
TARG3
TARG4
VPC
//
//
//
//
VPC:
VPC:
VPC:
VPC:
L
VL2
VL2
VL3
VL4
Next iteration
15
not taken
BTB
TARG2
VPC Prediction Process-III
Real Instruction
call R1
VGHR
// PC: L
1100
Virtual Instructions
cond.
cond.
cond.
cond.
jump
jump
jump
jump
TARG1
TARG2
TARG3
TARG4
Direction Predictor
taken
VPC
//
//
//
//
VPC:
VPC:
VPC:
VPC:
L
VL3
VL2
VL3
VL4
BTB
Predicted Target
= TARG3
TARG3
16
VPC Prediction Algorithm
 Access the conditional branch predictor and the BTB
with VPCA and VGHR
 Compute VPCA and VGHR for the next iteration
 VPCA = PC XOR HASHVAL[iter]
 VGHR = VGHR << 1
 Predicted not taken: Move to the next iteration
 Predicted taken: Use the target in the BTB as the
target of an indirect branch
 Give up and stall if
 Iteration count > MAX_ITER or BTB miss
17
VPC Training Algorithm
 An iterative process when an indirect branch is
retired (not on the critical path)
 Update the conditional branch predictor
 Virtual branch has a correct target: Taken
 Virtual branch has a wrong target: Not-taken
 Update replacement policy bits of the correct
target in the BTB
 Insert the correct target into the BTB
 Conditional branch predictor: taken
 Replace the least frequently used target (LFU)
18
Hardware Cost and Complexity
GHR
VGHR
Branch
Direction
Predictor
(BP)
Taken/Not Taken
Predict?
PC
+
Direct/Indirect
VPCA
BTB
Target Address
Hash Function
Iteration counter
19
Outline
 Background and Motivation
 VPC Prediction
 Results
 Conclusion
20
Simulation Methodology
 Pin-based x86 Simulator
 Processor configuration






4K-entry BTB
64KB perceptron conditional branch predictor
Minimum 30-cycle branch misprediction penalty
8-wide, 512-entry instruction window
Less aggressive processor (in the paper)
Gshare, O-GEHL conditional branch predictors
 Indirect branch intensive benchmarks
 5 SPEC CPU2000, 5 SPEC CPU 2006, 2 other C++
 IBM server benchmarks (OLTP) (in the paper)
21
22
16
baseline
baseline
VPC-ITER-2
VPC-ITER-2
VPC-ITER-4
VPC-ITER-4
VPC-ITER-6
VPC-ITER-6
VPC-ITER-8
VPC-ITER-8
VPC-ITER-10
VPC-ITER-10
VPC-ITER-12
VPC-ITER-12
VPC-ITER-14
VPC-ITER-16
VPC-ITER-16
14
12
10
8
6
4
2
AV
G
ix
x
ga
p
pe
rlb
en
ch
gc
c0
6
sj
en
g
na
m
d
po
vr
ay
ric
ha
rd
s
pe
rl b
m
k
eo
n
cr
af
ty
0
gc
c
Indirect branch Mispredictions (MPKI)
(MPKI)
VPC MPKI
23
G
AV
ix
x
lb
en
ch
gc
c0
6
ga
p
pe
r
k
lb
m
pe
r
eo
n
af
ty
cr
sj
en
g
na
m
d
po
vr
a
ric y
ha
rd
s
VPC-ITER-2
VPC-ITER-4
VPC-ITER-6
VPC-ITER-8
VPC-ITER-10
VPC-ITER-12
VPC-ITER-14
VPC-ITER-16
110
100
90
80
70
60
50
40
30
20
10
0
gc
c
% IPC improvement over baseline
VPC Performance
IPC improvement (%)
35
98%
98.3%
99%
gshare
perceptron
O-GEHL
30
25
20
15
10
5
0
Improving conditional branch prediction accuracy also
improves indirect branch prediction accuracy!
24
Conditional branch accuracy (%)
Different Direction Predictors
VPC vs. Static Devirtualization
 Advantages
 Enables other compiler optimizations (function inlining)
 Can reduce the number of mispredictions
 Disadvantages/Limitations
 Not all indirect branches can be statically devirtualized
 Extensive static analysis/profiling
 Lack of adaptivity to run-time input set and phase
behavior
 VPC prediction can be used with
statically devirtualized binaries
 10% improvement on top of static devirtualization
25
Outline
 Background and Motivation
 VPC Prediction
 Results
 Conclusion
26
Conclusion
 VPC dynamically converts indirect branches into
multiple conditional branches; uses the existing
conditional branch prediction hardware
 VPC prediction reduces the branch misprediction
penalty without significant extra hardware
storage.
 Baseline: 26% IPC improvement
 O-GEHL: 31% IPC improvement
 VPC can be an enabler encouraging programmers to use
object-oriented programming styles
27
Thank you!
Questions?
VPC vs. Cascaded IBP
cascaded-704B
cascaded-1.4KB
cascaded-2.8KB
cascaded-5.5KB
cascaded-11KB
cascaded-22KB
cascaded-44KB
cascaded-88KB
cascaded-176KB
VPC-ITER-12
100
80
60
40
20
29
G
AV
ix
x
sj
en
g
na
m
d
po
vr
ay
ric
ha
rd
s
lb
en
ch
gc
c0
6
ga
p
pe
r
k
lb
m
pe
r
eo
n
cr
-20
af
ty
0
gc
c
% IPC improvement over baseline
120
VPC vs. Other Indirect BP
gcc
crafty
eon
perlbmk
Target
Tag
Cache
12KB
1.5KB
>192KB
1.5KB
Cascaded
>176KB
2.8KB
>176KB
2.8KB
TTC: Chang et al. (’96)
Cascaded: Driesen and Holzle(’98)
30
Iterative prediction
 It doesn’t hurt performance significantly
 Results
 Why?
 Most prediction is within a few iterations.
 Results
31
32
ix
x
AV
G
gc
c
cr
af
ty
eo
pe n
rlb
m
k
pe g a p
rlb
en
ch
gc
c0
6
sj
en
g
na
m
po d
vr
ric ay
ha
rd
s
VPC Hit Iteration Counter
100%
11-12
80%
60%
9-10
7-8
5-6
40%
4
3
20%
2
1
0%
Can the BTB be pipelined?
 Yes
 The next iteration of VPC can be
started without knowing the previous
iteration in the pipeline.
 Consecutive VPC prediction iterations
can be simply pipelined.
 If the iteration is not needed then
simply discard the prediction.
33
Is 4K-entry BTB too large?
 Pentium 4 has a 4K-entry BTB
 IBM Z series (z990) has an 8K-entry
BTB
 AMD Athlon and Hammer have 2Kentry BTBs
34
8
40
base
vpc
IPC improvement
7
6
30
5
25
4
20
3
15
2
10
1
5
0
0
512
35
35
1024
2048
4096
% IPC improvement over baseline
Indirect branch Mispredictions (MPKI)
BTB Size Effects
36
po
6
ha
ix
x
rd
s
vr
ay
m
d
en
g
na
sj
c0
h
p
en
c
gc
rlb
m
k
ga
rlb
n
20%
ric
pe
pe
ty
eo
cr
af
c
gc
VPC access (%)
VPC Prediction Accuracy
100%
80%
60%
40%
no target
wrong target
correct
0%
37
AV
G
ix
x
gc
c
cr
af
ty
e
pe on
rlb
m
k
pe ga
rlb p
en
ch
gc
c0
6
sj
en
g
na
m
d
po
v
ric ray
ha
rd
s
Target Distribution
100%
16+
80%
11-15
6-10
60%
5
40%
4
3
20%
2
1
0%
VPC vs. Tagged Target Cache
TTC-384B
TTC-768B
TTC-1.5KB
TTC-3KB
TTC-6KB
TTC-12KB
TTC-24KB
TTC-48KB
TTC-96KB
TTC-192KB
VPC-ITER-12
38
100
80
60
40
20
AV
G
ix
x
sj
en
g
na
m
d
po
vr
ay
ric
ha
rd
s
ga
pe
p
rlb
en
ch
gc
c0
6
k
lb
m
pe
r
eo
n
cr
af
ty
0
gc
c
% IPC improvement over baseline
120
120
39
1br/cycle
2br/cycle
4br/cycle
6br/cycle
8br/cycle
10br/cycle
100
80
60
40
20
G
AV
ix
x
sj
en
g
na
m
d
po
vr
ay
ric
ha
rd
s
ga
p
pe
rlb
en
ch
gc
c0
6
k
pe
rlb
m
eo
n
cr
af
ty
0
gc
c
% IPC improvement over baseline
VPC Prediction Delay Effects
VPC with O-GEHL BP
TTC-384B
TTC-768B
TTC-1.5KB
TTC-3KB
TTC-6KB
TTC-12KB
TTC-24KB
TTC-48KB
VPC-ITER-12
40
100
80
60
40
20
G
AV
ix
x
sj
en
g
na
m
d
po
vr
ay
ric
ha
rd
s
lb
en
ch
gc
c0
6
ga
p
pe
r
k
lb
m
pe
r
eo
n
cr
af
ty
0
gc
c
% IPC improvement over baseline
120
VPC with a Less Aggressive Processor
TTC-384B
TTC-768B
TTC-1.5KB
TTC-3KB
TTC-6KB
TTC-12KB
TTC-24KB
TTC-48KB
VPC-ITER-12
60
50
40
30
20
10
41
AV
G
x
ix
rd
s
ha
ric
vr
ay
po
m
d
na
en
g
sj
c0
6
gc
h
en
c
p
pe
rlb
ga
m
k
rlb
n
pe
eo
ty
cr
af
c
0
gc
% IPC improvement over baseline
70
Server Benchmarks
Indirect branch Mispredictions (MPKI)
16
14
12
10
baseline
VPC-ITER-2
VPC-ITER-4
VPC-ITER-6
VPC-ITER-8
VPC-ITER-10
VPC-ITER-12
VPC-ITER-14
VPC-ITER-16
8
6
4
2
0
OLTP1
42
OLTP2
OLTP3
AVG
Server Benchmarks (VPC vs. TTC)
Indirect branch Mispredictions (MPKI)
18
16
14
12
10
baseline
TTC-384B
TTC-768B
TTC-1.5KB
TTC-3KB
TTC-6KB
TTC-12KB
TTC-24KB
TTC-48KB
VPC-ITER-10
8
6
4
2
0
OLTP1
43
OLTP2
OLTP3
AVG
VPC Prediction vs. Compiler-Based
Devirtualization (With TTC)
TTC-384B
TTC-768B
TTC-1.5KB
TTC-3KB
TTC-6KB
TTC-12KB
TTC-24KB
TTC-48KB
VPC-ITER-12
90
% IPC improvement over baseline
80
44
70
60
50
40
30
20
10
0
-10
c
gc
c
fty
a
r
n
eo
pe
m
rlb
k
p
ga
pe
ch
n
e
rlb
06
c
gc
ng
e
sj
d
m
a
n
ay
r
v
po
AV
G
Conditional Br. MPKI
Conditional Br. Prediction Effects
4
3.5
3
2.5
2
1.5
1
0.5
0
Base
VPC
gshare
perceptron
O-GEHL
VPC Prediction reduces the accuracy of direction
branch prediction but not that much!
45
46
ic
ex s
cy cel
g
sq win
w ls
in e
ex rv
pl r
ie or
xp er
lo
em rer
a
fir cs
ef
o
na
vt x
sa pp un
-w t v e
or iew
ld
de
w
sk o ind
to utl
p- oo
se k
ar
ch
a
ac vi
ro da
r
w ead
in
a
w mp
in
dv
d
AV
G
si
m
Percentage of all mispredicted branches(%)
Indirect Branch Mispredictions
60
50
indirect branches
40
30
20
10
0
VPC Prediction with Static Devirtualization
VPC-ITER-4
50
VPC-ITER-6
VPC-ITER-8
40
VPC-ITER-10
30
VPC-ITER-12
20
10

AV
G
vr
ay
po
m
d
na
en
g
sj
c0
gc
rlb
6
h
en
c
p
pe
VPC prediction can be used with static devirtualized binaries.

47
ga
m
k
pe
rlb
n
eo
ty
cr
af
c
0
gc
% IPC improvement over baseline
60
Not all indirect branches could be devirtualized
VPC Training: Correct Prediction
Retirement: Real Instruction
call R1
// PC: L
Known: Correct predicted, predicted iter = 3
48
Iter
VPCA
VGHR
Direction BP
BTB
1
L
GHR
Not-taken
-
2
VL2
GHR<<1
Not-taken
-
3
VL3
GHR<<2
Taken
Update
replacement
VPC Training: Misprediction
Retirement: Real Instruction
call R1
// PC: L
Known: Mispredicted, correct target address
49
Iter
VPCA
VGHR
BTB Access
Train
Direction BP
Train BTB
1
L
GHR
TARG !=
Correct
Not-taken
-
2
VL2
GHR<<1
TARG !=
Correct
Not-taken
-
3
VL3
GHR<<2
Target =
Correct
Taken
Update
replacement
VPC Training: Misprediction
Retirement: Real Instruction
call R1
// PC: L
Known: Mispredicted, correct target address
No Target
50
Iter
VPCA
VGHR
BTB Access
Train
Direction BP
Train BTB
1
L
GHR
TARG !=
Correct
Not-taken
-
2
VL2
GHR<<1
TARG !=
Correct
Not-taken
-
3
VL3
GHR<<2
TARG !=
Correct
Not-taken
-
VPC Training: Misprediction
Retirement: Real Instruction
call R1
// PC: L
Known: Mispredicted, correct target address
Replacement

51
Iter
VPCA
VGHR
BTB Access
Repl.
counter
Train BP
Train
BTB
1
L
GHR
TARG !=
Correct
3
Nottaken
-
2
VL2
GHR<<
1
TARG !=
Correct
10
NotTaken
taken
Insert
Nothing
3
VL3
GHR<<
2
TARG !=
Correct
8
Nottaken
-
Does VPC need an extra BTB port?
 No
 A read from the BTB is only needed
when a branch is mispredicted.
 95% branches are correctly predicted
with VPC.
 The read is performed only there is a
available BTB port.
52
Download