VPC Prediction: Reducing the Cost of Indirect Branches via

advertisement
VPC Prediction: Reducing the Cost of Indirect Branches
via Hardware-Based Dynamic Devirtualization
Hyesoon Kim José A. Joao Onur Mutlu§ Chang Joo Lee Yale N. Patt Robert Cohn†
ECE Department
The University of Texas at Austin
§Microsoft Research
Redmond, WA
†Intel Corporation
Hudson, MA
{hyesoon, joao, cjlee, patt}@ece.utexas.edu onur@microsoft.com robert.s.cohn@intel.com
ABSTRACT
Indirect branches have become increasingly common in modular
programs written in modern object-oriented languages and virtualmachine based runtime systems. Unfortunately, the prediction accuracy of indirect branches has not improved as much as that of conditional branches. Furthermore, previously proposed indirect branch
predictors usually require a significant amount of extra hardware storage and complexity, which makes them less attractive to implement.
This paper proposes a new technique for handling indirect
branches, called Virtual Program Counter (VPC) prediction. The key
idea of VPC prediction is to treat a single indirect branch as multiple
“virtual” conditional branches in hardware for prediction purposes.
Our technique predicts each of the virtual conditional branches using the existing conditional branch prediction hardware. Thus, no
separate storage structure is required for predicting indirect branch
targets.
Our evaluation shows that VPC prediction improves average performance by 26.7% compared to a commonly-used branch target
buffer based predictor on 12 indirect branch intensive applications.
VPC prediction achieves the performance improvement provided by
at least a 12KB (and usually a 192KB) tagged target cache predictor
on half of the examined applications. We show that VPC prediction
can be used with any existing conditional branch prediction mechanism and that the accuracy of VPC prediction improves when a more
accurate conditional branch predictor is used.
Categories and Subject Descriptors:
C.1.0 [Processor Architectures]: General
C.1.1 [Single Data Stream Architectures]: RISC/CISC, VLIW architectures
C.5.3 [Microcomputers]: Microprocessors
D.3.3 [Language Constructs and Features]: Polymorphism
General Terms: Design, Performance.
Keywords: Indirect branch prediction, virtual functions, devirtualization.
1.
INTRODUCTION
Object-oriented programs are becoming more common as more
programs are written in modern high-level languages such as Java,
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ISCA’07, June 9–13, 2007, San Diego, California, USA.
Copyright 2007 ACM 978-1-59593-706-3/07/0006 ...$5.00.
C++, and C#. These languages support polymorphism [4], which
significantly eases the development and maintenance of large modular software projects. To support polymorphism, modern languages
include dynamically-dispatched function calls (i.e. virtual functions)
whose targets are not known until run-time because they depend on
the dynamic type of the object on which the function is called. Virtual function calls are usually implemented using indirect branch/call
instructions in the instruction set architecture. Previous research has
shown that modern object-oriented languages result in significantly
more indirect branches than traditional C and Fortran languages [3].
Unfortunately, an indirect branch instruction is more costly on processor performance because predicting an indirect branch is more
difficult than predicting a conditional branch as it requires the prediction of the target address instead of the prediction of the branch
direction. Direction prediction is inherently simpler because it is a binary decision as the branch direction can take only two values (taken
or not-taken), whereas indirect target prediction is an N-ary decision
where N is the number of possible target addresses. Hence, with
the increased use of object-oriented languages, indirect branch target mispredictions have become an important performance limiter in
high-performance processors.1 Moreover, the lack of efficient architectural support to accurately predict indirect branches has resulted
in an increased performance difference between programs written in
object-oriented languages and programs written in traditional languages, thereby rendering the benefits of object-oriented languages
unusable by many software developers who are primarily concerned
with the performance of their code [43].
Figure 1 shows the number and fraction of indirect branch mispredictions per 1K retired instructions (MPKI) in different Windows
applications run on an Intel Core Duo T2500 processor [22] which
includes a specialized indirect branch predictor [15]. The data is collected with hardware performance counters using VTune [23]. In the
examined Windows applications, on average 28% of the branch mispredictions are due to indirect branches. In two programs, Virtutech
Simics [32] and Microsoft Excel 2003, almost half of the branch mispredictions are caused by indirect branches. These results show that
indirect branches cause a considerable fraction of all mispredictions
even in today’s relatively small-scale desktop applications.
Previously proposed indirect branch prediction techniques [6, 8,
27, 9, 10, 40] require large hardware resources to store the target
addresses of indirect branches. For example, a 1024-entry gshare
conditional branch predictor [33] requires only 2048 bits but a 1024entry gshare-like indirect branch predictor (tagged target cache [6])
needs at least 2048 bytes along with additional tag storage even if the
processor stores only the least significant 16 bits of an indirect branch
1
In the rest of this paper, an “indirect branch” refers to a non-return unconditional branch
instruction whose target is determined by reading a general purpose register or a memory
location. We do not consider return instructions since they are usually very easy to
predict using a hardware return address stack [26].
424
55
50
45
40
35
indirect branches
30
25
20
15
10
sim
l
ce
ic
s
0
cy
gw
sq in
lse
w rv
in
ex r
p
ie lor e
xp
lo r
r
em er
ac
f ir s
ef
ox
vt
un
pp e
na tvi
sa ew
-w
or
l
ou dwi
t
de loo nd
sk
to k
pse
av arc
id h
ac a
ro
re
w ad
in
am
w p
in
dv
d
A
V
G
5
ex
Percentage of all mispredicted branches (%)
cy
gw
sq in
lse
w rv
in
ex r
p
ie lor e
xp
lo r
r
em er
ac
f ir s
ef
ox
vt
un
pp e
na tvi
sa ew
-w
or
l
ou dwi
de tloo nd
sk
to k
pse
av arc
id h
ac a
ro
re
w ad
in
am
w p
in
dv
d
A
V
G
l
ce
sim
ic
s
conditional
indirect
ex
Mispredictions Per Kilo Instructions (MPKI)
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Figure 1: Indirect branch mispredictions in Windows applications: MPKI for conditional and indirect branches (left), percentage of
mispredictions due to indirect branches (right)
target address in each entry.2 As such a large hardware storage comes
with an expensive increase in power/energy consumption and complexity, most current high-performance processors do not dedicate
separate hardware but instead use the branch target buffer (BTB) to
predict indirect branches [1, 18, 28], which implicitly –and usually
inaccurately– assumes that the indirect branch will jump to the same
target address it jumped to in its previous execution [6, 27].3 To our
knowledge, only Intel Pentium M implements specialized hardware
to help the prediction of indirect branches [15], demonstrating that
hardware designers are increasingly concerned with the performance
impact of indirect branches. However, as we showed in Figure 1,
even on a processor based on the Pentium M, indirect branch mispredictions are still relatively frequent.
In order to efficiently support polymorphism in object-oriented
languages without significantly increasing complexity in the processor front-end, a simple and low-cost –yet effective– indirect branch
predictor is necessary. A current high-performance processor already
employs a large and accurate conditional branch predictor. Our goal
is to use this existing conditional branch prediction hardware to also
predict indirect branches instead of building separate, costly indirect
branch prediction structures.
We propose a new indirect branch prediction algorithm: Virtual
Program Counter (VPC) prediction. A VPC predictor treats a single
indirect branch as multiple conditional branches (virtual branches) in
hardware for prediction purposes. Conceptually, each virtual branch
has its own unique target address, and the target address is stored in
the BTB with a unique “fake” PC, which we call virtual PC. The processor uses the outcome of the existing conditional branch predictor
to predict each virtual branch. The processor accesses the conditional
branch predictor and the BTB with the virtual PC address of a virtual
branch. If the prediction for the virtual branch is “taken,” the target
address provided by the BTB is predicted as the next fetch address
(i.e. the predicted target of the indirect branch). If the prediction of
the virtual branch is “not-taken,” the processor moves on to the next
virtual branch: it tries a conditional branch prediction again with a
different virtual PC. The processor repeats this process until the conditional branch predictor predicts a virtual branch as taken. VPC prediction stops if none of the virtual branches is predicted as taken after
a limited number of virtual branch predictions. After VPC prediction
stops, the processor can stall the front-end until the target address of
the indirect branch is resolved.
The VPC prediction algorithm is inspired by a compiler optimization, called receiver class prediction optimization (RCPO) [7, 19,
16, 2] or devirtualization [24]. This optimization statically converts
an indirect branch to multiple direct conditional branches (in other
words, it “devirtualizes” a virtual function call). Unfortunately, devirtualization requires extensive static program analysis or accurate
profiling, and it is applicable to only a subset of indirect branches
with a limited number of targets that can be determined statically [24].
Our proposed VPC prediction mechanism provides the benefit of using conditional branch predictors for indirect branches without requiring static analysis or profiling by the compiler. In other words,
VPC prediction dynamically devirtualizes an indirect branch without
compiler support. Unlike compiler-based devirtualization, VPC prediction can be applied to any indirect branch regardless of the number
and locations of its targets.
The contributions of VPC prediction are as follows:
2
With a 64-bit address space, a conventional indirect branch predictor likely requires
even more hardware resources to store the target addresses [27].
3
Previous research has shown that the prediction accuracy of a BTB-based indirect
branch predictor, which is essentially a last-target predictor, is low (about 50%) because
the target addresses of many indirect branches alternate rather than stay stable for long
periods of time [6, 27].
425
1. To our knowledge, VPC prediction is the first mechanism that
uses the existing conditional branch prediction hardware to
predict the targets of indirect branches, without requiring any
program transformation or compiler support.
2. VPC prediction can be applied using any current as well as future conditional branch prediction algorithm without requiring
changes to the conditional branch prediction algorithm. Since
VPC prediction transforms the problem of indirect branch
prediction into the prediction of multiple virtual conditional
branches, future improvements in conditional branch prediction accuracy can implicitly result in improving the accuracy
of indirect branch prediction.
3. Unlike previously proposed indirect branch prediction schemes,
VPC prediction does not require extra storage structures to
maintain the targets of indirect branches. Therefore, VPC prediction provides a low-cost indirect branch prediction scheme
that does not significantly complicate the front-end of the processor while providing the same performance as more complicated indirect branch predictors that require significant amounts
of storage.
2.
BACKGROUND ON INDIRECT BRANCH
PREDICTION
We first provide a brief background on how indirect branch predictors work to motivate the similarity between indirect and conditional
branch prediction. There are two types of indirect branch predictors: history-based and precomputation-based [37]. The technique
we introduce in this paper utilizes history information, so we focus
on history-based indirect branch predictors.
2.1 Why Does History-Based Indirect Branch
Prediction Work?
History-based indirect branch predictors exploit information about
the control-flow followed by the executing program to differentiate
between the targets of an indirect branch. The insight is that the
control-flow path leading to an indirect branch is strongly correlated
with the target of the indirect branch [6]. This is very similar to modern conditional branch predictors, which operate on the observation
that the control-flow path leading to a branch is correlated with the
direction of the branch [11].
2.1.1
A Source Code Example
The example in Figure 2 shows an indirect branch from the GAP
program [12] to provide insight into why history-based prediction
of indirect branch targets works. GAP implements and interprets
a language that performs mathematical operations. One data structure in the GAP language is a list. When a mathematical function
is applied to a list element, the program first evaluates the value of
the index of the element in the list (line 13 in Figure 2). The index can be expressed in many different data types, and a different
function is called to evaluate the index value based on the data type
(line 10). For example, in expressions L(1), L(n), and L(n+1), the
index is of three different data types: T_INT, T_VAR, and T_SUM,
respectively. An indirect jump through a jump table (EvTab in lines
2, 3 and 10) determines which evaluation function is called based
on the data type of the index. Consider the mathematical function
L2(n) = L1(n) + L1(n+1). For each n, the program calculates three
index values; Eval_VAR, Eval_SUM, and Eval_VAR functions are
called respectively to evaluate index values for L1(n), L1(n+1), and
L2(n). The targets of the indirect branch that determines the evaluation function of the index are therefore respectively the addresses of
the two evaluation functions. Hence, the target of this indirect branch
alternates between the two functions, making it unpredictable with a
BTB-based last-target predictor. In contrast, a predictor that uses
branch history information to predict the target easily distinguishes
between the two target addresses because the branch histories followed in the functions Eval_SUM and Eval_VAR are different; hence
the histories leading into the next instance of the indirect branch used
to evaluate the index of the element are different. Note that a combination of the regularity in the input index expressions and the code
structure allows the target address to be predictable using branch history information.
2.2 Previous Work
The indirect branch predictor described by Lee and Smith [30]
used the branch target buffer (BTB) to predict indirect branches. This
scheme predicts that the target of the current instance of the branch
will be the same as the target taken in the last execution of the branch.
This scheme does not work well for indirect branches that frequently
switch between different target addresses. Such indirect branches are
commonly used to implement virtual function calls that act on many
different objects and switch statements with many ‘case’ targets that
are exercised at run-time. Therefore, the BTB-based predictor has
low (about 50%) prediction accuracy [30, 6, 8, 27].
1: // Set up the array of function pointers (i.e. jump table)
2: EvTab[T_INT] = Eval_INT; EvTab[T_VAR] = Eval_VAR;
3: EvTab[T_SUM] = Eval_SUM;
4: // ...
5:
6: // EVAL evaluates an expression by calling the function
7: // corresponding to the type of the expression
8: // using the EvTab[] array of function pointers
9:
10: #define EVAL(hd) ((*EvTab[TYPE(hd)])((hd))) /*INDIRECT*/
11:
12: TypHandle Eval_LISTELEMENT ( TypHandle hdSel ) {
13:
hdPos = EVAL( hdSel );
14:
// evaluate the index of the list element
15:
// check if index is valid and within bounds
16:
// if within bounds, access the list
17:
// at the given index and return the element
18: }
Figure 2: An indirect branch example from GAP
Chang et al. [6] first proposed to use branch history information to
distinguish between different target addresses accessed by the same
indirect branch. They proposed the “target cache,” which is similar
to a two-level gshare [33] conditional branch predictor. The target
cache is indexed using the XOR of the indirect branch PC and the
branch history register. Each cache entry contains a target address.
Each entry can be tagged, which reduces interference between different indirect branches. The tagged target cache significantly improves
indirect branch prediction accuracy compared to a BTB-based predictor. However, it also requires separate structures for predicting
indirect branches, increasing complexity in the processor front-end.
Later work on indirect branch prediction by Driesen and Hölzle
focused on improving the prediction accuracy by enhancing the indexing functions of two-level predictors [8] and by combining multiple indirect branch predictors using a cascaded predictor [9, 10].
The cascaded predictor is a hybrid of two or more target predictors.
A relatively simple first-stage predictor is used to predict easy-topredict (single-target) indirect branches, whereas a complex secondstage predictor is used to predict hard-to-predict indirect branches.
Driesen and Hölzle [10] concluded that a 3-stage cascaded predictor
performed the best for a particular set of C and C++ benchmarks.
Kalamatianos and Kaeli [27] proposed predicting indirect
branches via data compression. Their predictor uses prediction by
partial matching (PPM) with a set of Markov predictors of decreasing size indexed by the result of hashing a decreasing number of bits
from previous targets. The Markov predictor is a large set of tables
where each table entry contains a single target address and bookkeeping bits. The prediction comes from the highest order table that
can predict, similarly to a cascaded predictor. The PPM predictor
requires significant additional hardware complexity in the indexing
functions, Markov tables, and additional muxes used to select the
predicted target address.
Recently, Seznec and Michaud [40] proposed extending their
TAGE conditional branch predictor to also predict indirect branches.
However, their mechanism also requires additional storage space for
indirect target addresses and additional complexity to handle indirect
branches.
2.3
Our Motivation
All previously proposed indirect branch predictors (except the
BTB-based predictor) require separate hardware structures to store
target addresses in addition to the conditional branch prediction hardware. This not only requires significant die area (which translates
into extra energy/power consumption), but also increases the design
complexity of the processor front-end, which is already a complex
426
and cycle-critical part of the design.4 Moreover, many of the previously proposed indirect branch predictors are themselves complicated [9, 10, 27, 40], which further increases the overall complexity
and development time of the design. For these reasons, most current
processors do not implement separate structures to predict indirect
branch targets.
Our goal in this paper is to design a low-cost technique that accurately predicts indirect branch targets (by utilizing branch history
information to distinguish between the different target addresses of
a branch) without requiring separate complex structures for indirect
branch prediction. To this end, we propose Virtual Program Counter
(VPC) prediction.
3.
VIRTUAL PROGRAM COUNTER (VPC)
PREDICTION
3.1 Overview
A VPC predictor treats an indirect branch as a sequence of multiple virtual conditional branches.5 Each virtual branch is predicted in
sequence using the existing conditional branch prediction hardware,
which consists of the direction predictor and the BTB (Figure 3).
If the virtual branch is predicted to be not-taken, the VPC predictor
moves on to predict the next virtual branch in the sequence. If the virtual branch is predicted to be taken, VPC prediction uses the target
associated with the virtual branch in the BTB as the next fetch address, completing the prediction of the indirect branch. Note that the
virtual branches are visible only to the branch prediction hardware.
GHR
VGHR
Conditional
Branch
Predictor
(BP)
branch is fetched. If the virtual branch is predicted not-taken, the
prediction algorithm moves to the next iteration (i.e. the next virtual
branch) by updating the VPCA and VGHR. The VPCA value for an
iteration (other than the first iteration) is computed by hashing the
original PC value with a randomized constant value that is specific to
the iteration. In other words, V P CA = P C ⊕ HASHV AL[iter],
where HASHVAL is a hard-coded hardware table of randomized
numbers that are different from one another. The VGHR is simply
left-shifted by one bit at the end of each iteration to indicate that the
last virtual branch was predicted not taken.6
The iterative prediction process stops when a virtual branch is predicted to be taken. Otherwise, the prediction process iterates until
either the number of iterations is greater than MAX_ITER or there
is a BTB miss (!pred_target in Algorithm 1 means there is a BTB
miss).7 If the prediction process stops without predicting a target, the
processor stalls until the indirect branch is resolved.
Note that the value of MAX_ITER determines how many attempts
will be made to predict an indirect branch. It also dictates how many
different target addresses can be stored for an indirect branch at a
given time in the BTB.
Algorithm 1 VPC prediction algorithm
iter ← 1
V P CA ← P C
V GHR ← GHR
done ← F ALSE
while (!done) do
pred_target ← access_BTB(V P CA)
pred_dir ← access_conditional_BP(V P CA, V GHR)
if (pred_target and (pred_dir = T AKEN )) then
next_P C ← pred_target
done ← T RU E
else if (!pred_target or (iter ≥ M AX_IT ER)) then
ST ALL ← T RU E
done ← T RU E
end if
V P CA ← Hash(P C, iter)
V GHR ← Left-Shift(V GHR)
iter++
end while
Taken/Not Taken
Predict?
PC
VPCA
BTB
Cond/Indirect
Target Address
Hash Function
Iteration
Counter
3.2.1
Figure 3: High-level conceptual overview of the VPC predictor
3.2 Prediction Algorithm
The detailed VPC prediction algorithm is shown in Algorithm 1.
The key implementation issue in VPC prediction is how to distinguish
between different virtual branches. Each virtual branch should access a different location in the direction predictor and the BTB (so
that a separate direction and target prediction can be made for each
branch). To accomplish this, the VPC predictor accesses the conditional branch prediction structures with a different virtual PC address
(VPCA) and a virtual global history register (GHR) value (VGHR)
for each virtual branch. VPCA values are distinct for different virtual
branches. VGHR values provide the context (branch history) information associated with each virtual branch.
VPC prediction is an iterative prediction process, where each iteration takes one cycle. In the first iteration (i.e. for the first virtual
branch), VPCA is the same as the original PC address of the indirect
branch and VGHR is the same as the GHR value when the indirect
4
Using a separate predictor for indirect branch targets adds one more input to the mux
that determines the predicted next fetch address. Increasing the delay of this mux can
result in increased cycle time, adversely affecting the clock frequency.
5
We call the conditional branches “virtual” because they are not encoded in the program
binary. Nor are they micro-operations since they are only visible to the VPC predictor.
Prediction Example
Figure 4a,b shows an example virtual function call and the corresponding simplified assembly code with an indirect branch. Figure 4c
shows the virtual conditional branches corresponding to the indirect
branch. Even though the static assembly code has only one indirect branch, the VPC predictor treats the indirect branch as multiple
conditional branches that have different targets and VPCAs. Note
that the hardware does not actually generate multiple conditional
branches. The instructions in Figure 4c are shown to demonstrate
VPC prediction conceptually. We assume, for this example, that
MAX_ITER is 3, so there are only 3 virtual conditional branches.
Table 1 demonstrates the five possible cases when the indirect
branch in Figure 4 is predicted using VPC prediction, by showing
the inputs and outputs of the VPC predictor in each iteration. We
6
Note that VPC addresses (VPCAs) can conflict with real PC addresses in the program,
thereby increasing aliasing and contention in the BTB and the direction prediction structures. The processor does not require any special action when aliasing happens. To
reduce such aliasing, the processor designer should: (1) provide a good randomizing
hashing function and values to generate VPCAs and (2) co-design the VPC prediction
scheme and the conditional branch prediction structures carefully to minimize the effects
of aliasing. Conventional techniques proposed to reduce aliasing in conditional branch
predictors [33, 5] can also be used to reduce aliasing due to VPC.
7
The VPC predictor can continue iterating the prediction process even if there is BTB
miss. However, we found that continuing in this case does not improve the prediction
accuracy. Hence, to simplify the prediction process, our VPC predictor design stops the
prediction process when there is a BTB miss in any iteration.
427
Table 1: Possible VPC Predictor states and outcomes when branch in Figure 4b is predicted
Case
1
2
3
4
5
1st iteration
inputs
outputs
VPCA
VGHR
BTB
BP
L
1111
TARG1
T
L
1111
TARG1
NT
L
1111
TARG1
NT
L
1111
TARG1
NT
L
1111
TARG1
NT
2nd iteration
inputs
outputs
VPCA
VGHR
BTB
BP
VL2
1110
TARG2
T
VL2
1110
TARG2
NT
VL2
1110
TARG2
NT
VL2
1110
MISS
-
a = s->area ();
(a) Source code
R1 = MEM[R2]
INDIRECT_CALL R1 // PC: L
(b) Corresponding assembly code with an indirect branch
3rd iteration
input
output
VPCA
VGHR
BTB
BP
VL3
1100
TARG3
T
VL3
1100
TARG3
NT
-
Prediction
TARG1
TARG2
TARG3
stall
stall
(c) Virtual conditional branches (for prediction purposes)
target) train the direction predictor as not-taken (as shown in Algorithm 2). The last virtual branch trains the conditional branch predictor as taken and updates the replacement policy bits in the BTB entry
corresponding to the correctly-predicted target address. Note that
Algorithm 2 is a special case of Algorithm 3 in that it is optimized
to eliminate unnecessary BTB accesses when the target prediction is
correct.
Figure 4: VPC prediction example: source, assembly, and the
corresponding virtual branches
Algorithm 2 VPC training algorithm when the branch target is correctly predicted. Inputs: predicted_iter, P C, GHR
iter1: cond. br TARG1 // VPCA: L
iter2: cond. br TARG2 // VPCA: VL2 = L XOR HASHVAL[1]
iter3: cond. br TARG3 // VPCA: VL3 = L XOR HASHVAL[2]
iter ← 1
V P CA ← P C
V GHR ← GHR
while (iter < predicted_iter) do
if (iter == predicted_iter) then
update_conditional_BP(V P CA, V GHR, TAKEN)
update_replacement_BTB(V P CA)
else
update_conditional_BP(V P CA, V GHR, NOT-TAKEN)
end if
V P CA ← Hash(PC, iter)
V GHR ← Left-Shift(V GHR)
iter++
end while
assume that the GHR is 1111 when the indirect branch is fetched.
Cases 1, 2, and 3 correspond to cases where respectively the first,
second, or third virtual branch is predicted taken by the conditional
branch direction predictor (BP). As VPC prediction iterates, VPCA
and VGHR values are updated as shown in the table. Case 4 corresponds to the case where all three of the virtual branches are predicted
not-taken and therefore the outcome of the VPC predictor is a stall.
Case 5 corresponds to a BTB miss for the second virtual branch and
thus also results in a stall.
3.3 Training Algorithm
The VPC predictor is trained when an indirect branch is committed. The detailed VPC training algorithm is shown in Algorithms 2 and 3. Algorithm 2 is used when the VPC prediction was
correct and Algorithm 3 is used when the VPC prediction was incorrect. The VPC predictor trains both the BTB and the conditional
branch direction predictor for each predicted virtual branch. The key
functions of the training algorithm are:
Algorithm 3 VPC training algorithm when the branch target is mispredicted. Inputs: P C, GHR, CORRECT _T ARGET
iter ← 1
V P CA ← P C
V GHR ← GHR
f ound_correct_target ← F ALSE
while ((iter ≤ M AX_IT ER) and (f ound_correct_target =
F ALSE)) do
pred_target ← access_BTB(V P CA)
if (pred_target = CORRECT_TARGET) then
update_conditional_BP(V P CA, V GHR, TAKEN)
update_replacement_BTB(V P CA)
f ound_correct_target ← T RU E
else if (pred_target) then
update_conditional_BP(V P CA, V GHR, NOT-TAKEN)
end if
V P CA ← Hash(PC, iter)
V GHR ← Left-Shift(V GHR)
iter++
end while
1. to update the direction predictor as not-taken for the virtual
branches that have the wrong target (because the targets of
those branches were not taken) and to update it as taken for
the virtual branch, if any, that has the correct target.
2. to update the replacement policy bits of the correct target in the
BTB (if the correct target exists in the BTB)
3. to insert the correct target address into the BTB (if the correct
target does not exist in the BTB)
Like prediction, training is also an iterative process. To facilitate
training on a correct prediction, an indirect branch carries with it
through the pipeline the number of iterations performed to predict the
branch (predicted_iter). VPCA and VGHR values for each training iteration are recalculated exactly the same way as in the prediction algorithm. Note that only one virtual branch trains the prediction
structures in a given cycle.8
3.3.1
/* no-target case */
if (f ound_correct_target = F ALSE) then
V P CA ← VPCA corresponding to the virtual branch with a BTB-Miss or
Least-frequently-used target among all virtual branches
V GHR ← VGHR corresponding to the virtual branch with a BTB-Miss or
Least-frequently-used target among all virtual branches
insert_BTB(V P CA, CORRECT_TARGET)
update_conditional_BP(V P CA, V GHR, TAKEN)
end if
Training on a Correct Prediction
If the predicted target for an indirect branch was correct, all virtual branches except for the last one (i.e. the one that has the correct
3.3.2
8
It is possible to have more than one virtual branch update the prediction structures by
increasing the number of write ports in the BTB and the direction predictor. We do not
pursue this option as it would increase the complexity of prediction structures.
Training on a Wrong Prediction
If the predicted target for an indirect branch was wrong, there
are two misprediction cases: (1) wrong-target: one of the virtual
428
branches has the correct target stored in the BTB but the direction
predictor predicted that branch as not-taken, (2) no-target: none of
the virtual branches has the correct target stored in the BTB so the
VPC predictor could not have predicted the correct target. In the notarget case, the correct target address needs to be inserted into the
BTB.
To distinguish between wrong-target and no-target cases, the training logic accesses the BTB for each virtual branch (as shown in Algorithm 3).9 If the target address stored in the BTB for a virtual
branch is the same as the correct target address of the indirect branch
(wrong-target case), the direction predictor is trained as taken and the
replacement policy bits in the BTB entry corresponding to the target
address are updated. Otherwise, the direction predictor is trained as
not-taken. Similarly to the VPC prediction algorithm, when the training logic finds a virtual branch with the correct target address, it stops
training.
If none of the iterations (i.e. virtual branches) has the correct target
address stored in the BTB, the training logic inserts the correct target
address into the BTB. One design question is what VPCA/VGHR
values should be used for the newly inserted target address. Conceptually, the choice of VPCA value determines the order of the
newly inserted virtual branch among all virtual branches. To insert
the new target in the BTB, our current implementation of the training
algorithm uses the VPCA/VGHR values corresponding to the virtual
branch that missed in the BTB. If none of the virtual branches missed
in the BTB, our implementation uses the VPCA/VGHR values corresponding to the virtual branch whose BTB entry has the smallest
least frequently used (LFU) value. Note that the virtual branch that
missed in the BTB or that has the smallest LFU-value in its BTB
entry can be determined easily while the training algorithm iterates
over virtual branches (However, we do not show this computation in
Algorithm 3 to keep the algorithm more readable).10
3.4 Supporting Multiple Iterations per Cycle
The iterative prediction process can take multiple cycles. The
number of cycles needed to make an indirect branch prediction with
a VPC predictor can be reduced if the processor already supports
the prediction of multiple conditional branches in parallel [44]. The
prediction logic can perform the calculation of VPCA values for
multiple iterations in parallel since VPCA values do not depend on
previous iterations. VGHR values for multiple iterations can also
be calculated in parallel assuming that previous iterations were “not
taken” since the prediction process stops when an iteration results in
a “taken” prediction. Section 5.4 evaluates the performance impact
of performing multiple prediction iterations in parallel.
1. Three registers to store iter, V P CA, and V GHR for prediction purposes (Algorithm 1).
2. A hard-coded table, HASHV AL, of 32-bit randomized values. The table has M AX_IT ER number of entries. Our experimental results show that M AX_IT ER does not need to
be greater than 20. The table is dual-ported to support one
prediction and one update concurrently.
3. A predicted_iter value that is carried with each indirect
branch throughout the pipeline. This value cannot be greater
than M AX_IT ER.
4. Three registers to store iter, V P CA, and V GHR for training
purposes (Algorithms 2 and 3).
5. Two registers to store the V P CA and V GHR values that may
be needed to insert a new target into the BTB (for the no-target
case in Algorithm 3).
Note that the cost of the required storage is very small. Unlike previously proposed history-based indirect branch predictors, no large
or complex tables are needed to store the target addresses. Instead,
target addresses are naturally stored in the existing BTB.
The combinational logic needed to perform the computations required for prediction and training is also simple. Actual PC and GHR
values are used to access the branch prediction structure in the first iteration of indirect branch prediction. While an iteration is performed,
the VPCA and VGHR values for the next iteration are calculated and
loaded into the corresponding registers. Therefore, updating VPCA
and VGHR for the next iterations is not on the critical path of the
branch predictor access.
The training of the VPC predictor on a misprediction may slightly
increase the complexity of the BTB update logic because it requires
multiple iterations to access the BTB. However, the VPC training
logic needs to access the BTB multiple times only on a target misprediction, which is relatively infrequent, and the update logic of the
BTB is not on the critical path of instruction execution.
4.
EXPERIMENTAL METHODOLOGY
We use a Pin-based [31] cycle-accurate x86 simulator to evaluate
VPC prediction. The parameters of our baseline processor are shown
in Table 2. The baseline processor uses the BTB to predict indirect
branches [30].
Table 2: Baseline processor configuration
Front End
Branch
Predictors
3.5 Hardware Cost and Complexity
The extra hardware required by the VPC predictor on top of the
existing conditional branch prediction scheme is as follows:
Execution
Core
9
On-chip
Caches
Note that these extra BTB accesses for training are required only on a misprediction and
they do not require an extra BTB read-port. An extra BTB access holds only one BTB
bank per training-iteration. Even if the access results in a bank conflict with the accesses
from the fetch engine for all the mispredicted indirect branches, we found that the performance impact is negligible due to the low frequency of indirect branch mispredictions
in the VPC mechanism [29].
10
This scheme does not necessarily find and replace the least frequently used of the targets
corresponding to an indirect branch – this is difficult to implement as it requires keeping LFU information on a per-indirect branch basis across different BTB sets. Rather,
our scheme is an approximation that replaces the target that has the lowest value for
LFU-bits (corresponding to the LFU within a set) stored in the BTB entry, assuming the
baseline BTB implements an LFU-based replacement policy. Other heuristics are possible to determine the VPCA/VGHR of a new target address (i.e. new virtual branch). We
experimented with schemes that select among the VPCA/VGHR values corresponding to
the iterated virtual branches randomly, or based on the recency information that could be
stored in the corresponding BTB entries and found that LFU performs best with random
selection a close second. We do not present these results due to space limitations.
Buses and
Memory
Prefetcher
64KB, 2-way, 2-cycle I-cache; fetch ends at the first predicted-taken br.;
fetch up to 3 conditional branches or 1 indirect branch
64KB (64-bit history, 1021-entry) perceptron branch predictor [25];
4K-entry, 4-way BTB with pseudo-LFU replacement;
64-entry return address stack; min. branch mispred. penalty is 30 cycles
8-wide fetch/issue/execute/retire; 512-entry ROB; 384 physical registers;
128-entry LD-ST queue; 4-cycle pipelined wake-up and selection logic;
scheduling window is partitioned into 8 sub-windows of 64 entries each
L1 D-cache: 64KB, 4-way, 2-cycle, 2 ld/st ports;
L2 unified cache: 1MB, 8-way, 8 banks, 10-cycle latency;
All caches use LRU replacement and have 64B line size
300-cycle minimum memory latency; 32 memory banks;
32B-wide core-to-memory bus at 4:1 frequency ratio
Stream prefetcher with 32 streams and
16 cache line prefetch distance (lookahead) [42]
The experiments are run using 5 SPEC CPU2000 INT benchmarks,
5 SPEC CPU2006 INT/C++ benchmarks, and 2 other C++ benchmarks. We chose those benchmarks in SPEC INT 2000 and 2006
INT/C++ suites that gain at least 5% performance with a perfect indirect branch predictor. Table 3 provides a brief description of the
other two C++ benchmarks.
We use Pinpoints [36] to select a representative simulation region
for each benchmark using the reference input set. Each benchmark
429
Table 3: Evaluated C++ benchmarks that are not included in
SPEC CPU 2000 or 2006
translator from IDL (Interface Definition Language) to C++
simulates the task dispatcher in the kernel of an operating system [43]
is run for 200 million x86 instructions. Table 4 shows the characteristics of the examined benchmarks on the baseline processor. All
binaries are compiled with Intel’s production compiler (ICC) [21]
with the -O3 optimization level.
100
RESULTS
90
% IPC improvement over baseline
5.1 Dynamic Target Distribution
Figure 5 shows the distribution of the number of dynamic targets for executed indirect branches. In eon, gap, and ixx, more than
40% of the executed indirect branches have only one target. These
single-target indirect branches are easily predictable with a simple
BTB-based indirect branch predictor. However, in gcc (50%), crafty
(100%), perlbmk (94%), perlbench (98%), sjeng (100%) and povray
(97%), over 50% of the dynamic indirect branches have more than
5 targets. On average, 51% of the dynamic indirect branches in the
evaluated benchmarks have more than 5 targets.
hm
ea
n
na
m
d
po
vr
ay
ric
ha
rd
s
ix
x
sje
ng
pe
rlb
m
k
ga
p
pe
rlb
en
ch
gc
c0
6
eo
n
am
ea
n
s
x
ix
ha
ric
po
vr
rd
ay
d
m
na
6
ng
sje
ch
pe
gc
rlb
ga
c0
p
en
k
m
eo
rlb
n
ty
pe
gc
Figure 7 shows the distribution of the number of iterations needed
to generate a correct target prediction. On average 44.6% of the correct predictions occur in the first iteration (i.e. zero idle cycles) and
81% of the correct predictions occur within three iterations. Only in
perlbmk and sjeng more than 30% of all correct predictions require
at least 5 iterations. Hence, most correct predictions are performed
quickly resulting in few idle cycles during which the fetch engine
stalls.
100
11-12
9-10
7-8
5-6
4
3
2
1
90
80
70
60
50
40
30
20
am
ea
n
s
rd
x
ix
ay
vr
ha
ric
d
po
ng
m
na
sje
6
c0
gc
en
pe
rlb
p
ga
m
rlb
pe
n
k
0
ch
10
c
Figure 6 (top) shows the performance improvement of VPC prediction over the baseline BTB-based predictor when MAX_ITER is
varied from 2 to 16. Figure 6 (bottom) shows the indirect branch
MPKI in the baseline and with VPC prediction. In eon, gap, and
namd, where over 60% of all executed indirect branches have at
most 2 unique targets (as shown in Figure 5), VPC prediction with
MAX_ITER=2 eliminates almost all indirect branch mispredictions.
Almost all indirect branches in richards have 3 or 4 different targets.
Therefore, when the VPC predictor can hold 4 different targets per
indirect branch (MAX_ITER=4), indirect branch MPKI is reduced
to only 0.7 (from 13.4 in baseline and 1.8 with MAX_ITER=2). The
performance of only perlbmk and perlbench continues to improve
significantly as MAX_ITER is increased beyond 6, because at least
65% of the indirect branches in these two benchmarks have at least
16 dynamic targets (This is due to the large switch-case statements in
perl that are used to parse and pattern-match the input expressions.
The most frequently executed/mispredicted indirect branch in perlbench belongs to a switch statement with 57 static targets). Note
that even though the number of mispredictions can be further reduced when MAX_ITER is increased beyond 12, the performance
improvement actually decreases for perlbench. This is due to two
reasons: (1) storing more targets in the BTB via a larger MAX_ITER
value starts creating conflict misses, (2) some correct predictions take
gc
5.2 Performance of VPC Prediction
Figure 6: Performance of VPC prediction: IPC improvement
(top), indirect branch MPKI (bottom)
Percent of all correct predictions (%)
Figure 5: Distribution of the number of dynamic targets across
executed indirect branches
cr
af
ty
baseline
VPC-ITER-2
VPC-ITER-4
VPC-ITER-6
VPC-ITER-8
VPC-ITER-10
VPC-ITER-12
VPC-ITER-14
VPC-ITER-16
c
am
ea
n
s
rd
x
ix
ha
ric
d
ay
m
vr
po
na
6
ng
sje
en
c0
gc
pe
rlb
p
m
rlb
pe
ty
n
eo
af
cr
c
k
ch
10
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
af
20
gc
20
gc
c
30
0
30
ty
40
40
eo
50
50
cr
Indirect branch Mispredictions (MPKI)
60
ga
Percent of Executed Indirect Branches (%)
70
60
0
16+
11-15
6-10
5
4
3
2
1
80
70
10
100
90
VPC-ITER-2
VPC-ITER-4
VPC-ITER-6
VPC-ITER-8
VPC-ITER-10
VPC-ITER-12
VPC-ITER-14
VPC-ITER-16
80
af
5.
cr
ixx
richards
longer when MAX_ITER is increased, which increases the idle cycles in which no instructions are fetched.
On average, VPC prediction improves performance by 26.7%
over the BTB-based predictor (when MAX_ITER=12), by reducing the average indirect branch MPKI from 4.63 to 0.52. Since a
MAX_ITER value of 12 provides the best performance, most later
experiments in this section use MAX_ITER=12. We found that using
VPC prediction does not significantly impact the prediction accuracy
of conditional branches in the benchmark set we examined as shown
in Table 6.
Figure 7: Distribution of the number of iterations (for correct
predictions) (MAX_ITER=12)
430
Table 4: Characteristics of the evaluated benchmarks:
language and type of the benchmark (Lang/Type), baseline IPC (BASE IPC), potential IPC
improvement with perfect indirect branch prediction (PIBP IPC Δ), static number of indirect branches (Static IB), dynamic number of indirect branches (Dyn. IB),
indirect branch prediction accuracy (IBP Acc), indirect branch mispredictions per kilo instructions (IB MPKI), conditional branch mispredictions per kilo instructions
(CB MPKI). gcc06 is 403.gcc in CPU2006 and gcc is 176.gcc in CPU2000.
eon
C++/int
2.15
16.2%
1857
1401K
72.2
1.9
0.2
perlbmk
gap
perlbench gcc06 sjeng
C/int
C/int
C/int
C/int
C/int
1.29
1.29
1.18
0.66
1.21
105.5% 55.6%
51.7%
17.3% 18.5%
864
1640
1283
1557
369
2908K 3454K
1983K
1589K 893K
30.0
55.3
32.6
43.9
28.8
10.2
7.7
6.7
4.5
3.2
0.9
0.8
3.0
3.7
9.5
5.3 Comparisons with Other Indirect Branch
Predictors
100
TTC-384B
TTC-768B
TTC-1.5KB
TTC-3KB
TTC-6KB
TTC-12KB
TTC-24KB
TTC-48KB
TTC-96KB
TTC-192KB
VPC-ITER-12
80
70
60
50
40
30
20
hm
ea
n
na
m
d
po
vr
ay
ric
ha
rd
s
ix
x
AVG
1.29
32.5%
50.6
4.63
3.0
60
50
40
30
20
pe
rlb
m
k
ga
p
pe
rlb
en
ch
gc
c0
6
eo
n
gc
c
0
ea
n
5.4
am
rd
x
ix
ay
s
Figure 9: Performance of VPC prediction vs. cascaded predictor
ha
vr
po
70
10
ric
d
m
na
6
ng
sje
en
c0
gc
rlb
pe
m
p
ga
n
eo
rlb
pe
ty
af
cr
gc
c
k
ch
baseline
TTC-384B
TTC-768B
TTC-1.5KB
TTC-3KB
TTC-6KB
TTC-12KB
TTC-24KB
TTC-48KB
TTC-96KB
TTC-192KB
VPC-ITER-12
80
cr
af
ty
sje
ng
eo
n
pe
rlb
m
k
ga
p
pe
rlb
en
ch
gc
c0
6
cr
af
ty
gc
c
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
ixx
C++/int
1.62
12.8%
1281
252K
80.7
1.4
4.2
cascaded-704B
cascaded-1.4KB
cascaded-2.8KB
cascaded-5.5KB
cascaded-11KB
cascaded-22KB
cascaded-44KB
cascaded-88KB
cascaded-176KB
VPC-ITER-12
90
0
Indirect branch Mispredictions (MPKI)
richards
C++/int
0.91
107.1%
266
4533K
40.9
13.4
1.4
100
10
Figure 8: Performance of VPC prediction vs. tagged target
cache: IPC (top), MPKI (bottom)
Figure 8 compares the performance and MPKI of VPC prediction
with the tagged target cache (TTC) predictor [6]. The size of the 4way TTC predictor is calculated assuming 4-byte targets and 2-byte
tags for each entry.11 On average, VPC prediction provides the performance provided by a 6KB TTC predictor. However, as shown
in Table 5, in six benchmarks, the VPC predictor performs at least
as well as a 12KB TTC (and on 4 benchmarks better than a 192KB
TTC). As shown in Table 5, the size of TTC that provides equivalent performance is negatively correlated with the average number of
dynamic targets for each indirect branch in a benchmark: the higher
the average number of targets the smaller the TTC that performs as
well as VPC (e.g. in crafty, perlbmk, and perlbench). This is because
TTC provides separate storage to cache the large number of dynamic
11
povray
C++/fp
1.79
12.1%
1035
1148K
70.8
1.7
2.1
targets in addition to the BTB whereas VPC prediction uses only the
available BTB space. As the average number of targets increases,
the contention for space in the BTB also increases, and reducing this
contention even with a relatively small separate structure (as TTC
does) provides significant performance gains.
Figure 9 compares the performance of VPC prediction with a 3stage cascaded predictor [9, 10]. On average, VPC prediction provides the same performance improvement as a 22KB cascaded predictor. As shown in Table 5, in six benchmarks, VPC prediction
provides the performance of at least a 176KB cascaded predictor.12
% IPC improvement over baseline
% IPC improvement over baseline
90
namd
C++/fp
2.62
5.4%
678
517K
83.3
0.4
1.1
hm
ea
n
crafty
C/int
1.71
4.8%
356
195K
34.1
0.6
6.1
na
m
d
po
vr
ay
ric
ha
rd
s
ix
x
gcc
C/int
1.20
23.0%
987
1203K
34.9
3.9
3.0
sje
ng
Lang/Type
BASE IPC
PIBP IPC Δ
Static IB
Dyn. IB
IBP Acc (%)
IB MPKI
CB MPKI
Note that we simulated full 8-byte tags for TTC and hence our performance results
reflect full tags, but we assume that a realistic TTC will not be implemented with full
tags so we do not penalize it in terms of area cost. A target cache entry is allocated only
on a BTB misprediction for an indirect branch. Our results do not take into account the
increase in cycle time that might be introduced due to the addition of the TTC predictor
into the processor front-end.
Effect of VPC Prediction Delay
So far we have assumed that a VPC predictor can predict a single
virtual branch per cycle. Providing the ability to predict multiple virtual branches per cycle (assuming the underlying conditional branch
predictor supports this) would reduce the number of idle cycles spent
during multiple VPC prediction iterations. Figure 10 shows the performance impact when multiple iterations can take only one cycle.
Supporting, unrealistically, even 10 prediction iterations per cycle
further improves the performance benefit of VPC prediction by only
2.2%. As we have already shown in Figure 7, only 19% of all correct predictions require more than 3 iterations. Therefore, supporting
multiple iterations per cycle does not provide significant improvement. We conclude that, to simplify the design, the VPC predictor
can be implemented to support only one iteration per cycle.
5.5
5.5.1
Sensitivity of VPC Prediction to
Microarchitecture Parameters
Different Conditional Branch Predictors
We evaluated VPC prediction with various baseline conditional
branch predictors. Figure 11 compares the performance of the TTC
12
We found that a 3-stage cascaded predictor performs slightly worse than an equallysized TTC predictor. This is because the number of static indirect branches in the evaluated benchmarks is relatively small (10-20) and a cascaded predictor performs better
than a TTC when there is a larger number of static branches [9, 10].
431
Table 5: The sizes of tagged target cache (TTC) and cascaded predictors that provide the same performance as the VPC predictor
(MAX_ITER=12) in terms of IPC
gcc
crafty
eon
perlbmk gap perlbench gcc06 sjeng
namd
povray
richards
ixx
TTC size (B)
12KB
1.5KB >192KB 1.5KB 6KB
512B
12KB 3KB >192KB >192KB >192KB
3KB
cascaded size (B) >176KB 2.8KB >176KB 2.8KB 11KB 1.4KB 44KB 5.5KB >176KB >176KB >176KB >176KB
avg. # of targets
6.1
8.0
2.1
15.6
1.8
17.9
5.8
9.0
2.0
5.9
3.4
4.1
100
5.5.2
80
70
60
50
Different BTB sizes
We evaluated VPC prediction with different BTB sizes: 512, 1024,
and 2048 entries. As Table 7 shows, even with smaller BTB sizes
VPC prediction still provides significant performance improvements.
1 br/cycle
2 br/cycle
4 br/cycle
6 br/cycle
8 br/cycle
10br/cycle
Table 7: Effect of different BTB sizes
40
30
BTB entries
20
512
1K
2K
4K
hm
ea
n
% IPC improvement over baseline
VPC Prediction on a Less Aggressive Processor
100
TTC-384B
TTC-768
TTC-1.5KB
TTC-3KB
TTC-6KB
TTC-12KB
TTC-24KB
TTC-48KB
VPC-ITER-12
90
80
70
60
50
40
30
20
90
80
70
60
50
40
30
5.6
hm
ea
n
na
m
d
po
vr
ay
ric
ha
rd
s
ix
x
sje
ng
cr
af
ty
gc
c
eo
n
pe
rlb
m
k
ga
p
pe
rlb
en
ch
gc
c0
6
10
Figure 11: Performance of VPC prediction vs. TTC on a processor with an O-GEHL conditional branch predictor
Table 6: Effect of different conditional branch predictors
Cond. BP
gshare
perceptron
O-GEHL
Baseline
cond. MPKI indi. MPKI
3.70
4.63
3.00
4.63
1.82
4.63
IPC
1.25
1.29
1.37
VPC prediction
cond. MPKI indi. MPKI
3.78
0.65
3.00
0.52
1.84
0.32
hm
ea
n
eo
n
gc
c
Figure 12: VPC prediction vs. TTC on a less aggressive processor
20
0
na
m
d
po
vr
ay
ric
ha
rd
s
ix
x
10
0
TTC-384B
TTC-768B
TTC-1.5KB
TTC-3KB
TTC-6KB
TTC-12KB
TTC-24KB
TTC-48KB
VPC-ITER-12
VPC prediction
indirect MPKI IPC Δ
1.31
18.5%
0.95
21.7%
0.78
23.8%
0.52
26.7%
Figure 12 shows the performance of VPC and TTC predictors on
a less aggressive baseline processor that has a 20-stage pipeline, 4wide fetch/issue/retire rate, 128-entry instruction window, 16KB perceptron branch predictor, 1K-entry BTB, and 200-cycle memory latency. Since the less aggressive processor incurs a smaller penalty for
a branch misprediction, improved indirect branch handling provides
smaller performance improvements than in the baseline processor.
However, VPC prediction still improves performance by 17.6%.
110
100
IPC
1.16
1.25
1.28
1.29
sje
ng
predictor and the VPC predictor on a baseline processor with a 64KB
O-GEHL predictor [39]. On average, VPC prediction improves performance by 31% over the baseline with an O-GEHL predictor. We
also found [29] that VPC prediction improves performance by 23.8%
on a baseline with a 64KB gshare [33] predictor.
Table 6 summarizes the results of our comparisons. Reducing the
conditional branch misprediction rate via a better predictor results in
also reducing the indirect branch misprediction rate with VPC prediction. Hence, as the baseline conditional branch predictor becomes
better, the performance improvement provided by VPC prediction increases. We conclude that, with VPC prediction, any research effort
that improves conditional branch prediction accuracy will likely result in also improving indirect branch prediction accuracy – without
requiring the significant extra effort to design complex and specialized indirect branch predictors.
5.5.3
Baseline
indirect MPKI
4.81
4.65
4.64
4.63
pe
rlb
m
k
ga
p
pe
rlb
en
ch
gc
c0
6
Figure 10: Performance impact of supporting multiple VPC prediction iterations per cycle
% IPC improvement over baseline
na
m
d
po
vr
ay
ric
ha
rd
s
ix
x
sje
ng
cr
af
ty
gc
c
0
eo
n
pe
rlb
m
k
ga
p
pe
rlb
en
ch
gc
c0
6
10
cr
af
ty
% IPC improvement over baseline
90
IPC Δ
23.8%
26.7%
31.0%
Performance of VPC Prediction on Server
Applications
We also evaluated the VPC predictor with commercial on-line
transaction processing (OLTP) applications [17]. Each OLTP trace
is collected from an IBM System 390 zSeries machine [20] for 22M
instructions. Unlike the SPEC CPU benchmarks, OLTP applications
have a much higher number of static indirect branches (OLTP1:7601,
OLTP2:7991, and OLTP3:15733) and very high indirect branch misprediction rates.13 The VPC predictor (MAX_ITER=10) reduces the
indirect branch misprediction rate by 28%, from 12.2 MPKI to 8.7
MPKI. The VPC predictor performs better than a 12KB TTC predictor on all applications and almost as well as a 24KB TTC on oltp2.
Hence, we conclude that the VPC predictor is also very effective in
large-scale server applications.
13
System 390 ISA has both unconditional indirect and conditional indirect branch instructions. For this experiment, we only consider unconditional indirect branches and use a
16K-entry 4-way BTB in the baseline processor.
432
baseline
VPC-ITER-2
VPC-ITER-4
VPC-ITER-6
VPC-ITER-8
VPC-ITER-10
VPC-ITER-12
VPC-ITER-14
VPC-ITER-16
oltp1
oltp2
oltp3
Indirect branch Mispredictions (MPKI)
Indirect branch Mispredictions (MPKI)
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
baseline
TTC-384B
TTC-768B
TTC-1.5KB
TTC-3KB
TTC-6KB
TTC-12KB
TTC-24KB
TTC-48KB
VPC-ITER-10
oltp1
amean
oltp2
oltp3
amean
Figure 13: MPKI of VPC prediction on OLTP Applications: effect of MAX_ITER (left) and vs. TTC predictor (right)
6.
VPC PREDICTION AND
COMPILER-BASED
DEVIRTUALIZATION
Devirtualization is the substitution of an indirect method call
with direct method calls in object-oriented languages [7, 19, 16, 2,
24]. Ishizaki et al. [24] classify the devirtualization techniques into
guarded devirtualization and direct devirtualization.
Guarded devirtualization: Figure 14a shows an example virtual
function call in the C++ language. In the example, depending on the
actual type of Shape s, different area functions are called at runtime. However, even though there could be many different shapes in
the program, if the types of shapes are mostly either an instance of
the Rectangle class or the Circle class at run-time, the compiler
can convert the indirect call to multiple guarded direct calls [16, 13,
2] as shown in Figure 14(b). This compiler optimization is called Receiver Class Prediction Optimization (RCPO) and the compiler can
perform RCPO based on profiling.
Shape* s = ... ;
a = s->area(); // an indirect call
(a) A virtual function call in C++
Shape * s = ...;
if (s->class == Rectangle)
// a cond. br at PC: X
a = Rectangle::area();
// a direct call
else if (s->class == Circle) // a cond. br at PC: Y
a = Circle::area();
// a direct call
else
a = s->area();
// an indirect call at PC: Z
(b) Devirtualized form of the above virtual function call
but requires whole-program analysis to make sure there is only one
possible target. This approach enables code optimizations that would
otherwise be hindered by the indirect call. However, this approach
cannot be used statically if the language supports dynamic class loading, like Java. Dynamic recompilation can overcome this limitation, but it requires an expensive mechanism called on-stack replacement [24].
6.1
6.1.1
The benefits of this optimization are: (1) It enables other compiler optimizations. The compiler could inline the direct function
calls or perform interprocedural analysis [13]. Removing function
calls also reduces the register save/restore overhead. (2) The processor can predict the virtual function call using a conditional branch
predictor, which usually has higher accuracy than an indirect branch
predictor [2]. However, not all indirect calls can be converted to multiple conditional branches. In order to perform RCPO, the following
conditions need to be fulfilled [13, 2]:
1. The number of frequent target addresses from a caller site
should be small (1-2).
2. The majority of target addresses should be similar across input
sets.
3. The target addresses must be available at compile-time.
Direct devirtualization: Direct devirtualization converts an indirect call into a single unconditional direct call if the compiler can
prove that there is only one possible target for the indirect call. Hence,
direct devirtualization does not require a guard before the direct call,
Need for Static Analysis or Accurate Profiling
The application of devirtualization to large commercial software
bases is limited by the cost and overhead of the static analysis or
profiling required to guide the method call transformation. Devirtualization based on static analysis requires type analysis, which in
turn requires whole program analysis [24], and unsafe languages like
C++ also require pointer alias analysis. Note that these analyses need
to be conservative in order to guarantee correct program semantics.
Guarded devirtualization usually requires accurate profile information, which may be very difficult to obtain for large applications. Due
to the limited applicability of static devirtualization, [24] reports only
an average 40% reduction in the number of virtual method calls on a
set of Java benchmarks, with the combined application of aggressive
guarded and direct devirtualization techniques.
6.1.2
Figure 14: A virtual function call and its devirtualized form
Limitations of Compiler-Based
Devirtualization
Impact on Code Size and Branch Mispredictions
Guarded devirtualization can sometimes reduce performance since
(1) it increases the static code size by converting a single indirect
branch instruction into multiple guard test instructions and direct
calls; (2) it could replace one possibly mispredicted indirect call with
multiple conditional branch mispredictions, if the guard tests become
hard-to-predict branches [41].
6.1.3
Lack of Adaptivity to Run-Time Input-Set and
Phase Behavior
The most frequently-taken targets chosen for devirtualization can
be based on profiling, which averages the whole execution of the program for one particular input set. However, the most frequently-taken
targets can be different across different input sets. Furthermore, the
most frequently-taken targets can change during different phases of
the program. Additionally, dynamic linking and dynamic class loading can introduce new targets at runtime. Compiler-based devirtualization cannot adapt to these changes in program behavior because
the most frequent targets of a method call are determined statically
and encoded in the binary.
Due to these limitations, state-of-the-art compilers either do not
implement any form of devirtualization (e.g. GCC 4.0 [14]14 ) or
14
GCC only implements a form of devirtualization based on class hierarchy analysis in
the ipa-branch experimental branch, but not in the main branch [35].
433
60
30
20
n
ea
ay
vr
TTC-384B
TTC-768B
TTC-1.5KB
TTC-3KB
TTC-6KB
TTC-12KB
TTC-24KB
TTC-48KB
VPC-ITER-12
% IPC improvement over baseline
90
80
70
60
50
40
30
20
hm
ea
n
baseline
TTC-384B
TTC-768B
TTC-1.5KB
TTC-3KB
TTC-6KB
TTC-12KB
TTC-24KB
TTC-48KB
VPC-ITER-12
9
8
7
6
5
4
3
2
n
ea
am
ay
po
vr
d
m
na
6
ng
sje
en
c0
gc
p
rlb
pe
m
rlb
pe
n
eo
ty
af
cr
gc
c
k
0
ch
1
ga
Indirect branch Mispredictions (MPKI)
po
vr
ay
na
m
d
sje
ng
pe
rlb
en
ch
gc
c0
6
ga
p
eo
n
cr
af
ty
0
pe
rlb
m
k
10
gc
c
Since it is not possible to selectively enable only RCPO in ICC, we could not isolate the
impact of RCPO on performance. Hence, we only present the effect of VPC prediction
on binaries optimized with RCPO.
16
RCPO binaries were compiled in two passes with ICC: the first pass is a profiling run
with the train input set (-prof_gen switch), and the second pass optimizes the binaries
based on the profile (we use the -prof_use switch, which enables all profile-guided
optimizations).
hm
d
100
10
A compiler that performs devirtualization reduces the number of
indirect branches and therefore reduces the potential performance
improvement of VPC prediction. This section evaluates the effectiveness of VPC prediction on binaries that are optimized using aggressive profile-guided optimizations, which include RCPO. ICC [21]
performs a form of RCPO [38] when value-profiling feedback is enabled, along with other profile-based optimizations.15
Table 8 shows the number of static/dynamic indirect branches
in the BASE and RCPO binaries run with the full reference input
set. BASE binaries are compiled with the -O3 option. RCPO binaries are compiled with all profile-guided optimizations, including
RCPO.16 Table 8 shows that RCPO binaries reduce the number of
static/dynamic indirect branches by up to 51%/86%.
Figure 15 shows the performance impact of VPC prediction on
RCPO binaries. Even though RCPO binaries have fewer indirect
branches, VPC prediction still reduces indirect branch mispredictions by 80% on average, improving performance by 11.5% over a
BTB-based predictor. Figure 16 shows the performance comparison
of VPC prediction with a tagged target cache on RCPO binaries. The
performance of VPC is better than a tagged predictor of 48KB (for
eon, namd, povray), and equivalent to a tagged predictor of 24KB
(for gap), of 12KB (for gcc), of 3KB (for perlbmk, gcc06, and sjeng),
of 1.5KB (for crafty), and 768B (for perlbench). Hence, a VPC predictor provides the performance of a large and more complicated
tagged target cache predictor even when the RCPO optimization is
used by the compiler.
po
6
ng
m
na
sje
gc
c0
en
p
m
rlb
n
pe
eo
ty
af
Figure 15: Performance of VPC prediction on RCPO binaries
6.3 Performance of VPC Prediction on
Binaries Optimized with Compiler-Based
Devirtualization
15
cr
gc
c
k
0
ch
10
rlb
VPC prediction is essentially a dynamic devirtualization mechanism used for indirect branch prediction purposes. However, VPC’s
devirtualization is visible only to the branch prediction structures.
VPC has the following advantages over compiler-based devirtualization:
1. As it is a hardware mechanism, it can be applied to any indirect
branch without requiring any static analysis/guarantees or profiling.
2. Adaptivity: Unlike compiler-based devirtualization, the dynamic training algorithms allow the VPC predictor to adapt to
changes in the most frequently-taken targets or even to new targets
introduced by dynamic linking or dynamic class loading.
3. Because virtual conditional branches are visible only to the
branch predictor, VPC prediction does not increase the code size, nor
does it possibly convert a single indirect branch misprediction into
multiple conditional branch mispredictions.
On the other hand, the main advantage of compiler-based devirtualization over VPC prediction is that it enables compile-time code
optimizations. However, as we show in the next section, the two
techniques can be used in combination and VPC prediction provides
performance benefits on top of compiler-based devirtualization.
VPC-ITER-4
VPC-ITER-6
VPC-ITER-8
VPC-ITER-10
VPC-ITER-12
40
pe
6.2 VPC Prediction vs. Compiler-Based
Devirtualization
50
ga
% IPC improvement over baseline
they implement a limited form of direct devirtualization that converts only provably-monomorphic virtual function calls into direct
function calls (e.g. the Bartok compiler [41, 34]).
Figure 16: VPC prediction vs. tagged target cache on RCPO
binaries: IPC (top) and MPKI (bottom)
7.
CONCLUSION
This paper proposed and evaluated the VPC prediction paradigm.
The key idea of VPC prediction is to treat an indirect branch instruction as multiple “virtual” conditional branch instructions for prediction purposes in the microarchitecture. As such, VPC prediction enables the use of existing conditional branch prediction structures to
predict the targets of indirect branches without requiring any extra
structures specialized for storing indirect branch targets. Our evaluation shows that VPC prediction, without requiring complicated structures, achieves the performance provided by other indirect branch
predictors that require significant extra storage and complexity. We
believe the performance impact of VPC prediction will further increase in future applications that will be written in object-oriented
programming languages and that will make heavy use of polymorphism since those languages were shown to result in significantly
more indirect branch mispredictions than traditional C/Fortran-style
languages. By making available to indirect branches the rich, accurate, highly-optimized, and continuously-improving hardware used
434
Table 8: The number of static and dynamic indirect branches in BASE and RCPO binaries
Static BASE
Static RCPO
Dynamic BASE (M)
Dynamic RCPO (M)
gcc crafty eon perlbmk gap perlbench gcc06 sjeng namd povray
987 356 1857
864
1640
1283
1557
369
678
1035
984 358 1854
764
1709
930
1293
369
333
578
144 174 628
1041 2824
8185
304 10130
7
8228
94 119 619
1005 2030
1136
202 10132
4
7392
to predict conditional branches, VPC prediction can serve as an en- [19] U. Hölzle and D. Ungar. Optimizing dynamically-dispatched calls with
run-time type feedback. In PLDI, 1994.
abler encouraging programmers (especially those concerned with the
performance of their code) to use object-oriented programming styles, [20] IBM Corporation. IBM zSeries mainframe servers.
http://www.ibm.com/systems/z/.
thereby improving the quality and ease of software development.
ACKNOWLEDGMENTS
We especially thank Thomas Puzak of IBM for providing the traces
of commercial benchmarks that we used in our experiments. We
thank Joel Emer, Paul Racunas, John Pieper, Robert Cox, David
Tarditi, Veynu Narasiman, Jared Stark, Santhosh Srinath, Thomas
Moscibroda, Bradford Beckmann, members of the HPS research
group, and the anonymous reviewers for their comments and suggestions. We gratefully acknowledge the support of the Cockrell Foundation, Intel Corporation and the Advanced Technology Program of
the Texas Higher Education Coordinating Board.
REFERENCES
[1] Advanced Micro Devices, Inc. AMD Athlon(T M ) XP Processor Model
10 Data Sheet, Feb. 2003.
[2] B. Calder and D. Grunwald. Reducing indirect function call overhead
in C++ programs. In POPL-21, 1994.
[3] B. Calder, D. Grunwald, and B. Zorn. Quantifying behavioral
differences between C and C++ programs. Journal of Programming
Languages, 2(4):323–351, 1995.
[4] L. Cardelli and P. Wegner. On understanding types, data abstraction,
and polymorphism. ACM Computing Surveys, 17(4):471–523, Dec.
1985.
[5] P.-Y. Chang, M. Evers, and Y. N. Patt. Improving branch prediction
accuracy by reducing pattern history table interference. In PACT, 1996.
[6] P.-Y. Chang, E. Hao, and Y. N. Patt. Target prediction for indirect
jumps. In ISCA-24, 1997.
[7] L. P. Deutsch and A. M. Schiffman. Efficient implementation of the
Smalltalk-80 system. In POPL, 1984.
[8] K. Driesen and U. Hölzle. Accurate indirect branch prediction. In
ISCA-25, 1998.
[9] K. Driesen and U. Hölzle. The cascaded predictor: Economical and
adaptive branch target prediction. In MICRO-31, 1998.
[10] K. Driesen and U. Hölzle. Multi-stage cascaded prediction. In
European Conference on Parallel Processing, 1999.
[11] M. Evers, S. J. Patel, R. S. Chappell, and Y. N. Patt. An analysis of
correlation and predictability: What makes two-level branch predictors
work. In ISCA-25, 1998.
[12] The GAP Group. GAP System for Computational Discrete Algebra.
http://www.gap-system.org/.
[13] C. Garrett, J. Dean, D. Grove, and C. Chambers. Measurement and
application of dynamic receiver class distributions. Technical Report
UW-CS 94-03-05, University of Washington, Mar. 1994.
[14] GCC-4.0. GNU compiler collection. http://gcc.gnu.org/.
[15] S. Gochman, R. Ronen, I. Anati, A. Berkovits, T. Kurts, A. Naveh,
A. Saeed, Z. Sperber, and R. C. Valentine. The Intel Pentium M
processor: Microarchitecture and performance. Intel Technology
Journal, 7(2), May 2003.
[16] D. Grove, J. Dean, C. Garrett, and C. Chambers. Profile-guided
receiver class prediction. In OOPSLA-10, 1995.
[17] A. Hartstein and T. R. Puzak. The optimum pipeline depth for a
microprocessor. In ISCA-29, 2002.
[18] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and
P. Roussel. The microarchitecture of the Pentium 4 processor. Intel
Technology Journal, Feb. 2001. Q1 2001 Issue.
[21] Intel Corporation. ICC 9.1 for Linux.
http://www.intel.com/cd/software/products/
asmo-na/eng/compilers/284264.htm.
[22] Intel Corporation. Intel Core Duo Processor T2500.
http://processorfinder.intel.com/Details.aspx?
sSpec=SL8VT.
[23] Intel Corporation. Intel VTune Performance Analyzers.
http://www.intel.com/vtune/.
[24] K. Ishizaki, M. Kawahito, T. Yasue, H. Komatsu, and T. Nakatani. A
study of devirtualization techniques for a Java Just-In-Time compiler.
In OOPSLA-15, 2000.
[25] D. A. Jiménez and C. Lin. Dynamic branch prediction with
perceptrons. In HPCA-7, 2001.
[26] D. Kaeli and P. Emma. Branch history table predictions of moving
target branches due to subroutine returns. In ISCA-18, 1991.
[27] J. Kalamatianos and D. R. Kaeli. Predicting indirect branches via data
compression. In MICRO-31, 1998.
[28] R. E. Kessler. The Alpha 21264 microprocessor. IEEE Micro,
19(2):24–36, 1999.
[29] H. Kim, J. A. Joao, O. Mutlu, C. J. Lee, Y. N. Patt, and R. Cohn. VPC
prediction: Reducing the cost of indirect branches via hardware-based
dynamic devirtualization. Technical Report TR-HPS-2007-002, The
University of Texas at Austin, Mar. 2007.
[30] J. K. F. Lee and A. J. Smith. Branch prediction strategies and branch
target buffer design. IEEE Computer, pages 6–22, Jan. 1984.
[31] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney,
S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized
program analysis tools with dynamic instrumentation. In PLDI, 2005.
[32] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg,
J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full
system simulation platform. IEEE Computer, 35(2):50–58, Feb. 2002.
[33] S. McFarling. Combining branch predictors. Technical Report TN-36,
Digital Western Research Laboratory, June 1993.
[34] Microsoft Research. Bartok compiler.
http://research.microsoft.com/act/.
[35] D. Novillo, Mar. 2007. Personal communication.
[36] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi.
Pinpointing representative portions of large Intel Itanium programs
with dynamic instrumentation. In MICRO-37, 2004.
[37] A. Roth, A. Moshovos, and G. S. Sohi. Improving virtual function call
target prediction via dependence-based pre-computation. In ICS-13,
1999.
[38] D. Sehr, Nov. 2006. Personal communication.
[39] A. Seznec. Analysis of the O-GEometric History Length branch
predictor. In ISCA-32, 2005.
[40] A. Seznec and P. Michaud. A case for (partially) TAgged GEometric
history length branch prediction. Journal of Instruction-Level
Parallelism (JILP), 8, Feb. 2006.
[41] D. Tarditi, Nov. 2006. Personal communication.
[42] J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. POWER4
system microarchitecture. IBM Technical White Paper, Oct. 2001.
[43] M. Wolczko. Benchmarking Java with the Richards benchmark.
http://research.sun.com/people/mario/java_
benchmarking/richards/richards.html.
[44] T.-Y. Yeh, D. Marr, and Y. N. Patt. Increasing the instruction fetch rate
via multiple branch prediction and branch address cache. In ICS, 1993.
435
Download