TAP Prediction: Reusing Conditional Branch Predictor for Indirect

advertisement
TAP Prediction: Reusing Conditional Branch Predictor for Indirect
Branches with Target Address Pointers
Zichao Xie, Dong Tong, Mingkai Huang, Xiaoyin Wang, Qinqing Shi, Xu Cheng
Microprocessor Research & Development Center, Peking University, Beijing, China
{xiezichao, tongdong, huangmingkai, wangxiaoyin, shiqinqing, chengxu}@mprc.pku.edu.cn
They are used to implement several programming constructs,
including virtual function calls, switch-case statements and
function pointers. Realizing the importance of indirectbranch predictions, designers are beginning to develop highperformance indirect-branch predictors in recent commercial
processors [5].
Previously proposed hardware-based indirect-branch predictors use branch history to distinguish various occurrences
of indirect branches. Some predictors place the targets corresponding to various branch occurrences directly into a large
dedicated storage structure. These kinds of predictors [2]
[3] [14], called dedicated-storage-predictors in this paper,
can predict targets very quickly. However, their storage
requirement takes up significant die area, which translates
into extra power consumption. A representative hardwarereusing predictor, VPC Prediction [8], reuses only existing
branch prediction hardware. This method achieves high prediction accuracy, but it may take many cycles to finish the
prediction, thus reducing its performance improvement. The
goal of our technique is to achieve time cost similar to that
of dedicated-storage-predictors, without requiring additional
large amounts of storage.
This paper proposes a hardware-reusing indirect-branch
prediction, called Target Address Pointer (TAP) Prediction.
TAP Prediction reuses the history-based direction predictor to
detect various indirect-branch occurrences, and stores multiple indirect-branch targets in the BTB. It employs the Target
Address Pointers to establish the mapping from the indirectbranch occurrences to the stored indirect-branch targets. As
shown in Figure 2, to generate such pointers according
to the current branch history, TAP prediction reuses the
storage of the existing direction predictor to construct several
small predictors. Each small predictor, called sub-predictor,
performs as well as the original predictor, but uses fewer
branch histories. Its output is defined as one bit of the
target address pointer. When fetching an indirect branch,
TAP Prediction uses the sub-predictors to generate the bits
of the target address pointer, ignoring the result from the
original predictor. Then TAP prediction selects the virtual
address calculated from this pointer to index the BTB for
the predicted target.
Through hardware reusing, TAP Prediction employs target
address pointers to achieve fast indirect-branch prediction
without requiring additional storage. Our experimental results show that for three representative direction predictors–
Abstract— Indirect-branch prediction is becoming more important for modern processors as more programs are written
in object-oriented languages. Previous hardware-based indirectbranch predictors generally require significant hardware storage or use aggressive algorithms which make the processor
front-end more complex. In this paper, we propose a fast and
cost-efficient indirect-branch prediction strategy, called Target
Address Pointer (TAP) Prediction. TAP Prediction reuses the
history-based branch direction predictor to detect occurrences
of indirect branches, and then stores indirect-branch targets
in the Branch Target Buffer (BTB). The key idea of TAP
Prediction is to predict the Target Address Pointers, which
generate virtual addresses to index the targets stored in the
BTB, rather than to predict the indirect-branch targets directly.
TAP Prediction also reuses the branch direction predictor to
construct several small predictors. When fetching an indirect
branch, these small predictors work in parallel to generate
the target address pointer. Then TAP prediction accesses the
BTB to fetch the predicted indirect-branch target using the
generated virtual address. This mechanism could achieve time
cost comparable to that of dedicated-storage-predictors, without
requiring additional large amounts of storage. Our evaluation shows that for three representative direction predictors–
Hybrid, Perceptrons, and O-GEHL–TAP schemes improve
performance by 18.19%, 21.52%, and 20.59%, respectively,
over the baseline processor with the most commonly-used BTB
prediction. Compared with previous hardware-based indirectbranch predictors, the TAP-Perceptrons scheme achieves performance improvement equivalent to that provided by a 48KB
TTC predictor, and it also outperforms the VPC predictor by
14.02%.
I. I NTRODUCTION
Modern high-performance processors employ the branch
prediction component as an essential part to exploit
instruction-level parallelism. Previous branch prediction researchers have focused on predicting conditional or unconditional direct branches accurately [10] [6] [13]. Unlike directbranch prediction, it is more difficult to achieve high prediction accuracy for indirect-branch prediction, as it requires
predicting the target address instead of the branch direction.
A commonly-used Branch Target Buffer (BTB) can predict
only the last-taken targets of indirect branches. Figure 1
shows the poor indirect-branch prediction accuracy for the
simulated benchmarks with a 4-way 4K-entry BTB. Another
trend is that indirect branches are becoming increasingly
common in programs written in object-oriented languages.
This work was supported by the National High Technology Research and
Development Program of China under Grant No. 2009ZX01029-001-002.
978-1-4577-1954-7/11/$26.00 ©2011 IEEE
119
Indirect Branch
70
60
50
40
30
20
10
0
f ty
cr a
Fig. 1.
80
Conditional Branches
Indirect Branches
Prediction Accuracy (%)
Mispredictions Per Kilo Instructions (MPKI)
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
n
eo
k
bm
rl
pe
p
ga
ch
ng
sje rlben
pe
y
6
ge
rds
c0
vr a
ha
era
gc
po
ric
Av
f ty
cr a
n
eo
k
bm
rl
pe
p
ga
ch
ng
sje rlben
pe
y
6
ge
rds
c0
vr a
ha
era
gc
po
ric
Av
MPKI (Left) and indirect-branch prediction accuracy (Right) of 4-way 4K-entry BTB (12KB Hybrid predictor for conditional-branch prediction)
Branch Direction
Predictor
Hybrid predictor [10], Perceptrons predictor [6], and OGEHL predictor [13]–TAP Prediction reduces the indirectbranch MPKI significantly over the commonly-used BTBbased prediction. This results in improving performance by
18.19%, 21.52%, and 20.59%, respectively. Compared with
previously proposed hardware-based predictors, TAP Prediction with Perceptrons Predictor (TAP-Perceptrons scheme)
could achieve performance improvement equivalent to a
48KB TTC predictor [2], and it outperforms the VPC predictor [8] by 14.02%.
The contributions of this paper are as follows:
1) TAP Prediction requires no extra amount of dedicated
storage because it reuses the existing branch prediction
hardware. The branch direction predictor is redefined
to generate the target address pointers for various
indirect-branch occurrences, whereas the BTB stores
multiple indirect-branch targets.
2) TAP Prediction employs hardware pointers to predict
indirect-branch targets quickly. Its time cost is similar
to that of the dedicated-storage-predictors.
3) TAP Prediction is highly extensible to various branch
prediction structures. For three representative branch
direction predictors, all TAP implementations have
shown attractive performance improvements compared
to that of a commonly-used BTB prediction scheme.
Small Predictor
Branch PC
Small Predictor
Current
Branch Histories
Small Predictor
Small Predictor
Target Address Pointer
0/1
0/1
0/1
0/1
Taken/Not-Taken for Conditional Branches
Virtual Address
fTAP(PC)
BTB
Predicted Target
Fig. 2.
Overview of TAP Prediction
predictors. A simple first-stage predictor is used for easy-topredict (single-target) indirect branches, whereas a complex
second-stage predictor is used for hard-to-predict indirect
branches. The IT-TAGE predictor [14] proposed by Seznec
and Michaud is conceptually similar to a multistage cascade
structure. Their mechanism employs a base predictor and a
number of tagged tables indexed by a very long GHR, PC,
and path history. The predicted target comes from the table
with the longest history which the access hits.
Compared to those dedicated-storage predictors, VPC
Prediction [8] has been proposed recently to reduce the
cost of indirect-branch prediction using existing conditionalbranch prediction components. VPC Prediction treats an
indirect branch with T targets as T virtual direct branches,
each with its own unique target address. When fetching an
indirect branch, VPC Prediction accesses the conditionalbranch predictor iteratively. One of T virtual direct branches
is predicted as well as a conditional branch in each iteration.
This iterative process stops when a virtual direct branch is
predicted to be taken, or a predefined maximum iteration
number is reached. This technique, however, may take many
cycles to accomplish the indirect-branch prediction, thus
II. P REVIOUS W ORK
Modern processors usually employ the BTB to predict
indirect branches [9]. However, the BTB records only the
last-taken targets, which cannot distinguish various indirectbranch occurrences, resulting in poor prediction accuracy.
Chang et al. [2] first proposed Tagged Target Cache (TTC)
to use branch history information to distinguish differences
among indirect-branch occurrences. Its concept is similar to
that of a two-level direction predictor [17]. Each TTC entry
contains a target address and a tag field. When fetching
an indirect branch, the TTC predictor is indexed using the
XOR of the branch address (PC) and the global branch
history register (GHR) to provide the predicted target. When
the indirect branch retires, the corresponding TTC entry is
updated with the actual target address.
Driesen et al. [3] proposed another dedicated-storagepredictor, a cascaded predictor, by combining multiple target
120
Percentage of the number of indirect branches targets (%)
reducing its performance efficiency.
III. TARGET A DDRESS P OINTER (TAP) P REDICTION
A. Idea of TAP Prediction
The key idea of TAP Prediction is to predict a pointer,
which points to an indirect-branch target stored in the BTB,
instead of predicting a target address directly. Using pointers
has two benefits: first, it is easy to access the pointed data
when the pointers are obtained; second, it is possible to store
the pointed data in the existing storage, as those data can be
stored in distributed memory locations. Based on these two
advantages, TAP Prediction uses Target Address Pointers
as the intermediate representations of those indirect-branch
targets, separating the indirect-branch prediction flow into
two steps: first generate a target address pointer according
to each indirect-branch occurrence distinguished by branch
histories (Occurrences Mapping), then use this pointer to
fetch the predicted target (Targets Mapping).
100
90
80
70
16+
13-16
9-12
5-8
1-4
60
50
40
30
20
10
0
cr a
fty
Fig. 3.
n
eo
k
bm
rl
pe
p
ga
ch
ng
en
sj e
rl b
pe
c
gc
06
y
rds
vra
ha
po
ri c
Distribution of the number of indirect-branch targets
index the target-entries1 . In such cases, TAP Prediction
keeps the BTB access the same as that of the conventional
BTB. The fTAP (PC) could have multiple implementations.
A simple implementation is to XOR the 4 bits of the target
address pointer with the highest 4 bits and lowest 4 bits of
PC. This function distributes the generated virtual addresses
widely in the BTB, thereby reducing interference with each
other or with conditional branches.
1) Targets Mapping
Unlike conventional BTB structure, an indirect branch
in TAP Prediction occupies multiple BTB entries. Those
entries are classified as the target-entry or the allocationentry. As a program executes, TAP Prediction dynamically
allocates some BTB entries, called target-entries, to store
the encountered targets of this indirect branch. Each targetentry is indexed by a virtual address calculated from the
obtained target address pointer during the prediction. Previous analysis [8] showed that indirect branches have no more
than 16 targets in most simulated programs. In addition, the
statistics of our selected program segments shown in Figure
3 (Configurations are shown in Section IV.) illustrate that the
rate of indirect branches having more than 8 possible targets
is also relatively small. For high extensibility, TAP Prediction
allocates at most 16 target-entries for an indirect branch. If
an indirect branch has more than 16 targets, TAP Prediction
will replace the content of a used target-entry by the newly
encountered target. The allocation-entry, a specific BTB
entry, is employed by each indirect branch to record the
target-entry allocation information. This information is stored
in the lower 16 bits of the target address field. In order to
determine as quickly as possible whether the BTB has targetentries for an indirect branch, TAP Prediction places this
entry into the BTB set indexed by the branch PC.
TAP Prediction creates the target address pointers to
the target-entries of an indirect branch, implementing the
Targets Mapping. Each target-entry of an indirect branch
corresponds to a target address pointer, so there are at most
16 pointers for an indirect branch in TAP Prediction. To
reach the minimum length of those pointers, TAP Prediction
uses 4 bits to represent 16 pointers. The target address
pointers are treated as function pointers to call the function
fTAP (PC), which generates different virtual addresses to
2) Occurrences Mapping
When predicting an indirect branch, TAP Prediction uses
the redefined history-based direction predictor to detect
various indirect-branch occurrences and generate the target
address pointers, implementing the Occurrences Mapping.
To adapt to various kinds of direction predictors, TAP Prediction adopts a simple but effective method to generate target
address pointers, i.e. it reuses the existing direction predictor
to construct several small predictors, each of which predicts
one bit of the target address pointer. These small predictors,
called Sub-Predictors, perform as well as the original predictor, but use fewer branch histories. The prediction result
of each sub-predictor, ’Taken/Not-Taken’, is defined as one
bit of the target address pointer, ’1/0’. Thus, a 4-bit target
address pointer needs 4 sub-predictors. Each sub-predictor
treats the corresponding bit of the target address pointer as
its training goal. Note that constructing sub-predictors does
not change the conditional branch prediction flow. Thus, as
for conditional branches, the direction predictor performs in
the same way as its original mechanism.
This paper chooses three representative direction
predictors–Hybrid Predictor [10], Perceptrons Predictor
[6], and O-GEHL Predictor [13]–to implement the
occurrences mapping. Their detailed implementations are
listed in Table I:
TAP-Hybrid: To this kind of predictor that uses SRAMs
to construct PHTs, the SRAMs are divided into 4 groups.
1 Note that virtual addresses can conflict with conditional-branch addresses, thereby increasing aliasing and interference in the BTB. The
processor does not require any special action when aliasing happens.
Evaluations in Section V show that the adverse effects of TAP Prediction
on the BTB miss rate and conditional branch misprediction rate is tolerable.
121
Sub-Predictor 0
TAP[0] = Sign
Indirect-Branch Prediction
W1
xi = GHR[15:0]
W2
Sub-Predictor 1
TAP[1] = Sign
...
W15
Σ
W16
xi = GHR[15:0]
...
Sub-Predictor 2
TAP[2] = Sign
W31
Σ
W32
...
xi = GHR[15:0]
W47
Sub-Predictor 3
TAP[3] = Sign
W48
Σ
...
xi = GHR[15:0]
Fig. 4.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
Conditional-Branch Prediction
W0
Σ
Σ
Prediction = Sign
xi = GHR[63:0]
W63
The direction predictor of the TAP-Perceptrons scheme
void
Perl_sortsv(pTHX_ SV **array, size_t nmemb, SVCOMPARE_t cmp)
{
void (*sortsvp)(pTHX_ SV **array,
size_t nmemb,
SVCOMPARE_t cmp,
ldah $29,0($26)
U32 flags)
lda $29,0($29)
= S_mergesortsv;
ldah $1,S_qsortsv($29)
SV *hintsv;
ldq $3,0($0)
I32 hints;
lda $27,S_qsortsv($1)
hints = SORTHINTS(hintsv);
if (hints & HINT_SORT_QUICKSORT) {
sortsvp = S_qsortsv;
}
else {
sortsvp = S_mergesortsv;
}
$L273:
ldah $1,S_mergesortsv($29)
lda $27,S_mergesortsv($1)
br $31,$L276
sortsvp(aTHX_ array, nmemb, cmp, 0);
}
Fig. 5.
Each group constructs a sub-predictor with one-quarter of
PHT entries. A multiplexer is added to each SRAM to select
the corresponding index under two fetching cases, indirectbranch cases or conditional-branch cases, predicting the
target address pointer or the branch direction, respectively.
TAP-Perceptrons: To this computation-based predictor,
four new adders are employed to calculate the results of the
sub-predictors (Figure 4). The weights read from the predictor table are simultaneously calculated using the mechanisms
of the sub-predictors and the original predictor separately.
TAP-OGEHL: Because this predictor uses both PHTs and
computations to predict branch directions, a sub-predictor is
constructed with one-quarter of PHTs, additional adders and
multiplexers. According to different fetching cases, O-GEHL
selects either the original mechanism to predict branch
directions or the sub-predictors to produce the target address
pointers.
The benefits of this method are as follows: first, TAP Prediction transforms the prediction of target address pointers
into the direction predictions of multiple sub-predictors, thus
the improvements in conditional-branch prediction accuracy
can also benefit TAP Prediction. Second, the implementation
is quite simple. It needs only to copy the logic of the
existing direction predictor, including both prediction logic
and updating logic. Third, it obtains the prediction results
quickly, because all sub-predictors work simultaneously.
$L276:
jsr $26,($27),0
Code example of an indirect branch
latencies, the basic prediction steps are the same.):
In the first cycle of fetching an indirect branch, TAP
Prediction accesses the BTB and the direction predictor
simultaneously2 . If there is a BTB hit, it indicates that the
BTB has allocated target-entries for this indirect branch
so that the BTB output is the content of the allocationentry. Otherwise, TAP Prediction stalls the pipeline until
this indirect branch is resolved. Meanwhile, TAP Prediction
uses the sub-predictors to generate the target address pointer
corresponding to the detected indirect-branch occurrence,
and then calculates the virtual address through fTAP (PC).
In the next cycle, TAP Prediction accesses the BTB using
the obtained virtual address for the predicted indirect-branch
target. If there is a BTB hit, the stored target is issued to the
pipeline. Otherwise, TAP Prediction stalls fetching instructions until the actual target is calculated in the pipeline.
C. Prediction Example
In this section, we use a code example to illustrate how
TAP Prediction effectively works with target address pointers. Figure 5 shows a code example of an indirect branch
from the Perlbench benchmark in SPEC CPU 2006 suites
[15]. The indirect branch, ′ jsr$26, ($27), 0′ , implements calling the function pointer, ′ sortsvp′ . As the program executes,
different results of the conditional branch in Line 13 lead
to different values for ′ $27′ , resulting in different function
pointers. If the condition in Line 13 is true (CondT ), the
indirect-branch target is ’S qsortsv’ (TargetT ). Otherwise
(CondF), the target is ’S mergesortsv’(TargetF).
Figure 6 illustrates the prediction flow using the TAPPerceptrons scheme if there is a CondT case. When fetching
the indirect branch, all of the sub-predictors in the Perceptrons predictor use the recent 16 GHRs to calculate the
bits of the target address pointer. In CondT case, the target
address pointer, ’1101’, is generated, and then is treated
as the function pointer, calling the function ’ f1101 (PC)’ to
calculate the virtual address. This virtual address is next
issued to the BTB to fetch the pointed TargetT address.
B. Prediction Flow
A critical issue of TAP Prediction flow is to know the
fetching instruction is an indirect branch. The PPD (Prediction Probe Detector) technique [12], which is implemented
in modern high performance processors for the purpose
of eliminating unnecessary branch predictor accesses using
predecoding bits, can be extended to identify the branch
type at the beginning of instruction fetch stage. However,
some TAP implementations have exceptions, avoiding using
such mechanism. For example, since the TAP-Perceptrons
scheme can predict the branch direction for conditional
branches and the target address pointers for indirect branches
simultaneously, the branch type of the fetching instruction
can be determined after the BTB access. This is the same as
VPC prediction [8], which uses the matched BTB entry to
provide the branch type.
The TAP Prediction flow are as follows (although various
branch prediction implementations lead to different access
2 This step may take several cycles, which depends on the structure of the
direction predictor, until the target address pointer is generated.
122
TABLE I
O CCURRENCES M APPING I MPLEMENTATIONS OF B RANCH D IRECTION P REDICTORS
Structural Description
Original Configuration: 12KB, including a 32K-entry gshare PHT, and 8K-entry Bimodal and Selector, 1-cycle prediction delay.
Construct Sub-Predictors: divides the original PHT into 4 small PHTs, ignoring the outputs of Bimodal and Selector.
Each sub-predictor has 8K-entry PHT indexed by XOR result of the PC and low 13-bit of GHR.
The higher bit of each obtained saturating counter represents the bit of the target address pointer.
Original Configuration: path-based perceptrons [6], 64KB table, using 64-bit GHR as inputs. Each entry has 64 weights (wi ). 2-cycle prediction delay.
Construct Sub-Predictors: divides the 64 weights (wi ) into 4 parts. Each part constructs a sub-predictor, using the low 16-bit of
GHR and the segmented weights to calculate the output. The output is redefined as one bit of the target address pointer.
Original Configuration: 64K-bit, including 8 predictor tables. Tables are indexed with hash functions of branch history, path history,
and PC. The set of used global history lengths forms a geometric series, i.e., L( j) = α j−1 L(1). 2-cycle prediction delay.
Construct Sub-Predictors: divides 8 predictor tables into 4 sub-predictors. Each sub-predictor has 2 predictor tables and captures the recent
20 global branch histories. All tables are extended with two read ports (or implemented with 2 SRAM cells) to provide 4 signed counters
for each sub-predictor. Each sub-predictor performs as well as the O-GEHL predictor to generate one bit of the target address pointer.
Predictor Scheme
TAP-Hybrid
TAP-Perceptrons
TAP-OGEHL
Direction Predictor
PC
Sub-Predictor
Weights 48~63
1 1 0 1
PC Indexed
Σ
Weights 32~47
Σ
Weights 16~31
GHR[15:0]
Σ
Weights 0~15
f1101(PC)
TABLE II
BTB
Target Address Pointer
BASELINE PROCESSOR CONFIGURATION
Allocation-Entry
Pipeline Depth
0010_0000_0000_0010
Instruction Fetch
Target-Entry
Virtual AddressCondT TargetT
Indexed
Execution Engine
Target-Entry
Branch Predictor
TargetF
Σ
Caches
Fig. 6.
Memory
An example of TAP Prediction with the Perceptrons predictor
D. Updating Flow
bit in the allocation-entry should be cleared.
U pdating f or W rong Pointers Cases: If the virtual address hits a target, TAP Prediction compares the target to
the correct target. If they are the same, the corresponding
target address pointer is updated to point to this target-entry.
Each sub-predictor saves the bit of the correct target address
pointer for use as the training goal. Each training process is
the same as the training of the original direction predictor.
U pdating f or Meaningless Pointers Cases: At the end of
traversing, if none of the entries matches the correct target,
TAP Prediction selects a target-entry to store the correct
target. If there are unallocated target-entries, it will randomly
allocate a new target-entry and update the allocation-entry.
Otherwise, it will randomly replace a used target-entry to
store the newly encountered target. Finally, sub-predictors
are trained to generate the correct target address pointer.
The updating in correct TAP Prediction case is the same as
that of the traditional BTB, which updates the Least Recently
Used information of the BTB set containing the used targetentry. If there is a misprediction, TAP Prediction should
update the corresponding target address pointer to make it
map to the BTB entry storing the correct indirect-branch
target. There are two kinds of misprediction cases in TAP
Prediction:
∙
∙
16 stages
4-instruction per cycle; fetch and at first
pred taken branch; using PPD technique
4-wide decode/issue/execution/commit;
512-entry RUU; 128-entry LSQ
4K-entry, 4-way BTB (LRU), 1-cycle pred delay
32-entry return address stack;
Structures of the direction predictors are listed in Table I.
15-cycle min. branch misprediction penalty
32KB, 8-way, 2-cycle L1 DCache & ICache;
1MB, 16-way, 10-cycle unified L2 Cache.
All cache have 64B block size with LRU replacement policy
150-cycle memory latency (first chunk), 15-cycle(rest)
W rong Pointer: one of the target-entries stores the
correct indirect-branch target, but the target address
pointer points to a wrong target-entry. This is primarily
caused by the contentions between branch directions
and target address pointers in direction predictor.
Meaningless Pointer: none of the target-entries is storing the correct indirect-branch target, so the generated
target address pointer is meaningless. This is mainly
caused by compulsory misses of target-entries.
E. Hardware Cost and Complexity
The additions and modifications to the existing branch
prediction hardware are as follows:
∙ Add a register to store the virtual address calculated
from the target address pointer, during prediction and
updating.
∙ Implement the hardware function, f TAP (PC): the design
in Section III.A requires only two 4-bit XORs.
∙ Add a multiplexer to select the PC or the virtual address
to access the BTB.
∙ Add registers carrying the content of the allocation-entry
throughout the pipeline.
∙ Copy the prediction and training logic of the original
direction predictor to construct sub-predictors.
In order to distinguish these two kinds of mispredictions
and update the allocation-entry, the updating mechanism
needs to traverse target-entries in the BTB. The traversing
process may take several cycles. However, it is only necessary to traverse those allocated target-entries (according
to the information stored in the allocation-entry), which
significantly reduces the time costs of updating. During
traversing, TAP Prediction generates the virtual addresses
sequentially to index the allocated target-entries:
U pdating f or the allocation-entry: If a virtual address
misses a target in the BTB, it indicates that this target-entry
has been replaced by some other target, so the corresponding
123
TABLE III
C HARACTERISTICS OF THE EVALUATED BENCHMARKS
Static number of Indirect Branches
Dynamic number of Indirect Branches (K)
Indirect Branches mispredictions per kilo instructions (MPKI)
Indirect Branches Prediction Accuracy (%)
Hybrid Base IPC
Perceptrons Base IPC
O-GEHL Base IPC
crafty
11
210
1.10
48.0
1.67
1.97
1.95
eon
26
557
1.54
72.4
1.74
2.13
2.14
perlbmk
46
1176
8.43
28.4
1.38
1.48
1.49
gap
35
1237
5.51
55.5
1.50
1.53
1.54
sjeng
15
317
1.96
38.2
1.44
1.80
1.77
perlbench
24
1125
8.95
20.5
1.09
1.13
1.14
gcc06
128
393
1.77
55.0
1.46
1.80
1.80
povray
8
696
1.70
75.6
1.99
2.16
2.17
richards
5
898
4.85
46.0
1.58
1.75
1.76
Avg.
38
735
3.98
45.5
1.52
1.72
1.72
TABLE IV
T IMING PARAMETERS OF B RANCH C OMPONENTS UNDER TSMC
D OPHIN L IBRARY 65 NM (0.9V, 125∘ C)
The structure of TAP Prediction has two features: First, no
extra large table is required. Second, the logic required for
TAP Prediction and updating is simple. Especially, copying
the existing logic of the original direction predictor reduces
the design complexity significantly.
SRAM
Components
Data
Tag
Hybrid
Perceptrons
O-GEHL
BTB
IV. E XPERIMENTAL M ETHODOLOGY
We extended SimpleScalar [1] to evaluate TAP Prediction.
Table II shows the parameters of our baseline processor,
which employs the BTB to predict indirect branches. The latencies of various branch prediction strategies have been considered in our experiments. Our workload selects the indirectbranch-sensitive benchmarks, including 4 SPEC CPU 2000
INT benchmarks, 4 SPEC CPU 2006 benchmarks [15], and
one C++ benchmark–Richards [16]. Each benchmark gains
at least 5 percent performance with a perfect indirect branch
predictor. These benchmarks also have been used in previous
related researches [8] [4].
We used SimPoint [11] to select a representative simulation region for each benchmark using the reference input set.
Each benchmark was run for 100 million Alpha instructions.
All binaries were compiled with Compaq C++ V6.5-042 for
Compaq Tru64 Unix V5.1B. Table III shows the characteristics of the examined benchmarks on the baseline processors.
Timing (ns)
Cell
TSetup
TClk−to−Q
128x148
128x120
128x32
128x64
128x16
0.208
0.212
0.233
0.223
0.238
0.878
0.862
0.772
0.790
0.763
Estimated Max Latency
w/o
w/
Sub-pred
TAP
TAP
in TAP
1.112
1.112
-
0.902
1.470
1.433
0.902
1.470
1.433
0.802
1.400
1.163
B. MPKI Impacts and Performance Improvement
Figure 7 shows the indirect branch MPKI of the baseline
and the TAP schemes. All TAP schemes for three direction
predictors reduce the indirect-branch MPKI significantly. For
example, the TAP-Perceptrons scheme reduces the average
MPKI from 3.98 to 1.03. Table V shows the conditionalbranch MPKI impacts. Since TAP Prediction reuses existing
branch components, it is inevitable to have adverse impacts
on conditional branches. However, those impacts are relatively small compared to the TAP’s significant indirectbranch MPKI improvement. The results show that the impacts on conditional branch prediction vary under those three
TAP schemes. This is because the mechanisms in Perceptrons
and O-GEHL, which are inherently used to avoid aliasing
problems and contentions for conditional branches, are also
effective to avoid the interference between predicting branch
directions and target address pointers.
The MPKI reduction leads to attractive IPC improvement,
as shown in Figure 8. Because TAP Prediction does not
increase the latency of the branch prediction, IPC evaluation
generally reflects the performance impact. The IPC improvements of TAP-Hybrid, TAP-Perceptrons and TAP-OGEHL
are 18.19%, 21.52%, and 20.59%, respectively. We have
also experimented with a perfect indirect-branch predictor
for each baseline, which predicts all the indirect-branch
targets with 100% accuracy. The TAP schemes achieve
76.84%, 85.09%, and 82.35%, respectively, of the maximum
performance improvements provided by the ideal indirectbranch predictors.
The effect of the updating mechanism in TAP-Perceptrons
scheme is listed in Table VI. The result shows that it takes
only 1.80 cycles for each indirect-branch update averagely.
Also, if we assume that each update can be finished one cycle
(w/o additional cycles), it can only improve performance
by 0.19% averagely. Therefore, although the target-entries
traversing makes the updating a little complex, it would not
V. R ESULTS AND A NALYSIS
A. Timing Estimation
To evaluate the timing impact of TAP Prediction, we
modeled the TAP schemes by extending the front-end of
a commercial 64-bit superscalar processor. This fully synthesizable processor runs about 1GHz under TSMC 65nm
technology. It can issue and complete three instructions
per clock cycle. Instructions complete in order, but execute
out of order. Its dynamic branch prediction employs a 2Kentry, 2-level direction predictor and a 512-entry, 4-way
set-associative BTB, which has 1-cycle access latency. The
implementations and their timing parameters are listed in
Table IV3 . The evaluation results prove that constructing
sub-predictors does not increase the latency of the branch
prediction hardware. This is because a sub-predictor uses
fewer predictor tables (for fewer GHRs), thereby saving some
multiplexors or reducing the number of adding operands,
compared to the original branch predictor.
3 To estimate the timing of TAP Prediction, we extended the existing components to implement the datapath of various branch prediction components.
Those structures are fully synthesizable designs instead of the traditional
customized designs. Thus, the estimated timing, such as BTB’s latency,
would be longer than that in a real chip.
124
TABLE V
C ONDITIONAL -B RANCH MPKI C OMPARISONS
Configuration
Hybrid Predictor
Perceptrons Predictor
OGEHL Predictor
crafty
7.67
8.46
4.99
5.02
5.04
5.27
5.84
baseline
TAP
baseline
TAP
VPC
baseline
TAP
eon
5.98
6.64
6.33
6.36
6.37
6.25
6.36
perlbmk
5.79
7.37
8.40
8.44
12.70
8.26
8.97
gap
0.94
1.50
2.84
2.86
12.91
2.19
2.76
Baseline
TAP-Hybrid
TAP-Perceptrons
TAP-OGEHL
7
BTB Entries
8K
4K
2K
1K
512
6
MPKI
perlbench
3.22
3.72
1.25
1.25
1.25
0.78
1.27
gcc06
5.80
6.56
4.74
4.85
4.77
4.00
4.70
povray
1.89
2.00
3.49
3.49
7.47
3.14
3.14
richards
4.51
4.63
3.14
3.14
3.15
3.10
3.11
Avg.
5.41
6.14
4.93
4.97
7.01
4.76
5.26
TABLE VII
E FFECT OF DIFFERENT BTB SIZES IN TAP-P ERCEPTRONS
9
8
sjeng
12.90
14.38
9.21
9.33
9.40
9.85
11.23
5
4
Baseline IB MPKI
3.96
3.97
4.02
4.05
4.21
TAP IB MPKI
1.01
1.03
1.15
1.26
1.41
IPC Δ (%)
21.63
21.52
21.09
20.28
18.19
3
2
The TTC predictor [2] is a representative dedicatedstorage-predictor. We simulated various sizes of the TTC
predictor, from 256-entry to 64K-entry. Each entry of a TTC
predictor has a 4-byte target and a 2-byte tag, giving a total
size of 1.5KB to 384KB respectively. On average, the TAPPerceptrons predictor achieves the equivalent performance
provided by a 48KB TTC predictor. In five benchmarks,
gap, s jeng, gcc06, povray, and richards, the TAP predictor
performs at least as well as a 96KB TTC predictor.
We also compared the efficiency of the VPC predictor [8]
and the TAP predictor. In our experiments, the maximum
iteration number of VPC Prediction is set to 12. In addition,
since the LFU training algorithm of VPC prediction is
difficult to implement in hardware, we adopt a simple way for
updating: insert the newly encountered indirect-branch target
sequentially at the tail of the access sequence. We found that
VPC Prediction in our evaluations usually requires 3 to 6
iterations to finish predictions, which is less effective than
the results in [8]. Besides the impacts of the simple training
algorithm, another reason is the differences of the ISA and
the characteristics of the selected program segments. The
efficiency of VPC Prediction depends on the access order
of the frequently-predicted indirect-branch targets. In our
selected program segments, since the frequently-predicted
targets usually appear later than some other targets, they are
inserted into the middle or at the tail of the access sequence.
In our experiments, the TAP predictor outperforms a
VPC predictor by 14.02%. Besides the total indirect-branch
MPKI of VPC Prediction, we also evaluated its MPKI if
predictions are finished in 2 cycles (2-cycle-MPKI). The
results illustrate that although the total indirect-branch MPKI
of VPC Prediction is similar to that of TAP Prediction,
the 2-cycle-MPKI increases significantly, especially for the
benchmarks with higher numbers of dynamic targets. This
finding is similar to the analysis in research [7]. It indicates
that VPC Prediction requires more cycles to reach the same
accuracy as that of TAP Prediction. On the other hand, as
shown in Table V, VPC’s adverse impact on conditional
branch predictions is also greater than that of TAP Prediction.
1
0
f ty
cr a
Fig. 7.
n
eo
mk
rl b
pe
p
ga
ch
ng
sje rlben
pe
c
gc
06
y
ge
rds
vr a
ha
era
po
ric
Av
Indirect-branch MPKI impacts of TAP predictors
TABLE VI
U PDATE P ENALTIES IN TAP-P ERCEPTRONS S CHEME : I NDIRECT
B RANCH (IB), IPC I MPROVEMENT WITHOUT ADDITIONAL CYCLES (IPC
W / IDEAL UPDATE )
crafty
eon
perlbmk
gap
sjeng
perlbench
gcc06
povray
richards
average
IB Update
Num
Cycles
60990
119049
27720
27812
174404
925781
679
365
79359
174622
436313
2654845
50615
108615
49702
39907
5677
8782
35898
64645
Cycles
per IB update
1.95
1.00
5.31
0.54
2.20
6.08
2.15
0.80
1.55
1.80
IPC w/ ideal
update (%)
0.22
0.08
2.05
0.00
0.32
3.81
0.33
0.10
0.07
0.19
affect performance significantly.
C. Sensitivity to Different BTB Sizes
Since indirect-branch targets are stored in the BTB, we
also evaluated the effects of different BTB sizes. We take the
TAP-Perceptrons scheme as an example. The results in Table
VII show that TAP Prediction could provide significantly
performance improvement even with small BTB sizes. This
is because TAP Prediction occupies only the limited number
of BTB entries to store indirect-branch targets dynamically.
D. Comparison with Other Indirect-Branch Predictors
As an example, we choose the TAP-Perceptrons predictor
to compare with other hardware-based indirect-branch predictors, the TTC predictor and the VPC predictor. Figure
9 and 10 illustrate the comparisons of the indirect-branch
MPKI and the IPC improvement.
125
55
55
TAP-Hybrid
Ideal-Hybrid
50
45
40
40
30
25
20
IPC Delta (%)
45
40
35
30
25
20
20
15
10
10
5
5
0
0
n
eo
mk
rlb
pe
ga
ch
ng
sje rlben
pe
p
y
6
ge
rds
c0
vra
ha
er a
gc
po
ric
Av
5
0
fty
c ra
n
eo
mk
rlb
pe
p
ga
ch
ng
sje rlben
pe
7
6
4
3
2
1
0
n
eo
m
rl b
pe
Fig. 9.
k
p
ga
ch
ng
sje rlben
pe
c
gc
06
35
30
20
15
10
5
0
mk
rl b
pe
Fig. 10.
p
ga
sje
ch
en
rl b
pe
ng
ch
ng
sje rlben
pe
y
6
ge
rds
c0
vra
ha
era
gc
po
ric
Av
R EFERENCES
25
n
eo
p
ga
[1] D. Burger, et al. The SimpleScalar Tool Set, Version 2.0. SIGARCH
Comput. Archit. News, 1997.
[2] P. Chang, et al. Target prediction for indirect jumps. In ISCA-24, 1997.
[3] K. Driesen, et al. The cascaded predictor: Economical and adaptive
branch target prediction. In MICRO-31, 1998.
[4] M. Farooq, et al. Value Based BTB Indexing for Indirect Jump
Prediction. In HPCA-2010.
[5] S. Gochman, et al. The Intel Pentium M processor: Microarchitecture
and performance. Intel Technology Journal, 7(2), May 2003.
[6] D. A. Jiménez. Fast Path-Based Neural Branch Prediction. In MICRO36, 2003.
[7] J. A. Joao, et al. Improving the Performance of Object-Oriented
Languages with Dynamic Predication of Indirect Jumps. In ASPLOS13, pages 80-90, 2008.
[8] H. Kim, et al. VPC Prediction: Reducing the Cost of Indirect Branches
via Hardware-Based Dynamic Devirtualization. In ISCA-34, pages
424-435, 2007.
[9] J.K.F. Lee, et al. Branch prediction strategies and branch target buffer
design. IEEE Computer, pages 6-22 January 1984.
[10] S. McFarling. Combining branch predictors. Technical Report TN-36,
Digital Western Research Laboratory, June 1993.
[11] E. Perelman, et al. Using SimPoint for Accurate and Efficient Simulation. In SIGMETRICS ’03, pages 318-319, 2003.
[12] D. Parikh, et al. Power-Aware Branch Prediction: Characterization and
Design. IEEE Transactions on Computers, Vol.53, No.2, Feb. 2004.
[13] A. Seznec. Analysis of the O-GEometric History Length branch
predictor. In ISCA-32, 2005.
[14] A. Seznec et al. A case for (partially) TAgged GEometric history
length branch prediction. JILP, 8, Feb. 2006.
[15] SPEC. Standard Performance Evaluation Corporation. http://
www.spec.org
[16] M. Wolczko. Benchmarking Java with the Richards benchmark. http://research.sun.com/people/mario/java_
benchmarking/richards/richards.html
[17] T.-Y. Yeh et al. Two-Level Adaptive Training Branch Prediction. In
MICRO-24, pages 51-61, 1991.
TTC-1.5KB
TTC-3KB
TTC-6KB
TTC-12KB
TTC-24KB
TTC-48KB
TTC-96KB
TTC-192KB
TTC-384KB
TAP-Perceptrons
VPCPrediction
40
mk
rlb
pe
The authors would like to thank Youn-Long Lin and Helen
Gordon for their help with the paper review.
Comparisons of indirect-branch MPKI
45
n
eo
VII. ACKNOWLEDGMENTS
y
ge
rds
vr a
ha
era
po
ric
Av
50
fty
c ra
an indirect branch. This pointer is used to calculate a virtual
address indexing the predicted target stored in the BTB. With
the help of such pointers, TAP Prediction achieves time cost
similar to that of dedicated-storage-predictors without additional dedicated storage. TAP Prediction is highly extensible
to various branch prediction structures. As targets to three
representative direction predictors, TAP Prediction reduces
the indirect-branch MPKI significantly and has shown attractive performance improvements compared to the commonlyused BTB scheme.
5
ty
y
6
ge
rds
c0
vra
ha
era
gc
po
ric
Av
IPC improvement of TAP Predictors
Baseline-Perceptrons
TTC-1.5KB
TTC-3KB
TTC-6KB
TTC-12KB
TTC-24KB
TTC-48KB
TTC-96KB
TTC-192KB
TTC-384KB
TAP-Perceptrons
VPC Prediction
VPC in 2 cycles
8
MPKI
25
10
9
IPC Delta (%)
30
15
Fig. 8.
f ty
cr a
35
15
fty
c ra
f
cr a
50
45
35
TAP-OGEHL
Ideal-OGEHL
55
TAP-Perceptrons
Ideal-Perceptrons
50
IPC Delta (%)
IPC delta (%)
60
60
60
y
6
ge
rds
c0
vr a
ha
er a
gc
po
ri c
Av
Comparisons of IPC improvements
VPC Prediction generates and predicts a virtual branch in
each cycle. Thus, for one branch occurrence, VPC Prediction
usually occupies more table entries of the direction predictor
for those virtual addresses. In contrast, the TAP-Perceptrons
scheme takes only one entry for an indirect branch.
VI. CONCLUSION
In this paper, we have proposed and evaluated a fast
hardware-reusing mechanism of indirect-branch prediction,
called TAP Prediction. It reuses the original direction predictor to construct several sub-predictors, which are used to
generate the bits of the target address pointer when fetching
126
Download