TAP Prediction: Reusing Conditional Branch Predictor for Indirect Branches with Target Address Pointers Zichao Xie, Dong Tong, Mingkai Huang, Xiaoyin Wang, Qinqing Shi, Xu Cheng Microprocessor Research & Development Center, Peking University, Beijing, China {xiezichao, tongdong, huangmingkai, wangxiaoyin, shiqinqing, chengxu}@mprc.pku.edu.cn They are used to implement several programming constructs, including virtual function calls, switch-case statements and function pointers. Realizing the importance of indirectbranch predictions, designers are beginning to develop highperformance indirect-branch predictors in recent commercial processors [5]. Previously proposed hardware-based indirect-branch predictors use branch history to distinguish various occurrences of indirect branches. Some predictors place the targets corresponding to various branch occurrences directly into a large dedicated storage structure. These kinds of predictors [2] [3] [14], called dedicated-storage-predictors in this paper, can predict targets very quickly. However, their storage requirement takes up significant die area, which translates into extra power consumption. A representative hardwarereusing predictor, VPC Prediction [8], reuses only existing branch prediction hardware. This method achieves high prediction accuracy, but it may take many cycles to finish the prediction, thus reducing its performance improvement. The goal of our technique is to achieve time cost similar to that of dedicated-storage-predictors, without requiring additional large amounts of storage. This paper proposes a hardware-reusing indirect-branch prediction, called Target Address Pointer (TAP) Prediction. TAP Prediction reuses the history-based direction predictor to detect various indirect-branch occurrences, and stores multiple indirect-branch targets in the BTB. It employs the Target Address Pointers to establish the mapping from the indirectbranch occurrences to the stored indirect-branch targets. As shown in Figure 2, to generate such pointers according to the current branch history, TAP prediction reuses the storage of the existing direction predictor to construct several small predictors. Each small predictor, called sub-predictor, performs as well as the original predictor, but uses fewer branch histories. Its output is defined as one bit of the target address pointer. When fetching an indirect branch, TAP Prediction uses the sub-predictors to generate the bits of the target address pointer, ignoring the result from the original predictor. Then TAP prediction selects the virtual address calculated from this pointer to index the BTB for the predicted target. Through hardware reusing, TAP Prediction employs target address pointers to achieve fast indirect-branch prediction without requiring additional storage. Our experimental results show that for three representative direction predictors– Abstract— Indirect-branch prediction is becoming more important for modern processors as more programs are written in object-oriented languages. Previous hardware-based indirectbranch predictors generally require significant hardware storage or use aggressive algorithms which make the processor front-end more complex. In this paper, we propose a fast and cost-efficient indirect-branch prediction strategy, called Target Address Pointer (TAP) Prediction. TAP Prediction reuses the history-based branch direction predictor to detect occurrences of indirect branches, and then stores indirect-branch targets in the Branch Target Buffer (BTB). The key idea of TAP Prediction is to predict the Target Address Pointers, which generate virtual addresses to index the targets stored in the BTB, rather than to predict the indirect-branch targets directly. TAP Prediction also reuses the branch direction predictor to construct several small predictors. When fetching an indirect branch, these small predictors work in parallel to generate the target address pointer. Then TAP prediction accesses the BTB to fetch the predicted indirect-branch target using the generated virtual address. This mechanism could achieve time cost comparable to that of dedicated-storage-predictors, without requiring additional large amounts of storage. Our evaluation shows that for three representative direction predictors– Hybrid, Perceptrons, and O-GEHL–TAP schemes improve performance by 18.19%, 21.52%, and 20.59%, respectively, over the baseline processor with the most commonly-used BTB prediction. Compared with previous hardware-based indirectbranch predictors, the TAP-Perceptrons scheme achieves performance improvement equivalent to that provided by a 48KB TTC predictor, and it also outperforms the VPC predictor by 14.02%. I. I NTRODUCTION Modern high-performance processors employ the branch prediction component as an essential part to exploit instruction-level parallelism. Previous branch prediction researchers have focused on predicting conditional or unconditional direct branches accurately [10] [6] [13]. Unlike directbranch prediction, it is more difficult to achieve high prediction accuracy for indirect-branch prediction, as it requires predicting the target address instead of the branch direction. A commonly-used Branch Target Buffer (BTB) can predict only the last-taken targets of indirect branches. Figure 1 shows the poor indirect-branch prediction accuracy for the simulated benchmarks with a 4-way 4K-entry BTB. Another trend is that indirect branches are becoming increasingly common in programs written in object-oriented languages. This work was supported by the National High Technology Research and Development Program of China under Grant No. 2009ZX01029-001-002. 978-1-4577-1954-7/11/$26.00 ©2011 IEEE 119 Indirect Branch 70 60 50 40 30 20 10 0 f ty cr a Fig. 1. 80 Conditional Branches Indirect Branches Prediction Accuracy (%) Mispredictions Per Kilo Instructions (MPKI) 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 n eo k bm rl pe p ga ch ng sje rlben pe y 6 ge rds c0 vr a ha era gc po ric Av f ty cr a n eo k bm rl pe p ga ch ng sje rlben pe y 6 ge rds c0 vr a ha era gc po ric Av MPKI (Left) and indirect-branch prediction accuracy (Right) of 4-way 4K-entry BTB (12KB Hybrid predictor for conditional-branch prediction) Branch Direction Predictor Hybrid predictor [10], Perceptrons predictor [6], and OGEHL predictor [13]–TAP Prediction reduces the indirectbranch MPKI significantly over the commonly-used BTBbased prediction. This results in improving performance by 18.19%, 21.52%, and 20.59%, respectively. Compared with previously proposed hardware-based predictors, TAP Prediction with Perceptrons Predictor (TAP-Perceptrons scheme) could achieve performance improvement equivalent to a 48KB TTC predictor [2], and it outperforms the VPC predictor [8] by 14.02%. The contributions of this paper are as follows: 1) TAP Prediction requires no extra amount of dedicated storage because it reuses the existing branch prediction hardware. The branch direction predictor is redefined to generate the target address pointers for various indirect-branch occurrences, whereas the BTB stores multiple indirect-branch targets. 2) TAP Prediction employs hardware pointers to predict indirect-branch targets quickly. Its time cost is similar to that of the dedicated-storage-predictors. 3) TAP Prediction is highly extensible to various branch prediction structures. For three representative branch direction predictors, all TAP implementations have shown attractive performance improvements compared to that of a commonly-used BTB prediction scheme. Small Predictor Branch PC Small Predictor Current Branch Histories Small Predictor Small Predictor Target Address Pointer 0/1 0/1 0/1 0/1 Taken/Not-Taken for Conditional Branches Virtual Address fTAP(PC) BTB Predicted Target Fig. 2. Overview of TAP Prediction predictors. A simple first-stage predictor is used for easy-topredict (single-target) indirect branches, whereas a complex second-stage predictor is used for hard-to-predict indirect branches. The IT-TAGE predictor [14] proposed by Seznec and Michaud is conceptually similar to a multistage cascade structure. Their mechanism employs a base predictor and a number of tagged tables indexed by a very long GHR, PC, and path history. The predicted target comes from the table with the longest history which the access hits. Compared to those dedicated-storage predictors, VPC Prediction [8] has been proposed recently to reduce the cost of indirect-branch prediction using existing conditionalbranch prediction components. VPC Prediction treats an indirect branch with T targets as T virtual direct branches, each with its own unique target address. When fetching an indirect branch, VPC Prediction accesses the conditionalbranch predictor iteratively. One of T virtual direct branches is predicted as well as a conditional branch in each iteration. This iterative process stops when a virtual direct branch is predicted to be taken, or a predefined maximum iteration number is reached. This technique, however, may take many cycles to accomplish the indirect-branch prediction, thus II. P REVIOUS W ORK Modern processors usually employ the BTB to predict indirect branches [9]. However, the BTB records only the last-taken targets, which cannot distinguish various indirectbranch occurrences, resulting in poor prediction accuracy. Chang et al. [2] first proposed Tagged Target Cache (TTC) to use branch history information to distinguish differences among indirect-branch occurrences. Its concept is similar to that of a two-level direction predictor [17]. Each TTC entry contains a target address and a tag field. When fetching an indirect branch, the TTC predictor is indexed using the XOR of the branch address (PC) and the global branch history register (GHR) to provide the predicted target. When the indirect branch retires, the corresponding TTC entry is updated with the actual target address. Driesen et al. [3] proposed another dedicated-storagepredictor, a cascaded predictor, by combining multiple target 120 Percentage of the number of indirect branches targets (%) reducing its performance efficiency. III. TARGET A DDRESS P OINTER (TAP) P REDICTION A. Idea of TAP Prediction The key idea of TAP Prediction is to predict a pointer, which points to an indirect-branch target stored in the BTB, instead of predicting a target address directly. Using pointers has two benefits: first, it is easy to access the pointed data when the pointers are obtained; second, it is possible to store the pointed data in the existing storage, as those data can be stored in distributed memory locations. Based on these two advantages, TAP Prediction uses Target Address Pointers as the intermediate representations of those indirect-branch targets, separating the indirect-branch prediction flow into two steps: first generate a target address pointer according to each indirect-branch occurrence distinguished by branch histories (Occurrences Mapping), then use this pointer to fetch the predicted target (Targets Mapping). 100 90 80 70 16+ 13-16 9-12 5-8 1-4 60 50 40 30 20 10 0 cr a fty Fig. 3. n eo k bm rl pe p ga ch ng en sj e rl b pe c gc 06 y rds vra ha po ri c Distribution of the number of indirect-branch targets index the target-entries1 . In such cases, TAP Prediction keeps the BTB access the same as that of the conventional BTB. The fTAP (PC) could have multiple implementations. A simple implementation is to XOR the 4 bits of the target address pointer with the highest 4 bits and lowest 4 bits of PC. This function distributes the generated virtual addresses widely in the BTB, thereby reducing interference with each other or with conditional branches. 1) Targets Mapping Unlike conventional BTB structure, an indirect branch in TAP Prediction occupies multiple BTB entries. Those entries are classified as the target-entry or the allocationentry. As a program executes, TAP Prediction dynamically allocates some BTB entries, called target-entries, to store the encountered targets of this indirect branch. Each targetentry is indexed by a virtual address calculated from the obtained target address pointer during the prediction. Previous analysis [8] showed that indirect branches have no more than 16 targets in most simulated programs. In addition, the statistics of our selected program segments shown in Figure 3 (Configurations are shown in Section IV.) illustrate that the rate of indirect branches having more than 8 possible targets is also relatively small. For high extensibility, TAP Prediction allocates at most 16 target-entries for an indirect branch. If an indirect branch has more than 16 targets, TAP Prediction will replace the content of a used target-entry by the newly encountered target. The allocation-entry, a specific BTB entry, is employed by each indirect branch to record the target-entry allocation information. This information is stored in the lower 16 bits of the target address field. In order to determine as quickly as possible whether the BTB has targetentries for an indirect branch, TAP Prediction places this entry into the BTB set indexed by the branch PC. TAP Prediction creates the target address pointers to the target-entries of an indirect branch, implementing the Targets Mapping. Each target-entry of an indirect branch corresponds to a target address pointer, so there are at most 16 pointers for an indirect branch in TAP Prediction. To reach the minimum length of those pointers, TAP Prediction uses 4 bits to represent 16 pointers. The target address pointers are treated as function pointers to call the function fTAP (PC), which generates different virtual addresses to 2) Occurrences Mapping When predicting an indirect branch, TAP Prediction uses the redefined history-based direction predictor to detect various indirect-branch occurrences and generate the target address pointers, implementing the Occurrences Mapping. To adapt to various kinds of direction predictors, TAP Prediction adopts a simple but effective method to generate target address pointers, i.e. it reuses the existing direction predictor to construct several small predictors, each of which predicts one bit of the target address pointer. These small predictors, called Sub-Predictors, perform as well as the original predictor, but use fewer branch histories. The prediction result of each sub-predictor, ’Taken/Not-Taken’, is defined as one bit of the target address pointer, ’1/0’. Thus, a 4-bit target address pointer needs 4 sub-predictors. Each sub-predictor treats the corresponding bit of the target address pointer as its training goal. Note that constructing sub-predictors does not change the conditional branch prediction flow. Thus, as for conditional branches, the direction predictor performs in the same way as its original mechanism. This paper chooses three representative direction predictors–Hybrid Predictor [10], Perceptrons Predictor [6], and O-GEHL Predictor [13]–to implement the occurrences mapping. Their detailed implementations are listed in Table I: TAP-Hybrid: To this kind of predictor that uses SRAMs to construct PHTs, the SRAMs are divided into 4 groups. 1 Note that virtual addresses can conflict with conditional-branch addresses, thereby increasing aliasing and interference in the BTB. The processor does not require any special action when aliasing happens. Evaluations in Section V show that the adverse effects of TAP Prediction on the BTB miss rate and conditional branch misprediction rate is tolerable. 121 Sub-Predictor 0 TAP[0] = Sign Indirect-Branch Prediction W1 xi = GHR[15:0] W2 Sub-Predictor 1 TAP[1] = Sign ... W15 Σ W16 xi = GHR[15:0] ... Sub-Predictor 2 TAP[2] = Sign W31 Σ W32 ... xi = GHR[15:0] W47 Sub-Predictor 3 TAP[3] = Sign W48 Σ ... xi = GHR[15:0] Fig. 4. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: Conditional-Branch Prediction W0 Σ Σ Prediction = Sign xi = GHR[63:0] W63 The direction predictor of the TAP-Perceptrons scheme void Perl_sortsv(pTHX_ SV **array, size_t nmemb, SVCOMPARE_t cmp) { void (*sortsvp)(pTHX_ SV **array, size_t nmemb, SVCOMPARE_t cmp, ldah $29,0($26) U32 flags) lda $29,0($29) = S_mergesortsv; ldah $1,S_qsortsv($29) SV *hintsv; ldq $3,0($0) I32 hints; lda $27,S_qsortsv($1) hints = SORTHINTS(hintsv); if (hints & HINT_SORT_QUICKSORT) { sortsvp = S_qsortsv; } else { sortsvp = S_mergesortsv; } $L273: ldah $1,S_mergesortsv($29) lda $27,S_mergesortsv($1) br $31,$L276 sortsvp(aTHX_ array, nmemb, cmp, 0); } Fig. 5. Each group constructs a sub-predictor with one-quarter of PHT entries. A multiplexer is added to each SRAM to select the corresponding index under two fetching cases, indirectbranch cases or conditional-branch cases, predicting the target address pointer or the branch direction, respectively. TAP-Perceptrons: To this computation-based predictor, four new adders are employed to calculate the results of the sub-predictors (Figure 4). The weights read from the predictor table are simultaneously calculated using the mechanisms of the sub-predictors and the original predictor separately. TAP-OGEHL: Because this predictor uses both PHTs and computations to predict branch directions, a sub-predictor is constructed with one-quarter of PHTs, additional adders and multiplexers. According to different fetching cases, O-GEHL selects either the original mechanism to predict branch directions or the sub-predictors to produce the target address pointers. The benefits of this method are as follows: first, TAP Prediction transforms the prediction of target address pointers into the direction predictions of multiple sub-predictors, thus the improvements in conditional-branch prediction accuracy can also benefit TAP Prediction. Second, the implementation is quite simple. It needs only to copy the logic of the existing direction predictor, including both prediction logic and updating logic. Third, it obtains the prediction results quickly, because all sub-predictors work simultaneously. $L276: jsr $26,($27),0 Code example of an indirect branch latencies, the basic prediction steps are the same.): In the first cycle of fetching an indirect branch, TAP Prediction accesses the BTB and the direction predictor simultaneously2 . If there is a BTB hit, it indicates that the BTB has allocated target-entries for this indirect branch so that the BTB output is the content of the allocationentry. Otherwise, TAP Prediction stalls the pipeline until this indirect branch is resolved. Meanwhile, TAP Prediction uses the sub-predictors to generate the target address pointer corresponding to the detected indirect-branch occurrence, and then calculates the virtual address through fTAP (PC). In the next cycle, TAP Prediction accesses the BTB using the obtained virtual address for the predicted indirect-branch target. If there is a BTB hit, the stored target is issued to the pipeline. Otherwise, TAP Prediction stalls fetching instructions until the actual target is calculated in the pipeline. C. Prediction Example In this section, we use a code example to illustrate how TAP Prediction effectively works with target address pointers. Figure 5 shows a code example of an indirect branch from the Perlbench benchmark in SPEC CPU 2006 suites [15]. The indirect branch, ′ jsr$26, ($27), 0′ , implements calling the function pointer, ′ sortsvp′ . As the program executes, different results of the conditional branch in Line 13 lead to different values for ′ $27′ , resulting in different function pointers. If the condition in Line 13 is true (CondT ), the indirect-branch target is ’S qsortsv’ (TargetT ). Otherwise (CondF), the target is ’S mergesortsv’(TargetF). Figure 6 illustrates the prediction flow using the TAPPerceptrons scheme if there is a CondT case. When fetching the indirect branch, all of the sub-predictors in the Perceptrons predictor use the recent 16 GHRs to calculate the bits of the target address pointer. In CondT case, the target address pointer, ’1101’, is generated, and then is treated as the function pointer, calling the function ’ f1101 (PC)’ to calculate the virtual address. This virtual address is next issued to the BTB to fetch the pointed TargetT address. B. Prediction Flow A critical issue of TAP Prediction flow is to know the fetching instruction is an indirect branch. The PPD (Prediction Probe Detector) technique [12], which is implemented in modern high performance processors for the purpose of eliminating unnecessary branch predictor accesses using predecoding bits, can be extended to identify the branch type at the beginning of instruction fetch stage. However, some TAP implementations have exceptions, avoiding using such mechanism. For example, since the TAP-Perceptrons scheme can predict the branch direction for conditional branches and the target address pointers for indirect branches simultaneously, the branch type of the fetching instruction can be determined after the BTB access. This is the same as VPC prediction [8], which uses the matched BTB entry to provide the branch type. The TAP Prediction flow are as follows (although various branch prediction implementations lead to different access 2 This step may take several cycles, which depends on the structure of the direction predictor, until the target address pointer is generated. 122 TABLE I O CCURRENCES M APPING I MPLEMENTATIONS OF B RANCH D IRECTION P REDICTORS Structural Description Original Configuration: 12KB, including a 32K-entry gshare PHT, and 8K-entry Bimodal and Selector, 1-cycle prediction delay. Construct Sub-Predictors: divides the original PHT into 4 small PHTs, ignoring the outputs of Bimodal and Selector. Each sub-predictor has 8K-entry PHT indexed by XOR result of the PC and low 13-bit of GHR. The higher bit of each obtained saturating counter represents the bit of the target address pointer. Original Configuration: path-based perceptrons [6], 64KB table, using 64-bit GHR as inputs. Each entry has 64 weights (wi ). 2-cycle prediction delay. Construct Sub-Predictors: divides the 64 weights (wi ) into 4 parts. Each part constructs a sub-predictor, using the low 16-bit of GHR and the segmented weights to calculate the output. The output is redefined as one bit of the target address pointer. Original Configuration: 64K-bit, including 8 predictor tables. Tables are indexed with hash functions of branch history, path history, and PC. The set of used global history lengths forms a geometric series, i.e., L( j) = α j−1 L(1). 2-cycle prediction delay. Construct Sub-Predictors: divides 8 predictor tables into 4 sub-predictors. Each sub-predictor has 2 predictor tables and captures the recent 20 global branch histories. All tables are extended with two read ports (or implemented with 2 SRAM cells) to provide 4 signed counters for each sub-predictor. Each sub-predictor performs as well as the O-GEHL predictor to generate one bit of the target address pointer. Predictor Scheme TAP-Hybrid TAP-Perceptrons TAP-OGEHL Direction Predictor PC Sub-Predictor Weights 48~63 1 1 0 1 PC Indexed Σ Weights 32~47 Σ Weights 16~31 GHR[15:0] Σ Weights 0~15 f1101(PC) TABLE II BTB Target Address Pointer BASELINE PROCESSOR CONFIGURATION Allocation-Entry Pipeline Depth 0010_0000_0000_0010 Instruction Fetch Target-Entry Virtual AddressCondT TargetT Indexed Execution Engine Target-Entry Branch Predictor TargetF Σ Caches Fig. 6. Memory An example of TAP Prediction with the Perceptrons predictor D. Updating Flow bit in the allocation-entry should be cleared. U pdating f or W rong Pointers Cases: If the virtual address hits a target, TAP Prediction compares the target to the correct target. If they are the same, the corresponding target address pointer is updated to point to this target-entry. Each sub-predictor saves the bit of the correct target address pointer for use as the training goal. Each training process is the same as the training of the original direction predictor. U pdating f or Meaningless Pointers Cases: At the end of traversing, if none of the entries matches the correct target, TAP Prediction selects a target-entry to store the correct target. If there are unallocated target-entries, it will randomly allocate a new target-entry and update the allocation-entry. Otherwise, it will randomly replace a used target-entry to store the newly encountered target. Finally, sub-predictors are trained to generate the correct target address pointer. The updating in correct TAP Prediction case is the same as that of the traditional BTB, which updates the Least Recently Used information of the BTB set containing the used targetentry. If there is a misprediction, TAP Prediction should update the corresponding target address pointer to make it map to the BTB entry storing the correct indirect-branch target. There are two kinds of misprediction cases in TAP Prediction: ∙ ∙ 16 stages 4-instruction per cycle; fetch and at first pred taken branch; using PPD technique 4-wide decode/issue/execution/commit; 512-entry RUU; 128-entry LSQ 4K-entry, 4-way BTB (LRU), 1-cycle pred delay 32-entry return address stack; Structures of the direction predictors are listed in Table I. 15-cycle min. branch misprediction penalty 32KB, 8-way, 2-cycle L1 DCache & ICache; 1MB, 16-way, 10-cycle unified L2 Cache. All cache have 64B block size with LRU replacement policy 150-cycle memory latency (first chunk), 15-cycle(rest) W rong Pointer: one of the target-entries stores the correct indirect-branch target, but the target address pointer points to a wrong target-entry. This is primarily caused by the contentions between branch directions and target address pointers in direction predictor. Meaningless Pointer: none of the target-entries is storing the correct indirect-branch target, so the generated target address pointer is meaningless. This is mainly caused by compulsory misses of target-entries. E. Hardware Cost and Complexity The additions and modifications to the existing branch prediction hardware are as follows: ∙ Add a register to store the virtual address calculated from the target address pointer, during prediction and updating. ∙ Implement the hardware function, f TAP (PC): the design in Section III.A requires only two 4-bit XORs. ∙ Add a multiplexer to select the PC or the virtual address to access the BTB. ∙ Add registers carrying the content of the allocation-entry throughout the pipeline. ∙ Copy the prediction and training logic of the original direction predictor to construct sub-predictors. In order to distinguish these two kinds of mispredictions and update the allocation-entry, the updating mechanism needs to traverse target-entries in the BTB. The traversing process may take several cycles. However, it is only necessary to traverse those allocated target-entries (according to the information stored in the allocation-entry), which significantly reduces the time costs of updating. During traversing, TAP Prediction generates the virtual addresses sequentially to index the allocated target-entries: U pdating f or the allocation-entry: If a virtual address misses a target in the BTB, it indicates that this target-entry has been replaced by some other target, so the corresponding 123 TABLE III C HARACTERISTICS OF THE EVALUATED BENCHMARKS Static number of Indirect Branches Dynamic number of Indirect Branches (K) Indirect Branches mispredictions per kilo instructions (MPKI) Indirect Branches Prediction Accuracy (%) Hybrid Base IPC Perceptrons Base IPC O-GEHL Base IPC crafty 11 210 1.10 48.0 1.67 1.97 1.95 eon 26 557 1.54 72.4 1.74 2.13 2.14 perlbmk 46 1176 8.43 28.4 1.38 1.48 1.49 gap 35 1237 5.51 55.5 1.50 1.53 1.54 sjeng 15 317 1.96 38.2 1.44 1.80 1.77 perlbench 24 1125 8.95 20.5 1.09 1.13 1.14 gcc06 128 393 1.77 55.0 1.46 1.80 1.80 povray 8 696 1.70 75.6 1.99 2.16 2.17 richards 5 898 4.85 46.0 1.58 1.75 1.76 Avg. 38 735 3.98 45.5 1.52 1.72 1.72 TABLE IV T IMING PARAMETERS OF B RANCH C OMPONENTS UNDER TSMC D OPHIN L IBRARY 65 NM (0.9V, 125∘ C) The structure of TAP Prediction has two features: First, no extra large table is required. Second, the logic required for TAP Prediction and updating is simple. Especially, copying the existing logic of the original direction predictor reduces the design complexity significantly. SRAM Components Data Tag Hybrid Perceptrons O-GEHL BTB IV. E XPERIMENTAL M ETHODOLOGY We extended SimpleScalar [1] to evaluate TAP Prediction. Table II shows the parameters of our baseline processor, which employs the BTB to predict indirect branches. The latencies of various branch prediction strategies have been considered in our experiments. Our workload selects the indirectbranch-sensitive benchmarks, including 4 SPEC CPU 2000 INT benchmarks, 4 SPEC CPU 2006 benchmarks [15], and one C++ benchmark–Richards [16]. Each benchmark gains at least 5 percent performance with a perfect indirect branch predictor. These benchmarks also have been used in previous related researches [8] [4]. We used SimPoint [11] to select a representative simulation region for each benchmark using the reference input set. Each benchmark was run for 100 million Alpha instructions. All binaries were compiled with Compaq C++ V6.5-042 for Compaq Tru64 Unix V5.1B. Table III shows the characteristics of the examined benchmarks on the baseline processors. Timing (ns) Cell TSetup TClk−to−Q 128x148 128x120 128x32 128x64 128x16 0.208 0.212 0.233 0.223 0.238 0.878 0.862 0.772 0.790 0.763 Estimated Max Latency w/o w/ Sub-pred TAP TAP in TAP 1.112 1.112 - 0.902 1.470 1.433 0.902 1.470 1.433 0.802 1.400 1.163 B. MPKI Impacts and Performance Improvement Figure 7 shows the indirect branch MPKI of the baseline and the TAP schemes. All TAP schemes for three direction predictors reduce the indirect-branch MPKI significantly. For example, the TAP-Perceptrons scheme reduces the average MPKI from 3.98 to 1.03. Table V shows the conditionalbranch MPKI impacts. Since TAP Prediction reuses existing branch components, it is inevitable to have adverse impacts on conditional branches. However, those impacts are relatively small compared to the TAP’s significant indirectbranch MPKI improvement. The results show that the impacts on conditional branch prediction vary under those three TAP schemes. This is because the mechanisms in Perceptrons and O-GEHL, which are inherently used to avoid aliasing problems and contentions for conditional branches, are also effective to avoid the interference between predicting branch directions and target address pointers. The MPKI reduction leads to attractive IPC improvement, as shown in Figure 8. Because TAP Prediction does not increase the latency of the branch prediction, IPC evaluation generally reflects the performance impact. The IPC improvements of TAP-Hybrid, TAP-Perceptrons and TAP-OGEHL are 18.19%, 21.52%, and 20.59%, respectively. We have also experimented with a perfect indirect-branch predictor for each baseline, which predicts all the indirect-branch targets with 100% accuracy. The TAP schemes achieve 76.84%, 85.09%, and 82.35%, respectively, of the maximum performance improvements provided by the ideal indirectbranch predictors. The effect of the updating mechanism in TAP-Perceptrons scheme is listed in Table VI. The result shows that it takes only 1.80 cycles for each indirect-branch update averagely. Also, if we assume that each update can be finished one cycle (w/o additional cycles), it can only improve performance by 0.19% averagely. Therefore, although the target-entries traversing makes the updating a little complex, it would not V. R ESULTS AND A NALYSIS A. Timing Estimation To evaluate the timing impact of TAP Prediction, we modeled the TAP schemes by extending the front-end of a commercial 64-bit superscalar processor. This fully synthesizable processor runs about 1GHz under TSMC 65nm technology. It can issue and complete three instructions per clock cycle. Instructions complete in order, but execute out of order. Its dynamic branch prediction employs a 2Kentry, 2-level direction predictor and a 512-entry, 4-way set-associative BTB, which has 1-cycle access latency. The implementations and their timing parameters are listed in Table IV3 . The evaluation results prove that constructing sub-predictors does not increase the latency of the branch prediction hardware. This is because a sub-predictor uses fewer predictor tables (for fewer GHRs), thereby saving some multiplexors or reducing the number of adding operands, compared to the original branch predictor. 3 To estimate the timing of TAP Prediction, we extended the existing components to implement the datapath of various branch prediction components. Those structures are fully synthesizable designs instead of the traditional customized designs. Thus, the estimated timing, such as BTB’s latency, would be longer than that in a real chip. 124 TABLE V C ONDITIONAL -B RANCH MPKI C OMPARISONS Configuration Hybrid Predictor Perceptrons Predictor OGEHL Predictor crafty 7.67 8.46 4.99 5.02 5.04 5.27 5.84 baseline TAP baseline TAP VPC baseline TAP eon 5.98 6.64 6.33 6.36 6.37 6.25 6.36 perlbmk 5.79 7.37 8.40 8.44 12.70 8.26 8.97 gap 0.94 1.50 2.84 2.86 12.91 2.19 2.76 Baseline TAP-Hybrid TAP-Perceptrons TAP-OGEHL 7 BTB Entries 8K 4K 2K 1K 512 6 MPKI perlbench 3.22 3.72 1.25 1.25 1.25 0.78 1.27 gcc06 5.80 6.56 4.74 4.85 4.77 4.00 4.70 povray 1.89 2.00 3.49 3.49 7.47 3.14 3.14 richards 4.51 4.63 3.14 3.14 3.15 3.10 3.11 Avg. 5.41 6.14 4.93 4.97 7.01 4.76 5.26 TABLE VII E FFECT OF DIFFERENT BTB SIZES IN TAP-P ERCEPTRONS 9 8 sjeng 12.90 14.38 9.21 9.33 9.40 9.85 11.23 5 4 Baseline IB MPKI 3.96 3.97 4.02 4.05 4.21 TAP IB MPKI 1.01 1.03 1.15 1.26 1.41 IPC Δ (%) 21.63 21.52 21.09 20.28 18.19 3 2 The TTC predictor [2] is a representative dedicatedstorage-predictor. We simulated various sizes of the TTC predictor, from 256-entry to 64K-entry. Each entry of a TTC predictor has a 4-byte target and a 2-byte tag, giving a total size of 1.5KB to 384KB respectively. On average, the TAPPerceptrons predictor achieves the equivalent performance provided by a 48KB TTC predictor. In five benchmarks, gap, s jeng, gcc06, povray, and richards, the TAP predictor performs at least as well as a 96KB TTC predictor. We also compared the efficiency of the VPC predictor [8] and the TAP predictor. In our experiments, the maximum iteration number of VPC Prediction is set to 12. In addition, since the LFU training algorithm of VPC prediction is difficult to implement in hardware, we adopt a simple way for updating: insert the newly encountered indirect-branch target sequentially at the tail of the access sequence. We found that VPC Prediction in our evaluations usually requires 3 to 6 iterations to finish predictions, which is less effective than the results in [8]. Besides the impacts of the simple training algorithm, another reason is the differences of the ISA and the characteristics of the selected program segments. The efficiency of VPC Prediction depends on the access order of the frequently-predicted indirect-branch targets. In our selected program segments, since the frequently-predicted targets usually appear later than some other targets, they are inserted into the middle or at the tail of the access sequence. In our experiments, the TAP predictor outperforms a VPC predictor by 14.02%. Besides the total indirect-branch MPKI of VPC Prediction, we also evaluated its MPKI if predictions are finished in 2 cycles (2-cycle-MPKI). The results illustrate that although the total indirect-branch MPKI of VPC Prediction is similar to that of TAP Prediction, the 2-cycle-MPKI increases significantly, especially for the benchmarks with higher numbers of dynamic targets. This finding is similar to the analysis in research [7]. It indicates that VPC Prediction requires more cycles to reach the same accuracy as that of TAP Prediction. On the other hand, as shown in Table V, VPC’s adverse impact on conditional branch predictions is also greater than that of TAP Prediction. 1 0 f ty cr a Fig. 7. n eo mk rl b pe p ga ch ng sje rlben pe c gc 06 y ge rds vr a ha era po ric Av Indirect-branch MPKI impacts of TAP predictors TABLE VI U PDATE P ENALTIES IN TAP-P ERCEPTRONS S CHEME : I NDIRECT B RANCH (IB), IPC I MPROVEMENT WITHOUT ADDITIONAL CYCLES (IPC W / IDEAL UPDATE ) crafty eon perlbmk gap sjeng perlbench gcc06 povray richards average IB Update Num Cycles 60990 119049 27720 27812 174404 925781 679 365 79359 174622 436313 2654845 50615 108615 49702 39907 5677 8782 35898 64645 Cycles per IB update 1.95 1.00 5.31 0.54 2.20 6.08 2.15 0.80 1.55 1.80 IPC w/ ideal update (%) 0.22 0.08 2.05 0.00 0.32 3.81 0.33 0.10 0.07 0.19 affect performance significantly. C. Sensitivity to Different BTB Sizes Since indirect-branch targets are stored in the BTB, we also evaluated the effects of different BTB sizes. We take the TAP-Perceptrons scheme as an example. The results in Table VII show that TAP Prediction could provide significantly performance improvement even with small BTB sizes. This is because TAP Prediction occupies only the limited number of BTB entries to store indirect-branch targets dynamically. D. Comparison with Other Indirect-Branch Predictors As an example, we choose the TAP-Perceptrons predictor to compare with other hardware-based indirect-branch predictors, the TTC predictor and the VPC predictor. Figure 9 and 10 illustrate the comparisons of the indirect-branch MPKI and the IPC improvement. 125 55 55 TAP-Hybrid Ideal-Hybrid 50 45 40 40 30 25 20 IPC Delta (%) 45 40 35 30 25 20 20 15 10 10 5 5 0 0 n eo mk rlb pe ga ch ng sje rlben pe p y 6 ge rds c0 vra ha er a gc po ric Av 5 0 fty c ra n eo mk rlb pe p ga ch ng sje rlben pe 7 6 4 3 2 1 0 n eo m rl b pe Fig. 9. k p ga ch ng sje rlben pe c gc 06 35 30 20 15 10 5 0 mk rl b pe Fig. 10. p ga sje ch en rl b pe ng ch ng sje rlben pe y 6 ge rds c0 vra ha era gc po ric Av R EFERENCES 25 n eo p ga [1] D. Burger, et al. The SimpleScalar Tool Set, Version 2.0. SIGARCH Comput. Archit. News, 1997. [2] P. Chang, et al. Target prediction for indirect jumps. In ISCA-24, 1997. [3] K. Driesen, et al. The cascaded predictor: Economical and adaptive branch target prediction. In MICRO-31, 1998. [4] M. Farooq, et al. Value Based BTB Indexing for Indirect Jump Prediction. In HPCA-2010. [5] S. Gochman, et al. The Intel Pentium M processor: Microarchitecture and performance. Intel Technology Journal, 7(2), May 2003. [6] D. A. Jiménez. Fast Path-Based Neural Branch Prediction. In MICRO36, 2003. [7] J. A. Joao, et al. Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps. In ASPLOS13, pages 80-90, 2008. [8] H. Kim, et al. VPC Prediction: Reducing the Cost of Indirect Branches via Hardware-Based Dynamic Devirtualization. In ISCA-34, pages 424-435, 2007. [9] J.K.F. Lee, et al. Branch prediction strategies and branch target buffer design. IEEE Computer, pages 6-22 January 1984. [10] S. McFarling. Combining branch predictors. Technical Report TN-36, Digital Western Research Laboratory, June 1993. [11] E. Perelman, et al. Using SimPoint for Accurate and Efficient Simulation. In SIGMETRICS ’03, pages 318-319, 2003. [12] D. Parikh, et al. Power-Aware Branch Prediction: Characterization and Design. IEEE Transactions on Computers, Vol.53, No.2, Feb. 2004. [13] A. Seznec. Analysis of the O-GEometric History Length branch predictor. In ISCA-32, 2005. [14] A. Seznec et al. A case for (partially) TAgged GEometric history length branch prediction. JILP, 8, Feb. 2006. [15] SPEC. Standard Performance Evaluation Corporation. http:// www.spec.org [16] M. Wolczko. Benchmarking Java with the Richards benchmark. http://research.sun.com/people/mario/java_ benchmarking/richards/richards.html [17] T.-Y. Yeh et al. Two-Level Adaptive Training Branch Prediction. In MICRO-24, pages 51-61, 1991. TTC-1.5KB TTC-3KB TTC-6KB TTC-12KB TTC-24KB TTC-48KB TTC-96KB TTC-192KB TTC-384KB TAP-Perceptrons VPCPrediction 40 mk rlb pe The authors would like to thank Youn-Long Lin and Helen Gordon for their help with the paper review. Comparisons of indirect-branch MPKI 45 n eo VII. ACKNOWLEDGMENTS y ge rds vr a ha era po ric Av 50 fty c ra an indirect branch. This pointer is used to calculate a virtual address indexing the predicted target stored in the BTB. With the help of such pointers, TAP Prediction achieves time cost similar to that of dedicated-storage-predictors without additional dedicated storage. TAP Prediction is highly extensible to various branch prediction structures. As targets to three representative direction predictors, TAP Prediction reduces the indirect-branch MPKI significantly and has shown attractive performance improvements compared to the commonlyused BTB scheme. 5 ty y 6 ge rds c0 vra ha era gc po ric Av IPC improvement of TAP Predictors Baseline-Perceptrons TTC-1.5KB TTC-3KB TTC-6KB TTC-12KB TTC-24KB TTC-48KB TTC-96KB TTC-192KB TTC-384KB TAP-Perceptrons VPC Prediction VPC in 2 cycles 8 MPKI 25 10 9 IPC Delta (%) 30 15 Fig. 8. f ty cr a 35 15 fty c ra f cr a 50 45 35 TAP-OGEHL Ideal-OGEHL 55 TAP-Perceptrons Ideal-Perceptrons 50 IPC Delta (%) IPC delta (%) 60 60 60 y 6 ge rds c0 vr a ha er a gc po ri c Av Comparisons of IPC improvements VPC Prediction generates and predicts a virtual branch in each cycle. Thus, for one branch occurrence, VPC Prediction usually occupies more table entries of the direction predictor for those virtual addresses. In contrast, the TAP-Perceptrons scheme takes only one entry for an indirect branch. VI. CONCLUSION In this paper, we have proposed and evaluated a fast hardware-reusing mechanism of indirect-branch prediction, called TAP Prediction. It reuses the original direction predictor to construct several sub-predictors, which are used to generate the bits of the target address pointer when fetching 126