VPC Prediction: Reducing the Cost of Indirect Branches via Hardware-Based Dynamic Devirtualization Hyesoon Kim José A. Joao Onur Mutlu§ Chang Joo Lee Yale N. Patt Robert Cohn† ECE Department The University of Texas at Austin §Microsoft Research Redmond, WA †Intel Corporation Hudson, MA {hyesoon, joao, cjlee, patt}@ece.utexas.edu onur@microsoft.com robert.s.cohn@intel.com ABSTRACT Indirect branches have become increasingly common in modular programs written in modern object-oriented languages and virtualmachine based runtime systems. Unfortunately, the prediction accuracy of indirect branches has not improved as much as that of conditional branches. Furthermore, previously proposed indirect branch predictors usually require a significant amount of extra hardware storage and complexity, which makes them less attractive to implement. This paper proposes a new technique for handling indirect branches, called Virtual Program Counter (VPC) prediction. The key idea of VPC prediction is to treat a single indirect branch as multiple “virtual” conditional branches in hardware for prediction purposes. Our technique predicts each of the virtual conditional branches using the existing conditional branch prediction hardware. Thus, no separate storage structure is required for predicting indirect branch targets. Our evaluation shows that VPC prediction improves average performance by 26.7% compared to a commonly-used branch target buffer based predictor on 12 indirect branch intensive applications. VPC prediction achieves the performance improvement provided by at least a 12KB (and usually a 192KB) tagged target cache predictor on half of the examined applications. We show that VPC prediction can be used with any existing conditional branch prediction mechanism and that the accuracy of VPC prediction improves when a more accurate conditional branch predictor is used. Categories and Subject Descriptors: C.1.0 [Processor Architectures]: General C.1.1 [Single Data Stream Architectures]: RISC/CISC, VLIW architectures C.5.3 [Microcomputers]: Microprocessors D.3.3 [Language Constructs and Features]: Polymorphism General Terms: Design, Performance. Keywords: Indirect branch prediction, virtual functions, devirtualization. 1. INTRODUCTION Object-oriented programs are becoming more common as more programs are written in modern high-level languages such as Java, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISCA’07, June 9–13, 2007, San Diego, California, USA. Copyright 2007 ACM 978-1-59593-706-3/07/0006 ...$5.00. C++, and C#. These languages support polymorphism [4], which significantly eases the development and maintenance of large modular software projects. To support polymorphism, modern languages include dynamically-dispatched function calls (i.e. virtual functions) whose targets are not known until run-time because they depend on the dynamic type of the object on which the function is called. Virtual function calls are usually implemented using indirect branch/call instructions in the instruction set architecture. Previous research has shown that modern object-oriented languages result in significantly more indirect branches than traditional C and Fortran languages [3]. Unfortunately, an indirect branch instruction is more costly on processor performance because predicting an indirect branch is more difficult than predicting a conditional branch as it requires the prediction of the target address instead of the prediction of the branch direction. Direction prediction is inherently simpler because it is a binary decision as the branch direction can take only two values (taken or not-taken), whereas indirect target prediction is an N-ary decision where N is the number of possible target addresses. Hence, with the increased use of object-oriented languages, indirect branch target mispredictions have become an important performance limiter in high-performance processors.1 Moreover, the lack of efficient architectural support to accurately predict indirect branches has resulted in an increased performance difference between programs written in object-oriented languages and programs written in traditional languages, thereby rendering the benefits of object-oriented languages unusable by many software developers who are primarily concerned with the performance of their code [43]. Figure 1 shows the number and fraction of indirect branch mispredictions per 1K retired instructions (MPKI) in different Windows applications run on an Intel Core Duo T2500 processor [22] which includes a specialized indirect branch predictor [15]. The data is collected with hardware performance counters using VTune [23]. In the examined Windows applications, on average 28% of the branch mispredictions are due to indirect branches. In two programs, Virtutech Simics [32] and Microsoft Excel 2003, almost half of the branch mispredictions are caused by indirect branches. These results show that indirect branches cause a considerable fraction of all mispredictions even in today’s relatively small-scale desktop applications. Previously proposed indirect branch prediction techniques [6, 8, 27, 9, 10, 40] require large hardware resources to store the target addresses of indirect branches. For example, a 1024-entry gshare conditional branch predictor [33] requires only 2048 bits but a 1024entry gshare-like indirect branch predictor (tagged target cache [6]) needs at least 2048 bytes along with additional tag storage even if the processor stores only the least significant 16 bits of an indirect branch 1 In the rest of this paper, an “indirect branch” refers to a non-return unconditional branch instruction whose target is determined by reading a general purpose register or a memory location. We do not consider return instructions since they are usually very easy to predict using a hardware return address stack [26]. 424 55 50 45 40 35 indirect branches 30 25 20 15 10 sim l ce ic s 0 cy gw sq in lse w rv in ex r p ie lor e xp lo r r em er ac f ir s ef ox vt un pp e na tvi sa ew -w or l ou dwi t de loo nd sk to k pse av arc id h ac a ro re w ad in am w p in dv d A V G 5 ex Percentage of all mispredicted branches (%) cy gw sq in lse w rv in ex r p ie lor e xp lo r r em er ac f ir s ef ox vt un pp e na tvi sa ew -w or l ou dwi de tloo nd sk to k pse av arc id h ac a ro re w ad in am w p in dv d A V G l ce sim ic s conditional indirect ex Mispredictions Per Kilo Instructions (MPKI) 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Figure 1: Indirect branch mispredictions in Windows applications: MPKI for conditional and indirect branches (left), percentage of mispredictions due to indirect branches (right) target address in each entry.2 As such a large hardware storage comes with an expensive increase in power/energy consumption and complexity, most current high-performance processors do not dedicate separate hardware but instead use the branch target buffer (BTB) to predict indirect branches [1, 18, 28], which implicitly –and usually inaccurately– assumes that the indirect branch will jump to the same target address it jumped to in its previous execution [6, 27].3 To our knowledge, only Intel Pentium M implements specialized hardware to help the prediction of indirect branches [15], demonstrating that hardware designers are increasingly concerned with the performance impact of indirect branches. However, as we showed in Figure 1, even on a processor based on the Pentium M, indirect branch mispredictions are still relatively frequent. In order to efficiently support polymorphism in object-oriented languages without significantly increasing complexity in the processor front-end, a simple and low-cost –yet effective– indirect branch predictor is necessary. A current high-performance processor already employs a large and accurate conditional branch predictor. Our goal is to use this existing conditional branch prediction hardware to also predict indirect branches instead of building separate, costly indirect branch prediction structures. We propose a new indirect branch prediction algorithm: Virtual Program Counter (VPC) prediction. A VPC predictor treats a single indirect branch as multiple conditional branches (virtual branches) in hardware for prediction purposes. Conceptually, each virtual branch has its own unique target address, and the target address is stored in the BTB with a unique “fake” PC, which we call virtual PC. The processor uses the outcome of the existing conditional branch predictor to predict each virtual branch. The processor accesses the conditional branch predictor and the BTB with the virtual PC address of a virtual branch. If the prediction for the virtual branch is “taken,” the target address provided by the BTB is predicted as the next fetch address (i.e. the predicted target of the indirect branch). If the prediction of the virtual branch is “not-taken,” the processor moves on to the next virtual branch: it tries a conditional branch prediction again with a different virtual PC. The processor repeats this process until the conditional branch predictor predicts a virtual branch as taken. VPC prediction stops if none of the virtual branches is predicted as taken after a limited number of virtual branch predictions. After VPC prediction stops, the processor can stall the front-end until the target address of the indirect branch is resolved. The VPC prediction algorithm is inspired by a compiler optimization, called receiver class prediction optimization (RCPO) [7, 19, 16, 2] or devirtualization [24]. This optimization statically converts an indirect branch to multiple direct conditional branches (in other words, it “devirtualizes” a virtual function call). Unfortunately, devirtualization requires extensive static program analysis or accurate profiling, and it is applicable to only a subset of indirect branches with a limited number of targets that can be determined statically [24]. Our proposed VPC prediction mechanism provides the benefit of using conditional branch predictors for indirect branches without requiring static analysis or profiling by the compiler. In other words, VPC prediction dynamically devirtualizes an indirect branch without compiler support. Unlike compiler-based devirtualization, VPC prediction can be applied to any indirect branch regardless of the number and locations of its targets. The contributions of VPC prediction are as follows: 2 With a 64-bit address space, a conventional indirect branch predictor likely requires even more hardware resources to store the target addresses [27]. 3 Previous research has shown that the prediction accuracy of a BTB-based indirect branch predictor, which is essentially a last-target predictor, is low (about 50%) because the target addresses of many indirect branches alternate rather than stay stable for long periods of time [6, 27]. 425 1. To our knowledge, VPC prediction is the first mechanism that uses the existing conditional branch prediction hardware to predict the targets of indirect branches, without requiring any program transformation or compiler support. 2. VPC prediction can be applied using any current as well as future conditional branch prediction algorithm without requiring changes to the conditional branch prediction algorithm. Since VPC prediction transforms the problem of indirect branch prediction into the prediction of multiple virtual conditional branches, future improvements in conditional branch prediction accuracy can implicitly result in improving the accuracy of indirect branch prediction. 3. Unlike previously proposed indirect branch prediction schemes, VPC prediction does not require extra storage structures to maintain the targets of indirect branches. Therefore, VPC prediction provides a low-cost indirect branch prediction scheme that does not significantly complicate the front-end of the processor while providing the same performance as more complicated indirect branch predictors that require significant amounts of storage. 2. BACKGROUND ON INDIRECT BRANCH PREDICTION We first provide a brief background on how indirect branch predictors work to motivate the similarity between indirect and conditional branch prediction. There are two types of indirect branch predictors: history-based and precomputation-based [37]. The technique we introduce in this paper utilizes history information, so we focus on history-based indirect branch predictors. 2.1 Why Does History-Based Indirect Branch Prediction Work? History-based indirect branch predictors exploit information about the control-flow followed by the executing program to differentiate between the targets of an indirect branch. The insight is that the control-flow path leading to an indirect branch is strongly correlated with the target of the indirect branch [6]. This is very similar to modern conditional branch predictors, which operate on the observation that the control-flow path leading to a branch is correlated with the direction of the branch [11]. 2.1.1 A Source Code Example The example in Figure 2 shows an indirect branch from the GAP program [12] to provide insight into why history-based prediction of indirect branch targets works. GAP implements and interprets a language that performs mathematical operations. One data structure in the GAP language is a list. When a mathematical function is applied to a list element, the program first evaluates the value of the index of the element in the list (line 13 in Figure 2). The index can be expressed in many different data types, and a different function is called to evaluate the index value based on the data type (line 10). For example, in expressions L(1), L(n), and L(n+1), the index is of three different data types: T_INT, T_VAR, and T_SUM, respectively. An indirect jump through a jump table (EvTab in lines 2, 3 and 10) determines which evaluation function is called based on the data type of the index. Consider the mathematical function L2(n) = L1(n) + L1(n+1). For each n, the program calculates three index values; Eval_VAR, Eval_SUM, and Eval_VAR functions are called respectively to evaluate index values for L1(n), L1(n+1), and L2(n). The targets of the indirect branch that determines the evaluation function of the index are therefore respectively the addresses of the two evaluation functions. Hence, the target of this indirect branch alternates between the two functions, making it unpredictable with a BTB-based last-target predictor. In contrast, a predictor that uses branch history information to predict the target easily distinguishes between the two target addresses because the branch histories followed in the functions Eval_SUM and Eval_VAR are different; hence the histories leading into the next instance of the indirect branch used to evaluate the index of the element are different. Note that a combination of the regularity in the input index expressions and the code structure allows the target address to be predictable using branch history information. 2.2 Previous Work The indirect branch predictor described by Lee and Smith [30] used the branch target buffer (BTB) to predict indirect branches. This scheme predicts that the target of the current instance of the branch will be the same as the target taken in the last execution of the branch. This scheme does not work well for indirect branches that frequently switch between different target addresses. Such indirect branches are commonly used to implement virtual function calls that act on many different objects and switch statements with many ‘case’ targets that are exercised at run-time. Therefore, the BTB-based predictor has low (about 50%) prediction accuracy [30, 6, 8, 27]. 1: // Set up the array of function pointers (i.e. jump table) 2: EvTab[T_INT] = Eval_INT; EvTab[T_VAR] = Eval_VAR; 3: EvTab[T_SUM] = Eval_SUM; 4: // ... 5: 6: // EVAL evaluates an expression by calling the function 7: // corresponding to the type of the expression 8: // using the EvTab[] array of function pointers 9: 10: #define EVAL(hd) ((*EvTab[TYPE(hd)])((hd))) /*INDIRECT*/ 11: 12: TypHandle Eval_LISTELEMENT ( TypHandle hdSel ) { 13: hdPos = EVAL( hdSel ); 14: // evaluate the index of the list element 15: // check if index is valid and within bounds 16: // if within bounds, access the list 17: // at the given index and return the element 18: } Figure 2: An indirect branch example from GAP Chang et al. [6] first proposed to use branch history information to distinguish between different target addresses accessed by the same indirect branch. They proposed the “target cache,” which is similar to a two-level gshare [33] conditional branch predictor. The target cache is indexed using the XOR of the indirect branch PC and the branch history register. Each cache entry contains a target address. Each entry can be tagged, which reduces interference between different indirect branches. The tagged target cache significantly improves indirect branch prediction accuracy compared to a BTB-based predictor. However, it also requires separate structures for predicting indirect branches, increasing complexity in the processor front-end. Later work on indirect branch prediction by Driesen and Hölzle focused on improving the prediction accuracy by enhancing the indexing functions of two-level predictors [8] and by combining multiple indirect branch predictors using a cascaded predictor [9, 10]. The cascaded predictor is a hybrid of two or more target predictors. A relatively simple first-stage predictor is used to predict easy-topredict (single-target) indirect branches, whereas a complex secondstage predictor is used to predict hard-to-predict indirect branches. Driesen and Hölzle [10] concluded that a 3-stage cascaded predictor performed the best for a particular set of C and C++ benchmarks. Kalamatianos and Kaeli [27] proposed predicting indirect branches via data compression. Their predictor uses prediction by partial matching (PPM) with a set of Markov predictors of decreasing size indexed by the result of hashing a decreasing number of bits from previous targets. The Markov predictor is a large set of tables where each table entry contains a single target address and bookkeeping bits. The prediction comes from the highest order table that can predict, similarly to a cascaded predictor. The PPM predictor requires significant additional hardware complexity in the indexing functions, Markov tables, and additional muxes used to select the predicted target address. Recently, Seznec and Michaud [40] proposed extending their TAGE conditional branch predictor to also predict indirect branches. However, their mechanism also requires additional storage space for indirect target addresses and additional complexity to handle indirect branches. 2.3 Our Motivation All previously proposed indirect branch predictors (except the BTB-based predictor) require separate hardware structures to store target addresses in addition to the conditional branch prediction hardware. This not only requires significant die area (which translates into extra energy/power consumption), but also increases the design complexity of the processor front-end, which is already a complex 426 and cycle-critical part of the design.4 Moreover, many of the previously proposed indirect branch predictors are themselves complicated [9, 10, 27, 40], which further increases the overall complexity and development time of the design. For these reasons, most current processors do not implement separate structures to predict indirect branch targets. Our goal in this paper is to design a low-cost technique that accurately predicts indirect branch targets (by utilizing branch history information to distinguish between the different target addresses of a branch) without requiring separate complex structures for indirect branch prediction. To this end, we propose Virtual Program Counter (VPC) prediction. 3. VIRTUAL PROGRAM COUNTER (VPC) PREDICTION 3.1 Overview A VPC predictor treats an indirect branch as a sequence of multiple virtual conditional branches.5 Each virtual branch is predicted in sequence using the existing conditional branch prediction hardware, which consists of the direction predictor and the BTB (Figure 3). If the virtual branch is predicted to be not-taken, the VPC predictor moves on to predict the next virtual branch in the sequence. If the virtual branch is predicted to be taken, VPC prediction uses the target associated with the virtual branch in the BTB as the next fetch address, completing the prediction of the indirect branch. Note that the virtual branches are visible only to the branch prediction hardware. GHR VGHR Conditional Branch Predictor (BP) branch is fetched. If the virtual branch is predicted not-taken, the prediction algorithm moves to the next iteration (i.e. the next virtual branch) by updating the VPCA and VGHR. The VPCA value for an iteration (other than the first iteration) is computed by hashing the original PC value with a randomized constant value that is specific to the iteration. In other words, V P CA = P C ⊕ HASHV AL[iter], where HASHVAL is a hard-coded hardware table of randomized numbers that are different from one another. The VGHR is simply left-shifted by one bit at the end of each iteration to indicate that the last virtual branch was predicted not taken.6 The iterative prediction process stops when a virtual branch is predicted to be taken. Otherwise, the prediction process iterates until either the number of iterations is greater than MAX_ITER or there is a BTB miss (!pred_target in Algorithm 1 means there is a BTB miss).7 If the prediction process stops without predicting a target, the processor stalls until the indirect branch is resolved. Note that the value of MAX_ITER determines how many attempts will be made to predict an indirect branch. It also dictates how many different target addresses can be stored for an indirect branch at a given time in the BTB. Algorithm 1 VPC prediction algorithm iter ← 1 V P CA ← P C V GHR ← GHR done ← F ALSE while (!done) do pred_target ← access_BTB(V P CA) pred_dir ← access_conditional_BP(V P CA, V GHR) if (pred_target and (pred_dir = T AKEN )) then next_P C ← pred_target done ← T RU E else if (!pred_target or (iter ≥ M AX_IT ER)) then ST ALL ← T RU E done ← T RU E end if V P CA ← Hash(P C, iter) V GHR ← Left-Shift(V GHR) iter++ end while Taken/Not Taken Predict? PC VPCA BTB Cond/Indirect Target Address Hash Function Iteration Counter 3.2.1 Figure 3: High-level conceptual overview of the VPC predictor 3.2 Prediction Algorithm The detailed VPC prediction algorithm is shown in Algorithm 1. The key implementation issue in VPC prediction is how to distinguish between different virtual branches. Each virtual branch should access a different location in the direction predictor and the BTB (so that a separate direction and target prediction can be made for each branch). To accomplish this, the VPC predictor accesses the conditional branch prediction structures with a different virtual PC address (VPCA) and a virtual global history register (GHR) value (VGHR) for each virtual branch. VPCA values are distinct for different virtual branches. VGHR values provide the context (branch history) information associated with each virtual branch. VPC prediction is an iterative prediction process, where each iteration takes one cycle. In the first iteration (i.e. for the first virtual branch), VPCA is the same as the original PC address of the indirect branch and VGHR is the same as the GHR value when the indirect 4 Using a separate predictor for indirect branch targets adds one more input to the mux that determines the predicted next fetch address. Increasing the delay of this mux can result in increased cycle time, adversely affecting the clock frequency. 5 We call the conditional branches “virtual” because they are not encoded in the program binary. Nor are they micro-operations since they are only visible to the VPC predictor. Prediction Example Figure 4a,b shows an example virtual function call and the corresponding simplified assembly code with an indirect branch. Figure 4c shows the virtual conditional branches corresponding to the indirect branch. Even though the static assembly code has only one indirect branch, the VPC predictor treats the indirect branch as multiple conditional branches that have different targets and VPCAs. Note that the hardware does not actually generate multiple conditional branches. The instructions in Figure 4c are shown to demonstrate VPC prediction conceptually. We assume, for this example, that MAX_ITER is 3, so there are only 3 virtual conditional branches. Table 1 demonstrates the five possible cases when the indirect branch in Figure 4 is predicted using VPC prediction, by showing the inputs and outputs of the VPC predictor in each iteration. We 6 Note that VPC addresses (VPCAs) can conflict with real PC addresses in the program, thereby increasing aliasing and contention in the BTB and the direction prediction structures. The processor does not require any special action when aliasing happens. To reduce such aliasing, the processor designer should: (1) provide a good randomizing hashing function and values to generate VPCAs and (2) co-design the VPC prediction scheme and the conditional branch prediction structures carefully to minimize the effects of aliasing. Conventional techniques proposed to reduce aliasing in conditional branch predictors [33, 5] can also be used to reduce aliasing due to VPC. 7 The VPC predictor can continue iterating the prediction process even if there is BTB miss. However, we found that continuing in this case does not improve the prediction accuracy. Hence, to simplify the prediction process, our VPC predictor design stops the prediction process when there is a BTB miss in any iteration. 427 Table 1: Possible VPC Predictor states and outcomes when branch in Figure 4b is predicted Case 1 2 3 4 5 1st iteration inputs outputs VPCA VGHR BTB BP L 1111 TARG1 T L 1111 TARG1 NT L 1111 TARG1 NT L 1111 TARG1 NT L 1111 TARG1 NT 2nd iteration inputs outputs VPCA VGHR BTB BP VL2 1110 TARG2 T VL2 1110 TARG2 NT VL2 1110 TARG2 NT VL2 1110 MISS - a = s->area (); (a) Source code R1 = MEM[R2] INDIRECT_CALL R1 // PC: L (b) Corresponding assembly code with an indirect branch 3rd iteration input output VPCA VGHR BTB BP VL3 1100 TARG3 T VL3 1100 TARG3 NT - Prediction TARG1 TARG2 TARG3 stall stall (c) Virtual conditional branches (for prediction purposes) target) train the direction predictor as not-taken (as shown in Algorithm 2). The last virtual branch trains the conditional branch predictor as taken and updates the replacement policy bits in the BTB entry corresponding to the correctly-predicted target address. Note that Algorithm 2 is a special case of Algorithm 3 in that it is optimized to eliminate unnecessary BTB accesses when the target prediction is correct. Figure 4: VPC prediction example: source, assembly, and the corresponding virtual branches Algorithm 2 VPC training algorithm when the branch target is correctly predicted. Inputs: predicted_iter, P C, GHR iter1: cond. br TARG1 // VPCA: L iter2: cond. br TARG2 // VPCA: VL2 = L XOR HASHVAL[1] iter3: cond. br TARG3 // VPCA: VL3 = L XOR HASHVAL[2] iter ← 1 V P CA ← P C V GHR ← GHR while (iter < predicted_iter) do if (iter == predicted_iter) then update_conditional_BP(V P CA, V GHR, TAKEN) update_replacement_BTB(V P CA) else update_conditional_BP(V P CA, V GHR, NOT-TAKEN) end if V P CA ← Hash(PC, iter) V GHR ← Left-Shift(V GHR) iter++ end while assume that the GHR is 1111 when the indirect branch is fetched. Cases 1, 2, and 3 correspond to cases where respectively the first, second, or third virtual branch is predicted taken by the conditional branch direction predictor (BP). As VPC prediction iterates, VPCA and VGHR values are updated as shown in the table. Case 4 corresponds to the case where all three of the virtual branches are predicted not-taken and therefore the outcome of the VPC predictor is a stall. Case 5 corresponds to a BTB miss for the second virtual branch and thus also results in a stall. 3.3 Training Algorithm The VPC predictor is trained when an indirect branch is committed. The detailed VPC training algorithm is shown in Algorithms 2 and 3. Algorithm 2 is used when the VPC prediction was correct and Algorithm 3 is used when the VPC prediction was incorrect. The VPC predictor trains both the BTB and the conditional branch direction predictor for each predicted virtual branch. The key functions of the training algorithm are: Algorithm 3 VPC training algorithm when the branch target is mispredicted. Inputs: P C, GHR, CORRECT _T ARGET iter ← 1 V P CA ← P C V GHR ← GHR f ound_correct_target ← F ALSE while ((iter ≤ M AX_IT ER) and (f ound_correct_target = F ALSE)) do pred_target ← access_BTB(V P CA) if (pred_target = CORRECT_TARGET) then update_conditional_BP(V P CA, V GHR, TAKEN) update_replacement_BTB(V P CA) f ound_correct_target ← T RU E else if (pred_target) then update_conditional_BP(V P CA, V GHR, NOT-TAKEN) end if V P CA ← Hash(PC, iter) V GHR ← Left-Shift(V GHR) iter++ end while 1. to update the direction predictor as not-taken for the virtual branches that have the wrong target (because the targets of those branches were not taken) and to update it as taken for the virtual branch, if any, that has the correct target. 2. to update the replacement policy bits of the correct target in the BTB (if the correct target exists in the BTB) 3. to insert the correct target address into the BTB (if the correct target does not exist in the BTB) Like prediction, training is also an iterative process. To facilitate training on a correct prediction, an indirect branch carries with it through the pipeline the number of iterations performed to predict the branch (predicted_iter). VPCA and VGHR values for each training iteration are recalculated exactly the same way as in the prediction algorithm. Note that only one virtual branch trains the prediction structures in a given cycle.8 3.3.1 /* no-target case */ if (f ound_correct_target = F ALSE) then V P CA ← VPCA corresponding to the virtual branch with a BTB-Miss or Least-frequently-used target among all virtual branches V GHR ← VGHR corresponding to the virtual branch with a BTB-Miss or Least-frequently-used target among all virtual branches insert_BTB(V P CA, CORRECT_TARGET) update_conditional_BP(V P CA, V GHR, TAKEN) end if Training on a Correct Prediction If the predicted target for an indirect branch was correct, all virtual branches except for the last one (i.e. the one that has the correct 3.3.2 8 It is possible to have more than one virtual branch update the prediction structures by increasing the number of write ports in the BTB and the direction predictor. We do not pursue this option as it would increase the complexity of prediction structures. Training on a Wrong Prediction If the predicted target for an indirect branch was wrong, there are two misprediction cases: (1) wrong-target: one of the virtual 428 branches has the correct target stored in the BTB but the direction predictor predicted that branch as not-taken, (2) no-target: none of the virtual branches has the correct target stored in the BTB so the VPC predictor could not have predicted the correct target. In the notarget case, the correct target address needs to be inserted into the BTB. To distinguish between wrong-target and no-target cases, the training logic accesses the BTB for each virtual branch (as shown in Algorithm 3).9 If the target address stored in the BTB for a virtual branch is the same as the correct target address of the indirect branch (wrong-target case), the direction predictor is trained as taken and the replacement policy bits in the BTB entry corresponding to the target address are updated. Otherwise, the direction predictor is trained as not-taken. Similarly to the VPC prediction algorithm, when the training logic finds a virtual branch with the correct target address, it stops training. If none of the iterations (i.e. virtual branches) has the correct target address stored in the BTB, the training logic inserts the correct target address into the BTB. One design question is what VPCA/VGHR values should be used for the newly inserted target address. Conceptually, the choice of VPCA value determines the order of the newly inserted virtual branch among all virtual branches. To insert the new target in the BTB, our current implementation of the training algorithm uses the VPCA/VGHR values corresponding to the virtual branch that missed in the BTB. If none of the virtual branches missed in the BTB, our implementation uses the VPCA/VGHR values corresponding to the virtual branch whose BTB entry has the smallest least frequently used (LFU) value. Note that the virtual branch that missed in the BTB or that has the smallest LFU-value in its BTB entry can be determined easily while the training algorithm iterates over virtual branches (However, we do not show this computation in Algorithm 3 to keep the algorithm more readable).10 3.4 Supporting Multiple Iterations per Cycle The iterative prediction process can take multiple cycles. The number of cycles needed to make an indirect branch prediction with a VPC predictor can be reduced if the processor already supports the prediction of multiple conditional branches in parallel [44]. The prediction logic can perform the calculation of VPCA values for multiple iterations in parallel since VPCA values do not depend on previous iterations. VGHR values for multiple iterations can also be calculated in parallel assuming that previous iterations were “not taken” since the prediction process stops when an iteration results in a “taken” prediction. Section 5.4 evaluates the performance impact of performing multiple prediction iterations in parallel. 1. Three registers to store iter, V P CA, and V GHR for prediction purposes (Algorithm 1). 2. A hard-coded table, HASHV AL, of 32-bit randomized values. The table has M AX_IT ER number of entries. Our experimental results show that M AX_IT ER does not need to be greater than 20. The table is dual-ported to support one prediction and one update concurrently. 3. A predicted_iter value that is carried with each indirect branch throughout the pipeline. This value cannot be greater than M AX_IT ER. 4. Three registers to store iter, V P CA, and V GHR for training purposes (Algorithms 2 and 3). 5. Two registers to store the V P CA and V GHR values that may be needed to insert a new target into the BTB (for the no-target case in Algorithm 3). Note that the cost of the required storage is very small. Unlike previously proposed history-based indirect branch predictors, no large or complex tables are needed to store the target addresses. Instead, target addresses are naturally stored in the existing BTB. The combinational logic needed to perform the computations required for prediction and training is also simple. Actual PC and GHR values are used to access the branch prediction structure in the first iteration of indirect branch prediction. While an iteration is performed, the VPCA and VGHR values for the next iteration are calculated and loaded into the corresponding registers. Therefore, updating VPCA and VGHR for the next iterations is not on the critical path of the branch predictor access. The training of the VPC predictor on a misprediction may slightly increase the complexity of the BTB update logic because it requires multiple iterations to access the BTB. However, the VPC training logic needs to access the BTB multiple times only on a target misprediction, which is relatively infrequent, and the update logic of the BTB is not on the critical path of instruction execution. 4. EXPERIMENTAL METHODOLOGY We use a Pin-based [31] cycle-accurate x86 simulator to evaluate VPC prediction. The parameters of our baseline processor are shown in Table 2. The baseline processor uses the BTB to predict indirect branches [30]. Table 2: Baseline processor configuration Front End Branch Predictors 3.5 Hardware Cost and Complexity The extra hardware required by the VPC predictor on top of the existing conditional branch prediction scheme is as follows: Execution Core 9 On-chip Caches Note that these extra BTB accesses for training are required only on a misprediction and they do not require an extra BTB read-port. An extra BTB access holds only one BTB bank per training-iteration. Even if the access results in a bank conflict with the accesses from the fetch engine for all the mispredicted indirect branches, we found that the performance impact is negligible due to the low frequency of indirect branch mispredictions in the VPC mechanism [29]. 10 This scheme does not necessarily find and replace the least frequently used of the targets corresponding to an indirect branch – this is difficult to implement as it requires keeping LFU information on a per-indirect branch basis across different BTB sets. Rather, our scheme is an approximation that replaces the target that has the lowest value for LFU-bits (corresponding to the LFU within a set) stored in the BTB entry, assuming the baseline BTB implements an LFU-based replacement policy. Other heuristics are possible to determine the VPCA/VGHR of a new target address (i.e. new virtual branch). We experimented with schemes that select among the VPCA/VGHR values corresponding to the iterated virtual branches randomly, or based on the recency information that could be stored in the corresponding BTB entries and found that LFU performs best with random selection a close second. We do not present these results due to space limitations. Buses and Memory Prefetcher 64KB, 2-way, 2-cycle I-cache; fetch ends at the first predicted-taken br.; fetch up to 3 conditional branches or 1 indirect branch 64KB (64-bit history, 1021-entry) perceptron branch predictor [25]; 4K-entry, 4-way BTB with pseudo-LFU replacement; 64-entry return address stack; min. branch mispred. penalty is 30 cycles 8-wide fetch/issue/execute/retire; 512-entry ROB; 384 physical registers; 128-entry LD-ST queue; 4-cycle pipelined wake-up and selection logic; scheduling window is partitioned into 8 sub-windows of 64 entries each L1 D-cache: 64KB, 4-way, 2-cycle, 2 ld/st ports; L2 unified cache: 1MB, 8-way, 8 banks, 10-cycle latency; All caches use LRU replacement and have 64B line size 300-cycle minimum memory latency; 32 memory banks; 32B-wide core-to-memory bus at 4:1 frequency ratio Stream prefetcher with 32 streams and 16 cache line prefetch distance (lookahead) [42] The experiments are run using 5 SPEC CPU2000 INT benchmarks, 5 SPEC CPU2006 INT/C++ benchmarks, and 2 other C++ benchmarks. We chose those benchmarks in SPEC INT 2000 and 2006 INT/C++ suites that gain at least 5% performance with a perfect indirect branch predictor. Table 3 provides a brief description of the other two C++ benchmarks. We use Pinpoints [36] to select a representative simulation region for each benchmark using the reference input set. Each benchmark 429 Table 3: Evaluated C++ benchmarks that are not included in SPEC CPU 2000 or 2006 translator from IDL (Interface Definition Language) to C++ simulates the task dispatcher in the kernel of an operating system [43] is run for 200 million x86 instructions. Table 4 shows the characteristics of the examined benchmarks on the baseline processor. All binaries are compiled with Intel’s production compiler (ICC) [21] with the -O3 optimization level. 100 RESULTS 90 % IPC improvement over baseline 5.1 Dynamic Target Distribution Figure 5 shows the distribution of the number of dynamic targets for executed indirect branches. In eon, gap, and ixx, more than 40% of the executed indirect branches have only one target. These single-target indirect branches are easily predictable with a simple BTB-based indirect branch predictor. However, in gcc (50%), crafty (100%), perlbmk (94%), perlbench (98%), sjeng (100%) and povray (97%), over 50% of the dynamic indirect branches have more than 5 targets. On average, 51% of the dynamic indirect branches in the evaluated benchmarks have more than 5 targets. hm ea n na m d po vr ay ric ha rd s ix x sje ng pe rlb m k ga p pe rlb en ch gc c0 6 eo n am ea n s x ix ha ric po vr rd ay d m na 6 ng sje ch pe gc rlb ga c0 p en k m eo rlb n ty pe gc Figure 7 shows the distribution of the number of iterations needed to generate a correct target prediction. On average 44.6% of the correct predictions occur in the first iteration (i.e. zero idle cycles) and 81% of the correct predictions occur within three iterations. Only in perlbmk and sjeng more than 30% of all correct predictions require at least 5 iterations. Hence, most correct predictions are performed quickly resulting in few idle cycles during which the fetch engine stalls. 100 11-12 9-10 7-8 5-6 4 3 2 1 90 80 70 60 50 40 30 20 am ea n s rd x ix ay vr ha ric d po ng m na sje 6 c0 gc en pe rlb p ga m rlb pe n k 0 ch 10 c Figure 6 (top) shows the performance improvement of VPC prediction over the baseline BTB-based predictor when MAX_ITER is varied from 2 to 16. Figure 6 (bottom) shows the indirect branch MPKI in the baseline and with VPC prediction. In eon, gap, and namd, where over 60% of all executed indirect branches have at most 2 unique targets (as shown in Figure 5), VPC prediction with MAX_ITER=2 eliminates almost all indirect branch mispredictions. Almost all indirect branches in richards have 3 or 4 different targets. Therefore, when the VPC predictor can hold 4 different targets per indirect branch (MAX_ITER=4), indirect branch MPKI is reduced to only 0.7 (from 13.4 in baseline and 1.8 with MAX_ITER=2). The performance of only perlbmk and perlbench continues to improve significantly as MAX_ITER is increased beyond 6, because at least 65% of the indirect branches in these two benchmarks have at least 16 dynamic targets (This is due to the large switch-case statements in perl that are used to parse and pattern-match the input expressions. The most frequently executed/mispredicted indirect branch in perlbench belongs to a switch statement with 57 static targets). Note that even though the number of mispredictions can be further reduced when MAX_ITER is increased beyond 12, the performance improvement actually decreases for perlbench. This is due to two reasons: (1) storing more targets in the BTB via a larger MAX_ITER value starts creating conflict misses, (2) some correct predictions take gc 5.2 Performance of VPC Prediction Figure 6: Performance of VPC prediction: IPC improvement (top), indirect branch MPKI (bottom) Percent of all correct predictions (%) Figure 5: Distribution of the number of dynamic targets across executed indirect branches cr af ty baseline VPC-ITER-2 VPC-ITER-4 VPC-ITER-6 VPC-ITER-8 VPC-ITER-10 VPC-ITER-12 VPC-ITER-14 VPC-ITER-16 c am ea n s rd x ix ha ric d ay m vr po na 6 ng sje en c0 gc pe rlb p m rlb pe ty n eo af cr c k ch 10 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 af 20 gc 20 gc c 30 0 30 ty 40 40 eo 50 50 cr Indirect branch Mispredictions (MPKI) 60 ga Percent of Executed Indirect Branches (%) 70 60 0 16+ 11-15 6-10 5 4 3 2 1 80 70 10 100 90 VPC-ITER-2 VPC-ITER-4 VPC-ITER-6 VPC-ITER-8 VPC-ITER-10 VPC-ITER-12 VPC-ITER-14 VPC-ITER-16 80 af 5. cr ixx richards longer when MAX_ITER is increased, which increases the idle cycles in which no instructions are fetched. On average, VPC prediction improves performance by 26.7% over the BTB-based predictor (when MAX_ITER=12), by reducing the average indirect branch MPKI from 4.63 to 0.52. Since a MAX_ITER value of 12 provides the best performance, most later experiments in this section use MAX_ITER=12. We found that using VPC prediction does not significantly impact the prediction accuracy of conditional branches in the benchmark set we examined as shown in Table 6. Figure 7: Distribution of the number of iterations (for correct predictions) (MAX_ITER=12) 430 Table 4: Characteristics of the evaluated benchmarks: language and type of the benchmark (Lang/Type), baseline IPC (BASE IPC), potential IPC improvement with perfect indirect branch prediction (PIBP IPC Δ), static number of indirect branches (Static IB), dynamic number of indirect branches (Dyn. IB), indirect branch prediction accuracy (IBP Acc), indirect branch mispredictions per kilo instructions (IB MPKI), conditional branch mispredictions per kilo instructions (CB MPKI). gcc06 is 403.gcc in CPU2006 and gcc is 176.gcc in CPU2000. eon C++/int 2.15 16.2% 1857 1401K 72.2 1.9 0.2 perlbmk gap perlbench gcc06 sjeng C/int C/int C/int C/int C/int 1.29 1.29 1.18 0.66 1.21 105.5% 55.6% 51.7% 17.3% 18.5% 864 1640 1283 1557 369 2908K 3454K 1983K 1589K 893K 30.0 55.3 32.6 43.9 28.8 10.2 7.7 6.7 4.5 3.2 0.9 0.8 3.0 3.7 9.5 5.3 Comparisons with Other Indirect Branch Predictors 100 TTC-384B TTC-768B TTC-1.5KB TTC-3KB TTC-6KB TTC-12KB TTC-24KB TTC-48KB TTC-96KB TTC-192KB VPC-ITER-12 80 70 60 50 40 30 20 hm ea n na m d po vr ay ric ha rd s ix x AVG 1.29 32.5% 50.6 4.63 3.0 60 50 40 30 20 pe rlb m k ga p pe rlb en ch gc c0 6 eo n gc c 0 ea n 5.4 am rd x ix ay s Figure 9: Performance of VPC prediction vs. cascaded predictor ha vr po 70 10 ric d m na 6 ng sje en c0 gc rlb pe m p ga n eo rlb pe ty af cr gc c k ch baseline TTC-384B TTC-768B TTC-1.5KB TTC-3KB TTC-6KB TTC-12KB TTC-24KB TTC-48KB TTC-96KB TTC-192KB VPC-ITER-12 80 cr af ty sje ng eo n pe rlb m k ga p pe rlb en ch gc c0 6 cr af ty gc c 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 ixx C++/int 1.62 12.8% 1281 252K 80.7 1.4 4.2 cascaded-704B cascaded-1.4KB cascaded-2.8KB cascaded-5.5KB cascaded-11KB cascaded-22KB cascaded-44KB cascaded-88KB cascaded-176KB VPC-ITER-12 90 0 Indirect branch Mispredictions (MPKI) richards C++/int 0.91 107.1% 266 4533K 40.9 13.4 1.4 100 10 Figure 8: Performance of VPC prediction vs. tagged target cache: IPC (top), MPKI (bottom) Figure 8 compares the performance and MPKI of VPC prediction with the tagged target cache (TTC) predictor [6]. The size of the 4way TTC predictor is calculated assuming 4-byte targets and 2-byte tags for each entry.11 On average, VPC prediction provides the performance provided by a 6KB TTC predictor. However, as shown in Table 5, in six benchmarks, the VPC predictor performs at least as well as a 12KB TTC (and on 4 benchmarks better than a 192KB TTC). As shown in Table 5, the size of TTC that provides equivalent performance is negatively correlated with the average number of dynamic targets for each indirect branch in a benchmark: the higher the average number of targets the smaller the TTC that performs as well as VPC (e.g. in crafty, perlbmk, and perlbench). This is because TTC provides separate storage to cache the large number of dynamic 11 povray C++/fp 1.79 12.1% 1035 1148K 70.8 1.7 2.1 targets in addition to the BTB whereas VPC prediction uses only the available BTB space. As the average number of targets increases, the contention for space in the BTB also increases, and reducing this contention even with a relatively small separate structure (as TTC does) provides significant performance gains. Figure 9 compares the performance of VPC prediction with a 3stage cascaded predictor [9, 10]. On average, VPC prediction provides the same performance improvement as a 22KB cascaded predictor. As shown in Table 5, in six benchmarks, VPC prediction provides the performance of at least a 176KB cascaded predictor.12 % IPC improvement over baseline % IPC improvement over baseline 90 namd C++/fp 2.62 5.4% 678 517K 83.3 0.4 1.1 hm ea n crafty C/int 1.71 4.8% 356 195K 34.1 0.6 6.1 na m d po vr ay ric ha rd s ix x gcc C/int 1.20 23.0% 987 1203K 34.9 3.9 3.0 sje ng Lang/Type BASE IPC PIBP IPC Δ Static IB Dyn. IB IBP Acc (%) IB MPKI CB MPKI Note that we simulated full 8-byte tags for TTC and hence our performance results reflect full tags, but we assume that a realistic TTC will not be implemented with full tags so we do not penalize it in terms of area cost. A target cache entry is allocated only on a BTB misprediction for an indirect branch. Our results do not take into account the increase in cycle time that might be introduced due to the addition of the TTC predictor into the processor front-end. Effect of VPC Prediction Delay So far we have assumed that a VPC predictor can predict a single virtual branch per cycle. Providing the ability to predict multiple virtual branches per cycle (assuming the underlying conditional branch predictor supports this) would reduce the number of idle cycles spent during multiple VPC prediction iterations. Figure 10 shows the performance impact when multiple iterations can take only one cycle. Supporting, unrealistically, even 10 prediction iterations per cycle further improves the performance benefit of VPC prediction by only 2.2%. As we have already shown in Figure 7, only 19% of all correct predictions require more than 3 iterations. Therefore, supporting multiple iterations per cycle does not provide significant improvement. We conclude that, to simplify the design, the VPC predictor can be implemented to support only one iteration per cycle. 5.5 5.5.1 Sensitivity of VPC Prediction to Microarchitecture Parameters Different Conditional Branch Predictors We evaluated VPC prediction with various baseline conditional branch predictors. Figure 11 compares the performance of the TTC 12 We found that a 3-stage cascaded predictor performs slightly worse than an equallysized TTC predictor. This is because the number of static indirect branches in the evaluated benchmarks is relatively small (10-20) and a cascaded predictor performs better than a TTC when there is a larger number of static branches [9, 10]. 431 Table 5: The sizes of tagged target cache (TTC) and cascaded predictors that provide the same performance as the VPC predictor (MAX_ITER=12) in terms of IPC gcc crafty eon perlbmk gap perlbench gcc06 sjeng namd povray richards ixx TTC size (B) 12KB 1.5KB >192KB 1.5KB 6KB 512B 12KB 3KB >192KB >192KB >192KB 3KB cascaded size (B) >176KB 2.8KB >176KB 2.8KB 11KB 1.4KB 44KB 5.5KB >176KB >176KB >176KB >176KB avg. # of targets 6.1 8.0 2.1 15.6 1.8 17.9 5.8 9.0 2.0 5.9 3.4 4.1 100 5.5.2 80 70 60 50 Different BTB sizes We evaluated VPC prediction with different BTB sizes: 512, 1024, and 2048 entries. As Table 7 shows, even with smaller BTB sizes VPC prediction still provides significant performance improvements. 1 br/cycle 2 br/cycle 4 br/cycle 6 br/cycle 8 br/cycle 10br/cycle Table 7: Effect of different BTB sizes 40 30 BTB entries 20 512 1K 2K 4K hm ea n % IPC improvement over baseline VPC Prediction on a Less Aggressive Processor 100 TTC-384B TTC-768 TTC-1.5KB TTC-3KB TTC-6KB TTC-12KB TTC-24KB TTC-48KB VPC-ITER-12 90 80 70 60 50 40 30 20 90 80 70 60 50 40 30 5.6 hm ea n na m d po vr ay ric ha rd s ix x sje ng cr af ty gc c eo n pe rlb m k ga p pe rlb en ch gc c0 6 10 Figure 11: Performance of VPC prediction vs. TTC on a processor with an O-GEHL conditional branch predictor Table 6: Effect of different conditional branch predictors Cond. BP gshare perceptron O-GEHL Baseline cond. MPKI indi. MPKI 3.70 4.63 3.00 4.63 1.82 4.63 IPC 1.25 1.29 1.37 VPC prediction cond. MPKI indi. MPKI 3.78 0.65 3.00 0.52 1.84 0.32 hm ea n eo n gc c Figure 12: VPC prediction vs. TTC on a less aggressive processor 20 0 na m d po vr ay ric ha rd s ix x 10 0 TTC-384B TTC-768B TTC-1.5KB TTC-3KB TTC-6KB TTC-12KB TTC-24KB TTC-48KB VPC-ITER-12 VPC prediction indirect MPKI IPC Δ 1.31 18.5% 0.95 21.7% 0.78 23.8% 0.52 26.7% Figure 12 shows the performance of VPC and TTC predictors on a less aggressive baseline processor that has a 20-stage pipeline, 4wide fetch/issue/retire rate, 128-entry instruction window, 16KB perceptron branch predictor, 1K-entry BTB, and 200-cycle memory latency. Since the less aggressive processor incurs a smaller penalty for a branch misprediction, improved indirect branch handling provides smaller performance improvements than in the baseline processor. However, VPC prediction still improves performance by 17.6%. 110 100 IPC 1.16 1.25 1.28 1.29 sje ng predictor and the VPC predictor on a baseline processor with a 64KB O-GEHL predictor [39]. On average, VPC prediction improves performance by 31% over the baseline with an O-GEHL predictor. We also found [29] that VPC prediction improves performance by 23.8% on a baseline with a 64KB gshare [33] predictor. Table 6 summarizes the results of our comparisons. Reducing the conditional branch misprediction rate via a better predictor results in also reducing the indirect branch misprediction rate with VPC prediction. Hence, as the baseline conditional branch predictor becomes better, the performance improvement provided by VPC prediction increases. We conclude that, with VPC prediction, any research effort that improves conditional branch prediction accuracy will likely result in also improving indirect branch prediction accuracy – without requiring the significant extra effort to design complex and specialized indirect branch predictors. 5.5.3 Baseline indirect MPKI 4.81 4.65 4.64 4.63 pe rlb m k ga p pe rlb en ch gc c0 6 Figure 10: Performance impact of supporting multiple VPC prediction iterations per cycle % IPC improvement over baseline na m d po vr ay ric ha rd s ix x sje ng cr af ty gc c 0 eo n pe rlb m k ga p pe rlb en ch gc c0 6 10 cr af ty % IPC improvement over baseline 90 IPC Δ 23.8% 26.7% 31.0% Performance of VPC Prediction on Server Applications We also evaluated the VPC predictor with commercial on-line transaction processing (OLTP) applications [17]. Each OLTP trace is collected from an IBM System 390 zSeries machine [20] for 22M instructions. Unlike the SPEC CPU benchmarks, OLTP applications have a much higher number of static indirect branches (OLTP1:7601, OLTP2:7991, and OLTP3:15733) and very high indirect branch misprediction rates.13 The VPC predictor (MAX_ITER=10) reduces the indirect branch misprediction rate by 28%, from 12.2 MPKI to 8.7 MPKI. The VPC predictor performs better than a 12KB TTC predictor on all applications and almost as well as a 24KB TTC on oltp2. Hence, we conclude that the VPC predictor is also very effective in large-scale server applications. 13 System 390 ISA has both unconditional indirect and conditional indirect branch instructions. For this experiment, we only consider unconditional indirect branches and use a 16K-entry 4-way BTB in the baseline processor. 432 baseline VPC-ITER-2 VPC-ITER-4 VPC-ITER-6 VPC-ITER-8 VPC-ITER-10 VPC-ITER-12 VPC-ITER-14 VPC-ITER-16 oltp1 oltp2 oltp3 Indirect branch Mispredictions (MPKI) Indirect branch Mispredictions (MPKI) 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 baseline TTC-384B TTC-768B TTC-1.5KB TTC-3KB TTC-6KB TTC-12KB TTC-24KB TTC-48KB VPC-ITER-10 oltp1 amean oltp2 oltp3 amean Figure 13: MPKI of VPC prediction on OLTP Applications: effect of MAX_ITER (left) and vs. TTC predictor (right) 6. VPC PREDICTION AND COMPILER-BASED DEVIRTUALIZATION Devirtualization is the substitution of an indirect method call with direct method calls in object-oriented languages [7, 19, 16, 2, 24]. Ishizaki et al. [24] classify the devirtualization techniques into guarded devirtualization and direct devirtualization. Guarded devirtualization: Figure 14a shows an example virtual function call in the C++ language. In the example, depending on the actual type of Shape s, different area functions are called at runtime. However, even though there could be many different shapes in the program, if the types of shapes are mostly either an instance of the Rectangle class or the Circle class at run-time, the compiler can convert the indirect call to multiple guarded direct calls [16, 13, 2] as shown in Figure 14(b). This compiler optimization is called Receiver Class Prediction Optimization (RCPO) and the compiler can perform RCPO based on profiling. Shape* s = ... ; a = s->area(); // an indirect call (a) A virtual function call in C++ Shape * s = ...; if (s->class == Rectangle) // a cond. br at PC: X a = Rectangle::area(); // a direct call else if (s->class == Circle) // a cond. br at PC: Y a = Circle::area(); // a direct call else a = s->area(); // an indirect call at PC: Z (b) Devirtualized form of the above virtual function call but requires whole-program analysis to make sure there is only one possible target. This approach enables code optimizations that would otherwise be hindered by the indirect call. However, this approach cannot be used statically if the language supports dynamic class loading, like Java. Dynamic recompilation can overcome this limitation, but it requires an expensive mechanism called on-stack replacement [24]. 6.1 6.1.1 The benefits of this optimization are: (1) It enables other compiler optimizations. The compiler could inline the direct function calls or perform interprocedural analysis [13]. Removing function calls also reduces the register save/restore overhead. (2) The processor can predict the virtual function call using a conditional branch predictor, which usually has higher accuracy than an indirect branch predictor [2]. However, not all indirect calls can be converted to multiple conditional branches. In order to perform RCPO, the following conditions need to be fulfilled [13, 2]: 1. The number of frequent target addresses from a caller site should be small (1-2). 2. The majority of target addresses should be similar across input sets. 3. The target addresses must be available at compile-time. Direct devirtualization: Direct devirtualization converts an indirect call into a single unconditional direct call if the compiler can prove that there is only one possible target for the indirect call. Hence, direct devirtualization does not require a guard before the direct call, Need for Static Analysis or Accurate Profiling The application of devirtualization to large commercial software bases is limited by the cost and overhead of the static analysis or profiling required to guide the method call transformation. Devirtualization based on static analysis requires type analysis, which in turn requires whole program analysis [24], and unsafe languages like C++ also require pointer alias analysis. Note that these analyses need to be conservative in order to guarantee correct program semantics. Guarded devirtualization usually requires accurate profile information, which may be very difficult to obtain for large applications. Due to the limited applicability of static devirtualization, [24] reports only an average 40% reduction in the number of virtual method calls on a set of Java benchmarks, with the combined application of aggressive guarded and direct devirtualization techniques. 6.1.2 Figure 14: A virtual function call and its devirtualized form Limitations of Compiler-Based Devirtualization Impact on Code Size and Branch Mispredictions Guarded devirtualization can sometimes reduce performance since (1) it increases the static code size by converting a single indirect branch instruction into multiple guard test instructions and direct calls; (2) it could replace one possibly mispredicted indirect call with multiple conditional branch mispredictions, if the guard tests become hard-to-predict branches [41]. 6.1.3 Lack of Adaptivity to Run-Time Input-Set and Phase Behavior The most frequently-taken targets chosen for devirtualization can be based on profiling, which averages the whole execution of the program for one particular input set. However, the most frequently-taken targets can be different across different input sets. Furthermore, the most frequently-taken targets can change during different phases of the program. Additionally, dynamic linking and dynamic class loading can introduce new targets at runtime. Compiler-based devirtualization cannot adapt to these changes in program behavior because the most frequent targets of a method call are determined statically and encoded in the binary. Due to these limitations, state-of-the-art compilers either do not implement any form of devirtualization (e.g. GCC 4.0 [14]14 ) or 14 GCC only implements a form of devirtualization based on class hierarchy analysis in the ipa-branch experimental branch, but not in the main branch [35]. 433 60 30 20 n ea ay vr TTC-384B TTC-768B TTC-1.5KB TTC-3KB TTC-6KB TTC-12KB TTC-24KB TTC-48KB VPC-ITER-12 % IPC improvement over baseline 90 80 70 60 50 40 30 20 hm ea n baseline TTC-384B TTC-768B TTC-1.5KB TTC-3KB TTC-6KB TTC-12KB TTC-24KB TTC-48KB VPC-ITER-12 9 8 7 6 5 4 3 2 n ea am ay po vr d m na 6 ng sje en c0 gc p rlb pe m rlb pe n eo ty af cr gc c k 0 ch 1 ga Indirect branch Mispredictions (MPKI) po vr ay na m d sje ng pe rlb en ch gc c0 6 ga p eo n cr af ty 0 pe rlb m k 10 gc c Since it is not possible to selectively enable only RCPO in ICC, we could not isolate the impact of RCPO on performance. Hence, we only present the effect of VPC prediction on binaries optimized with RCPO. 16 RCPO binaries were compiled in two passes with ICC: the first pass is a profiling run with the train input set (-prof_gen switch), and the second pass optimizes the binaries based on the profile (we use the -prof_use switch, which enables all profile-guided optimizations). hm d 100 10 A compiler that performs devirtualization reduces the number of indirect branches and therefore reduces the potential performance improvement of VPC prediction. This section evaluates the effectiveness of VPC prediction on binaries that are optimized using aggressive profile-guided optimizations, which include RCPO. ICC [21] performs a form of RCPO [38] when value-profiling feedback is enabled, along with other profile-based optimizations.15 Table 8 shows the number of static/dynamic indirect branches in the BASE and RCPO binaries run with the full reference input set. BASE binaries are compiled with the -O3 option. RCPO binaries are compiled with all profile-guided optimizations, including RCPO.16 Table 8 shows that RCPO binaries reduce the number of static/dynamic indirect branches by up to 51%/86%. Figure 15 shows the performance impact of VPC prediction on RCPO binaries. Even though RCPO binaries have fewer indirect branches, VPC prediction still reduces indirect branch mispredictions by 80% on average, improving performance by 11.5% over a BTB-based predictor. Figure 16 shows the performance comparison of VPC prediction with a tagged target cache on RCPO binaries. The performance of VPC is better than a tagged predictor of 48KB (for eon, namd, povray), and equivalent to a tagged predictor of 24KB (for gap), of 12KB (for gcc), of 3KB (for perlbmk, gcc06, and sjeng), of 1.5KB (for crafty), and 768B (for perlbench). Hence, a VPC predictor provides the performance of a large and more complicated tagged target cache predictor even when the RCPO optimization is used by the compiler. po 6 ng m na sje gc c0 en p m rlb n pe eo ty af Figure 15: Performance of VPC prediction on RCPO binaries 6.3 Performance of VPC Prediction on Binaries Optimized with Compiler-Based Devirtualization 15 cr gc c k 0 ch 10 rlb VPC prediction is essentially a dynamic devirtualization mechanism used for indirect branch prediction purposes. However, VPC’s devirtualization is visible only to the branch prediction structures. VPC has the following advantages over compiler-based devirtualization: 1. As it is a hardware mechanism, it can be applied to any indirect branch without requiring any static analysis/guarantees or profiling. 2. Adaptivity: Unlike compiler-based devirtualization, the dynamic training algorithms allow the VPC predictor to adapt to changes in the most frequently-taken targets or even to new targets introduced by dynamic linking or dynamic class loading. 3. Because virtual conditional branches are visible only to the branch predictor, VPC prediction does not increase the code size, nor does it possibly convert a single indirect branch misprediction into multiple conditional branch mispredictions. On the other hand, the main advantage of compiler-based devirtualization over VPC prediction is that it enables compile-time code optimizations. However, as we show in the next section, the two techniques can be used in combination and VPC prediction provides performance benefits on top of compiler-based devirtualization. VPC-ITER-4 VPC-ITER-6 VPC-ITER-8 VPC-ITER-10 VPC-ITER-12 40 pe 6.2 VPC Prediction vs. Compiler-Based Devirtualization 50 ga % IPC improvement over baseline they implement a limited form of direct devirtualization that converts only provably-monomorphic virtual function calls into direct function calls (e.g. the Bartok compiler [41, 34]). Figure 16: VPC prediction vs. tagged target cache on RCPO binaries: IPC (top) and MPKI (bottom) 7. CONCLUSION This paper proposed and evaluated the VPC prediction paradigm. The key idea of VPC prediction is to treat an indirect branch instruction as multiple “virtual” conditional branch instructions for prediction purposes in the microarchitecture. As such, VPC prediction enables the use of existing conditional branch prediction structures to predict the targets of indirect branches without requiring any extra structures specialized for storing indirect branch targets. Our evaluation shows that VPC prediction, without requiring complicated structures, achieves the performance provided by other indirect branch predictors that require significant extra storage and complexity. We believe the performance impact of VPC prediction will further increase in future applications that will be written in object-oriented programming languages and that will make heavy use of polymorphism since those languages were shown to result in significantly more indirect branch mispredictions than traditional C/Fortran-style languages. By making available to indirect branches the rich, accurate, highly-optimized, and continuously-improving hardware used 434 Table 8: The number of static and dynamic indirect branches in BASE and RCPO binaries Static BASE Static RCPO Dynamic BASE (M) Dynamic RCPO (M) gcc crafty eon perlbmk gap perlbench gcc06 sjeng namd povray 987 356 1857 864 1640 1283 1557 369 678 1035 984 358 1854 764 1709 930 1293 369 333 578 144 174 628 1041 2824 8185 304 10130 7 8228 94 119 619 1005 2030 1136 202 10132 4 7392 to predict conditional branches, VPC prediction can serve as an en- [19] U. Hölzle and D. Ungar. Optimizing dynamically-dispatched calls with run-time type feedback. In PLDI, 1994. abler encouraging programmers (especially those concerned with the performance of their code) to use object-oriented programming styles, [20] IBM Corporation. IBM zSeries mainframe servers. http://www.ibm.com/systems/z/. thereby improving the quality and ease of software development. ACKNOWLEDGMENTS We especially thank Thomas Puzak of IBM for providing the traces of commercial benchmarks that we used in our experiments. We thank Joel Emer, Paul Racunas, John Pieper, Robert Cox, David Tarditi, Veynu Narasiman, Jared Stark, Santhosh Srinath, Thomas Moscibroda, Bradford Beckmann, members of the HPS research group, and the anonymous reviewers for their comments and suggestions. We gratefully acknowledge the support of the Cockrell Foundation, Intel Corporation and the Advanced Technology Program of the Texas Higher Education Coordinating Board. REFERENCES [1] Advanced Micro Devices, Inc. AMD Athlon(T M ) XP Processor Model 10 Data Sheet, Feb. 2003. [2] B. Calder and D. Grunwald. Reducing indirect function call overhead in C++ programs. In POPL-21, 1994. [3] B. Calder, D. Grunwald, and B. Zorn. Quantifying behavioral differences between C and C++ programs. Journal of Programming Languages, 2(4):323–351, 1995. [4] L. Cardelli and P. Wegner. On understanding types, data abstraction, and polymorphism. ACM Computing Surveys, 17(4):471–523, Dec. 1985. [5] P.-Y. Chang, M. Evers, and Y. N. Patt. Improving branch prediction accuracy by reducing pattern history table interference. In PACT, 1996. [6] P.-Y. Chang, E. Hao, and Y. N. Patt. Target prediction for indirect jumps. In ISCA-24, 1997. [7] L. P. Deutsch and A. M. Schiffman. Efficient implementation of the Smalltalk-80 system. In POPL, 1984. [8] K. Driesen and U. Hölzle. Accurate indirect branch prediction. In ISCA-25, 1998. [9] K. Driesen and U. Hölzle. The cascaded predictor: Economical and adaptive branch target prediction. In MICRO-31, 1998. [10] K. Driesen and U. Hölzle. Multi-stage cascaded prediction. In European Conference on Parallel Processing, 1999. [11] M. Evers, S. J. Patel, R. S. Chappell, and Y. N. Patt. An analysis of correlation and predictability: What makes two-level branch predictors work. In ISCA-25, 1998. [12] The GAP Group. GAP System for Computational Discrete Algebra. http://www.gap-system.org/. [13] C. Garrett, J. Dean, D. Grove, and C. Chambers. Measurement and application of dynamic receiver class distributions. Technical Report UW-CS 94-03-05, University of Washington, Mar. 1994. [14] GCC-4.0. GNU compiler collection. http://gcc.gnu.org/. [15] S. Gochman, R. Ronen, I. Anati, A. Berkovits, T. Kurts, A. Naveh, A. Saeed, Z. Sperber, and R. C. Valentine. The Intel Pentium M processor: Microarchitecture and performance. Intel Technology Journal, 7(2), May 2003. [16] D. Grove, J. Dean, C. Garrett, and C. Chambers. Profile-guided receiver class prediction. In OOPSLA-10, 1995. [17] A. Hartstein and T. R. Puzak. The optimum pipeline depth for a microprocessor. In ISCA-29, 2002. [18] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. The microarchitecture of the Pentium 4 processor. Intel Technology Journal, Feb. 2001. Q1 2001 Issue. [21] Intel Corporation. ICC 9.1 for Linux. http://www.intel.com/cd/software/products/ asmo-na/eng/compilers/284264.htm. [22] Intel Corporation. Intel Core Duo Processor T2500. http://processorfinder.intel.com/Details.aspx? sSpec=SL8VT. [23] Intel Corporation. Intel VTune Performance Analyzers. http://www.intel.com/vtune/. [24] K. Ishizaki, M. Kawahito, T. Yasue, H. Komatsu, and T. Nakatani. A study of devirtualization techniques for a Java Just-In-Time compiler. In OOPSLA-15, 2000. [25] D. A. Jiménez and C. Lin. Dynamic branch prediction with perceptrons. In HPCA-7, 2001. [26] D. Kaeli and P. Emma. Branch history table predictions of moving target branches due to subroutine returns. In ISCA-18, 1991. [27] J. Kalamatianos and D. R. Kaeli. Predicting indirect branches via data compression. In MICRO-31, 1998. [28] R. E. Kessler. The Alpha 21264 microprocessor. IEEE Micro, 19(2):24–36, 1999. [29] H. Kim, J. A. Joao, O. Mutlu, C. J. Lee, Y. N. Patt, and R. Cohn. VPC prediction: Reducing the cost of indirect branches via hardware-based dynamic devirtualization. Technical Report TR-HPS-2007-002, The University of Texas at Austin, Mar. 2007. [30] J. K. F. Lee and A. J. Smith. Branch prediction strategies and branch target buffer design. IEEE Computer, pages 6–22, Jan. 1984. [31] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI, 2005. [32] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50–58, Feb. 2002. [33] S. McFarling. Combining branch predictors. Technical Report TN-36, Digital Western Research Laboratory, June 1993. [34] Microsoft Research. Bartok compiler. http://research.microsoft.com/act/. [35] D. Novillo, Mar. 2007. Personal communication. [36] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi. Pinpointing representative portions of large Intel Itanium programs with dynamic instrumentation. In MICRO-37, 2004. [37] A. Roth, A. Moshovos, and G. S. Sohi. Improving virtual function call target prediction via dependence-based pre-computation. In ICS-13, 1999. [38] D. Sehr, Nov. 2006. Personal communication. [39] A. Seznec. Analysis of the O-GEometric History Length branch predictor. In ISCA-32, 2005. [40] A. Seznec and P. Michaud. A case for (partially) TAgged GEometric history length branch prediction. Journal of Instruction-Level Parallelism (JILP), 8, Feb. 2006. [41] D. Tarditi, Nov. 2006. Personal communication. [42] J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. POWER4 system microarchitecture. IBM Technical White Paper, Oct. 2001. [43] M. Wolczko. Benchmarking Java with the Richards benchmark. http://research.sun.com/people/mario/java_ benchmarking/richards/richards.html. [44] T.-Y. Yeh, D. Marr, and Y. N. Patt. Increasing the instruction fetch rate via multiple branch prediction and branch address cache. In ICS, 1993. 435