Regret Minimizing Branch Prediction Jeremiah Blocki Erik Zawadzki Yuan Zhou September 27, 2010 1 Problem definition sions. In the following section we will describe previous branch prediction policies. Next we will outline The instruction pipeline is essential to modern, high- our approach: using regret minimization (section 3.1) performance processors. Efficiently using each stage and modeling the cost of mispredictions (section 3.2). of the pipeline is a considerable challenge since some We discuss how our approach can be implemented as sequences of instructions may require a pipeline stall well as the series of experiments that will determine in order to process correctly. Every stall signifies the effectiveness of our approach (section 3.3). hardware underutilization, a loss of throughput, and an opportunity for performance improvement. There Previous work are three main reasons for a stalled pipeline. Stalls 2 can be classified as being caused by either data dependency (e.g. a read-after-write dependency), resource In this section, we describe some previously suggested contention (e.g. separate instructions that simultane- prediction policies. These policies are especially inously need a data store), or control flow dependency. teresting as they can be used in our portfolio of preThe last dependency stems from conditional jumps dictors. (This is a known technique for branch preand we will examine this issue in our project. A con- diction. See, for example, McFarling [7].) Three ditional jump instruction causes the current program of the majors existing branch predictors are the bito branch—either to proceed in sequential order or modal, local and global predictors. These are well to jump to a non-sequential branch target. Therefore known predictors that have been studied in detail in, the next PC value depends on the outcome of a reg- for example, Smith [9], Lee and Smith [5], Yeh and ister comparison and this requires a call to an ALU. Patt [11]. We may consider adding additional branch This comparison may take several cycles to complete, predictions policies to our portfolio of experts as our or in the case of a cache miss it may take hundreds of project progresses. cycles to complete. Until the outcome is known the processor cannot confidently update the PC and fetch 2.1 Majority predictor (bimodal prethe next instruction. In older processors this uncerdictor) tainty caused a stall in the pipeline but most modern processors avoid this problem by predicting the taken The behavior of typical branches is far from random. branch and speculatively continuing the execution. Most branches are either taken or not taken. Think Unfortunately there is no oracle for branch predica- of the branch used by a long loop, such as tion. Most branch prediction schemes (cf. § 2) are for (i = 0; i < 100000; i ++) . . . ; heuristics that rely on historical branching information. Backing out of a speculative execution is costly In this example, the branch is (almost) always taken. and involves flushing part of the pipeline. Therefore, The majority predictor[9] is designed to fit cases like any measure that improving the accuracy of branch this — it takes the majority vote in the recent hisprediction will improve the overall performance of a tory for a specific branch. To implement the majority chip. Our project suggests a novel branch predictor predictor, a k-bit counter for each branch is kept (sevthat is based on a framework for making decisions un- eral branches might share a counter due to the lack der uncertainty known as regret minimization. This of space, but this would impair the prediction perforis related but distinct from the work by McFarling mance). Whenever the branch is taken, the counter is [7] on using a portfolio of branch predictors. In par- increase by 1 (we do not increase it when it is 2k − 1), ticular, we are sensitive to the cost of making deci- or the counter is decreased by 1 when the branch is 1 not taken (similarly, we do not decrease it when it is 0). The highest bit predicts the branching — we take the branch when the highest bit is 1, and do not take the branch when it is 0. 2.2 branch for 1 time. In this way, the branching predictor only keeps two values for decision: the distance between the last two branch-taken events (or branch-not-taken events), and how long ago did the last branch-taken event (or branch-not-taken event) happen. We can also generalize this kind of predictors to fit the following example. Local predictor Consider the following two examples. for (i = 0; i < 100000; i ++) { for (j = 0; j < i; j ++) . . . ; } • The branch used by the if statement. In the example above, the inner-loop even does not behave according to a fixed pattern. The pattern, however, becomes longer by one unit each time. But we can easily introduce a new predictor with longer history to deal with this case (while with little cost). for (i = 0; i < 100000; i ++) { if (i & 1) . . . ; else . . . ; } • The branch used by the inner for loop. for (i = 0; i < 100000; i ++) { for (j = 0; j < 3; j ++) . . . ; } 2.4 Sometimes, the direction taken by the current branch may depend on other recent branches. The following code is a good example. In both examples, the branch behaves in the periodical pattern. The length of the first pattern is 2, and the length of the second pattern is 3. A local predictor, suggested in Yeh and Patt [10], assumes the branch’s behavior is periodical. It keeps the recent history of the branch and always predicts according to the history and the length of the period. Another issue about the local predictor is how to know the length of the period. Thanks to the powerful regret minimization algorithms, we can introduce a family of local predictors with various period length (say, from 2 to 8). The regret minimization algorithm will decide which length is the correct one. 2.3 Global predictor if (x < 1) . . . ; if (x > 1) . . . ; If we assume x takes value 0 with a small probability, then the direction taken by the second branch is highly negatively correlated with that of the first branch. If we keep the history of all the recent branches, at the second branch, we can figure out where did the first branch go, and predict the opposite direction. To implement this, a family of global predictors as been suggested[11], the i-th predictor predicts according to the i-th branch before the current one (while the actual prediction can be the same as, or opposite to the history value). Local predictor with longer history A problem with local predictor is that record for the history cannot be too large. A pattern with length 50 is often not affordable to keep. A various way might be keep simpler, but longer pattern. Consider the following example. 3 3.1 Proposed solution Regret minimization Regret Minimization is a powerful technique from machine learning theory, which we believe will be a good model for branch prediction. We will illustrate The branch used by the inner loop behaves in a pe- the concept of regret minimization using branch preriod pattern with length 50. This length is too small diction. Consider the following setting: You are a for the bimodal predictor to get good performance, microprocessor and you have just fetched a branch but too large for the local predictor to keep the his- instruction. The branch instruction is dependend on tory. But the history pattern for this example is some operations that have not completed. You have simple – the branch is taken for 49 times and not three possible actions from which you may choose. taken for 1 time. This example inspires us to keep First you could wait for the other instructions to finsimple patterns in this form – take the branch for t ish so that you know for sure what the next instructimes, and do not take the branch for 1 time; or, tion is, but then you might waste precious clock cydo not take the branch for t times, and take the cles. Second you could predict that the branch will for (i = 0; i < 100000; i ++) { for (j = 0; j < 50; j ++) . . . ; } 2 be taken and load the corresponding next instruction. Third you can predict that the branch will not be taken and load the corresponding instructions. The second and third options are potentially risky because the desision must be made online. Guess right and you will improve performance by keeping the instruction pipeline full with useful instructions. Guess wrong and you will waste even more time and resources. Each time you load a branch instruction you have several ”experts” advising you which action to take. After you make your choice you will soon discover what the costs for each decision were. The correct branch prediction will have zero cost, while an misprediction typically has greater cost than waiting. This process is repeated many times. Since you know nothing about the program that is running you don’t know in advance which experts will perform well. Perhaps a global predictor will work well, perhaps it will make too many mispredictions. Perhaps a majority predictor will perform well, perhaps not. Our regret is the cost of our decisions minus the cost of the decisions made by the best expert in hindsight. Using a regret minimization algorithm, such as the randomized weighted majority algorithm (Littlestone and Warmuth [6]) you can guarantee that your performance will be almost as good as the best expert in hindsight. There are several important variations of the regret minimization framework. Dani and Hayes [2] show how to minimize regret even in the partial information model where you only see the costs for your particular actions. However, their bounds are worse than those in [6]. Another strong variation of regret minimization breaks up the predictions into stages. During each stage the algorithm performs at least as well as the best expert for that particular stage. An excellent summary of many of these results can be found in Blum and Mansour [1]. guarantee that its performance is nearly as good. However, McFarling [7] notes that different branch prediction policies are suited to different types of programs. This implies that we won’t know the best policy until runtime. In fact, the ”best policy” could change significantly during any given window of time. The main challenge with using regret minimization fro branch prediction is the higher overhead. For example we would have to run the logic for each of our experts for each prediction. However, these steps could be easily performed in parallel. Second, the randomized weighted majority algorithm would have to keep track of and update weights for each of the experts. Traditionally this updating happens immediately after we learn the correct result. Finally, the randomized weighted majority algorithm uses these weights to pick a random expert to follow at each stage. If these computations are too complicated then the branch predictor may take too many cycles to make its prediction. Of course, the branch predictor will only useful if it can make a good prediction before the branch path is resolved. We believe that we could allow the updating of weights to be delayed so that these more expensive computations are moved off the critical path. 3.2 Modeling assumptions and cost model One of the major ways that we depart from earlier predicition models is that we are not directly in the raw accuracy of our predicition algorithm. We are interested in a policy that reduces regret—this is inherently tied to the cost of decisions. Since not all branch-prediction mistakes have the same cost we need a cost model to accurately calculate regret. The cost for an incorrectly predicted branch depends on the details of the underlying system[8]. For example, depending on the details of the system, it might be cheaper to speculatively not taking a branch might be than to speculatively take a branch. Therefore, it may be reasonable to use a branch predictor that is biased towards sequential speculation (this is an intuition that is absent from all previous work). Of course, the actual empirical cost of a decision is the best loss function. This ideal loss function may be impractical to implement in practice. Not only does it require additional on-chip insturmentation, but also requires counterfactual information about the branch not executed—we need to know the cost of both branches in order to calculate our regret. There are regret-minimization techniques that can function with partially known costs (cf. § 3.1), but the the- We believe that regret minimization would have several advantages over static policies because it could allow the processor to dynamically adapt its branch prediction policy to fit the current program. It would also allow us to develop and use better cost metrics than prediction accuracy (see discussion in section 3.2) Of course regret minimization does not guarantee good performance unless one of the experts performs well. However, branch predictors are typically able to achieve significantly better accuracy random guessing in practice. Similarly, if we knew in advance what the best branch prediction policy was then it would propably be better to use that policy instead of regret minimization since there is less overhead and a regret minimization algorithm would only 3 oretical guarantees associated with these techniques are weaker. This indicates that we need a cost model. An especially simple one is to calculate the mean value of taking a branch verus not taking a branch, but more sophisticated models may further improve performance. Here are some features that we might want to use in our cost model: evaluation will be different than the one in McFarling [7]. We are primarily interested in reducing cost so our comparision will be based both on prediction accuracy and on cost reduction. Unfortunately since the trace is static this means that we will be unable to determine if our branching policy actually reduces cost or not. We can only report on whether policies tend to selection costly decisions according to our cost model. In order to properly investigate whether our policy reduces cost we would have to integrate our policies into a simulator, but this is a more more complex and time-consuming method of evalution. Therefore, we will consider the tendency to select less costly decisions (according to our cost model) a useful proxy for actual cost reduction. • Instruction caching: instructions are typically fetched in blocks. Mistakenly taking a branch might be exceptionally expensive if a branch misprediction causes a useless fetch from a slow cache (or worse: memory). Therefore, a large branch offset (or use of an address) might signal an especially expensive jump. • Branch targeting[5]: when is the target of a branch known? It is possible that it takes k cycles to determine the branch target. This means that it is cheaper to speculatively simulate not taking the branch than taking it. 4 • Milestone 1: For this milestone we expect to have an implementation of the regretminimization framework and policies completed in a high-level language. We will also have some experimental runs on toy datasets. Futhermore, we will set up trace infrastructure for the later experiements. • Chaining effects: if instructions are prefetched and there is special hardware that identifies all the branch instructions, then it might be more attractive to speculatively execute sequences of instructions that have a low density of branches or have many predictable branches. • Milestone 2: We will complete, for this milestone, initial experiments using actual trace data on a variety of benchmarks. This may cause us to re-examine some of our assumptions and rework our approach. • Pipeline depth[3] : in deep pipelines there will be more instructions to back out than in a shallower pipeline. • Synchronicity[4]: asynchronous processors may be able to retain more work after mispredictions since some of the instructions in the pipeline will be independent of the branch. 