Regret Minimizing Branch Prediction 1 Problem definition Jeremiah Blocki

advertisement
Regret Minimizing Branch Prediction
Jeremiah Blocki
Erik Zawadzki
Yuan Zhou
September 27, 2010
1
Problem definition
sions. In the following section we will describe previous branch prediction policies. Next we will outline
The instruction pipeline is essential to modern, high- our approach: using regret minimization (section 3.1)
performance processors. Efficiently using each stage and modeling the cost of mispredictions (section 3.2).
of the pipeline is a considerable challenge since some We discuss how our approach can be implemented as
sequences of instructions may require a pipeline stall well as the series of experiments that will determine
in order to process correctly. Every stall signifies the effectiveness of our approach (section 3.3).
hardware underutilization, a loss of throughput, and
an opportunity for performance improvement. There
Previous work
are three main reasons for a stalled pipeline. Stalls 2
can be classified as being caused by either data dependency (e.g. a read-after-write dependency), resource In this section, we describe some previously suggested
contention (e.g. separate instructions that simultane- prediction policies. These policies are especially inously need a data store), or control flow dependency. teresting as they can be used in our portfolio of preThe last dependency stems from conditional jumps dictors. (This is a known technique for branch preand we will examine this issue in our project. A con- diction. See, for example, McFarling [7].) Three
ditional jump instruction causes the current program of the majors existing branch predictors are the bito branch—either to proceed in sequential order or modal, local and global predictors. These are well
to jump to a non-sequential branch target. Therefore known predictors that have been studied in detail in,
the next PC value depends on the outcome of a reg- for example, Smith [9], Lee and Smith [5], Yeh and
ister comparison and this requires a call to an ALU. Patt [11]. We may consider adding additional branch
This comparison may take several cycles to complete, predictions policies to our portfolio of experts as our
or in the case of a cache miss it may take hundreds of project progresses.
cycles to complete. Until the outcome is known the
processor cannot confidently update the PC and fetch 2.1 Majority predictor (bimodal prethe next instruction. In older processors this uncerdictor)
tainty caused a stall in the pipeline but most modern
processors avoid this problem by predicting the taken The behavior of typical branches is far from random.
branch and speculatively continuing the execution. Most branches are either taken or not taken. Think
Unfortunately there is no oracle for branch predica- of the branch used by a long loop, such as
tion. Most branch prediction schemes (cf. § 2) are
for (i = 0; i < 100000; i ++) . . . ;
heuristics that rely on historical branching information. Backing out of a speculative execution is costly In this example, the branch is (almost) always taken.
and involves flushing part of the pipeline. Therefore, The majority predictor[9] is designed to fit cases like
any measure that improving the accuracy of branch this — it takes the majority vote in the recent hisprediction will improve the overall performance of a tory for a specific branch. To implement the majority
chip. Our project suggests a novel branch predictor predictor, a k-bit counter for each branch is kept (sevthat is based on a framework for making decisions un- eral branches might share a counter due to the lack
der uncertainty known as regret minimization. This of space, but this would impair the prediction perforis related but distinct from the work by McFarling mance). Whenever the branch is taken, the counter is
[7] on using a portfolio of branch predictors. In par- increase by 1 (we do not increase it when it is 2k − 1),
ticular, we are sensitive to the cost of making deci- or the counter is decreased by 1 when the branch is
1
not taken (similarly, we do not decrease it when it is
0). The highest bit predicts the branching — we take
the branch when the highest bit is 1, and do not take
the branch when it is 0.
2.2
branch for 1 time. In this way, the branching predictor only keeps two values for decision: the distance between the last two branch-taken events (or
branch-not-taken events), and how long ago did the
last branch-taken event (or branch-not-taken event)
happen. We can also generalize this kind of predictors to fit the following example.
Local predictor
Consider the following two examples.
for (i = 0; i < 100000; i ++) { for (j = 0; j < i; j
++) . . . ; }
• The branch used by the if statement.
In the example above, the inner-loop even does not
behave according to a fixed pattern. The pattern,
however, becomes longer by one unit each time. But
we can easily introduce a new predictor with longer
history to deal with this case (while with little cost).
for (i = 0; i < 100000; i ++) { if (i & 1) . . . ;
else . . . ; }
• The branch used by the inner for loop.
for (i = 0; i < 100000; i ++) { for (j = 0; j < 3;
j ++) . . . ; }
2.4
Sometimes, the direction taken by the current branch
may depend on other recent branches. The following
code is a good example.
In both examples, the branch behaves in the periodical pattern. The length of the first pattern is 2, and
the length of the second pattern is 3. A local predictor, suggested in Yeh and Patt [10], assumes the
branch’s behavior is periodical. It keeps the recent
history of the branch and always predicts according
to the history and the length of the period. Another
issue about the local predictor is how to know the
length of the period. Thanks to the powerful regret
minimization algorithms, we can introduce a family
of local predictors with various period length (say,
from 2 to 8). The regret minimization algorithm will
decide which length is the correct one.
2.3
Global predictor
if (x < 1) . . . ;
if (x > 1) . . . ;
If we assume x takes value 0 with a small probability, then the direction taken by the second branch
is highly negatively correlated with that of the first
branch. If we keep the history of all the recent
branches, at the second branch, we can figure out
where did the first branch go, and predict the opposite direction. To implement this, a family of global
predictors as been suggested[11], the i-th predictor
predicts according to the i-th branch before the current one (while the actual prediction can be the same
as, or opposite to the history value).
Local predictor with longer history
A problem with local predictor is that record for the
history cannot be too large. A pattern with length 50
is often not affordable to keep. A various way might
be keep simpler, but longer pattern. Consider the
following example.
3
3.1
Proposed solution
Regret minimization
Regret Minimization is a powerful technique from
machine learning theory, which we believe will be a
good model for branch prediction. We will illustrate
The branch used by the inner loop behaves in a pe- the concept of regret minimization using branch preriod pattern with length 50. This length is too small diction. Consider the following setting: You are a
for the bimodal predictor to get good performance, microprocessor and you have just fetched a branch
but too large for the local predictor to keep the his- instruction. The branch instruction is dependend on
tory. But the history pattern for this example is some operations that have not completed. You have
simple – the branch is taken for 49 times and not three possible actions from which you may choose.
taken for 1 time. This example inspires us to keep First you could wait for the other instructions to finsimple patterns in this form – take the branch for t ish so that you know for sure what the next instructimes, and do not take the branch for 1 time; or, tion is, but then you might waste precious clock cydo not take the branch for t times, and take the cles. Second you could predict that the branch will
for (i = 0; i < 100000; i ++) { for (j = 0; j < 50; j
++) . . . ; }
2
be taken and load the corresponding next instruction. Third you can predict that the branch will
not be taken and load the corresponding instructions.
The second and third options are potentially risky
because the desision must be made online. Guess
right and you will improve performance by keeping
the instruction pipeline full with useful instructions.
Guess wrong and you will waste even more time and
resources. Each time you load a branch instruction
you have several ”experts” advising you which action
to take. After you make your choice you will soon
discover what the costs for each decision were. The
correct branch prediction will have zero cost, while
an misprediction typically has greater cost than waiting. This process is repeated many times. Since you
know nothing about the program that is running you
don’t know in advance which experts will perform
well. Perhaps a global predictor will work well, perhaps it will make too many mispredictions. Perhaps
a majority predictor will perform well, perhaps not.
Our regret is the cost of our decisions minus the
cost of the decisions made by the best expert in
hindsight. Using a regret minimization algorithm,
such as the randomized weighted majority algorithm
(Littlestone and Warmuth [6]) you can guarantee
that your performance will be almost as good as the
best expert in hindsight. There are several important
variations of the regret minimization framework.
Dani and Hayes [2] show how to minimize regret
even in the partial information model where you only
see the costs for your particular actions. However,
their bounds are worse than those in [6]. Another
strong variation of regret minimization breaks up
the predictions into stages. During each stage the
algorithm performs at least as well as the best expert
for that particular stage. An excellent summary of
many of these results can be found in Blum and
Mansour [1].
guarantee that its performance is nearly as good.
However, McFarling [7] notes that different branch
prediction policies are suited to different types of
programs. This implies that we won’t know the best
policy until runtime. In fact, the ”best policy” could
change significantly during any given window of time.
The main challenge with using regret minimization
fro branch prediction is the higher overhead. For example we would have to run the logic for each of
our experts for each prediction. However, these steps
could be easily performed in parallel. Second, the
randomized weighted majority algorithm would have
to keep track of and update weights for each of the
experts. Traditionally this updating happens immediately after we learn the correct result. Finally, the
randomized weighted majority algorithm uses these
weights to pick a random expert to follow at each
stage. If these computations are too complicated then
the branch predictor may take too many cycles to
make its prediction. Of course, the branch predictor
will only useful if it can make a good prediction before the branch path is resolved. We believe that we
could allow the updating of weights to be delayed so
that these more expensive computations are moved
off the critical path.
3.2
Modeling assumptions and cost
model
One of the major ways that we depart from earlier
predicition models is that we are not directly in the
raw accuracy of our predicition algorithm. We are
interested in a policy that reduces regret—this is inherently tied to the cost of decisions. Since not all
branch-prediction mistakes have the same cost we
need a cost model to accurately calculate regret. The
cost for an incorrectly predicted branch depends on
the details of the underlying system[8]. For example,
depending on the details of the system, it might be
cheaper to speculatively not taking a branch might
be than to speculatively take a branch. Therefore,
it may be reasonable to use a branch predictor that
is biased towards sequential speculation (this is an
intuition that is absent from all previous work). Of
course, the actual empirical cost of a decision is the
best loss function. This ideal loss function may be
impractical to implement in practice. Not only does
it require additional on-chip insturmentation, but
also requires counterfactual information about the
branch not executed—we need to know the cost of
both branches in order to calculate our regret. There
are regret-minimization techniques that can function
with partially known costs (cf. § 3.1), but the the-
We believe that regret minimization would have
several advantages over static policies because it
could allow the processor to dynamically adapt its
branch prediction policy to fit the current program.
It would also allow us to develop and use better cost
metrics than prediction accuracy (see discussion in
section 3.2) Of course regret minimization does not
guarantee good performance unless one of the experts performs well. However, branch predictors are
typically able to achieve significantly better accuracy
random guessing in practice. Similarly, if we knew in
advance what the best branch prediction policy was
then it would propably be better to use that policy
instead of regret minimization since there is less overhead and a regret minimization algorithm would only
3
oretical guarantees associated with these techniques
are weaker. This indicates that we need a cost model.
An especially simple one is to calculate the mean
value of taking a branch verus not taking a branch,
but more sophisticated models may further improve
performance. Here are some features that we might
want to use in our cost model:
evaluation will be different than the one in McFarling
[7]. We are primarily interested in reducing cost so
our comparision will be based both on prediction accuracy and on cost reduction. Unfortunately since
the trace is static this means that we will be unable to determine if our branching policy actually reduces cost or not. We can only report on whether
policies tend to selection costly decisions according
to our cost model. In order to properly investigate
whether our policy reduces cost we would have to
integrate our policies into a simulator, but this is a
more more complex and time-consuming method of
evalution. Therefore, we will consider the tendency
to select less costly decisions (according to our cost
model) a useful proxy for actual cost reduction.
• Instruction caching: instructions are typically
fetched in blocks. Mistakenly taking a branch
might be exceptionally expensive if a branch misprediction causes a useless fetch from a slow
cache (or worse: memory). Therefore, a large
branch offset (or use of an address) might signal
an especially expensive jump.
• Branch targeting[5]: when is the target of a
branch known? It is possible that it takes k cycles to determine the branch target. This means
that it is cheaper to speculatively simulate not
taking the branch than taking it.
4
• Milestone 1: For this milestone we expect to have an implementation of the regretminimization framework and policies completed
in a high-level language. We will also have some
experimental runs on toy datasets. Futhermore,
we will set up trace infrastructure for the later
experiements.
• Chaining effects: if instructions are prefetched
and there is special hardware that identifies all
the branch instructions, then it might be more
attractive to speculatively execute sequences of
instructions that have a low density of branches
or have many predictable branches.
• Milestone 2: We will complete, for this milestone, initial experiments using actual trace data
on a variety of benchmarks. This may cause us
to re-examine some of our assumptions and rework our approach.
• Pipeline depth[3] : in deep pipelines there will
be more instructions to back out than in a shallower pipeline.
• Synchronicity[4]:
asynchronous processors
may be able to retain more work after mispredictions since some of the instructions in
the pipeline will be independent of the branch.
Therefore, if there are enough independent instructions waiting in the reservation station, perhaps we are better off punting the branch decision rather than commiting to either branch.
References
[1] A. Blum and Y. Mansour. Learning, regret
minimization, and equilibria. Algorithmic Game
Theory, pages 79–102, 2007.
[2] V. Dani and T.P. Hayes. Robbing the bandit: Less regret in online geometric optimization
against an adaptive adversary. In Proceedings of
the seventeenth annual ACM-SIAM symposium
on Discrete algorithm, page 943. ACM, 2006.
We expect that further architectural considerations
for cost will arise as we continue to work on this
project.
3.3
Milestones
Evaluation methodology
[3] WW Hwu, T.M. Conte, and P.P. Chang. Comparing software and hardware schemes for reducing the cost of branches. ACM SIGARCH Computer Architecture News, 17(3):224–233, 1989.
To evalute our policy we will first train our cost model
on the execution traces from one dataset, and then
run our policy on a second set of traces. One of these
will be the SPEC 2000 integer benchmarks, a set of
benchmarks similar to the one used in McFarling [7].
We compare our new policy against all the predictors simulated in McFarling [7], including bimodal,
gshare predictors, and the combined predictors. Our
[4] M.S. Lam and R.P. Wilson. Limits of control
flow on parallelism. ACM SIGARCH Computer
Architecture News, 20(2):46–57, 1992.
4
[5] J.K.F. Lee and A.J. Smith. Analysis of branch
prediction strategies and branch target buffer design. IEEE Computer, 17(1):6–22, 1984.
[6] N. Littlestone and M.K. Warmuth. The weighted
majority algorithm. 1989.
[7] S. McFarling. Combining branch predictors.
Technical report, Technical Report TN-36, Digital Western Research Laboratory, 1993.
[8] S. McFarling and J. Hennesey. Reducing the cost
of branches. In Proceedings of the 13th annual
international symposium on Computer architecture, page 403. IEEE Computer Society Press,
1986.
[9] J.E. Smith. A study of branch prediction strategies. In 25 years of the international symposia on
Computer architecture (selected papers), pages
202–215. ACM, 1998.
[10] T.Y. Yeh and Y.N. Patt. Two-level adaptive
training branch prediction. In Proceedings of
the 24th annual international symposium on Microarchitecture, pages 51–61. ACM, 1991.
[11] T.Y. Yeh and Y.N. Patt. A comparison of dynamic branch predictors that use two levels of
branch history. In Proceedings of the 20th annual
international symposium on Computer architecture, pages 257–266. ACM, 1993.
5
Download