Effective Branch Prediction through Caching of Aliasing Branches*

advertisement
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 20, 557-574 (2004)
Effective Branch Prediction through Caching of
Aliasing Branches*
WEI-MING LIN AND AN-YI YANG+
Department of Electrical Engineering
The University of Texas at San Antonio
San Antonio, TX 78249, U.S.A.
E-mail: wlin@utsa.edu
+
Faraday Technology Corporation
Hsinchu, 300 Taiwan
E-mail: kevin_y@faraday-tech.com
High performance CPUs constantly face obstacles in pipelining delays from conditional branches to reach their expected potential. Precise branch prediction is required to
overcome this performance limitation imposed on high performance architecture and is
the key to many techniques for enhancing and exploiting Instruction-Level Parallelism
(ILP). In general, prediction accuracy can be improved by reducing aliases in the prediction table used in a traditional dynamic prediction mechanism. In this paper, we propose
a new technique to significantly reduce the more likely destructive aliases by caching up
recurring aliasing branches that occur within a small temporal locality with one another.
An extensive pre-simulation analysis on traces clearly supports such a claim showing
that aliasing branches with a small temporal locality account for majority of all aliases.
This paper further shows that such aliases are more likely to lead to destructive prediction result. Thus, our technique, incorporated with a small additional “alias table” with
its size much smaller than the existing prediction table normally used, is capable of
eliminating most of aliases, especially the highly performance-degrading repetitive “local” aliases. Our simulation results demonstrate a significant improvement in prediction
accuracy from adopting this technique while requiring very limited extra hardware cost.
In addition, this proposed add-on feature can be easily incorporated into any existing or
advanced dynamic branch predictors to further enhance their prediction performance.
Keywords: branch prediction, instruction-level parallelism, high performance computing,
aliases, dynamic branch prediction
1. INTRODUCTION
In the past decade, by taking advantage of RISC architecture and advanced VLSI
technology, computer designers are able to exploit more Instruction-Level Parallelism
(ILP) by using deeper pipelines, wider issue rates and superscalar techniques. However,
these techniques suffer from disruption caused by branches during the issue of instructions to functional units. How to appease such a performance-degrading effect from
branch instructions, which typically make up twenty or more percentage of an instruction
stream, has to be paid with more attention.
Received September 19, 2002; revised June 26, 2003; accepted December 10, 2003.
Communicated by Yu-Chee Tseng.
*
This research was supported in part by the Office of Naval Research under grants N00014-95-1-0514 and
N00014-96-1-0897, and in part by the Department of Defense/Air Force Office of Scientific Research under
grant F49620-96-1-0472.
557
558
WEI-MING LIN AND AN-YI YANG
Branch prediction is a common technique used to overcome this performance limitation imposed on high performance architectures and is the key to many techniques for
enhancing ILP. Branch prediction essentially involves a guess on the likely stream direction that is to take place after a branch instruction; whenever such a guess is correct, penalty in pipeline delay is either reduced or completely avoided. There have been various
branch prediction schemes proposed in this area [1, 2, 6-8, 10, 13, 15, 21]. They are usually classified as static or dynamic according to how prediction is made. Static prediction
schemes always assume same outcome for any given branch, whereas a dynamic scheme
uses run-time behavior of branches to adjust database for later predictions. Focus of this
paper is on the dynamic ones which usually show far better prediction accuracy than the
static ones.
A typical dynamic prediction mechanism relies on a prediction table to record the
behavior of past branches. One major cause of performance degrading in various dynamic branch prediction schemes is the amount of prediction table aliases. Alias problem
occurs when different branches are mapped to the same entry in this table. Such a problem is unavoidable unless a sufficiently large number, most of time cost-inhibiting, of
entries are provided. How to reduce the damage caused by the aliases without significantly increasing the hardware requirement becomes a must for an effective prediction.
The amount of prediction table aliases in different branch prediction approaches may
vary because of their different mapping/hashing schemes; however, they are harmful to
prediction accuracy in general. Due to various reasons, not all aliases lead to penalties.
These reasons include some from program behaviors, such as the one-time (table initialization) alias between two separate (disjoint) looping branches, non-destructive (or
even constructive) branch behaviors among aliasing branches, and others from special
hardware features, such as the damping effect from two-bit counters, etc. From probabilistic point of view, alias reduction in general leads to increasing prediction accuracy.
Especially, if two aliasing branches alternate in trace order very frequently, penalty may
become extremely high if such an alias leads to mispredictions. In this paper, we propose
a new technique to significantly alleviate this most destructive aliasing problem by caching up aliasing branches that happen within a small temporal locality of one another.
Such a technique, incorporated with a small additional “alias table” of its size much
smaller than the prediction table, is capable of removing most of aliases. Our argument
for such an investment is based on an observation that, when two aliasing branches are
far apart in time, it tends to result in a non-destructive alias or is simply masked over by
another branch on the same entry within a smaller temporal locality. Thus, we believe
that eliminating repetitive “local” aliases can address most of the alias problems. Our
pre-simulation analysis on traces shows that indeed most aliases happen between
branches within a relatively small temporal locality. Our simulation results demonstrate a
significant improvement in prediction accuracy from adopting this technique while adding on very limited extra cost.
The remainder of the paper is organized as follows. A brief overview of the
well-known counter-based dynamic branch prediction schemes is presented in section 2.
The proposed technique is then described in the following section. In section 4, a trace
analysis for potential improvement is given. It is then followed by our simulation and
performance comparison results. Concluding remarks are given in the last section.
BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES
559
2. DYNAMIC BRANCH PREDICTION
There have been many dynamic branch prediction schemes proposed in the past
decade. A few representative ones are described in the following for the sake of completeness.
• One-Level Predictor:
The prediction table is usually indexed by the lower-order address bits in the program counter (PC), although other portions of the PC have been used as well. Fig. 1 illustrates the design of such a scheme. Each entry in the prediction table (PHT) is used to
provide prediction information for the branch instruction mapped to it, and is implemented by a counter which goes up or down according to the actual outcome of the corresponding branch instruction. Each branch is predicted based on its most recent outcome.
Instead of the simple one-bit counter, a well-known two-bit up-down counter has been
extensively used in this scheme so as to render a damping effect which enhances prediction accuracy for typical reentrant loop constructs. Damage caused by alternating occurrences between two aliasing branches can also be alleviated using the two-bit counters.
Such an observation prompts most later advanced designs to use such a two-bit counter
prediction table as a design base.
Fig. 1. Implementation of simple one-level branch prediction.
• Correlation-based or Two-Level Adaptive Predictor:
Outcome of a branch is usually affected by some previously executed branches.
Such a correlation could exist among different branch instructions executed temporally
close to one and other, or simply refers to the effect on a branch from its own recent execution behavior. The latter one has been partially considered in the simple one-level
two-bit counter design. Such an approach requires a separate table, the so-called history
table, to record the necessary history information. A general design block diagram is
shown in Fig. 2 in which the PHT organized as a two-dimensional table is addressed by
560
WEI-MING LIN AND AN-YI YANG
Fig. 2. Implementation of per-address correlation-based branch prediction.
Fig. 3. Implementation of global correlation-based branch prediction.
two separate indices, the PC index and the history index. History information established
in a history table can be either in per-address (per-branch) format as shown in Fig. 3 or in
global format as shown in Fig. 3. In a per-address case, a per-address history table is
needed which also is addressed by the PC index. A shift register, so-called History Register (HR), is usually used to implement each such entry. On the other hand, for the
global format, only one HR is needed, as shown in Fig. 3. This aims at exploiting the
correlation in behavior existing in most programs between recurring identical branches
(as in the per-address case) or between distinct branches adjacent in time (as in the global
case). HR index and PC index combined are then used to locate the counter in the PHT
for prediction. It is shown that [19] global history schemes perform well with integer
programs while per-address history schemes are better for floating point programs. Also,
note that the PC index for history table does not have to come from the same least sig-
BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES
561
nificant portion of PC that the PC index for PHT normally uses. The so-called
“per-address” refers to the one that uses the least significant bits of PC for such an index,
while the “per-set” refers to the selections otherwise. In general, such a selection does not
lead to any significant discrepancy in performance.
• Gshare Predictor:
In the Gshare scheme [9], as shown in Fig. 4, the prediction table is addressed by an
index established by XORing a global history and part of the PC index. Gshare scheme
does lead to improvement in most cases compared to a simple two-level predictor; however, the exact cause for such an improvement has never been clearly analyzed. (Note
that, in one of the original Gsahre designs, the XORing function is performed over the
entire PC index; that is, m is set to be equal to n, which in general leads to worse performance than a simple two-level predictor.)
Fig. 4. Implementation of sharing index branch (Gshare) prediction.
• Others:
The possibility of combining different branch predictors is exploited by McFarling
in [9]. It comes from the observation that some schemes work well on one type of programs while not so on another. The selective scheme is implemented with two different
predictors, with each making prediction separately. A third table is then used to make
decision between the two prediction outcomes based on various program scenarios. Such
a scheme is claimed to perform well on different circumstances, yet it has a hardware
cost roughly three times of what a non-selective one would cost. A predictor called
LGshare has also been proposed [4] to further improve on Gshare by using both global as
well as per-address history of a branch to predict its behavior. Among many more others
in this field, a new predictor discussed in [6] is based on Simultaneous Subordinate MicroThreading (SSMT), which provides a new means to improve branch prediction accuracy. SSMT machines run multiple concurrent microthreads in support of the primary
thread to dynamically construct microthreads that can speculatively and accurately
562
WEI-MING LIN AND AN-YI YANG
pre-compute branch outcomes along frequently mispredicted paths. Another technique is
introduced in [5] to reduce the pattern history table interference by dynamically identifying some easily predictable branches and inhibiting the pattern history table update for
these branches.
In general, there are a few types of well-known potential problems that would lead
to a misprediction result due to the nature of the predictor employed:
• Initialization −
Every branch instruction that has a predictable behavior needs to have its behavior
history properly established in the prediction table before a meaningful prediction can be
made.
• Alias −
This problem occurs when different branches are mapped to the same entry in the
PHT. Such a problem is unavoidable unless a sufficiently large number of entries to
cover all potential program sizes are provided.
• Undetected Correlation −
Due to the limited size of history register, correlation among branches far apart in
time/trace may not be detected.
• “Random” (Unpredictable) Branch Behavior −
A branch’s behavior, either at times or throughout the life of the program, may be
simply run-time data-dependent which is either completely “random” or unpredictable
based on any of the known branch prediction schemes.
Some of the above problems may further intertwine with each other. For example, if
the overall size of the predictor table is to remain the same, by increasing the history
depth (the history register size) to allow more potential correlation to be detected, alias
problem between different branches would worsen. Alias problem remains a mystic factor in many approaches due to its randomness in whether or not an alias is known to be
destructive. Knowing how to effectively identify predominantly destructive aliases
would help these approaches significantly.
3. PROPOSED TECHNIQUE
3.1 Repetitive Aliases within Small Temporal Locality
As mentioned in the introduction, alias problem occurs when different branches are
mapped to the same entry in the prediction table. The most destructive aliases are the
ones from highly repetitive aliasing branches. Fig. 5 shows a typical example in which
branch instruction A (for the for loop iteration) is assumed to be aliased with branch instruction B (for the if statement inside the loop). Therefore, these two branches will take
BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES
for (i = 1; i <= 1000; i + +)
{
…
if (i + j < 2000)
…
else
…
}
563
/* Branch A */
/* Branch B */
Fig. 5. An example on potential alias between alternating branches.
turn using the same entry in the prediction table for prediction, a prediction instead based
on the other branch’s past behavior. If the loop is assembled such that the branch A is
taken only when the end of loop (i = 1000) is met, then branch A is not taken for the first
999 times. Furthermore, assume that j takes on a value such that branch B is always
taken. Therefore, both branch instructions are going to be using each other’s branch result for prediction, which leads to a highly performance-degrading sequence of aliases. If
one-bit counter is used, this leads to close to 2,000 consecutive misprediction (on both A
and B), while about 1,000 mispredictions (on B) occur when two-bit counter is used,
amounting to a near 100% and a 50% misprediction rate in these 2000 branches, respectively.
3.2 Alias Table Technique
The above alias problem can be easily handled by giving an extra “table entry” to B
when alias between A and B is first detected, so that they do not overwrite each other’s
prediction information. Thus, a scheme which is capable of detecting such aliases and
subsequently provides an extra “entry” for the aliasing branches can eliminate all these
mispredictions. Our proposed technique is based on such a notion. Detecting alias between the two instructions requires comparing their indices and some additional bits of
their PCs to differentiate between recurring instructions and distinct ones. The extended
index will be used and so named for this purpose. Similar to the functionality of a cache,
an alias table is proposed to keep track of the most recently accessed branches by maintaining their extended indices for alias detection and their associated prediction information. Fig. 6 illustrates the design of such a scheme. Each entry in the alias table has three
fields in it:
• ext_index: used to store the extended index of the occupying branch instruction for
comparison to identify a “hit” and to detect an alias with other instructions
• index: portion of the ext_index to store the corresponding index value
• LRU_counter: used for replacement policy similar to the typical one adopted for cache
replacement policy, in which the least recently used entry is replaced when a new one
comes in
• pred_bit(s): purpose of which is identical to how the prediction table functions, and a
simple write-back policy is applied to this similar to that used in a cache when replacement occurs
564
WEI-MING LIN AND AN-YI YANG
Fig. 6. Design block diagram of the alias table technique.
1. for current branch instruction with ext_index and index,
search the alias table for an ext_index match;
2. if exists an ext_index match, say, at entry i
3.
use alias_table[i].pred_bit(s) for prediction;
/* use the matched entry for prediction */
4.
update all LRU counters accordingly;
5. else /* no match in alias table */
6.
find the LRU one to replace, say, entry j;
7.
pred_table[alias_table[j].index] ← alias_table[j].pred_bit(s);
/* write back the prediction bits back to prediction table for the replaced one */
8.
alias_table[j].pred_bit(s) ← pred_table[index];
/* retrieve prediction bits from prediction table for the new one */
9.
alias_table[j].ext_index ← ext_index;
/* store extended index */
10.
make prediction bases on the retrieved bit(s);
11.
update affected LRU counters;
Fig. 7. The proposed alias table algorithm.
The algorithm for this technique is presented in Fig. 7. In a nutshell, the branch instruction currently issued is first checked for a “hit” in the alias table. If such a hit is
found, prediction is made based on its entry in the alias table leaving the prediction table
untouched. Otherwise, a new entry in the alias table is established for this newly accessed branch. LRU policy is followed to select the one to replace in the alias table and a
simple write back strategy is employed to have the prediction bit(s) of the replaced one
stored back to the prediction table.
BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES
565
4. ANALYSIS ON POTENTIAL IMPROVEMENT
Obviously, how effective the proposed technique eliminates aliases depends on the
size of the alias table, i.e. the number of entries in it. An alias is not considered potentially destructive if the two branches involved do not recur and subsequently use the
wrong information for prediction. With respect to a given alias table size, aliases can be
further divided into three categories according to whether or not the prediction bits of
involved instructions exist and, if so, where they are located at the time when an aliasing
instruction is being accessed. To describe this more precisely, we first define the following two terms:
• active: a branch is considered active when its prediction bit(s) is currently in the alias
table
• valid: a branch is considered valid when its prediction bit(s) is currently valid (present)
in the prediction table
Assuming that A and B are the two aliasing instructions, and A is currently being
accessed with its prediction information potentially tampered by the last access of B, A
can be either active or inactive. These two situations are clearly shown in Fig. 8. If A is
active then alias problem does not exist, as shown in case (1) in the figure. If A is inactive, it still depends on whether A has a “valid” information in the prediction table which
can be retrieved, which is shown in case (2). Alias problem exists only when A is inactive and invalid, as shown in case (3).
Fig. 8. Cases of Activeness of aliasing branches.
According to the algorithm in Fig. 4, activeness and validity of A depends on the
size of the alias table and the exact trace between A and B right before the current access
of A. This can be further illustrated with a pattern of aliases between two branch instruc-
566
WEI-MING LIN AND AN-YI YANG
tions, for the sake of simplicity in explanation. Assume that A and B are the two aliasing
instructions in a pattern as depicted in a trace example shown in Fig. 9. All occurrences
of instruction A and B in this trace segment are shown. y denotes the “Number of Distinct
Branch Instructions” (NDBI) in the trace between the last occurrence of A and the last
occurrence of B (B inclusive), and x denotes the NDBI between the last occurrence of B
and the current access of A (A inclusive). Let the alias distance of a currently accessed
branch instruction denote the NDBI between it and its most recently accessed alias partner. Let S denote the number of entries in the alias table. Effectiveness of the proposed
technique on such a case depends on the activeness of the two instructions, and thus depends on the relationship among x, y and S, due to the LRU replacement policy adopted.
There are three different cases in terms of the activeness and validity, and note that the
activeness is defined at the time when the current A is being accessed:
1. x < S, y < S and x + y < S: Size of the alias table is large enough to accommodate all
distinct new branches since the last access of A. Therefore, both instructions are active
(as shown in case (1) in Fig. 8), since A still has an active entry in the alias table.
2. x < S and x + y > S: Size of the alias table is not large enough to accommodate all distinct new branches since the last access of A, but is large enough to still retain B.
Therefore, A is inactive (but having a valid entry in the prediction table), while B is
active (as shown in case (2) in Fig. 8), since A’s entry in the alias table has been replaced by B’s but is not overwritten in the prediction table.
3. x > S: Size of the alias table is not large enough to accommodate all distinct new
branches since the last access of B. Therefore, both A and B are inactive and B is occupying the entry in the prediction table (as shown in case (3) in Fig. 8), since both
have been replaced in the alias table and A’s entry has been currently occupied by B.
Fig. 9. A trace example demonstrating a sequence of recurring aliases.
In case (1) and (2), alias problem no longer exists, since, when A is accessed, valid
prediction bit(s) can be found either in the alias table or the prediction table. Alias problem remains in case (3). Thus, we can conclude that, for such a simple alias pattern, if a
recurring aliasing branch instruction has an alias distance (x in the above discussion)
smaller than S, the original existing alias problem associated with this access is eliminated, or else the problem remains otherwise. Let δA(B) denotes such an alias distance
between the two accesses. However, an alias pattern between two branches can be a lot
BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES
567
more complicated than the one just described. For example, in the above trace example,
there may exist more than one access of branch B between the current A and its most
recent access. As shown in the example in Fig. 10, there are a few accesses (three shown)
of B in that duration. Among all these accesses to B, let Z be the set of all NDBI between
every pair of consecutive B accesses. We can see that, if there exists a z ∈ Z such that z >
S, then B is going to overwrite A in the prediction table and thus alias problem occurs at
the time of current A access. Let µA(B) denote the largest one among all such NDBI values. Thus, an alias table of size S can eliminate this alias problem if
S > max{δA(B), µA(B)}
(1)
otherwise the problem remains.
Fig. 10. A trace example demonstrating a sequence of recurring aliases with multiple occurrence.
In order to render the general scenario, the case that an alias exists between the current branch and more than one other distinct branches needs to be addressed. When there
are more than one instruction aliased with the currently accessed one, the condition in Eq.
(1) has to be applied to each individual aliasing instruction. That is, in between the most
recent access of A and the current access of it, if none of these aliasing instructions is
replaced and in turn overwrites the information for A in the prediction table, alias problem no longer exists. Let {A1, A2, … An} be the set of all branch instructions are that
aliased with branch A during this duration. The condition that needs to be satisfied for
alias elimination becomes
S > max{δA(Ai), µA(Ai) | ∀i, 1 ≤ i ≤ n}
568
WEI-MING LIN AND AN-YI YANG
The above maximum value is then referred to as the “Alias-Removing Minimum Entries” (ARME) for branch A. An example is shown in Fig. 11 to illustrate such a general
case, where A is aliased with B, C and D, and the ARME of A equals
max{δA(B), µA(B), δA(C), µA(C), δA(D), µA(D)}
Fig. 11. A trace example demonstrating a sequence of recurring aliases among four instructions.
5. TRACE ANALYSIS AND SIMULATION
A comprehensive trace analysis is first performed on several test programs to show
the potential effectiveness of proposed alias table technique in removing aliases when
tables of different sizes are used. This is then followed by a simulation to confirm the
improvement.
5.1 Simulator and Test Programs
Our trace analysis and simulation is conducted on a SPARC 20 system. Data are
obtained using Shade version 5.25 analyzing program. Shade is a dynamic code tracer
BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES
569
which combines instruction set simulation, trace generation and custom trace analysis in
a process. Our test programs include a benchmark program from the Stanford Body
Benchmark Suite, sfloat, and seven standard UNIX utility programs. A brief description
of these programs on various aspects of their branch instructions is given in Table 1.
Table 1. Testing program description.
program
# instructions
# branches
# taken
sfloat
gcc
grep
# not-taken
branch %
7980915
236507
190180
64327
3.0%
680247
142178
91969
50209
20.9%
490160
99618
54534
45084
20.3%
ls
777372
159900
87542
72358
20.6%
awk
629837
135917
73854
62063
21.6%
chmod
605188
124152
71241
52911
20.5%
pack
167746
23820
13714
10106
14.2%
cc
168385
33677
19155
14522
19.8%
5.2 Trace Analysis on Alias Reduction
A trace analysis is performed on the above eight test programs to show the percentage of alias reduction using different sizes of table. This is achieved by determining the
following two values:
• B: the number of all branch instructions executed that are predicted based on an alias
instruction’s information when no alias table is used, i.e. S = 0.
• R(S): among all the above B branches instructions, the number of those that have a
ARME value smaller than the number of alias table entries, S.
The alias reduction percentage for a given S then becomes R (BS ) . Fig. 12 shows the
analysis results when various number of index bits are used. Obviously, the larger the
alias table is, the more aliases can be removed. The results show that about 50% of aliases are eliminated by having a table of mere 25 to 30 entries. This investment is far better than by enlarging the prediction table. However, the ensuing cost goes up rapidly with
more entries incorporated into the table, including the hardware needed for fast associative comparison for match and alias detection. Thus, one needs to compromise between
the cost and performance in selecting a reasonable table size. Another observation leads
us to an important argument that the larger the ARME for a branch instruction becomes,
the less significant its correlation is with its previous branch behavior, and therefore, the
less useful the prediction bit(s) in the prediction table becomes. This in turn implies that
an alias with a larger ARME is less likely to lead to a destructive prediction result even
without the help of an alias table, which leads to less advantage in performing alias reduction on aliases with large ARME values. Consequently, for a given alias table size, the
percentage in removing destructive aliases is actually even higher than the alias reduction
rate produced shown in Fig. 12.
WEI-MING LIN AND AN-YI YANG
570
Fig. 12. Alias reduction rate versus the alias table.
5.3 Simulation Results
A series of simulation runs are performed by varying the following three parameters
on both one-bit and two-bit counter cases:
• number of index bits
• number of extended index bits
• number of entries in the alias table
Note that, for both the index and extended index, the least significant portion of PC
are used in our tests. It turns out that one additional bit in the extended index over the
original index is sufficient to provide close to 100% alias detection for small number
entries used in the alias table. Thus all our results presented here are based on this arrangement. Improvement results versus index size and alias table are shown in Figs. 13
and 14, respectively, using the one-bit counter scheme. Figs. 15 and 16, respectively,
give corresponding results for the two-bit counter case. The “miss rate improvement
percentage” is defined as:
M orig − M ATT
M orig
where Morig and MATT denote the miss rate of original scheme and that of the proposed,
respectively. Each point in these results corresponds to the average of results from the
eight test programs.
BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES
571
Fig. 13. Improvement results versus index size for one-bit prediction scheme.
Fig. 14. Improvement results versus alias table size for one-bit prediction scheme.
Our first observation from these results reveals that performance improvement from
the proposed technique is a little more significant on one-bit counter case than on the two
two-bit one, due to the extra damping capability already endowed in the latter one. The
improvement percentage decreases, in general, when the number of index bits used is
increased. The reason for this is that less alias problem remains to be tackled when the
size of prediction table increases. The effect from increasing the size of the alias table (i.e.
the number of entries in the table) is in general beneficial, while not very significant in
some cases.
572
WEI-MING LIN AND AN-YI YANG
Fig. 15. Improvement results versus index size for two-bit prediction scheme.
Fig. 16. Improvement results versus alias table size for two-bit prediction scheme.
6. CONCLUSIONS
In this paper, we presented a very cost efficient branch prediction scheme capable of
eliminating most of destructive aliases. The proposed technique posts an improvement in
miss prediction rate up to 10% in some cases. More analysis is needed to further identify
other patterns of destructive aliases for better designs.
BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES
573
REFERENCES
1. T. Ball and J. R. Larus, “Branch prediction for free,” in Proceedings of ACM
SIGPLAN 1993 Conference on Programming Language Design and Implementation,
1993, pp. 300-313.
2. B. K. Bray and M. J. Flynn, “Strategies for branch target buffers,” in Proceedings of
24th Workshop on Microprogramming and Microarchitecture, 1991, pp. 42-49.
3. B. Calder and D. Grunwald, “Fast and accurate instruction fetch and branch
prediction,” in Proceeding of 21st International Symposium on Computer
Architecture, 1994, pp. 2-11.
4. M. C. Chang and Y. W. Chou, “Branch prediction using both global and local branch
history information,” Computers and Digital Techniques, IEE Proceedings, Vol. 149,
2002, pp. 33-38.
5. P. Y. Chang, M. Evers, and Y. N. Patt, “Improving branch prediction accuracy by
reducing pattern history table interference,” in Proceedings of International
Conference on Parallel Architectures and Compilation Techniques, 1996, pp. 48-57.
6. R. S. Chappell, F. Tseng, A. Yaoz, and Y. N. Patt, “Difficult-path branch prediction
using subordinate microthreads,” in Proceedings of 29th Annual International
Symposium on Computer Architecture, 2002, pp. 307-317.
7. J. Fisher and S. Freudenberger, “Predicting conditional branch direction from
previous runs of a program,” in Proceedings of 5th Annual International Conference
on Architectural Support for Programming Languages and Operating System, 1992,
pp. 85-95.
8. J. K. F. Lee and A. J. Smith, “Branch prediction strategies and branch target buffer
design,” IEEE Computer, 1984, pp. 6-22.
9. W. M. Lin and R. Madhavaram, “Advanced branch prediction based on a
generalized predictor,” in Proceedings of the 18th International Conference on
Computers and Their Applications (CATA), 2003, pp. 304-307.
10. S. McFarling, “Combining branch predictor,” Technical Report TN-36, Digital
Western Research Laboratory, 1993.
11. S. McFarling and J. L. Hennessy, “Reducing the cost of branches,” in Proceedings of
13th Annual International Symposium of Computer Architecture, 1986, pp. 396-403.
12. S. T. Pan, K. So, and J. T. Rahmeh, “Improving the accuracy of dynamic branch
prediction using branch correlation,” in Proceedings of 5th Annual International
Conference on Architectural Support for Programming Languages and Operating
System, 1992, pp. 76-84.
13. D. A. Patterson and J. L. Hennessy, Computer Architecture: A Quantitative
Approach, 2nd Edition, Morgan Kaufmann Publishers, Inc., 1995.
14. C. H. Perleberg and A. J. Smith, “Branch target buffer design and optimization,”
IEEE Transactions on Computers, Vol. 42, 1993, pp. 396-412.
15. J. E. Smith, “A study of branch prediction strategies,” in Proceedings of 8th Annual
International Symposium on Computer Architecture, 1981, pp. 135-147.
16. Z. Su and M. Zhou, “A comparative analysis of branch prediction schemes,”
Technical Report, Computer Science Division, University of California at Berkeley,
1995.
17. Shade Manual, Sun Microsystems, 1995.
574
WEI-MING LIN AND AN-YI YANG
18. T. Y. Yeh and Y. N. Patt, “Two-level adaptive branch prediction,” in Proceedings of
24th Annual ACM/IEEE International Symposium and Workshop on
Microarchitecture, 1991, pp. 51-61.
19. T. Y. Yeh and Y. N. Patt, “Alternative implementations of two-level adaptive branch
prediction,” in Proceedings of 19th International Symposium on Computer
Architecture, 1992, pp. 124-134.
20. T. Y. Yeh and Y. N. Patt, “A comparison of dynamic branch predictors that use two
levels of branch history,” in Proceedings of 20th Annual International Symposium on
Computer Architecture, 1993, pp. 257-266.
21. T. Y. Yeh and Y. N. Patt, “Two-level adaptive branch prediction and instruction
fetch mechanism for high performance superscalar processors,” Technical Report
CSE-TR-182-93, Computer Science and Engineering Division, University of
Michigan, 1993.
22. C. Young and M. D. Smith, “Improving the accuracy of static branch prediction
using branch correlation,” Technical Report 06-95, Center for Research in
Computing Technology, Harvard University, 1995.
23. C. Young, N. Gloy and M. D. Smith, “A comparative analysis of schemes for
correlated branch prediction,” in Proceedings of 22nd Annual International
Symposium on Computer Architecture, 1995, pp. 276-286.
Wei-Ming Lin (林維明) received the B.S. degree in Electrical Engineering from National Taiwan University, Taipei,
Taiwan, in 1982, the M.S. and Ph.D. degrees in Electrical Engineering from the University of Southern California, Los Angeles,
in 1986 and 1991, respectively. He was an assistant professor in
the Department of Electrical and Computer Engineering at Mississippi State University before joining the University of Texas at
San Antonio (UTSA) in 1993, and, since 1998, he has been an
associate professor of Electrical Engineering there. Dr. Lin has
published more than 55 technical papers in international journals
and conferences in the area of distributed and parallel computing, and computer architecture. He has served in program committee for many international conferences and will
serve as the program chair for the International Conference on Computer Applications in
Industry and Engineering. He received the Best Paper Award for “Load Balancing Technique for Parallel Search with Statistical Model” in the 1995 International Phoenix Conference on Computer Communications, and the Best Paper Award (runner up) for “Sorting-Based Rejection Techniques for Fast Random Number Generation” in the Fifth International Conference on Intelligent Systems.
An-Yi Yang (楊安義) received the B.S. degree in Industrial
Engineering from Chung Cheng Institute of Technology, Taiwan,
in 1987, the M.S. degree in Electrical Engineering from the University of Texas at San Antonio in 2000. He is currently associated with the Faraday Technology Corporation in Taiwan.
Download