JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 20, 557-574 (2004) Effective Branch Prediction through Caching of Aliasing Branches* WEI-MING LIN AND AN-YI YANG+ Department of Electrical Engineering The University of Texas at San Antonio San Antonio, TX 78249, U.S.A. E-mail: wlin@utsa.edu + Faraday Technology Corporation Hsinchu, 300 Taiwan E-mail: kevin_y@faraday-tech.com High performance CPUs constantly face obstacles in pipelining delays from conditional branches to reach their expected potential. Precise branch prediction is required to overcome this performance limitation imposed on high performance architecture and is the key to many techniques for enhancing and exploiting Instruction-Level Parallelism (ILP). In general, prediction accuracy can be improved by reducing aliases in the prediction table used in a traditional dynamic prediction mechanism. In this paper, we propose a new technique to significantly reduce the more likely destructive aliases by caching up recurring aliasing branches that occur within a small temporal locality with one another. An extensive pre-simulation analysis on traces clearly supports such a claim showing that aliasing branches with a small temporal locality account for majority of all aliases. This paper further shows that such aliases are more likely to lead to destructive prediction result. Thus, our technique, incorporated with a small additional “alias table” with its size much smaller than the existing prediction table normally used, is capable of eliminating most of aliases, especially the highly performance-degrading repetitive “local” aliases. Our simulation results demonstrate a significant improvement in prediction accuracy from adopting this technique while requiring very limited extra hardware cost. In addition, this proposed add-on feature can be easily incorporated into any existing or advanced dynamic branch predictors to further enhance their prediction performance. Keywords: branch prediction, instruction-level parallelism, high performance computing, aliases, dynamic branch prediction 1. INTRODUCTION In the past decade, by taking advantage of RISC architecture and advanced VLSI technology, computer designers are able to exploit more Instruction-Level Parallelism (ILP) by using deeper pipelines, wider issue rates and superscalar techniques. However, these techniques suffer from disruption caused by branches during the issue of instructions to functional units. How to appease such a performance-degrading effect from branch instructions, which typically make up twenty or more percentage of an instruction stream, has to be paid with more attention. Received September 19, 2002; revised June 26, 2003; accepted December 10, 2003. Communicated by Yu-Chee Tseng. * This research was supported in part by the Office of Naval Research under grants N00014-95-1-0514 and N00014-96-1-0897, and in part by the Department of Defense/Air Force Office of Scientific Research under grant F49620-96-1-0472. 557 558 WEI-MING LIN AND AN-YI YANG Branch prediction is a common technique used to overcome this performance limitation imposed on high performance architectures and is the key to many techniques for enhancing ILP. Branch prediction essentially involves a guess on the likely stream direction that is to take place after a branch instruction; whenever such a guess is correct, penalty in pipeline delay is either reduced or completely avoided. There have been various branch prediction schemes proposed in this area [1, 2, 6-8, 10, 13, 15, 21]. They are usually classified as static or dynamic according to how prediction is made. Static prediction schemes always assume same outcome for any given branch, whereas a dynamic scheme uses run-time behavior of branches to adjust database for later predictions. Focus of this paper is on the dynamic ones which usually show far better prediction accuracy than the static ones. A typical dynamic prediction mechanism relies on a prediction table to record the behavior of past branches. One major cause of performance degrading in various dynamic branch prediction schemes is the amount of prediction table aliases. Alias problem occurs when different branches are mapped to the same entry in this table. Such a problem is unavoidable unless a sufficiently large number, most of time cost-inhibiting, of entries are provided. How to reduce the damage caused by the aliases without significantly increasing the hardware requirement becomes a must for an effective prediction. The amount of prediction table aliases in different branch prediction approaches may vary because of their different mapping/hashing schemes; however, they are harmful to prediction accuracy in general. Due to various reasons, not all aliases lead to penalties. These reasons include some from program behaviors, such as the one-time (table initialization) alias between two separate (disjoint) looping branches, non-destructive (or even constructive) branch behaviors among aliasing branches, and others from special hardware features, such as the damping effect from two-bit counters, etc. From probabilistic point of view, alias reduction in general leads to increasing prediction accuracy. Especially, if two aliasing branches alternate in trace order very frequently, penalty may become extremely high if such an alias leads to mispredictions. In this paper, we propose a new technique to significantly alleviate this most destructive aliasing problem by caching up aliasing branches that happen within a small temporal locality of one another. Such a technique, incorporated with a small additional “alias table” of its size much smaller than the prediction table, is capable of removing most of aliases. Our argument for such an investment is based on an observation that, when two aliasing branches are far apart in time, it tends to result in a non-destructive alias or is simply masked over by another branch on the same entry within a smaller temporal locality. Thus, we believe that eliminating repetitive “local” aliases can address most of the alias problems. Our pre-simulation analysis on traces shows that indeed most aliases happen between branches within a relatively small temporal locality. Our simulation results demonstrate a significant improvement in prediction accuracy from adopting this technique while adding on very limited extra cost. The remainder of the paper is organized as follows. A brief overview of the well-known counter-based dynamic branch prediction schemes is presented in section 2. The proposed technique is then described in the following section. In section 4, a trace analysis for potential improvement is given. It is then followed by our simulation and performance comparison results. Concluding remarks are given in the last section. BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES 559 2. DYNAMIC BRANCH PREDICTION There have been many dynamic branch prediction schemes proposed in the past decade. A few representative ones are described in the following for the sake of completeness. • One-Level Predictor: The prediction table is usually indexed by the lower-order address bits in the program counter (PC), although other portions of the PC have been used as well. Fig. 1 illustrates the design of such a scheme. Each entry in the prediction table (PHT) is used to provide prediction information for the branch instruction mapped to it, and is implemented by a counter which goes up or down according to the actual outcome of the corresponding branch instruction. Each branch is predicted based on its most recent outcome. Instead of the simple one-bit counter, a well-known two-bit up-down counter has been extensively used in this scheme so as to render a damping effect which enhances prediction accuracy for typical reentrant loop constructs. Damage caused by alternating occurrences between two aliasing branches can also be alleviated using the two-bit counters. Such an observation prompts most later advanced designs to use such a two-bit counter prediction table as a design base. Fig. 1. Implementation of simple one-level branch prediction. • Correlation-based or Two-Level Adaptive Predictor: Outcome of a branch is usually affected by some previously executed branches. Such a correlation could exist among different branch instructions executed temporally close to one and other, or simply refers to the effect on a branch from its own recent execution behavior. The latter one has been partially considered in the simple one-level two-bit counter design. Such an approach requires a separate table, the so-called history table, to record the necessary history information. A general design block diagram is shown in Fig. 2 in which the PHT organized as a two-dimensional table is addressed by 560 WEI-MING LIN AND AN-YI YANG Fig. 2. Implementation of per-address correlation-based branch prediction. Fig. 3. Implementation of global correlation-based branch prediction. two separate indices, the PC index and the history index. History information established in a history table can be either in per-address (per-branch) format as shown in Fig. 3 or in global format as shown in Fig. 3. In a per-address case, a per-address history table is needed which also is addressed by the PC index. A shift register, so-called History Register (HR), is usually used to implement each such entry. On the other hand, for the global format, only one HR is needed, as shown in Fig. 3. This aims at exploiting the correlation in behavior existing in most programs between recurring identical branches (as in the per-address case) or between distinct branches adjacent in time (as in the global case). HR index and PC index combined are then used to locate the counter in the PHT for prediction. It is shown that [19] global history schemes perform well with integer programs while per-address history schemes are better for floating point programs. Also, note that the PC index for history table does not have to come from the same least sig- BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES 561 nificant portion of PC that the PC index for PHT normally uses. The so-called “per-address” refers to the one that uses the least significant bits of PC for such an index, while the “per-set” refers to the selections otherwise. In general, such a selection does not lead to any significant discrepancy in performance. • Gshare Predictor: In the Gshare scheme [9], as shown in Fig. 4, the prediction table is addressed by an index established by XORing a global history and part of the PC index. Gshare scheme does lead to improvement in most cases compared to a simple two-level predictor; however, the exact cause for such an improvement has never been clearly analyzed. (Note that, in one of the original Gsahre designs, the XORing function is performed over the entire PC index; that is, m is set to be equal to n, which in general leads to worse performance than a simple two-level predictor.) Fig. 4. Implementation of sharing index branch (Gshare) prediction. • Others: The possibility of combining different branch predictors is exploited by McFarling in [9]. It comes from the observation that some schemes work well on one type of programs while not so on another. The selective scheme is implemented with two different predictors, with each making prediction separately. A third table is then used to make decision between the two prediction outcomes based on various program scenarios. Such a scheme is claimed to perform well on different circumstances, yet it has a hardware cost roughly three times of what a non-selective one would cost. A predictor called LGshare has also been proposed [4] to further improve on Gshare by using both global as well as per-address history of a branch to predict its behavior. Among many more others in this field, a new predictor discussed in [6] is based on Simultaneous Subordinate MicroThreading (SSMT), which provides a new means to improve branch prediction accuracy. SSMT machines run multiple concurrent microthreads in support of the primary thread to dynamically construct microthreads that can speculatively and accurately 562 WEI-MING LIN AND AN-YI YANG pre-compute branch outcomes along frequently mispredicted paths. Another technique is introduced in [5] to reduce the pattern history table interference by dynamically identifying some easily predictable branches and inhibiting the pattern history table update for these branches. In general, there are a few types of well-known potential problems that would lead to a misprediction result due to the nature of the predictor employed: • Initialization − Every branch instruction that has a predictable behavior needs to have its behavior history properly established in the prediction table before a meaningful prediction can be made. • Alias − This problem occurs when different branches are mapped to the same entry in the PHT. Such a problem is unavoidable unless a sufficiently large number of entries to cover all potential program sizes are provided. • Undetected Correlation − Due to the limited size of history register, correlation among branches far apart in time/trace may not be detected. • “Random” (Unpredictable) Branch Behavior − A branch’s behavior, either at times or throughout the life of the program, may be simply run-time data-dependent which is either completely “random” or unpredictable based on any of the known branch prediction schemes. Some of the above problems may further intertwine with each other. For example, if the overall size of the predictor table is to remain the same, by increasing the history depth (the history register size) to allow more potential correlation to be detected, alias problem between different branches would worsen. Alias problem remains a mystic factor in many approaches due to its randomness in whether or not an alias is known to be destructive. Knowing how to effectively identify predominantly destructive aliases would help these approaches significantly. 3. PROPOSED TECHNIQUE 3.1 Repetitive Aliases within Small Temporal Locality As mentioned in the introduction, alias problem occurs when different branches are mapped to the same entry in the prediction table. The most destructive aliases are the ones from highly repetitive aliasing branches. Fig. 5 shows a typical example in which branch instruction A (for the for loop iteration) is assumed to be aliased with branch instruction B (for the if statement inside the loop). Therefore, these two branches will take BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES for (i = 1; i <= 1000; i + +) { … if (i + j < 2000) … else … } 563 /* Branch A */ /* Branch B */ Fig. 5. An example on potential alias between alternating branches. turn using the same entry in the prediction table for prediction, a prediction instead based on the other branch’s past behavior. If the loop is assembled such that the branch A is taken only when the end of loop (i = 1000) is met, then branch A is not taken for the first 999 times. Furthermore, assume that j takes on a value such that branch B is always taken. Therefore, both branch instructions are going to be using each other’s branch result for prediction, which leads to a highly performance-degrading sequence of aliases. If one-bit counter is used, this leads to close to 2,000 consecutive misprediction (on both A and B), while about 1,000 mispredictions (on B) occur when two-bit counter is used, amounting to a near 100% and a 50% misprediction rate in these 2000 branches, respectively. 3.2 Alias Table Technique The above alias problem can be easily handled by giving an extra “table entry” to B when alias between A and B is first detected, so that they do not overwrite each other’s prediction information. Thus, a scheme which is capable of detecting such aliases and subsequently provides an extra “entry” for the aliasing branches can eliminate all these mispredictions. Our proposed technique is based on such a notion. Detecting alias between the two instructions requires comparing their indices and some additional bits of their PCs to differentiate between recurring instructions and distinct ones. The extended index will be used and so named for this purpose. Similar to the functionality of a cache, an alias table is proposed to keep track of the most recently accessed branches by maintaining their extended indices for alias detection and their associated prediction information. Fig. 6 illustrates the design of such a scheme. Each entry in the alias table has three fields in it: • ext_index: used to store the extended index of the occupying branch instruction for comparison to identify a “hit” and to detect an alias with other instructions • index: portion of the ext_index to store the corresponding index value • LRU_counter: used for replacement policy similar to the typical one adopted for cache replacement policy, in which the least recently used entry is replaced when a new one comes in • pred_bit(s): purpose of which is identical to how the prediction table functions, and a simple write-back policy is applied to this similar to that used in a cache when replacement occurs 564 WEI-MING LIN AND AN-YI YANG Fig. 6. Design block diagram of the alias table technique. 1. for current branch instruction with ext_index and index, search the alias table for an ext_index match; 2. if exists an ext_index match, say, at entry i 3. use alias_table[i].pred_bit(s) for prediction; /* use the matched entry for prediction */ 4. update all LRU counters accordingly; 5. else /* no match in alias table */ 6. find the LRU one to replace, say, entry j; 7. pred_table[alias_table[j].index] ← alias_table[j].pred_bit(s); /* write back the prediction bits back to prediction table for the replaced one */ 8. alias_table[j].pred_bit(s) ← pred_table[index]; /* retrieve prediction bits from prediction table for the new one */ 9. alias_table[j].ext_index ← ext_index; /* store extended index */ 10. make prediction bases on the retrieved bit(s); 11. update affected LRU counters; Fig. 7. The proposed alias table algorithm. The algorithm for this technique is presented in Fig. 7. In a nutshell, the branch instruction currently issued is first checked for a “hit” in the alias table. If such a hit is found, prediction is made based on its entry in the alias table leaving the prediction table untouched. Otherwise, a new entry in the alias table is established for this newly accessed branch. LRU policy is followed to select the one to replace in the alias table and a simple write back strategy is employed to have the prediction bit(s) of the replaced one stored back to the prediction table. BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES 565 4. ANALYSIS ON POTENTIAL IMPROVEMENT Obviously, how effective the proposed technique eliminates aliases depends on the size of the alias table, i.e. the number of entries in it. An alias is not considered potentially destructive if the two branches involved do not recur and subsequently use the wrong information for prediction. With respect to a given alias table size, aliases can be further divided into three categories according to whether or not the prediction bits of involved instructions exist and, if so, where they are located at the time when an aliasing instruction is being accessed. To describe this more precisely, we first define the following two terms: • active: a branch is considered active when its prediction bit(s) is currently in the alias table • valid: a branch is considered valid when its prediction bit(s) is currently valid (present) in the prediction table Assuming that A and B are the two aliasing instructions, and A is currently being accessed with its prediction information potentially tampered by the last access of B, A can be either active or inactive. These two situations are clearly shown in Fig. 8. If A is active then alias problem does not exist, as shown in case (1) in the figure. If A is inactive, it still depends on whether A has a “valid” information in the prediction table which can be retrieved, which is shown in case (2). Alias problem exists only when A is inactive and invalid, as shown in case (3). Fig. 8. Cases of Activeness of aliasing branches. According to the algorithm in Fig. 4, activeness and validity of A depends on the size of the alias table and the exact trace between A and B right before the current access of A. This can be further illustrated with a pattern of aliases between two branch instruc- 566 WEI-MING LIN AND AN-YI YANG tions, for the sake of simplicity in explanation. Assume that A and B are the two aliasing instructions in a pattern as depicted in a trace example shown in Fig. 9. All occurrences of instruction A and B in this trace segment are shown. y denotes the “Number of Distinct Branch Instructions” (NDBI) in the trace between the last occurrence of A and the last occurrence of B (B inclusive), and x denotes the NDBI between the last occurrence of B and the current access of A (A inclusive). Let the alias distance of a currently accessed branch instruction denote the NDBI between it and its most recently accessed alias partner. Let S denote the number of entries in the alias table. Effectiveness of the proposed technique on such a case depends on the activeness of the two instructions, and thus depends on the relationship among x, y and S, due to the LRU replacement policy adopted. There are three different cases in terms of the activeness and validity, and note that the activeness is defined at the time when the current A is being accessed: 1. x < S, y < S and x + y < S: Size of the alias table is large enough to accommodate all distinct new branches since the last access of A. Therefore, both instructions are active (as shown in case (1) in Fig. 8), since A still has an active entry in the alias table. 2. x < S and x + y > S: Size of the alias table is not large enough to accommodate all distinct new branches since the last access of A, but is large enough to still retain B. Therefore, A is inactive (but having a valid entry in the prediction table), while B is active (as shown in case (2) in Fig. 8), since A’s entry in the alias table has been replaced by B’s but is not overwritten in the prediction table. 3. x > S: Size of the alias table is not large enough to accommodate all distinct new branches since the last access of B. Therefore, both A and B are inactive and B is occupying the entry in the prediction table (as shown in case (3) in Fig. 8), since both have been replaced in the alias table and A’s entry has been currently occupied by B. Fig. 9. A trace example demonstrating a sequence of recurring aliases. In case (1) and (2), alias problem no longer exists, since, when A is accessed, valid prediction bit(s) can be found either in the alias table or the prediction table. Alias problem remains in case (3). Thus, we can conclude that, for such a simple alias pattern, if a recurring aliasing branch instruction has an alias distance (x in the above discussion) smaller than S, the original existing alias problem associated with this access is eliminated, or else the problem remains otherwise. Let δA(B) denotes such an alias distance between the two accesses. However, an alias pattern between two branches can be a lot BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES 567 more complicated than the one just described. For example, in the above trace example, there may exist more than one access of branch B between the current A and its most recent access. As shown in the example in Fig. 10, there are a few accesses (three shown) of B in that duration. Among all these accesses to B, let Z be the set of all NDBI between every pair of consecutive B accesses. We can see that, if there exists a z ∈ Z such that z > S, then B is going to overwrite A in the prediction table and thus alias problem occurs at the time of current A access. Let µA(B) denote the largest one among all such NDBI values. Thus, an alias table of size S can eliminate this alias problem if S > max{δA(B), µA(B)} (1) otherwise the problem remains. Fig. 10. A trace example demonstrating a sequence of recurring aliases with multiple occurrence. In order to render the general scenario, the case that an alias exists between the current branch and more than one other distinct branches needs to be addressed. When there are more than one instruction aliased with the currently accessed one, the condition in Eq. (1) has to be applied to each individual aliasing instruction. That is, in between the most recent access of A and the current access of it, if none of these aliasing instructions is replaced and in turn overwrites the information for A in the prediction table, alias problem no longer exists. Let {A1, A2, … An} be the set of all branch instructions are that aliased with branch A during this duration. The condition that needs to be satisfied for alias elimination becomes S > max{δA(Ai), µA(Ai) | ∀i, 1 ≤ i ≤ n} 568 WEI-MING LIN AND AN-YI YANG The above maximum value is then referred to as the “Alias-Removing Minimum Entries” (ARME) for branch A. An example is shown in Fig. 11 to illustrate such a general case, where A is aliased with B, C and D, and the ARME of A equals max{δA(B), µA(B), δA(C), µA(C), δA(D), µA(D)} Fig. 11. A trace example demonstrating a sequence of recurring aliases among four instructions. 5. TRACE ANALYSIS AND SIMULATION A comprehensive trace analysis is first performed on several test programs to show the potential effectiveness of proposed alias table technique in removing aliases when tables of different sizes are used. This is then followed by a simulation to confirm the improvement. 5.1 Simulator and Test Programs Our trace analysis and simulation is conducted on a SPARC 20 system. Data are obtained using Shade version 5.25 analyzing program. Shade is a dynamic code tracer BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES 569 which combines instruction set simulation, trace generation and custom trace analysis in a process. Our test programs include a benchmark program from the Stanford Body Benchmark Suite, sfloat, and seven standard UNIX utility programs. A brief description of these programs on various aspects of their branch instructions is given in Table 1. Table 1. Testing program description. program # instructions # branches # taken sfloat gcc grep # not-taken branch % 7980915 236507 190180 64327 3.0% 680247 142178 91969 50209 20.9% 490160 99618 54534 45084 20.3% ls 777372 159900 87542 72358 20.6% awk 629837 135917 73854 62063 21.6% chmod 605188 124152 71241 52911 20.5% pack 167746 23820 13714 10106 14.2% cc 168385 33677 19155 14522 19.8% 5.2 Trace Analysis on Alias Reduction A trace analysis is performed on the above eight test programs to show the percentage of alias reduction using different sizes of table. This is achieved by determining the following two values: • B: the number of all branch instructions executed that are predicted based on an alias instruction’s information when no alias table is used, i.e. S = 0. • R(S): among all the above B branches instructions, the number of those that have a ARME value smaller than the number of alias table entries, S. The alias reduction percentage for a given S then becomes R (BS ) . Fig. 12 shows the analysis results when various number of index bits are used. Obviously, the larger the alias table is, the more aliases can be removed. The results show that about 50% of aliases are eliminated by having a table of mere 25 to 30 entries. This investment is far better than by enlarging the prediction table. However, the ensuing cost goes up rapidly with more entries incorporated into the table, including the hardware needed for fast associative comparison for match and alias detection. Thus, one needs to compromise between the cost and performance in selecting a reasonable table size. Another observation leads us to an important argument that the larger the ARME for a branch instruction becomes, the less significant its correlation is with its previous branch behavior, and therefore, the less useful the prediction bit(s) in the prediction table becomes. This in turn implies that an alias with a larger ARME is less likely to lead to a destructive prediction result even without the help of an alias table, which leads to less advantage in performing alias reduction on aliases with large ARME values. Consequently, for a given alias table size, the percentage in removing destructive aliases is actually even higher than the alias reduction rate produced shown in Fig. 12. WEI-MING LIN AND AN-YI YANG 570 Fig. 12. Alias reduction rate versus the alias table. 5.3 Simulation Results A series of simulation runs are performed by varying the following three parameters on both one-bit and two-bit counter cases: • number of index bits • number of extended index bits • number of entries in the alias table Note that, for both the index and extended index, the least significant portion of PC are used in our tests. It turns out that one additional bit in the extended index over the original index is sufficient to provide close to 100% alias detection for small number entries used in the alias table. Thus all our results presented here are based on this arrangement. Improvement results versus index size and alias table are shown in Figs. 13 and 14, respectively, using the one-bit counter scheme. Figs. 15 and 16, respectively, give corresponding results for the two-bit counter case. The “miss rate improvement percentage” is defined as: M orig − M ATT M orig where Morig and MATT denote the miss rate of original scheme and that of the proposed, respectively. Each point in these results corresponds to the average of results from the eight test programs. BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES 571 Fig. 13. Improvement results versus index size for one-bit prediction scheme. Fig. 14. Improvement results versus alias table size for one-bit prediction scheme. Our first observation from these results reveals that performance improvement from the proposed technique is a little more significant on one-bit counter case than on the two two-bit one, due to the extra damping capability already endowed in the latter one. The improvement percentage decreases, in general, when the number of index bits used is increased. The reason for this is that less alias problem remains to be tackled when the size of prediction table increases. The effect from increasing the size of the alias table (i.e. the number of entries in the table) is in general beneficial, while not very significant in some cases. 572 WEI-MING LIN AND AN-YI YANG Fig. 15. Improvement results versus index size for two-bit prediction scheme. Fig. 16. Improvement results versus alias table size for two-bit prediction scheme. 6. CONCLUSIONS In this paper, we presented a very cost efficient branch prediction scheme capable of eliminating most of destructive aliases. The proposed technique posts an improvement in miss prediction rate up to 10% in some cases. More analysis is needed to further identify other patterns of destructive aliases for better designs. BRANCH PREDICTION THROUGH CACHING OF ALIASING BRANCHES 573 REFERENCES 1. T. Ball and J. R. Larus, “Branch prediction for free,” in Proceedings of ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation, 1993, pp. 300-313. 2. B. K. Bray and M. J. Flynn, “Strategies for branch target buffers,” in Proceedings of 24th Workshop on Microprogramming and Microarchitecture, 1991, pp. 42-49. 3. B. Calder and D. Grunwald, “Fast and accurate instruction fetch and branch prediction,” in Proceeding of 21st International Symposium on Computer Architecture, 1994, pp. 2-11. 4. M. C. Chang and Y. W. Chou, “Branch prediction using both global and local branch history information,” Computers and Digital Techniques, IEE Proceedings, Vol. 149, 2002, pp. 33-38. 5. P. Y. Chang, M. Evers, and Y. N. Patt, “Improving branch prediction accuracy by reducing pattern history table interference,” in Proceedings of International Conference on Parallel Architectures and Compilation Techniques, 1996, pp. 48-57. 6. R. S. Chappell, F. Tseng, A. Yaoz, and Y. N. Patt, “Difficult-path branch prediction using subordinate microthreads,” in Proceedings of 29th Annual International Symposium on Computer Architecture, 2002, pp. 307-317. 7. J. Fisher and S. Freudenberger, “Predicting conditional branch direction from previous runs of a program,” in Proceedings of 5th Annual International Conference on Architectural Support for Programming Languages and Operating System, 1992, pp. 85-95. 8. J. K. F. Lee and A. J. Smith, “Branch prediction strategies and branch target buffer design,” IEEE Computer, 1984, pp. 6-22. 9. W. M. Lin and R. Madhavaram, “Advanced branch prediction based on a generalized predictor,” in Proceedings of the 18th International Conference on Computers and Their Applications (CATA), 2003, pp. 304-307. 10. S. McFarling, “Combining branch predictor,” Technical Report TN-36, Digital Western Research Laboratory, 1993. 11. S. McFarling and J. L. Hennessy, “Reducing the cost of branches,” in Proceedings of 13th Annual International Symposium of Computer Architecture, 1986, pp. 396-403. 12. S. T. Pan, K. So, and J. T. Rahmeh, “Improving the accuracy of dynamic branch prediction using branch correlation,” in Proceedings of 5th Annual International Conference on Architectural Support for Programming Languages and Operating System, 1992, pp. 76-84. 13. D. A. Patterson and J. L. Hennessy, Computer Architecture: A Quantitative Approach, 2nd Edition, Morgan Kaufmann Publishers, Inc., 1995. 14. C. H. Perleberg and A. J. Smith, “Branch target buffer design and optimization,” IEEE Transactions on Computers, Vol. 42, 1993, pp. 396-412. 15. J. E. Smith, “A study of branch prediction strategies,” in Proceedings of 8th Annual International Symposium on Computer Architecture, 1981, pp. 135-147. 16. Z. Su and M. Zhou, “A comparative analysis of branch prediction schemes,” Technical Report, Computer Science Division, University of California at Berkeley, 1995. 17. Shade Manual, Sun Microsystems, 1995. 574 WEI-MING LIN AND AN-YI YANG 18. T. Y. Yeh and Y. N. Patt, “Two-level adaptive branch prediction,” in Proceedings of 24th Annual ACM/IEEE International Symposium and Workshop on Microarchitecture, 1991, pp. 51-61. 19. T. Y. Yeh and Y. N. Patt, “Alternative implementations of two-level adaptive branch prediction,” in Proceedings of 19th International Symposium on Computer Architecture, 1992, pp. 124-134. 20. T. Y. Yeh and Y. N. Patt, “A comparison of dynamic branch predictors that use two levels of branch history,” in Proceedings of 20th Annual International Symposium on Computer Architecture, 1993, pp. 257-266. 21. T. Y. Yeh and Y. N. Patt, “Two-level adaptive branch prediction and instruction fetch mechanism for high performance superscalar processors,” Technical Report CSE-TR-182-93, Computer Science and Engineering Division, University of Michigan, 1993. 22. C. Young and M. D. Smith, “Improving the accuracy of static branch prediction using branch correlation,” Technical Report 06-95, Center for Research in Computing Technology, Harvard University, 1995. 23. C. Young, N. Gloy and M. D. Smith, “A comparative analysis of schemes for correlated branch prediction,” in Proceedings of 22nd Annual International Symposium on Computer Architecture, 1995, pp. 276-286. Wei-Ming Lin (林維明) received the B.S. degree in Electrical Engineering from National Taiwan University, Taipei, Taiwan, in 1982, the M.S. and Ph.D. degrees in Electrical Engineering from the University of Southern California, Los Angeles, in 1986 and 1991, respectively. He was an assistant professor in the Department of Electrical and Computer Engineering at Mississippi State University before joining the University of Texas at San Antonio (UTSA) in 1993, and, since 1998, he has been an associate professor of Electrical Engineering there. Dr. Lin has published more than 55 technical papers in international journals and conferences in the area of distributed and parallel computing, and computer architecture. He has served in program committee for many international conferences and will serve as the program chair for the International Conference on Computer Applications in Industry and Engineering. He received the Best Paper Award for “Load Balancing Technique for Parallel Search with Statistical Model” in the 1995 International Phoenix Conference on Computer Communications, and the Best Paper Award (runner up) for “Sorting-Based Rejection Techniques for Fast Random Number Generation” in the Fifth International Conference on Intelligent Systems. An-Yi Yang (楊安義) received the B.S. degree in Industrial Engineering from Chung Cheng Institute of Technology, Taiwan, in 1987, the M.S. degree in Electrical Engineering from the University of Texas at San Antonio in 2000. He is currently associated with the Faraday Technology Corporation in Taiwan.