Alternative Implementations of Two-Level Tse-Yu Department Yeh of Electrical The Ann and Adaptive Yale Arbor, and rate and depth prediction of pipelining of high perfor- delivering the potential pipelined dynamic branch Prediction) than predictor that any achieves other mechanism scheme uses two to make of encountered, the substantially higher in the of branch history branch of the specific a wide-issue, propose Adaptive reported the and We (Two-Level levels predictions, currences performance microarchitecture. history of the behavior pattern Branch The k branches last s oc- k branches. the We have identified three variations of the Two-Level tive Branch Prediction, depending on how finely solve the history hardware tions, costs and along dynamic and static benchmarks. curacy 94.4 We for different We 1 popular schemes prediction costs and of each the Prediction We pattern to will be taken, scheme not measure and the accuracy. the issue rate and dictor is vital depth of pipelining processors to delivering deep pipelined they predicting that on the of high increase, the per- predict branches data opcode the sets schemes the opcode, the potential performance microarchitecture. Even and the branch to for data the fact basis that of run-time the substantial additional but be- the prediction tendency. sample Unfor- data appears may execution the of that branch dynamic prediction history will collects is being made proposed bases about [14] a Static its that that J. Smith to store, counter prediction branch. Training on implies is required. subsequently information the maintaining In all cases, on Lee and method which uses statistics gathered prior to execution time coupled with the history pattern of the last k run-time executions of the branch to make the next prediction as to which way that branch will go. The major disadvantage of of Static a with Permlsslon to copy without fee all or part of this material is granted provided that the copies are not made or dntrlbuted for dmect commercial advantage, the ACM copyright notice and the title of the pubhcat[on and Its date appear, and notice is given that copying IS by percussion of the Association for Computmg Machinery. To copy otherwise, or to repubhsh, requires a fee and/or specific perrnlsslon. for data In this dictor any 124 Training respect to the $1.50 and history as in branch behave in information hardware be at run-time. also can be as simple last to of a branch a static that that predicting the code, branch [6, 13] can be used presetting of the the the tendency prediction only and gathered @ 1992 ACM 0.89791 .509.7/92/0005/0124 branch or on the intensive where profiling behavior from track loop according branch different for programs by measuring on sample in for Also, bit A. Smith amount use pre- use run-time Static [17] proposed utilizing a branch target buffer for each branch, a two-bit saturating up-down of speculative work due to branch prediction becomes much larger. Since all such work must be t brown away if the prediction is incorrect, an excellent branch prea wide-issue, well is irregular. which Superscalar they to make predictions. be based prediction in that statistics in that make is effective work havior branch As profiling aa always or can branch are static same way, or it can be elaborate as very large amounts of history information. Introduction formance to of instructions of cycles these inan incorrect branch of the branch, as in “if the branch is backward, taken, if forward, predict not taken” [17]. This instruction at most obtain be as simple keeping information. variation can Dynamic ac- algorithms are dynamic very is 97 achieve Others history in a substantial of suggested Some dictions. tunately, SPEC prediction accuracy. prediction Branch proposed on average Branch known of history the prediction schemes, the of different amounts measure same several other ef- accuracy is full results number number before and does varia- relative Adaptive Adaptive average their prediction to the information latter the three of Two-Level that the other effectiveness in evaluating prediction show Two-Level while percent the with of the due opcode direction predict Adapwe re- We compute each the branch variations Prediction, percent, costs We measure three gathered. of implementing use these fectiveness. of the information of 5 percent [6, 13, 14, 17], execution information for rate literature schemes accuracy last of these The a new literature. miss in performance fetched each cycle and the structions are in the pipeline prediction becomes known. mance Superscalar processors increase, the importance of an excellent branch predictor becomes more vital to deep Science 48109-2122 loss As the issue Computer of Michigan Michigan Abstract Prediction N. Patt Engineering University Branch that other methods to profiling; the sample that paper scheme uses two make predictions. set may a new substantially reported The mentioned history not above statistics be applicable at run-time. we propose levels been pattern data appears achieves anism has the in the of branch first level dynamic higher literature. history is the branch pre- accuracy than The mech- information history to of the last k branches reflect encountered. whether countered, this or the instruction.) the had waa Suppose six taken, O that that the times the previous eight 11100101, the branch alternated taken. Then the 101010. Our The history for the Training tive ations tion costs We in evaluating their trace-driven benchmarks 1, we Branch the three the rithms different information. We obtain the compare the the several We show Adaptive the This paper and its three six in six Adaptive model costs. four discusses Section traces some used in this results and concluding cor- Definition of which Predict 2.1 Section five of branch history first is the level tered. Branch Prediction information history (Variations uses two to make of the of our last scheme predictions. k branches reflect mark consists benbmk of seven was not simulated independent simulate the branch behavior ted these loops. loops. se +J=d(%&) state ‘lhmition hgicford Branch F’re- a “l” is recorded; a 2k different in the results were history k bits patterns the represented branch is B register, being the specific HR, predicted, its history R=_I, used to address the pattern history bits SC in The table s times by that in there history last if the appear 2k patterns, pattern for in register. conditional is dressed entry PHTR=_kR=_k+l tory table are then used for prediction is of the are For each of these entry of there branch . . . . ..B=_. predicting denoted as pattern the ad- in the pattern the branch. ]hisThe The 2C = (1) A(sc), encoun- whether takes Sdictirm ofB levels where this because this benchIt l(s%) Adaptive then Since k branches content After 1The N=a7 taken, branch R+kRc_k+l...... history table. Branch ion Adaptive : at most register. of the When Adaptive was contains content Section Overview Two-Level i% Pram Hietay Bi@) is recorded. register, preceding remarks. Two-Level 00-.-00 ~ of Two-Level is a corresponding Simula- analysis. branch a “O” the history associ- the the history the 2 0 1 1: Structure not, two the the study. our Bd Figure Prediction describes computes simulation contains three of percent Section and and the Section results diction. 97 percent, Branch implementations . . . . . . . ..lb.a . ..’.. 7’ for Two- 94.4 sections. register literature. accuracy most branch 11.......10 11....4..11 * lb: BrnuhPmlt :ofB ‘+ If Two-Level hardware reports to is about the of the shift 00.!.....01 00.......10 — ‘*\ variation predictors at 11 accuracy. variations. responding Be4ik-k+l pattern branch achieve of the We we Prediction entries is a k-bit representing BranehEiitary Register(BHB) (Shiileft whenupdate) algo- Finally is organized our of each prediction schemes prediction introduces tion average regis- PatternBiituryTehle (PET) popular and in the the at history on the outcomes ““’.~dex : accuracy. available in information is collected of the bits pattern of accumu- the based struc- the k branches. Adaptive prediction costs are register of im- schemes. history Adaptive Branch other the prediction schemes the of bits and BrsnchHietory Pattern use these other different in recent IIast Two-Level 1. Instead contents history de- costs several shifts (HR) depending history the major’data programs, the informa- and prediction of Two-Level while average with measure same that history of Two-Level static The most register see Figure table for of information, profiling pattern s oc- vari- effectiveness. amounts popular Level ated along and the behavior uses two predictions history last k branches. Adap- Prediction, hardware relative effectiveness and which the three variations, variations dynamic measure elimiStatic by branch and time, (PHT), by updating branches. simulation of nine of the ten SPEC measure the branch prediction ac- Prediction, proposed the three which of the Two-Level the table pattern pattern history the the The in question. levels branch statistics ters two for of these branch or instruction.) behavior pattern Prediction lating on history the encountered. branch pattern on the of the history run-time not branch Branch the pattern the identified Branch compute of the of method have last “taken.” at run we resolve each curacy 1 and Adaptive Using to We finely gathered. plementing level tures, k branches specific is based Adaptive and the predict disadvantages We call our how contain would for Prediction. on would the the s occurrences taken). last of the same is the of Prediction the of the taken level To maintain that not had 2 are collected of Two-Level pending level mentioned method. Branch in each between predictor level above was that branches information information nating second branch of these k branches branch actual currences in question. last the k occurrences second behavior 1 represents s = 6, and last branch pattern 8, the (where means en- behavior branch pattern k = same branch on the of the for scheme k branches of the of the specific 11100101 further ofour last is the is based suppose, the behavior branch level s occurrences example, actual k occurrences second Prediction last For the s occurrences k branches. for last The for the last (Variations means too long ~ is the the prediction conditional decision branch function. is resolved, the clut- is shifted left into the history register IfR come R. in the least significant bit position and is also used to update the pattern history bits in the pattern lhisAfter being PHTRc_k Rc_k+l . . . . .. Rc_l. tory table entry to of these seven kernels, so we omit- 125 updated, the &k+l&k+2 . . . . .. content pattern history pattern history is done by in the branch bits. bits the and in state pattern as inputs register state represented S.+l. the transition to and new the new pattern pattern history branch entry is greater takes will be of the tive Branch history bits S.+l be- Sc+l = 6(Sc, implement bits the function in the entries sition function bits 1 and logic 6 to update of the pattern b, predicting S and the finite-state Moore the R machine, of is used pattern history function outcome circuit table. The branch characterized track The counter used in of this for study the pattern path history the outcome the history history will Last- the pattern tran- history to comprise a of the pattern pattern results appears has predicted as not predicted as taken. up-down Smith’s bit the The same history be what hap- to store that Al the automaton A2 to the branch for will be used keeping in is preset A, for fore is, J. branch As different pattern, be found can be different tion for Two-Level ma Atizmatnn S=.”-.”e u.- Two-Level M! cod..> Adaptive Static ions. Adaptive accurate over many Training, on if changing data branch histhere decision Prediction. Branch can therefore, funcPredic- Prediction bits change the predictor change behavior Wit in Two-Level can adjust to of the program h these run-time Prediction can programs and contrary, brings may about not to up dat es, Branch sets the information different the on executes. execution predict same execu- information actual results prediction Branch program branch proper if the during same table; to the the pattern history Branch Prediction, the current the history history That Prediction, given Adaptive as the made times Branch be- pattern. pattern history table with the pattern inputs of Two-Level are at different pattern before history a result, in the is known branch Adaptive tory function, of A is determined predictions appears of branches. make Al branch Two-Level Since Adaptive AtinmtaI a given In Static decision output other hand, updates the kept in the pattern history tions (LT) for changes Prediction profiling. pattern dif- pattern table Branch from history major the history be- informa- The prediction the taken. Adaptive is that Adaptive Training not predictors, history. schemes to the is or equal on run-time in the pattern branch pattern adaptively Lae&Time branch ac- Two-Level branch based two Therefore, same tion. are the which of A2. and dynamic othertime the branch is predicted [14] are input execution the branch are variations these a given history [17]. Automaton the A4 in Static the content is greater in Two-Level execution. is a saturating automaton design be information register next value dynamic history Training, pat- will the between but records history i.e. is taken; The counter predictions ference as if the Training dynamically when same pattern otherwise, buffer The only the automaton tion, there is no taken branch of the branch when the same similar target time will the 2. branch next is needed times in their same entry, and Static of is the table otherwise, A3 history result history Prediction cause which history of the The two taken; counter, branch history one history in Figure The Only when next execution register shown execution information. last pattern the prediction Only of the tern appeared. recorded, the history two; index, history the pattern taken Both machines for predicting in the pattern appeared. time. history and are stores last the entry take Moore same Adapup-down when is decremented. has the Automata Two-Level content entry branch of a certain register table counter predicted history value the saturating is incremented history the In 2-bit is not of the the counter otherwise, taken. of the branch execution when two; the history cesses the same by equations finite-state updating Time pattern last the table branch automaton pened whose Branch diagrams as not Prediction, keeps branch history J, pattern the to 2. State the combinational to the next as taken or equal predicted when of the be predicted than counter wise, A straightforward path pattern. the (2) RJ. branch will a branch, come is decremented The of the table outcome and taken. by the 6 which the is taken beccmes transition history function bits generate the The pattern history Therefore, history the becomes bits the old of R. be highly data sets. predict well different execution behavior. 2.2 Alternative Implementations Adaptive Automaton Automaton A3 Fimre 2: State diamams of the ch~nes used for mak~ng prediction tern history In J. table Smith’s A4 counter keeps track branch. The counter the of the 2-bit branch is incremented saturating history when up-down alternative Adaptive Branch Adaptive Global History History Table branch (GPHT) 126 used of Two-Level Prediction implementations Prediction, are differentiated of the Two- as shown in Figure as follows: Branch Register Prediction and a Using Global a Pattern (GAg) In GAg, there ter (GHR) and of a certain the are three Level Two-Level entry. design There 3. They finite-state Moore maand updating the pat- Branch is only a single a single global by the Two-Level global pattern Adaptive history history Branch registable Pre- GAg In order PAg levels, Glab.1 — Slsby ‘IUI1. (CWmTJ UObd k-h to completely each static ble a set of which table (P PHT). and bt-Y L8#hr Qnmu each ters Therefore, history a per-address history history table branch. in a per-address All branch for each distinct Per-address ing Per-address 3 Implementation 3.1 Figure 3: Global Adaptive view Branch of three variations of Two-Level history regis- history tab ie. ficult diction. global All branch history which are updated variation are based and pattern after therefore Branch predictions register global each is called Prediction using branch Global on the same history table is resolved. Two-Level a global pattern This cycle from this branch table ister (GAg). Since the outcomes of different register and the same same history the information tory of both is influenced branches branch by results update pattern history history and of different prediction for a conditional branch tually dependent on the outcomes pattern branches. can updated the his- Adaptive The in this scheme is acof other branches. Per-address Branch Pattern In History order the branch collect branch tory registers tory table the each one specific branch instruction is kept for each branch all history pattern history table, pattern The history the is based on history bits entry history table. indexed register. by Since table, Two-Level variation the prediction, the same is called by the by update history branch indi- Per-address using a global history The prediction the branch’s in the global the all content branches of Branch History the for the branch’s the same interference History Tables still Prediction Table prediction time. The the prediction table branch results is is accessed. is incurred to the history table, of the branch until from the the pre- time may not Adaptive The when can either if two update be hand, result update is known. With occurs, of the same as that be cle- speculative branch his- depending on branch static his- pattern can the or repaired to the the as critical its available is very the branch for to accu- Prediction is not a misprediction a czme, be used prediction timing therefore, instances the Branch be reinitialized budget can by updating on the other history; In such branches Since is enhanced the branch hardware is degraded. previous history. speculatively. Also, predictor. branch occur history in consecutive cycles, the latency of prediction can be reduced for the second branch by using the prediction pattern fetched exists. 3.2 from Target the pattern Address After the direction still the possibility Using and at that is known previous branch of Twc-Level the a conditional fetched with The history latency the along table. is encountered, cycle of the high, tory the global with by appending prediction be updated reg- history prediction derived The the accuracy racy updating, branch and own history and the pattern history table update Adaptive Branch conditional register pattern table is then stored the branch history address predictions tory global When~ a history before the prediction of a subsequent branch takes If the obsolete branch history is used for making layed history Pattern branch Prediction the pattern Per-address address access of a static own branch pattern the is known. accesses the next High witbin as follows: cycle, history. cycle. made address se- is dif- sequential branch’s contents also one one be two as the branch the his- is accessed conditional registers results history and can into the the two It is available. his- is accessible for branch only ready place. (PAg). branch’s pattern entry Since Branch table execution update each static this Adaptive The same and requires cycles known, old branch Sometimes to branch the Prediction branch the register as soon the diction level is as- branch individually. addresses. and first register in a per-address instruction distinct the the In the that Therefore, uM- (PAp). prediction different becomes history available a a Global conditional information vidually Two-Level static in which static in history in two to time time one are contained (PBHT) and interference distinct history Table Using time be accessed result is called a prediction. accesses that the history pattern (PAg) information, with Prediction History Table reduce history sociated Branch two from the pattern history the branch’s history in next Two-Level the is updated. table the table, it Prediction Prediction requirement, result tables to make requires are performed Adaptive history to squeeze To satisfy branch, Branch Branch accesses performance one Timing of Update table Branch Preinformatim Branch history associated Considerations Adaptive quential conditional Adaptive pattern Pipeline Information Two-Level Prediction. static Two-Level ta- history register are Since this variation of Two-Level Adaptive diction keeps separate history and pattern % in both pattern pattern conditional are grouped interference a per-address pattern static the has its own is called a per-address with remove branch Per- it (PAp) 127 takes to generate history table directly. Caching of a branch of a pipeline the target is predicted, there is bubble due to the time address. To eliminate this bubble, we cache the target One extra field is required history table fordoing this. taken, the target instructions; Caching cycles requires the fetching address fetched possible of the because the instruction branch the block history in the instruction tions are history block a branch block address 3.3 F’Ag infor- to fetch new in- to have hold all branches’ in the and branch a branch History history execution Table use It table history branch of entries Within a set, a Least-Recently-Used is used for replacement. is used The to index predicted, the branch’s entry ble is located first. If the tag accessing address, the to predict address, into In this an Ideal branch study, entry the both Branch the History Table The branch history practical (IBHT), A (LRU) al- part of a and in the not and LRU and branch is used the rest history are table The branch history table size is h. The branch history table is 2~ -way there are a predic- bits. Each history register contains Each pattern history table history table entry set and PAg, set-associative. k bits. size contains is p. to the size of the branch p is always in which the multiplexer, the finite-state and Furthermore, integer. s bits. (In PAp, history table, equal to one.) p is h, while When i is equal to log2h there k bits are tathe entry cost.5.cheme(BHz’( and there h,j, = CCJStBJI~(h, = {BHTstcwageasPcace j, k),p shifter, the and incremen- is a non-negative in a history k) + BHT~P~ating-~ogtc} register, = X f30StPWT(2k, + BHTACCe.mg.LOgtC + {[~ x (~w?f.-t+,)-b,t 256-entry IBHT [1 x Address-Decoder,~,t to the table ations. + PHTupdattng_z.og + ic) + Predictzorz-BitI_w + +2’ x Jj_b,t + 1 X 2Jxl-Muxk-b,t] [h X sh2f~eTkJit i- ’2’ X LRU.Incrementorsj_b, X ~iStOT@itS.-b,t] [1 x /iddress-~ecoderkAit] 128 S) Comparators(.-i+ p X {[2k Branch {PHTstomgeJpace + H&btt direct-mapped simulation x a s)) ~ +P PHTAc.e.,in9_Logtc four x PHT(2k, +LRU-Bits,Jzt)] data is provided to show the accuracy loss due history interference in a practical branch history the machine. pattern history table always has 2h entries. The hardware cost of Two-Level Adaptive Prediction is as follows: branch, Predicwith parator, tor, 512-entry, 4-way 512-entry and implement of which table bit, 4-way set-associative 256-entry, direct-mapped The equations predictor. a subset indexed configurations: set-associative caches. following the match approach was simulated the tar- C., 6’d, C., Cm, Csh, Ci, and C. are the constant base costs” for the storage, the decoder, the com- branch. static conditional Adaptive Branch table for are: history in the address for caching branch tion equal together table the bits, and be imple- branch history entry matches for as a tag the estimates branch ad- circuits space in the for of these Pattern per- cache. does included are a address in GAg can information above to entry associated branch is to be If the tag is allocated is a history register for each were simulated for Two-Level tion. in the in the the branch. a new the lower higher part is stored as a tag in the with that branch. When a conditional storage required of and In an entry of the branch history table, fields for branch history, an address tag, implemen- for here. are grouped as a set. address enough or direct-mapped in the table is not table, update of the consists Imple- large table history bit The logic logic and entry. per-address is not fea- in real history as a set-associative number is used static branch for tags, incrementors, history two space updating bits branch are the bits, updating LRU table. the and and the it is not prediction accessing for history stored whether or not next sequential gorithm the table, storage pattern to index The include the following characterize items MUXes, There is instruction history determine from the branch predictors all tables in their structure. per-address branch if t here if the Therefore, a practical approach branch history table is proposed mented fixed are decoded, to variations. Detailed Assumptions at ion sible The is used proposed table accessing predic- The history and addresses branch pattern decoders because In this Branch and PAp history address get be squashed. branch t ations. branch in the instruction are three information, The and a run-time the history pattern table. block dress for inconsequential. of the and parts. bits Estimates estimates costs table decoders history address cost relative comparators, the instrucin the is not LRU in history missed Per-address ment before required keeping branch branch instruction should hits area tables. or the branch prediction is used to the new instructions fetched address until address misses the instructions in the known chip mechanism major by being in the branch sequential After than of the is no branch cycle present case, the next st ruct ions. address there in that is not If the prediction also block The history by the rather is not can be made If the either fetched mation is decoded. block decoded. table, block instruction address the This Cost tion the in con- delay. Hardware hardware is used. to be accessed in the table, following prediction any instruction branch the address makes table branch 3.4 of branches. to fetch without history of the the addresses branch address is used the fall-through the target secutive the address otherwise, addresses in each entry of the branch When abranch is predicted i- t]} + + + [State-updater.-b, t]} = {hx[(a–i+j) [hxc. +2’x(a– [h XkXC,g. [2’ x 4.1 +k+l+j]xcs+ a+j)xcc+2’xkxcm]+ +2’ Cd]+ [s XjXCi]} x 2s+’ +~X{[2k x Ca]}, X. a +j > Description Nine used XC.]+ i. The In GAg, only one history tern history table are used, register and one global patso h and p are both equal to costGAg(BHT(l, , k), 1 x PHT(2~, COStBHT(l, , k) + 1 X COS~PHT(2k, & {[k+l]xc, +k {2’ It is clear x($ x to see that S) COStBHT(h, H {hx[(a+2x j, k) + 1 X j+k+l-i)x ~ x Csh]} were The cost of a PAg to the scheme history grows register exponentially length and j, k), h x PHT(2k, = 6’OS~BHT(h, E {hx[(a+2x matrix300, and instruction because rate depends usually II Benchmark with II tomcatv s)) j, k) + h X COS~PHT(2k, When the history branch table register history a+j>i. is sufficiently and the respect with size. However, the a more dominant factor respect branch history than it is in were used in this study. A Mo- 88100 instruction level simulator is used for geninstruction traces. The instruction and address traces are fed into the branch decodes instructions, predicts prediction branches, simulator which and verifies the predictions results collect for branch with the prediction 1: Number Benchmark Name eqntott espresso gcc Xhsp doduc fpppp matrix300 spice2g6 tomcatv to the Model simulations on the The simulated regular The testing and used in this study lC)O be- of static traces History number for branch number instruction 1. twenty executed. of the register hit of static branches training data are listed sets in Table Number Static Benchmark of Number Static of 370 branch “ n IJ of static conditional branches in each cost to the scheme. Trace-driven the are (6) large, with linearly table Simulation torola erating c, +Cd)}, exponentially length size becomes a PAg 4 history register were branch before benchmark. x (s x grows conditional of their in Table on the benchmarks +cd+ + h x {2’ scheme a S) Table of a PAp where with linearly j+k+l–i)xc, k x Csh]} benchmarks. behav- focuses instructions tomcatv in listed have all finished the programs. are They branch 2. (5) respect to the branch history table size. In a PAp scheme using a branch history table as defined above, h pattern history tables are used, so p is equal to h. By using Function 3, the estimated cost for PAp is as follows: costp,4p(BHT(h, Fpppp, million in the a+j~i. million branch branches and benchmarks prediction which conditional for each benchmark + gcc benchmarks in- spice2g6 interesting. branches, twenty out ~pppp, execution; is attainable, irregular integer of branch except through in- branch is tested. million havior Cd+ and conditional for the loop Doduc, are more mettle simulated ones is not benchmarks, accuracy it is on the study fpppp, integer Nasa7 repetitive used. branches for conditional C.+ {2k X( SXCS+C~)}, respect benchmarks predictor’s the to capture point have the integer instructions cOStPHT(2k,S) tomcatv conditional li. doduc, kernels. prediction this long floating predictors Since s)) too high prediction exponen- five and suite are are float- benchmarks. include gee, and of the (4) grows it takes Therefore, are integer tomcatv espresso, and a very branch of GAg and of all seven the four benchmarks dependent ior. j, k), 1 x PHT(2’, = because matrix300 tially with respect to the history register length. In PAg, only one pattern history table is used, so p is equal to one. Since j and s are usually small compared to the other variables, by using Function 3, the estimated cost for PAg using a branch history table is as follows: costp&(BHT(h, cluded Among xc,,}+ cost eqntott, many C.+c,)} the include thus, and point spice2g6 behavior s)) = benchmarks floating matrix300, one. No tag and no branch history table accessing logic are necessary for the single history register. Besides, pattern history state updating logic is small compared to the other two terms in the pattern history table cost. Therefore, cost estimation function for GAg can be simplified from Function 3 to the following Function: Traces benchmarks from the SPEC benchmark in this branch prediction study. Five ing point (3) of to statistics Table accuracy. 129 2: Training Cps cexp.i tower of hanoi tiny doducin NA NA short greycode.in NA and testing data bcadbxout.i eight queens doducin natoms Built-in greycode.in Built-in sets of benchmarks. In the about traces generated 24 percent integer of benchmarks namic instructions instructions. cent dynamic of the branches; ditional diction and about the the sets, content in each entry. for the pattern history table the dy- Figure data 5 percent floating of point benchmarks 4 shows about instructions prediction branches is the mechanisms for testing instructions Figure branch therefore, the dynamic for are branch with the part are conditional mechanism most important different classes for 2. For Branch is not a flag context specified con- among the preof branches. not B-h Idrwtion specified, Since Distribution in the branch 9s.. table miss the is known, history flushing Frm .Suh hst El Im Smlwh h! ■ Junv Fnc.ilw Iml ❑ Cadnc.ml Bmti of the branch and table the reinitialization table bit A not is it is to all 1‘s occurs. history is extended of the After the branch context taken a history is initialized causes result register. than results, history which If ch are simulated. simulation branch is extS’wit simulated. branches history on the Cent switches to our history Context_Switch are taken in the Pattern is no pattern When switches more according the result ~ 1 switches. are a miss there in the shown designs, designs. no context there register when out their aa c, context branches -C in of an entry automaton Buffer because kept for content be any Target included, information 80 per- The can through- switch branch results history in table. — 1.s1 Mod.] Name ~ r co # of As. Er,tr. PHT PHT .mfig. of Entry Ed,y set # Comt Size !3ni,. Cent — GAg(HK(l, 1 ,,.s,), 1 X PHT(2r PAs(BHT (256,1, 1 X PHT(2” r-sr), ,A2), branch instructions. r-bit 256 4 r-bit Characterization of Branch 1 r. bit r-bit 512 4 r-bit 512 4 r-bit pAg(BHT The three variations Prediction were tions. tions in of the 3. schemes tion dynamic also simulated. dynamic s.), we to analyzed, the Scheme( History IBHT, entity, and the PAg(l 512 r-Sr), are ,xsr), r-bit 512 entity 1 for to keep HR specifies Associativity the is the Entry_Content number Asc the (lAg, of the in branch history table entry. When Associativity to 1, the branch history table is direct-mapped. Atm 1 ~. A* III 1 2r Atm LT A2 512 2“ Atui 1 2“ PB 1 2P PB A2 A2 – Table Set-A Table, ble, Stattc IBHT Tame, a Level ta- tory each - Global Tables, Training is set The tern content of an entry in the branch history table can be any automaton shown in Figure 2 or simply a history Entr. – Entr8es, GAg Table a PB Branch Preset sr Shaft 3: Configurations - Table, PAP Pattern LT Branch PSg - History - Ta. Last. Prediction Per-address Per-address Bzt, GSg H8story Infin$te, – – Adapttve Table, Pattern inf Us$ng Global - H8story Adapt%ue Predzctzon a Preset Table, Tab/e, – Branch Config. Two-Level Global Two-Level H%story Des8gn, – Global Pattern Predation BHT Buffer a Preset Hwtory Pattern - Global Ustng Branch Using History Ustng Per-address Adaptsve – Automaton, Target Idea/ - Atm – Branch Tra$nmg PAg Us$ng in ssociativxty, BTB Predactaon Global Aim 4 — Branch register), content 2“ LT Configuration, information associativity specifies T 512 J.]) Pattern( of entries 1 s At III J.i) BTB(BHT(512,4,LT), H%story history Aim AS 73K 4 512 conven- If a predictor naming conven- history (A single 2. A4 r-bit shown example, 1 S, blank. scheme, used Size is left At m S* — Associativity, x 2“ A2 r-bit 7 different naming Size, 1 Al r-bit 4 s, branch the Aim ,, ,LT),[c]) BHT(inf, At m 2, S, Target Buffer design (BTB) Associativity, EntrgX’ontent), for example, or BHT. ble, following ConteztSwitch). feature in the Branch Size, is the that (512,4, 2r A2 r-bit 4 [c]) configura- predictors distinguish field specifies PAg, PAp or [17]. In History( of branches, static Pattern.TableSet-.Size corresponding Scheme 512 ,A4), 1 X PHT(2r configura- The History( Entry-Content), not have a certain the several and branch order Entry-Content), tion, Branch 1 1 s, 512 In is used: Size, does Adaptive with known were Table Two-Level simulated Other predictors of Atm s, ,A2),[c]) 1 XPH’1’(2r 2r s, PA~(BHT(512>4,r-st), Predictors 1 AZ 4 1x PHT(22r,A3),[c]) 4.2 A*JII ,, PAg(BHT(512,4,Mr), 1 xPHT(2” 2’ A2 51’2 ,Al),[c]) PAs(BHT(512,4,,. 1 A2 512 [c]) 512,4,,-s,), 1 X PHT(2r At m S, r-sr), 1 X PHT(2r 2T ** wr), PAs(BHT(512,1, of dynamic 1 ,A2),[c]) PAs(BHT( 4: Distribution 256 ,A2),[c]) 1 XPHT(2r 1 S, ,A2),[c]) PA~(BHT(256,4, Figure r-bit Two- Pattern Hzs- Per-address Table, Stattc PHT – Pat- Register. of simulated branch predictors. register. In Pattern-Table-Set-Size Size, Entry-Cent ent ), Pattern_Table?Set_ number of pattern Pattern is the tory the history tables implementation information, implementation, Size for specifies and used in keeping the number Entry_Content The Pattern( Size is the the entries of entries specifies history bits are also initialized in the pattern history at the beginning table of execution. Since taken branches are more likely for those pattern history tables using automata Al, A2, A3, and A4, all scheme, pattern pattern his- entries in tries the 130 are initialized are initialized to state to state 3. For 1 such Last-Time, that the all en- branches at the beginning of execution dicted It taken. history tables In addition and A. is not during execution. to the Two-Level Smith’s Static get Buffer designs, prediction schemes poses. ilar scheme with to the profiling. Training Training using the GAg but tern history a preset the with the as used a fair comparison. Two-Level table and the execution starts, Branch A2 diction (BTFN) backward Always and cution predicts not the of taken profiling a branch profiling data program executed the 5 Branch Figures the scheme one not-taken the with The for branch testing taken data history the schemes history history table ta- tables hit History Table to ratio. Automa- efficiency simulated 12-bit better A2, Last- Time time; Time. only formance of A2, however, A2 A3, in and The four-state more history what tolerant performs figures clearly, with automaton set- automata information happened the to the devi- four-state The per- close to each other; best. each and predic- A4 all per- history. Among the worse than the others. and A4 are very finiteA4, a four-way A3, more usually is shown A3, branch records are therefore execution performs a PAg A2, maintain which they following A4 Al, different A2, registers BHT. Last- and Al, with history than A3, of using automata In order Two-Level to show Adaptive A2. AdsPU. kin. U- DifArd Mb Tmmi!4mAnkmmta — 0 9s00 ~ -——U— P~( BHT(512,4,12m’\ PliT(2.12,LllJ - .+ -. P*( 8HT(512,4,12s’k PHT(’2W2J1)J –– 4–- P~( BHT(512,4,12sJ PnT(m2A2),) ‘e P~( S14T(512,4,12H’~ Plil(’?l Z.Ml) — PM BliT(51Z4,12m’~ PHT(2w2#\} — c for Taken ;08000. --------------------------- a I c Y00400 ------ .--..,. ----------- I ----------------------------------------------------------- t for- loop-bound in the counts exe- the static fre- branch direction of Becchmalk frequently. executed with predictions sets, PAp branch branch Pattern 512-entry ho.laml branches most Simulation 11 show the thus for a the Fimre calculat- 5: Comparison Pr~dictors benchmarks. is the benchmarks. Results prediction described across all the mean across GMean” point takes were having Scheme a pro- Not predicted branch ideal pre- branches each regjs- usin~ of Two-Level different Adaptive finite-state Branch auto~lata. accuracy. predictors SPEC for and Taken, and once of a program set is used 5 through ometric mean is the geometric “FP only profiling Prediction on the nine and mispredicts prediction branch floating it and Always for branch Five the Al with branch a branch The information training ing branch the of the Branch history of increasing PAg an practical the ations in automata, the both simulated is effective Adaptive table. predicts if Time the preset Forward if execution. is the and taken scheme because of a loop. quency taken BTFN scheme Taken for the by Taken, The with of Last- last history history the Not Taken Backward The programs, include Forward TWID- program to load static the Branch different effectiveness with 5 shows than Train- before were effect automata. Al, configu- required The and state form were schemes Static pattern designs simulated scheme ward. the Last-Time. and table the branch table Two-Level length. Effect tor, in this schemes Training, of Prediction with the simulated associative stati- to implement is needed into Buffer Taken scheme. time bits and all branches. The extra schemes Backward filing history (PSp) simulated cost because In Static Target automata the the Figure Static branches to implement than is using tables Adaptive each (IBHT) pat- a lot of storage history 100 per- ton to a different branch Parameters simulated history 5.1.1 to profiling Per-address were to table. structure Training Training cost pattern prediction show Static Scheme are similar. pattern in The Adaptive Static of all Two-Level less expensive recorded per-address requires schemes same difference lengths ble history behavior by the to assess were Training scheme A. Smith’s rations ing is not this no PSp Lee and simulated pattern of pattern ter second-level table. 76 percent Branch of Adaptive history the variations were Static Static of Static however, track schemes history three Smith’s from of Adaptive Prediction is pre-determined the Global per-address Therefore, study. that The A. is collected meaning is sim- a similar from Prediction scheme pattern has difference application structure; to keep preset pattern using is another and meaning which information global Training as PSg, a global PSg, pattern scaled Evaluation Level Tarpur- Two-Level Lee 5.1 branch comparison the important accuracy cent. Lee Branch Training with study, scheme with abbreviated cally. this is identified Similarly, Static for a given In for prediction pattern and static simulated but to be pre- schemes, schemes, dynamic Per-address an IBHT likely to reinitialize Training were the prediction by be more Adaptive and some Lee and A. Smith’s in structure that will necessary in the “Tot The accuracy previous GMean” benchmarks, all the integer geometric mean vertical 5.1.2 Three of same session Figure is the ge- all the history shows the forms across 131 of History variations using Register history Length registers of the length 6 shows the prediction Every scheme “Int GMean” benchmarks, axis Effect the register the effects of history register accuracy of Two-Level Adaptive in the graph waa simulated with best, length. PAg Among the the variations, second, and GAg length on schemes. the samle PAp the per- worst. GAg is not every effective branch excessive with updates interference. because 6-bit history the same PAg it has a branch performs history registers, because register, causing history better table than which 5.1.3 GAg, reduces Hardware In the Figure the interference in branch history. PAp predicts the best, because the interference in the pattern history is re- 6, prediction sillabktoiymahhrdtb-kn aih various Two-Level costs. the least PAp the I same is useful cmm variation Comparison using of history the registers Two-Level of the cost three. GAg’s Effects To of length, ious Figure from in accuracy to effect ference in their the accuracy There on The GAg schemes history and short history effectiveness with with history register history register history schemes because of the pattern pattern regis- and these when due multiple pattern history tables. a to ............----0---- t&+# BHq,m,w\ - .+ PM Bli-r[51~,a@, 2wPHl(afA2).) PHT(2vBA2L] ---. --. --, --------------------------- 8: I @# BHql,.Mw~ PHT(N4J2L] The -- .- -A. . . . PN[ BHT[51z4,1w ?HT(& 12&),) Since Two-Level branch history table Two-Level 97 percent needs of Context the Ben&mark lengths tions if no trap After a context on the saved is more likely pattern history assuming history table. that switches occur average accuracy uses history, the the switch. prediction and Fig- accuracy without context whenever a trap ocevery 500,000 instruc- pattern pattern switch is simulated. history table history table to be similar to the is not of the current table than to a re-initialized The value 500,000 is derived a 50 MHz every the with a context switch, the process of branch a context simulation, trace or because by 132 in simulated occurs, Prediction track during re-initialized, process’s pattern schemes. Branch to keep difference schemes achieve Switch Adaptive table schemes accuracy. to be flushed 9 shows three Adaptive prediction switches. During the curs in the instruction register among is expensive is expensive Effect ure history cheapest table PAp 5.1.4 for of various is the Banchmafk O.aooo Effect PAg history an 12-bit history regisregisters. According and c0 Moo 7: requires history To requires less inter- history ! GAg PAg 6-bit GAg 0.7800 c Figure accuracy, even A ~ O.azoo a accuracy. is used. registers. ....J3.... configuration prediction O.a+oo about O.aaoo variation’s about is chosen register -.----...-.......-.-.:. Figure “ register, requires achieve scheme var- of The is the of 9 the on PAiz branch register of GAg schemes. effect history is an increase effect on PAp the of by lengthening has smaller smaller lengths effect lengths. 18 bits. is obvious len~th register the 7 shows register 6 bits length history investigate history percent ter various further the that prediction estimates, history ~O,azoo Branch approximately which One ~+’+~......~%:x~-...- length. dif- evaluat- variation with schemes /m Adaptive same obtain 97 percent the-required 6: When which accuracy. to show to to our long have the second, Adaptive predict three prediction 18-bit history ters, and PAp O.moo they with How- accuracy. each achieve PAg expect. know schemes schemes expensive, would to when requirements Figure Vari- compared. of Two-Level 8 illustrates 97 percent Y as you variations it Figure for schemes Three the Adaptive is the most least, three expensive the were the Prediction, PIG BHl(51a,4,aaL ZifPHT(M,A2k) for length register and GAg — of accuracy history ever, ing t Qooo Efficiency same ferent moved. ~of-bpu”-.s Cost at ions 10 ms degradations clock in is used a 1 IPC for and context machine. the three schemes The are all less than gccwhen those ber 1 percent. PAgand of the other of traps traps the The than large num- of the number accuracy global of register accuracy : O.eaoo actually There increases when context are very few conditional .fPPPP and all the havior; therefore, helps clear out -of conditional initializing the branches the global switches branches are in ‘.+ ‘“i ; ““i :~, .$........................ ‘.[/ a using GAg simulated. .. ...!..,[. , ; 0.0400 have regular behistory register P*( lwT@f.,lzsrl, PHT(212,KI,C) ..&i c of ~pppp — *’> . . . . . . . ... A ~ O.woa can .> ,, r ----. ..Y .. .LL..LZ.I.:I: .:.. ““”””””” ‘ c--d-hwti-@-&dhp& ?.0000 of the GAg history prediction for greater excessive the prediction an initialized quickly. degradations aremuch because However, degrade because be refilled accuracy are used programs in gee. do not scheme, The PAp Owoo PA#( 8HT(61a4,12s.~ PHT[2-lW.C) .+r.-.. P~[ BHT[2f6,4,w14~ PHT[2”12,A2),C) ---0-. P*( c!m(m,l,la.~ ?HT(2W,A21,S) -.-. *..-. P*( BHT(2W,,,,ZS.L PHT(2”12,A2),C) . . . .. . . . . .. . . . . . . . . . . . .. . . . . .. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . t noise. ‘. aati titi 1 0000 0.9600 Figure ~ 0 eon c c tion 10: Effect on PAg of branch history table implementi>- schemes. : 8.=00 a Since c y O,Moo the data complete sets, the data 0 Moo tomcatv Tot eqntott WIOW o I nt Q&ma&al ZI,sp gcc P rMm &-ic fpppp available SPIM t.mcaw matm 300 than erage 29s 9: Effect of context switch on prediction target racy. graphed. Note buffer using is about the their accuracy the prediction 2-bit data Effect of ment Figure Branch 10 illustrates tivity the of the branch text switches. mentations ulated. and an ideal table’s programs 5.2 get buffer prediction branch fit history history in the miss close because table. rate imple- were 512-entry is very table, most increases, sim- to that of the in the accuracy de- is also seen which is about 68.5 percent. In this and figure, PNdidh Always the other Taken schemes curacy and Other Adaptive schemes, ations The to be branch to achieves for because 4-way used by all scheme curve whose prediction with schemes which were The the base accuracy is about Adaptive 62.5 scheme 2.6 percent. *O c OMOO Y vari- 0,7800 the enough Adaptive scheme is about 9m( BHT($I2,4,LT),) — Brn( WT(N 2,4,A2)J —— F.ro,dt,g -. -*- BTm (m.w] –.+-. MPlh(am$) first-level on the baFigure accuracy P~[ W(61~4,12$,~ PHT(2,12,M),) is selected it is simple by the Two-Level --0-- --# o woo Prediction. chosen q[ BH~l.ls@. PHT(2”1B,F9),) — well-known BHT P*( WT[slmlhrk PHT(2-12M} . ...4...- a ac- the three keep Two-Level scheme is achieved prediction Branch because The other among 512-entry information, and the Static Training sis of similar costs. top 97 percent Adaptive be implemented. The schemes. the least set-associative history are below c schemes prediction comparison it costs of Two-Level 89 percent accuracy by at least A ~ 09200 Branch Prediction the branch which is chosen counters — of Two-Level 11 compares scheme trainbranch scheme achieves The branch tar- Taken’s ; 06900 PAg the 0 Moo Comparison Figure for greatly for prediction Two-Level av- depends up-down average are with and 89 percent achieves about of the prediction Always BTFN’s percent to the Cb#mnd- and that 10000 schemes. Prediction of BTFN (76 percent). lower accuracy saturating using Last- Time accuracy. Most is superior branch branches Prediction line of con- table table curves associa- in the presence branch performance as table PAp Imple- size and set-associative history can creases in the Table of the table practical four-way branch effects history Four The history ideal History at ion 1 to 4 percent sets used [17] is around 93 percent. The Profiling about 91 percent prediction accuracy. 5.1.5 data , and benchmarks of 94.4 percent that are nc)t matrix300 4 to 19 percent between The schemes of appropriate fpppp, for accuracy testing. Training PSg curve and GSg is about similarities ing and accu- Static unavailability for eqntott, top prediction on the Figure points the individually. Benchmark the to the are not lower 07600 for due Adaptive 97 percent. 133 11: Comparison of branch prediction schemes. 6 Concluding In this paper we have proposed predictor (Two-Level Adaptive achieves substantially higher scheme ware that we are costs scheme of and of utilizes a per-address history We have and and ulation Adaptive the percent measured sensitivity to and s, the size ble. We reported effects the we should cent prediction and that needed. point out in the which will issue of speculative work to a branch prediction miss examining that hopefully reduce rate ment and Motorola at the provide, and will in and E. S. Davidson, “Characterization of Branch and Data Dependencies in Programs for EvalTransactions on uating Pipeline Performance” , IEEE Computers, (July 1987), pp.859-876. the depth further have [10] P. G. Emma is still the [11] J. A. DeRosa and Branch Architectures national Symposium 1987), pp.10-16. of the to be thrown Thus, the 3 perWe are it and to characterize for the and in particular, to NCR work. for 32, We technical Corporation No. which wish members for was to for their and the also gift HPS re- D.R. Ditzel and H.R. McLellan, “Branch Folding in the CRISP Microprocessor: Reducing Branch Delay to of the Ilth International Symposium Zero”, Proceedings Architecture, (June 1987), pp.2-9. on Computer [13] S. McFarling and J. Hennessy, “Reducing the Cost of Proceedings of the 13th International SYmBranches”, posium comments grateful to [14] sup- of an NCR useful in T-Y Yeh and Y.N. Patt, “Two-Level Adaptive Branch Prediction”, Technical Report CSE- TR-11 7-91, Computer Science and Engineering Division, Department of EECS1 The University of Michigan, (Nov. 1991). posium 1991), and pp. Workshop on Microarchitecture Branch (1986), J. Lee and A. J. Smith, “Branch Prediction IEEE and Branch Target Buffer Design”, (January 1984), pp.6-22. pp.396-403. Strategies Computer, [16] D.A. Patterson and C.H. Sequin, “RISC-I: A Reduced Proceedings of the Instruction Set VLSI Computer”, 8th International Symposium on Computer Architecture, (May. 1981), pp.443-458. [17] J.E. Smith, “A gies”, Proceedings on (Nov. 51-61. [3] M. Butler, T-Y Yeh, Y.N. Patt, and M. Shebanow, “Instruction Architecture, T.R. Gross and J. Hennessy, “Optimizing Delayed Proceedings of the 15th Annual Workshop Branches”, on Microprogramming, (Oct. 1982), pp.114-120. Sym, Computer [15] References [2] T-Y Yeh and Y.N. Patt, “Two-Level Adaptive Prediction”, The 24th A CM/IEEE International on our work. [1] “An Evaluation of of the Iith InterArchitecture, (June [12] environ- financial very H. M. Levy, “, Proceedings on Computer acknowl- of the stimulating are “Reducing the Branch Penalty in Pipelined Computer, (July 1988), pp.47-55. “, IEEE and Y.N. Patt, “Checkpoint Repair for IEEE Transactions Execution Machines”, on Computers, (December 1987), pp.1496-1514. enough improvement. authors other on this Model good Ari- [9] W.W. Hwu Out-of-order 97 per- engines miss. to try Michigan Corporation and not our to increase that The gratitude group they port, ta- it. suggestions Tower, [8] D. J. Lilja, Processors various prediction rate needs 3 percent Acknowledgments with are combine amount prediction we feel computing out search the Phoenix, noted history of branch performance increase pipeline, edge paramregister, pattern that figures research High will cent 94.4 We history effectiveness accuracy future due of the in the the the predictors. length entry on most of varying Adaptive of each 97 percent, at algorithms that use the pattern history table We showed the effects of cent ext swit th- Finally, the k, is about achieve Manual”, [7] N.P. Jouppi and D. Wall, “Available Instruction-Level Parallelism for Superscalar and Superpipelined MaProceedings of the Third International Conchines.”, ference on Architectural Support for Programming Languages and operating Systems, (April 1989), pp. 272282. Two- accuracy. the the sim- for User’s Hwu, T. M. Conte, and P. P. Chang, “Comparing [6] W.W. Software and Hardware Schemes for Reducing the Cost Proceedings of the 16th International of Branchesn, Symposium on Computer Architecture, (May 1989). We have accuracy schemes the dynamic benchmarks. Prediction prediction Two-Level future [5] Motorola Inc., “M881OO zona, (March 13, 1989). Pre- trace-driven prediction known of the prediction information. ing. SPEC of Branch proposed using Inter(May. 34-42. a global accuracy popular ten Branch other have this imple- and Adaptive schemes 18th Prediction table prediction other average average We eters the of the the Level effective Branch history Two-Level several while most hardof o.f the Architecture, [4] D. R. Kaeli and P. G. Emma, “Branch History Table Prediction of Moving Target Branches Due to Subrouof the 18th International tine Returns” , Proceedings Symposium on Computer Architecture, (May 1991), pp. other the variations Adaptive prediction that that any computed the branch of of nine shown than three that measured static We branch Prediction) table. variations diction dynamic accuracy of. Two-Level pattern three aware a new Branch implementing determined mentation Greater Than Two”, Proceedings national Symposium on Computer 1991), PP. 276-286. Remarks [18] M. Alsup, H. Scales, Level Parallelism is 134 Computer Study of Branch Prediction Strateof the 8th International Symposium Architecture, (May. 1981), pp.135-148. T. C. Chen, “Parallelism, Pipelining and Computer EfComputer Design, Vol. 10, No. 1, (Jan. 1971), ficiency”, pp.69-74.