Two-Level Adaptive Training Tse-Yu Department Yeh of Electrical The Ann and Branch Yale N. Engineering University Arbor, Patt and Michigan Computer Introduction Pipelining, structures, The deep importance tiveness for Some static are profiling dynamic in that This paper the branch mation use run-time to pipeline stalls or schemes. negative history and to schemes. of the 93 percent basis the are the for other relevant scheme vs. 7 percent represents reducing for metric than number first to reduce sible Adap- The ac- the compared Since rate. a 100 percent pipeline and initiating flushes required. 0-89791-460-0/91/001 1/0051 schemes are taken. 51 schemes make can not a the as posbubbles. and decoding execution of pipeline to reduce branch the execuprefetching target before taken 60 percent of can into or Static the that all and conditional also that [13]. other. consideration on tend The the are the as the all branches taken can accuracy dynamic study, taken. Static Certain more direction Backward inabout opcode. to branch branch such are In pre- that all in this branches instructions infor- prediction used be based static branch branches Smith into on the as predicting predicting benchmarks the depending 68 percent by Lee of the than can be classified be as simple Predicting predictions be taken of the approximately as reported direction to reduce by predicting, predictions. diction structions $1.50 is a way to classes of branch @ 1991 ACM pipeline pipeline the schemes used are until to resolve as early fetching dynamic mation achieve ways branch fetch fast to branches prediction and branches Permission to copy without fee all or part of this material is grantrd pro. vialed that the copies are not made or distributed for direct conurremal advantage, the ACM ropyright notice and the title of the publication and its date appesr, and notire is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. in taken are two to reduce execution In stalls loss due to the pipeline instruction prediction tar- is resolved. schemes im- the due issuing resulting the is to provide Branch penalty Branch other resolve instruction tion branch Adaptive the target bubbles. a predicexecution for second There is to the branches to be calculated. of cycles the performance is considerable. is the miss of is large, before instruction number conditional Unconditional address the to be resolved to be calculated target pipelines increases. condition is determined, the the and 97 percent case) address When loss: Two-Level (best more the target stalls speculative the the the due greater, branches, the be fetched. computers, static Two-Level schemes. of the for simulated, known benchmarks, SPEC is 3 percent to wait branch Adaptive introduced, of other can of for address conventional alters of infor- Two-Level ten the rate in the achieves in progress, provement on schemes. flushing This target As becomes on performance types to wait ways On performance branches. bandwidth to the effective processor. machine unresolved different the have predictor, which Prediction requires miss Training of prediction Branch on nine already for of branches have bubbles. Predictor branch miss branch scheme, algorithm to simulations Training to less than dynamic impede issuing effect branches are at run-time. Branch dynamic deeper Among Others on a single branches get information execution Training prediction compared curacy a new Adaptive configurations Training The opcode predictions. hand, as [18] and continuing one of the most performance contains of conditional prediction use other to the effec- as early [6], has been to improve execution. literature time get instruction collected Several tion they to make proposes Two-Level tive the at least present other predictions. the and up in the presence that they speed of branch statistics among predictor In fact, a number in help branch pipeline is well-known. proposals make to of a good of a deep branches and pipelines use, Science 48109-2122 1 microarchitectures ion of Michigan Abstract High-performance Predict in one can also Taken and Forward fective Not once over work Profiling path scheme well [12, by ever, program with certain profiling bit this irregular tendencies sets than than scheme using the of the branches the which the schemes more branches. data sets that posed at tion branch prediction of branches’ dictions. Lee called and a Branch urating is then history the Buffer design the result of the Static Training from branch also also Static [13] which of the Training statistics scheme to accumulate may not is that the col- results both the Section of five Sec- a wide dynamic se- and contains Adaptive the Adaptive the some con- Training Training following Branch Prediction characteristics: to different is executed based during the current history of execution of program. history pattern pattern history pattern the on the information on the fly of the program program and prediction Execution of disadvan- statistics be applicable including used models. Prediction has the Branch pat- results major the methodology prediction simulation predictors. branches is the a history execution The pro- the Prediction only uses the statistics n run-time to Branch the simulated the Two-Level scheme Another Smith and The Tar- record branch. program discusses the Branch prediction by Lee and gain branch’s Branch to of the a prediction. first The be simplified proposed of the last branch entry. This performance introduction Training Two-Level 2 execution of the by scheme. remarks. sat- information The state scheme, execution to make of the their scheme to be run same can history branch an three of schemes cluding to a large Adaptive repozts Training processor. gives and represents mispredictions they uses 2-bit predictions. of the a pre-run consisting tage has In last scheme lected the four static the pre- a structure to collect the of to make [13] which changes state get dynamic proposed to make buffer. on advantage behavior Buffer counters used in the is based tern Smith dynamically entry takes run-time Target up-down which study lection Dynamic two Section in this run-time. knowledge directly This in Adaptive lead Two-Level scheme. differ- occur can 93 percent. reduction Twc-Level Section and How- have under on a high-performance in advance may achieve 100 percent reduction branch opcode. has to be performed data other the in efonly to predict tendencies prediction sample branch with be used the a static is fairly it misses However, on programs also which because of a loop. 5] can measuring presetting [16] programs, all iterations does not ent Taken in loop-bound runs table of the execution information of the in the predictor. program is collected by updating branch the history Therefore, no pre- are necessary. data sets. There is pipelined misses work that tivation to second is the of that unique is the Branch only is collected of Static record of the record of the record of the last Trace-driven Two-Level diction schemes mark suite. the By benchmarks his- but of the using the reaches [13]. used branch and simulated on Two-Level average in this history history entries comes not sults moreover of history ble prediction static the 97 percent, while scheme branch SPEC Adaptive prediction The pre- most 52 the The in bits are results in the the a history bits particular pattern pattern the history bits content of the branch that history addresses, in is being register the the history ta- the most of the by checktable register entry for the predicted. table history the contents ing the re- All register are made by branch represent predictions in out- is a shift register Branch the bits information. history for predic- on the representing contained of the contents history depending history pattern instead the pattern history and branch updating table to Lee programs, register. instruction of the and recent the on which by ma(HR) similar of history Since for most The particular Training accuracy shifts branch Training, profiling pattern registers (HRT). indexed bench- the Adaptive information branches. which (PT), scheme by registers of the table two register Training is collected of the has history pattern Static statistics based are scheme branch In Tw&Level tions register particular study. the Adaptive Prediction Training the history in execution recent dynamic were Smith the scheme based branch used of the dis- Training are n branches, were Training other Prediction, The Adaptive the accumulating executmajor Adaptive structures, those Two-Level Branch Two-Level data and two The The the Prediction. last jor s occurrences without predictions The mo- n branches. Adaptive as the fly s occurrences simulations aa well Branch last last eliminating Two-Level Training predictions. n branches. n branches. on the because on the the uses of Concept pre- dynamic scheme last last Training is called Prediction, on the for of the beforehand, here by is the to make of the behavior pattern program advantage information branch This new 2.1 deep- of speculative higher-accuracy The history in caused amount [1, 8]. a new, scheme. information proposed machines large be discarded history level ing the the proposing of branch tory to prediction first degradation superscalar due has for branch levels performance and/or diction The serious is indexed register table by branch is called Pat&n Table (PT) BraDchmtllry Pattero — OQ-...oo 00...0I 00-,10_ BrauchEistmy B.egk& (Eli) (Mftleftwheoupiate) RwBi@+I . . . . . .. RwR~.I II s: Pattern m) Hie4tK0’y edidion o ~~~~~ I Bit(a) ‘*’y= “’$ ,“ ,,,,,0 7 (LT) Ime-t-Time Auta!mtnn Figure 1: The structure of the l’ramitim Two-Level Figure Adaptive scheme. history register table is called a global the history registers access The structure Prediction Bj branch needed track in based is on the in the history bits in of the the history the pattern history When comes each register, at most register. of the patterns, each order there entry are is indexed of its branch history register, ss &,c-kRi,c-k+~ of executing the branch, are k patterns 2k entries in by one distinct pattern table. The pattern entry PTR ,,=-kR,,c-k+l are then diction used of the for branch A is the After come the &,e in the update conditional entry the in branch. the the The ad- into is position history pTRi,=_kR,,c_k+ resolved, bits and in is the After l . . . . .. Ri.c_l. the register also can of the is done pattern by the history state bits transition in SC+l. the pattern function The finite- history in z characterized of the history bits S.+l be- transition chines used in this will be what records pattern used the history appeared. ta- as not up- predicted as taken. transition table entry and 6 which takes next 53 up-down and counter counter, which when there history the the branch branch will automathe same is no taken when pattern will be branch will be A2 the is needed branch is a saturatbut Buffer when next prediction times of the Target is incremented when two is also used, Branch of the The automaton hisof the The one bit otherwise, The Smith’s is decremented execution taken; pattern Only same hisFigure execution appeared. last ma- in the in the last execution has the predicted the pattern information. Only the next finite-state the of the results register the output appears time. history the recorded, Lee last pattern with the stores of the pattern happened the 2. Since bits, are shown pattern history out- history of the updating Last-Time history branch in entry the to store Al table when the 1. for outcome tran- machine, 1 and pattern his- The S and machine study the same bits is used pattern table. by equations on the only the circuit the a finite-state diagrams pattern automaton logic history by equation state in the (2) pattern is a Moore The history out- being outcome Ri,c) comprise is based machine bit of the be characterized prediction finite-state The becomes the pattern 6 to update 6, pattern H~ pattern the history 6(s., = entries which the bits in the function ing history of pattern new pattern function R of the branch ton history and the combinational the come branch function. the new sition dated, the content of the history register becomes by Rj,C-k+~~,C_k+z . ..... &,C and the state represented pattern bits 2. The (1) decision bit pattern tory tory pre- A(SC), = bits generate the to implement tory pattern the Therefore, time branch left significant the SC in , . . . .. R..=_l predicting prediction is shifted least bits address is Zc where history to the come the for the last k out- is used to A straightforward 1“ Bi is being predicted, the HE, whose content is . . . . .. Ri.c-l history as inputs $%+1 keep track to pattern keep a” there 2k different In in the old diagrams updating entry. are to then Since transition for k last k bits branch table bits. of a of the was taken, a “ O“ is recorded. dressed ble for all Branch therefore, state used pattern branch prediction pattern branch; register a conditional denoted to table. pattern. contents table history history table; pattern Training The A4 pat- because The If the branch if not, in the history appear 1. The table, Adaptive the of the history. (PHRT). same Figure of executing is recorded; the of Two-Level is shown out comes table pattern Automaton M machines the a per-address 2: state tern Automaton AZ 8mwlnml# up&uvl Ox.nt,.) eat $*!=6(SGM Auto-n Training Al of B s, 11.....11 ~ @ Autanatm differently, design branch is not be predicted [13]. is taken taken. The as taken when the counter otherwise, the Automata A3 Both ing and Static are two Level Adaptive from profiling. diction Training but Training, decision pattern function, is determined of A is determined history dictions are different tive Training, ate pattern of each tory if the times during on the branch, the Two-Level Since branch proper tive ferent sults in different there decision is a history were Branch configurations: with change can of the data Static well execution 4-way lated with is lost history Adapmany Training, 3.2 two tion. Methods Implementations of History Table Register the is not table ter feasible for each static in real Register The Per-address register for table to have history its own Therefore, implementing the register history two in The solution regis- approaches Per-addrew in Within address higher part table the for is used for table tern table lookup History A fixed grouped ble with the next to index branch. implemented Register into which lower the table part The althe in the per-address his- way the (AHRT). is called When time time the is available table does of the very the otherwise. a 54 to the the practical history the must history result have branch has when does this the not kind have branch. register the to ta- Therefore, the and prethe cycle. prediction of previous confirmed. This is being machine, but until a excase executed not usu- has a high ten- is predicted staIl the that of the loop of a pro- store table, of branch branch is confirmed. the pat- register be predicted, been superscalar Since and history result a tight pattern table, when the address. the is updated, to be accessed occurs before one cycle, perform history in the branch not to pattern for a predic- into a high-performance register bit Predictor make instruction updated in the often be taken, machine branch is register problem is required appears dency problem from history ecution ally next as a prediction the for the the by a deep-pipelined of a and is recorded in this Table as a (LRU) The branch num- together with a prediction pattern per-address this to the two lookups the requirement at the diction Hi~tory Least-Recently-Used as a tag the the cache. are replacement. is used allocated Associative the a set, is used register implement sim- accuracy Branch lookups determining to Another is to in Training table to squeeze is usually duce enough as a set-associative branch entry branch approach of entries gorithm tory a big simu- much interference Adaptive It is hard branch Table. first set. have implementations. are propoeed ber to also IHRT designs. sequential prediction It was The how two and 256- Latency Two-Level needs sets re- which 3.1 history table Prediction The on the data behavior. Implementation the with dif- processor 3 to branch, simulated HHRT there Training set-associative show and in which 256 entries. to is saved. approaches Adaptive The and accuracy be obtained conditional was 4-way execution store (IHRT), AHRT is provided register would static set-associative. due what imple- to make over if changing cur- The 512 entries in the Two-Level colli- this prediction Table each the Since the tag The is called table, practical reg- a condi- table. (HHRT). of the two 512-entry data Two-Level Two-Level accurate sets. predict entry to the program the updates, be highly in adjust for for Predictor. ulation bits register be cost Register simulated for of Two-Level in accordance the above History can can but the the interference than of approach a hash expect, is lower study, his- function Predictions behavior and not branch in more would into Table accessing cost the history this If a new approach. address hashing using Register an AHRT, the Ideal result information therefore, predictor With can still may same adaptively history execution programs contrary, actual The an table. AHRT, is an extra in this is to implement when approach In this appropri- behavior. the predictions. the history Training. pattern Training given this with Adap- the the for appears There store has pattern in the branch. for one the an entry table results As corresponding is used History history. of the table. can occur mentation contents a hash history Hash sions pre- Two-Level with table; change Training, pattern branch the a given branch updates prediction execution the Adaptive hand, pattern the Training program rent to Adaptive Adaptive history tional en- branch tag approach branch’s If the to address the the first. have for w per-address the not the second table the is used does is allocated The history for same execution. pattern inputs execution information different in same other branch AHRT, implementing ister pre- Therefore, is, the As a result, pattern, different the made history be found That entry Training to the is located register branch for in Two- input a given dy- is to be predicted, AHRT in the the informa- in Static execution. before pattern, the branch in the history pre- between history the J, for branch i.e. dynamically is preset before output at pattern In Static Traintheir difference changes try entry Adaptive information, the table conditional taken. to A2. because major to two; as not Two-Level The is that or equal similar on run-time pattern than predicted predictors, history. schemes in the and branch are based be are both Training branch these is greater will A4 dynamic dictions namic tion value branch taken the and previous Dynamic Indmwti.n DMribution =i\ ~ i 100 @o 80 P I ❑ bum h El Im 5rdl ■ J“m 80 Sub hsl r c km : 60 t a 40 0 e 30 mallw r 70 e Nm+snti Inu Imt ❑ ml. ❑ Ihm ■ WV Fmm ‘sub hs! mlti h!, IWSU O CuWuwl Iml Bmmh Ins! 20 10 10 0 A 0 7&A oqnt q+. ~ ~. 1, fpWP maw. ,mca 1-. FPA tic m son 2* Mwhmarh Figure 3: Distribution of dynamic instructions. and Simulation Methodology 4 Figure Model Trace-driven simulations were used in this torola instruction level for 88100 generating address lator instruction traces which verifies for branch instructions [4] are classified into subroutine return branches, and structions other non-branch stack. can the branch a subroutine address and is detected. The the return sets without routines, the Emma double in [2] is able An get address is calculated struction to the can branches which stack stacks on registers for scheme counter; the the stack when ior. miss For instruction proposed by the return Description Nine benchmarks in this from the SPEC prediction Kaeli are integer of branches in the the mettle tar- cution lated target value The ready. suite are on the tend behavwhere for fpppp conditional and condiand gcc branch gcc finish conditional of dynamic range accuto have branch prediction f pppp millions number repet- benchmarks except million benchmarks have is tested. benchmarks for the benchmarks dynamic ure 3. About namic Five focuses twenty The integer floating prediction benchmarks predictor for twenty The and tomcatv high branch five inexe- branches are instructions simu- fifty to 1.8 from million billion. Unconditional benchmark fpppp, the the and irregular is on the all to capture Among a very branches study before long integer of the branch this executed. too thus, The branches, structions. in the in- benchmarks. doduc, include matrix300 it were simulated address for the register study. and four kernels. execution; conditional Since of Traces branch conditional it takes of all seven Therefore, branch used of static benchmarks benchmarks, tional sub- branch’s the offset point because loop many prediction may benchmarks floating the integer 4.1 number racy is attainable. instruction from to become 213 370 1: The behavior itive therefore, have to wait address codes returns immediately. 1149 Number Static Cnd. Br. 556 489 653 606 matrix300, spice2g6 and torncatv and the integer ones eqntott, espresso, gee, and li. Nasa7 is not ininclude point unconditional program is the target In- return as the overflows. by adding be generated Table address a return to perform immediate espresso Ii fpppp spice2@ a return prediction instructions prediction. address when address address special 277 6922 Subroutine onto is popped return eqntott gcc doduc matrix300 tomcat v The into for condition using is pushed address Bencbrnark Name ing point unconditional on registers. targets. target Number of Static Cnd. Br. cluded by is called for the branch Benchmark Name instructions. set branches, are classified have to wait be predicted A return instruction immediate branches branch each benchmark. class. branches to decide and to collect conditional the branches instruction branches and branches, than results of dynamic simu- branches, accuracy. classes: unconditional Conditional in order the branch in the M88100 four and prediction predicts prediction The branch A Mois used instruction the branch with study, (ISIM) The instructions, the predictions statistics when traces. are fed into decodes simulator 4: Distribution are float- 55 The instruction 24 percent benchmarks instructions distribution is shown of the dynamic and about for the floating 5 percent point in Fig- instructions benchmarks instructions. distribution of the dynamic branch for of the dy- instructions are Branch HRT 1. !Lner. tatmzl Entry content ~ Entries NaIu. AT(AHRT(2S6,12SR), Ilelliat*o Eniry C.rlten< Il keeping 12-bii SR At= A2 512 12. bit SR AtDI A2 512 I%bit SR Aim A3 specifies the 512 12-bit SR At= A4 an in 512 12-bit SR Atra LT tomaton IHRT, AT(AHRT(512,12sR), PTGJ2,A3),) AT(AHRT(512,1!JSR), PT(212,LT),) AT(AHRT(512,1OSR), 612 10-bit 512 8-bit SR AtuI AZ SR AtIU A2 AT(AHRT(512,6SR), .512 6-bit 256 SR 12-bit Atm AX SR A41u AZ AT(HHRT(256,12SR), in each PT(#z ,Aa)>) 512 AT(HHRT(512,12SR), PT(212 12-bit SR AtIU A2 history 12-bit SR Atza A2 2. 12-bit SR ,A2),) AT(1HRT(,12SR), PT(212,A2 ),) 512 ST(AHRT(512,12SR), PT(_J12 PT(212 .512 PT(212 – Two-Level Smsth’s ),,) Adapt%ve table IHRT of any au- any automaton 612 12-bit SR PB training and 512 12-bit SR PB different data 12-bit SR PB Data the sets the testing. Data there is no Data Data When set Figure designs, designs. used. data When in Buffer in their are same content pattern because kept data the the shown included, history in the im- in Target In imple- for of entries specifies Branch is not register. is the of an entry information as Same, Buffer or Lee this LS – Four-way HHRT – Hash Hzstory Regwter Tzme, Table, PB - SR - Preset of simulated about 4. As can 80 percent ing is used for is specified our are is the branch conditional prediction branches. class that accuracy. in the dynamic The trace The should distribu- branch instruc- conditional be studied number tapes the of static of the benchmarks is both Diff, as HRT register re-initialized. is not pattern branches using ginning history to 3. For of execution will 0’s. of each beginning when pattern an entry the history table entries of execution. those A3, and Last-Time, the be more his- register for that accord- 1‘s than branch, in the A2, 1 such in of the 1‘s at the static likely, Al, to state more beginning more to state initialized are listed are are taken execution, bits automata, are initialized the in the at the The model contents contain a different history schemes designs. simulation the During to for 2. are initialized execution. If set is needed Training of branches bits testing. Buffer of each should the and Adaptive Target results, is re-allocated bles Since pattern A4, all ta- entries all entries branches likely are at the be- to be predicted taken. 1. In Simulation addition schemes, Model Branch Several configurations Adaptive Training register the all in the taken conditional data 60 percent are also initialized branch to improve no training in Table usually of program predictors. training simulation register entry Bit. for scheme are listed Accordingly, Register, used Branch and about to tory Regwter Shift branch – Lee Set- Prediction be seen from of the Smith’s study Since lkainmg, – .%at$c are as in Two-Level and configuration AHRT - sets is not specified, the shemes, A2 LT A2 LT Al LT Des$gn, Table, Last- Atm Aim AiIII A*UI Aim A* III ST Hwtory LT in Figure were table along naming to ideal for Scheme (History is Putt ern(Size, fies the scheme, Static the the Training the or is similar prediction the prediction were also accessing Scheme Lee Adaptive Lee and speciTrainSmith’s 56 to and simulated method with The branch with practical for Static Branch automata prediction Training schemes some were Smith’s Two-Level but introduced Smith’s simulated static the two and and for a given The Training schemes Lee an IHRT Adaptive Static designs, purposes. with that Two-Level prediction scheme by profiling. the Smith’s Buffer scheme Entry.Content), Data). (ST), sim- schemes, and branch comparison HRT were different branch Two-Level hash (IHRT) (Size, his- to Lee Target static implementa- and HRT Entry-Content), for example, per-address practical (AHRT) distinguish convention the two HRT the for the Two-Level For (PHRT), with order simulated scheme. associative In (AT), part history how Register – Ideal tions ing be Smith’s specified 2: Configurations schemes content can and Putt ern The the The specifies Z%ain%ng, Target – Automaton, (HHRT), be information Entry_Content and PB + History ulated. can Pattern history SR 512 512 512 512 . ),,) Branch Associative tions, ent content a history the number 12-bit . PT(212 ,PB),Diff) LS(AHRT(512,A2 LS(AHRT(512,LT),,) LS(HHRT(512,A2 LS(HHRT(512,LT),,) LS(lHRT( ,A2),,) LS(lHRT( ,LT),,) tory 2 or keeping PB PT(212,PB),Diff) ST(1HRT(,12SR), 4.2 number ,PB)$DM) ST(HHRT(512,12SR), in Table Figure specifies SR ,PB),SaIIIe) PT(212 in 12-bit ,PB),SaIIIe) ST(AHRT(512,12SR), branches table for Lee pattern ST(IHRT(,12SR), tion, for example, the The entry. register (Size, Ent ry.Cent and each history entry. For the PB ,PB),SaIXIe) ST(HHRT(51;,12SR), is shown in for specifies Size plementation, PT(26,A2),) Table Size HHRT. Entry_Content), patterns, PT(28,A2),) Table, the shown mentation AT(AHRT(512,8SR), Atm of branches, implementation, content Pattern(Size, PT(210,A2),) and the History In implementation ,A4),) AT(AHRT(512,12SR), AT (LS). is the information or in entry design History AHRT, of entries PT(212,A2),) AT(AHRT(512,12SR), Buffer history 256 PT(212,A2),) PT(212 Target Entry-Content), and simulated for Static Training Adaptive Training the important pattern and dynamic difference is pre-determined approaches Training for the HRT the same designs were with above. Target A2, A3, Buffer A4, schemes and Last-Time. simulated include the Always Taken, taken, and scheme is done a simple for tion. predicted branch was each takes used profiling the sum of the larger possible directions number of the with and Static Training and ures the nine labeled mean across the static diction the ferent 5.1 scaled with ent HRT simulations using can achieve scheme which the the Effect of State - ----------------------- 5 shows the were Y automata. history register accuracy. Training miss Four cause early other four-state using Last- Time ideal ! TaIJ km itG Figure The accuracy machines the other around history In is used order figures, maintain which Time automata table. A2, A3, and the execution A4 A3, and A4. 1 percent are therefore which The history records II WO tifPPwm.ti, w 300 Adaptive register show Xm qc 10MS Training table the scheme A2 state schemes using implementations. which transition curves is shown clearly with in the usually performs automata used Effect of History ment ation A3, A4, be- Figure 6 shows to the on the prediction included, it was inferior automata more only to each transition A2, is not about 97 percent. state automata, indicated performs We the state the in this following transition best among study. Automata Al A2, “w. 6: Two-Level different Training register of different automata, q.lt, scheme and Static history Transition transition experiments the ones using Last- state .. .. .. .. . scheme the effect Smith’s Last- Time were simulated. and ATIHHIAW2,1~Tw2v2L) - ATWRTP2t1 z3RI,PT(4s90,W - . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . #u- simdiffer- accuracy table the efficiency - . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . * ATlN41T@2t.m2Wl(401&A2L) 5.1.2 Figure 3 * ATIHRTI!6, MI%?T(4wLA21.I a ATIAHITISW.1UlUPT(401qMLl 0,22 dif- automata, Adaptive also uses the .. ....... .... ....................... 0.78 i schemes different and using ImdEemA8A2m a c to 100 percent. on prediction Lee schemes 0.8 -- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . transition history mug Ditrelentxmr n.n . - ...................$$. Training and Training automata. c c pre- between Adaptive transition 0“’” ~v’’-k$y ~ : v the 5.1.1 O AT(AHIIT(5M! 2SRWWW.LTI.I ,2’.>;.V$Z...... I... and shows Two-Level state Ad@iveTn2mhw -. * shows all float- 5: ‘r90-Z.awl cat- across axis Training Two-Level to de- schemes. effects without Figure geometric benchmarks, 76 percent demonstrate as a comparison . .1.................. different Fig- G Mean” a comparison state their IHRT ... . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . the across axis, from Adaptive of the an .. L 0.24 0.8 Buffer the mean Adaptive different to show ....... were accuracy “ Int vertical implementations, lengths y ... + ATIAHflTIS!2.12SR),PT140WA4).) total schemes. with prediction Two-Level ulated Target all integer The Two-Level The OM - - . . . . . . . . . . . . of schemes, shows the geometric concludes branch the section horizontal Mean” across this prediction the G benchmarks. accuracy : -. AT(A”~T(SI z,,2S~,PT(400AA2L) instructions. prediction benchmarks, shows section in Branch the On mean “ FP G Mean” point the as “Tot all the for two over Training branch 10 show geometric This presented benchmarks. egory study, 0.92 - - c c set the ratio branch Adaptive Schemes, 5 through data ~ Results results some same branch ‘\ one the in the two numbers static Dlffmmnt State ‘lhndth 1 Om - and execu- is the in this conditional SdIme4 Udq profiling profiling by taking khPth ‘l%ddIw ‘Wo-I.avd 1 Not of taken the execution number the Two-Level signs, in the Since of every simulation run The of a branch was calculated Simulation The ing branch dynamic Forward frequency frequently. accuracy and scheme. the direction prediction 5 profiling static most for Taken by counting not-taken The Backward The scheme worse achieve four-state information what happened more tolerant Training ulated than with forms than ond, the entry AHRT the time; to noise the scheme in history. 57 the Every same history finite-state last effects best, the hit ratio. the branch scheme worst, This Adaptive graph length. the IHRT AHRT scheme the fourth in the in the length, Imple- implementations Two-Level register 512-entry is due history scheme HHRT Table HRT of the history register 512-entry the of the accuracy schemes. equivalent similar the Register the was simWith the scheme per- scheme the sec- third, the 256- and the 256-entry decreasing order to the increasing as the hit ratio HHRT of the interference decreases. HRT in . 1~ WO.Lwd Ad@iw ; .. .. .. . . .. ... .,, 0“ // . .. . . . . . . . . . . . . . . . a 8, U,hlgaf.lnrys@stelUof ~$ —t f.a@u . ,:.,’ ? . . . .. .. .. .. .. .. .. . .. . . . .. . . . . . .. .. .. .. . . . . . .. . . .. .. .. . .. . . . . . .. .. .. .. .. .. .. . . . 0,1 . . . . . . .. . . . .. .. .. Training gcc li doduc a ATlAtM17512.tmPT(lm) fpppp - matrix300 history registers 5.1.3 the of 7 shows Table 3: Training using four lated. lengthening to the the and often accuracy improves 5.2 Static Static Training branch will the to scheme of every keep the the number requires tions, the all to consider in addition from to the to the data set and those for sets are the same were both and executed the data set, on the Static of nine matrix300, sets. re- tion is achieved to index the branch the is then Training changes, enough in of practical two sets the table like implementa- The 58 were for data sets are order for trained results achieve to are the with that branches are 4-way AHRT. The execution, is about the regular 5 percent branch using 12 bit is about Adaptive and different the prediction for For are behavior 0.5 percent. an data 1 percent lower. set This registers accuracy degradations are within data Two-Level and the same accuracy when in predic- 97 percent. history However, in configuraschemes scheme the sets 3. highest the espresso It is about degradations The The by data similar using 12 bit drop benchmarks, 8. because applicable The Tkaining Training training gcc and with ap- eqntott, in Table is about achieved using for other. an IHRT. scheme for each schemes other excluded sets or the Figure Static and as that are data Adaptive in with benchmarks, are shown execution by the trained four schemes of the and ent due to the HRT the can predictions testing shown are used significant. point set, tomcatv, Training respectively. programs. practical same accuracy which data similar registers 512-entry number were other Two-Level training history to another, which best and and accuracy is used bit data schemes applicable too 6 are of each preset are the is being history in their in schemes The fpppp, Static to a branch program a big The be used lli~~ on different testing in on the same schemes same the are no other which the (with sets, Same other benchmarks data the Figure by tested tested the data (with beforehand. Five Static and are pat- scheme. training schemes Two- register in the Training schemes for schemes logic and with implement history both of the All Training because Static trained the In by the presented. comparison. best for used to than the transition trained as those for Because bench- simulated cost less expensive the were which were The effects which execution must for the names) tions program, prediction the the the of the contains sets of each study required their plicable with given gather registers data state results the branches IHRT, NA greycode.in NA because simulation to required effects to show sets track one offer static is simpler known his- any the and program software, preset registers the in branch. varies hardware to hold In order the the the pattern which nat oms greycode.in NA table table in training in the history doducin in this the In order data When of history the IHRT until pattern However, used execution The and his- branch table predicting branches keep .i queens doducin Training, path. run-time. recorded table names) register probabilities with History information. for of static to eight NA NA schemes. is not there required done branch of the pattern prediction used static the the branch be needs during its branch by of a branch not-taken support. track branch predicted, history examines profiling the can hardware static or Adaptive a fair calculate accounting statistics quires to to predict Although history simu- According accuracy Prediction from set be taken pattern Training the n executions last gathered data training were 0.5 percent 2 bits. prediction Branch statistics tory schemes is reached. of the a training for Training Level tern on ~aking pattern the about increasing the asymptote lengths for registers results, of hanoi testing used Training Static Train- Training register increases history simulation length tory history accuracy length Adaptive Adaptive Static similar. register of Twc-Level Two-Level different The the using Length of history accuracy The schemes Register effect S= mark. lengths. History the prediction ing schemes. Training of different Effect Figure Adaptive bca dbxout short implementations Two-Level Data int.pri~.eqn Cps tiny spice2g6 tomcat v I 7: Testing cexp.i tower * AT(N+IT(6,2.IaPT(4M1.A2L) AqWiT@latSlyT@6v,q) Set . t Figure ‘Data NA espresso * AlpWT(siaamMAf)) 0s4 Name eqntott El e ala Benchmark :..:................ ~... . ,g----------------------------------- . .. . . .. . . . . . . * c Y . .. .. . .*. ... .-;..* “-“ - sn A c c ‘rnkJvg Safnmm Ii the not of the Since lower is more floating so apparprograms. the data camu190a d Srmdl Plwdictlm sch- 1 +’ Pr&h@(..s.fnl) Figure 8: Prediction accuracy of Static Training Figure 10: Comparison of branch prediction schemes schemes. bound cal at 93 percent HRT percent 1 a LY+mT@,f*\,) - U3(AMIT(512LT).J 0 LS(MIYT(5WA2).) the a L$$+HRT@12#2).) Xm Wal ~ Pm’$llnj(,,samo) The Prediction BTFN, accuracy Always of Branch Taken, and Target the For training curacy Training Schemes using testing is not complete, schemes is not graphed. and for the different the data sets ac- Other accuracy Not taken (BTFN), ing scheme. The ulated automata, Only with the results are shown in signs A3 using using percent urations, ulated. A2. similar Using Al, because those to IHRT, an IHRT using and designs using AHRT, in those predict A2. and results to those Al using A2 the is about simple times Three schemes the opcode whether taken the branch for Always scheme like benchmarks. prediction lower for than accuthe other 70 percent. 69 percent. Taken scheme benchmark to changes another. each the is made Its branch The count or not. to at the the bit depending than the prediction prediction bit. percent. This 92.5 cost the prediction run-time is about but of how times is larger The according is to run many is set or cleared scheme simple here statistics and how branch. branch count is fairly simulated is taken taken of this prediction other the accumulate for of the branch average taken However, scheme to the branch in the not is often one profiling is not of profiling and accuracy. and Forthe profil- 5.4 Comparison of Schemes were sim- Figure 10 illustrates the comparison between the schemes mentioned above. The 512-entry 4-way AHRT was chosen for all the uses of HRT, because it is sim- Last-Time of the de- of the designs ple about and buffer and HHRT, to below 60 percent. once branch low A2, A3, A4, and Last-Time. are similar designs than Buffer designs figure, Taken Taken, Target of the and A4 The lower Branch the Backward Always fall benchmarks is approximate from program scheme ward 4 compared Not benchmarks, of markedly not-taken results of Lee and smith’s practi- is about points Forward as 98 percent. its accuracy average The Figure 9 shows the simulation Branch Target Buffer designs, loop-bound accuracy on Schemes with poorly data loop-bound but benchmarks, of the 5.3 the The The average predict of the and for tomcatv is as high many for the Static for Taken Taken and the racy the scheme. schemes Last-Time A2. Some is effective matrix300 Buffer Profiling Always Backward average 9: same Using using schemes. (BTFN) The Figure than and other quite designs, the 76 percent, * L.SWIMT(SULTM - N.q, lower BTFN * W4RT@t.LTI,J for implementations. 2 to Static similar config- were sets the 3 enough Training sim- about upper 59 to be training costs. At scheme 97 percent. implemented. schemes the whose As top Tmm-Level are is the average can chosen Two-Level prediction be seen from Adaptive on the basis accuracy the of Adaptive graph, is the Static Training lower than dicts almost Buffer top branch Branch 92.5 with pre- lsst about [1] Target percent. the achieves References 5 percent scheme Smith’s around a branch of the 1 to profiling and accuracy predicts execution about The as Lee with which predicts curve. as well design scheme the scheme the result This paper a new examining branch the history behavior pattern with HRT three history data was the other higher using with ally the tion several schemes schemes, Backward The Taken and scheme were to have on suite. ter than diction Since a prediction lative execution improvement Not from Tse-Yu to improve however, good branch of [6] sim- with Architecture, Adaptive Training Report, University Technical (1991). Motorola Inc., Arizona, (March “M881OO User’s Phoeniz, Manual”, 13, 1989). Hwu, T. M. Conte, and P. P. Chang, “ComparW.W. ing Software and Hardware Schemes for Reducing Symposium of the 16th inArchitecture, 1989). and D. Parallelism Wall, for “Available and Proceedings Machines.”, Instruction- Superscalar Super- of the Third In- ternational Conference on Architectural Support Languages and Operating Sysfor Programming 1989), pp. 272-282. tems, (April predicTraining [7] a simple Jouppi pipelined is usu- and N.P. Level seen Taken, in scheme D. J. Lilja, Pipelined has than pipeline “Reducing the Processors 1988), pp.47-55. W.W. Hwu Branch IEEE “, Penalty Computer, in (July branch pre[9] the processor P. G. Emma specube superscalar exploiting instruction processor predictor. mispredicted on the Two-Level is proposed processors level performance. critically by Branch effec- (June [11] effective- accuracy penalty D-R. in of a the pp. 1496- and in Programs , IEEE Performance” H. Architectures (July M. “, Symposium 1987), “Characterization 1987), Levy, for Trans- pp.859-876. “An Evaluation of Proceedings of the ldth inon Computer Architecture, PP.1O-16. high as- branches. 60 Ditzel and CRISP H.R. McLellan, Microprocessor: “Branch Folding Reducing Branch tional Proceedings of the Idth Interna(June Symposium on Computer Architecture, 1987), pp.2-9. Delay Training to support the for Trans- parallelism This Adaptive as a way minimizing are 1987), Dependencies on Computers, [10] J. A. DeRosa Training execution Data Pipeline actions can Repair IEEE (December and E. S. Davidson, and Evaluating performance Adaptive “Checkpoint Machines”, on Computers, of Branch required. of the Patt, Execution 1514. a 100 percent flushing Two-Level actions bet- and Y.N. Out-of-order benchmark 4 percent flushes progrem, [8] been of 97 per- SPEC or dynamic more causes the depends Prediction performance “Two-Level Yeh, (May AHRT Training accuracy a high-performance and single ness, on Computer on Computer Always taken, the static miss for to 34-42. Prediction”, ternational methods sociated pp. Proceedings branch is about means already Deep-pipelining Branch Symposium ternational scheme. tive ternational Ta- Due of the 18th In- the Cost of Branches”, As Static Training number by using of register. designs, accuracy which on considerable Returns” History Branches , Proceedings Subroutine 1991), “Branch Target scheme was Adaptive prediction of the other the of the Archi- simulated. benchmarks in history Smith’s Forward prediction most P. G. Emma, H. Par- Proceedings Two”, usually accuracy or static and Adaptive schemes, reduction the Buffer an average nine The [5] IHRT the lengths. prediction dynamic Target [4] hold each same scheme register as Lee Twc-Level shown cent such The because Two-Level other Branch profiling the Alsup, Level HHRT. results, to Than of Moving of Michigan, which for the Training by lengthening addition scheme, the history simulation improved In [3] sim- to an AHRT than size, Adaptive various table. bounds using same were IHRT M. the unique enough Patt, “Instruction is a set-associative upper accuracy than the large which A scheme rate schemes is a hash obtain of the miss table AHRT to Two-Level ulated Training and Prediction (May by and of that Y.N. Shebanow, is Greater Branch which prediction has lower from the HHRT n branches configurations: HHRT schemes. an Each the included has last a branch occurrences register all stat ic branches, and Two-Level predicts n branches. last Adaptive is an ideal cache, of the Two-Level ulated predictor, scheme for the lasts of the The branch The Training. Yeh, M. D. R. Kaeli ble Remarks proposes Adaptive T-Y and 18th International Symposium on Computer tecture, (May. 1991), pp. 276–286. [2] Concluding Butler, allelism of 89 percent accuracy. 6 M. Scales, The to Zero”, [12] S. McFarling Cost of ternational (1986), and J. Hennessy, “Reducing the ings of the 8th International Symposium on Computer Architecture, (May. 1981), pp.443-458. Branches”, Proceedings of the 13th InSymposium on Computer Architecture, [16] J.E. Smith, pp.396-403. gies”, [13] J. Lee and A. J. Smith, gies and Computer, Branch “Branch Target (January Prediction Buffer 1984), Design”, Strat& posium IEEE Gross and J. Henneasy, “Optimizing Branches”, Proceedings of the l$th shop on Microprogramming, (Oct. duced Delayed Instruction and C.H. Prediction International Architecture, (May. Strate&m1981), Shar Sequin, Set VLSI Annual Work1982), pp.114 “RISC-I: Computer”, and E.S. Davidson, “A MdtiminiPr~ cessor System Implemented Through Pipelining.”, IEEE Computer, (Feb. 1974), pp.42-51. [18] T. C. Chen, Patterson Computer $ih pp.443-458. 120. [15] D.A. on of Branch of the pp.6-22. [17] L.E. [14] T.R. “A Study Proceedings “parallelism, Efficiency”, Computer 1971), pp.69-74. A ReProceed- 61 Pipelining Design, Vol. and Computer 10, No. 1, (Jan.