Evaluation of Design Options for the Trae Cahe Feth Mehanism Sanjay Jeram Patel, Daniel Holmes Friendly, and Yale N. Patt Fellow, IEEE Advaned Computer Arhiteture Laboratory Department of Eletrial Engineering and Computer Siene The University of Mihigan Ann Arbor, MI fsanjayp, 48109-2122 ites, pattgees.umih.edu Tel: (313) 936-0404 Fax: (313) 763-4617 ABSTRACT In this paper, we examine some ritial design features of a trae ahe feth engine for a 16-wide issue proessor and evaluate their eets on performane. We evaluate path assoiativity, partial mathing, and inative issue, all of whih are straightforward extensions to the trae ahe. We examine features suh as the ll unit and branh preditor design. In our nal analysis, we show that the trae ahe mehanism attains a 28% performane improvement over an aggressive single blok feth mehanism and a 15% improvement over a sequential multi-blok mehanism. Keywords: high bandwidth feth mehanisms, trae ahe, instrution ahe, wide issue mahines, speulative exeution 1 Introdution A miroproessor is omposed of three fundamental omponents: a means to supply instrutions, a means to supply the data needed by these instrutions, and a means to proess these instrutions. For high performane, the instrutions and data must be delivered at high rate to a proessing ore apable of eetively onsuming them. Delivering instrutions at a high rate is not a straightforward task. Several fators degrade feth engine performane. First, ahe misses ause the feth engine to stall and supply nothing until the miss is resolved. Seond, branh mispreditions ause the feth engine to supply instrutions whih will later be disarded. Third, hanges in ontrol ow inhibit the feth engine from produing a full width of instrutions. Due to the physial struture of instrution ahes, it is diÆult to feth both a taken branh and its target in the same yle. Table 1 shows the average number of instrutions between all branhes and between taken branhes (i.e., taken onditional branhes, jumps, subroutine alls, returns and traps) for the SPECint95 benhmarks. Evident from this table is that if partial fethes due to branhes are not dealt with, then feth bandwidth will be limited to an average limit of 9 instrutions per yle. insts per branh insts per taken branh mprs g go ijpeg li m88k perl vrtx avg 5.57 5.05 6.63 10.40 4.27 4.35 5.21 6.41 5.99 7.78 8.32 9.85 13.51 6.55 5.28 7.94 9.85 8.64 Table 1: Instrution run lengths when terminating fethes on branhes. A straightforward tehnique for dealing with partial fethes is to onstrut an instrution ahe where multiple independent fethes an be performed eah yle. To do this, multiple addresses are used to index into the instrution ahe via multiple read ports. After the ahe lines are read, the requested instrutions must be properly aligned and merged before they are supplied for exeution. Suh a solution adds onsiderable logi omplexity in an already ritial exeution path of the proessor. Either yle time will be aeted or extra pipeline stages will be required. Reently proposed, the trae ahe [20, 11, 22, 19℄ overomes this bandwidth hurdle without requiring exessive logi omplexity in the instrution delivery path. Like an instrution ahe, the trae ahe is aessed using the Program Counter. Unlike an instrution ahe, 2 a trae ahe line ontains instrutions as they appear in exeution order, as opposed to the stati order determined by the ompiler. Two adjaent instrutions in a trae ahe line need not be adjaent in the exeutable image. A trae ahe line stores a segment of the dynami instrution stream. By plaing logially ontiguous instrutions in physially ontiguous storage, the trae ahe is able to supply multiple feth bloks eah yle. A feth blok roughly orresponds to a ompiler basi blok { it is a dynami sequene of instrutions starting at the urrent feth address and ending at the next ontrol instrution. In addition to inreasing the delivered instrution bandwidth by inreasing the eetive feth rate (the average number of orret instrutions issued for eah feth that returns onpath instrutions), ahing instrutions as trae segments makes reproessing them easier. Sine instrutions are written into the trae ahe after being deoded, deode bits are stored along with eah instrution minimizing the proessing required when it is fethed again. In this paper, we evaluate several extensions of the trae ahe suh as partial mathing, path assoiativity, and inative issue. We also examine some of the key omponents of the trae ahe feth mehanism suh as the ll unit and multiple branh preditor. 2 Related Work Cahe organizations for simultaneously fething multiple bloks have been studied by Yeh et al [28℄, Conte et al [4℄ and Sezne et al [24℄. By multiporting the instrution ahe and/or the branh target buer (BTB) and generating multiple feth addresses and branh preditions per yle, these shemes are able to overome the single feth blok bottlenek. There are several ompile-time approahes to solving the feth bottlenek problem. With Blok-Strutured ISAs [14, 8℄, superblok sheduling [9℄ and trae sheduling [5, 3℄, the stati form of the program is organized by the ompiler into longer sequential units omposed of 3 multiple basi bloks. Melvin and Patt [15℄ disuss the performane impliations of the ll unit and the idea of dynamially ombining feth bloks into larger \exeution atomi units" (EAUs) to further inrease the feth bandwidth is rst proposed. In 1994, Peleg and Weiser led a patent on the trae ahe onept [20℄. A similar onept was proposed by Johnson [11℄. The onept was further investigated by Rotenberg et al [22℄. They presented a thorough omparison between the trae ahe sheme and several hardware-based high-bandwidth feth shemes and showed the advantage of using a trae ahe, both in performane and lateny. Extensions and analysis of the trae ahe mehanism were proposed by Patel et al [19, 6, 18℄ and trae ahe impliations on proessor design were presented in [23℄. A similar approah to ahing dynami instrution groups was presented in the DIF ahe by Nair and Hopkins [17℄. 3 The Trae Cahe Feth Mehanism We divide the trae ahe feth mehanism into four major omponents: a trae ahe, a ll unit, a multiple branh preditor, and a onventional instrution ahe. The trae ahe is the main soure of instrution supply and is lled with trae segments by the ll unit. The speulative sequening of segments is performed by the multiple branh preditor. The instrution ahe plays an important but supporting role, handling ases when the required instrutions are not found in the trae ahe. A high-level diagram of the mehanism is shown in gure 1. In this setion, we provide the details of a feth engine for a 16-wide issue dynamiallysheduled mahine. To meet the instrution bandwidth demands of this mahine, 16 instrutions and three onditional branhes an be fethed per yle. 4 3.1 The Trae Cahe The trae ahe stores segments of the dynami instrution stream, exploiting the fat that many branhes tend to favor one outome. If blok A is followed by blok B whih in turn is followed by blok C at a partiular point in the exeution of a program, there is a strong likelihood that they will be exeuted in that order again. After the rst time they are exeuted in this order, they are stored in the trae ahe as a single entry. Subsequent fethes of blok A from the trae ahe provide bloks B and C as well. Figure 2 shows an example of a trae ahe feth yle. A request is made with the address of blok A. The trae ahe responds with a hit and drives out the seleted segment omposed of the bloks A, B, and C. The predition strutures are aessed onurrently with the trae ahe. At the end of the yle, the segment is mathed with the predition. Suppose the preditor selets the path ABD. Then only bloks A and B are supplied. Blok D is requested in the following yle. The onept of supplying only the mathing portion of a trae ahe line is alled partial mathing and its impliations will be examined in setion 5.1. The trae ahe an store segments ontaining up to 16 instrutions, 3 of whih may be onditional branhes. The line is aessed by the address of the rst instrution in the segment. The organization of the trae ahe is similar to that of a onventional instrution ahe, as the lines may be arranged in an assoiative manner. A hit is determined by a tag math. Eah line of the trae ahe ontains: 16 slots for instrutions. Instrutions are stored in deoded form and oupy approxi- mately ve bytes for a typial ISA. Up to three branhes an be stored per line. Eah instrution is marked with a two-bit tag indiating to whih blok it belongs. 5 Fetch Address Fill Unit Trace Cache Multiple Branch Predictor Selection Logic Instruction Cache Path L2 Decoder Next Fetch Address Unified Cache Register Rename HPS Execution Core Figure 1: The trae ahe feth mehanism. A T B NT T Address of A D C Trace Cache A B C Multiple Branch Predictor Selection Logic A T/NT/T B Figure 2: The trae ahe and branh preditor are indexed with the address of blok A. The inset gure shows the ontrol ow from blok A. The preditor selets path ABD. The trae ahe only ontains ABC. AB is supplied. 6 Four target addresses. With three basi bloks per segment and the ability to feth partial segments, there are four possible targets to a segment. The four addresses are expliitly stored allowing immediate generation of the next feth address, even for ases where only a partial segment mathes. Path information. This eld enodes the number and diretions of branhes in the segment and inludes bits to identify whether a segment ends in a branh and whether that branh is a return from subroutine instrution. In the ase of a return instrution, the return address stak provides the next feth address. The total size of a line is around 97 bytes for a typial arhiteture: 5x16 bytes of instrutions, 4x4 bytes of target addresses, and 1 byte of path information. Instrution dependenies within a trae segment are predetermined before the segment is stored in the trae ahe. This additional information allows for minimal deoding when the segment is fethed from the trae ahe. Soure operands produed by an instrution outside the segment are expliitly marked as requiring an external value. Instrutions whih produe a live-out value are marked as requiring a physial register. It is important to note that a omplex dependeny analysis aross 16 instrutions does not need to be performed on segments fethed from the trae ahe. The onept of expliitly marking internal/external register values within a basi blok was rst desribed by Sprangle and Patt [25℄ and later adapted for use with the trae ahe by Vajapeyam and Mitra [27℄. Instrutions within a segment an be arranged in an order that permits quik issue. Beause the dependenies within a segment are expliitly marked, the ordering of instrutions arries no signiane. Instrutions within the ahe line an be arranged to mitigate the routing required to forward instrutions to funtional unit reservation stations. Friendly et al [7℄ examine a sheme where instrutions within a segment are ordered to redue the ommuniation delays assoiated with data forwarding aross many funtional units. 7 3.2 The Fill Unit The ll unit ollets instrutions as they are issued by the proessor and ombines them into trae segments. Coneptually, the instrutions (in units of bloks) are lathed by the ll unit in the order they were fethed. The ll unit merges the arriving bloks with bloks lathed in previous yles. The merge proess involves reating dependeny information and reordering instrutions. The proess ontinues until the segment beomes nalized, at whih point the segment is written into the trae ahe. A segment beomes nalized when 1. it ontains 16 instrutions, or 2. it ontains 3 onditional branhes, or 3. it ontains a single indiret jump, return, or trap instrution, or 4. merging the inoming blok would result in a segment larger than 16 instrutions. Rule 1 is implied by the size of the trae ahe line and rule 2 by the number of preditions supplied per yle by the preditor. Beause their targets vary, return instrutions and indiret jumps ause nalization (rule 3). Unonditional branhes and subroutine alls do not aet trae segment nalization. Beause bloks are being ombined in a greedy fashion, there are ases where the ll unit stores multiple opies of basi bloks in the trae ahe. Figure 3 shows a simple loop omposed of ve basi bloks. The ll unit an potentially reate ve dierent trae segments ontaining portions of this loop, all of whih an be simultaneously resident in the trae ahe. This blok redundany may degrade performane of the trae ahe mehanism by displaing useful lines with redundant information. The tradeo is between higher bandwidth from fething larger segments versus lost bandwidth due to inreased misses in the trae ahe. 8 The problem arises beause of bloks where several exeution paths merge, suh as blok B in gure 3. Rule 4 above treats basi bloks as atomi entities. A basi blok is not split aross two segments unless the blok is larger than 16 instrutions. In addition to exaerbating the blok redundany problem, splitting a basi blok reates an additional blok that may tax other strutures suh as the branh preditor. A study of eetive tehniques for splitting bloks was done by Patel et al [18℄. Three outomes are possible with the arrival of eah new blok of instrutions: (1) the arriving blok is merged with the unnalized segment and the new, larger segment is not nalized. (2) the entire arriving blok annot be merged with the awaiting segment. The awaiting segment is nalized and the arriving blok now oupies the ll unit. (3) the arriving blok is ompletely merged with the awaiting segment and the new, larger segment is nalized. 3.3 The Branh Preditor The branh preditor is a ritial omponent in a high bandwidth feth mehanism. To maintain a high rate of instrution supply, the preditor needs to make multiple aurate branh preditions per yle. In the ase of our trae ahe mehanism, three preditions per yle are required. Two level adaptive branh predition has been demonstrated to ahieve high predition auray over a wide set of appliations [29℄. In a two level sheme, the rst level of history reords the outomes of the most reently exeuted branhes. The seond level of history, stored in the pattern history table (PHT), reords the most likely outome when a partiular pattern in the rst level history is enountered. In typial shemes, the PHT onsists of saturating two-bit ounters. 9 To make three preditions per yle, we expand eah PHT entry from a single two-bit ounter into three two-bit ounters, with eah two-bit ounter providing a predition for a fethed branh. Evaluation of this PHT entry organization versus several others is provided in setion 5.5. The baseline onguration for the trae ahe preditor uses the gshare sheme outlined by MFarling [13℄. The global branh history is XORed with the urrent feth address, forming an index into the PHT. This hashing better utilizes the PHT and improves predition auray over other global history based shemes. The branh preditor also ontains a return address stak to predit the target addresses of return instrutions. 3.4 The Instrution Cahe A onventional instrution ahe supports the trae ahe by supplying instrutions when the trae ahe does not ontain the requested segment. Sine hitting in the trae ahe is the frequently ourring ase, the supporting iahe need not be enhaned for higher bandwidth. The iahe therefore supplies up to one feth blok per yle. The instrution ahe has two read ports to allow adjaent ahe lines to be retrieved eah yle. By fething two ahe lines and realigning instrutions, the feth mehanism overomes partial fethes due to ahe line boundaries. In the ase of a trae ahe miss and an instrution ahe hit, up to a single basi blok is supplied at the end of the yle. If both the trae ahe and instrution ahe miss, then a request for the missing instrution ahe line is made to the seond level ahe. The feth mehanism stalls until the missing line arrives. 10 4 Experimental Setup A pipeline simulator that allows the modeling of wrong path eets was used as the experimental model. The simulator was implemented using the SimpleSalar 2.0 tool suite and instrution set [1℄, whih is a superset of the MIPS-IV ISA. In the exeution model, all instrutions undergo four stages of proessing: feth, issue, shedule, exeute. Eah stage takes at least one yle. The baseline feth engine, apable of supplying up to 16 instrutions per yle, inludes a large 2K entry (approximately 128KB for instrution storage), 4-way set assoiative trae ahe and a 4KB, 4-way instrution ahe. Eah trae ahe line ontains up to 16 instrutions, ontaining at most three onditional branhes. To assist in the generation of target addresses for fethes from the iahe, a 1KB branh target buer (BTB) is inluded. A 1MB unied seond level ahe provides instrution and data with a lateny of 8 yles in the ase of rst level ahe misses. The L2 miss lateny to memory is 50 yles. The baseline branh preditor modeled is a 15-bit version of the gshare preditor desribed in setion 3.3 and provides up to three individual onditional branh preditions eah yle. The size of the pattern history table is xed at 32K entries onsisting of 3 two-bit ounters (24KB of storage). An ideal return address stak is modeled. The exeution engine is omposed of 16 funtional units, eah with a 32-entry reservation station. The funtional units are uniform and apable of all operations. A 64KB L1 data ahe was used. The model uses hekpoint repair [10℄ to reover from branh mispreditions and exeptions. The exeution engine is apable of reating up to three hekpoints eah yle, one for eah feth blok supplied. The memory sheduler waits for addresses to be generated before sheduling memory operations. No memory operation an bypass a store with an unknown address. Sine this study is onerned with the feth engine, many omponents of the exeution engine are modeled at very aggressive design points thereby 11 reduing the potential for the bakend to indue bottleneks whih may obsure bottleneks in the frontend. All experiments were performed on the SPECint95 benhmarks and on a benhmark suite onsisting of several ommon C appliations [26℄. Table 2 lists the number of instrutions simulated and the input set, if the input was derived from a standard input set1 . All simulations were run until ompletion (exept li and ijpeg, 500M instrutions). Benhmark ompress g go ijpeg li m88ksim perl vortex hess ghostsript (gs) pgp plot python sim-outorder (ss) tex Inst Count 95M 157M 151M 500M 500M 493M 41M 214M 119M 180M 322M 284M 220M 100M 164M Input Set test.in jump.i 2stone9.in penguin.ppm train.lsp dhry.test srabbl.pl vortex.in Table 2: Benhmarks and datasets used. All benhmarks, exept li and ijpeg, were simulated to ompletion. 1 Vortex and go were simulated with abbreviated versions of the SPECint95 test input set. Compress was simulated on a modied version of the test input with an initial list of 30000 elements. 12 5 Critial Issues In this setion, we use the experimental baseline to evaluate several design enhanements and ongurations of the trae ahe. 5.1 Partial Mathing Eah feth yle, the preditions made by the branh preditor are used to selet whih bloks within the aessed segment will be issued to the exeution engine. This proess is referred to as partial mathing [22, 6℄. Alternatively, the trae ahe an be designed to signal a hit only if all the bloks within the seleted segment math; otherwise a miss is signaled. Figure 4 shows the performane dierene between a trae ahe that implements partial mathing and one that does not. The average performane improvement for these benhmarks is 14%. With partial mathing, the number of requests that miss both the trae ahe and small iahe drops signiantly beause of the more exible poliy for mathing valid lines. If the iahe were made larger, the relative benet with partial mathing would be expeted to diminish. 5.2 Path Assoiativity Path assoiativity relaxes the onstraint that dierent segments starting from the same feth blok annot be stored in the trae ahe at the same time. Path assoiativity allows segments ABC and ABD (see gure 2) to reside onurrently in the ahe whereas a nonpath assoiative trae ahe allows only one segment starting at A to be resident in the trae ahe at any instane in time. 13 A B Possible segments ABC DEB CDE BCD EBC C D E Figure 3: If the ll unit is able to reate three-blok segments for this path through a loop, then all ve possible segments will be reated and stored in the trae ahe. 7.5 1% 7.0 Instructions Per Cycle Baseline With Partial Matching 14% 6.5 6.0 10% 5.5 16% 5.0 4.5 38% 23% 15% 4.0 3.5 9% 11% 7% 24% 11% 20% 2% 6% 3.0 2.5 2.0 1.5 1.0 0.5 0.0 comp gcc go ijpeg li m88k perl vor ch gs pgp plot py ss tex Benchmarks Figure 4: Partial Mathing. The performane dierene between a trae ahe that partially mathes segments and one that does not is around 14% 14 The data plotted in Figure 5 indiate that path assoiativity has little eet on the performane of the baseline trae ahe. Adding path assoiativity inreases the number of segments that map into a partiular set; thus additional misses may our due to inreased set onits. Therefore the experiment was onduted on a path-assoiative trae ahe with set-assoiativity of 4 and of 8. In both ases, the performane gain from path-assoiativity is slight. Path assoiativity requires extra gates in the trae seletion logi to selet the longest mathing trae segment. It is likely that the line seletion time for the path assoiative ahe will be longer, possibly inreasing ahe yle time. 5.3 Inative Issue Partial mathing inreases the number of instrutions that are issued eah yle but it does not take advantage of the entire segment of instrutions fethed from the trae ahe. Bloks within the trae ahe segment whih do not math the predited path are disarded. As long as the predition is orret, this does not impat the eetive feth rate of the proessor. If the predition is inorret however, an opportunity to issue a greater number of orret instrutions has been missed. With inative issue [6℄ all bloks within a trae ahe line are issued into the proessor ore whether or not they math the predited path. The bloks that do not math the predition are said to be issued inatively. Although these inative instrutions are renamed and reeive physial registers for their destination values, the hanges they make to the register alias table [10℄ are not onsidered valid for subsequent issue yles. Thus instrutions along the predited path view the speulative state of the proessor exatly as if the inative bloks had not been issued. When the branh that ended the last ative blok resolves, if the predition was orret, the inative instrutions are disarded. If the predition was 15 inorret, the proessor has already fethed, issued and possibly exeuted some instrutions along the orret path. Inative issue redues the impat of branh mispreditions. It allows a portion of the branh resolution lateny to be hidden by making some orret path instrutions (whih follow a mispredited branh) available for proessing earlier. When the mispredited branh resolves, the reovery state of the proessor is further along the orret path than it would have been if the inative instrutions had not been issued. To implement inative issue, modiations must be made to the renaming and reovery strutures. Our exeution model uses a hekpointed register alias table to maintain both the arhitetural and speulative state of the proessor. The hanges needed to implement inative issue inlude adding an ative bit to eah hekpoint in the table. As the hekpoints are reated, this bit is set if the instrutions in the orresponding blok are issued atively and the bit is leared if the instrutions are issued inatively. The most reent ative hekpoint is used as the speulative state of the mahine when new instrutions are issued. When a branh resolves and is determined to be mispredited, the inative hekpoints immediately following the resolved hekpoint beome ative and all subsequent hekpoints, orresponding to instrutions along the inorret path, are ushed from the pipelines and the instrution window. The feth proeeds from the target of the newly ativated hekpoint. If the branh was orretly predited, the inative hekpoints are simply invalidated one its outome is known. Figure 6 illustrates the bloks issued from a fethed ahe line by the three dierent poliies. Figure 7 presents the performane benets of inative issue. Over the baseline onguration, inative issue oers a 17% performane boost. However, omparing gure 7 with gure 4, one an notie a slight 3% inrease from inative issue over partial mathing. In16 7.5 Baseline With Path Assoc/4-way With Path Assoc/8-way 7.0 Instructions Per Cycle 6.5 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 comp gcc go ijpeg li m88k perl vor ch gs pgp plot py Benchmarks Figure 5: Performane of Path Assoiativity. A B C Predicted path: ABC Fetched segment: ABD D No partial matching: miss Partial matching: AB Inactive Issue: AB (active) D(inactive) Figure 6: An example of the dierent issue poliies. 17 ss tex ative issue is an extension of partial mathing, and designed as a hedge against branh mispreditions. The value of inative issue is greater for less aurate branh predition. For example, inative issue is more helpful on programs whih are harder on the branh preditor as shown in the boost gained on g, go, and pgp. 5.4 5.4.1 Fill Unit Issues Blok Colletion The ll unit ollets bloks of instrutions as they are proessed and produes segments to store into the trae ahe. The ll unit an ollet these bloks at any point in the proessor pipeline. In this experiment, we determine whether the bloks should be olleted as instrutions are issued into the instrution window or when they are retired. Figure 8 shows that the dierenes in performane between the two shemes are slight. For many benhmarks, the speulative segment reation is beneial, while for some (g, go), it degrades performane. The ll unit olleting instrutions at issue time provides inreased traÆ to the trae ahe beause segments olleted while exeuting on a wrong exeution path are also written into the trae ahe. In some ases this generates useful segments, but in other ases it evits useful segments from the trae ahe. The baseline onguration ollets bloks at retire time. A ll unit that ollets at retire time only writes segments from the orret exeution path to the trae ahe. However, it suers from an inreased lateny between the initial feth of a blok and its olletion into a segment and subsequent storage into the trae ahe. This an potentially impat the rst few iterations of a tight loop, whih will be fethed from the instrution ahe until the rst iteration retires. In the next setion we show that this is not a signiant inuene on performane. 18 7.5 4% Instructions Per Cycle 7.0 Baseline With Inactive Issue 14% 6.5 6.0 10% 5.5 17% 5.0 4.5 4.0 42% 23% 8% 14% 11% 20% 27% 28% 3.5 10% 8% 13% 3.0 2.5 2.0 1.5 1.0 0.5 0.0 comp gcc go ijpeg li m88k perl vor ch gs pgp plot py ss tex Benchmarks Figure 7: Performane of Inative Issue. 7.5 Retire Time (Baseline) Issue Time 7.0 Instructions Per Cycle 6.5 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 comp gcc go ijpeg li m88k perl vor ch gs pgp plot py ss tex Benchmarks Figure 8: Issue vs. Retire. This plot shows that olleting instrutions at issue time is not very dierent from olleting instrutions at retire time. 19 5.4.2 Fill Unit Lateny As bloks of instrutions are lathed into the ll unit, some proessing is required before the omposite segment an be written to the trae ahe. The dependenies within the arriving blok must be reorded to reet the values produed by the awaiting bloks. Possibly, the instrutions within the segment may need to be reordered so that they an be quikly routed to funtional unit node tables when the segment is refethed. To perform these operations, the ll unit may require several yles. The purpose of this experiment is to determine the sensitivity of the trae ahe feth mehanism to ll unit lateny. Figure 9 shows the results of varying the number of yles from the arrival of the terminal blok to the point it is written into the trae ahe. The results show that a ll unit with a 10-yle lateny has a negligible loss in performane over a single-yle ll unit. There are two reasons for this ounter-intuitive behavior. First, the next feth of that segment usually does not our within the next 10-yles. Seond, if it does, it an be supplied by the iahe (whih is likely to ontain the feth blok as it reently satised the original request for the blok). For some benhmarks (e.g., gs), a longer ll unit lateny results in a slightly higher performane. A longer lateny sometimes delays the replaement of a useful trae ahe segment with a less useful one. 5.5 Branh Preditor Issues The multiple branh preditor is a ruial element of the trae ahe feth mehanism. If the trae ahe is not supported by a preditor apable of making aurate preditions, gains in eetive feth rate will be oset by losses from more disarded fethes, likely resulting in a loss in performane. In this setion we examine several organizations for the pattern history table (PHT). 20 A pattern history table entry ontains the most likely branh outome when a partiular pattern is enountered in the rst level history. For our baseline multiple branh preditor, we are using a pattern history table entry omposed of 3 two-bit saturating ounters. This entry format was derived as a ost-eetive version of the sheme used in [19, 6℄ where a PHT entry is omposed of 7 two-bit ounters. Figure 10 shows how the seven ounters are used to supply three preditions per yle. The rst two-bit ounter supplies the predition for the rst branh and is used to selet whih of two two-bit ounters supplies the predition for the seond branh. Both preditions are used to selet one of four two-bit ounters to supply the predition for the third branh. All three preditions are made with a single aess to the PHT. The ongurations evaluated in this experiment inlude the 7 ounter sheme, the 3 ounter sheme, and a sheme presented by Menezes et al [16℄ where eah PHT entry ontains the most likely path through a program subgraph ontaining 3 branhes. In this sheme, the PHT entries are 4 bits wide: 3 bits to enode the likely path (8 paths are possible) and a fourth bit whih reords the likeliness of this path. Figure 11 shows the performane omparison of the various PHT shemes. The size of the preditor is kept roughly onstant. The 7 ounter sheme uses 14 bits of history and requires 32KB of storage, the 3 ounter sheme uses 15 bits of history and requires 24KB of storage, the likely path sheme uses 16 bits of history and requires 32KB of storage. In general, the 3 ounter sheme slightly outperforms the other two shemes. It outperforms the 7 ounter sheme beause of better utilization: many ounters within an entry in the 7 ounter sheme are likely to go unused beause of the prevalene of biased branhes. The 3 ounter sheme outperforms the likely path sheme beause of more hysteresis. 21 7.5 1 cycle (Baseline) 2 cycles 3 cycles 10 cycles 7.0 Instructions Per Cycle 6.5 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 comp gcc go ijpeg li m88k perl vor ch gs pgp plot py ss tex Benchmarks Figure 9: Fill Unit Lateny is not a major inuene on performane. from pattern history table B0 B1 2-bit counter B2 Figure 10: Seven two-bit ounters are used to provide three preditions. B0 is the predition for the rst branh, B1 is the predition of the seond and B2 is for the third. 22 5.6 Comparison to an Aggressive Single Blok Mehanism The experiments thus far have evaluated various design options for the trae ahe feth mehanism. We provide a omparison of the enhanes trae ahe to the urrent dominant tehniques for feth engine design. Rotenberg et al. [22℄ presented a thorough omparison of the trae ahe's performane on the SPECint92 and IBS benhmarks to a few of the hardware-based multiple blok feth tehniques mentioned in setion 2. Here we present a omparison of our baseline onguration with a aggressive single blok feth mehanism and an iahe apable of fething up to the rst taken branh. The omponents of the single feth blok mehanism are approximately the same size and aess omplexity as the trae ahe ounterparts in the baseline onguration. The single blok mehanism onsists of a single yle, 128KB, 4-way set assoiative instrution ahe apable of fething two onseutive ahe lines and supplying up to 16 instrutions or until the rst ontrol ow instrution eah yle. The next feth address is generated with an 8KB branh target buer. The single branh preditor is a hybrid preditor, onsisting of two omponents: a 15-bit PAs preditor and a 15-bit gshare preditor. The seletion between the omponents is done by a 15-bit gshare-style seletor. Combining a per-address preditor with a gshare preditor has been shown to be an eetive way of boosting preditor auray [13, 2℄. This feth mehanism is similar to the one used on the Alpha 21264 [12℄. We also ompare the trae ahe to a mehanism where the instrution ahe is apable of supplying a sequential stream of instrutions beyond onditional branhes whih are predited to be not taken. This onguration requires the use of a multiple branh preditor (up to three branhes an be fethed) and therefore does not benet from the high predition auray attainable with the hybrid preditor. We all this sheme the sequential blok iahe. Figure 12 displays the experimental results. The trae ahe outperforms both shemes 23 for most benhmarks. Listed on the graph are the perentage inreases over the single blok iahe sheme. On average, the trae ahe delivers a performane improvement of 28% over the single blok iahe and a 15% improvement over the sequential blok iahe. The large boost in performane of the trae ahe mehanism omes from a signiant inrease in average eetive feth rate | the average number of instrutions delivered per on-path feth. Our experimental results show (not presented here) that this rate doubles with a trae ahe over the single blok iahe. While the inreased feth rate improves the performane of the trae ahe mehanism, the losses due to branh mispreditions and, to a lesser extent, ahe misses degrade performane. Our experimental data indiates that the onditional branh mispredition rate drops from 6.6% with the trae ahe to 5.5% with the single blok iahe. 6 Conlusions In this paper we have examined some of the ritial design parameters of the trae ahe feth mehanism. The trae ahe supplies multiple feth bloks of instrutions eah yle by storing logially ontiguous instrution sequenes in physially ontiguous storage. We have demonstrated that the ability to partially math a trae segment provides an average 14% performane boost over a onguration whih requires a omplete math. Inative issue is a hedge against branh mispreditions and yields 17% improvement over the baseline and is partiularly helpful on benhmarks whih suer from poor predition auray. We have also demonstrated that trae reation an be done speulatively with no degradation in performane. Creating traes speulatively at issue time may allow for for simpler implementations. In addition, the lateny in reating traes has negligible eets on performane. 24 7.5 3 Counters (Baseline) 7 Counters Likely Path 7.0 Instructions Per Cycle 6.5 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 comp gcc go ijpeg li m88k perl vor ch gs pgp plot py ss tex Benchmarks Figure 11: Evaluation of various pattern history table entry ongurations. 7.5 Instructions Per Cycle Single Block ICache Sequential Block ICache Trace Cache w/Inactive Issue 25% 7.0 39% 6.5 6.0 46% 5.5 49% 5.0 39% 27% 33% 16% 4.5 4.0 25% 39% 23% 24% 3.5 21% 9% 10% 3.0 2.5 2.0 1.5 1.0 0.5 0.0 comp gcc go ijpeg li m88k perl vor ch gs pgp plot py ss tex Benchmarks Figure 12: The performane of the baseline trae ahe feth mehanism against two iahe feth shemes. 25 When ompared with an aggressive single blok feth iahe, the trae ahe attains an average performane inrease of 28% and attains a 15% improvement over a sequential blok iahe. Muh of this performane inrease omes from the inrease in eetive feth rate, whih is twie that of the single blok engine. Beause it is a low-omplexity tehnique for delivering high instrution bandwidth, the trae ahe will be an important omponent of future miroproessors [21℄. There remain many important issues whih need resolving. We are urrently fousing on quantifying and reduing instrution redundany in the trae ahe and on developing tehniques for reating longer trae segments. 7 Aknowledgements We would like to thank several other members of the HPS researh group, Rob Chappell, Marius Evers, Peter Kim, Paul Raunas, Jared Stark, and several of its alumni, in partiular Mike Shebanow. We would like to thank our orporate sponsors | Intel, HAL, and NCR | for funding our work. Referenes [1℄ D. Burger, T. Austin, and S. Bennett, \Evaluating future miroproessors: The simplesalar tool set," Tehnial Report 1308, University of Wisonsin - Madison Tehnial Report, July 1996. [2℄ P.-Y. Chang, E. Hao, T.-Y. Yeh, and Y. N. Patt, \Branh lassiation: A new mehanism for improving branh preditor performane," in Proeedings of the 27th Annual ACM/IEEE , pp. 22{31, 1994. International Symposium on Miroarhiteture 26 [3℄ R. P. Colwell, R. P. Nix, J. J. O'Donnell, D. B. Papworth, and P. K. Rodman, \A VLIW arhiteture for a trae sheduling ompiler," , vol. 37, no. IEEE Transations on Computers 8, pp. 967{979, August 1988. [4℄ T. M. Conte, K. N. Menezes, P. M. Mills, and B. A. Patel, \Optimization of instrution feth mehanisms for high issue rates," in Proeedings of the 22nd Annual International Symposium , 1995. on Computer Arhiteture [5℄ J. A. Fisher, \Trae sheduling: A tehnique for global miroode ompation," IEEE Trans, vol. C-30, no. 7, pp. 478{490, July 1981. ations on Computers [6℄ D. H. Friendly, S. J. Patel, and Y. N. Patt, \Alternative feth and issue tehniques from the trae ahe feth mehanism," in Proeedings of the 30th Annual ACM/IEEE International , 1997. Symposium on Miroarhiteture [7℄ D. H. Friendly, S. J. Patel, and Y. N. Patt, \Putting the ll unit to work: Dynami optimizations for trae ahe miroproessors," in Proeedings of the 30th Annual ACM/IEEE , 1997. International Symposium on Miroarhiteture [8℄ E. Hao, P.-Y. Chang, M. Evers, and Y. N. Patt, \Inreasing the instrution feth rate via blok-strutured instrution set arhitetures," in Proeedings of the 29th Annual ACM/IEEE , 1996. International Symposium on Miroarhiteture [9℄ W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery, \The superblok: An eetive tehnique for VLIW and supersalar ompilation," Journal of , vol. 7, no. 9-50, , 1993. Superomputing [10℄ W. W. Hwu and Y. N. Patt, \Chekpoint repair for out-of-order exeution mahines," in , pp. 18{ Proeedings of the 14th Annual International Symposium on Computer Arhiteture 26, 1987. 27 [11℄ J. D. Johnson, \Expansion ahes for supersalar miroproessors," Tehnial Report CSLTR-94-630, Stanford University, Palo Alto CA, June 1994. [12℄ J. Keller, , Digital The 21264: A Supersalar Alpha Proessor with Out-of-Order Exeution Equipment Corporation, Hudson, MA, Otober 1996. [13℄ S. MFarling, \Combining branh preditors," Tehnial Report TN-36, Digital Western Researh Laboratory, June 1993. [14℄ S. Melvin and Y. Patt, \Enhaning instrution sheduling with a blok-strutured ISA," In- , 1994. ternational Journal on Parallel Proessing [15℄ S. W. Melvin and Y. N. Patt, \Performane benets of large exeution atomi units in dynamially sheduled mahines," in Proeedings , pp. 427{432, 1989. of Superomputing '89 [16℄ K. N. Menezes, S. W. Sathaye, and T. M. Conte, \Path predition for high issue-rate proessors," in Proeedings of the 1997 ACM/IEEE Conferene on Parallel Arhitetures and , 1997. Compilation Tehniques [17℄ R. Nair and M. E. Hopkins, \Exploiting instrution level parallelism in proessors by ahing sheduled groups," in Proeedings of the 24th Annual International Symposium on Computer , pp. 13{25, 1997. Arhiteture [18℄ S. J. Patel, M. Evers, and Y. N. Patt, \Improving trae ahe eetiveness with branh promotion and trae paking," in Proeedings of the 25th Annual International Symposium on , 1998. Computer Arhiteture [19℄ S. J. Patel, D. H. Friendly, and Y. N. Patt, \Critial issues regarding the trae ahe feth mehanism," Tehnial Report CSE-TR-335-97, University of Mihigan Tehnial Report, May 1997. [20℄ A. Peleg and U. Weiser. Dynami Flow Instrution Cahe Memory Organized Around Trae Segments Independant of Virtual Address Line 28 . U.S. Patent Number 5,381,533, 1994. [21℄ F. Pollak, Desription of the intel miroproessor roadmap, Press brieng, Otober 1998. [22℄ E. Rotenberg, S. Bennett, and J. E. Smith, \Trae ahe: a low lateny approah to high bandwidth instrution fething," in Proeedings of the 29th Annual ACM/IEEE International , 1996. Symposium on Miroarhiteture [23℄ E. Rotenberg, Q. Jaobsen, Y. Sazeides, and J. E. Smith, \Trae proessors," in Proeedings , 1997. of the 30th Annual ACM/IEEE International Symposium on Miroarhiteture [24℄ A. Sezne, S. Jourdan, P. Sainrat, and P. Mihaud, \Multiple-blok ahead branh preditors," in Proeedings of the 7th International Conferene on Arhitetural Support for Programming , 1996. Languages and Operating Systems [25℄ E. Sprangle and Y. Patt, \Failitating supersalar proessing via a ombined stati/dynami register renaming sheme," in Proeedings of the 27th Annual ACM/IEEE International Sym, pp. 143{147, 1994. posium on Miroarhiteture [26℄ J. Stark, P. Raunas, and Y. N. Patt, \Reduing the performane impat of instrution ahe misses by writing instrutions into the reservation stations out-of-order," in Proeedings of the , pp. 34 { 43, 1997. 30th Annual ACM/IEEE International Symposium on Miroarhiteture [27℄ S. Vajapeyam and T. Mitra, \Improving supersalar instrution dispath and issue by exploiting dynami ode sequenes," in Proeedings of the 24th Annual International Symposium on , pp. 1{12, 1997. Computer Arhiteture [28℄ T.-Y. Yeh, D. Marr, and Y. N. Patt, \Inreasing the instrution feth rate via multiple branh predition and branh address ahe," in Proeedings of the International Conferene on Su- , pp. 67{76, 1993. peromputing [29℄ T.-Y. Yeh and Y. N. Patt, \Two-level adaptive branh predition," in Proeedings of the 24th , pp. 51{61, 1991. Annual ACM/IEEE International Symposium on Miroarhiteture 29 Sanjay J. Patel is urrently a PhD andidate in Computer Siene and Engineering at the University of Mihigan, Ann Arbor. He is investigating tehniques for high bandwidth instrution supply for proessors in the era of 100M and 1B transistors. His researh interest inlude trae ahes, branh predition, memory bandwidth issues, and the performane simulation of miroarhitetures. He reeived a MS from the University of Mihigan and has worked for Digital Equipment Corporation. Dan Friendly is a PhD andidate in omputer siene and engineering at the University of Mihigan. His researh interests inlude high bandwidth feth mehanisms, supersalar arhitetures, and speulative exeution tehniques. He reeived a BA in Amerian soial issues from Maalester College and an MS in omputer siene from the University of Mihigan. Yale N. Patt is Professor of Eletrial Engineering and Computer Siene at the University of Mihigan, where he teahes undergraduate and graduate ourses in omputer arhiteture, and direts the PhD researh of nine graduate students in omputer arhiteture, high-performane proessor design, and omputer systems implementation. Patt earned his BS from Northeastern University and his MS and PhD from Stanford University, all in eletrial engineering. He reeived the 1995 IEEE Emanuel R. Piore Award and the 1996 ACM/IEEE Ekert-Mauhly Award. He is a Fellow of the IEEE. 30