Document 11257288

advertisement
Evaluation of Design Options for the Trae Cahe
Feth Mehanism
Sanjay Jeram Patel, Daniel Holmes Friendly, and
Yale N. Patt Fellow, IEEE
Advaned Computer Arhiteture Laboratory
Department of Eletrial Engineering and Computer Siene
The University of Mihigan
Ann Arbor, MI
fsanjayp,
48109-2122
ites, pattgees.umih.edu
Tel: (313) 936-0404
Fax: (313) 763-4617
ABSTRACT
In this paper, we examine some ritial design features of a trae ahe feth engine for
a 16-wide issue proessor and evaluate their eets on performane. We evaluate path
assoiativity, partial mathing, and inative issue, all of whih are straightforward extensions
to the trae ahe. We examine features suh as the ll unit and branh preditor design.
In our nal analysis, we show that the trae ahe mehanism attains a 28% performane
improvement over an aggressive single blok feth mehanism and a 15% improvement over
a sequential multi-blok mehanism.
Keywords: high bandwidth feth mehanisms, trae ahe, instrution ahe, wide issue
mahines, speulative exeution
1 Introdution
A miroproessor is omposed of three fundamental omponents: a means to supply
instrutions, a means to supply the data needed by these instrutions, and a means to proess
these instrutions. For high performane, the instrutions and data must be delivered at high
rate to a proessing ore apable of eetively onsuming them.
Delivering instrutions at a high rate is not a straightforward task. Several fators degrade
feth engine performane. First, ahe misses ause the feth engine to stall and supply
nothing until the miss is resolved. Seond, branh mispreditions ause the feth engine to
supply instrutions whih will later be disarded. Third, hanges in ontrol ow inhibit the
feth engine from produing a full width of instrutions. Due to the physial struture of
instrution ahes, it is diÆult to feth both a taken branh and its target in the same yle.
Table 1 shows the average number of instrutions between all branhes and between taken
branhes (i.e., taken onditional branhes, jumps, subroutine alls, returns and traps) for the
SPECint95 benhmarks. Evident from this table is that if partial fethes due to branhes
are not dealt with, then feth bandwidth will be limited to an average limit of 9 instrutions
per yle.
insts per branh
insts per taken branh
mprs g go
ijpeg li
m88k perl vrtx avg
5.57
5.05 6.63 10.40 4.27 4.35 5.21 6.41 5.99
7.78
8.32 9.85 13.51 6.55 5.28 7.94 9.85 8.64
Table 1: Instrution run lengths when terminating fethes on branhes.
A straightforward tehnique for dealing with partial fethes is to onstrut an instrution
ahe where multiple independent fethes an be performed eah yle. To do this, multiple
addresses are used to index into the instrution ahe via multiple read ports. After the
ahe lines are read, the requested instrutions must be properly aligned and merged before
they are supplied for exeution. Suh a solution adds onsiderable logi omplexity in an
already ritial exeution path of the proessor. Either yle time will be aeted or extra
pipeline stages will be required.
Reently proposed, the trae ahe [20, 11, 22, 19℄ overomes this bandwidth hurdle without requiring exessive logi omplexity in the instrution delivery path. Like an instrution
ahe, the trae ahe is aessed using the Program Counter. Unlike an instrution ahe,
2
a trae ahe line ontains instrutions as they appear in exeution order, as opposed to
the stati order determined by the ompiler. Two adjaent instrutions in a trae ahe
line need not be adjaent in the exeutable image. A trae ahe line stores a segment of
the dynami instrution stream. By plaing logially ontiguous instrutions in physially
ontiguous storage, the trae ahe is able to supply multiple feth bloks eah yle. A feth
blok roughly orresponds to a ompiler basi blok { it is a dynami sequene of instrutions
starting at the urrent feth address and ending at the next ontrol instrution.
In addition to inreasing the delivered instrution bandwidth by inreasing the eetive
feth rate (the average number of orret instrutions issued for eah feth that returns onpath instrutions), ahing instrutions as trae segments makes reproessing them easier.
Sine instrutions are written into the trae ahe after being deoded, deode bits are stored
along with eah instrution minimizing the proessing required when it is fethed again.
In this paper, we evaluate several extensions of the trae ahe suh as partial mathing,
path assoiativity, and inative issue. We also examine some of the key omponents of the
trae ahe feth mehanism suh as the ll unit and multiple branh preditor.
2 Related Work
Cahe organizations for simultaneously fething multiple bloks have been studied by
Yeh et al [28℄, Conte et al [4℄ and Sezne et al [24℄. By multiporting the instrution ahe
and/or the branh target buer (BTB) and generating multiple feth addresses and branh
preditions per yle, these shemes are able to overome the single feth blok bottlenek.
There are several ompile-time approahes to solving the feth bottlenek problem. With
Blok-Strutured ISAs [14, 8℄, superblok sheduling [9℄ and trae sheduling [5, 3℄, the stati
form of the program is organized by the ompiler into longer sequential units omposed of
3
multiple basi bloks.
Melvin and Patt [15℄ disuss the performane impliations of the ll unit and the idea of
dynamially ombining feth bloks into larger \exeution atomi units" (EAUs) to further
inrease the feth bandwidth is rst proposed. In 1994, Peleg and Weiser led a patent on
the trae ahe onept [20℄. A similar onept was proposed by Johnson [11℄. The onept
was further investigated by Rotenberg et al [22℄. They presented a thorough omparison between the trae ahe sheme and several hardware-based high-bandwidth feth shemes and
showed the advantage of using a trae ahe, both in performane and lateny. Extensions
and analysis of the trae ahe mehanism were proposed by Patel et al [19, 6, 18℄ and trae
ahe impliations on proessor design were presented in [23℄. A similar approah to ahing
dynami instrution groups was presented in the DIF ahe by Nair and Hopkins [17℄.
3 The Trae Cahe Feth Mehanism
We divide the trae ahe feth mehanism into four major omponents: a trae ahe, a
ll unit, a multiple branh preditor, and a onventional instrution ahe. The trae ahe
is the main soure of instrution supply and is lled with trae segments by the ll unit.
The speulative sequening of segments is performed by the multiple branh preditor. The
instrution ahe plays an important but supporting role, handling ases when the required
instrutions are not found in the trae ahe. A high-level diagram of the mehanism is
shown in gure 1.
In this setion, we provide the details of a feth engine for a 16-wide issue dynamiallysheduled mahine. To meet the instrution bandwidth demands of this mahine, 16 instrutions and three onditional branhes an be fethed per yle.
4
3.1
The Trae Cahe
The trae ahe stores segments of the dynami instrution stream, exploiting the fat
that many branhes tend to favor one outome. If blok A is followed by blok B whih in
turn is followed by blok C at a partiular point in the exeution of a program, there is a
strong likelihood that they will be exeuted in that order again. After the rst time they
are exeuted in this order, they are stored in the trae ahe as a single entry. Subsequent
fethes of blok A from the trae ahe provide bloks B and C as well.
Figure 2 shows an example of a trae ahe feth yle. A request is made with the
address of blok A. The trae ahe responds with a hit and drives out the seleted segment
omposed of the bloks A, B, and C. The predition strutures are aessed onurrently
with the trae ahe. At the end of the yle, the segment is mathed with the predition.
Suppose the preditor selets the path ABD. Then only bloks A and B are supplied. Blok
D is requested in the following yle. The onept of supplying only the mathing portion
of a trae ahe line is alled partial mathing and its impliations will be examined in
setion 5.1.
The trae ahe an store segments ontaining up to 16 instrutions, 3 of whih may
be onditional branhes. The line is aessed by the address of the rst instrution in the
segment. The organization of the trae ahe is similar to that of a onventional instrution
ahe, as the lines may be arranged in an assoiative manner. A hit is determined by a tag
math.
Eah line of the trae ahe ontains:
16 slots for instrutions. Instrutions are stored in deoded form and oupy approxi-
mately ve bytes for a typial ISA. Up to three branhes an be stored per line. Eah
instrution is marked with a two-bit tag indiating to whih blok it belongs.
5
Fetch Address
Fill
Unit
Trace Cache
Multiple
Branch
Predictor
Selection Logic
Instruction
Cache
Path
L2
Decoder
Next Fetch Address
Unified
Cache
Register Rename
HPS Execution Core
Figure 1: The trae ahe feth mehanism.
A
T
B
NT
T
Address of A
D
C
Trace Cache
A
B
C
Multiple
Branch
Predictor
Selection Logic
A
T/NT/T
B
Figure 2: The trae ahe and branh preditor are indexed with the address of blok A.
The inset gure shows the ontrol ow from blok A. The preditor selets path ABD. The
trae ahe only ontains ABC. AB is supplied.
6
Four target addresses. With three basi bloks per segment and the ability to feth
partial segments, there are four possible targets to a segment. The four addresses are
expliitly stored allowing immediate generation of the next feth address, even for ases
where only a partial segment mathes.
Path information. This eld enodes the number and diretions of branhes in the
segment and inludes bits to identify whether a segment ends in a branh and whether
that branh is a return from subroutine instrution. In the ase of a return instrution,
the return address stak provides the next feth address.
The total size of a line is around 97 bytes for a typial arhiteture: 5x16 bytes of
instrutions, 4x4 bytes of target addresses, and 1 byte of path information.
Instrution dependenies within a trae segment are predetermined before the segment
is stored in the trae ahe. This additional information allows for minimal deoding when
the segment is fethed from the trae ahe. Soure operands produed by an instrution
outside the segment are expliitly marked as requiring an external value. Instrutions whih
produe a live-out value are marked as requiring a physial register. It is important to note
that a omplex dependeny analysis aross 16 instrutions does not need to be performed on
segments fethed from the trae ahe. The onept of expliitly marking internal/external
register values within a basi blok was rst desribed by Sprangle and Patt [25℄ and later
adapted for use with the trae ahe by Vajapeyam and Mitra [27℄.
Instrutions within a segment an be arranged in an order that permits quik issue.
Beause the dependenies within a segment are expliitly marked, the ordering of instrutions
arries no signiane. Instrutions within the ahe line an be arranged to mitigate the
routing required to forward instrutions to funtional unit reservation stations. Friendly
et al [7℄ examine a sheme where instrutions within a segment are ordered to redue the
ommuniation delays assoiated with data forwarding aross many funtional units.
7
3.2
The Fill Unit
The ll unit ollets instrutions as they are issued by the proessor and ombines them
into trae segments. Coneptually, the instrutions (in units of bloks) are lathed by the
ll unit in the order they were fethed. The ll unit merges the arriving bloks with bloks
lathed in previous yles. The merge proess involves reating dependeny information and
reordering instrutions. The proess ontinues until the segment beomes nalized, at whih
point the segment is written into the trae ahe.
A segment beomes nalized when
1. it ontains 16 instrutions, or
2. it ontains 3 onditional branhes, or
3. it ontains a single indiret jump, return, or trap instrution, or
4. merging the inoming blok would result in a segment larger than 16 instrutions.
Rule 1 is implied by the size of the trae ahe line and rule 2 by the number of preditions
supplied per yle by the preditor. Beause their targets vary, return instrutions and
indiret jumps ause nalization (rule 3). Unonditional branhes and subroutine alls do
not aet trae segment nalization.
Beause bloks are being ombined in a greedy fashion, there are ases where the ll
unit stores multiple opies of basi bloks in the trae ahe. Figure 3 shows a simple loop
omposed of ve basi bloks. The ll unit an potentially reate ve dierent trae segments
ontaining portions of this loop, all of whih an be simultaneously resident in the trae ahe.
This blok redundany may degrade performane of the trae ahe mehanism by displaing
useful lines with redundant information. The tradeo is between higher bandwidth from
fething larger segments versus lost bandwidth due to inreased misses in the trae ahe.
8
The problem arises beause of bloks where several exeution paths merge, suh as blok B
in gure 3.
Rule 4 above treats basi bloks as atomi entities. A basi blok is not split aross two
segments unless the blok is larger than 16 instrutions. In addition to exaerbating the
blok redundany problem, splitting a basi blok reates an additional blok that may tax
other strutures suh as the branh preditor. A study of eetive tehniques for splitting
bloks was done by Patel et al [18℄.
Three outomes are possible with the arrival of eah new blok of instrutions: (1) the
arriving blok is merged with the unnalized segment and the new, larger segment is not
nalized. (2) the entire arriving blok annot be merged with the awaiting segment. The
awaiting segment is nalized and the arriving blok now oupies the ll unit. (3) the
arriving blok is ompletely merged with the awaiting segment and the new, larger segment
is nalized.
3.3
The Branh Preditor
The branh preditor is a ritial omponent in a high bandwidth feth mehanism. To
maintain a high rate of instrution supply, the preditor needs to make multiple aurate
branh preditions per yle. In the ase of our trae ahe mehanism, three preditions
per yle are required.
Two level adaptive branh predition has been demonstrated to ahieve high predition
auray over a wide set of appliations [29℄. In a two level sheme, the rst level of history
reords the outomes of the most reently exeuted branhes. The seond level of history,
stored in the pattern history table (PHT), reords the most likely outome when a partiular
pattern in the rst level history is enountered. In typial shemes, the PHT onsists of
saturating two-bit ounters.
9
To make three preditions per yle, we expand eah PHT entry from a single two-bit
ounter into three two-bit ounters, with eah two-bit ounter providing a predition for a
fethed branh. Evaluation of this PHT entry organization versus several others is provided
in setion 5.5.
The baseline onguration for the trae ahe preditor uses the gshare sheme outlined
by MFarling [13℄. The global branh history is XORed with the urrent feth address,
forming an index into the PHT. This hashing better utilizes the PHT and improves predition
auray over other global history based shemes.
The branh preditor also ontains a return address stak to predit the target addresses
of return instrutions.
3.4
The Instrution Cahe
A onventional instrution ahe supports the trae ahe by supplying instrutions when
the trae ahe does not ontain the requested segment. Sine hitting in the trae ahe is the
frequently ourring ase, the supporting iahe need not be enhaned for higher bandwidth.
The iahe therefore supplies up to one feth blok per yle.
The instrution ahe has two read ports to allow adjaent ahe lines to be retrieved
eah yle. By fething two ahe lines and realigning instrutions, the feth mehanism
overomes partial fethes due to ahe line boundaries.
In the ase of a trae ahe miss and an instrution ahe hit, up to a single basi blok
is supplied at the end of the yle. If both the trae ahe and instrution ahe miss, then
a request for the missing instrution ahe line is made to the seond level ahe. The feth
mehanism stalls until the missing line arrives.
10
4 Experimental Setup
A pipeline simulator that allows the modeling of wrong path eets was used as the
experimental model. The simulator was implemented using the SimpleSalar 2.0 tool suite
and instrution set [1℄, whih is a superset of the MIPS-IV ISA. In the exeution model, all
instrutions undergo four stages of proessing: feth, issue, shedule, exeute. Eah stage
takes at least one yle.
The baseline feth engine, apable of supplying up to 16 instrutions per yle, inludes
a large 2K entry (approximately 128KB for instrution storage), 4-way set assoiative trae
ahe and a 4KB, 4-way instrution ahe. Eah trae ahe line ontains up to 16 instrutions, ontaining at most three onditional branhes. To assist in the generation of target
addresses for fethes from the iahe, a 1KB branh target buer (BTB) is inluded. A 1MB
unied seond level ahe provides instrution and data with a lateny of 8 yles in the
ase of rst level ahe misses. The L2 miss lateny to memory is 50 yles. The baseline
branh preditor modeled is a 15-bit version of the gshare preditor desribed in setion 3.3
and provides up to three individual onditional branh preditions eah yle. The size of
the pattern history table is xed at 32K entries onsisting of 3 two-bit ounters (24KB of
storage). An ideal return address stak is modeled.
The exeution engine is omposed of 16 funtional units, eah with a 32-entry reservation
station. The funtional units are uniform and apable of all operations. A 64KB L1 data
ahe was used. The model uses hekpoint repair [10℄ to reover from branh mispreditions
and exeptions. The exeution engine is apable of reating up to three hekpoints eah
yle, one for eah feth blok supplied. The memory sheduler waits for addresses to
be generated before sheduling memory operations. No memory operation an bypass a
store with an unknown address. Sine this study is onerned with the feth engine, many
omponents of the exeution engine are modeled at very aggressive design points thereby
11
reduing the potential for the bakend to indue bottleneks whih may obsure bottleneks
in the frontend.
All experiments were performed on the SPECint95 benhmarks and on a benhmark suite
onsisting of several ommon C appliations [26℄. Table 2 lists the number of instrutions
simulated and the input set, if the input was derived from a standard input set1 . All
simulations were run until ompletion (exept li and ijpeg, 500M instrutions).
Benhmark
ompress
g
go
ijpeg
li
m88ksim
perl
vortex
hess
ghostsript (gs)
pgp
plot
python
sim-outorder (ss)
tex
Inst Count
95M
157M
151M
500M
500M
493M
41M
214M
119M
180M
322M
284M
220M
100M
164M
Input Set
test.in
jump.i
2stone9.in
penguin.ppm
train.lsp
dhry.test
srabbl.pl
vortex.in
Table 2: Benhmarks and datasets used. All benhmarks, exept li and ijpeg, were simulated
to ompletion.
1 Vortex
and go were simulated with abbreviated versions of the SPECint95 test input set. Compress was
simulated on a modied version of the test input with an initial list of 30000 elements.
12
5 Critial Issues
In this setion, we use the experimental baseline to evaluate several design enhanements
and ongurations of the trae ahe.
5.1
Partial Mathing
Eah feth yle, the preditions made by the branh preditor are used to selet whih
bloks within the aessed segment will be issued to the exeution engine. This proess is
referred to as partial mathing [22, 6℄. Alternatively, the trae ahe an be designed to
signal a hit only if all the bloks within the seleted segment math; otherwise a miss is
signaled. Figure 4 shows the performane dierene between a trae ahe that implements
partial mathing and one that does not. The average performane improvement for these
benhmarks is 14%.
With partial mathing, the number of requests that miss both the trae ahe and small
iahe drops signiantly beause of the more exible poliy for mathing valid lines. If the
iahe were made larger, the relative benet with partial mathing would be expeted to
diminish.
5.2
Path Assoiativity
Path assoiativity relaxes the onstraint that dierent segments starting from the same
feth blok annot be stored in the trae ahe at the same time. Path assoiativity allows
segments ABC and ABD (see gure 2) to reside onurrently in the ahe whereas a nonpath assoiative trae ahe allows only one segment starting at A to be resident in the trae
ahe at any instane in time.
13
A
B
Possible segments
ABC
DEB
CDE
BCD
EBC
C
D
E
Figure 3: If the ll unit is able to reate three-blok segments for this path through a loop,
then all ve possible segments will be reated and stored in the trae ahe.
7.5
1%
7.0
Instructions Per Cycle
Baseline
With Partial Matching
14%
6.5
6.0
10%
5.5
16%
5.0
4.5
38%
23%
15%
4.0
3.5
9%
11%
7%
24%
11%
20%
2%
6%
3.0
2.5
2.0
1.5
1.0
0.5
0.0
comp gcc
go ijpeg li m88k perl vor
ch
gs
pgp plot py
ss
tex
Benchmarks
Figure 4: Partial Mathing. The performane dierene between a trae ahe that partially
mathes segments and one that does not is around 14%
14
The data plotted in Figure 5 indiate that path assoiativity has little eet on the
performane of the baseline trae ahe. Adding path assoiativity inreases the number of
segments that map into a partiular set; thus additional misses may our due to inreased
set onits. Therefore the experiment was onduted on a path-assoiative trae ahe with
set-assoiativity of 4 and of 8. In both ases, the performane gain from path-assoiativity
is slight.
Path assoiativity requires extra gates in the trae seletion logi to selet the longest
mathing trae segment. It is likely that the line seletion time for the path assoiative ahe
will be longer, possibly inreasing ahe yle time.
5.3
Inative Issue
Partial mathing inreases the number of instrutions that are issued eah yle but it
does not take advantage of the entire segment of instrutions fethed from the trae ahe.
Bloks within the trae ahe segment whih do not math the predited path are disarded.
As long as the predition is orret, this does not impat the eetive feth rate of the
proessor. If the predition is inorret however, an opportunity to issue a greater number
of orret instrutions has been missed.
With inative issue [6℄ all bloks within a trae ahe line are issued into the proessor
ore whether or not they math the predited path. The bloks that do not math the
predition are said to be issued inatively. Although these inative instrutions are renamed
and reeive physial registers for their destination values, the hanges they make to the
register alias table [10℄ are not onsidered valid for subsequent issue yles. Thus instrutions
along the predited path view the speulative state of the proessor exatly as if the inative
bloks had not been issued. When the branh that ended the last ative blok resolves,
if the predition was orret, the inative instrutions are disarded. If the predition was
15
inorret, the proessor has already fethed, issued and possibly exeuted some instrutions
along the orret path.
Inative issue redues the impat of branh mispreditions. It allows a portion of the
branh resolution lateny to be hidden by making some orret path instrutions (whih
follow a mispredited branh) available for proessing earlier. When the mispredited branh
resolves, the reovery state of the proessor is further along the orret path than it would
have been if the inative instrutions had not been issued.
To implement inative issue, modiations must be made to the renaming and reovery
strutures. Our exeution model uses a hekpointed register alias table to maintain both
the arhitetural and speulative state of the proessor. The hanges needed to implement
inative issue inlude adding an ative bit to eah hekpoint in the table. As the hekpoints
are reated, this bit is set if the instrutions in the orresponding blok are issued atively and
the bit is leared if the instrutions are issued inatively. The most reent ative hekpoint is
used as the speulative state of the mahine when new instrutions are issued. When a branh
resolves and is determined to be mispredited, the inative hekpoints immediately following
the resolved hekpoint beome ative and all subsequent hekpoints, orresponding to
instrutions along the inorret path, are ushed from the pipelines and the instrution
window. The feth proeeds from the target of the newly ativated hekpoint. If the branh
was orretly predited, the inative hekpoints are simply invalidated one its outome is
known.
Figure 6 illustrates the bloks issued from a fethed ahe line by the three dierent
poliies.
Figure 7 presents the performane benets of inative issue. Over the baseline onguration, inative issue oers a 17% performane boost. However, omparing gure 7 with
gure 4, one an notie a slight 3% inrease from inative issue over partial mathing. In16
7.5
Baseline
With Path Assoc/4-way
With Path Assoc/8-way
7.0
Instructions Per Cycle
6.5
6.0
5.5
5.0
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
comp gcc
go ijpeg li m88k perl vor
ch
gs
pgp plot py
Benchmarks
Figure 5: Performane of Path Assoiativity.
A
B
C
Predicted path: ABC
Fetched segment: ABD
D
No partial matching: miss
Partial matching: AB
Inactive Issue: AB (active) D(inactive)
Figure 6: An example of the dierent issue poliies.
17
ss
tex
ative issue is an extension of partial mathing, and designed as a hedge against branh
mispreditions. The value of inative issue is greater for less aurate branh predition. For
example, inative issue is more helpful on programs whih are harder on the branh preditor
as shown in the boost gained on g, go, and pgp.
5.4
5.4.1
Fill Unit Issues
Blok Colletion
The ll unit ollets bloks of instrutions as they are proessed and produes segments
to store into the trae ahe. The ll unit an ollet these bloks at any point in the
proessor pipeline. In this experiment, we determine whether the bloks should be olleted
as instrutions are issued into the instrution window or when they are retired.
Figure 8 shows that the dierenes in performane between the two shemes are slight.
For many benhmarks, the speulative segment reation is beneial, while for some (g,
go), it degrades performane. The ll unit olleting instrutions at issue time provides
inreased traÆ to the trae ahe beause segments olleted while exeuting on a wrong
exeution path are also written into the trae ahe. In some ases this generates useful
segments, but in other ases it evits useful segments from the trae ahe. The baseline
onguration ollets bloks at retire time.
A ll unit that ollets at retire time only writes segments from the orret exeution
path to the trae ahe. However, it suers from an inreased lateny between the initial
feth of a blok and its olletion into a segment and subsequent storage into the trae ahe.
This an potentially impat the rst few iterations of a tight loop, whih will be fethed from
the instrution ahe until the rst iteration retires. In the next setion we show that this
is not a signiant inuene on performane.
18
7.5
4%
Instructions Per Cycle
7.0
Baseline
With Inactive Issue
14%
6.5
6.0
10%
5.5
17%
5.0
4.5
4.0
42%
23%
8%
14%
11%
20%
27%
28%
3.5
10%
8%
13%
3.0
2.5
2.0
1.5
1.0
0.5
0.0
comp gcc
go ijpeg li m88k perl vor
ch
gs
pgp plot py
ss
tex
Benchmarks
Figure 7: Performane of Inative Issue.
7.5
Retire Time (Baseline)
Issue Time
7.0
Instructions Per Cycle
6.5
6.0
5.5
5.0
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
comp gcc
go ijpeg li m88k perl vor
ch
gs
pgp plot py
ss
tex
Benchmarks
Figure 8: Issue vs. Retire. This plot shows that olleting instrutions at issue time is not
very dierent from olleting instrutions at retire time.
19
5.4.2
Fill Unit Lateny
As bloks of instrutions are lathed into the ll unit, some proessing is required before the omposite segment an be written to the trae ahe. The dependenies within
the arriving blok must be reorded to reet the values produed by the awaiting bloks.
Possibly, the instrutions within the segment may need to be reordered so that they an be
quikly routed to funtional unit node tables when the segment is refethed. To perform
these operations, the ll unit may require several yles.
The purpose of this experiment is to determine the sensitivity of the trae ahe feth
mehanism to ll unit lateny. Figure 9 shows the results of varying the number of yles
from the arrival of the terminal blok to the point it is written into the trae ahe.
The results show that a ll unit with a 10-yle lateny has a negligible loss in performane
over a single-yle ll unit. There are two reasons for this ounter-intuitive behavior. First,
the next feth of that segment usually does not our within the next 10-yles. Seond, if it
does, it an be supplied by the iahe (whih is likely to ontain the feth blok as it reently
satised the original request for the blok). For some benhmarks (e.g., gs), a longer ll
unit lateny results in a slightly higher performane. A longer lateny sometimes delays the
replaement of a useful trae ahe segment with a less useful one.
5.5
Branh Preditor Issues
The multiple branh preditor is a ruial element of the trae ahe feth mehanism. If
the trae ahe is not supported by a preditor apable of making aurate preditions, gains
in eetive feth rate will be oset by losses from more disarded fethes, likely resulting in
a loss in performane.
In this setion we examine several organizations for the pattern history table (PHT).
20
A pattern history table entry ontains the most likely branh outome when a partiular
pattern is enountered in the rst level history. For our baseline multiple branh preditor,
we are using a pattern history table entry omposed of 3 two-bit saturating ounters.
This entry format was derived as a ost-eetive version of the sheme used in [19, 6℄
where a PHT entry is omposed of 7 two-bit ounters. Figure 10 shows how the seven
ounters are used to supply three preditions per yle. The rst two-bit ounter supplies
the predition for the rst branh and is used to selet whih of two two-bit ounters supplies
the predition for the seond branh. Both preditions are used to selet one of four two-bit
ounters to supply the predition for the third branh. All three preditions are made with
a single aess to the PHT.
The ongurations evaluated in this experiment inlude the 7 ounter sheme, the 3
ounter sheme, and a sheme presented by Menezes et al [16℄ where eah PHT entry ontains
the most likely path through a program subgraph ontaining 3 branhes. In this sheme,
the PHT entries are 4 bits wide: 3 bits to enode the likely path (8 paths are possible)
and a fourth bit whih reords the likeliness of this path. Figure 11 shows the performane
omparison of the various PHT shemes. The size of the preditor is kept roughly onstant.
The 7 ounter sheme uses 14 bits of history and requires 32KB of storage, the 3 ounter
sheme uses 15 bits of history and requires 24KB of storage, the likely path sheme uses 16
bits of history and requires 32KB of storage.
In general, the 3 ounter sheme slightly outperforms the other two shemes. It outperforms the 7 ounter sheme beause of better utilization: many ounters within an entry in
the 7 ounter sheme are likely to go unused beause of the prevalene of biased branhes.
The 3 ounter sheme outperforms the likely path sheme beause of more hysteresis.
21
7.5
1 cycle (Baseline)
2 cycles
3 cycles
10 cycles
7.0
Instructions Per Cycle
6.5
6.0
5.5
5.0
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
comp gcc
go ijpeg li m88k perl vor
ch
gs
pgp plot py
ss
tex
Benchmarks
Figure 9: Fill Unit Lateny is not a major inuene on performane.
from pattern history table
B0
B1
2-bit counter
B2
Figure 10: Seven two-bit ounters are used to provide three preditions. B0 is the predition
for the rst branh, B1 is the predition of the seond and B2 is for the third.
22
5.6
Comparison to an Aggressive Single Blok Mehanism
The experiments thus far have evaluated various design options for the trae ahe feth
mehanism. We provide a omparison of the enhanes trae ahe to the urrent dominant
tehniques for feth engine design. Rotenberg et al. [22℄ presented a thorough omparison
of the trae ahe's performane on the SPECint92 and IBS benhmarks to a few of the
hardware-based multiple blok feth tehniques mentioned in setion 2. Here we present a
omparison of our baseline onguration with a aggressive single blok feth mehanism and
an iahe apable of fething up to the rst taken branh.
The omponents of the single feth blok mehanism are approximately the same size
and aess omplexity as the trae ahe ounterparts in the baseline onguration. The
single blok mehanism onsists of a single yle, 128KB, 4-way set assoiative instrution
ahe apable of fething two onseutive ahe lines and supplying up to 16 instrutions or
until the rst ontrol ow instrution eah yle. The next feth address is generated with
an 8KB branh target buer. The single branh preditor is a hybrid preditor, onsisting
of two omponents: a 15-bit PAs preditor and a 15-bit gshare preditor. The seletion
between the omponents is done by a 15-bit gshare-style seletor. Combining a per-address
preditor with a gshare preditor has been shown to be an eetive way of boosting preditor
auray [13, 2℄. This feth mehanism is similar to the one used on the Alpha 21264 [12℄.
We also ompare the trae ahe to a mehanism where the instrution ahe is apable
of supplying a sequential stream of instrutions beyond onditional branhes whih are predited to be not taken. This onguration requires the use of a multiple branh preditor
(up to three branhes an be fethed) and therefore does not benet from the high predition auray attainable with the hybrid preditor. We all this sheme the sequential blok
iahe.
Figure 12 displays the experimental results. The trae ahe outperforms both shemes
23
for most benhmarks. Listed on the graph are the perentage inreases over the single blok
iahe sheme. On average, the trae ahe delivers a performane improvement of 28% over
the single blok iahe and a 15% improvement over the sequential blok iahe.
The large boost in performane of the trae ahe mehanism omes from a signiant
inrease in average eetive feth rate | the average number of instrutions delivered per
on-path feth. Our experimental results show (not presented here) that this rate doubles
with a trae ahe over the single blok iahe.
While the inreased feth rate improves the performane of the trae ahe mehanism,
the losses due to branh mispreditions and, to a lesser extent, ahe misses degrade performane. Our experimental data indiates that the onditional branh mispredition rate
drops from 6.6% with the trae ahe to 5.5% with the single blok iahe.
6 Conlusions
In this paper we have examined some of the ritial design parameters of the trae ahe
feth mehanism. The trae ahe supplies multiple feth bloks of instrutions eah yle
by storing logially ontiguous instrution sequenes in physially ontiguous storage.
We have demonstrated that the ability to partially math a trae segment provides an average 14% performane boost over a onguration whih requires a omplete math. Inative
issue is a hedge against branh mispreditions and yields 17% improvement over the baseline
and is partiularly helpful on benhmarks whih suer from poor predition auray.
We have also demonstrated that trae reation an be done speulatively with no degradation in performane. Creating traes speulatively at issue time may allow for for simpler
implementations. In addition, the lateny in reating traes has negligible eets on performane.
24
7.5
3 Counters (Baseline)
7 Counters
Likely Path
7.0
Instructions Per Cycle
6.5
6.0
5.5
5.0
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
comp gcc
go ijpeg li m88k perl vor
ch
gs
pgp plot py
ss
tex
Benchmarks
Figure 11: Evaluation of various pattern history table entry ongurations.
7.5
Instructions Per Cycle
Single Block ICache
Sequential Block ICache
Trace Cache w/Inactive Issue
25%
7.0
39%
6.5
6.0
46%
5.5
49%
5.0
39%
27%
33%
16%
4.5
4.0
25%
39%
23%
24%
3.5
21%
9%
10%
3.0
2.5
2.0
1.5
1.0
0.5
0.0
comp gcc
go ijpeg li m88k perl vor
ch
gs
pgp plot py
ss
tex
Benchmarks
Figure 12: The performane of the baseline trae ahe feth mehanism against two iahe
feth shemes.
25
When ompared with an aggressive single blok feth iahe, the trae ahe attains an
average performane inrease of 28% and attains a 15% improvement over a sequential blok
iahe. Muh of this performane inrease omes from the inrease in eetive feth rate,
whih is twie that of the single blok engine.
Beause it is a low-omplexity tehnique for delivering high instrution bandwidth, the
trae ahe will be an important omponent of future miroproessors [21℄. There remain
many important issues whih need resolving. We are urrently fousing on quantifying and
reduing instrution redundany in the trae ahe and on developing tehniques for reating
longer trae segments.
7 Aknowledgements
We would like to thank several other members of the HPS researh group, Rob Chappell,
Marius Evers, Peter Kim, Paul Raunas, Jared Stark, and several of its alumni, in partiular
Mike Shebanow. We would like to thank our orporate sponsors | Intel, HAL, and NCR
| for funding our work.
Referenes
[1℄ D. Burger, T. Austin, and S. Bennett, \Evaluating future miroproessors: The simplesalar
tool set," Tehnial Report 1308, University of Wisonsin - Madison Tehnial Report, July
1996.
[2℄ P.-Y. Chang, E. Hao, T.-Y. Yeh, and Y. N. Patt, \Branh lassiation: A new mehanism
for improving branh preditor performane," in Proeedings
of the 27th Annual ACM/IEEE
, pp. 22{31, 1994.
International Symposium on Miroarhiteture
26
[3℄ R. P. Colwell, R. P. Nix, J. J. O'Donnell, D. B. Papworth, and P. K. Rodman, \A VLIW
arhiteture for a trae sheduling ompiler,"
, vol. 37, no.
IEEE Transations on Computers
8, pp. 967{979, August 1988.
[4℄ T. M. Conte, K. N. Menezes, P. M. Mills, and B. A. Patel, \Optimization of instrution feth
mehanisms for high issue rates," in Proeedings of the 22nd Annual International Symposium
, 1995.
on Computer Arhiteture
[5℄ J. A. Fisher, \Trae sheduling: A tehnique for global miroode ompation," IEEE Trans, vol. C-30, no. 7, pp. 478{490, July 1981.
ations on Computers
[6℄ D. H. Friendly, S. J. Patel, and Y. N. Patt, \Alternative feth and issue tehniques from the
trae ahe feth mehanism," in
Proeedings of the 30th Annual ACM/IEEE International
, 1997.
Symposium on Miroarhiteture
[7℄ D. H. Friendly, S. J. Patel, and Y. N. Patt, \Putting the ll unit to work: Dynami optimizations for trae ahe miroproessors," in
Proeedings of the 30th Annual ACM/IEEE
, 1997.
International Symposium on Miroarhiteture
[8℄ E. Hao, P.-Y. Chang, M. Evers, and Y. N. Patt, \Inreasing the instrution feth rate via
blok-strutured instrution set arhitetures," in Proeedings of the 29th Annual ACM/IEEE
, 1996.
International Symposium on Miroarhiteture
[9℄ W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann,
R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery,
\The superblok: An eetive tehnique for VLIW and supersalar ompilation," Journal
of
, vol. 7, no. 9-50, , 1993.
Superomputing
[10℄ W. W. Hwu and Y. N. Patt, \Chekpoint repair for out-of-order exeution mahines," in
, pp. 18{
Proeedings of the 14th Annual International Symposium on Computer Arhiteture
26, 1987.
27
[11℄ J. D. Johnson, \Expansion ahes for supersalar miroproessors," Tehnial Report CSLTR-94-630, Stanford University, Palo Alto CA, June 1994.
[12℄ J. Keller,
, Digital
The 21264:
A Supersalar Alpha Proessor with Out-of-Order Exeution
Equipment Corporation, Hudson, MA, Otober 1996.
[13℄ S. MFarling, \Combining branh preditors," Tehnial Report TN-36, Digital Western Researh Laboratory, June 1993.
[14℄ S. Melvin and Y. Patt, \Enhaning instrution sheduling with a blok-strutured ISA,"
In-
, 1994.
ternational Journal on Parallel Proessing
[15℄ S. W. Melvin and Y. N. Patt, \Performane benets of large exeution atomi units in dynamially sheduled mahines," in Proeedings
, pp. 427{432, 1989.
of Superomputing '89
[16℄ K. N. Menezes, S. W. Sathaye, and T. M. Conte, \Path predition for high issue-rate proessors," in
Proeedings of the 1997 ACM/IEEE Conferene on Parallel Arhitetures and
, 1997.
Compilation Tehniques
[17℄ R. Nair and M. E. Hopkins, \Exploiting instrution level parallelism in proessors by ahing
sheduled groups," in Proeedings
of the 24th Annual International Symposium on Computer
, pp. 13{25, 1997.
Arhiteture
[18℄ S. J. Patel, M. Evers, and Y. N. Patt, \Improving trae ahe eetiveness with branh
promotion and trae paking," in Proeedings of the 25th Annual International Symposium on
, 1998.
Computer Arhiteture
[19℄ S. J. Patel, D. H. Friendly, and Y. N. Patt, \Critial issues regarding the trae ahe feth
mehanism," Tehnial Report CSE-TR-335-97, University of Mihigan Tehnial Report, May
1997.
[20℄ A. Peleg and U. Weiser.
Dynami Flow Instrution Cahe Memory Organized Around Trae
Segments Independant of Virtual Address Line
28
. U.S. Patent Number 5,381,533, 1994.
[21℄ F. Pollak, Desription of the intel miroproessor roadmap, Press brieng, Otober 1998.
[22℄ E. Rotenberg, S. Bennett, and J. E. Smith, \Trae ahe: a low lateny approah to high
bandwidth instrution fething," in Proeedings
of the 29th Annual ACM/IEEE International
, 1996.
Symposium on Miroarhiteture
[23℄ E. Rotenberg, Q. Jaobsen, Y. Sazeides, and J. E. Smith, \Trae proessors," in Proeedings
, 1997.
of the 30th Annual ACM/IEEE International Symposium on Miroarhiteture
[24℄ A. Sezne, S. Jourdan, P. Sainrat, and P. Mihaud, \Multiple-blok ahead branh preditors,"
in Proeedings
of the 7th International Conferene on Arhitetural Support for Programming
, 1996.
Languages and Operating Systems
[25℄ E. Sprangle and Y. Patt, \Failitating supersalar proessing via a ombined stati/dynami
register renaming sheme," in Proeedings of the 27th Annual ACM/IEEE International Sym, pp. 143{147, 1994.
posium on Miroarhiteture
[26℄ J. Stark, P. Raunas, and Y. N. Patt, \Reduing the performane impat of instrution ahe
misses by writing instrutions into the reservation stations out-of-order," in Proeedings of the
, pp. 34 { 43, 1997.
30th Annual ACM/IEEE International Symposium on Miroarhiteture
[27℄ S. Vajapeyam and T. Mitra, \Improving supersalar instrution dispath and issue by exploiting dynami ode sequenes," in Proeedings
of the 24th Annual International Symposium on
, pp. 1{12, 1997.
Computer Arhiteture
[28℄ T.-Y. Yeh, D. Marr, and Y. N. Patt, \Inreasing the instrution feth rate via multiple branh
predition and branh address ahe," in Proeedings
of the International Conferene on Su-
, pp. 67{76, 1993.
peromputing
[29℄ T.-Y. Yeh and Y. N. Patt, \Two-level adaptive branh predition," in Proeedings
of the 24th
, pp. 51{61, 1991.
Annual ACM/IEEE International Symposium on Miroarhiteture
29
Sanjay J. Patel is urrently a PhD andidate in Computer Siene and Engineering at
the University of Mihigan, Ann Arbor. He is investigating tehniques for high bandwidth
instrution supply for proessors in the era of 100M and 1B transistors. His researh interest inlude trae ahes, branh predition, memory bandwidth issues, and the performane
simulation of miroarhitetures. He reeived a MS from the University of Mihigan and has
worked for Digital Equipment Corporation.
Dan Friendly is a PhD andidate in omputer siene and engineering at the University
of Mihigan. His researh interests inlude high bandwidth feth mehanisms, supersalar arhitetures, and speulative exeution tehniques. He reeived a BA in Amerian soial issues
from Maalester College and an MS in omputer siene from the University of Mihigan.
Yale N. Patt is Professor of Eletrial Engineering and Computer Siene at the University of Mihigan, where he teahes undergraduate and graduate ourses in omputer arhiteture, and direts the PhD researh of nine graduate students in omputer arhiteture,
high-performane proessor design, and omputer systems implementation. Patt earned his
BS from Northeastern University and his MS and PhD from Stanford University, all in
eletrial engineering. He reeived the 1995 IEEE Emanuel R. Piore Award and the 1996
ACM/IEEE Ekert-Mauhly Award. He is a Fellow of the IEEE.
30
Download