Limits of Instruction-Level Parallelism David W. Wall Digital Equipment in parallel. instructions Abstract Growing machines ful interest and heavily examination ism exists how ambitious programs. by wide the techniques for multiple-issue machines much in typical complicated software of in -pipelined requires parallelism parallel- Such an examination [1] is one that can issue multiple is and in the same cycle. the parallelism that one instruction of to take advantage have been proposed. hardware variety can be exploited, including renaming, and alias analysis. In each case, the parallelism is the number of divided by the number of cycles required. Architectures a care- instruction-level increasing Corporation Research Laboratory Western less branch prediction, register By performing simulations than machine of independent A superpipelined instructions machine [7] issues per cycle, but the cycle time is set much the [11] of this kind A superscalar machine typical instruction is like a superscalar latency. A machine, except the VLIW based on instruction maces, we can model techniques at the limits of feasibility and even beyond. Our study parallel instructions must be explicitly packed compiler into very long instruction words. shows a striking But how much parallelism is there to exploit? Popular wisdom, supported by a few studies [7,13,14], difference between assuming that the techniques we use are perfect and merely assuming that they are impossibly good. Even with impossibly good techniques, average parallelism more common. rarely exceeds 7, with suggests 5 that parallelism within a basic by the block rarely exceeds 3 or 4 on the average. Peak parallelism can be higher, especially for some kinds of numeric programs, but the payoff of high peak parallelism is low if the aver- age is still small. 1. Introduction These There is growing interest in machines that exploit, already usually with compiler assistance, the parallelism that prolevel. Figure 1 shows an grams have at the instruction example rl := O[r9] 4[r3] 4[r2] := r6 (a} parallelt’stn=3 Figure parallelism := r6 (and lack granted direct to copy provided commercial title of the that copying Q 1991 specific ACM withcwt fee the copies the ACM and ita date is by permission To copy all or part thereofl. otherwise, of this are not made of the and Association or to republish, material or distributed copyright appear, /0003 -0176... in for the DECStation 5000,” load laten- notice notice a superscalar capability to a machine with some To increase the instruction-level parallelism One category includes techniques for increasing the parallelism within a basic block, the other for using parallelism across several basic blocks. These techniques often interact in a way that has not been adequately is for and the explored. is given We would like to bound the effectiveness for Computing requires that the hardware can exploit, people have explored a variety of techniques. These fall roughly into two categories. ● a fea permission. 0-89791-3810-9191 programs; Adding in l(a) consists of three instructions advantage, publication Machinery. and/or that machines as reflected pipelining is beneficial only if there is more parallelism available than the pipelining already exploits. that can be executed at the same time, because they do not depend on each other’s results. The code fragment in l(b) does have dependencies, and so cannot be executed Permission Many cies, delayed branches, and floating-point latencies give the machine a degree of pipelining equal to about 1.5. + 17 (b) parallelism=l 1. Instruction-level The code fragment typical := O[r9] r2 := rl r2 := 17 are troublesome. operations with latencies of multiple cycles. We can compute the degree of pipelining by multiplying the latency of each operation by its dynamic frequency in of this parallelism. rl limits have some degree of pipelining, $ 1.50 ..-. DECStetion is a trademark of Digital Equipment Corporation. of a technique whether it is used in combination sibly good companion techniques, eral approach is therefore needed. describe our use of trace-driven importance with impos- or with none. to study the we may have no way of knowing whether the memory locations referenced in the load and store are the same. If they are the same, then there is a dependency between of programs. these two instructions: we cannot store a new value until we have fetched the old. parallelism Parallelism dependencies within between within a basic pairs blocks. block dependency. is limited of instructions. these dependencies are real, reflecting architecture registers assuming by the flow of data in can lead to a false dependency mining the actual addresses referenced, on a register. 1.2. Crossing block The number usually 9 instructions might pay attention in that An alternative is the hardware solution in which the hardware imposes renaming, indirection between the register number appearing instruction instruction and the actual register used. Each time an sets a register, the hardware selects an actual which does the register allocation Taking in the In a dynami- mispredicts entry. hardware tech- the branch predic- its table entry; We do not wrap loop only once, when it is exited. vahte for table entries is 2, just that each branch will be taken. Branch prediction we fetch while tunately, register renaming can also lengthen the machine pipeline, thereby increasing the branch penalties of the its effects on parallelism. We is often used to keep a pipeline and decode we are executing instructions after a branch the branch and the instructions To use branch prediction to increase parallel execution, we must be able to execute instructions across This may involve an unknown branch speculatively. maintaining shadow registers, whose values are not committed until we are sure we have correctly predicted the branch. memory. a typical good initial A before it. False dependencies can also involve a common it causes us to decrement. barely predicting to include more registers than will fit in the instruction format, further reducing false dependencies. Unfor- with is Two branches that map to the same table entry interfere with each othe~ no “key” identifies the owner of the full: here only prediction a branch causes us to increment not taking can give better results than the compiler’s but we are concerned If we we must be able to issue instruc- In the scheme we used [9,12], static allocation, even if the compiler did it as well as it could. In addition, register renaming allows the hardware machine, branches is less than 6. around when the table entry reaches its maximum or minimum. We predict that a branch will be taken if its table entry is 2 or 3. This two-bit prediction scheme of register a level of register to use for as long as that value is needed. between tor maintains a table of two-bit entries. Low-order bits of a branch’s address provide the index into this table. for number of registers needed is minimized. cally, of instructions quite small, often averaging Branch parallelism. Current compilers often do not, preferring instead to reuse registers as often as possible so that the sense the hardware boundaries. branch will be taken, or else we must cope with the possibility that we do not know. to its alloca- the opportunities but this may be tions from different basic blocks in parallel. But this means we must know in advance whether a conditional nique. tion of registers, so as to maximize decide but even that they are the same. want large parallelism, order, because the third changes the value of rl. However, if the compiler had used r3 instead of rl in the third instruction, these two instructions would be independent. A smart compiler references are independent, assume conservatively scalar rl := 0[r9] r2:=rl+l we must do the second and third there is no can help a compiler too late to affect parallelism decisions. If the compiler or hardware are not sure the locations are different, we must In the code sequence rl := If they are different, analysis that is imprecise; sometimes we must assume the worst. Hardware can resolve the question at run-time by deter- Some of a traditional Alias when two memory the program. Others are false dependencies, accidents of the code generation or results of our lack of precise knowledge about the flow of data. Allocating For rl := 0[r9] 4[r16] := r3 a We will begin with a survey of ambitious techniques for increasing the exploitable instruction-level 1.1. Increasing assump- on memory. example, in the code sequence branch and jump predic- tion, and alias analysis. In each case we can model range of possibilities from perfect to non-existent. parallelism have to make conservative tions that lead to false dependencies In this paper, we will simulation of register renaming, ever, we may still A gen- It may involve being selective about the instruc- assume that memory locations have meaning to the programmer that registers do not, and hence that hardware tions we choose: we may not be willing to execute memory stores speculatively, for example. Some of this may be put partly under compiler control by designing an renaming instruction of memory locations is not desirable. How- 177 set with explicitly squashable instructions. Each squashable instruction condition evaluated would be tied explicitly in another instruction, and would squashed by the hardware if the condition false. Rather branches, than try we might possible along both when we know to predict paths, which destinations of path for correct Many A real archi- architectures or indirect. have two or three control. kinds niques, of of a direct jump well out, fairly accurate) jump maintained of destination prediction scheme. A table is addresses. The address of a jump into this table. the index jump, Whenever we is an compiler that can also increase parallelism. trace. Our system branch prediction scheme could good alias analysis and regis- This is in contrast to the 1989 study Wall [7], which executable their they worked rather compiler scheduling than dynamic did scheduling did not consider The by more instruction only ambitious of Jouppi static within traces. basic and program Since blocks, scheduling. of our paper is more like that of methodology Tjaden and Flynn [14], which also scheduled instructions from a dynamic trace. Like Jouppi and Wall, however, we Tjaden and Flynn did not move instructions across branches. Their results were similar to those of Jouppi and Wall, with parallelism rarely above 3, even though the two studies assumed quite different architectures. with each other. old array of tech- ter renaming. entry for the jump. We predict that an indirect jump will be to the address in its table entry. As with branch prediction, we do not prevent two jumps from mapping to the unrolling whenever system for scheduling produced by an instruction work even with impossibly we put its address in the table same table entry and interfering a simple ask how well a realistic in been done on predicting the destinations of indirect jumps, but it might pay off in instruction-level parallelism. This paper considers a very simple (and, it turns technique and memory consider the full range in order to bound the effectiveness of the various techniques. For example, it is useful to an indirect jump, however, may require us to wait until the address computation is possible. Little work has Loop trace case the option ranges from perfect, which could not be implemented in reality, to non-existent. It is important to advance. The same is true of a branch, assuming we know how its condition will turn out. The destination of provides effect, based on the this prediction allows us to assume various kinds of branch and jump prediction, alias analysis, and register renaming. In each A direct jump is one whose des- destination execute an indirect In work. we have built instructions Branches are tination is given explicitly in the instruction, while an indirect jump is one whose destination is expressed as an address computation involving a register. In principle we the as a whole. the state of registers 2. This and previous conditional and have a destination some specified offset from the PC. Jumps are unconditional, and may be can know these blocks To better understand this bewildering to change the flow of either direct (a sequence of to find a trace on the amount of fanout we can in parallel. instructions is of the long instruction enter or leave the sequence unexpectedly. This added code may itself be scheduled as part of a later and less heavily executed trace. tolerate before we must assume that new branches are not explored VLIW fails, code is inserted outside the sequence of blocks to branches happen quite often in normal code, so for large degrees of parallelism we may encounter another branch one. the parallelism for by the compiler scheduling predicts a branch statically, profile. To cope with occasions when capability is guaranteed to be wasted, but we will never miss out by taking the wrong path. Unfortunately, tecture may have limits scheduling It uses a profile tions Some of our parallelism before we have resolved the previous was developed [4] global blocks that are executed often), and schedules the instruc- instructions the wrong scheduling where needed to exploit words. execute squashing it is. machines, be turns out to be the speculatively Trace to a Nicolau optimization effectiveness If we unroll assuming and Fisher’s of perfect trace trace-driven scheduling branch prediction study [11] of the was more liberal, and perfect alias a loop ten times, thereby removing 909?0of its branches, we effectively increase the basic block size tenfold. This larger basic block may hold parallelism that had been analysis. However, they did not consider more realistic assumptions, arguing instead that they were interested primarily in programs for which realistic implementations unavailable would be close to perfect. because of the branches. technique for moving instructions across branches to increase parallelism. It analyzes the dependencies in a loop body, looking for ways to increase parallelism by moving instmcSoftware pipelining [81 is a compiler The study by Smith, Johnson, and Horowitz [13] was a realistic application of trace-driven simulation that assumed neither too restrictive nor too generous a model. They were interested, however, in validating a particular realistic machine design, one that could consistently tions from one iteration into a previous or later iteration. Thus the dependencies in one iteration can be stretched out across several, effectively executing several iterations in parallel without the code expansion of unrolling. exploit a parallelism of only 2. They did not explore the range of techniques discussed in this paper. 178 We believe our study can provide useful bounds on the behavior not only of hardware techniques like branch prediction and register renaming, techniques like software pipelining to retiring the cycle’s instructions passing them on to be executed. but also of compiler from the scheduler, and When we have exhausted the trace, we divide and trace scheduling, number of Unfortunately, we could think of no good way to model loop unrolling. Register renumbering can cause created. The result is the parallelism. much of the computation 3.1. Parameters. in a loop to migrate toward the beginning of the loop, providing for parallelism much like those presented Much of the computation, incrementing however, backward opportunities by unrolling. like of the loop index, is inherently the repeated that there are an infinite sequential. false register using an LRU the results to those of the 3. Our experimental framework. executed. addresses referenced, discipline: and the results of branches packs these instructions may contain so that no For finite renaming, set dynamically allocated when we need a new register count) is earliest. as 64 instructions on replicated Fin- done with 256 integer registers registers. It is also interesting to the number on our base machine. To simulate no renaming, we simply use the registers specified in the codq this and is of course highly into the compiler dependent on the register strategy of we use. We can assume several degrees of branch as many we data into cycles, we assume that We further assume no limits the cycles see what happens when we reduce this to 64 or even 32, a sequence of pending cycles. In packing instructions of whose most recent use (measured is normally and 256 floating-point This trace also includes A greedy algorithm any cycle register we select the register To explore the parallelism available in a particular program, we execute the program to produce a trace of parallel. occur. in cycles rather than in instruction ite renaming jumps. the number number of registers, dependencies we assume a finite normal versions. the instructions by We can do three kinds of register renaming: perfect, finite, and none. For perfect renaming, we assume We address loop unrolling in an admittedly unsatisfying manner, by unrolling the loops of some numerical programs by hand and comparing instructions tion. all branches are correctly in a two-bit func- two-bit tional units or ports to registers or memory: all 64 instructions may be multiplies, or even loads. We assume that every operation has a latency of one cycle, so the result of an operation executed in cycle N can be One extreme is perfect prediction: prediction predicted. predic- we assume that Next we can assume scheme as described scheme can be either infinite, before. with The a table big enough that two different branches never have the same table entry, or finite, with a table of 2048 entries. To model trace scheduling, we can also assume static branch used by an instruction executed in cycle N+ 1. This includes memory references: we assume there are no prediction cache misses. that it goes most frequently. And finally, we can assume that no branch prediction occurs; this is the same as assuming that every branch is predicted wrong. We pack the instructions as follows. For each instruction from the trace into cycles in the trace, we start at tion. In any case we are concerned jumps; cycle after this one. correctly. we put the instruction in the next non-full cycle. If we cannot put the instruction in any pending cycle, we start a new predic- only with indirect we assume that direct jumps are always predicted The scheduling pending cycle at the end of the sequence. effect of branch and jump is easy to state. Correctly prediction predicted on branches and jumps have no effect on scheduling (except for register dependencies involving their operands). Instructions on opposite sides of an incorrectly predicted branch or As we add more and more cycles, the sequence gets longer. We assume that hardware and software techniques will have some limit on how many instructions jump, however, always conflict. Another way to think of this is that the sequence of pending cycles is flushed they will consider at once. When the total number of instructions in the sequence of pending cycles reaches whenever an incorrect prediction is made. Note that we generally assume no other penalty for failure. This we remove the first cycle from the sequence, or not. for jump go where it went last time). We can assume static prediction based on a profile. And we can assume no predic- them), we assume that we can put the instruc- whether it is full of instructions run; in always go the way We can assume that indirect jumps are perfectly predicted. We can assume infinite or finite hardware prediction as described above (predicting that a jump will tion in that cycle but no farther back. Otherwise we assume only that we can put the instruction in the next this limit, an identical tion. conflict exists depends on which model we are considering. If the conflict is a false dependency (in models If the correct cycle is full, from The same choices are available the end of the cycle sequence, representing the latest pending cycle, and move earlier in the sequence until we find a conflict with the new instruction. Whether a allowing based on a profile this case we predict that a branch will assumption This corresponds 179 is optimistic; in most real architectures, a failed prediction causes a bubble in one or more can occur. in the pipeline, in which cycles no execution We will return to this topic later. We can also allow instntctions tain The None limit. involved to move past a cer- instructions incorrectly predicted branches. of the experiments described time. here With Four levels of alias actual memory are available. analysis alias analysis, in which address referenced by a load or stor~ a that whenever Between conflicts with a load or store. size. compiler might do it. We don’t such a compiler, but we have implemented scheme is alias two inter- by a new O[gp] := r2 Figure The two instructions analysis manifestly because they other by to the stack and the other is intermediate scheme even though compiler is our called own alias compiler doesn’t do it. Under this model, we assume perfect analysis of stack and global references, regardless of which registers are used to make them. address on the stack conflicts only with the same address. are resolved The by instruction idea that references the compiler, Heap behind alias dataflow windows remarks instructions Livermore 24479634 floating-point 814 174883597 Stanford 1019 20759516 sed 1751 1447717 844 13910586 loops 1-14 Linear algebra [3] Hennessy’s Stream suite [5] editor Fite search yacc 1856 30948883 Compiler-compiler metronome 4287 70235508 Timing grr 5883 142980475 PCB verifier router 2721 26702439 Recursive ccom 10142 18465797 C compiter gccl 8300Q 22745232 espresso ~ 12000 135317102 7000 1247190509 fpppp 2600 244124171 qnantnrn chemistry doduc 5200 284697827 hydrocode 180 1986257545 tomcatv Figure other Programs 3.2. hand, 3. The pass tree comparison front end GNU C compiter 1 of boolean function minimizer Lisp interpreter simulation mesh generation seventeen test programs. measured. As test cases we used four by real programs compiler is be resolved by These and possibly by benchmarks assumptions is particulady defensible. Many languages allow pointers into the stack and global areas, rendering them as difficult as the heap. Practical considerations DECStation such as separate compilation may “ R3000 analyzing perfectly. were usually the reference The Mips also keep us from On the other 180 used programs data was references reaches the window a load or store to analysis analysis enter the are the norm for the results 22294030 solving diophantine equations over loop indexes, whereas heap references are often less tractable. Neither of these non-heap instructions 462 A store to an on the the heap can often by doing also new window. 268 inspection. our outside references, prediction a full-size Livermore eco a reference to the globaI data area. The analysis a reference A missed new of and then start Whetstones by inspection. use the same base register but different displacements. The two instructions in 2(b) cannot conflict, because one is manifestly cycles, the number of instructions egrep in 2(a) cannot conflict, into dynamic (b) 2. Alias we fetch an entire window them windows, Continuous Linpack (a} can one at a time, and old cycles leave the window lines ure 2. rl := O[fp] We or continuously. instruction den~ the two ways this might happen are shown in Fig- := r2 of assumed discrete windows. This is a common technique in compile-time instruction-level code schedulers. We look at the two instructions to see if it is obvious that they are indepen- 4[r9] instructions. discretely window. inspection. rl := O[r9] number described here, although to implement them in hardware is more difficult. Smith, Johnson, and Horowitz [13] have mediate schemes that may give us some insight. One intermediate maximum the is 2048 either windows, continuous window these two extremes would be alias analysis as a smart vectorizing this schedule with With always is causes us to start over with store conflicts with a load or store only if they access the same location. We can also assume no alias analysis, so a store default the window discrete fresh at the size about the effects of parallelism. that can appear in the pending cycles at any instructions, We can we look some intuition window By manage this ability. assume perfect provides alias analysis on instruction-level This of corresponds instructions alternatives to architectures that speculatively execute from both possible paths, up to a certain number ~anout side, even heap references are not as hopeless as this model assumes [2,6]. Nevertheless, our range of four resulting whatsoever at WRL, are shown run on accompanying an official data set. 5000, version is a trademark toy in Figure “short” The programs which benchmarks, and six SPEC has a MIPS 1.31 compilers of MIPS Computer were seven benchmarks. 3. The SPEC test data, but the data set rather than were compiled for a R3000* used. Systems, Inc. processor. 4. ResuIts. static jump We ran these test programs configurations. dynamic for a wide range of The results we have are tabulated in the formance some interesting trends. To provide a framework for our exploration, we defined a series of five increasingly prediction but we will extract instructions. Many of the results show the effects of variations we present It would be interesting starts to degrade, and to explore how well static does if the profile (54L is from a non-identical t ! will pe$etv 56 i .. -’,..- Stupid Fair jump reg alias renaming analysis . . . ,’ ,’ none none none none infinite infinite inspection 256 ,’ ,’ 36 ~ 28 ,’ , ,, >.. ,’ ~ ,. . ,’ , ,’ doduc ,, r ,’ ..- 48 I ~ 3 ~ ,’ ,’ ..-” predict .- -- ;:, i .. .. 52 Note that even the Fair model is quite ambitious. branch i I infinite infinite 256 perfect Great infinite infinite perfect perfeet espresso Cconr egrep 32 w. g& 24 Good run * I 60 on these standard models. predict to of the program. ambitious models spanning the possible range. These five are specified in Figure 4; the window size in each is 2K does a bit better than our simple scheme. explore how small the hardware table can be before per- some of them to show appendix, prediction prediction Lmpack ~~$onome 20 Stanford 16 liwo 12 Perfect perfect ~rfect perfect perfect Whetstones Livennore 8 4 Figure 4. Five increasingly ambitious models. S~pid injinite jinite 96% Linpack Good Fair Great Perfect jumps branches 96% static injinite jinite 95% 99% 99% 96% 19% 77% Figure 6. Parallelism static Livermore 98% 98% 98% 19% Stanford 90% 90% 89% 69% 69% 71% Whetstones 90% 90% 92% 80% 80% 88% sed 97% 97% 97% 96% 96% 97% egrep 90% 90% 91% 98% 98% 98% yacc 95% 95% 92% 75% 75% 71% met 92% 92% 92% 77% 77% 65% 4.2. Parallelism under under the five models. the five models. Figure 6 shows the parallelism each of the five models. of each program for The numeric programs are shown as dotted lines. Unsurprisingly, the Stupid model rarely gets above 2; the lack of branch prediction means that it finds only intra-block parallelism, and the lack of renaming and alias analysis means it won’t find much of grr 85% 84% 82% 67% 66% 64% eco 92% 92% 91% 47% 47% 56% Ccom 90% 90% 90% 55% 54% 64% has parallelism ti 90% 90% 90% 56% 55% 70% tomcatv 99% 99% 99% 58% 58% 72~o branch prediction, perfect alias analysis, and perfect register renaming would lead us down a dangerous gar- doduc 95% 95% 95% 39% 39% 62% den path. espresso 89% 89% 87% 65% 65% 53% feeee 92% 91% 8870 84% 84% 80% gcc 90% 89’70 90% 55% 54% 60% mean 92% 92% 92% 67% 67% 73% that. The Fair model is better, with parallelism between 2 and 4 common. Even the Great model, however, rarely So would a study that included only fpppp and It is interesting The success of the two-bit branch prediction has been reported elsewhere [9, 101. Our results were comparable and are shown in Figure 5. It makes little difference whether we use an infinite table or one with only 2K entries, even though several of the programs are more than twenty thousand instructions long. Static branch prediction as well based on a profile as hardware prediction that Whetstones fect model. This is the result of Afndahl’s the parallelism do poorly and Livermore, benchmarks, compute prediction. all we want to run on our two numeric independently, and jump A study that assumed perfect tomcatv, unless that’s really machine. Figure 5. Success rates of branch and jump prediction. 4.1. Branch above 8. for even under the Per- each Law: Livermore if we loop the values range from 2.4 to 29.9, with a median around 5. Speeding up a few loops 30-fold sim- ply means that the cycles needed for less parallel will dominate the total. loops 4.3. Effects of unrolling. Loop unrolling should have some effect on the five models. We explored this by unrolling two benchmarks by hand. The normal Linpack benchmark is already does almost exactly across atl of the tests; unrolled 181 four times, so we made a version of it unrolled structure as they did before unrolling. sions have slightly however, ver- that framework, because 3/4 or 9/10 of the loop overhead has been removed. slightly. quate, The unrolled less to do within As a result, the parallelism goes down Even when alias analysis by inspection unrolling the loops either naively is ade- or carefully sometimes causes the compiler to spill some registers. This is even harder for the alias analysis to deal with because these references usually register than the array references. Loop unrolling able parallelism, . *-======-- unrolling with the rest of our techniques have been able to do here. t r S~upid Oreat Gocd Fair but it is clear we must integrate 4.4. Effects of window Perfect under the five models. instructions: size. the scheduler is allowed instructions ten times, and a version back up, removing also unrolled in which the normal the Livermore did the unrolling we rolled unrolling benchmark in two different the loops by four. We ten times. We ways. reassociated to increase parallelism, One is naive varying the window are delayed so that false memory are complete, do not interfere with doing the calculations Figure 7 shows the result. 10-unrolled until versions. Unrolling after all the model conflicts both A less ambitious the continuous by 4 or 10 disappears altogether 75% of the executed instructions, just short of our maximum window which accounts for shows achieves a parallelism of 64 in each case. Great tine. This loop includes an embedded random-number generator, and each iteration is thus very dependent on its predecessor. This confirms the importance of using must unrolling actually hurts the only is not always sufficient down to 4. almost the Perfect The Good identical to the to resolve them, would tend use discrete than windows window increases would them as the windows with above. continuous. 2K window size than Under the as conbut decreases; of 128 instructions 9 assuming as well instructions, to a fresh Figure 8, except do nearly is get a relative less parallelism we used as Figure rather the to have model discrete manager schedule and then start over models windows when parallelism window same model, windows. instructions, execute the difference before the we the curves level off. If we have very small windows, it might pay off to manage them continuously; in other words, continuous management of a small window is as the good as multiplying the window size by 4. As before, the Perfect model does better the larger the window, but parallelism the parallelism slightly under the Fair model. The reason is fairly simple. The Fair model uses alias analysis by inspection, which of This discrete The tinuous whole program traces; studies that considered parallelism in saxpy would be quite misleading. full window. in the Perfect aggregate parallelism stays lower because the next most frequently executed code is the loop in the matgen rou- Naive but looks 4.5. Effects of using discrete Livermore unrolling routine, Typical in parallel. each other, because the saxpy size from 2K instructions is not shown, and Linpack in the more ambitious models, although unrolling Linpack by 10 doesn’t make it much more parallel than unrolling by 4. The difference between model, time. Great model. The dotted lines are the helps at one parallelism drops off quickly. Unsurprisingly, model does better the bigger its window. and in which assign- ments to array members cycles Under the Great model, which does not have perfect branch prediction, most programs do as well with a 32instruction window as with a larger one. Below that, in which computascalar product) are calculations pending size of 2K to keep that many superscalar hardware is unlikely to handle windows of that size, but software techniques like trace scheduling for a VLIW machine might. Figure 8 shows the effect of unrolling, in which the loop body is simply replicated ten times with suitable adjustments to array indexes and so unrolling, on. The other is careful tions involving accumulators (like in the better than we Our standard models all have a window Figure 7. Unrolling base is a good way to increase the avail- ,;:,,.=:=.:,======= 12 8 have a different is only two-thirds that of continuous win- dows. the conflict 4.6. Effects of branch between a store at the end of one iteration and the loads at the beginning of the next. In naive unrolling, the loop body is simply replicated, and these memory conflicts impose the same rigid framework to the dependency and jump prediction. We have several levels of branch and jump prediction. Figure 10 shows the results of varying these while register renaming and alias analysis stay perfect. Reduc- 182 64 t I I 1 I I I 1 60 . J 56 p 52 ) 48 ,’ 44 ,’ ,’ ,’ ,’ .’ 40 36: 32+ ,. ..- ,’ ,’ : fPPPP .. ,’ 24 b 40 I ,.. #’ 28 f ------------- . tomcatv ,.. . ---- ---,’ 44 ] ,’ 1 52 tonlcatv .------------? ---,’ ,’ ,,’ 20k Linpack -------------------------: 12 . . . . . . . . . . ..- 8 ------------- --- 4 n ‘48163264 128 Figure 8(a). Size of continuous 256 512 windows lK 2K under Great model. Figure 9(a). Size of discrete windows under Great model, 64 60 56 fPPPP torncatv 52 .,. ,’ 48 ,, ,’ ,, ,,’ ,,’ 40 1 ,,’ ,,, ,’ 44 48 ,,, ,. esplesso ‘1 40 ~ 36 .2 j 32 Cmrrt ww w. espresso 24 :g:hpack 20 Stanford 16 Ccotn & 28 Lmpack ~~onome 20 b ~::rome 16 liwo 12 12 8 doduc 44 ,, I Stanford 8 Whetstones Livennore .:: whetstones Livemroxe 4 o “48163264 128 Figure 8(b). Size of continuous ing the level of jump but only 256 windows prediction if we have perfect S12 lK 48163264 2K Figure 9(b). under Perfect model, can have a large effect, branch prediction. 128 of parallelism mispredicted Other- 256 Size of discrete windows 512 lK 2K rmder Perfect model, under the Great model, assuming that each branch or jump adds N cycles with no wise, removing jump prediction altogether has little effmt. This graph does not show static or finite predic- instructions in them. If we assume instead that all indirect jumps (but not all branches) are mispredicted, the tion: it turns out to make little difference whether prediction is infinite, finite, or static, because they have nearly the same success rate. right ends of these curves drop by 10% to 30%. That jump prediction has little Livermore jumps, and their branches are comparatively predictable. Tomcatv and fpppp are above the range of this graph, but effect on the paral- lelism under non-Perfect models does not mean that jump prediction is useless. In a real machine, a jump predicted incorrectly (or not at all) may result in a bubble in the pipeline. The bubble is a series of cycles in which and Linpack stay relatively horizontal They make fewer branches and over the entire range: their curves have slopes about the same as these. 4.7. Effects of alias analysis no execution occurs, while the unexpected instructions are fetched, decoded, and started down the execution pipeline. Depending on the penalty, this may have a serious effect on performance, Figure 11 shows the degradation and register renaming. We also ran several experiments varying alias analysis in isolation, and varying register renaming in isolation. 183 64 60 fPPPP tomcatv 56 doduc 52 48 44 40 espresso 44 36 CcInn 40 32 & 28 I egrep P. Lmpack 24 32 & 28 ,,., ,, 1 eaprcsso I Cconr egrep w, Lmpack gcc 24 Stanford 16 doduc 36 $ !!$l”””m’ 20 fPPPP tolucatv ‘----------------.“,’ .“,’ ,“,’ ,’,’ ,, ,, ., ., ., 52 48 8 2 $ . ... .. . .. .. ... .... ------------------- 60 56 ~~onome 20 Stanford JiecO 16 12 8 12 Whetstones Livennore 4 tie’o 8 Whetstones Livennom 4 ‘&n. ‘e J nf bW Figure ?%P jil%;. ~M jpi$ j!t~s ~f# ){~$ 10. Effect of branch and jump prediction aNone aInsp with Figure 12(a). aCOmp Effect of alias aPerf analysis model. on Perfect petfect alias analysis and register renaming 14 ~-.. ------- -- ------ - 52 --------------- . 12 - ------- .- , ----------------: e--‘ Linpack tomcatv ,’ 48 ; 44 + 10 - ~ ,, ---”” ””-” ----”---: ,’ 32 ,’ ,’ & 28 8. j ,’ 40 & ~ 36 2 ,’ ,’ ,’ ,. ,, ,’ ,.’ 20 : ,’ ,,’ ,,’ ,, 16 ~ fPPPP .“ ,.’ -------------------: Linpack 12 2- -~- . . . . . . . . . . . . . . . . . . . . . . 8 li 4 I 01 01234567 I I I I I I I I 891o mispraiiction penalty alnsp aNOne Figure 12(b). Figure 11. Effect of a misprediction Figure 12 shows aCOmp (cycles) aPerf Effect of alias analysis on Great model, cycle penalty on the Great model, the effect of varying analysis under the Perfect and Great models. Figure 13 shows the effect of varying the register renaming under the Perfect and Great models. Dropping from infinitely many registers to 256 CPU and 256 FPU registers rarely had a large effect unless the other parame- the alias We can see isn’t very powerful; it that “alias analysis by inspection” increased parallelism by more than 0.5. “Alias rarely analysis by compiler” was (by definition) indistinguish- ters were perfect. Under the Great model, register renaming with 32 registers, the number on the actual machine, yielded parallelisms roughly halfway between no renaming and perfect renaming. able from perfect alias analysis on programs that do not use the heap, and was somewhat helpful even on those that do. There remains a gap between this analysis and perfection, which suggests that the payoff of further work 4.8. on heap disambiguation may be significant. Unless branch prediction is perfect, however, even perfect alias analysis usually leaves us with parallelism Conclusions. Good branch critical between 4 and of 8. instruction-level reduce on 184 the prediction to the exploitation the penalty parallelism by hardware of more parallelism. for indirect of Jump jumps, non-penalty or software than modest is amounts prediction cm but has little effeet cycles. Register 64 gU&Ptv 60 doduc 56 52 zero-conflict branch and jump prediction 256 CPU and 256 FPU registers perfect alias analysis continuous windows of 64 instructions 48 44 espresso 40 E a 36 $ 32 ccom ewep w. & 28 24 Lmpack /&onome 20 Stanford 16 liao 12 Whetstones Livennore 8 4 o~ rNone rPerf Figure 13(a). Effect of register renaming on the Perfect model. Figure 14. The paratletism 64L I I from an ambitious hardware model. 3 I 60 + 56 ? 52 : 48 44 ,, ,. ~ ..- ..- ceom jj~~ terncatv ..; ---- ?.:+!’ :.:+ :::.:: eco ~: stanford yacc _ ,, 40 I oh rNone r32 Figure 13(b). ! 3 I r64 1256 Effect of register renaming static branch and jump prediction 256 CPU and 256 FPU registers perfect alias analysis continuous windows of 2K instructions rPerf on the Great model. Figure 15. The parallelism renaming is important as well, though a compiler be able to do an adequate job with static analysis, if it knows it is compiling for parallelism. Even ambitious models combining discussed here are disappointing. Figure parallelism achieved by a quite ambitious model, with branch and jump prediction from en ambitious software model, might is closer to 9, but the median is still around 5. A consistent speedup of 5 would be quite good, but we cannot honestly expect more (at least without developing tech- the techniques niques beyond those discussed here). 14 shows the hardware-style using infinite We must also remember tions this study makes. tables, 256 ITT-J and 256 CPU registers used with LRU renaming, perfect alias analysis, and windows of 64 the simplifying assump- We have assumed that all opera- tions have latency of one cycle; in practice an instruction with larger latency uses some of the available parallel- instructions maintained continuously. The average parallelism is around 7, the median around 5. Figure 15 ism. We have assumed unlimited resources, including a perfect cache; as the memory bottleneck gets worse it may be more helpful to have a bigger on-chip cache than shows the parallelism achieved by a quite ambitious software-style model, with static branch and jump predic- a lot of duplicate functional units. We have assumed that there is no penalty for a missed prediction; in practice the tion, 256 FPU and 256 CPU registers used with LRU renaming, perfect alias analysis, and windows of 2K instructions maintained continuously. The average here penalty may be many empty cycles. 185 We have assumed a (11), uniform machine cycle time, though adding superscalar capability will surely not decrease the cycle time and may in fact increase it. We have assumed uniform technology, but an ordinary machine may have a shorter time-to-market ogy. and therefore our expected parallelism pp. Tilak Agarwala by a third; Watson March [2] and could reduce together [4] John instruction set Research Center R. Chase, Cocke. High processors. performance IBM Technical Thomas Joseph Maxk A. Fisher, John Nicolau. and PLAN Report Computers Wegman, and F. Kemeth Appendix. R. Ellis, Parallel John C. Ruttenberg, processing: A smart and com- Proceedings of the SIGpp. on Compiler Construction, as SIGPLAN Notices 19 (6), June a dumb machine. Published John Hennessy. D. Jones approach grams Stanford benchmark and Steven S. suite. to interprocedural with recursive Muchnick. data data flow Personal Norman P. Jouppi instruction-level and David parallelism flexible and pro- Ninth Annual Programming of W. for A analysis structures. ACM Symposium on Principles Languages, pp. 66-74, Jan. 1982. Wall. Available superscalar and super- Third International Symposium on Architectural Support for Programming Languages and 1989. Published Operating Systems, pp. 272-282, April as Computer Architecture News 17 (2), Operating Systems Review 23 (speciat issue), SIGPLAN Notices 24 pipelined machines. (special issue). Also available m WRL Research Report 89/7. [8] Monica scheduling Lam. Software pipelining: technique for VLIW machines. An effective Proceedings Programming of the SIGPLAN ’88 Conference on pp. 318-328. Language Design and Implementation, Published as SIGPLAN Notices 23 (7), July 1988. [9] Published and M. J. Flynn. of parallel instructions. C-19 (10), pp. 889-895, Detection and parallel IEEE Transactions on October 1970. Johnny strategies K. F. Lee and Alan and branch (l), pp.6-22, J~U~y target J. Smith. buffer Branch design. prediction Computer 17 1984. [10] Scott McFarling and John Hennessy. Reducing the cost of branches. Thirteenth Annual Symposium on Compp. 396-403. Published as Computer puter Architecture, Architecture News 14 (2), June 1986. [11] Alexandru Nicolau and Joseph A. Fisher. Measuring the parallelism available for very long instruction word architectures. IEEE Transactions on Computers C-33 Parallelism under many models. On the next two pages are the results of running the test programs under more than 100 different configurations. The columns labelled ‘ ‘Livcu10” and ‘‘ Lincu10” are the Livermore and Linpack benchmarks unrolled carefully 10 times. The configurations are keyed by the following abbreviations: communication. [7] strategies. on Computer Architecture, as Computer Architecture News G. S. Tjaden #55845, 1984. Neil prediction Symposium [14] execution ’84 Symposium 37-47. [6] 1984. branch Michael D. Smith, Mike Johnson, and Mark A. Horowitz. Limits on multiple instruction issue. Third International Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 290-302, April 1989. Published as Computer Architecture News 17 (2), Operating Systems Review 23 (special issue), SIGPLAN Notices 24 (special issue). J. Jack J. Dongarra. Performance of various computers using standard linear equations software in a Fortran envirotunent. Computer Architecture News 11 (5), pp. 22-27, December 1983. Alexandru [5] 135-148. of 31, 1987. David piler Annual November study [13] they could Zadeck. Analysis of pointers and structures. Proceedings of the SIGPLAN ’90 Conference on Programing pp. 296-310. Language Design and Implementation, Published as SIGPLAN Notices 25 (6), June 1990. [3] A 9 (3), 1986. it completely. reduced Smith. Eighth References [1] pp. 968-976, J. E. use newer, faster technol- Sadly, any one of these considerations eliminate [12] 186 bPerf bJnf b2K bStat bNone perfect branch prediction, 100% correct. 2-bit branch prediction with infinite table. 2-bit branch prediction with 2K-entry table. static branch prediction from profile. no branch prediction. jPerf jJnf j2K jStat jNone perfect indirect jump prediction, 100% correct. indireet jump prediction with infinite table. indirect jump prediction with 2K-entry table. static indirect jump prediction from profile. no indirect iump prediction. rPerf rN rNone perfect register renaming: infinitely many registers. register renaming with N cpu and N fpu registers. no register renaming: use registers as compiled. aPerf aComp aInsp aNone perfect alias analysis: use actual addresses to decide. “compiler” alias analysis. “alias analysis “by inspection.” no alias analysis. WN dwN continuous window of N instructions, default 2K. discrete window of N instructions, default 2K. FIOow=?w)wl-w)m . . . . . r-(.rul-il+mmmm~ 1+.-I I-4 FII+I+ . . . . . . . . . w-4 . . . . . . . rwwwwwwvm~w . . . . . . . . . . . . . . vwwwmwc.4mm ,., . rrrrrwm-mm- . 0WW04NNOLTIPV VIPPPPPUIUIIX*P ..,., wwwwwwwmm~w mev-.=rmo-wcnw . . . . rrrrrrrw=fmr . Or+mqqdu-lmwo . . . . LIJcJaw. . . . . . ~+owcomfnmrnm . . . . . . . wwmwm~~~om~ . . . . w-4+m*r4mrmc0. . . . . wwwtn-mwf+m=r=r . . . . ., . -moicoomrcn-twco . . . . . -rwmmmtid*wm . . . . . m-4N0dmw.+mm* ,,.,, . corwm-me4e4c0utm . . . . ,, . . . . . . . . mmmwomwindcvm . . . . . ,,.., NCV@COIXFWLOWNCO 1+ N03comc0cormrr . . . . . WW-VW-f=#VMOI-* . . rcocomcorwm-4m ,,,... Loeew-cu*-+mr4 . . . roomqcombw+m . . . . . wwm=rmmmtircw+ cOmcOco*OmwOm ..s . . . W-?*-MNN-MN . MWwwwwwwwwwww H HH #c .n HH .n . . . . Wwwwwwwww Wwwwwwwwww GGCCCGCCCG H H H HH AA.4$2AAAAA14 . . cow c cc c GG c HHHHHHHHHHH .n .n .~ .~ .m .m .~ H ow . . at. w-r-r-emmtwd- w 187 . ,. mm mmrwmwutmmo ,..!,. o cc G SC CGG 131HHHHHHI+HHH .n .m .I1 .“ .rl .Im .n .m .rl Wwwwwwwwwwww Izccccccccccifi H HH H HH nnnJlllJ2xa,4Ann.Q . . n? Ndd u-la! . . . Pwul ,+ . . . HH c c .~ .~ .~ SC H HH .n &.-id.-+,+IOrAOAQI*I* . . . . . Ill u---*mmc4mc4- . . . . . . . . . .-id.smcowwcom..+d,+ . . ,,.. mme-mmmmmmmm . . . . . . . . . . . v=r-wm ..+* . . .i.-+.+-a.+com . . . . ---m-mm . ,,. . . mmm . CocommmcNal . . . . *-a*m**m . . . --a=rmmcomoo . . mui==rmmmm . . . . mmmm . . . -J=?.$*Flr’ldr-lFl mmmmmmmmmm . . . . . COcoriei dwoocomwmmr .,.,. . @41+cwc-41+1-+r+AAr+ ,,,, .,,. I-IFIFIAAFI?+.+.-I.-I mmmcwcw ,, b In cocucucofmrmrrm +. . . . . . .-+*=4-4-f**-4mcw-m o m mmmm~-ow-m-r In,.,,.,... Mknwmwmwvmw-m mmixmmrimmmco . . . . . . w=r=fm-mmmmmmm cowmmmm~ommm .......... mtn-4mmmmmwe4utn AL.4mmro . . . -=rm~mr ......... m-m f) @4wc0 + m@4 w =i4KiLli 00 Kikillilxi NV W44WWWWWWWWWW G C fic G cc H H I+H H HH nnA.Ja.Qnnn.4.Qn,a,4A cc HH C H cc HH WWWWWWWWWW4-IW c cc cc G G c H HH HH H H H ,4Ja,fla.fa.(1.(1.cl.an.cl$l 4J.IJ+I+I EJIdldti 4J+J.IJ4J U2ululrn Ww G fifi H HH Cc HH c H .Q .4,4.4 188 . . . “ “ ““ wcnmrm . . Add,+- . Wulul!n r-u-lrw!nrhlrumm r+rli-lrll+l-lrlr+l+rl . “ ‘ ‘ “ ‘ “ .-IAAAA ‘ “ . .