Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy Department of Computer University Science and Engineering of Washington Seattle, WA 98195 Abstract ity of such a machine, (2) evaluate the performance relative to superscalar and fine-grain This paper examines simultaneous mitting several independent perscalar’s multiple tithreaded units in a single cycle. organizations: multithreading to utilize multithreaded architectures multithreading throughput of a superscalar, has the potential and double native to single-chip multithreading multiprocessors; multiprocessors simultaneous crease processor the design. We examine alternative organizations multi- for performance on-chip microprocessors employ and processor utilization; its limits. For example, modem Alpha 21164 [1 1], PowerPC however, superscalars, cycle from a single thread. Multiple tial to increase performance, shown as horizontal instruction but is ultimately (i.e., the available between them. waste and vertical MASA[15] text switch between threads. Traditional resources. ory and firnctional to in- many of these complexities and Alewife and evaluate waste in Figure [2] employ multiple threads with fast con- multithreading thread. The technique issue instructions is thus limited width increases, the ability of traditional contrast, attacks both horizontal Introduction that permits mukithreading mr.dtithreading superscalar architectures (SM), a technique threads to issue to multiple evaluation thread and functional unit is completely SM is to substantially both long memory per-instruction latency-hiding numerous instructions and limited mukithreading available combines parallelism extensions per ability of multithreaded design challenges meeting Multiflow large forwarding units. supporting e.g., achiev- high memory requirements, tional access and scheduling multithreading processors. scalar architecture our SM architectures Since code scheduling compiler are is crucial [20]. of superscalar to increase instruction For example, a fine-grain In this paper, we (1) introduce To place our we generate code using the state-of-the-art trace scheduling Our results show the limits It also inherits from these architectures, file bandwidth, onto functional architectures. models. to wide architec- derived from the 300 MHz Alpha 21164 [1 1], of that basic design, on wide superscalars, the multiple-issue- relative multithreaded multithreading enhanced for wider superscalar execution; in the face of in waste. in the context of modem superscalar processors, we simu- late a base architecture of features of modem superscalar processors with the ing high register demands, The objective increase processor utilization latencies Simultaneous dynamic. to utilize multithreading, improvement, and conventional tures, of various simultaneous func- tional units each cycle. In the most general case, the binding between thread, and vertical This study evaluates the potential several independent from only one that can be found in a single thread in a single cycle. And as issue in the design space. This paper examines simultaneous hides mem- by the amount of parallelism processor resources will decrease. Simultaneous 1 1. Multi- unit latencies, attacking vertical waste. In any one cycle, though, these architectures to op- on the other hand, such as HEP [28], Tera [3], multithreaded complexity by instruction and long-latency thread. The effects of these are corresponding potential per issue has the potenlimited parallelism) the single executing threaded architectures, has such as the DEC 604 [9], MIPS R1OOOO [24], Sun Ul- outperform has excellent to increase each technique traSparc [25], and HP PA-8000 [26] issue up to four instructions erations within it can add substantial advantages of SM ar- multiprocessors. various techniques parallelism dependencies is an attractive alter- and real-estate over small-scale, made possible with similar execution multithreading utilization, Si- (3) show how for SM processors, and (4) demonstrate simultaneous processors with a variety of organizations conventional in to achieve 4 times the that of fine-grain and evaluate tradeoffs We also show that simultaneous While processor. We evaluate several cache configurations by this type of organization su- are limited the potential Current mul- multiprocess- the resources of a wide-issue multaneous threading. multiple-issue to tune the cache hierarchy chitectures and compare them Our results show that both (single-threaded) perscalar and fine-grain their ability to a su- We present a wide superscalar, a fine-grain processor, and single-chip, ing architectures. a technique per- threads to issue instructions functional several models of simultaneous with alternative multithreading, of those models multithreading, execution and tradi- throughput in future we show that (1) even an 8-issue super- fails to sustain 1.5 instructions multithreaded per cycle, and (2) processor (capable of switching contexts every cycle at no cost) utilizes only about 40% of a wide superscalar, several SM models, most of which limit key aspects of the complex- regardless of the number on the other in instruction Permission to copy without fee all or parl of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association of Computing Machinery.To copy otherwise, or to republish, requires a fee and/or specific permission. ISCA ’95, Santa Margherita Ligure Italy @ 1995 ACM 0-89791 -698 -0/95/0006 ...$3.50 hand, of threads. provides throughput, Simultaneous significant multithreading, performance and is only limited improvements by the issue bandwidth of the processor. A more traditional means of achieving This research was supportedhy 1136, NSF 392 ONR grants NOOO1 4-92-J. is the con- 1395 and NOOO 14-94-1- grants CCR-9200832 and CDA-9 123308, NSF PYI Award MIP.9058439, the Washington Fellowship. parallelism Technology Center, Digital Equipment Corporation, and a Microsoft the performance issue slots I provides El full issue slot •l of a single-threaded motivation Section 4 presents the performance and compares empty issue slot fine-grain multithreaded threading. vertical waste = 12 slots Figure 1: Empty issue slots can be defined waste or horizontal the processor when waste. Vertical issues no instructions waste is introduced when waste execution (as opposed to single-issue waste and increases in a cycle. execution) the amount on the performance architectures. our results in Section 8, processors, simultaneous simulation multithreaded environment extension As chip densities processor and the single-chip are two close organizational alternatives 2.1 multiprocessor architecture; for increasing on-chip exe- simultaneous partially simultaneous is potentially superior to mukiprocess- to utilize processor resources. For example, a single multithreaded processor with 10 functional performs by 24% a conventional total of 32 functional workload that is highly optimized units out- 8-processor multiprocessor on the pipeline Environment uses emulation-based to Tango [8] and g88 [4]. decoded instructions Our simulator instruction-level with a for fast emulated execution. models the execution branch prediction simulation, Like g88, it features caching of pipelines, archy (both in terms of hit rates and bandwidths), units, when they have equal Issue bandwidth. For this study we have speculated is a straight- wide superscalar processors, on our target machine. Our simulator similar multithreading the memory hier- the TLBs, and the logic of a wide superscalar processor. It is based on the Alpha AXP 21164, augmented first for wider superscalar ex- structure for ecution and then for multithreaded execution. The model deviates a simultaneous multithreaded processor, since an implementation does not yet exist. Our architecture may therefore be optimistic in from the Alpha in some respects to support increased single-stream two respects: prediction, instruction first, in the number of pipeline stages required parallelism, for issue; second, in the data cache access time (or load de- lay cycles) for a shared cache, which affects our comparisons single-chip multiprocessors. The likely magnitude is discussed in Sections 2.1 and 6, respectively. given the other constraints Real implementations design tradeoffs; multithread- of our architecture. taneous multithreading through through various combinations and multithreading simulation features, 10 functional assume that all functional units are completely pipelined. latencies used in the simuliitions, of VLIW, both analytically Latency Class 8,16 integer multiply su- conditional [34] and 2 move 0 compare [16, 17, 6, 23]; we discuss these in detail in 1 all other integer Section 7, Our work differs and extends from that work in multiple FP divide respects: (1) the methodology, all other FP 4 load (Ll 2 our simulations, workload, and the wide-issue technology; compiler the accuracy and detail of we use for comparison, optimization with simultaneous finally, (4) in our comparison and scheduling and evaluation multithreading; of multiprocessing cache hit, no bank conflicts) 8 load (L3 cache hit) 14 and load (memory) and control hazard (br or jmp predicted) 50 1 6 control hazard (br or jmp mispredicted) as follows. our basic machine model, the workloads environment 17,30 load (L2 cache hit) multithreading, This paper is organized simulation the (2) the variety of SM models we simulate; (3) our anal- ysis of cache interactions simultaneous including the base architecture Section 2 defines in detail Table 1: Simukdted that we measure, and the that we constmcted. Section 3 evaluates 393 instruction and Table 1 which derived from the Alpha 21164, simul- units per cycle. We issues that exhibit branch point, three load/store issue rate of 8 instructicms Instruction architectures contains due to various we intend to explore these implementation Previous studies have examined configuration issue, improved caches. 1 branch) and a maximum shows the instruction in future work. perscalar, simulated of four types (four integer, two floating Our results thus may see reduced performance such as more flexible instruction and larger, higher-bandwidth The typical with of these effects serve, at the least, as an upper bound to simultaneous ing performance, a of a simul- that architecture of next-generation Simulation cution resources. We compare these two approaches and show that ing in its ability processors, and small-scale increase, single-chip will become a viable design option [7]. The simul- taneous multithreaded as defined multithreaded To do this, we have developed running a real multiprogrammed multiprocessor. alternatives that defines an implementation taneous multithreaded waste. multi- We discuss related work in Section 7, multiprocessors. for execution ventional of simultaneous Methodology multiple-issue both introduces as a Section 6 compares the SM approach with conventional multiprocessor forward multiprocessors as well in the previous section: wide superscalars, traditional Superscalar of vertical architecture, processor. Section 5 explores the effect of Our goal is to evaluate several architectural vertical in a cycle, horizontal not all issue slots can be filled horizontal as either it approach. of a range of SM architectures and summarize 2 architecture; multithlreaded them to the superscalar cache design alternatives horizontal waste =9 slots superscalar for the simultaneous latencies are We assume first- and second-level larger than on the Alpha, on-chip caches considerably for two reasons. First, tion 4, and the performance from the priority multithreading puts a larger strain on the cache subsystem, and second, we expect larger on-chip caches to be common simultaneous multithreading in the same time-frame becomes viable. We also ran simu- lations with caches closer to cu~ent processors—we experiments as appropriate, (Table 2) are multi-ported by interleaving occurs whenever the program the instruction model lockup-free memory them into banks, similar [30]. An instruction counter crosses a 32-byte boundary; caches and TLBs. accesses and no execution TLB misses require two full a one-cycle Our workload 2 MB raw instruction DM DM 4-way DM 32 32 32 32 8 4 1 1 cycle 1 cycle 2 cycles 2 cycles not scheduled inthe nextcycle. due to functional Wecomplementthis trace scheduling a great deal of complexity unless weneedprecise per-register scoreboarding and fetch/decode 2-bit branch prediction branch prediction; information I cache. Conflicts inthetable (withan 8KB are notresolved. up to eight simultaneous experiments, hardware contexts. multithreaded unit, contextl context O, etc. throughput identical We instructions support models in Section4. onto anyunit of Using In options; were each compiled the executable single-thread To limit the equal trace scheduling with several differ- with the lowest single-thread parallelism through our compilation the increases in parallelism achieved multithreading, motivation the limits for simultaneous of wide superscalar the base and bounding single-hardware-context cannot be scheduled then the remaining scheme advantages, as will be seen in Sec- im- techniques. machine, we measured We also recorded the if the next instruction in the same cycle as the current instruction, of the current instruction and the next (delayed) instruction, are assigned to the cause of the delay. When there are overlapping causes, all cycles are assigned to the cause that delays the instruction 394 identifying the potential issue slots this cycle, as well as all issue slots for idle cycles between the execution fair scheme are virtually multithreading i.e., the percentage of issue slots that are tilled cause of each empty issue slot. For example, instruction Have All execution, each cycle, for most of the SPEC benchmarks. unutilizedby from thepriority Where possible from specific latency-hiding the issue utilization, models; only the relative speeds threads change. Theresults position. with the Multiflow Superscalar Bottlenecks: the Cycles Gone? provement onto any available show that the overall of this scheme and a completely present us with some analytical several asdiscussed Each of such that execution time on our target hardware was used for all experiments. by exposing are scheduled in a strict prior- canschedule Our experiments for most of our execution of thedifferent The applications This section provides destination we assume support is added execution, of benchmarks. of the benchmarks, to produce Alpha code scheduled for our target the sources of those limitations, ity order, i.e., context O can schedule instructions functional (one address. most of our experiments machine. 3 Topredictre- we use a 12-entry return stack like the21164 For our multithreaded of B where T is the I cache), perhardware context). Ourcompiler does nonsupport hints for computed jumps; we simulate the effect with from a particular modified with simultaneous coverage is composed number of threads (8). each program system, we avoid overstating for branches that remain a 32-entry jump table, which records the last jumped-to for sets history ta- the table improves to the Alpha only stores prediction retumstack Alpha-style register to reduce the effect we use a subset of the benchmarks compiler, ent compiler l-bit- ordering are run by SPEC. The in length, is run once in each priority to the maximum We compile thus eliminating a simple instructions of threads and B is the number By maximizing which inthe a single data point number of permutations, andwecanuse of branch addresses relative turn destinations, are more complex; difference, each benchmark scheme) in the replicated direct-mapped, ble [29] supports experiments have pipes. A 2048-entry, multithreaded of benchmark to alone. all of the benchmarks using the default data set(s) specified issue (e.g., we don’t need register renaming exceptions, experiments, multithreading to completion straightforward execution, delays, ineffi- impact of simultaneous [20]. This reduces the benefits that might be gained by full dynamic achieved In this way, see the performance the B runs uses a different using the workload processing. results are not affected by synchronization unit availability static scheduling, compiler superscalar assigning a distinct etc., effects that would make it more difficult number unit availability. applications, This models a parallel runs, each T * 500 million schedul- can be scheduled onto funcon functional model with the use of state-of-the-art Multiflow Dependence-freeinstruc- suite [1 O]. To gauge the achievable by multithreaded rather than parallel In the single-thread to an eight-instruction-per-thread from there, instructions Instructions priority mode. cient parallelization, of thecachehlerarchy tional units out of order, depending throughput program to each thread. 8 dynamic execution. is the SPEC92 benchmark by multiprogramming throughput ing window; have less in single-threaded is more tolerant of misprediction processors, we chose uniprocessor Transfer tions are issued in-order that show that would Workload L3 Cache Wesupportlimited penalty delays, the impact was less than .5%. 256 KB Details experiments throughput stages for be an increase in resources. L2 Cache Table2: Wehaverun With 8 threads, where throughput 64 KB time/bank penalty. effect would increase in the misprediction DCache Banks issue. Ifsimul- the primary scheduling, 64 KB Line Size andanotherto instruction ICache Assoc forissue a full pipeline requires more than two pipeline than a 1% impact on instruction We to accommo- The Alphadevotes taneous multithreading 2.2 Size multithreading. themisprediction cache access is fetched from the prefetch buffer. We do not assume any changes to the basic pipeline date simultaneous stage toamange ins~ctions discuss these but do not show any results. The caches to the design of Sohi and Franklin otherwise, that of the fair scheme can be extrapolated scheme results. the most if the delays are additive, 100 90 ❑ rnancny conflict ❑ long fp ❑ ,h.,~ fp ❑ long integer ❑ short integer ❑ load delays 80 70 60 H control hazards ❑ brarrchrrtisprediction ❑ dcacherniss ❑ icache miss ❑ dtlb miss 50 40 30 Hitlb miss ■ processor busy 20 10 0 o Applications Figure 2: Sources of all unused others represent issue cycles in an 8-issue superscalar processor. Processor busy represents such as an I tlb miss and an I cache miss, the wasted cycles are at least 4.5% of wasted cycles. divided one factor will up appropriately. Table 3 specifies of wasted cycles in our model, latency-reducing techniques barriers all possible sources and some of the latency-hiding that might work [32, 5, 18], in contrast, quantified by removing to parallelism apply to them. issue slots; all units of our wide superscalar From the composite 2, demonstrate processor the resulting latency-tolerating latencies. that the functional are highly of the composite of Figure 2), which represents an average execution wasted issue bandwidth. bar there are dominant that any single increase in the of these programs scheduling achieved if it only attacks specific targets several important types of segments but we expect that our compiler most of the available gains in that regard. Current trends have been to devote increasingly of less than 1.5 that there is no dominant Although It is thus unlikely Instruction has already l:arger amounts of on-chip area to caches, yet even if memory latencies are completely eliminated, we cannot achieve 40% utilization of this processor. If per cycle on our 8-issue machine. These results also indicate any to the degree technique will produce a dramatic of the wasted issue bandwidth, underutilized. results bar on the far right, we see a utilization of only 19% (the “processor bus y“ component eliminating performance cause of wasted cycles — there appears to be no dominant solution. performance Our results, shown in Figure Even completely improve Not only is there no dominant some of these same effects and measuring not necessarily that this graph might imply, because many of the causes overlap. or Previous increases in performance. instructions the utilized wasted issue slots. specific source of latency-hiding increase in parallelism items in techniques are limited, then any dramatic needs to come from a general latency-hiding individual applications (e.g., mdljsp2, swm, fpppp), the dominant cause is different in each case. In the composite results we see that solution, of which multithreading is an example. The different types of multithreading have the potential to hide all sc~urces of latency, the largest cause (short FP dependence) but to different the issue bandwidth, is responsible for 37% of degrees. This becomes clearer if we classify wasted cycles as either vertical but there are six other causes that account for 395 Source of Wasted Issue Slots Possible Latency-Hiding instruction tlb miss, data or Latency-Reducing decrease the TLB miss rates (e.g., increase the TLB sizes); hardware instruction tlb miss or software data IJrefetchimz; faster servicing of TLB misses I cache miss larger, more associative, or faster instruction cache hierarchy; D cache miss larger, more associative, instruction branch misprediction improved control hazard speculative load delays (first-level or faster data cache hierarchy; scheduling; more sophisticated branch prediction execution; dynamic prefetching; hardware instruction hardware or software hardware prefetching prefetching; improved execution scheme; lower branch misprediction penalty more aggressive if-conversion shorter load latency; improved instruction scheduling; dynamic scheduling II cache hits)” short integer delay I improved instruction long integer, short fp, long (multiply is the only long integer operation, fp delays latencies; improved memory Technique conflict scheduling instruction divide is the only long floating (accesses to the same memory location in a single cycle) improved Table 3: All possible causes of wasted point operation) shorter scheduling issue slots, and latency-hiding or latency-reducing instruction techniques scheduling that can reduce the number of cycles wasted by each cause. waste (completely idle cycles) or horizontal slots in a non-idle cycle), as shown previously measurements, remainder Doing are horizontal traditional waste to horizontal Simultaneous multithreading of the vertical waste; slots lost to both horizontal has the potential and vertical provides details on how effectively waste. SM:FuIl Simultaneous simultaneous Issue. multithreaded This is a completely superscalac model in terms of hardware insight into the potential much of the following it. models each represent Simultaneous threaded processors. The next section b it does so. SM:Single Issue, SM:Dual three models limit for simultaneous plexities. significant example, multithreading, restrictions performance improvement perscalar and fine-grain multithreaded and also under less ambitious muhi- ● provides over both single-thread The Machine of 2 instructions SM:Llmited each cycle. For per cycle; therefore, a minimum to fill the 8 issue slots in one Connection. Each hardware context is directly connected to exactly one of each type of functional su- example, processors, both in the limit, if the hardware unit. For supports eight threads and there are four integer units, each integer unit could receive instructions hardware assumptions. from exactly two threads. The partitioning but each functional Models achieving The following window, Issue processor, each thread can issue among threads is thus less dynamic 4.1 Issue. These each thread can cycle. several machine models multithreading The to this scheme Issue, and SM:Four of 4 threads would be required spanning a range of hardware com- We then show that simultaneous but provides multithreading. the number of instructions in a SM:Dual a maximum results for simultaneous We begin by defining This is the least complexity, that decrease hardware complexity. to recover all issue Multithreading This section presents performance flexible all eight threads com- for simultaneous issue, or have active in the scheduling 4 per cycle. pete for each of the issue slots each cycle. realistic fill the issue waste, rather than eliminating multithreading ● (coarsewaste. converts LIW instruction waste, the to vertical of a single thread to completely this is most similar to the Tera processor [3], which issues one 3-operation 1. In our multithreading recovers only a fraction because of the inability slots each cycle, waste. Traditional can fill cycles that contribute so, however, vertical in Figure 61% of the wasted cycles are vertical grain or fine-grain) chitectures, waste (unused issue models reflect several possible design choices for a units unit is still shared (the critical factor in Since the choice of functional high utilization). units available to a single thread is different combined multithreaded, superscalar processor. The models differ in how threads can use issue slots and functional units each cycle; of functional than in the other models, than in our original target machine, we recompiled for a 4-issue (one of each type of functional unit) processor for this model. in all cases, however, the basic machine is a wide superscalar with 10 functional units capable of issuing 8 instructions per cycle (the Some important same core machine as Section 3). The models are: ● Fke-Grain Multithreading. plexity Only one thread issues instruc- hide horizontal Among existing in hardware commodel may not necessarily represent the cheapest implementation. of these complexity issues are inherited Many from our wide superscalar ing, the ports per register file, and the forwarding waste. It is the only model that does not feature multithreading. implementation in Table 4. Notice that the fine-grain design rather than from multithreading, per se. Even in the SM:full simultaneous issue model, the inter-instruction dependence check- tions each cycle, but it can use the entire issue width of the processor. This hides all sources of vertical waste, but does not simultaneous differences are summarized the issue bandwidth or proposed ar- 396 and the number of functional logic scale with units, rather than Inter-inst Instruction Register Dependence Ports Checking Model Forwarding Scheduling Logic onto FUS H/L* L H H H H Fkte-Grain H H SM:Single Issue SM:Dual Issue SM:Four Issue L M None L M M H H SM:Limited M M M M Notes No Connection Table4: modeled H this Acomparison H scheme with logic, and the difficulty H all forwarding ofkeyhardware ports needed for each register intact, complexi@ file, the dependence of scheduling sonable for an 8-issue processor. between 4 and 9 functional Current onto functional types of simultaneous multithreaded onto functional units seems rea- requiring models (H=high more threads complexity). to issue multiple of forwarding using the full issue, 1-thread result to represent the single-thread superscalar. With SM, it is not necessary for any single thread to be able to utilize the entire resources of the processor in order to get maximum on all or near-maximum the performance performance. The four-issue of the full simultaneous model, although they sim- dual-issue model is quite competitive, neous issue at 8 threads. The limited issue models full simultaneous issue, but the issue width and thus the complexity thenumberof the amount ranging from 3.2 to 4.2, with an issue where the per-thread of the architectures, performance units. simultaneous processors. The Hirata, et al., that are more similar to full simultaneous type; a maximum rate as high as 6.3 IPC. The speedups are calculated are less. The schedul- Others [34, 17, 23, 6] implement same Reconsider design [16] is closest to the single-issue is increased. for instructions, ulate a small number of configurations bandwidth FUS of of other ]H.Js highest performance thread superscalar execution to current four-issue units is more complex be eliminated, of thevarious for issue in the four-issue and dual-issue could between is independent Most complex, for a single thread 4-issue processors have models are comparable superscalars; the single-issue ing of instructions features checking units. The number of ports per register file and the logic to select instructions connection H but forwarding issued instructions the number of threads. Our choice of ten functional and limited forwarding scheduling SM:FU1l Simultaneous Issue * We have 3 Scheduling independent of other threads. model gets nearly issue model, and even the reaching 94% of full simultaconnection model approaches issue more slowly due to its less flexible ing. Each of these models becomes increasingly of the schemes, vary full simultaneous considerably. schedul- competitive with issue as the ratio of threads to issue slots increases. With the results shown in Figure 3(d), we see the possibility of trading the number of hardware contexts against hardware complex- 4.2 The Performance of Simultaneous ity in other areas. For example, Multithreading instructions Figure 3 shows the performance of the number throughput of threads. component show three interesting grained multithreading machine with 3 to 4 hardware contexts, a dual-issue machine with 4 of the various models as a function The segments of each bar indicate contributed contexts, a limited the trading pipeline design space: fine- machine with 5 contexts, or a single- complexity ing in its pipelines (only one thread per cycle, but that thread dynamically multithreaded architecture (Figure speedup (increase in instruction 2.1 over single-thread execution (from effects. 3(a)) provides throughput) We see (in Figure sue slots and functional of only however, sharing 3(c)) the effect of competition speed of the highest priority in this model. of sharing other system resources (caches, TLBs, been reduced four threads, the vertical to less than 3%, which bounds waste has any further units, degrades significantly single-issue down at 8 threads). Competition is only beneficial for 2 to 5 threads. These limitations to simultaneous multithreading, exploit horizontal waste. Figures 3(b,c,d) threading that multithreading however, because of its ability achieve maximum competition to for execution thread, for issue slots and functional as more threads are added (a 35% slowfor non-execution resources, then, a role in this performance region as the resources. Others have observed that caches are more strained by a multi- show the advantage of the simultaneous models, which plays nearly as significant do not apply branch predic- issue, the highest priority which is fairly immune to competition beyond that point. This result is similar to previous studies [2, 1,19, 14, 33, 31] for both coarse-grain and fine-grain multithreading on processors, which have concluded for is- issue model, thread. We can also observe the impact tion table); with full simultaneous gains re- thread (at 8 threads) runs at 55% of the shows that there is little advantage to adding more than four threads In fact, with otherwise also has negative units in the full simultaneous where the lowest priority 1.5 IPC to 3.2). The graph area direct result of threads sharing processor resources that would main idle much of the time; The fine-grain it has no forward- and no data caches, but supports 128 hardware The increases in processor utilization use any issue slot). a maximum Tera [3] is an extreme example of for more contexts; contexts. can use all issue slots), SM: Single Issue (many threads per cycle, but each can use only one issue slot), and SM: Full Simultaneous Issue (many threads per cycle, any thread can potentially connection issue machine with 6 contexts. by each thread. The bar-graphs points in the multithreaded if we wish to execute around four per cycle, we can build a four-issue or full simultaneous threaded workload multi- in locality speedups over single- 397 than a single-thread workload, due to a decrease [21, 33, 1, 31]. Our data (not shown) pinpoints the ex- :7$6YD5- +----- lowest ;4;3” ~ ~2- highest .~l -G 2U . (a) Fine - ~ 12345678 -G =c Number of Threads Grain Multithreading priority thread /“ priority ‘bread 0-12345678 Number of Threads (b) = SM: Single ~ ~ –....... .... . . Issue Per Thread . . ..———. . - . . SM:FU1l Simultaneous Issue G >6– “’’”“’’’’’’’’’’” ““ “’’’’’’’’’’’’’’’” “’’’”“’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’” “’’’’’’’’’’’’’’’’’’’” o SM:Dual Issue SM:Limited Connection SM:Single Issue k5– c1 :4 3 23 w .— — u Number Full SM: (c) Figure 3: Instruction models, 20 12345678 g throughput effect, of the number for all threads thread of threads. (a)-(c) show the throughput for each of the six machine models. superscalar. Sharing the caches as the wasted issue cycles (from the per- spective of the first thread) due to I cache misses grows from 170 per-thread capabilities tion throughput. These improvements tuning of the architecture data cache misses grows from have found that the instruction 12% to 18%. The data TLB also increases, from less than 170 to 6%. investigate the cache problem. that, with our workload, 96 entries brings while providing increasing waste of come without for multithreaded throughput any significant execution; in fact, we of the various SM models The effects. the shared data TLB from 64 to 8 threads) down to 1%, private TLBs of 24 entries reduces it to under 2Y0, 5 It is not necessary to have extremely the speedups shown in this section. smaller large caches to achieve Our experiments Cache Design for a Simultaneous threaded Processor Our measurements show a performance sharing in simultaneous tal speedups relatively we explore the cache problem constant across a wide range of cache sizes. 8-thread execution thread execution, results in lower hit rates than 1- organization the relative effect of changing the cache size is the In summary, our results show that simultaneous thread execution or fine-grain of the first-level further. degradation processors. due to cache In this section, Our study focuses on the (Ll ) caches, comparing the use of and data. (We assume that L2 and L3 caches are shared among all threads.) All experiments use the 4-issue model with up to 8 threads. multithreading The caches are specified attainable through either single- multithreading, multithreaded private per-thread caches to shared caches for both instructions same for each. surpasses limits on the performance Multi- with signif- caches (not shown here) reveal that the size of the caches affects 1-thread and 8-thread results equally, making the toThat is, while implementations can still attain high instrtrc- next section investigates designs that are more resistant to the cache we found regardless of the number of threads. icantly for particular of each bar is the is somewhat hampered by the sharing of the caches and TLBs. In the next section, we For the data TLB, the wasted cycles (with priority We have also seen that simplified SM with limited at one thread to 14% at eight threads, while wasted cycles due to will by thread The lowest segment to the total throughput. act areas where sharing degrades performance. is the dominant (d) All Models Issue as a function priority 345678 Number of Threads Simultaneous of the highest 2 of Threads and (d) shows the total throughput contribution , as [total I cache size in KB]@ivate shared]. [D cache size] ljrivate when run on a wide 398 or or shared] in Figure 4. For instance, 6 64p.64s has eight private 8 KB I caches and a shared 64 KB data cache. Not all of the private caches will be utilized when fewer than Simultaneous Multithreading Chip Multiprocessing v(ersus Single- eight threads are running. Figure 4 exposes several interesting properties caches. We see that shared caches optimize multithreading For example, the 64s.64s cache ranks first among all models at 1 thread and last at 8 threads, opposite result. instructions while However, and data. the 64p.64p the tradeoffs cache gives nearly the while instruction a private different access patterns between and data. Private I caches eliminate threads in tbe I cache, while a single thread to issue multiple memory conflicts both have multiple register sets, multiple functional on a single chip. The key difference partitions similac units, and high is in the way the multiprocessor resources, devoting a fixed number of functional units to each thread; the SM processor change every cycle. Clearly, however, allows t}le partitioning scheduling we will is more complex to for an show that in other areas the SM model requires fewer resources, relative to multiprocessing, a shared D cache allows instructions (MP). On level, the two approaches are extremely SM processo~ between of simultaneous multiprocessing and scheduled: will with the available the organizational statically caches benefit from private caches at 8 threads. One reason for this is the differing instructions single-chip those resources are partitioned data cache over all numbers of threads (e.g., compare 64p.64s with 64p.64p), to small-scale, issue bandwidth are not the same for both A shared data cache outperforms multiprocessors real estate. This section compares the performance cache), while better with a large number of threads. to rise, single-chip provide an obvious means of achieving parallelism for a small number of threads (where the few threads can use all available private caches perform As chip densities continue for multlthreaded in order to achieve a desired level of performance. to different For these experiments, we tried to choose SM and MP configu- rations that are reasonably equivalent, although in several cases we biased in favor of the MP. For most of the comparisons or most of the following we keep all equal: the number of register sets (i.e, the number of threads for SM and the number of processors for MP), the total issue bandwidth, ~ 64s.64s and the specific functional ration is often optimized ~ \ unit configuration. A consequence of the last item is that the functional 64p.64s for the multiprocessor inefficient configuration periments use 8 KB private instruction for simultaneous multithreading. o cache. n All ex- and data caches (per thread for SM, per processor for MP), a 256 KB 4-way shared second-level unit configu- and represents an set-associative cache, and a 2 MB direct-miipped third-level We want to keep the caches constant in our comparisons, and this (private I and D caches) is the most natural configuration for the multiprocessor. We evaluate MPs with 1, 2, and 4 issues per cycle on each pro‘- 64s.64p ~ 64p.64p cessor. We evaluate SM processors with 4 and 8 issues per cycle; however we use the SM:Four / I issues per cycle). I I I I I I I Issue model (defined in Section 4.1) for all of our SM measurements complexity I differences example, an SM:Four 12345678 Number of Threads (i.e., each thread is limited Using this model minimizes to four some of the inherent between the SM and MP architectures. processor with 4 issues per cycle in terms of both the number ports on each register file and the amount of inter-instruction Figure 4: Results for the simulated relative to the throughput cache configurations, (instructions For Issue processor is similar to a single-threaded shown pendence checking. per cycle) of the 64s.64p In each experiment of de- we run the same version of the benchmarks for both configurations (compilled for a 4-issue, 4 functional unit processor, which most closely matches the MP cache results. configuration) on both the MP and SM models; this typically favors the MP. We must note that, while There are two configurations that appear to be good choices. Because there is little performance of optimizing difference in general we have tried to bias the tests in favor of the MP, the SM results may be c~ptimistic in two respects—the at 8 threads, the cost functional for a small number of threads is small, making 64s.64s amount of time required to schedule instructions onto units, and the shared cache access time. The impact of the operate with former, discussed in Section 2.1, is small. The distance between the all or most thread slots full, the 64p.64s gives the best performance load/store units and the data cache can have a large impact on cache an attractive option. However, if we expect to typically in that region and is never worse than the second best performer fewer threads. load/store The shared data cache in this scheme allows it to take advantage of more flexible cache partitioning, instruction access time. with while the private The multiprocessor, units, can minimize with private caches and private the distances between them. Our SM processor cannot do so, even with private caches, because the loadlstore caches make each thread less sensitive to the presence of units are shared. However, two alternate configurations other threads. Shared data caches also have a significant advantage in a data-sharing environment by allowing sharing at the lowest level could eliminate this difference. Having eight load/store units (one private unit per thread, associated with a private cache) would still of the data cache hierarchy allow us to match MP performance without any special hardware for cache number of MP functional coherence. 399 with fewer than half the total units (32 vs. 15). Or with 4 load/store Purpose of Test Common Unlimited Test A: FUS = 32 FUS: equal of registel sets (processors Throughput Configuration SM: 8 thread, 8-issue ~ Reg sets = 8 MP L~J Test SM: 4 thread, 4-issue (instructions/cycle) 6.64 -. Issue bw = 8 total issue bandwidth, equal number Specific Elements or B: FUS = 16 8 l-issue procs 5.13 ~ . 3.40 Issue bw = 4 threads) MP: 4 l-issue Reg sets = 4 Test C: FUS = 16 2.77 procs SM: 4 thread, 8-issue Issue bw = 8 Reg sets = 4 Unlimited Test FUS: Test A, but limit D: Issue bw = 8 SM to 10 FUS Reg sets = 8 Test E: Unequal Issue BW: MP FUS = 32 has up to four times the Reg sets = 8 total issue bandwidth Test F: FUS= FU Utilization: equal NS=8 FUS, equal issue bw, fu~ctional for the various multi~rocessor unit of each type per proces~or. units and 8 threads, we could statically store combination 1 might unit would minimize unit, ‘7 A.ic<ne vs. simultaneous multithreading and all accesses through constrained that configurations. Test G shows the greater ability an essentially unlimited units (FUS); i.e., there is a functional every issue slot. number 8-issue SM and a 4-processor, 4 register sets and issue up to 8 instructions els, the ratio of functional high, so both configurations issue bandwidth. the 4 functional the addition per cycle. In these mod- multithreading, reasonable configuration (the same 10 functional while the SM is allowed units to have to two processors (2 multithreading’s ability to units through achieves more than 2 ~ times the show that simultaneous multiprocessing because of the dynamic does so portant, SM requires instruction the SM processor a fixed number of a fixed number of functional of thread contexts performs single-chip more effectively. Test D repeats test A but limits of SM to utilize unit types. Simultaneous These comparisons is most of their however, a single processor throughput. MP both have should be able to utilize Simultaneous However, drive up the utilization e.g., in Test C, a 4 thread, units (and threads) to issue bandwidth multithreading register sets), because each processor must have at least 1 of each of unit of each type available to 2-issue-per-processor has one Here both SM and MP have 8 functional 8 contexts (8 register sets), the MP is limited of the of functional The number of register sets and total issue band- width are constant for each experiment, units. and 8 issues per cycle. for various Tests A, B, and C compare the performance two schemes with Simultaneous with respect to which 4 instructions of functional comparison always of each FU type as the MP. can issue in a single cycle. us to resource sharing. Figure 5 shows the results of our SM/MP The multiprocessor performs well in these tests, despite its handicap, because the MP is Threads O and go to the same cache, thus allowing comparisons. has the same total number has twice the total issue bandwidth. share a single cachelload- the distance between cache and load/store unit, while still allowing 1.94 nrnrc In most cases the SM processor among each set of 2 threads. share one load/store load/store MP. Issue BW = 8 unequal reg sets Figure 5: Results 16 Reg sets = 4 Test G: to a more partitioning multithreading out- in a variety of configurations of functional units. many fewer resources (functional issue slots) to achieve a given performance More imunits and level. For example, a single 8-thread, 8-issue SM processor with 10 functional unit configura- units is 24~o faster than the 8-processor, single-issue MP (Test D), tion used throughout this paper). This configuration outperforms the multiprocessor by nearly as much as test A, even though the but requires 32 functional units; to equal the throughput of that 8-thread 8-issue SM, an MP system SM configuration requires eight 4-issue processors (Test E), which consume 32 func- forwarding has 22 fewer functional units and requires fewer a much larger total issue Finally, In test E, each MP processor can issue 4 instructions per cycle for a total issue bandwidth slightly in issue slots. outperforms there are further advantages of SM over MP that are not shown by the experiments: of 32 across the 8 processors; each SM thread can also issue 4 instructions per cycle, but the 8 threads share only 8 issue slots. The results are similar despite the disparity issue bandwidth tional units and 32 issue slots per cycle. connections. In tests E and F, the MP is allowed bandwidth. which has identical In test F, the 4-thread, ● Performance with few threads — These results show only the performance at maximum utilization. The advantage of SM (over MP) is greater as some of the contexts (processors) 8-issue SM come unutilized. a 4-processor, 4-issue per processor MP, which 400 be- An idle processor leaves l/p of an MP idle, while with SM, the other threads can expand to use the avail- from multiple able resources. This is important vidual clusters. Franklin’s where the degree of parallelism when(1) we run parallel code varies overtime, (2) the perfor- mance of a small number of threads is important environment, or (3) the workload the machine (e.g., 8 threads). in the target is sized for the exact size of Hirata, for execution (e.g., processor and simulate ray-tracing application, their architecture for a multithreaded its performance and flexibility of design — Our configuration op- has no branch prediction using 8 threads. Yamamoto, mechanism. of multithreaded have finer granularity. Their study models perfect branching, would typically add computing With simultaneous a multiprocessor, in units of entire processors. multithreading, or an instruction Keckler all threads throughput tectures that dynamically interleave operations speedups as high as 3,12 for some highly full advantage of the configurability of simultaneous should be able to construct configurations even further out-distance we believe us to put multiple contexts on a single chip, simultaneous efficient organization functional and wide multithreading Daddis and Tomg as a function issue bandwidth [6] plot increases in instruction of the fetch bandwidth all fetched instructions. functional Their units, and unlimited paper. In this section, tithreaded we note previous on several traditional architectures, and the Multiscalar simultaneously, work on instruction-level (coarse-grain and on two architectures architecture) studies of architectures neous multithreading of fetch, multithreading. that exhibit a different performance. decoding, Smith, workload based real instruction branching, We look more closely latencies) fine-grain, in detail. We a variety of models of at the reasons for the result- and address the shared cache issue specifically. pare simultaneous with single-thread multithreading processors and com- with other relevant architectures: superscalar multithreaded architectures and single-chip multiprocessors. techniques) and measure dependence-checking, window figuration; Lam and Wilson [18] focus on the interaction size, scheduling register renaming, Summary and branch prediction on ILP; Butler, ef al., [5] examine these limitations and ILP; and Wall [32] examines 8 et al., [28] focus on the effects scheduling policy, and functional scheduling window This paper examined simultaneous plus allows independent unit con- facilities size, branch tures. and aliasing. available simultaneous chitectures. ing of instructions but none features simultaneous from different Tera has LIW are single-issue, (3-wide) In the M-Machine instructions. There is no simultaneous basis similar and single-chip, Our evaluation extended from workload multithreading with wide superscalar, multiple-issue the DEC Alpha mul- fine-grain multiprocessing 21164,, simulation running of SPEC benchmarks, with the Multiflow architec- of simultaneous used execution-driven composed func- combines and multithreaded several models them a technique that to multiple trace scheduling Our results show the benefits of simultaneous when compared to the other architectures, namely: procrasor. units on a cycle-by-cycle and compared for our architecture In Section 4, multithreading [7] each processor cluster schedules LIW onto execution the Tera scheme. programmed rather than super- we extended these results by showing how fine-grain runs on a multiple-lsstse on a model threads during the same cycle. In fact, most of these architectures scalar, although multithreaded, issu- in both superscalar We have presented tithreading multithreading, multithreading, threads to issue instructions tional units in a single cycle. Simultaneous of branches Previous work on coarse-grain [2, 27, 31] and fine-grain [28, 3, 15, 22, 19] multithreading provides the foundation for our work on structions superscalararchitec- multiprogrammed memory, TLB, We go beyond comparisons perspective limitations prediction, We fetch we model all sources of latency (cache, ing performance simulta- ((but limited of throughput. on the SPEC benchmarks; SM execution. contexts active that issues studies on ILP, which remove barriers to parallelism apply real or ideal latency-hiding the resulting issue bandwidth They report a near doubling also extend the previous work in evaluating and contrast our work with these in particular. The data presented in Section 3 provides from previous mul- (the M-Machine that have multiple but do not have simultaneous also discuss previous and fine-grain) window system has two threads, unlimited In contrast to these studies of multithreaded, of sources in this throughput and the size of the dispatch stack. The dispatch stack is the global instruction represents the most a large number require- They use infinite speedup above 3 cwer single-thread tures, we use a heterogeneous, from report applications. in this manner. Related Work parallelism, parallel the register file bandwidth permit of those resources. on work instruc- and Dally ments for 4 threads scheduled bandwidth). We have built frclm VLIW Keckler execution for parallel applications. and complexity densities units. Prasadh and Wu also examine caches and show a maximum that when component hardware that multiprocessing. For these reasons, as well as the performance results shown, (i.e. of 1.3 to 3 with four threads. and Dally [17] and Prasadh and Wu [23] describe archi- did not take advantage of this flexibility. taking 7 backed up by simulation. tions onto individual multithreading, when model perfect caches and a homo- would be able to share in using that resource. Our comparisons Processor designers, ;architecture (all threads running the same trace). They report increases in instruction unit, a register issue slot; furthermore, superscalarperformance, geneous workload we can benefit from the ad- dition of a single resource, such as a functional context, we and They show et al., [34] present an analytical tions are much richer with SM, because the units of design That is, with on a parallel They do not simulate caches or TLBs, speedups as high as 5.8 over a single-threaded Granularity re- instruction. et al., [16] present an architecture superscalar IO). ● [13, 12] assigns threads to processors, so competition an individual a latency larger than what we have simulated units in the same cycle on indi- Multi scalar architecture sources (processors in this case) is at the level of a task rather than In the last case, a processor and all of its resources is lost when a thread experiences orders of magnitude fine-grain threads to functional arbased a multi- compiled compiler. multithreading in1. Given to our model, ture, properly issue of instructions 401 a simultaneous configured, multithreaded architec- can achieve 4 times the instruction throughput of a single-threaded issue width (8 instructions [5] wide superscalar with the same M. Butler, T.Y. Yeh, Y. Patt, M. Alsup, H. Scales, and M. Shebanow. per cycle, in our experiments). two. 2. While fine-grain multithreading (i.e., switching to a new thread every cycle) helps close the gap, the simultaneous ing architecture still outperforms fine-grain threading to utilize 3. A simultaneous formance architecture to a multiple-issue total number multiprocessor, achieving a specific performance resources with simultaneous The advantage of simultaneous scheduling is its ability functional creases hardware architecture bandwidth functional multithreading tion scheduling particularly [7] Moreover, multithreading. compared to the [8] by dynamically threads. Conference multithreaded units, register sets, and issue subsystem. [9] W.J. Dally, S.W. Keckler, of instruc- [10] However, and V1 .0. Technical MIT Concurrent Architecture Memo 58, Massachusetts VLSI of Technology, March H. Davis, S.R. Goldschmidt, 1994. and J. Hennessy. Multiprocessor and tracing using Tango. M. Denman. Report – Processing, In Mernational Corrfer- pages 11:99–1 07, August 1991. PowerPC 604. In Hot Chips VI, pages 193–200, 1994. K.M. Dixit. [11] J. Edmondson AXP with that of current super- and make individual threads less [12] M. Franklin. Equipment Fossum of Digital AXP Corporation M. Franklin and G.S. Sohi. comments and suggestions Weber. Comparative on the ating techniques. 15] R.H. Halstead in multithreaded sors. IEEE Transactr”ons on Parallel A. Agarwal, September lnzernarional Systems, [16] Symposium for multiprocessing. on Compmr Alverson, D. pages D. Cummings, B. [17] Koblenz, Some efficient architecture 1990 Usenix simulation Conference, S.W. Keckler compile and B. Smith. The Tera computer system. Conference on Supercomputing, pages 1–6, In Winter Hirata, from multiple and W.D. and toler- International [18] tech- K. Kimura, on Computer S. Nagamine, with simultaneous threrrd~. In J9tJr Amuud and W.J. Dally. time and runtime 402 on proIn 15th Architecture, Y. Mochizuki, An elementary instruction Mawwlional issuing Xynzpo- pages 136–145, May 1992. Processor coupling: scheduling Limits In 19th Annual International Architecture, January 1990. computing. Y. Nakase, and T. Nishizawa. M.S. Lam and R.P. Wilson. lelism. pages 53-63, A multithreaded symbolic Symposium Symposium May 1991. MASA: Annual International Symposium pages 202–213, May 1992. June 1990. R. Bedichek. H. T. Mowry, of latency reducing sium on Computer Architecture, Callahan, pages May 1988. processor architecture In 17th Annual A rchitccture, window In 19th Annual Architecture, pages 254-263, for parallel International A. Nishimura, APRIL May 1990. A. Porterfield, In International evaluation and T. Fujita. pages 443451, 1992. B.H. Llm, D. Kranz, and J. Kubiatowicz. a processor architecture Annual proces- and Distributed split parallelism. on Computer In 18th Annual cessor architecture tradeoffs The expandable fine-grain Symposium PhD thesis, Uni- 1993. 14] A. Gupta, J. Hennessy, K. Gharachorloo, References 3(5):525–539, of the 21164 VI, pages 1–8, August Architecture, Madison, Computer Architecture, Performance In for version of the Multiflow paper and the research. A. Agarwal. SPEC. 58–67, May 1992. We would also like to thank Burton Smith, Norm Jouppi, 104-114, The Mdtiscalar paradigm for exploiting for helpful An overview In Hot Chips versity of Wisconsin, [13] access to the source for the Alpha suites from 1994. contention. Inc. and the reviewers and P Rubinfield. microprocessor. International and Tryggve New CPU benchmark tuning the cache organization from Equator Technologies, niques. N. Carter, A. Chang, M. Fillo, we multithreading We would like to thank John O’Donnell [4] In pages 1:76- COA4PCON, Spring 1992, pages 305–3 10,1992. Acknowledgments R. Processing, architecture August of the most general SM model with in key areas commensurate sensitive to multi-thread [3] execution W.S. Lee. M-Machine ence on Parallel and can add resources in models of simultaneous can both increase performance [2] on Parallel 1991. simulation SM also in- increases the complexity scalars; we also show how properly [1] The concurrent streams on superscalar processors. 83, August Institute a simultaneous in the memory reach nearly the performance compiler. on Computer May 1991. Jr. and H.C. Torng. instruction International relative to superscalars, and causes shared resource have shown how simplified complexity G.E. Daddis, of multiple given the same multithreading, design flexibility; pages 276–286, is greater than Symposium manner. Simultaneous contention, [6] waste. units. to boost utilization to achieve better performance, a fine-grained by multi- goal requires fewer hardware units among multiple can tradeoff Architecture, steam parallelism International is superior in per- of register sets and functional execution other approaches, of fine-grain issue slots lost due to horizontal multithreaded instruction multithread- multithreading a factor of 2. This is due to the inability Single In 18th Annual Integrating for parallelism. In 19th on Computer Architecture, of control flow on paral- Symposium on Computer pages 46–57, May 1992. [19] J. Laudon, A. Gupta, multithreading stations. and M. Horowitz, technique In Sixth International Support for Programming P.G. Lowney, puting, trace scheduling [21] D.C. McCrackin. M. Nemirosky, on Architectural superscalar Systems, compiler. .lournal The of Supercom- May 1993. The synergistic effect of thread schedulcomputers. In COMPCON, Spring 1993, pages 157-164,1993. R.S. Nikhil and Arvind. computing? Can dataflow In 16th Annual Computer Architecture, [23] Conference Symposium on pages 262–272, June 1989. R.G. Prasadh and C.-L. multi-threaded subsume von Neumann International Wu. A benchmark evaluation RISC processor architecture. on Parallel of a In International pages 1:84-91, August Processing, 1991. [24] Microprocessor Report, October 241994. [25] Microprocessor Report, October 31994. [26] Microprocessor Report, November [27] R.H. Saavedra-Barrera, ysis of multithreaded Second Annual Architectures, [28] pages 169-178, B.J. Smith. Architecture J. Smith. Annual [30] and applications Support for Programming The effectiveness In Sixth International Support for Programming D.W. Wall. Limits International gramming April [33] parallelism. on Architectural Languages and Operating on Ar- In Fourth Support for Pro- Systems, pages 176-188, 1991. W.D. Weber and A. Gupta. Exploring hardware inary of multiple Conference 1994. of instruction-level Conference Lan- 1991. Languages and Operat- ing Systems, pages 328–337, October [32] data memory In Fourth International Systems, pages 53–62, April and S.J. Eggers. contexts. chitectural In 8th Architecture, High-bandwidth processors. on Architectural guages and Operating hardware strategies. on Computer May 1981. G.S. Sohi and M. Franklin. R. Thekkath and of the HEP multipro- prediction Symposium systems for superscalar [31] In Algorithms July 1990. A study of branch Conference on Parallel Anal- computing, 1981. International pages 135-148, for parallel system. In SPIE Real Time Signal Processing IV, pages 241–248, [29] D.E. Culler, and T. von Eicken. ACM Symposium cessor computer 141994. architectures contexts results. in a multiprocessor In 16th Annual Computer Architecture, the benefits of multiple architecture: International prelim- Symposium M.J. Serrano, A.R. Talcott, Performance processors, estimation In Twenty-Seventh R.C. Wood, and of multistreamed, Hawaii Interna- tion Conference on System Sciences, pages I: 195–204, January T.J. Karzes, W.D. Liechand J.C. Ruttenberg. ing and caching in multithreaded [22] W. Yamamoto, 1994. J.S. ODonnell, 7(1-2):51-142, [34] 1994. S.M. Freudenberger, tenstein, R.P. Nix, multiflow Conference A and work- Languages and Operating pages 308–3 18, October [20] Interleaving: targeting multiprocessors on pages 273–280, June 1989. 403