An Elementary Processor Simultaneous Instruction Issuing with Hiroaki Hirata, Kozo Akio Kimura, Nishimura, Satoshi Yoshimori Media 1006 Nagamine, Nakase Research Matsushita Electric Kadoma this paper, our we processor threads a single begin execution This unless parallel the utilization that there speed-up, conventional two Our unit conflicts bet ween oped a new loop execution flow mechanism static Simulation loop code threads be loops or VLIW con- in parallel and iterations, by a 3.72 tributed tion a the are have difficult it access. On units thread predicting in the two arises target sor architecture: Introduction nique to due to remote We are currently tion system which them through ation of high power. creates computer quality Furthermore, faithfully are developing as possible, also needed. an virtual integrated worlds graphics. visualiza- and represents However, the requires great in order to model the real world In this numerical paper, processing nique as computations we present a processor at a thread ther a control pipelined Our material is for dmect commercial advantage, the ACM copyright notice and the title of the pubkatlon and its date appear, and notice is given that copying IS by perrmsslon of the Association for Computing Machinery. To copy otherwise, or to republish, requmes a fee and/or specific pernusslon, provided @ 1992 that the cop]es are not made or distributed ACM 0.89791 .509.7/92/0005/0136 $1.50 136 performance does instruction of a renot help in we introduced into When other our proces- and parallel tech- long latencies a thread encoun- rapidly switches hand, parallel multi- is a latency-hiding to When an be issued dependency technique de- from because within the another is especially tech- instruction thread effective of eithread, is exe- in a deeply processor. goal processor granted able data inter- of the single during level. is not or from operation processor a processor an independent This the instruction from cuted. Permission to copy without fee all or part of this the result multithreading active the memof func- multithreading access. On within utiliza- utilization problems, concurrent of data, threads. threading processor techniques remain ac- case of a dis- executions. these memory an absence between gener- images intensive ters past of future The attempts the branch concurrent multithreading. 1 if the of multithreading systems. of data optimization conditional the threads(parallel functional to the to overcome kinds low can and obstacle executed In order machines. hand, for due to remote a processor execution possible low latencies dependencies Another peatedly system, other within time a number In tests coarse-grained many branches. rare- intersection on multiprocessing long the and generating inherent create memory from for processing has executes conditional shared control to parallelize however, instruction Another of the test in a parallel system. ray-tracing algorithms, can easily can result lays. graphics, to be executed and tional devel- multiple makes ex- functional technique. using architecture, which we and processor a graphics algorithms part This base such these a large thread, over to the efficient to control for In program. cesses run famous images. Each as the of computer very processes) results achieved are parallelism ory scheduling of our could account improves a 2.02 can In order scheme, to parallelize vector unit. is also applicable loop. used which whole processor. of a single architecture[l] machine alistic can unit greatly processor, architecture ecution instructions functional and four respectively, RISC co.jp) diosity simultaneously these are In different scheme of the functional on a nine-functional-unit times and execution by executing from are issued units, Ltd. Japan In the field throughput. instructions thread) functional flicts. show machine architecture, (not to multiple improves Mochizuki, Nishizawa Co., 571 a multithreaded propose which Teiji Threads Laboratory, Osaka. Abstract architecture Yoshiyuki and Industrial (hirataQisl.mei. In Architecture from Multiple is to design achieve which an elementary processor system than rat her The idea ware resource for of parallel opt imizat an efficient is oriented in a large use in and cost-effective specifically scale for use as multiprocessor an uniprocessor multithreading ion where arose several syst em. from processors hardare united and reorganized fully used. system each Figure organization. physical utilization 1(a) and of the T represents In this busiest improved united they to conflict with functional unit. The programmer’s ;**O ; of the Physical / Processor be united to be on from (a) Conventional unless for the the processor Figure l(b) is almost cal processor ganization in Figure putation, parallel assign ble to our In Interconnection loop we ourselves will field of physical or- techniques developed or doacross techniques are also ● com- The In section 2, our ........... .................................. 4, preceding outlining paper via works remarks with are made Processor In An section architecture. in section Figure are 5. in called Figure cache, thread a program slots. counter, an instruction physically Each fetch shared unit among average, count queue some instructions succeeding by the program slots, the B= and counter. S x C words, C is the instruction slot, associated with up a logical processor, while and unit The where number all decode functional A with pairs, logical instruction is provided unit thread An least and units tion a buffer the instruction buffer S is the of cycles processors organization decode which required sued saves for in would unit and unit. The the that The So, on unit threads en- t bread can instruction bottleneck for a case, in- is done queue one of the however, an instruction decodes and it. cache a processor snot her data the from The cache an instruc- instruction a load/store instructions using if it is free When in threads. instruction In such type, Branch buffer be needed. gets unit B instruct,he instruc- operation multiple one become slots. on a RISC are checked size needs number decode queue the operation. might t bread unit fill from fetching instruction, fetching unit assumed. are fashion at most cycles to in B cycles. the fetches C This buffer ers a branch fetch unit attempts the once based processors. has elementary system every unit. an interleaved is filled processor makes and wit h many unit thread queue and fetch 2, the queue fetch one struction and in united 1: Multiprocessor for tion Con- Organization instruction using instruction tions Model Hardware shown set architecture architectures our Organization preempt As * ● multi- Architecture Machine several ● Functional Units effectiveness results. on multithreaded (b) as follows. processor the ar- the 2.1.1 Sequencer and concurrent simulation to be compared of our is organized 3 demonstrates architecture 2.1 . . . . . .. . . . .+ . . . . . . . . . . . .. . . . . . . . . . . . . .- Physical ~ ~ Processor ... ~j, ......... ......... ... ,...,,,.,,,,.,,,,.,,:, ... ....,,: .. ... ...,,..,,,..,,.,, .. ... .. ... .. .. ... ... ~e@ter applica- multithreading multithreaded Section most parallel rest of this is addressed. 2 and Memories an Instruction .................................. .................................. ,::,:.:,,,,:::::::::::::::::,:.:::and a Data Caches Processor ~,, ,: ,: ,: :; ● 0:: ,: ,! :: ,: :: physi- of numerical concentrate upon to simply threading. cluding the to processors, interest presented the as doall iterations view although execution such paper, chitectural of our In as the in .Z1 Network same processor. this restrict 1(a), is different. for multiprocessor, which as same logical organization the different executed to compete of set Wlte,isk?r . ... .. .. .. ............ . ......... Logical view Functional Llnits the utilization instructions one another sequencer $ ☛☛ of the be expected and ....J...... , . ... J..... number Consequently, simultaneously z and Memon& unit cycles could so that 90~0. multiple are issued -J . .. . .. instruction latency execution could Network the ● l(b), 30x3= that IV is the processors unit Interconnection of a processor of the where total in Figure processor, threads the functional nearly assume unit is an issue L case, three one as shown of the unit, be on of the functional as U = A?# x 100[Yo], program. into example, functional utilization could are executed 3070 because The units multiprocessor threads For busiest is about of invocations unit, a general processor. dependency. is defined functional shows Multiple of the in Figure level so that l(a) are executed dependencies within is the of an instruction scoreboarding of dependencies. set is architecture technique, Otherwise, and is- issuing is interlocked. indicated Issued to be at of thread struction to access units. structions cache. 137 instructions schedule Unless issued are dynamically units an and instruction from other scheduled delivered conflicts decode units to with by in- functional other in- over the same Instruction Cache le ------‘------------K-R&&er ------ ------ (allocated executing the unit, struction for thread) unit, there it decode station, could from they issued dependencies is used bank has ports rent with thread station in our one, two more register is executed, is logically ports are unnecessary multithreading, bound in order. depth and stations. 2.1.2 banks, and one bank logical write data processor allocated for processor. 3(a) the banks than the are bank special between logical model include level. our machine files, multiple facilities. network is also thread slots, The between in- functional a disadvantage. Pipeline are More stages, The concur- queue the thread The exclu- by followed unit is checked 138 dynamic pipeline two in two Figure decode(D) followed by are shown is read in one cycle. scheduling 3(b). ion from the stages, logistages followed multiple by a execution stage. a buffer In stage DI, In stage or not. in In fet ch(IF) for convenience, is tested. if it is issuable of the logical of a superpipelined instruct by a write(W) IF stages of the instruction the as shown stage, an instruction a the instruction is a modification pipeline, followed schedule(S) When shows which processor Each is provided slots. register cal processor can be to support thread as of which port. operand In order than well to a thread. because physical as each other registers scheduling in Instruction RISC into banks queue and processor differs processor, the the and Figure banks to the instruction to be free of set private with complexity processor logical communications and dynamic the register-transfer cost of register algorithm[2] file, enable disadvantages in Tomasulo’s register register contrast, at the a logical though A standby is one, In Registers that the hardware units instruc- between guarantees which creased same binding it. Files Queu slots access any to Some in a standby of order, not processors even the thread bank registers re- Register and three are guaranteed is divided read out unit does conflicts. from with bound architecture. register in standby stays .. ---- . . .. Lai-ge a register can al- Consequently, whose instructions as a full read a decode latch and operations instruction general-purpose floating-point stored from a reservation large their instruction add station . . . ~------- sive one-to-one unit. stations cause resource are executed is a simple because with be sent to the ALU. a thread are issued station A a shift a succeeding thread from while schedule Standby instructions to in- by an instruction in a standby to proceed issued For example, selected ------- (allocated for ready thread) immediately the . . . ..-. Set organization Otherwise, by the instruction it is selected. units 2: Hardware is sent is not -------- K-Register (allocated for waiting thread) executed. is stored until if previously tions and an instruction schedule mains instruction unit is arbitrated When low the functional .. Set Figure functional ------ Ki&ister Set in fact, the format or type D2, the instruction Stage instruction and, of an instruction S is inserted schedule for units. Z11WIIW21DI IWISJEXI]EX2 . . . . . . , . - . - . . . -- . --.., --..., ,,, ‘------ . . . . . . . . . . . . .!----- - Lw Table 1 -------- Functional D] IF2/1 category unit instructions (b) pipeline Instruction Figure of base processor RISC 3: Instruction processor of EX stages instruction. For example, we assumed each operation ble result 1. The before ber the instruction of EX of stages may to access the cache, is two ery the cycles. cycle. the issue cache the are the compare 1 2 shift 1 2 Integer multiply 1 6 Multiplier divide 1 6 FP Adder add/subtract 1 4 compare 1 absolute/negate 1 Shifter I aa well operands are S. A scoreboarding register is flaged the stages. EX initiation same instruction a source, and 12 in Figure In the fetch end 3(a). case of stage of D1, unknown ately 11. five. some after 13 is four What threads cycles The same branch point. is into to Figure is worse, 3(b), But it encounter could the last the delay delay at the sor 11 threads of instruc- Consequently, a potential as other to arithmetical throughput. 3(a) than five time. the ing memory tructions is ment able 139 issues are copied buffer, The access state and processor out-of-order can the by proces- number the of number be resident overhead on to save memory. buffer, of the which When load/store instructions, and recorded when is pipelined context contains requirements. execution than rapidly as the exceed threads element deleted frames a logical long not reducing to/from important access requirement 11 It frames, the running As the hold respectively. context between and for (which is achieved does processor, register areas words state), binding and allocated as save more register grouped save status switching frame. contexts frame, are used scheduled context physical Another if be register save is conceptually address has logical a context or restore at the between context of physical address of thread-specific slots, to and floating-point and program processor and Multithreading a context register the threads immedi- called pieces changing outcome same types machine register, instruction counter thread purpose An program Since instruction Figure more is dam- some has an instruction entity status various 11 is a branch in a single a thread. the unit of general a thread of are also required executed become branches stage 11 as Assume Concurrent status 3(b). 13 is an instruction In in branch other as well of Support a thread between an while executed. enhance and of instruction if a conditional cycles. 2 performance scheme and in ev- is, as- fetch delays, with That instruction are of branches pair on the cycles threads multithreading together back the required thread however, of branches, Each delay instructions, to the even are in Figure at that and additional instructions. pipeline is sent instruction and three RISC request is still avoid in other delay 2.1.3 be a destination off single sets, is written to S and from the the because operation instructions registers 12 uses the result at least base from stage can 4 H type to instructions value corresponding in This read of data-dependent suming in the bit on 2 n 121[4)1 architecture, parallel hide as the instruc- of these that In our tions the num- can be accepted result 3), to a register. Required store 2 number assumed of load/store latency W, a result 1 are blocked in Ta- is the of the cycles of of cycles (i.e. latency instructions In stage (section number instruction two (cycle) issue 1 I kind as listed execution issue latency Other So, the one cycle. the data issue simulation means Since on the had latencies another issued. required tion and before be in our finishes stages), of pipeline is dependent latency latency add fsubt ract aged. number of logical is obvious The IIatencies Integer I of multithreaded pipeline issue/result ALU Barrel Instruction and functional I Dz I S~EX] ]EX2 I ------ , -----,------,------,-. ,!!! ! . . . . . . . . ..- - J.-----L-. ..... lIFzID1lWIS (a) unit -,.-.---, . . . . . . . . . . . . . . . . ...! Z21 IF, I 1: instructions is a thread these in ins- in the access require- they and frame outstand- are performed. standby of instructions. stations This enmakes signed to the before the The mode. The mode, struction selection instruction schedule unit (A, and B, policy C in denote in- the hand, other slots.) it which are tures. mechanism Before the thread should wait until instructions not performed source Therefore, could because also the thread buffer followed are by struction address ensures imprecise fault. floating-point save switching with starvation other because thrashing by the in- incorrect applied must choices problems. Instruction In this section, to trap. can We will crest e report by the lection policy selection policy instruction Figure each thread other struction the Unless instructions priorities are assigned, unit. rotated. In of the priority an issued from are schedule priorities example A unique other them. enlarge it order The instruction quires is selected to with avoid lowest to the doall would order ion be to we provide queue-structured the also usual at the the case, end of sets own next of other thread unique logical corresponding thread, of to therefore, loops, between can unneceswith through queue reOne memory. overhead, registers. connected one of loops processors. communication with registers is type logical communication the it communicate of doacross processor number its it executes. type reduce same registers Each the execution are The through in- connection queue the trade-offs starvation, priority in It processors communicant is occupied In the by setting iterations of ab- They to functional units. to which by But solution the conflicts slots case another. Simple processor. logical of due to data counters to special the 2.2 execution processor generates identifiers for processor respectively. in section switch to program which logical parallel processors activating logical to log- case of a physical compiler. and each Loop on our is inserted pro cessor is assigned thread and iterations a loop. slots se- policy from instruction address But an priorities. slot. to 2.3.2 is available presented a context by the instruction In unit. mode an- pitch. 4 illustrates with so as not loops Single kth supporting our in instruction schedule is preferable, pipeline multi-level higher instruction the and a physical as logical a in the instruction iteration sary in the parallel architec- in section loop slots, threads A fast-fork Strategy dynamic compiler in = O, 1,. . .) iteration mode, of threads to the the other assigning example, kth(i a change-priority future. we present priority. supports parallelize using of by facilities In this entirely be con- architecture Scheduling is to parallelize explicit-rotation of the be informed 2.2 other n thread sence is suppressed page absence traps The are discussed For with a loop. also on be by a data such executed of a single the ni+ The is one each cache/memory in the re- mechanism can is to aid machine, processors. executes handling. is triggered including paper this execution processor When access interrupts to concerning care, and solution This restartable similar conditions sidered register. but Parallel ical re-executed, indicated Overview of threads access re- the and the highest architecture Execution 2.3.1 multithreaded context. in again instruction exception Context Detailed instructions decoded Parallel execution memory of the is the architecture. in the outstanding the Mechanisms our is this the restarting as a piece is resumed, quirement being But such saved and in with is exe- respectively. 2.3 load/store instruction instructions be flushed. quests are swit thing rota- instruction One features The the processor but latency, by software. our 4. mode, threads to num- Figure complex. logical instructions in short load/store is possible the A load/store of context the pipeline all issued somewhat out, are performed. always main of programs is switched difficult These 2.3.3, the restart is possible. in of code implicit- at a given as shown why of two mode the explicit-rotation mode. the one alternative occurs processor reasons in In rotation explicit-rotation schedule when priority explicit-rotation the a change-priority logical are two to is controlled when on the works and interval), in the of priority is done to thread priority On the Dynamic the highest instruction. rotation There 4: unit is switched of cycles( cuted had mode privileged ber rot ation Figure schedule mode a rot ation tion which implicit-rotation through —. ‘lime slot instruction modes: I thread rotation. is as- simple 140 and topology registers between effective among is important hardware connection logical when processors considering cost and its effect. One topology net- is a ring Logical Prccessor . .. .. .. ... .. .... ---write : read wn”te : Logical Processor ~------------------, : Queue \ ‘cad Register : 0 ~ ;0 :1 compiler Logical Processor . .. .. .. ... .. .. .. .. .. wn”te \ ; Queue ~ ‘eat Register : 0 :1 !3 ;3 : :. :3 #*F+~ ;R:giyxstj : :. For ~ I i tures, !3 source .: : 31 “ :R:gis.sxs?q ;Regi~~y77s:~j - code architecture. But which use of the to issue instructions. algorithm Our work, as shown loops’ encountered ence. When ence, data cessors. should however, available Queue registers instructions. ters, for next logical to are one iteration several convert loops differ- logical Some to one. reading to from so that pro- the previous it- tween queue ters. Full/empty reducing registers and bits into registers. data The as the move as scoreb oar ding to instructions have This means tion. The know two the As Static Code mentioned graphics in overheads it 2.3.3 control sequence cause the sequence dependent high registers compiler of other threads, time Using are of a single thread because whether proach the scheduled does not aims at flooding tions by parallel On putation the each thread. means a high of or not. have other over of such hand, the behavior in is often of shortening sta- compiler The resource to determine if there the compiler station to section shows which or VLIW Figure when is executed. Loop Iter- 6. This Although of our architec- difficult a sample program to paral- of our program traverses it is a simple that that are architectures. fragment application example, execution is fundamental to it demon- scheme nany is a linear using a application = header; while ( ptr ! = NULL = a * tmp the if(tmp ) { ( (plm->point: + b * -> X:) ( (ptr->point )-->y) + c; <O) break; ptr ap- conflicts, = ptr -> next; > it instrucFigure code. caee of numerical a loop. ptr of instruc- sufficient scheduled possible is marked. of Sequential loops code structu~e” time scheduling this vector the is constant resource with list. consideration rate this entry programs. algorithm[4]. processing issuing execution it no other Short in- instruction the also to tell standby how- those in a standby is selected. Execution source linked loop by data- instructions units using in be- employ on Though control functional programs, execution-time concentrates in the for enables only pipeliner scheduler, the the from When at an issu- table mode used but parallelize shown strates to obtain without conflict, an example, can the in order scheduling code number computer predict execution, could list the and for cases, compiler reorders of to dynamically such a simple processing tions In the than before is determined throughput, case thread a branches. techniques The of the ation is not code marked, and instruction table Eager The difficult explicit-rot below. a softwiwe is stored corre- differs at ions ture 1, in issued reserva- entry regis- queue is often be whose standby is not function units. a resource described Our instruction which be- bits. section to the developed of the dependencies code. of So, c}pportunities algorithm conflicts, to issue refer- Scheduling programs, only re- conflict. schedule Our of the the most table without an entry instruction lelize 2.3.2 not to avoid we have the is an architec- i~pplicable miss instruction resource entries is a resource could a standby a NOP If a resource in the point all of the checks cause station. ing to reference general/floating-point attached pipelining reservation to the are mapped is interpreted speregis- writing software is determined is by queue and or floating-point registers disabled two respectively registers, available and to a standby ever, postpone employs also generate table might makes and sponds st ruct ions. unrolling tech- which VLIW is also Consequently, algorithm cycle reservation it which but 6], for a technique algorithm stations table would techniques, Loop enabled, the processor such queue enabled When general-purpose to than of iteration-differ- overhead. is reduced type &e technique[3]. cial ence doacross have through create tion registers Many has more could are 5. be relayed difference one such Figure of queue in p~actice a loop This eration in view scheduling pipelining[5, would straight of standby 5: Logical code technique a resource Such a new Queue Register software conflicts. instructions ~~ Figure judicious scheduling employs our :3 more example, effective :1 : employs niques. predict Therefore, 6: A sample program (written in C) comthe Each the the value 141 iteration of ptr is executed is passed on a logical through queue processor, and registers. The Estimation 3 Iteration 3.1 ptr . . . .. .. ptr’ . . .. . . . . In the following II .... ptr” ........ sults ptr’” 1 ‘“”’”--”? ‘ pi kill of our cache vtr”” 4 7: Parallel execution of while (see used two Figure 2). The compiler generates wait for can preceding mization with in trates the parallel ceives the value immediately the execute the A thread have call been this The scheme. rest of the can initiate executed plays that, loop of several banks two iteration eager who point. selection which in instruction in preserving program. the By using policy does The until thread earliest the gets stop for the thread without the struction, the with the highest priority highest should priority. This location of tmp ensure and tmp is performed the source code. the priority. tries at compiled by state translated to terval use only and the special In only this thread that writes in order instructions between execution the contexts of the with to of two it the loop, ware is the with it ment slots highest respect the times in mechreg- by use the as the pro- multithreaded exe- Sequential ex- version whose of instruction 3(b). Multithread- to unit, to up attained (1.83). 142 our compiler with with one 9970. is saturated four all thread load/store from unit execution of the is the reason improves in of a factor the to though improve- two thread are provided, unit, 3.22 hard- duplicated slots, one slots functional at only unit, than thread in- load/store improvement increasing This rotation cycles. is not (2.89/1.83=1.58) eight on were is eight by parallel although A performance by the unit processor possible an op- executed sequences when is gained streams, with was par- in C was simulator. ratios a processor speed-up written code schedule of the busiest 10.4w79.8%. loop. for multiwe have can be easily instruction instruction When becomes ratio another before traced instruction is still object of our program, program RISC The speed-up of results which The of a single-threaded utilization memory of threads we is defined execution. program, is less effective instruction are also provided It Parallel pixel. to be used pro cessor. the the branch architecture, simulation and case an 1.83 in- until after store by the are performed Some alive instruc- or overlapped processor a commercial 2 lists in of 2.89 is global assumed simulated delayed As an application option. a workstation, a thread execute presents timization other is valid When to consist of each of the sequential based machine. at every tries and to kill an interlocked guarantees consistency after into slots instruction another simultaneously machines required by a ray-tracing priority. variable compiler which thread a special highest it is put If the the of other Such with might facilities, in Figure allelized of preceding threads. units cache simulation with the execution chosen execution the operation given section threaded Table running All our cycles the the it 1. equipped Speed-up This instruction loop, our by sequential pri- not the heterogeneous ing se- of succeeding from configurations seven Latencies as a criterion. is shown 3.2 of an execution has exited But conflict. on a RISC schedthe the a thread unit of to be accessed estimate of total program iterations. When caches access not highest the to ratio means guarantees hinder order ecution remaining priority, in our to a data prediction suc- So, we thread the iteration multilevel branch portion to execution with implemented of those units. are not to those might execution. consists in Table cution execution. is the paper consists in order listed pipeline part It this In body. acknowledged are next continues policy executes iterations thread selection is not ority the been attempts In practice, no bank speed-up in the the reSince ist er windows. re- immediately an important priority. that the tion 7 illus- executed yet simulation architecture. of functional One load/store wm anisms, opti- thread passes an iteration mant ics oft he sequential highest in to of the Figure and in the sequential instruction ule units executed Such Each thread iteration, After scheme the ptr having is a variation machines. from a variable complete. dataflow thread iteration. to execution of ptr ceeding for without to a variable preceding to versions be initiated iterations respect use of variables value multiple iterations we present that kinds other unit. there so that assumed load/store by sections, Assumptions all hit. We loop and multithreading has not we units. Figure four simulation were other threads - Model parallel simulator, p.tr::’ ‘ ‘ i Simulation the why times. speed-up the load/store the speed- Addition ratios of by Table 2: Speed-up ratio by parallel multitthreading Table 3: Tradeoff allelism with no. of thread slots one with unit load/store units without with without with standby standby standby standby stat ion stat ion stat ion station -E 2 2.02 1.31 1.79 1.83 2.01 2.02 4 3.72 2.43 4 2.84 2.89 3.68 3.72 8 8 3.22 3.22 5.68 5.79 speed-up ratio stations 0N2.2%. This lelism low within plication We the improvement an instruction programs parallelism, is due stream. whose thread greater improvement also simulated a machine provided with tion units. fetch speed-up. those improve Such organization tively replaced delay due sharing by to an 1.80 5.80. fetch instruction are and This same between the thread also examined intervals application execution cycles, program, influence or sixteen the (2n on the cycles cycles where rotation interval performance, seems slightly the The the did though is main objective with ple instruction similar ing the brid same A hybrid tions processor where which number hardware that half resirlts. Values for hard- a simple have it is clear that D= threading is introduced for the instruction others. On the the hand, is to main the of obj ect ives architectures im- of cost-effectiveness, best into the improve difference of the two 1 is the a more is to enhance other viewpoint tablle of D. architecture This and This of S produces increase ta- are com- a (d, 1)-processor, of multithreading comparison in the processors 3(a) performance. eight support In this 3.2, both We Effect will choice if parallel a processor multi- architecture. compare static 2.3.2. Kernel 1. Code code The The Scheduling scheduling sample source strategies program code dis- is the is shown Liv- in Figure 8. stream us- we discuss and of Static in section ermore multi- section, coarse- is represented a thread slot issues of thread slots. For cost is dxs hy- DO IK=l, fine- 1 N X(K) = Q+Y(K)*(R*Z(K+IO) +T*Z(K+II)) per ports. of instructions and issues cycle. The Each instructions instruction and is provided thread slot instruction without window tion. as 4 lists The compiler, by proces- Kernel 1 (written with our and code execution code the was generated schedulers. one load/store a average source unit. Strategy and strategy checks the scheduling approach with It cy- 143 a standby was scheduled machine a simple B represents a resource itera- a commercial A represents approach, and code fcm one simulated scheduling window by object The the every in FORTRAN) cycles compiled double dependencies is filled 8: Livermore Table the instruction in both it requires Figure S is as same the cycle in the (Zl,S)- of instruc- cycle, For example, words by comparison, is almost (d x 2,s/2)-processor of register of size D, each set of (d,s)-processor, dependencies here number a fair of (d,s)-processor Although register number every additional for the of superscalar from instruction D is the maximum bandwidth sors. can a single employ of (1 ,dx s-)-processor. fetch slot as in section which the parallelism. processor, the from objective But, Superscalar architectures. program architectures grained thread issuing to superscalar requires increase than throughput. t bread cussed each the possible. in Design simulator, boost provided superior. Multithreading our that speed-up makes not — to are not We used 3(b) in Figure various using — proposed units. in Figure shown machine 3.4 3.3 par- 2.79 ml Simulated functional But, with n is 0w8). they ratios. shown pipeline single rotation much of eight significant and slots — 1.52 R 1 they the simulation demonstrates as that 3 lists pipeline is respec- is hidden but because are speed-up posed are possible. We slot ] 4.37 [7, 8] are the Table a slight shows conflicts cache employed ware. instruc- the 5.79 techniques thread , 5.79 performance, ble slots and does provide 1.79 and case of ap- thread ratios that instruction paral- Some la superscalar in fine-grained caches speed-up 2, except In the is rich whose instruction Achieved in Table to poor by can be achieved. private and two load/store cle. In speed-up ratio) 2 Standby this between (speed-up reservation has list the list tab [e table. Strategy B achieves the is overall performance superior to improvement other strategies. by 0N19.3% Table 4: Comparison of static code Table scheduling 5: loop no. optimization of thread slots 1 1 strategy strategy optimized A B 50 no. . 8.83 8.125 8 8 8 8 small in nearly this the B should The These 8cycle) code Although to contains in for three load one iteration. This the architecture. tion proach of perfor- 3.5 Effect In section tial of Eager 2.3.3, the loop iterations simulation results gram shown in one load/store The code requires average one iteration. when Figure allied An and footers performs threads number at most one. multiple instructions source but in the of iterations the As the in concept lization HEP[9], MASA[1O], utilization. is of fetch as units is Prasadh’s an bandwidth must multifrom but is still in our architecture, unless there are re- a superscalar threads Prasadh hybrid and to sustain be considered. four-threaded, instruction fetch In not as For unit times bandwidth uti- the instruction multiple functional example, although eight-functional-unit eight our cost- architec- Functional but present same. are factor, func- WU[18] the 3.3. archi- increase non-superscalar important about to does in section sufficient also sor achieves speed-up, of thirty-two procesit requires words an per cy- cle. when number of 5 be- Concluding Comments times. Works of concurrent propose multiple as a multithreaded this paper, oriented tiprocessor a The Torng[17] uses as mentioned tecture Related hand, effective ex- parallel each cycle can be issued slots, loop idea to be issued, instructions architectures in ap- This instructions other which In 4 each architecture ratio ratio of our such sequential speed-up unit ture rate. two a VLIW thread execution is small. and which tional of ptr. code, one other threads a multithreading opinion, number overhead the combina- other take for an iteration parallel maximum to 56/17=3.29 the with multi- switches. issuing hy- interleaving context basic of issued On concurrent however, architecture, control The conflicts. Daddis executing speed-up more sequential next is still to initiate than 5 lists version of thread thread includes than Table the compete the total has object to the dependency better increases, closer pro- of the parallel increase a thread code ecution comes sample iteration. number the inter-loop parallel number reports machine version of the is available, headers the of the is passed register, data by the iterations of sequen- section simulated one a small iteration. Since for of ptr enables limited The cycles With value a preceding as the execution sequential 56 cycles thread soon scheme This to sup- also employs enables instruction In their tecture of the a queue slots execution 6. execution a new through eager unit. execution the Execution is addressed. of the two Pleszkun[16] ex- [14] keeps a block architecture, during increase threading. such or- thread access. Our on multithreading and to is closely saturation. machine[15] working Farrens are Neumann In and a thread memory our parallel [13], a remote builds In with of single in which of streams. APRIL of interleaving. scheme to continue and cycles reason type interleaving instruction performance data-flow/von latter cycle-by-cycle machines, interleaving achieves instructions poor encounters strategy the program. (3+1)x2=8 is the are it 17 multiple these block threading A cost, to optimize so at least ef- brid between strategy improve ecution B, however, at a lower in order instruction, required differences 21.67 support from cycle one iteration 1 processors instructions der strategy effectiveness be employed object The A and program. same one store mance code. of strategy of sequential 32.5 4- port non-optimized for 3 until fectiveness execution slots 2 8 8 (unit: over execution , E=EEREl 6 7 8- eager of thread 42 , of strategy non- 11 Evaluation iterations multithreading Horizon[l 1], and threaded, P-RISC tor [12]. 144 of for proposed use systems. ray-tracing is discussed we Through program, we two-load/store-unit 2.02 speed-up a over multithreaded as a base processor the simulation ascertained that processor achieves a sequential archiof machine, mulusing a twoa facand a four-threaded speed-up. processor threaded architecture nique. achieves We also examined This effectiveness the combined with evaluation, however, is achieved when coarse-grained parallelism a factor of performance 3.72 superscalar shows and ignores scalar cost- Intl. employs only 354, May B.J. Smith, fine-grained plicable for our of software for architecture, pipelining. use One instance of such loop execution One It weak point grams. parallelizes We with of this are currently and the is the many which poor other of [11] Processing, Symp. Y. Mochizuki, Architecture ing,” In with Proc. Fukuoka, A. Nishimura, “A of Intl. finite puting Japan, pp. 87-96, [13] Processor Instruction Symp, R. M. Tomasulo, ploiting “An R.S. Multiple November Issu- Nikhil nal of Research pp. 25-33, and January Units,” Munshi Loops and on for Report 5546, June vol. [14] Jour- 11, no. “Scheduling Processors,” In E.G. 1, ing Theory, [5] R.F. W.D. SIGPLAN’84 pp. 48-57, [6] M. Lam, June “Software Design 1988. Murakami, “SIMP (Single struction Processor Intl. for 85, May the FPS- of the ACM Proc. [16] the 15th MultiSymAnnual pp. Processor of 1988 November “Can Lim, 443- Archi- Supercom- 1988. Dataflow In Subsume Proc. of the 16th Architecture, D. Kranz Proc. Computer and pp. and J. Kubiatow- Architecture of the 17th for Mul- Annual Architecture, A. Multiple Gupta, pp. Intl. 104-114, “Explorirlg Hardware Architecture: the Contexts in Preliminary Annual pp. Int!. 273-280, An Proc. [17] Effective Machines,” In June Ben- a Multi- Results,” Symp. In on Computer 1989. pp. Symp. on In Proc. Computer Neumann of the 15th Architecture, Annual pp. 131- 1988. Farrens and A.R. Improved of the 18th Pleszkun, Annual Inti. pp. 362-369, G.E. Daddis Jr. and rent Execution Throughput Symp. May H.C. of Multiple In In 1991. Torng, Proc. Processing, for ,“ on Computer “Tlhe Instruction Processors,” on Parallel ‘(Strategies Processor Architecture, Conf. 318- a Dataflow/von Architecture,” Superscalar on Programming Implementation, M.K. ‘(Toward Iannucci, Achieving Construction, VLIW Conf. and pp. Concur- Streams of -the 20th 1:76-83, on Inti. August 1991. N. Irie, M. Instruction Pipelining): Architecture,” Symp. for In Compiler of the SIGPLAN’88 June Compiler Pipelining: Technique Language nual on “A Proc. Processor of the 16th R.A. Int[. Schedul- 1984. 328, [7] K. Fortran Symp. Scheduling Proc. and Job-Shop Computer,” Smith, In on Computer In Weber of 140, June “A Scientific Parallel Technical 1976. Touzeau, 164 A for Architecture, Arvind, A Architecture, Sequential IBM [15] In Computer on 1978. “MASA: of MIMD Conf. 1990. efits 1987. Coffman, August Proc. 35-41, B.H. on Hybrid [4] Fujita, Intl. 1989. “APRIL: processor B. Simons, 344- Ex- 1967. Parallel T. B.J. Symp. A. Agarwal, Proc. [3] A. Annual pp. Resource 1978 Computing?” Intl. Symp. In IBM Development, and tiprocessing,” 1991. Algorithm Arithmetic 6-8, Computer pp. Nuemann icz, on Supercomputing, Efficient pp. Horizon,” Conf., June [2] 17th Y. Nakase Multithreaded Simultaneous Shared Architecture and for 262-272, Nishizawa, the 1988. Thistle Annual T. of processor. References H. Hirata, Horowitz, in a Super- Architecture, of the In on June M.R. von and and Processor tecture [12] [1] M.A. of pro- on evaluating Proc. Computing,” Intl. 451, variety of the Proc. Pipelined, In Halstead bolic are application design In Computer “A threaded as an eager the effectiveness working detailed [10] R.H. scheme. architectures. confirm and Scheduling 1990. Parallel idea possibilities loops other paper by using effects investigate on Computer,” ap- from is presented We should architecture is derived Pro cessor ,“ Symp. Lam Static par- algorithm multithreading intention scheme. programs. cache also parallel to be parallelized tested our of scheduling which We broadened difficult code M.S. Beyond tech- [9] a static Smith, “Boosting allelism. We developed M.D. best the a processor [8] of multi- Kuga and Stream A Novel In on Computer / Multiple High-Speed Proc. [18] S. Tomita, of the Architecture, In- ture,” Single 16th R.G. tion Prasadh In Processing, An- pp. 78- 1989. 145 and C. Wu, of a Multi-Threaded Proc. pp. of the 1:84-91, “A Benchmark RISC Processc)r 20th Int[. August Conf. 1991. EvaluaArchitecon Parallel