Single Instruction Michael Butler, Tse-Yu Department and Computer Mitch Is Greater Alsup, Hunter Microprocessor Since Recent (less studies than have two cycle) streams. Since the should it is important ists. study benchmarks repeat the hardware we get similar all constraints semantics allelism and amount the amount constraints except struction stream is properly to design hand, of these and Johnson measured the performance by the unconstrained much per now, cycle cycle. Finally, single we show that one can sustain on a processor vent if the from that results processor, is rea- today. The increasing research into ements way to to ways use these tional units only lelism present studies [3, case. cluded ecute 14, More that scientific a real 15, recently, insufficient than circuits large chip. suggested some two replicate instructions logic when el- all constraints multiple is exists processors per it. this researchers[l, ef- in that 2] have in real, that will per tions con- allows ex- chines ACM 0-89791 -394-9/91 /0005/0276$1 .50 data pre- on the execution On the except those of 17 instructions the hardware of other we have an execution flow, re- found per cy- is properly rate of 2.0 to 5.8 that an abstract is reasonable stream 3 describes our and 4 reports realistic Section to issues. and discusses 1The Nasa7 benchmark parallelism to experiments: machine results the configuraof our simu- case, and several implement 5 presents remarks 2 dismodel influence of various machine The model used by Smith the unconstrained are the the Section execution instruction and discusses the on performance. which in six sections. Section a brief 6 offers the future work was not simulated mark consists of seven independent straints, we omitted these loops. 276 cycle. program, when benchmarks, Section plementation Permission to copy without fee all or part of this material IS granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright nonce and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific perrmssion. did not constraints on a processor is organized the identified. Q 1991 cycle Section and Johnson[l], cycle. per in excess available tested. lations features non- how could today. paper simulator, fact also on an that parallel are removed one can sustain be exploited. Early was have to quantify benchmarks undue of the we show that cusses restricted paral- warrants that This func- technique of parallelism Finally, design to instructions by the semantics instructions One instruction-level parallelism to support of We benchmarks in order with hand, balanced, motivated numbers This of applications 16] has performance. is to amount applications more utilize single the in 5, VLSI computational elements on if of to improve fective the density mod5] and Jouppi[2]. to sustain two cle. perfor- Hwu[3, of the processor that, than degrees Introduction show it is difficult more quired 1 if the artifacts per- the execution and of these in these have compute-bound measured by Patt model exists programs that it. Our 2.0 in and execution parallelism be exploited in- [l] un- ten of the applications on several studied we have nine have benchmarks those by Smith of par- exploiting We im- degrees for used do this, in optimized, standard when ~equired To influ- to verify parallelism a set of real de facto should it is important exists. We have els, including those balanced, per mance parallelism of available benchmarks. we have found parallelism to 5.8 instructions We of available suite,l the formance show they other become SPEC and code. Group West 78735 parallelism SPEC Shebanow of the processor, a study compiled in the ex- constraints. On the important hardware of the researchers, of 17 instructions most really much dertaken of available execution resource results. how in sin- of the processor, resource are removed in excess sonable the parallelism parallelism previous oft he program, perhaps much differing of the under posed, how work little is available the design we model under the that influence to verify In this that per gle instruction parallelism concluded operations Drive Texas ence the design Michael Two Technology Cannon Austin, Abstract and Memory William 48109-2122 than Incorporated and 6501 of Michigan Michigan Scales, Motorola Engineering Science University Arbor, Parallelism Yeh, and Yale Patt of Electrical The Ann Stream loops. today, discussion some ma- are all of im- concluding we have planned. because Due to this benchtime con- 2 The RDF To exploit whatever stream, facts one that data Processing those instructions namic order they The that from present in of time. execute. appear in the size dynamic the dynamic issue tions can that stream We and call the tion class machine latency the can tency number correspond cpu to design corresponds sue corresponds the niscient, ple, can restriction its The Ia- RDF size and model that all parallelism size the on is- present unbounded, ters level RDF of the in the data parallelism the and (UDF) available exam- RDF hand, model, if we restrict we obtain upper that can support a specific It All of SPEC implementable size, issue execution parameon the floating size and tigate 277 We point the ten effect of latencies, of shorter were several of the instructions. their add, simulated and loads, etc.), a description class. taken The from machines so with M for floating memory latencies, pre– each class is listed however, the t, gcc limitations, and to that pro- with output point L for one machine time) in eqntot million cycles), belong know the instruction (in produced run scalar called all was se- unchanged classes multiply, that all but Johnson[l]. for Each latency instructions cies for the of integer compiler to time (A for floating execution the used Due instruction latencies. and is not and was simulated compiler conventional file. 1.8.5 with shortest run the FORTRAN (i.e. were under benchmarks 2.4 compiler if that code cpp cpp Hills intesuite: matrix300, run The C Rel. MC88100–a as the input and nine SPEC fpppp, (A particular object without its abbreviation point on. for the for Green Data exceptions: are from doduc, benchmark an 1 shows paper architecture. benchmarks benchmark instruction bounds a given on Ii, the Diab turned run Table machine. window influence of the for window compiled set efficient was run an excall the various for most this gee, using or the processor We in the compiled following to exploiting application. flow the several programs tomcatv, instruction cessor.) window of performance that is possible. For example, model gives an upper bound on the performance a machine espresso, when variations we have no impediment an (opti- capability. in point spice2g6, lected its om- For units, presented floating eqntott, its other results optimizations functional to study. functional are also presents with several with performance values unit to of imple- emulating identify differing cost Benchmarks compiler stream. On the Experiments were of instruction unbounded are interesting an unrestricted all above, Nonetheless to the rate unit specifies and realizable. issue specified prediction ecution model model with functional M88000 amount on 3 ger and parame- restriction limitation and The the window the will rate, 3.1 constraints on We models, its changed. on the RDF it is continuin an effort the RDF unbounded, and on supported. the is specified. limitation the instrucand practical reduce al- to the im- Today, group we are concerned parameters benchmarks. the if in addition this is because important to RDF branch is not the be units set architectures, objective, has not in as a speedup brought The In paper, RDF (HPS) its applicability research ultimate of the Substrate architectures. machine, exploits stream. implementation and is corre- cycle. be operation, defined RDF window of operations operation first the of instruc- a single Its RDF In this num- bandwidth. Clearly, of the each The that units, desired a practical buffering memory functional some [3]. to rate of was the operation. every with model set each instant mal) that efficiently developed our that instruction instruction by our capability restrict model Performance performance mentation. be unit that on a specific configuration in a single discovered refined its unit model bound if we further originally of all being improve any can a packet. the with perform RDF cycle specify associated The on latencies one ally an upper an execution for complex we quickly predic- RDF can support a realizable was branch the if functional machine High HPS plementation instruc- as the the mechanism though instruction in that [3]. can dynamic window 1985 in the rate a functional specified those that Finally, parallelism model, that number the the at issue to a realizable We first dy- we obtain we have available RDF a condi- of is less than from flow stream. as long instructions in associated model, ters of graph maximum into group way are issued number flow and and If we couple of a machine to their The stream window be removed data instruction be issued is the entered the into each data the capability predictor, a problem. model sponds an omniscient the maximum in the rate knows dynamic can of instructions by instruction (Instructions RDF stream, when completed. Instructions is the the size. ) The has size not instruc- and retiring is created always will window tions a dynamic for execution execution predictor branch were unit a problem. implementable, instruction into not window a window issuing dynamic stream is such class latencies. of systematically if function were the performance The parameters: have been resolved, after tional parallelism. issue rate tion to a real branch of arti- paradigm by three instructions instruction branch (RDF) instructions scheduling dependencies ber flow a program’s converting devoid of that instruction consists from graph, and in the instruction model utilization It is characterized tions flow the rate, exists an execution restricted size, issue of Execution parallelism needs limit abstract model. Model in we Smith with order of latenand smaller to simulated invesone CedeLkQiWton &da cdlectadh’ M&lk instwoca twe, C cwn~lec VW. FORTSANcom~lec GreenHIII) 100 Description Instruction Exe- Class cution Latency (A) FP Add 6 (M) Multiply 6 FP mul (D) Divide 12 FP dlv and (L) Mem Load 2 Memory loads (S) Mem Store - Memory stores 90 80 FP add, (B) Branch 1 Control (T) Bit 1 Shift, 1 INT Field (I) Integer sub, and convert and INT mul INT div ; and bit testing sub and logic I ❑ S,t Held ❑ Memov Store ❑ MemoryLosd 50 ❑ Branch I 40 a 930i33’ e instmctions add, H Integer P 70 e r 60 c OPS “:”:’:’:’:’ im. ::::::::::: ,,,,,,,,,,: = IN1J7PDIV ❑ INTFPMul ::::::::!: * ■ FPA5d 20 Table 1: Instruction Classes and 0I I I [ I . snT01TEwEs3m FORTRAN com@exGreen W) I I 1 Datacoliactadfrom Ma$kinstmcOon tiawC compIlw Mb, 50 ~ to Latencies I .J .....l ...........I....II I I ..a.. .. . .. .. . . .. . .. .. ...1... i ~ FPWP M;? LtmJ.c Benchmark n ■ ~~ ❑ ES%SS3 Figure 2: Instruction Class Distribution The machine configurations 3. machine Figure Each simulated is specified l__.--l EFFFW tional S MATRIXW retirement units, a Wx2G6 (i.e. L? TWCATV cycle), Branch Memo(y Manwy sit Field Integer Lcxad Store Instruction unit, and 1: Benchmark Code Analysis configuration using floating point Figure 1 shows of each each respond of 33 percent ations; the for to the data can are branches respectively. same nine As seen, simple Figure bars listed integer ALU 28 a stack The Machine instruction tant for the ones the instruction examining we utilization units, instruction of these cle), can the execution we units. functional machine on if functional be of Machine of from unit issued to machines are to Also, configuration (only functional unit contributes to as the two and I. tional de- one incy- upper throughput. 278 can units would service the can be implemented rest of the have multiple 3) the ma- units. One instructions execute from instructions instructions an capabilities from effort was to instruc- In addition, the relative cost was considered. If infrequently could cost, at As be made to service class of instructions frequently machine. functional unit be enhanced. most machines (see Figure execute of of executing of handling configurations used) in hardware mance can functional (frequently remaining can be number is capable capable one contains machines I, one can execute machine to match increase is each and used functional efficient issue I, tion class frequencies, of adding functionality reflect achieve since configuration each such must only unbounded has four units L, S, T, choosing RDF.18 an is ca- machines, of executing For example A, M, D, and B and and are as 4F2M functional chart classes resources classes. the classes imPor- unit Smith/Johnson The that indicate functional of which classes. units 15 particularly the with each identified made is performance frequency on struction frequency simulated. of pendent bound class The functional and Configurations each functional parentheses the is capable as machines In 3.2 configuration, a single UDF one letter. chine benchmark. pipes only instruction oper- in a single so each set of parentheses functional at to the that In described rate, packet load/store class of instructions, all cor- approximately another 2 presents by benchmarks. vertical benchmarks be comprise organized frequency nine nine 8 respectively. dynamic of the the of 3, 3, and divide, the each class, figure. and Iatencies and of of instructions loads percent the class instruction respectively right point multiply, histograms instruction Within the floating add, unit within classes unit in buffer. letters functional issue are issued of the corresponds of executing. each for the instruction pable that to its functional parentheses listed set of func- mechanism, characteristics prefetch are by its size of an instruction of instructions the respect of the Figure the instruction set Class prediction and With NT/FP Div branch mechanism, a group and lNT/FP Mul in Benchmarks Em Eluo-c FPAdd SPCEZ26 mMcAlv the resultant For example, occurring a nominal such, all an addi- at nominal machine perfor- integer ALUs instructions cost relative functional and to the units are Machine Functional Unit Branch Load Irlat.1 I hue Rediction Bate MecL Packet Store Configuration Name PacketRetire #Of S&J4 (A)(M)(D)(L)(SI(B)(TI(f)(f)(L) 85%Syth 4 UtIF ca(A,M,f),fJ,B,’f,~ RDFJSco(~M,D,~~B,TJ) OticfOtdE?1 ~uenti,] hi. No caches Prefetch sion Bufer rates Aligned 4wd, No bank for IO(% hfinb oti~o& I outc40dR lhnliaredFetch All look I Otichk store/load 8 4FZM (A,M,D,[)(SJ Z(L,S,T,I) 1~ 4EZM B85(A,M,D,I)(B,l) 2(L~,T,fJ 85%Byth 1 owdofi 1 un~kmdFetc~ h Order 4 o.t,f orkr untbedFe~h InOhkl 4 OutJD?&r Undigd Fekh same Inode? 6 OutorOr&r Unahgwd FdcII 8 OutofO?fk UnJigiwd Fetch All WAB85 (A,IXMHIVXTN!D 3U.$,TJ)85%Byth 1 h Order 8 OutnlOr&r Undigried Fetch initiate 8P3fJ BE (A,lXfA,f)(Il,f)(T,fXBJ) 2&#,TJ) BedBP in order 8 OutofO&r Unaiigred Fetch ally ● Figure 3: Simulated With respect set io Machine its functional of parentheses unit, and the letters within classes executing. A number exist. In that the RDF.18, unit corresponds tnstruciion ses indicates Configurations that appearing of our abstract an unbounded number indicate unit before copies each the of a set ofparenthe - packet in the simulations 3.3 Smith Silnulation The simulator machines process An is capable as well works (ISIM) unit When a trap must of functional units are rently in the window reads machine trap level the Our The any range of simulation gathers makes renaming anti and achieving execution file rate sim- ● machine, and thus time the can operands, ready Packet Renaming dependencies with the models ● as- branch pre- interrupts. being sim- all instructions cur- and then that execute define limits can exist execu- but the total in the num- machine is considered occupying space until in the it of four waiting for free, executing, at in the window, is retired. states: the size is given Rate into - In- waiting assigned for func- or waiting in the parameter for number of the packet solely by the Buffer buffer machine per A cycle. one cycle are Nor- and the is deter- configuration. - be modeled This machines to the cycle. that a single is set Configuration characteristics. Smith/Johnson in rate issued may per of instructions machine issue indicates can be issued of a group instructions Prefetch This that of and refill we simulated. “re- This the machine is issued Window the prefetch of both, for parameter of packets brought mally elimi- is critical (i.e. instructions execution. parameters be in one consists mined and all to complete, to become Issue packet assumptions: to packets. number simplifying performance unit the statistics. able are mutu- use of clheckpointing instruction it structions processing sim- after precise wait that from number schedul- the window and size - This An the The from to our key time. describes and and simulator. one issue execu- RDF by ISIM. analysis is performed. output high which produced several our very instruction. retirement. MC881OO simulates begins dependency execution simulator and the trace. and then stream data for code in a configuration Register for simulator instruction instruction nates models. object an performs and a wide (i.e. cycle) units issue, ber of instructions machines). the occurs pipelined is encountered, stop are several in the tional to be simulated dynamic ulator of modeling as execution in reads Johnson’s (except to in the win- each for supporting ulated operations performing loads resident completed recovery and as follows: producing ulator and have corresponds miss Process instruction tion, ing, of packet UDF integer of and forwarding are fully are removed Window simple this operation ) in whole There executing hit independent. diction tion of units Instructions the capable stores are both that [6] as a mechanism present. capable found machines, of the functional ver- ly. sumption is capable are if location a new in that functional the parentheses the functional mu!tiple case a single current 100 percent are modeled. modeled functional tired” configuration, to in the assume D caches. forwarding We InOrda We I and conflicts infrequent 1 1 both memory dow. modeled simulator. machines 1 (AJ(M,D,I)(B,93(L3,T,I) 1~ 6F3M 8F3M (.W(KNDJKT,W)W$,TJ) lM% are explicitly of the in An with instruction specified was used only order to size in the match their configurations. No memory stages naming renaming renaming of the at the byte occurred the algorithms renaming simulator, did In we supported granularity level, so infrequently, or the not is performed. actions noticeably but either of the improve the early memory found Functional Unit Configuration capabilities of all functional the configuration file. Each that because compilers, ● reof fined that cuting. performance. 279 by the instruction - The number and units are specified in functional unit is cle- classes it is capable of exe- ● In/out of whether order C)rder instructions for still Execution each allows — i.e. can for slip other This unit. to occur between but functional parameters of the out–of–issue discuss the execution models in turn. functional in–order out–of–order for indicates In–order are executed unit, flag be executed functional operations functional - each respect 4. I Previous For units. Prediction prediction branch accuracy prediction, branches instruction rectly or the This not. Smith/Johnson an actual each of the is made outcome prediction This pointing scribe issue by is stalled models Claas the ecute the compared the until the of clock type used are these latencies given - If a branch is cycles in table 1. machine used This of 3, 3, and 8 for floating the floating point to point add, Units to model of - a machine functional units. are effectively may window allows with an unbounded this functional the Issue whether functional unit they issued are sequence to executing it, that it cannot indicates at most window. at to match unit. all in that between con- time harmonic superscalar pede instructions high ous limit cycle. of four. Each chine work ing benchmark Results was simulated configurations: of Smith no realizable artificial and Johnson[l], constraints obtained faithful (2) on by three the Aside to the previous a model to achieving processor, restricting and the IPC. to four from 280 performance floating and results Johnson, the IPC benchmark machines The performance most independent, for a peak a maximum peak imobvi- is the issue rate of nine allowing per cycle, and of the machine. is comprised since suite. our a particular in report benchmark Smith/Johnson high point performance between unnecessarily. However, execution of four execution in- has been IPC. the four additional a poorly an overly–constrained values the by benchmarks, the entire and units, are issued performance: (3) machine functional are three represent- of of nine reduced sets of ma- details The their of published show show of a scalar performance structions under (1) a model machines rate IPC and prediction processor, executing by the order compilers, integer across a scalar by Smith and com- the results results the Johnson for of due to differ- 12, that comparison machine Several of mean published is divided capable over in Branch with 3.1 IPC and a rough pipelined Simulation Smith will functional all previously and different for is only order after simulations IPC 1.4 and of speedup those is Our 2.1 ex- as in [1]. 4 through are consistent for upon is not possible Figures 1.7 and The executed only completed. comparison and Johnson. there window (synthetic) one notes the instruction unit and be issued issue have a direct To provide if an instruction to a functional instruction The each Thus one in- are set architectures terms or whether stores is 85 percent benchmarks. a particular unit), instructions functional flag to (with a common be assigned following assigned time is constrained to a unique can not This to each functional of assigning issue - are at issue going that Constraint instructions struction instructions Thus, act con- can be placed instructions the are executed the is not in data-dependent ent instruction and Instruction and prefetch issue, cycles). from a mem- boundary. in the though integer model, with machine, even be- of the In this is filled instruction (The removed Loads between the for different instructions Smith aa are cycle. selected configuration. a cycle instructions from was Instruction into ALU our simulations machine, units in are accuracy are pre- 4 instructions unit be scheduled issued buffer one integer integer pletion. la- flag than execute config- results performance models.) be required. by functional in of Smith on a four-instruction-word instructions more SJ4 highest prefetch of model- machine SJ4 figures. as 3 of the not ample, store multiply, This With aa many in any given buffer four only machine aligned as many While Functional simulator words instructions “SF” but 4 through configuration (All the wide Thus, units we exception with Smith/Johnson course ex- 4[1]. in the instruction a single respectively. Unbounded needed smaller to and machine clarity four the (Figures the results demonstrated strained de- latencies only model tencies number The The suffix. divide required of instruction. is the machine latencies it of bringing of a check- These of machine benchmarks simulated, four ory to trace. performance Latency number a given there then by mechanism. Instruction and used as determined fails, resolved. cor- prediction, for cause is generated branch and We will set SJ4 presents machine were sented mechanism nine labeled Johnson’s urations dynamic is predicted real and prediction number the With 1]. With the each Work ing the assumptions configuration in branch is [1]. prediction real branch machine a random whether of branch and real[l the in the types are encountered stream, to determine are two - synthetic is specified As the There supported synthetic file. - machine. for to 12), the curve Branch results units within with unconstrained simulation instruction machine balanced prefetch wide fetch limit, there characteristics that machine configuration, buffer, and limit a sequential load/store execution A significant Smith/Johnson chine rise constraint. source machines configurations. is an The to a performance comprise EQNTW1’I InatructknLevelPamllebm Upper Boond 3(0 lPC of performance degradation imbalance scarcity in If integer operations 38 percent of the instructions integer benchmarks and machine contains integer unit, of much ALU then greater will to expect than because be saturated. our simulations and it is unreasonable two the simply This is in fact — the integer execution time units for the ma- units nearly the the of integer bottleneck. 7- - --_ in by contention Another the bottleneck prefetch respect cache buffer. lines will reduce cussed one per -- busy two prefetch lines from only the i–cache, the sue two window, to fetch new space. The space, instructions. cost of a more features chine to cycle, complex into only the limit the instructions and thus ! the loads occur spect to other strictly all previous the in loads, stores (i.e. been applied in order to maintain 1 p4 memory system. This support of a checkpointed write all in of the a synergistic cantly. changes a machine effect and the with state 1 .“ - A ---------------------------------------- For instance, removing allows greater x 4F2M285 ~t-1 o 9 12345678 from the for 5: Instruction ‘~; 6 the in addition of extra load/store of implementing all of these 30 to 50 percent (.6 to 1.0 IPC) model, across Ii, eqntott, and the four pipes. is a performance integer 13 14 15 Parallelism 16 of ESPRESSO Upper Boud 38 lPC ----------------------------- ..__ m .,.,’ 1 ––– ______________________ KJFB 48$24.4 02$3MR7sc +. Ww Rs --m----------------------------------‘“ c 3 have / signifi- ..__ , ---- 2 * 8F3MP45 I o 6F2M ‘-”------- + ..9=.+::? ::.:: ;:>:.-i,>%<.+?.kg.%:-*-* -*-~--v. ..+..- ,. --, .$, ------A , _- .--,.-— ,-—, —--,-- —.-—-,---,..... ..-. <.--m.. .=- .-. -- -— --- -—. -— -.-. --- load/store #if:-: - 4F2M X 4F2M@85 [ __________________________________ - S&J4 to be gained The P result increase of tested ~1 12345678 over the Smith/Johnson benchmarks 12 * -------- -.. --’ .8__________ 5 are imple- improves Level CCChwtmction Level Padelbm are in as described the sequential 11 before the performance 10 Window Size (packets) , constraint ‘) 8F3M,?45 / - S&J4 re- improvements performance .+.2FWRB after with above ~sF3MlWSF —________________________ of only however, suggested model, * 8F2M 1- ‘4 11 When FD$!8 1 __ —________________________ o 6FW [6] mented _---- ma- constraints buffer Level Panileli,m Up~r Bourn+179uc —__ ——___-__ t is unnecessary, of EQNTOTT - 4F2+4 These order a consistent Parallelism at the issued These Level two execution instructions 16 execution are executed executed. 15 does of the impede issue t4 available buffer. model, sequential and instructions have store) Smith/Johnson 13 ~~ Figure in 12 ,.” is- issue ability ;,-- --_: 5 throughput, Finally, 11 ■ ‘~q ----6 by prefetch- prefetch 10 empty be eliminated controller in the 1 9 4: Instruction ESPSEPSOInfiction it is buffer only unnecessarily issue can of the can - S&J4 567s Figure instance, regardless prefetch is room for prefetch can -X 4F2MB85 Window SJze (packets) improve. buffer inefficiency there If, instructions ::- to is that but the prefetch This ing anytime two next the buffer instructions to a full attempt buffer four ?::::::::, H-J 1234 absence would empty. ::”:: alignment (in the prefetch completely contains due window of the tie-? ‘“- ,.,x’ 1 “z----------------------------------------Y As dis- expanded and perform performance *WW - 4F2M will buffer. were instructions blocks), when buffer not useful shortcoming refilIed buffer -“ o ba- instructions a prefetch :.x-- of and small 2,5 useful by such of the end of basic Another only :__J_J’ instructions distribution cycle four of useful #.,:, *.-.x with middle --_--::_ ~T::&”-----+.’-+*--_+-+ 2 of is aligned -- 8F3M245 ---------------------Jo A;.:ZT=-====:”;::= ~;.~_____._.___ 3 -- ~------ in is the behavior *8FSMR3S.C 0 ff’24JRB ,’ c integer happens in the = F31F.B * 8F3M - _+---------------- resource. buffer the number if the so as to produce ALU ——_________________ ,. I is determined targets a uniform [1], in the branch on average be fetched integer Because any Assuming sic blocks, the to performance to memory, fetched. fetch for --- 5- --------------- “+ largely m-.---v s m-e # the are constantly ,.. 8 ,. m ~,__________________________________ 6- - ----- a speedup the .m, , -.-t” ---_ ——_____ gives only what program 8 in the 9 10 11 12 1S 14 15 16 Window Size (packets) (gee, Figure espresso). 281 6: Instruction Level Parallelism of GCC I LI Instruction Level ParaUeli3mUpper Bound: 162R 8 ,—,—, m m ., ...,.- , . . . .. ,- 6 I y“ ----------------------------------------- i +--__L-----___------_-----___-----___---/ II’ I , P 4 4-i_ W ------______ -s- . . ..m - FDF!3 B 6 e::::::::::i & 8F24A i’ t MATSIX340 Inatmction Level PmlleliamUpperBound 1166WC --- ■ .-* ---m . ...9-* -m 7 7 +--------=~------------------------------- I -. ---------~F2; 1 8 “ -----_ J_---_ --------___-J GeF3h4Fa2f ~8F3MFasJ 5 I ?4 .* 8F3M2$5 c 0 S$w 3 + 6FWJ 3 - 4F2M - 4F2?4 .:::=.::.x:::%% ::r::’~’””~ X 4F2MS8S 2 _-:;;:z’z?’” + 2FW.RB +. 8X$44RB * 8F3W885 c * SF2M --- 2 X 4F2MB45 ~7FG%--------------- &/.i,’,x,,.x4 -S6J4 - S&J4 .,8<-A ------------.. 1 o~ o~ 12345 6789101112131415 16 9 12345678 Window Size (packets) Figure 7: Instruction Level 10 11 12 13 14 15 16 Window Size (packets) Parallelism of XLISP Figure DODUC Instruction Level Parallelism U~er BOUML66 IPC 10: Instruction Level Parallelism of MATRIX300 SPICESG6Ind.mction Lavel Pe.rallelimnUpper Bound 17 IFT 8 , , ... ,. i , ..... , ..,. ... . ..... _____________ _____ ._:::: _______ ,_._ 1 e”’~”’>i’’+—= 7 X mFQF!.3 FCIFL9 ~WSMFSS 5 ,, .-m. ----------------------- --------------- p4 6- %M R8 ““ 3 1 ,# ✍✍✍✍✍✍ . .. ,=. . ..,=F* . . . . . :&’l+?_’:::::’-r ~ ________ 9 10 11 12 13 14 15 ~7’ ~->. --- 8: Instruction FPPPPMruction LA Level Prxallelmm I@r ., 8F3MS85 ✍☛✍☛✍✎ -”>”+ _,.—.,--., 0 6FZM -2”-”+ - 4F% --. ’-;=”=”” --------I X 4F2MB85 -----A-- .<.X-:%’S’”: ..:.:”:”:2:::”:---– -.——— ——---— ——— ——--— ———-. -— ———— --—----— -S3J4 - S&J4 ~ 1 16 12345S78 9 Window Size (packatsj Figure ✿✿✿✿✿✿✿✿✿✿✿✿ .. ..+. ❞ 0 12345678 ✍✍✍ --——————~~:o.+ko”” ,..<.---* -“ 1 o~ ✿✿✿ ✿✿☞✿☛ 2 -+-.*.-.. ..+ ✿✍✍✍✍✍ ✍✍✍ 3 X 4F2MB& ------------------------ -G--’----J , ;’: ✿✿✿✿ .:.. ---------- 2, ‘b- W2MRB ✍✚✍✍ c * 6F?44 - 4F2k4 ,, ~3F3L4R3% ■ I P4 -- SF3Msa5 s’ c * 8F2M 5 ,. s ---- ~----------------------------------- 6 * 8F244 :~ 10 11 12 13 14 15 16 Window Siza (packets) Parallelism of DODUC Figure Bound, 378 WC 11: Instruction Level Parallelism of SPICE2G6 TOMCA3V In&uction Level Pmnllehmo Upper Boud S30I PC 8~ _––______ 7 t ——__--- 6 —______________________________ ,.. .1- .-. a aa t _*-–U––____– _____ 5 t-------------- m I - -— —__--- p4- ■ -----~ —____,-- c I g —______ ---___ --l’_”___ ._____ ‘~ ––__v– - 4F3M X 4F2MS85 _- - S&J4 1 -–-#,,epaf’-”’-”” -&=*_x.. x=:g:x::-:~::?::?:::;;?- m,’ . ------------- ------ 3 –––.+ ,d 2 .,___ -/ ~G3MFB% ‘b- ffW RB —----- ..–. ----——— ——------— –. . . ..–-.. ——---F -. .Z:-.:z:: 9 10 11 12 13 14 15 16 Level Parallelism i –3Z.V2+-L*..:’ ;-_.e .% -. .y:+x===”” ~ ‘. 8F3MSS5 o 6F?M - 4F2M -- —---:x . ..- X 4F2MS85 - S&J4 ‘;:,,Q74.:z:?::::--------------------- / f2345678 Window !lze (packets) 9: Instruction FOFB 1 0/ 12345678 r ~ 8F2M , 1 b~ Figure { ------------------------ ‘:––__‘----------------------- c .“ 2- ------------- – _____________ ‘ ““”””’””’ !---------- p4 + SF?4A -.________ –______ , 0w3MWS$ -J-SF3MZas .’ 3 * 8F2M . *2M Iw —_________________ _,a_–,a 6 ~----------------’ 5 ----{ –––. ~~’” Fcf B 9 10 11 12 13 14 15 16 Window Size (packets) of FPPPP Figure 282 12: Instruction Level Parallelism of TOMCATV We on also the similar to primary that his published limitation on and it parallelism precede that which packets we chose Lack ability to block. integer the floating the Smith/Johnson machines 4,2 The RDF each the of of the highest nine performance a restricted eight eight tion class, move instructions complete on per until capable the of an that Branch results Note of eight model. placed an upper bound model performance for with an the machines issue approaches an size or fetch rate, bound for either tion accuracy ulator provides ticular benchmark. Machine of true This In of all segments absolute upper is the this rate graph, the ranges from parallelism that is value, the that chine 100 percent IPC for the the machine to exists in the The 4.3.1 eight have program units of the Ground the middle machine curves assumptions chine that terest that in schemes. tions drove knowing the Consequently, units (synthetic), and a real model of from are on execution the entered of the a series this a desire set one of to of pick with with of configura- of the 85 following percent removed in these model from the 85 percent 1.5 to 2.8 of ranges (floating that mempredic- 3.6 IPC for the the floating for the significant impact of the per mechanism. limits limita- The average advantage ranges for the prediction, point) integer and of addifrom ap- benchmarks from With 1.6 to 5.0 the real branch 2.4 and 3.4 (integer), is obtained. of floating performance of these cycle the machines in con- character- Performance floating outcome Performance point with point of the Because latencies benchmarks, smaller the experiment for the eight using smaller latencies. With the the branch branch branch branch of between the each 100 and real predicts benchmarks. performance three characteristics: of that one (five) with machines synthetic, issue branch of several These gathered units. point machines history the 5.8 (floating predictor, 283 three branch 3.2 to program. from 1.9 and repeated machine window with mechanism on the and existence experiin real blocks performance predictor. simulated from prediction 1.8 to 5.7 for unit 2.2 to 2.9 IPC predictor, branch from from 85 percent functional for the floating accurate and running suffers we assumed using and ranges dynamically the proximately in- machine rate, and 100 percent prediction The tional conan with branch based size of basic ma- prediction issue ma- ranges to 1.4 to 2.7 IPC presented. prediction. tion The branch size, Smith/Johnson and of better accurate, branch of 12, assumptions. coupled window with percent differs of each 100 instructions based a realistic of functional various selection effect 4 through performance implementable, predictors: The the were are combines ments the under configurations figurations set show configurations Figures with Performance branch (integer) perfor- of the benchmarks configuration synthetic, junction benchmarks, the prediction 85 percent units predic- machine benchmarks. functional of a branch istics nine memory branch balanced are also different percent Middle for use benchmarks. machines each with benchmarks memory and Configurations For point two demonstrate branch is presented Several the traced. 4.3 our effective Performance for the integer floating Performance integer point at IPC units tion. par- constrained 1165 to constraints. 1.9 to 2.2 IPC ory Flow reported 17 than point). sim- Data execution This more percent of a properly buffer with with 100 shown advantages no prefetch from of asymptote Unrestricted case, dependencies. each indicates window an ( UDF). by are or A six-functional-unit restrict of figures, benchmarks it makes machines 85 percent 2.7 to 2.9 IPC Clearly However, and IPC. If we do not top as they are better intensive for Results mance re- is omniscient. machine RDF units, instruction, restrictions performs point sizes because Four-functional-unit is- instruc- Functional of performance of machine of the the window. 4.3.2 rate out-of–order prediction the that full. type other provide issue reduces significant 12, performance an abstract window No approximate only the the regardless is any unrealizable simulation eight. cycle window execution. is This each executing 4 through with cycle. from execution. this shows machine instructions each Figures curve dataflow instructions sues benchmarks, completed mechanism As seen in the machine window of the For benchmarks. the removed is not it impacts on the floating smaller Machine benchmarks, point has by the this to are checkpointing, While in which contrast instruction implementing the is in required in the order instructions as the size. instructions in the This constraint, window all and where as soon for after issued. model This effective only completed, were execution. and a basic units, been out–of–order intro- machine’s have Smith/Johnson ex- I–stream. is within the imposes instruction the the The effectively packet packet Jouppi’s speculative an in limits to IPC), for by that execution two no in issue results execution performance between speculative than In–order based obtained performance dependencies instructions utilize high branches. models and (less execution in–order false simulated data are beyond ducing have Jouppi[2] to a serious of we of impediments machine ecution all note assumptions of on the and latencies, the we functional unit the real branch smaller latency ma- chine ranges marks. from The mains 2.8 to 5.8 on the floating performance on the integer point bench- benchmarks the re- register value issue from unchanged. value the is known. instruction 4.3.3 Lack Analysis ber Performance, clock measured cycle, in instructions can be modeled executed per where clock I is the cycle, is the and dynamic and cycle the dynamic unit compiler chine below of at one issued only p can code are minimized. presented, value branch every issue will issue that clearly of the chine, RDF.18 The and RDF only 3 memory size, and machines The of clock cycles This availability, precliction range model, in which and the first instruction full window three factors and In these have effect factors certain a major circumstances, instructions architectures in a register, play bandwidth carded after The effect may not which the jump a real the since location permit cannot a jump after a prior free). There cache in our which example, of branch sinlula- be discarded. this issue after the of predicting it. There are reduce the frequency branches algorithm. execute. This dis- branch. 11. It is clearly is most branch than 8F3M. greater effects window it is removed a minimum, time The 4-12 issue after so that at small beyond window some efficiency, branch if the instrucThis are likely overflow sizes. by is equal the compiler point, is the window lifetime effect attempt to be avail- or by increasing of window life- instruction it is issued. operands is issued effect of an from having is machine. lifetime can be increased by either code RB instruction instruction for operands wait of prediction Instruction issuance in gcc performance arise when size. from until At the the visible notable omniscient the window by re- of instructions with time or Second, machine as the From instruction misprediction window it instructions, machines: the stalls. until all instruc- 8F3M full issue is issued the exceeds size. must a mispredicted 50 percent must causes issue to branch This However, organization fect. Consider there the win- can be seen in Once other the factors misprediction are code fragments or window the window such losses, size increases Livermore Loops for which will have kernel #5: to DO in 10 i X(i)= to a location be performed data as etc. dominate. un- however, from For beNeu- instruction modeled the number tion static on performance. be known. of branch of reducing 6) for figures prediction machine even eliminating 4, 5, 6, 7, and size increases mis- In a pure no part con- a subsequent (lock-up with prediction to its latency. dow instruction branch cache First, either branch able as an instruction were effectively frequency. this can be minimized the frac- branch arise bank Von mispredicted, stalled by a better retirement. window is ap- factors: num- against allow able to retire effect. into and factor availability, of can in buffer branch branch for remedies measured adds both the of being two Finally, policy, represents by four instruction is assumed. stored this instructions omniscience machine any given for efficiency is determined losses, bounded In for the This traffic effectively instead time window machine reduce is a dichotomy to data branch almost be ma- per issue cycle stall register as a lack memory we can was not to having to schedule execution pointer fetch by of retirement The 8F3M only 8F3M Otherwise, exception can .33 to .72. dynamic issued. RDF is limited as one branch are identical. proximately The the of the data mispredicted (figure were factor and it has been after of view in figures for the performance to that latencies. such efficiency by comparing machine ports. with issue machine operation issue restrictions, tion static seen in the results levels size of the that issued the 8F3M effects there a mispredicted duce the latency performed. The time has the effect ma- optimizations that to deal misprediction the using restriction be emphasized way of misprediction through a superscalar misses, the cache condition is equivalent caused one multiply that no superscalar to access the has missed tions be is- one multiplier) for load the point 1 unless as those will is possible load is determined per can the condition be issued. example, memory Branch a branch be improved such It should branch that such of only It memory From de- machine. beyond has one A compiler reorganize the results most (e.g., Clearly, assistance. could effects of the stream out traffic For This must until tions. by the is program/data restrictions, machines. misses. misses. de- on performance. the maximum conflicts p. by the restrictions due to the presence reduce 6 p and instruction mann can cache We point cache this availability that is no corresponding of instructions Other and Both p is determined (which issue clock per factor, factor. bound factor instruction by functional further the in instructions 1 is determined an upper number 1 instructions. per clock rate no instructions sued in a single the eficiency a restriction and reduce issue stream by example, rate one inclusive. efficiency instruction pendent) clock issue issue issue efficiency and establishes static will zero maximum signer For execution between The static p is the dynamic 6 range The maximum define of instruction etc. machine is issued availability. of instruction tween ipc=I*p*c5 We of instructions flicts, The the jump pointer because as: is known. time 10 until 284 = 2,n Z(i)* CONTINUE (Y(i) - X(i-1)) code relittle ef- This simple loads, recurrence a floating tiply, a store, perhaps degree to execute each a full the state Inside loop, type will The rate dynamic such of latency and fpppp induced to issue the each a minimum itera- For large will out achieve efficiency situations access the Standard presence aspects exhibit this 6 problem. Issues based of implementing on RDF difficult principles issues multaneous data Instruction what delivery delivering them paper multiplier ficult. [I], to the that to issue address a fetch address, of fetching degree four holds cache lines within the Once structions Issue can address should resolved until of branches packet The be issued, in as well branch needs in time problems be decoded caching an instruction in the DIC. instructions, packet can be The DIC, in can store branch the through (DIC) instructions next [10, in- fetch of branch Rather of to thus improving we have to the average per come window increase instructions While we are not decoded window we expect 285 size which the instructions In in this way, window space a different use be increments. in all imple- to eliminate of the one branch This window, would per being limits size of a basic cycle, issued in the available block. increase Allowing the amount numbers of 2.0 to 5.8 instructions data and today. capabilities shows information, limita- instruction instructions the branch a restricted is reasonable stored storing on the perthese is possible only cycle our results from predecoded to is parallelism. bounded prediction assumed with cache, and that performance. branches will on mechanisms valuable it the branch. rates than data order after better the same cycle bandwidth than today. packet With following a limited to more that consume make in- applications execution. and no instructions what up of single the be removed. and Furthermore, a in only of checkpointing, inefficiency simula- substantially. completed instructions waiting this cycle These use have of available tar- fetch. packet two When in whole only capable Our hardware required and vi- [1, 2]. limitations improve window issued, multiple the the removed parallelism address. will in an instruction addition in the of producing the performance are Second, the presence end the 3]. each other of integer and imposed were mentation are not next the next can to implement achieved. be packets completed the many the and to perform cache the the re- techniques processor performance we have from in on large weJave processors while address however, determine circumvented instruction undecoded each can be how what or- per memory excess stated software Packets the in can example, removed sequen- to fetch nol~–sequential to be predicted get calculated can thus the demonstrated stream interests the are removed, stream in fetched, Furthermore, packet as a possibly been constraints, decode. the dif- that two where determine and These after that example, of the fetch has constraints be. For in the today, necessary instructions determining make Another with in the are reasonable with available tions limit shown that formance line. of instructions consistent rates per cycle ex- simultaneously, assuming four would consistency instruction have that, For fetching that lies in quickly packet. engine, size cache a beginning cycle and as proposed lines of the alignment cache a group difficulty next first results Note point delivery technique instructions, guarantees regardless per consistent here. in one clock. four for branches the raw bandwidth instructions cache delivered instruction can with stores. execution 3 instructions and one floating issues cache them, units issues, make several for a superscalar line tion si- determining function these the way of providing several of (e. g., only issue per clock) is a good tial proper caches fully cache Remarks of a single structions these multiple fetching bandwidth restrictions We briefly Given and problem a small be implemented small cache is only processors be fetched, from issue Among delivery is the must Aside lmachine trivial. machine accesses. instructions ecution, are not First, ported Concluding ability a superscalar are instruction to provide less prohibitive, of memory of delivering Several We briefly be used or recode benchmarks Implementation Its ports of providing cyc!le. a conventional would uses one, single unit. This 5 ports. multiple can by cache be used to keep the is to re- instructions multiple quest 6 is bandwidth. backed small sizes. is that per machine memory The ganization of twelve. accesses cache of the packet problem which be used. cost and techniques necessary associative in Table prior machine of the executed doduc two arise. execution for describe given of one clock remedy branch: addresses, implementation a superscalar on the the target Another memory take quickly branch mul- multiple clock will is dependent the only duce the latencies the loop. one of iterations, The and the latencies condition at an issue this .08! just two point On iteration using iteration window number steady each at least compare iteration. it takes However, 1. Since only per require a floating an increment, 8 machine, of 12 clocks ‘n’, and will subtract, 8 instructions iteration. tion, relation point flow issue As rate, levels increase, per cycle suggesting well in that in excess ccmsisteut machine the this 17 to (the 1165 is possible of 5 instructions has with and sizes and In the limit, size and issue rate per that of integration window correspondingly. engine issue our unUDF) range (yet), per [8] cycle. Finally, and perhaps re-emphasizing tained that using sue. code compiler motion support mented today deliverable is that is still that deliver more issue increase than still the instruction will be [10] pro- question. [2] M. Johnson, on Multiple Third N.P. Jouppi, Level tems, for D. R. Ditzel, and Architecture of the Proceedings on Computer “Available on and Patt, Op- [12] and Super- of the Third W. Hwu, New Microarchitecture: Proceedings and a%d Yeh, “, Support Operating [13] Sys- “Adaptive Steve Melvin and Parallelism ware and Software 18th Annual ture, (May Robert Patt, Shebanow, Rationale 18th “HPS, and W. Hwu, Regarding Workshop and HPS, M. A [14] pp.103- Shebanow, High Proceedings Patt, and W. Restricted Minimal 18th “HPSm, Mi- [15] 1985), ture, (June 1986), on Per- computer [16] Hav- Patt, Execution of the on Computers, W%eckpoint (December Repair IEEE 1987), University, Report (June “Super-Scalar No. Processor CSL-TR-89-383, Hard- Proceedings on Computer Nix, of the Architec- John O ‘Donnell, Rodman, computers and “A VLIW Compiler.”, , (August Joseph Available C-33, ArIEEE 1988), Fisher, for Very IEEE 11, (November Edward M. Riseman Inhibition of Jumps.”, IEEE D. J. 967- “Measuring Long Instruction Transactions 1984), for [17] Trans- G. on com- 968-976. Transactions Design”, Stanford 1989). 286 1972), Muraoka, of Operations in Fortran–Like Speedup”, Transactions 1970), 889-895. “The Condition computers and C-21, S–C. M. of on 1972), J. Chen, Simultaneously and Transactions 12, (December and on by Programs IEEE Execution C. Foster, 1405-1411. Y. S. Tjaden IEEE Caxton Parallelism Kuck, Parallel pp.1496- and Potential Number sulting tober Johnson, Fine- Combined Scheduling Architectures”, cutable Architec- 1514. Technical on Nicolau ers C-21, Machines”, “Exploiting Robert for a Trace Alexandru the pp.297-307. and Y.N. Out-of-order a High Architecture Proceedings Symposium Patt, Through and Paul the Parallelism Annual (December Flow Yale “Critical Performance of the Functionality.”, Annunal Hwu Hwu, Data Predic- of Michigan, 1991) 12, (December 13th [7] W.M. pp. 979. Workshop 1985), on Microprogramming, formance actions 1987), Introduc- Annual (December A PP.109-116. W.W. MicroSympo- Branch Techniques”, Colwell, Word croarchitecture.”, [6] (June University Symposium Papworth, puters ing McLellan, CRISP Annual Training Report, Grained David 108. [5] Y.N. H.R. of the l./th Architecture, Technical chitecture M. of the Microprogramming, Issues pp. In- pp.272-282. tion.”, [4] Y.N. Sym- 1987), (1991). Instruction- Architectural Languages 1989), Proceedings Superscalar Conference Tse–Yu Transactions on Annual (June A. D. Berenbaum Hardware tion pp.290-302. Proceedings Machines.”, (April [3] Y.N. of the ldth Architecture, Issue Pipelined on Architec- Languages 1989), D. Wall, Programming Horowitz, Issue”, Programming Parallelism M.A. Conference (April and ternational for Instruction for Systems, pipelined and International Support erating “Instruction Interuptable Proceedings processor.”, [11] tural 1989), 309-319. Smith, “Limits 12, (December S. Vajaperyam, on Computer “The References of the 38, No. and Transactions 27-34. perfor- stream and Processors.”, sium [1]M.D. Parallelism IEEE for High-Performance posium imple- to what G. S. Sohi Logic further. be twice [9] With execu- can Performance”, Vol. Distribution Machine density, further. the limits by single an open pp.1645-1658. processors in [1, 2], and tomorrow on Computers, to Nonuniform and on is- granularity should Effect for superscalar higher larger Its ob- assistance “The P. Jouppi, been to increase [12], performance suggested cessors provide Norman of Instruction-Level is worth have compiler produce line it results optimized to to bottom mance not our can be expected units The of importantly, architecture-specific performance tion all compilers With perform most Flynn, Independent computers “On Exe- Their Re- on comput- 1293-1310. “Detection and Instructions”, C-19, 10, (Oc-