PROCEEDINGS OF THE m,VOL. 54, NO. 12, DECEMBER, 1966 23. H. A. Ernst, “TCS, an experimental multiprogramming system for theIBM 7090,” IBM Corp., Yorktown Heights, N. Y., Research Rept. RJ248, June 1963,41 pages. the IBM 7090.” IBM Research Rept. RJ248, June 1963,41 pages. 24. M. Lehman. R. Eshed, and Z. Netter, “SABRAC, atime sharing lowcost computer.” Commun. ACM, vol. 6, pp. 427-429, August 1963. 25. R. V. Smith and D. N. Senzig, “Computer organization for array processing,” IBM Corp., Yorktown Heights, N. Y., Research Rept. RC 1330, December 1964. 26.A. S . Critchlow, “Generalized multiprocessing and multiprogramming systems,” 1963 AFIPS Proc. FJCC, pp. 107-126. 27. M. E. Conway, “A multiprocessor system design,” ibid., pp. 139-146. 28. R.R. Seeber and A. B. Lindquist. “Associative logic for highly parallel systems,” ihid., pp. 489493. 29. R. M. Meade, “604 machine description,” IBM internal memo., December 1963.38 pages. 30. M. Lehman, R. Eshed, and Z. Netter, “SABRA-a new generation serial computer,” IEEE Trans. on Electronic Computers, vol. EC-12. pp. 618628, December 1963. 31. M. W. Allen, T. Pearcey. J. P. Penny, G. A. Rose, and J. G. Sanderson. “CIRRUS, an economical multiprogram computer with microprogram control,” ibid., pp. 663-671. 32. W. F. Miller and R. A. Aschenbrenner, “The GUS multicomputer system,” ibid., pp. 671-676. 33. G. Estrin, B. Russenl, R. Turn, and J.Bibb, “Parallel processing in a restructurable computer system,” ibid., pp. 747-755. 34. G. Gregory and R. McReynolds, “The Solomon computer,’’ ibid., pp. 774-755. 35. H. S . Bright, “A Philco multi-processing system,” 1964 Proc. FJCC, pp. 97-141. 36.R. G. Ewing and P. M. Davies, “An associative processor,” 1964 AFIPS Proc. FJCC, pp. 147-158. 37. H. A. Kinslow, “The time-sharing monitor system,’’ ibid., pp. 443454. 38. J. Nievergelt. “Parallel methods for integrating ordinary differential equations,” Commun. ACM, vol. 7, pp. 731-733, December 1964. 39. W. H. Desmonde, RealTime DuraProcessing Systems. Englewood Cliffs, N. J.: Prentice-Hall, 1964. 40. M. Lehman. “Serial mode operation and high-speed parallel processing,” Information Processing, 1965 Proc. IFIP, pt. 2. New York: Spartan, 1966, pp. 631-633. 41. R. V. Smith and D. N. Senzig, “Computer organization for array 1901 processing.” 1965 Proc. FJCC,pp. 117-128. 42. E. W. Dijkstra, “Solution of a problem in concurrent programming control.” Commun. ACM, vol. 8, p. 569. September 1965. 43. J. B. Dennis, “Segmentation and the design of multiprogrammed computer systems,” J . ACM, vol 12, pp. 589-602, October 1965. 44. F. J. Corbato and V. A. Vyssotsky, “Introduction and overview of’ the multics system,” 1965 Proc. FJCC,pp. 185-196. 45. E. L. Glaser. J. Couleur, and G. Oliver, “System design of a computer fortime sharing applications,”ibid.. pp. 197-202. 46. V. A. Vyssotsky, F. J. Corbato, and R. M. Graham, “Structure of the multics supervisor,” ibid.. pp. 203-212. 47. R. C. Daley and P. G. Neumann, “A general-purpose file system for secondary storage,” ibid., pp. 213-229. 48. J. F. Ossanna, L. E. Mikus, and S. D. Dursten,“Communication and input-output switching in a multiple computing system,” ihid.. pp. 231-241. 49. J. W. Forgie, “A time and memory sharing executive program for quick-response on-line applications,” ibid., pp. 599610. 50. J. D. McCullogh, K. H. Speiennan, and F. W. Zurcher, “Design for a multiple user multiprocessing system,” ibid., pp. 61 1-618. 51. W. T. Comfort, “A computing system design for user device,” ibid., pp. 619-628. 52. J. P. Anderson,“Programstructures for parallel processing,” Commun. ACM, vol. 8, pp. 786788, December 1965. 53. B. W . Arden, B. A. Galler, T. C. D. O’Brien, and F. H. Westervelt, “Program and addressing structure in a time-sharing environment,” J . ACM, vol. 13, pp. 1-16. January 1966. 54. J. H. Katz, “Simulation ofa multiprocessor computer system,” SR & D Rept. LA-009, February 1966, to be published in 1966 Proc. SJCC. 55. G . S . Shedler and M. M. Lehman, “Parallel computation and the solution of polynomial equations,” IBM, Yorktown Heights, N. Y., Research Rept. RC 1550, February 1966. 56. H. Hellerman, “Parallel processing of algebraic expressions,” IEEE Trans. on ElectronicComputers, vol. EC-15, pp. 82-91, February 1966. 57. J. B. Dennis and E. C. Van Horn, “Programming semantics for multiprogrammed computation,’’ Commun. ACM, vol. 9, pp. 143-155, March 1966. 58. N.Wirth, “A note on ‘program structures’ for parallel programming,” Commun. ACM, vol. 9, pp. 32G321, May 1966. 59. D. E. Knuth, “Additional comments on problems in concurrent programming control,” ihid., pp. 321-322. Very High-speed Computing Systems MICHAEL J. FLYNN; Abstract-Very high-speed computers maybe clnssified as follows: 1) Single Jktmction S t r d i n g l e Data Stream (SISD) 2) S i l e Imbnctioa Stream-Multiple Data Stream (SIMD) 3) Multiple hstmcth StrePntSingle Data Stream (MSD) 4) Mnltiple Instroctioo Stream-Multiple Data Stream (”D). “Stream,” as nsed here, refers to the sequence of data or irstructiom as seen by the machine daring tbe execution of a program. m e coastitaeats of a system:storage, exeation, and h t r u d o n b a n d h g (braoching) are dkcussd witb regard to recent developmen6and/or systems Limitatim. The COlsMnents are dkcuwd m term ofcoocarrent S E D Manuscript received June 30,1966; revised August 16,1966. This work was performed under theauspices of the U. S. Atomic Energy Commission. Theauthor is with Northwestern University, Evanston, Ill., and Argonne National Laboratory, Argonne, Ill. MEMBER, IEEE systems (CDC 6600 series and, io particnlar, IBM Modd 90 series), since mnltiple stream organizations usually do not q u i r e anymoreelaborate compooents. Representative organizations are s e l e c t e d from each class and the arrangement of the comtitwnts is shown. INTRODUCTION ANY SIGNIFICANT scientific problems require the use of prodigious amounts of computing time. In order to handle these problems adequately, the large-scalescientific computer has been developed. This computer addresses itself to a class of problems characterized by having a high ratio of computing requirement to input/output requirements (a partially de facto situation M Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 25, 2010 at 16:36 from IEEE Xplore. Restrictions apply. 1902 THE OF PROCEEDINGS caused by the unavailability of matchinginput/output equipment). The complexity of these processors, coupled with the advancement of the state of the computing art they represent, has focused attentionon scientific computers. Insight thus gainedis frequently a predictor of computer developments on a more universal basis. This paper is an attempt to explore large scientific computing equipment, reviewing possible organizations starting with the “concurrent” organizations which are presently in operation and then examiningtheother theoretical organizational possibilities. IEEE DECEMBER the situation wouldbe equally untenable, sinceinmany problems one would consider a large matrix of data a unit. Thus we arbitrarily select a reference organization : the IBM 704-70927090. This organization is then regarded as the prototype of the class ofmachines which we label : 1) Single Instruction Stream-Single Data Stream(SISD). Three additional organizational classes are evident. 2) Single InstructionStream-MultipleDataStream (SIMD) 3) Multiple Instruction Stream-Single DataStream (MISD) 4) Multiple InstructionStream-Multiple Data Stream (MIMD). ORGANIZATION The computing process, in its essential form, is the performance of a sequenceof instructions on aset of data. Each instruction performs a combinatorial manipulation Before continuing, we definetwo additional useful (although, for economy,. subsequencing is also involved) notions. on one or two elements of the dataset. If the element werea Bandwidth is an expression of time-rate of occurrence. In single bit and onlyonemchbit could be manipulated at any particular, computational or execution bandwidth is the unit of time, we would have a variation of the Turing manumber of instructions processed per second and storage chine-the strictly serial sequmtial machine. bandwidth is the retrieval rate of operand and operation The natural extension of this is to introduce a data set memory words(words/second). whose elementsmore closely correspond to a “natural” Latency or latent period is the total time associated with data quantum (character, integer, floating point number, the processing (from excitation to response) of a particular etc.). Since the size of datum has increased, so too has the data unit at aphase in the computingprocess. number of combinatorial manipulations that can be perThus far,categorization has depended onthe multiplicity formed (manipulations on two n bit arguments have 22” of simultaneous events at the system’s component which possible outcomes).Of course, attention is restricted to imposes the most constraints. The ratio of the number of those operations whichhave arithmetic orlogical. sigsimultaneous instructions beingprocessed to this connificance. strained multiplicity is calledthe confluence (or concurrence) A program consists of an orderedset ofinstructions. The of the system. program has considerably fewer written (or stored) inConfluence is illustrated in Fig. 1 for an SISD organizastructions than the number of machine instructions to be tion. Its effect is to increase the computational bandwidth performed. The difference is in the recursions or “loops” (instructions processed/second) by maximizing the utility which are inherent in the program. is It highlyadvantageous if the algorithm being implemented is highly recursive. The basic mechanism for setting up the loops is the conditional GENERATE AOORESS OF INSTRUCTION branch instructions. -FETCH INSTRUCTION For convenience we adopt two working definitions: ZnDECOOE NSTRUCTION struction Stream is the sequence ofinstructions as performed GENERATE AOORESS OF OPERANO by the machine;Data Stream is the sequence of data called FETCH OPERfD for by the instruction stream (including input and partial or EXECUTEINSTRUCTION temporary results). These two concepts are quite useful in categorizing computer organizations in an attemptto avoid the ubiquitous and ambiguous term “parallelism.” Organizations willbe characterized by the multiplicity of the hardwareprovided to service the Instruction .and Data Streams. The mutiplicity is taken asthe maximum possible number of simultaneous operations (instructions) or operands (data)being in the same phaseof execution at the most ... INST.#2 constrained component of.the organization. Several questions are immediately evident: what is an instruction; what is an operand; how is the “constraining INST.#I component’’found?Theseproblemscan be answered L better by establishment of a reference. If the IBM 704.were INSTRUCTION # I STARTS compared to the Turing machine, the 704 would appear INSTRUCTION # 2 STARTS INSTRUCTION #3 STARTS highly parallel. On the other hand, if a definition were made Fig. 1 . Concurrency and instructionprocessing. in terms of the “natural” dataunit called for by a problem, 11 1 4- FLYNN: VERY HIGH-SPEED COMPUTING 1966 of the constraining component (or“bottleneck”). The processing ofthe first instruction proceeds in the phases shown. In order to increase thecomputational speed, we begin processing instruction 2 as soon as instruction 1 completes its first phase. Clearly, it is desirable to minimize the time in each phase, but there is no advantage in minimizing below the time required by aparticular phase or mechanism. Suppose, for a given organization, that the instruction decoder is an absolute serial mechanism with resolution time, At. If the average instruction preparation and execution time may beimproved by is t,, then the computational bandwidth t,/At. Thus if only one instruction can be handled at a time at the bottleneck, then the maximum achevable performance is l/Ar. (1 being the multiplicity of the constraint.) In order to process a given number of instructions in a particular unit of time, a certain bandwidth of memory must be achieved to insure an ample supply of operands and operations in the form of instructions. Similarly, the data must be operated on (executed) at a rateconsistent with the desired computational rate. CONSTITUENTS OF THE SYSTEM We willtreat storage,execution, and instructionhandling (branching) as the major constituents of a system. Since multiple stream organizations usually represent multiple attachments of one or more of the above constituents, we lose no generality in discussing the constituents of a system as they have evolved in SISD organizations. Indeed, confluent SISD organizations-by their nature-must allow arbitrary interaction between elements of the data stream or instructionstream. Multiple stream organization may limit this interaction, thussimplifying some of the considerations in an area. Therefore, for now, we shall be mainly concerned with techniques to extend computational performance in a context of a confluent SISD system. We will further assume that the serial mechanism which constrains the organization is the decoding of instructions. Thus, on the average, the processing of one instructionper decode cycle willbe an upper limit on performance. The relationship between the constituents is shown in Fig. 2 for the SISD organization. 1903 A . Storage The instruction anddata streams are assumed sequential. Thus, accessing to storage will also be sequential. If there were available storage whose access mechanism could be operated in one decode cycle and this accessing could be repeated every cycle,the system would be relatively simple. The only interference would develop when operands had to be fetched in a pattern conflicting with the instructions. Unfortunately, accessing in most storage media is considerably slower than decoding. This makes the use of interleaving techniques necessary to achieve the required memory bandwidth (see Fig. 3). In the interleaving scheme, n memories are used. Words are distributed in each of the memories sequentially (modulo n). In memory i is stored word i, n + i, and 2n + i, etc. However, the accessing mechanism does not alter thelatency situation forthe system, and t h s must be included in the design. Memory bandwidth must be sufficientto handle the accessing of instructions and operands, the storage of results and input-outputtraffic [9]. The amountof input-output bandwidth is problem-dependent. Large memory requirements will necessitate transfer of blocks of data to and from memory. The resulting traffic will be inversely proportional to the overall sizeof the memory. However, an increase in the size of the memory to minimize this interference also acts to increase the bulk of the memory, and usually the interconnection distance. The result, then, may be an increase in the latency caused by these communications problems. Figure 3 was generated after the work of Flores [9] and assumes completely random address requests. The “waiting time” is the average additional amount of time (over the access time) required to retrieve an item due to conflicting LATENCY ACCESS REGENERATION -tat r I n UNITS BUSY 1 I. . I n.at (MEMORY CYCLE) [71 I \ STORAGE EXECUTIONBANDWIDTH INSTRUCTIONSTREAY > INSTRUCTION HANDLING UNIT OPERAUO STREAY / I j nn... WAITING TIME AS A 1.0 FRACTION OF MEMORY CYCLE 0 ASSUME DEMAND RATE OF n WORDS PER MEMORY CYCLE \ \ I I .o RANDOM ADDRESS REQUESTS I .5 2:o NUMBER OF MEMORY UNITS DEMANDRATE ( I Fig. 2. SISD Organization. Fig. 3. Interleaving memory units. Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 25, 2010 at 16:36 from IEEE Xplore. Restrictions apply. 1904 DECEMBER IEEE PROCEEDINGS OF THE requests. Since memory requests are usually sequential for instructions and at least regular for data,the result is overly pessimistic for all but heavily branch-dependent problems. Typical programs will experience about half the waiting time shown. This latency in the system greatly increases the complexity of the control mechanism for the storage system. The storageunit must now include queuing mechanisms to organize, fetch, and store requests which are in process or could not be honored dueto conflicts. These queuing registers, containing outstanding service requests, must be continually compared against new requests for service. Extensive control interlocks must be made available to serve at least the following functions [ 2 ] , [3]: .~ ~~ direct the fetched word to the appropriate requestor; prevent out-of-sequence fetch-store or store-fetch in the same memory location; 3) eliminate duplicate requests of the same memory location (especially where the memory location contains more than one instruction or data unit); 4) analyze “keys” or “boundaries” where fetch or store memory protection are used. In addition, confluent computing systems frequently employ buffers to minimize traffic requirements on the memory [2]. An instruction buffer mightcontain the next n instructions in the sequence as well as a history of m instructions together with several levels of alternate pathsof branch. The presence of a historical picture of instructions (in the instruction buffer) allows for the opportunity to store small loops, thus avoiding the penalty of reaccessing instructions. An operand buffer is used by the execution unit as its intrinsic storage medium. Theinstruction unit would restructure instructions that called for operandsfrom memory intoinstructions which call forthecontents of these registers. The instruction unit would, after decoding the instruction, initiate the requests for the appropriate operands anddirect their placement into the operand buffer. Thus the execution unit would act as an independent computer, whose storage would be limited to the contents of this buffer, and whose instructions would bein a shortened format. B. Execution We center our discussion on the floating-point instructions since they are the most widely used in large scientific computing, and require the most sophisticated execution. In order to achieve the appropriatebandwidth levels, we can repeat the deployment scheme that was used for memory, only here the independent units will be dedicated to servicing one class of instructions. In addition to directly increasing the bandwidth by providing a number of units, the specialization of theunit aids in execution efficiency in several other ways. In particular, each unit might act as a small insular unit of logic, hence minimizing intra-unit wire communication problems. Secondly, the dedicated unit has fewer logical decisions to make than a universal unit. Thirdly, the unit may be implemented in a more efficient fashion. One may also consider “pipelining” techniques [6] to improve the bandwidth within a particular insular unit, in addition to optimizing an algorithm to minimize latent time. “Pipelining” is a process wherein natural points in the decision-making hardwareare sought to latch up intermediate results and resuscitate the useof the unit. The latching may include extra decision elements for storage, but these may also be used inaiding the controlof time skew (differences) in the various parallel paths. I ) Floating Add-Subtract Operations: Floating add-subtractoperations consist of three basic parts.First, the justification of the fractional parts of the two operands by the amount of the difference of the exponents. Second, the adding of the fractional parts (or appropriate complement). Third, the postnormalizationof the fraction if the result has a leading zero in a significant position or an overflow occurs. Because of the decision-making (shifting) problems associated with the exponent handling, a normal latch point for pipelining the two add operationsin one unit (duplexing the unit) is at the interface from the preshift into the adder or after the first level of the adder structure. The floating add class of instruction has been reported [5] to operate in the 120 nanosecond range for a duplexed unit with 56-bit fraction (Systems 360 format). 2) Floating Multiply [5], [lo], [12], [13]: The essential decision process of the multiplication algorithm is the addition of the multiplicand to itself by as many times as are indicated by the multiplier. If there aren bits in a multiplier and multiplicand, then the multiplicand must be added to itself n times with a one-place shift before each addition. Standard techniques exist for reducing the number of additions (e.g., multiplier bits: 1111. may be treated as+ 1oooO. - 1.) required by encoding the multiplier into alesser number of signed bits, say 4 2 . Once this is done, Wallace [12] suggests the direct addition of the 4 2 , n-bit, shifted multiplicands. The basic decision element in adding of each bit of the n/2 operands is the so-called binary full adder, which takes three inputs of equal significance and produces two outputs, one of the same significance, the sum, and one of higher significance, the carry. By the associative law, the carry is injected into any available lower level add structure of the appropriatesignificance. (The net effect issometimes referred to as aflush adder.) Thefinal result, of course, is the reduction of the n/2 multiples into two result segments -the sums and the carries-which are then assimilated by a conventional carry-propagate adder with an anticipation mechanism. The basic problem is the implementation of this algorithm. For large n, the plane segments that partition the algorithm are inconsistent with conventional packaging sizes and interconnection limitations (see below). Consider a simpler variation [Fig. 4(b)] [SI: here only n/m multiples of the multiplicand are retired per iteration (m iterations are required). The n/m multiples are decoded into n/2m additions of the multiplier. These multiples are Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 25, 2010 at 16:36 from IEEE Xplore. Restrictions apply. 1966 1905 COMPUTING FLYNN: VERY HIGH-SPEED then inserted into thefirst level ofthe addertree [Fig. 4(b)]. Now the add assimilation of both carries and sums proceeds as before with the introduction of intermediate staging points where the results are temporarily latched. The storage points act as skew (relative timing) control and improve the overall execution bandwidth efficiencyof the algorithm. After the first set of multiples has been inserted into thefirst level ofthe tree and assimilated, it is latched at the first latch point. The storage point serves to decouple the first level from subsequent levels of the hardware; therefore the first set of multiples may now proceed to be absorbed in the second level and, simultaneously, a second set of multiples may be inserted into the first level of the tree. The process continues and, at the bottom of the tree, the carries and sums are assimilated in an adder with carry propagation. Notice that thelatencyin this variation may be twice that of the original algorithm. Alsonotice that ,,/J/1 i+l i /1 i-1 REGULAR CUTTING PLANE REGISTERS CONTAINING MULTIPLES (ASSUME 61 TREE OF LATCHING FULL ADDERS CARRY PRWAGATE ADDER (INCLUDES CARRY ANTICIPATION) UNITINTERCONNECTIONS ARE SHOWN ON CUTTING PLANE potential bandwidth in this algorithm has not suffered since a second multiplication may proceed independently as soon as the last set of multiples is inserted and has passed the first level ofthe tree. Implementations of this second scheme have yielded a performance of 180 nanoseconds for a floating-point multiply including postshifting and exponent updating (56-bit fraction, Systems 360 format). are represenFigure 4, parts (a) and (b) three-dimensional tations of the second variation of the multiply algorithm. Figure 4(a)is an isometric projection and Fig.4(b) is a cross-section (or “regular cutting plane”) and profile view. This representation was chosen to illustrate some of the difficulties of implementing high-speed systems in general. It is wellknown [ 1 ] that propagation delay and nonuniform transmission line loading are major factors in theswitchmg speedof a logic stage. One would, therefore, desire a physical package consistent with the “natural” communication pattern of the algorithm so that the implementation could be optimized. The difficulties ofa planar package are obvious. Communications between “regular cutting planes” [profile view, Fig.4(b)] must be made axially in thephysical implementation. Indeed, there is no assurance that the capacity of the physical package-plane will match the requirements of the regular cutting plane-several packageplanes may be required. Notice that thefirst variation of the algorithm had much more extensive requirements per plane, thus its implementation would very likely be less efficient than thesecond variation. 3 ) Floating Point Divide [ 5 ] , [IO], [l I ] : Historically, divide has been limited bythe fact that theiterative process was dependent on previous partial results. Recently, attention has been given to techniques which do not have this limitation. These techniques include use of NewtonRaphson iterations forseries approximations to quotients. Assume a Quotient = b 6 ith bit MULTIPLE REGISTERS 0 0 0 Let 0 0 0 11 b 1 -(-X) - = 1 - x + x 2 - - 3 + . . . f X “ f . . . which can be rewritten as 1 b - = (1 PARTIAL SUM ACCUMUL4TlON ith R E W ~ R c u m N PLANE + x4) . . . (1 + x2m) Thus : Quotient L i+l i i-l ithPL4NE PROFILE (b) Fig. 4. Outline of a multiplier tree. x)(l + x2)(1 =a . ( I - x).(l + x2).(1 + x4) . . . (1 + x2’”). Recall that inbinary the factor (1 + x“) is related to (1 -x“) by a complementation operation. Thus,the denominator is rewritten as (1 +x), complemented to form (1 -x), the product is (1 - x2) which is complemented to form (1 + x2), which continues the development. Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 25, 2010 at 16:36 from IEEE Xplore. Restrictions apply. PROCEEDINGS OF THE IEEE 1906 Notice that the speed is substantially enhanced by the presence of a high-speed multiplier. In fact, as many as two multiplications might proceed simultaneously. one for the quotient development and one for the next denominator term, if the hardware permits. In forcing quickconvergence of the series, usually a table lookup arrangement isused to provide the first approximation for the quotient. Thereafter, subsequent iterations develop double the precision of the first approximation. Floating-point divides under 700 nanoseconds have been reported using this scheme [ 5 ] . 4 ) Algorithms for Achieving Maximum Ejiciency in the ConcurrentExecution of Independent Units: One of the purposes in havingindependent units dedicated to individual instruction types isto improve theexecution of each of the instruction classes. The other advantageis that these independent units may operate concurrently and serve to increase the overall bandwidth of the execution unit. Tomasulo [4] describes an elegant algorithm to allow the achievement of maximum efficiency in the concurrent execution of units (Fig. 5, illustration from Amdahl [I]). The operation of this algorithm takes advantage of the latent time following the issuingof the request to access the operands from storage into the memory in the execution area (either virtual or addressable buffers). When an instruction is forwarded to the execution area, a word in the execution unit memory is reserved for its operand, but the operand has notyet arrived. For example, if the instruction to be performed consisted of a multiply, then it could be immediatelyforwarded tothe multiplication unit. The ,a RI,p LOAD R, MULT ADD R , , R2 STORE R LOAD RI, (b) Fig. 5. Effect of Tomasulo concurrencyalgorithm. (a) Sequencewith conventional dependency. (b) Sequenceas performed aftertag was forwarded. DECEMBER multiplier would require a tag representing the operand. The execution area is provided with a common data bus and the tag representing the operand is broadcast onthe bus one cycle early, then the execution functional units examine their queues of required operands and gate in immediately an appropriate one without waiting for it to go to the buffer storage. The tag forwardingfrees the buffer storage of having any responsibility for this particularoperand. Of course, at thesame time, a load couldthen be executed into that same word. Thus, the second loadinstruction and the multiply instruction couldproceed concurrently in a fashion impossible before. Results, as they become available, might also be forwarded to units (the adder in Fig. 5) requesting action by a similar mechanism,ratherthan proceeding directly to storage, whether virtual or addressed. This avoids intermediate stops and hence improves overall efficiencyof execution by improving theoverlap in the concurrency system. C . Branching Assume for the moment that memory bandwidth and execution bandwidth have now been arranged so that they more than satisfy the requirement of one instruction processed per decode cycle. What then would limit the performance of a SISD system? If it isassumed that there are a fixed number of data-dependent branch points in a given program, then, by operating in a confluent or other highspeed mode, we essentially bring the branch points closer together. However, the resolution time of the data-dependent branch point isbasicallyfixed for a systemwith a given execution and/or accessing latency. The basic retrogressive factor is the presence of branch dependencies in the instruction stream. Among the many types of branch instructions, we have [2], [7] : 1) Execute: The operandwhich is fetched is to be treated asan instruction (or an instruction counter). This presents some problems since the access latency is inserted into the instruction stream. Itis essentially the sameproblem as indirect addressing ora double operand fetch. Onecould anticipate some of this difficulty by keeping ahead in the instruction stream. However, this particular instruction does not pose a series degradation problembecause its use isnormally restricted to linkages. 2) Branch Unconditional: The effective address so generated becomes the contentsof the instruction counter. This subclass of instruction presents the same problem as execute. 3) Branch on Index: The contents of an index register is decremented by one on each iteration until the register is zero, upon which the alternate path is taken. Only the access latency is a factor since zero can be anticipated. 4) Data-Dependent Branch: The paths of the branch are determined by the condition (sign, bit, status, etc.) of some data cell. Invariably, this condition is dependent on the execution of a previously issued instruction. Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 25, 2010 at 16:36 from IEEE Xplore. Restrictions apply. FLYNN: VERY HIGH-SPEED COMPUTING 1966 1907 Herethedegradation is serious and unavoidable. latency for the SJSD organization, multiple stream organiFirst the execution of the condition-generating data zations may exhibit branch induced degradationdue to must be completed. Then, the test is made and the either t h s same phenomenon oran analog spatial inpath is selected. Now, the operands can be fetched and efficiency in whichonly one pathactivates or determines the confluency can be restored. It is presumed that both outcome of a dependency. instruction paths were previously fetched-but note CLASSESOF ORGANIZATION that this is done only at the expense of greater memory In tlus section we shall consider each of the organizational bandwidth requirements. Another slight improvement classes listed in the first section of the paper. We will recan be made if the operandsare.fetched for one alternamark on or illustrate representative systems that fall into tive path so long as the unresolved branch path is not each class. The various configurations are by no means executed (to avoid serious recovery and reconstruction exhaustive. problems if the guess proves wrong). A . ConJluent SISD Of the two “loop closing” or conditional branch instrucThe confluent SISD processor (IBM STRETCH [7], tions, Branch on Index has less degradation due to resolution of latency thanData-Dependent Branch. When an CDC 6600 series [8], IBM 360/90 series [2l-[5]) achieves its option exists (as in the performance of a known number of power by overlapping the various sequential decision processes which- make upthe execution of the instruction iterations) the programmer should select the former. Assuming that the Data-Dependent Branch-resolving (Figs. 1 and 2). In spite of the various schemes for achieving latency is a constant for any particular machine, we may arbitrarily high memory bandwidth and execution bandshow its relationship (Fig. 6) to performance by assuming width, there remains an essential constraint in this type of various percentages (of occurrence in executed code) of this organization. As we implied before, this bottleneck is the type instruction. The latency represents the sum of average decoding of one instruction in a unit time, thus, no more than one instruction can he retired in the same time quanexecution time plus operand access time from memory. Figure 6 assumes that the organization under considcra- tum, .on the average. If one were to try to extend.this organtion has enough confluency to perform one instruction/ ization by taking two, three, or n different instructions in cycle on the average without branch-resolution interrup- the same decode cycle, and-no limitations were placed on tions. To resolve the branch,it is generally necessary to fully instruction interdependence, the number of instruction execute and test the result of the preceding instruction. types to be classified would be increased by the combinaDuring this time, fetching of instructions and data may torial amount (M different instructions taken n at a time proceed for one branch path and possibly a few instructions represents M” different outcomes) and the decoding may be fetched for the alternate. Fetching of both paths mechanism would be correspondingly increased in comdoubles the bandwidth required and increases the waiting plexity. On the other hand, one could place restrictions on time (Fig. 3). Thus thelatency includes an average execution the occurrence of either specified types of instructions or intime, test time, and. a . percentage (wrong path guesses) struction dependencies. This, in turn, narrows the class of problems for which the machine is suitable and/or demands of the operand-access time. Notice that while we have studied degradation .due to restrictive programming practices. Indeed, this is a characteristic of multiple stream organizations since the multiplicity (or “parallelism”) implies independent simultaneous action. n= NUMBER OF CYCLES OF EXECFTION AND I ACCESS LATENCY “=I .o - W W J vu Z t U V E 0.5 U = + Wu) =z ~~ 0 10 20 ~~ % PIIXNTASE OF OCEURREEYGE ff CONDlTlOUL BRANUI INSTRUCTIOIS IN ImmnowsTRum Fig. 6 . Degradation due to datadependent branch instructions. B. SIMD [14]-[18] SIMD-typestructures have been proposed by Unger [14], Slotnik [I51 (SOLOMON, ILLIAC IV), Crane and Githens [16], and, more recently, by Hellerman [17]. SOLOMON is the classic SIMD. There are n universal execution units each with its own access to operand storage. The single instruction stream acts simultaneously on the n operands without using confluence techniques. Increased performance is gained strictly by using more units. Communication between units is restricted to a predetermined neighborhood pattern andmust also proceed in a universal, uniform fashion [Fig. 7(a)]. (Note: SOLOMON has been superseded by ILLIAC IV as a system being actively developed, which is no longer completely SIMD.) The difficulties with SIMD are: 1) Latency in the instructionstream for SIMD branches is now replaced by latency in the datastream caused by Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 25, 2010 at 16:36 from IEEE Xplore. Restrictions apply. PROCEEDINGS DECEMBER OF THE IEEE 1908 private instruction memory and interaction between instruction streams occurs only viathe common data memory. Presumably, if there are N instruction units then the bandwidth of the common data storage mustbe N times greater than the individual instruction storage. This requirement could be substantially reduced by use of a modified version of Tomasulo’s tag-forwarding algorithmanda separate common forwarding bus. Another version of this [Fig. 7(c)] would force forwarding of operands. Thus the data stream presented to ExecuC . MISD tion Unit 2 is the resultant of Execution Unit1 operating its These structures have received much less attention [18], instruction on the source data stream. The instruction that [19]. An example of such a structureis shown in Fig. 7(b). any unit performs may be fixed (specialized such that the It basically employs the high bandwidth dedicated execu- interconnection (or setup) of units must be flexible), semition unit as described in the confluent SISD section. This fixed (such that the function of any unit is fixed for one unit is then shared by n yirtual machines operating on pass of a data file) or variable (so that the streamof instrucprogram sequences independent of oneanother.Each tions operates at any point on thesingle data stream). Under virtual machine has access to the execution hardware once such an arrangement, only the first execution unit sees the per cycle. Each virtual machine, of course, has its own source data stream and while it is processing the ith operand the ith execution unit is processing the ith derivation of the first operand of the source stream. operand communication (forwarding) problems. 2) Presently, the number of classes of problems whose operand streams have the required communication regularity is not well established. SIMD organizations are inconsistent withstandard algorithmic techniques (including, and especially, compiler techniClUeS). 3) The universality of the execution units deprive them of maximum efficiency. - I EXECUTION UNITS 1 U D. M I M D If we reconstruct the organization of Fig. 7(b)so that the data and instruction streams are maintained together in private memories, we have an example of MIMD. There is INSTRUCTION UNIT no interaction or minimum interaction allowedbetween -LIMITED WMMUNICATION these virtual machines. So long as the latenttime for execu(4 tion of any operation is less than the memory cycle, no problems arise due to branching. Thus, such an arrangement allows maximum advantage of the allowable bandwidths of execution unit and memory unit. Of course, such BUSES ARE T I M E SHARED an approach might well be criticized on thebasis that the reMULTIPLY BUS quirementofindependence of the instruction sequences I does not address itself to therequirements of large scientific problems. Itwould be better suited to the needs of the timesharing and/or utility environment. This restricted MIMD points up the shortcoming of our organizational definitions. The specifications fail to include a classdication of how the streams may interact, thus a restricted MIMD may beorganizationally much simplerthan UNIT a confluent SISD. General MIMD structures have been more widely described [20E[23] with large-scale multiplicity beingenvisioned by Holland [21] and more restricted implementations being undertaken by Burroughs and Univac[23]. INSTRUCTION In his original proposal, Holland considers an array of processors each with a one-wordstorage. The modules are independent and are capableof concurrent execution. The I I I processors have arithmetic ability but communication is STORAGE D A l AEXECUTION d * H UNIT * EXECUTION l UNIT EXECUTION limited by their need to “build a path” to the appropriate operand. In path building, the displacement and direction of the operand isspecified and the intervening module processors form a vinculum. DERIVED SOURCE Such an arrangement might solve some of the essential DATI DATA STREAY STREAY blockage problems of S E D and SIMD, since independent (c) program segments proceed simultaneously. Also, memory Fig. 7. (a) SIMD. (b) MISD. Converts to MIMD if instructions bandwidth is always adequate. and data are privately maintained (together). (c) MISD. 9 I Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 25, 2010 at 16:36 from IEEE Xplore. Restrictions apply. 1966 COMPUTING FLYNN: VERY HIGH-SPEED 1909 The difficulties with such an arrangement of MIMD generally include r201: time, and this may be further degraded by the occurrence of “conditional instructions. branch” By multiplexing the instructionstream or the data stream 1) interconnections between units, which pose serious or both, new classes of processors which do not necessarily interference problems share this limit are developed. Their effectiveness depends 2) the universal nature of the individual module, which on the natureof the problem and it is an open question as to limits its efficiency whether new algorithms which will serve to extend their 3) the class of problems which could utilize MIMD usefulness can be developed. organization, whch is presently small. ACKNOWLEDGMENT SYSTEMS REQUIREMENTS The author is indebted to D. Jacobsohn and R. AschenThe effectiveness of the overall computing processes must brenner of the Argonne National Laboratory for several be measured on a basis much larger than performance valuable discussions on some of the material. (nanoseconds per instruction) alone. Even on this primitive basis, it is obvious that different instruction repertoire REFERENCES may imply substantially different effectiveness withthe same [I] G. M. Amdahl and M. J. Flynn, “Engineering aspects of large hgh speed computer design,” Proc. Symp. on Microelectronics and Large average nanosecond per instruction ratio. Systems. Washington, D. C. : Spartan, 1965, pp. 77-95. Despite our original assumption thatthe very large prob[2] D. W. Anderson, F. J. Sparacio, and R. M. Tomasulo, “The Model lem does not employ significant input/output, it is .soon 91 : machine philosophy and instruction handling,” IBM J. Res. and realized that this .requirement is unduly .restrictive. PresDev., November 1966. [3] L. J. Boland, G. D. Granito, A. U. Marcotte, B. M. Messina, and ently, high-speed systems remain so limited. It may be J. W. Smith, “The Model 91 storage system,” ibid. some time before broader effective utilization may be [4] R. M. Tomasulo, “An efficient algorithm for automatic exploitation realized through new developments in input/output equipof multiple execution units,” ibid. [5] S. F. Anderson, J. Earle, R. E. Goldschmidt, and D.M. Powers, “The ment. Model 91 execution unit,” ibid. Of course, the overall measure of efficiency of the pro[6] L. W. Cotton, “Circuit implementation of high-speed pipeline syscessor is the number of correctly completed computations tems,” 1965 Proc. AFIPS FJCC,p. 489. [7] W. Buchholz,.Ed., Planning aComputerSystem. New York: Mca n d programs done over an extended period of time. InCraw-Hill, 1962. cluded in this measure is the reliability. and the maintain[8] J. E. Thomton,,‘‘Parallel operation in the Control Data 6600,” Proc. ability of the system. Complex systems with very large numAFIPS 1964 FJCC, pt. 11, pp. 3 3 4 . 191 I. Flores, “Derivation of a waiting-time factor for a multiple bank bers of components are naturallyvery difficultto maintain, J . ACM, vol. 1 I , pp. 265282, July 1964. unless features which provide fault location are included. [lo] memory,” M. Lehman, D. Senzig, and J. Lee, “Serial arithmetic techniques,” A major component of such features would be checking Proc. AFIPS I965 FJCC, pp. 715-725. [ l l ] R. E. Goldschmidt, “An algorithm for high-speed division,” M.S. (or fault detection) .of all operations and data transfers. thesis, Mass. Inst. Tech., Cambridge, June 1965. On detection of error, hardware-aided diagnostics should be [12] C. S. Wallace, “A suggestion for 4 fast multiplier,” IEEE Trans. on provided so that servicing and maintenance might be readily Electronic Computers, vol. EC-13, pp. 14-17, February 1964. and easily accomplished. If checking is not included in the [13] D. Jacobsohn, “A suggestion for a fast multiplier,” IEEE Tlans. on Electronic Computers (Correspondence), vol. EC-13, p. 754, Decemhardware, then it is, of course, incumbent on the user to ber 1964. program a thorough check of his results. This, of course, [14] S . H. Unger, “A computer oriented toward spatial problems,” Proc. IRE, pp. 1 7 4 , October 1958. represents an overhead which penalizes the effective perfor[I51 D. L. Slotnick, W. C. Borch, and R. C. McReynolds, “The Solomon mance and utilization of the equipment. Computer-a preliminary report,” Proc. I962 Workshop on ComNotice that some of the suggested organizations cater.to puter Organization. Washington, D. C . : Spartan, 1963, pp. 6 9 2 . restrictive problem sets-particularly the SIMD, where (161 B. A. Crane and J. A. Githens, “Bulk processing in distributed logic Memory,” IEEE Trans. on Electronic Computers, vol. EC-14, pp. multiple operandsare executed by the same instruction 186196, April 1965. stream. Such organizations are clearly limited for general- [17] H. Hellerman, “Parallel processing of algebraic expressions,” IEEE Trans. on Electronic Computers, vol. EC-15, pp. 82-91, February purpose processing where there may be many interactions 1966. between operand elements. Similarly, with the MISD [I81 D. N. Senzig and R. V. Smith, “Computer organization for array organizations, in the organization shown on Fig. 7(b), a processing” Proc. AFIPS I965 FJCC, p p . 117-129. number of independent instruction streams are involved and [I91 R. Aschenbrenner and G. Robinson, “Intrinsic multi-processing,” ArgonneNationalLaboratory,Argonne, Ill., ANL Tech. Memo. their very independence precludes MISD use on one large 121, June 1964. problem composed of essential dependencies; this, of [20] W. Comfort, “Highly pamllel machines,’? Proc. I962 Workshop on Computer Organization. Washington, D. C.: Spartan, 1963, pp. 1 2 6 course, is typical of large scale scientific programming. CONCLUSIONS Conventional Single Instruction Stream-Single Data Stream (SISD) processors may be enhanced by concurrency (confluence) of instruction handling, operand acquisition, and execution. There is, however, a limitation of the order of the execution of one instructionper instruction decoding 155. [21] J. H. Holland, “A universal computercapable of executing an arbitrary number of sub-programs simultaneously,” I959 Proc. EJCC, pp. 108-1 13. [22] R. Reiter, “A study of amodel for parallel computation,” University of Michigan. Ann Arbor, Tech. Rept.,ISL-654, July 1965. [23] D. R. Lewis and G. E. Mellen, “Stretching LARC‘s capability by 1-a new multiprocessor system,” presented at the 1964 Symp. on Microelectronics and Large Systems, Washington, D. C. Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 25, 2010 at 16:36 from IEEE Xplore. Restrictions apply.