1247 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-33, NO. 12, DECEMBER 1984 Concurrent VLSI Architectures CHARLES L. SEITZ, MEMBER, IEEE (Invited Paper) Abstract -This tutorial paper addresses some of the principles and provides examples of concurrent architectures and designs that have been inspired by VLSI technology. The circuit density offered by VLSI provides the means for implementing systems with very large numbers of computing elements, while its physical characteristics provide an incentive to organize systems so that the elements are relatively loosely coupled. One class of computer architectures that evolve from this reasoning include an interesting and varied class of concurrent machines that adhere to a structural model based on the repetition of regularly connected elements. The systems included under this structural model range from 1) systems that combine storage and logic at a fine grain size, and are typically aimed at computations with images or storage retrieval, to 2) systems that combine registers and arithmetic at a medium grain size to form computational or systolic arrays for signal processing and matrix computations, to 3) arrays of instruction interpreting computers that use teamwork to perform many of the same demanding computations for which we use high-performance single process computers today. Index Terms -Computational arrays, concurrent computation, logic-enhanced memories, microcomputer arrays, multiprocessors, parallel processing, smart memories, systolic arrays, VLSI. I. INTRODUCTION CONCURRENT (or parallel) architectures are not, of \_ course, an idea original to VLSI. See, for example, the IEEE Tutorial on Parallel Processing [36] for an excellent background and collection of papers on this subject. The editors introduce this volume by pointing out, "Whenever a computer designer has reached for a level of performance beyond that provided by his contemporary technology, parallel processing has been hi's apprentice." This comment applies particularly well to VLSI architectures, as we shall see, although in the VLSI domain we usually use the term concurrent to suggest the independence of a collection of computing activities, in preference to the lockstep connotations of the term parallel. In reaching for performance with VLSI, one finds that the digital microelectronic technologies with the highest complexity also have definite performance limitations. The fundamental limitation is the high cost of communication, relative to logic and storage [77]. Communication is expensive in chip area; 'indeed, most of the area of a chip is covered Manuscript received June 30, 1984; revised July 31, 1984. This work was supported in part by the Defense Advanced Research Projects Agency under ARPA Order 3771, and monitored by the Office of Naval Research under Contract N00014-79-C-0597. The author is with the Department of Computer Science, California Institute of Technology, Pasadena, CA 91125. with wires on several levels, with transistor switches rarely taking more than about 5 percent of the area on the lowest levels. Communication is also expensive in sending signals between chips, where package pin limitations, the area used for bonding pads and pad drivers, and the cost of the packages, must be considered in multichip system designs. Dynamic power supplied to the chip, and dissipated in the circuits that switch capacitive signal nodes, is typically dominated by the parasitic capacitance of the internal wires, bonding pads, and interchip wires, rather than by the c'apacitance of the transistor gates. In VLSI technologies such as CMOS, in which the static power is negligible, communication then accounts for most of the power consumed and dissipated on a chip. When it comes to performance, communication is expensive in delay, both internally and between chips. In MOS technologies, which exhibit the highest circuit density but a poor relationship between transistor driving capabilities and the wiring parasitics, circuit speeds are dominated by parasitic wiring capacitance. The switching speed of an MOS transistor in modern processes, with one minimum size transistor driving the gate of an adjacent identical transistor, is in the 0.1 ns range, but if one adds a few hundred microns of wiring, the delay is increased to several nanoseconds. Also, the nonzero resistance of the wires, together with the parasitic capacitances of a wire, imposes a delay in the wire itself that is becoming increasingly significant at smaller.geometries. Finally, the disparity between internal signal energies and the macroscopic world of bonding pads, package pins, and interchip wiring is so large that the' delay penalty in amplifying a signal so that it can run between chips is comparable to a clock period. Thus, both the cost and performance metrics of VLSI favor architectures in which communication is localized. This principle of locality is seen at every level of VLSI design. Cells are designed and laid out to minimize area and wiring capacitance. Sections of chips and whole chips are carefully organized by cell placement in semicustom designs, and by floor plans in custom designs, with the objectives of minimizing wire area and of placing close together those parts that must communicate at high bandwidth or with small delay. Finally, partitioning of systems onto multiple chips must adhere to limits on package pins and the signaling speed between chips. These physical design considerations of the geometry and energetics of communication, the effects of scaling, and their architectural implications are discussed in greater detail in Section II. 0018-9340/84/1200-1247$01.00 © 1984 IEEE 1248 The communication limitations outlined above influence all VLSI architectures. Even the "general purpose" sequential processors that fit on single chips, the subject of the companion article [27], exhibit a marked sensitivity of their design to localization of communication. The performance of such systems also depends strongly on the extent to which they exploit covertly the concurrency found in the process of interpreting a single instruction stream. Localization of communication is achieved by consolidating an entire instruction processor onto a single chip together with as much of the lowest levels of the storage hierarchy (registers, stack frames, and instruction and data caches) as possible [60]. This on-chip storage can be very effective in reducing the frequency of relatively slow off-chip storage references. Concurrency in interpreting the sequential instruction stream is achieved by instruction prefetching and by pipelining of instruction execution. One might expect these techniques to be even more effective in speeding up micromainframes than when they are applied to conventional mainframes, and indeed are moving the micromainframe into the performance range of mainframes, and at greatly reduced cost. We refer to the approach of exploiting concurrency starting from a sequential program definition as covert because it is successful only if it is hidden, that is, only if the effect of executing the source program, as assembled or compiled, is the same as if it were interpreted sequentially. Unfortunately, the degree of concurrency achieved by such techniques is typically and in aggregate much less than 10, although in some cases the concurrencies that can be discovered in sequential programs [35] are of considerably higher degree. One is left, then, with the "intriguing question" posed by C. Mead and L. Conway [54] in Chapter 8, "Highly Concurrent Systems," of their well-known text: "Does VLSI offer more than inexpensive implementations of conventional computers?" The concurrent VLSI architectures discussed here exploit overtly the high degree of concurrency found in many interesting and computationally demanding applications. Here we label the concurrency approach as overt because the concurrent formulation of the computation is out in the open, rather than hidden in a single process representation from which one discovers the concurrency with a compiler or during execution. In this approach one may think about, formulate, and express a computation in terms of many processes and their interaction. The eventual implementation target of such a formulation may be either a design directly in silicon, or a concurrent program that runs on an array of communicating and concurrently operating programmable computers. It is characteristic of many, but certainly not all, complex and demanding computing problems that they can be formulated for execution with large degrees of concurrency-in the thousands, even in millions of loosely coupled processes. Also, the degree of concurrency in such formulations characteristically grows with the problem size. For example, the number of computing cycles required for the computation of a realistic image by ray-tracing grows with the resolution with which it is displayed, but the concurrency that can be exploited in casting rays grows similarly. Problems in signal and image processing, computer graphics, storage retrieval, matrix operations, solving sets of partial differential equa- IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984 tions, physical simulation, and graph computations are among those applications that have been intensely studied and have been the object of clever system inventions by VLSI researchers and others. It is clear, however, that such machines are specialized to the application for which they are conceived. Even those concurrent machines that are programmable are not "general purpose" in the usual sense, since not all computing problems have the requisite concurrent formulation or size. It should be noted also that these concurrent VLSI engines, while developing rapidly in experimental prototypes both in industry and in universities, are relatively long range and adventurous efforts that may not have a significant impact on computing for several more years. Even the study of conventional computer architectures, designs, and programming systems is challenging in its diversity. One might expect the study of these specialized systems to be a chaos of individual inventions and applications. In fact, the highly concurrent VLSI architectures that have been sufficiently fully developed to use here as examples tend to fall into a consistent pattern that is disciplined by the characteristics of VLSI technology and design practice. Thus, following the discussion of VLSI technology and design in Section II, and before turning to examples, we discuss briefly in Section III a structural taxonomy of these concurrent systems. II. VLSI TECHNOLOGY AND DESIGN Digital systems have traditionally used a variety of technologies and manufacturing processes. The engineering of systems composed of many parts allows for a degree of specialization of the technology to the function to be performed, be it logic, storage, or communication. The progress in microelectronics, particularly in the last decade, has changed this engineering situation dramatically [68]. As the level of integration has increased, and the number of manufactured parts required to accomplish a given function has decreased, the engineering situation has become more uniform. Most of the design and engineering effort for a digital system today occurs in the design of the chips. Once the attention is inside a chip, the technology with which one creates a system is very tidy and consistent, governed by a much smaller set of rules and design paradigms than when many technologies have to be considered, and with no practical ''escape" into other technologies to bail the designer out of a tough problem. A. VLSI Models of Computation A consequence of this uniformity is that VLSI models of computation are quite realistic as a means of quantifying the consequences in silicon area, a measure of-cost, and computing time, of architectural choices within a chip. When a variety of digital technologies had to be considered, each with its- own cost, performance, and functional specialization, such modeling was much less tractable, or could be carried out only at a coarse level. As the scale of a system or subsystem encompassed in a single chip increases, one might hope and expect that architectural tradeoffs will be accom- 1249 SEITZ: CONCURRENT VLSI ARCHITECTURES plished in a less ad hoc fashion. VLSI is indeed a beautiful medium for studying the structure and design of digital systems, and this fact, as much as its economic importance, explains its appeal in the research community. VLSI models of computation [83] have been used extensively for the complexity analysis of concurrent algorithms. Although they are abstractions from the actual complications in chip design, they provide a way to describe concisely several of the essential features and problems of VLSI design that lead one toward concurrent architectures. We require, and hence shall use, only rather crude models here. For area A, it will suffice to use the actual or approximate area on silicon, an existent upper bound. The scalable parameter A [54] is used as the linear unit, where 2A is the feature size of the process, so that the area in A2 units is actually a measure of complexity. The computing time T for an element or system is taken as the latency, the delay from input to output for a "problem instance." Thompson suggests [83], [84] that a system working on p problem instances concurrently should be taken to exhibit an area A that is its total area divided by p. For our simpler purposes we shall instead take A simply as area, and T/p as the average interval between problem instances at the input and output. Average throughput is then p/T, and cost/performance corresponds to AT/p. Concurrency may be exploited at any level of a VLSI system design. One of the most common strategies in the logic, organization, and architecture of VLSI systems is to use pipelines - intermediate storage - in computation and communication paths, in order to increase throughput even if it is at the expense of increased latency. Fig. 1(a) is a simple illustration of this approach, the evaluation of an expression (in infix notation): (((Aop1B)op2C)op3D), where the opi are binary operators. Although the operations must be performed sequentially, this evaluation is assumed to be required repetitively. The boxes indicate the temporary storage for the input and output. If this picture were a concurrent program schema, the circles would represent processes, the boxes their input and output queues, and the arcs the message paths. If it were a block diagram, the circles would represent combinational functions, the boxes registers, and the arcs bundles of wires. If it were a logic gate diagram, the circles would be gates, the boxes clocked storage elements, and the arcs wires. If one denotes the respective areas and times of the operators as ai and ti, neglecting the boxes, the areas and times simply accumulate, so that the cost/performance is just (al + a2 + a3)(tl + t2 + t3)A pipelined version of this system, shown in Fig. 1(b), is dealing with three problem instances concurrently. Here one sees in miniature some of the opportunities and problems one faces with concurrent computations. If the operation times were equal, one has a system with only slightly increased area (the boxes), the same communication plan, the same latency time, and with p = 3, three times the throughput. What if the ti were not equal? Clearly the throughput would become 1/tma. When such a system is designed in silicon, one can generally trade off area and time within the operators, or locate the pipeline synchronization, to make the times approximately equal. When this situation appears in programming concurrent computers, it is referred to as load balancing, and is dealt with in ways that depend on both the o A C B S + OP- _ ,' e + OP2 = RESULT D ,,' ' S----} 1 PROBLEM + ~~~2JINSTANCES OP3 RESULT II Fig. 1. (a) Cascade evaluation. (b) Pipeline evaluation. application and on the architecture. Although there is apparent localization of communication within and between the operators, this example is otherwise independent of technology considerations. The case in favor of concurrency in VLSI systems becomes more compelling when one examines the area-time performance of the communication. There are currently in vogue several different time models for the delay of a wire and its driver. Each is physically valid under a range of conditions in a particular technology. In MOS technologies a simple amplifier with input capacitance Ci, will drive a capacitive load C0ut in time T = rjnv(COut/Cin), where rinv is a characteristic of the process, essentially the delay of an inverter with unit fanout and no parasitic load., The area of the amplifier is proportional to Cm. By cascading amplifiers in an optimal [54] size ratio C01t/CCjn = e, one can boost a signal from an energy corresponding to capacitance C, to the parasitic capacitance of a wire, CL, in a minimum time T = Tinv loge (CL/CX)The parasitic capacitance of a wire varies with wire length e according to a proportionality constant of the layer, so one can assert that communication time is T = O(log i) in the worst case in which one starts from a minimum signal energy. The numbers for a typical CMOS process of today are 'rinv = 0.6 ns, and the ratio between the capacitance of largest wiring structures and a minimum size transistor gate is about e8, and scales to about e15 in ultimate MOS technologies. Although in typical practice the compromise between driver area and delay dictates a suboptimal driver for long wires or interchip wires, one can, if necessary, achieve O(log f) dependence of the driver delay with wire length. The resistance of a wire together with the parasitic capacitance adds a diffusive propagation term for the wire itself that is O(f2) [65], [4]. The coefficient of this term makes this phenomenon significant in the fairly resistive silicon and polycrystalline silicon wires today, and is an important performance consideration in memory and programmed logic array (PLA) structures in single-level metal processes, since silicon wires must be used in one of the two directions in the 1250 matrix layout. In scaled technologies this problem appears also in metal wires, due to their scaled thickness. For reasons other than performance, such as noise immunity, one would never allow this diffusive term to become seriously dominant. Instead, one can include active repeater amplifiers periodically along a long communication path, thus making the delay 0(t), or one can convey signals long distances on additional metal layers that are thicker. While the dependence of delay time on wire length is quite benign if one uses optimal driving structures, the combination of area cost and this slowly growing delay is a substantial incentive to localization of communication. Communication throughput, or bandwidth, in the VLSI medium is a very meaningful measure that may be taken as a product of the number of parallel wires (the width of the wiring track) between two parts, and the "bit" rate. In signaling schemes in which there is no pipelining along the communication path, there can be only one transition propagating along the path, and the bit rate is bound by the reciprocal of the delay. If one takes the area-time product as the objective function to be minimized in a particular design, one sees that VLSI imposes serious penalties for separating two parts that communicate in this naive way. The wire area, the signal energy, and the area of the optimal driver (about 10 percent of the wire area) is each proportional to the distance, that is, A = 0(t), while T = 0(log C). Accordingly, the aggregate penalty in cost/performance for violating the principle of locality is 0(f log C). The penalty may be still worse for wires that are so long that they must be forced onto the upper, thicker metal layers, since these layers will have coarser design rules, exclusively long wires, and hence more area per wire, and are expected to be a very limited resource [55]. What one sees in the example of pipelining, and from the area-time performance of communication, is that the throughput can be increased if one can devise a way to confine the (physical) diameter of tightly coupled parts of a system. The expression "tightly coupled" can be taken in the synchronous design style to describe a system in which a large proportion of the communication must traverse the entire system on each clock period, such as from the input to the output registers in Fig. 1(a). On the other hand, when large systems are composed in a loosely coupled fashion, by which we mean that the parts can operate relatively independently that is, concurrently and can tolerate latency in their communication with other parts, the raw performance and excellent cost/performance that can be achieved in small diameter systems is reflected in large systems. This principle has been applied so thoroughly in the examples discussed later that these systems are arbitrarily extensible in the number of concurrent computing elements, and open-ended in performance. They can be expanded to be as large as desired with each part still operating at the same rate as when it is incorporated into a smaller system. This property is closely associated with the ability of a VLSI architecture to be scaled, that is, to exploit advances in the circuit technology. Another conclusion of this informal analysis is that communication between concurrent computing elements may take only certain forms. Any wide path -many parallel IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984 wires - must be strictly localized, such as can be achieved in a mesh-connected system of concurrent computing elements. Wiring economics dictate that any long path be narrow, and will necessarily exhibit a significant latency. This latency will be due in part to pipelining in communication. Also, messages of many bits sent between concurrent computing elements must be serialized according to the width of the communication path, and would accordingly exhibit a latency that is dependent on the size of the communication "packet." These two communication paradigms -local and wide versus distant and latent- are indeed recurrent and competing themes in the later examples; the former representing the wavefront or systolic type of computation, in which all communication can be made local, and the latter the queued message routing approach to less regular computations. B. Architectures That Scale Another aspect of this increasing uniformity of digital technology is that the future of at least the silicon technology is believed to be well understood and, but for a few considerations, is not radically different from today's high complexity MOS technologies. Thus, one has a reasonable hope of devising architectures with a longevity that parallels the continued evolution of the technology, or in VLSI jargon, "architectures that scale." The physical consequences of feature size scaling of MOS technologies are generally advantageous. If all physical dimensions and voltages are reduced together [54], so that electric field strength remains constant (and no new materials need be postulated), the electron "velocity" ,uE, where ,u is the mobility and E the electric field, remains constant, and the transit time across the smaller dimension channel is reduced in direct proportion. Capacitances per unit area increase linearly in the scaling, but the areas decrease quadratically, so that the capacitances in a scaled circuit are linearly smaller. One can see that at each circuit node the relation iAt = Aq = CAv is satisfied in such a way that the scaled circuit is an exact current-, voltage-, and time-scaled replica of the original. The energy associated with each switching event, (1/2)CAv2, is scaling as the third power of the feature size. If C is taken as the capacitance of the gate of a minimum geometry transistor, this switching energy ES, is equivalent to the product of the power per device and device delay, with the device switching at maximum rate. The power-delay product is the fundamental figure of merit for switching devices, and has a direct relation to the cost/performance of systems implemented with those devices. One notes also that the power per device scales down quadratically while the density scales up quadratically, so that the power per unit area need not be increased over today's levels. This cube law scaling of ESW with feature size is a remarkable incentive for continued advances in MOS technologies. If one were to take an existing single chip in a 2 ,tm CMOS technology, say, an instruction processor about 5 mm square running at a 20 MHz clock rate, and fabricate it at one-tenth of its present feature size, it would take up only Vioo the area, and in principle would operate from an 0.5 V source at V/ioo the power, and at a 200 MHz clock rate. This is certainly an 1251 SEITZ: CONCURRENT VLSI ARCHITECTURES attractive scaling. In practice, there are a few things that go wrong in this scaling that would have some impact on the scaled performance, but they are only difficulties, not disasters, for computing elements that are the complexity of a single chip today. Operation at such small voltages is problematic, and due to short channel effects, the threshold voltages and performance of the transistors would not be quite as good as simple scaling rules imply [31], [79]. Another effect that is significant is that while the capacitance distributed along the wires and transistor gates of a circuit node would be scaled to ¼/io the previous value, the reduced cross section of the wires would cause the resistance of the wire segments, even though they are shorter, to increase by a factor of 10. The RC product, which has dimensions of time, and is the coefficient in the delay in diffusive propagation of signals on wires, unfortunately fails to scale with the other circuit delays. Also, unless temperature is also scaled, the subthreshold conductance of the MOS transistor scales up rather dramatically with reduced threshold voltages, so that the dynamic storage structures that are so widely employed in MOS technologies, such as for dynamic RAM's, cannot be depended on to retain charge for more than a few microseconds. One can get around each of these scaling difficulties by variations in design style. For example, the use of additional metal layers in layout would be a compensation for the problem of wire resistance, and one would expect increased use of static storage structures in place of dynamic storage. So, it appears that variations of the designs of today, and of the design techniques, are feasible at least in small areas of the chips in this futuristic technology. The design of a chip with an area similar to today's chips, but with 100 times higher circuit density, will depend strongly on the additional layers of metal interconnection mentioned above. This situation is not at all unlike the wiring hierarchy employed today with packaged chips on printed circuit boards or chips mounted directly in ceramic carrier modules, but with the next level of wiring absorbed into the chip. Even under the optimistic and uniformitarian view that this future technology inherits our present design practices nearly intact, there is no fundamental help in sight for relieving the communication limitations. Indeed, by confining the wiring to two dimensions, we give up a physical dimension of packaging and interconnection. Driving a signal that is equivalent to a cross-chip wire of today has become no easier in scaling, and because of the diffusive delays in the lowest levels of interconnect, will force some connections to higher levels. Driving the long distance interconnect on the upper metal layers is essentially similar to driving bonding pads, package pins, and interchip wiring today. III. STRUCTURAL TAXONOMY The concurrent VLSI architectures to be discussed in the three following sections were selected as examples based on being 1) systems for which at least prototypes exist, and 2) clearly inspired by the opportunities and consistent with the design principles of VLSI. This selection is a family of concurrent systems that I have elsewhere referred to as ensembles [66], [67], and which can be discussed in terms of a simple process model of computation. There are other computational models and concurrent architectures whose VLSI implementations are interesting, such as data-flow and reduction machines, but these subjects would entail a survey in themselves. There is a commonality in the physical structure of this family of systems that is in part a necessity of the VLSI medium, and in part an artifact that we shall try to identify, These examples are all regularly connected direct networks of nominally identical concurrent computing elements. For the following discussion and examples, we shall refer to the computing elements as nodes, as in a computer network, and the, communication paths between them as channels. A. Communication Plans The communication plans of these systems are direct networks, such as the diverse selection illustrated in Fig. 2. In some systems the communication plan is a direct mapping of the communication requirements of an algorithm. In other cases messages are routed to a destination node through intermediate nodes, and the choice of communication plan is a compromise between wirability and performance. For example, a family of hypertorus networks can be represented as a k-coordinate periodic n-dimensional cube (k-ary n-cube) that connects kn = N nodes together such that the maximal shortest path between nodes is kn/2. In order to connect N = 212 nodes, one might choose k = 26 and n 2, an easily wirable two-dimensional mesh with 2 x 212 short channels, for which kn/2 = 64; or k = 2 and n = 12, for a binary (or Boolean) 12-cube with 6 x 212 channels, 1/6 of which are as long as a radius of the system, for which kn/2 = 12; or some intermediate compromise. Similarly, there are many parametric variations of the binary n-cube connected m-cycle for connecting m2n nodes. The width of the communication path is still another engineering variable. In the VLSI engines discussed here, the nodes themselves produce, consume, and in some cases route messages, so that these are what are called direct networks. In systems in which messages are routed, the node could be partitioned into two concurrently operating sections, as illustrated in Fig. 3(a), one section (C) to compute, and the other section (R) to route messages. The network illustrated is a binary 3-cube, and the channels are labeled according to the dimensions 0, 1, 2. The message section can be reorganized by the transformation shown into the multistage routing network of interchange boxes shown in Fig. 3(b). This transformation of the direct Boolean n-cube illustrates its essential similarity in structure and message flow performance to the corresponding indirect network, which is the same as the "flip" network used in STARAN [3], and under a rearrangement the same as the Omega network [44] or the banyan [24]. (See Siegel [70] for an insightful survey.) The absence of indirect networks in current experiments with concurrent VLSI systems is probably partly an artifact of the VLSI "lore" that switching networks do not scale well, which is certainly true of some of them, such as crossbar switches. Message switches are not eliminated by direct networks, but rather are partly concealed in this fully distributed form. = 1252 IEEE TRANSACTIONS ON COMPUTERS, VOL. 0 RING TREE SNEP TREE MESH c-33, DIMENSION 1 2 12, NO. DECEMBER 1984 I i ? I DIMENSION HYPERCUBE (BINARY n- CUBE) C1 - Ci C2 - C2 C3 - C3 C4 C4 C5 C5 C6 -c6 07 07 Fig. 3. (a) Direct binary 3-cube, and a transformation from the direct connection to interchange boxes. (b) Indirect binary 3-cube of interchange boxes. process, and to the economies of relatively larger fabrication runs of a smaller set of chip types. SHUFFLE - EXCHANGE CUBE - CONNECTED CYCLES Fig. 2. Typical direct networks. B. Homogeneity Another common characteristic of these experiments with concurrent VLSI architectures is that the nodes are nominally identical. We accordingly refer to these machines as homogeneous [76], meaning that they are of uniform structure. Heterogeneous systems would allow nodes to be specialized for different functions, much as are different computers on a network, or the functional elements in high performance computers. For these early experiments, however, homogeneous machines are certainly logistically simpler to design, test, assemble, and maintain. Homogeneity in programmable machines simplifies the software by giving all parts the same capabilities. Homogeneous machines also conform very well to the design flow of VLSI chips, in which repetition and regularity simplify the layout C. Node Complexity The choice of connection network establishes one dimension of variation in a taxonomy of this family of concurrent machines. Two other interesting and discriminative dimensions are the complexity of the nodes and the number of nodes reasonably contemplated for a given machine. A taxonomy in these two dimensions is shown in Fig. 4. The node complexity, also referred to as the "grain size" of the system, appears on the horizontal axis in Fig. 4 in A2 area units. The complexity of today's single chip, marked by a "*" on this axis, is an interesting point that separates systems of many nodes per chip from those of many chips per node. Today's "commodity" chips routinely reach 5 mm on a side at 2.5 Am feature size, or 4000 A on a side, which translates to 16 MA2, while advanced commercial chips are in the 50-100 MA2 range. Of course, these measures have been doubling approximately every two years over the recent past [57]. The two extreme zones indicated in Fig. 4, storage sys- 1253 SEITZ: CONCURRENT VLSI ARCHITECTURES nodes. Useful systems would include thousands, even millions, of these nodes. The ability to mix logic and storage 1612. economically at a fine grain in a single technology, which was not so attractive in earlier digital technologies [77], is part of what makes these architectures unique to VLSI. 109Typical applications of these systems are computations LOGIC-ENHANCED MEMORIES with images, such as scan conversion, correlation, and path finding; or database operations such as sorting, association, z COMPUTATIONAL ARRAYS and property inheritance. There is no real theory or com106 t putational model behind the design of these systems. They MICROCOMPUTER \ X tend to be in the nature of specialized individual inventions. 1 o ARRAYS The next zone represents systems for highly concurrent z numerical computations, in which the nodes are capable of \ CONVENTIONAL COMPUTERS operations such as multiplication and addition, and are connected in regular patterns that match the flow of data in the i j \D\ / ,\_~~~~~~~~~~~~~1 computation. These computational arrays, also called sysI tolic arrays [38], [41] because of the rhythmic pumping of 112 l 106 103 1109 data in pipelines, can be implemented in a variety of forms. NODE COMPLEXITY 2) The range of node complexities shown in Fig. 4 assumes that Fig. 4. Taxonomy of concurrent VLSI systems. the sequencing of operations is either built into the nodes, or that the node responds to control signals that are broadcast tems composed of random-access memory (RAM) chips, and into the array in the style of single instruction multiple data conventional single processor computers, are included for (SIMD) machines. However, the systolic algorithms decomparison with the three middle zones. For this com- signed for such machines are also highly efficient concurrent parison, we take the RAM cell and whole computer as the formulations for microcomputer arrays. Thus, the com"node." The three middle zones represent concurrent VLSI putational or systolic array is both an architecture and a comengines whose design, engineering, and applications have putational model, and has stimulated a broad research effort been studied in some depth, and which appear to be reason- in the design of concurrent algorithms for applications such ably distinct classes. The examples in the three following as signal processing, matrix and graph computations, and sections are respectively what are labeled as logic-enhanced sorting. It requires only several MA2, not even a full chip, to imnplememories, computational arrays, and microcomputer arrays. Let us here briefly traverse all five zones, left to right, to ment a minimal instruction processor and a small amount of describe some of the characteristics of each of these classes. storage for program and data. A single chip today is sufficient The RAM systems are composed almost exclusively of [48] for a processor with a rich instruction set and several high complexity chips, and provide a way to gauge the cost thousand bytes of storage. These highly integrated computers of a system if it were so highly integrated and produced in exhibit excellent cost/performance, but the performance and such large quantities. The basic repeated cell for storing one storage comes in fairly small units. Thus, it seems inevitable bit in a RAM varies from about 100 A2 for the densest one- that people are learning to team up myriads of these computtransistor dynamic RAM chips to about 400 A2 for high ers that fit in units approximating one per chip, or many per performance static RAM's. Multichip assemblies of many wafer, to attack demanding computational problems. Each identical RAM chips, the larger capacity systems being microcomputer is fitted with a number of communication typically denser and slower, are accordingly shown as an ports, and the array of nodes is connected in a direct network elongated and slightly slanted zone of expected variation. that is, as usual, dictated either by the application or by The selling price of mainframe add-in storage based on message routing and flow performance considerations. 64K dynamic RAM chips is currently somewhat less than Whether one can multiply the computing performance in $20 per RAM chip, packaged and powered. These 64K RAM concurrent execution by the number of nodes, or nearly so, chips are about 10 MA2. The resulting estimate of $2/MA2 is is very much dependent on the problem. All of the concurrent too low for chips produced in small quantities, so we will use formulations for finer grain machines can be mapped very a more conservative estimate of $5/MA2 for today's tech- efficiently onto microcomputer arrays. In addition, machines nology, with the understanding that this measure scales down of this class appear to be capable of performing many of with improvements in the circuit and packaging tech- the same scientific and engineering computations for which nologies. Hyperbolas of constant cost, the product of the cost people today use high-performance vector computers. The last zone, consisting of conventional single- or per node and number of nodes, appear in the log-log plot of Fig. 4 as straight lines labeled with a cost that applies to the several-processor computers, exists in a broad range from the highly integrated implementations that one hopes to achieve single-chip computer to high-speed supercomputers. with VLSI. IV. LOGIC-ENHANCED MEMORIES Logic-enhanced memories, also called "smart memories," We turn now to two interesting examples of specialized are very fine grain systems in which each node contains a few to a few hundred bits of storage associated with logic that can machines that are paradigms of the genus of concurrent VLSI operate on the storage contents and communicate with other architectures that mix logic and storage at a fine grain, variw \ 0 0 LL. 0 \ w 0 C 10 10 (X 1254 ously called logic-enhanced memories or "smart memories." It should be noted that these systems achieve rather remarkable performance per cost, in comparison to the same computation on a general purpose sequential computer, by 1) specialization of the system to the algorithms, 2) concurrent operation of an appreciable subset of the nodes in the system, 3) localization of communication between the stored data and the logic in the node (an unlimited storage bandwidth), and 4) localization of communication in the connection plan between nodes. A. Pathfinder The Pathfinder chip and system was designed to perform the computationally expensive part of two-layer wire routing, such as is employed in printed circuit board design, by an adaptation of the Lee-Moore algorithm. It was one of the first VLSI "smart memory" systems conceived. The original idea for this project was suggested by I. Sutherland in a 1976 internal Caltech memorandum titled "A better mousetrap," and a fully developed system was designed and carried through to a small prototype [13] by C. R. Carroll. The Lee-Moore algorithm [56], [45] finds the shortest path(s) between two points in a rectangular grid, in which path segments are allowed to run only horizontally or vertically, and in Which points on the grid may be blocked. The blocked points correspond to wiring area already used, either previously routed wires, component pads, or the edge of the circuit board. Each grid point has a state, either blocked, unoccupied, or, in the original form of the algorithm, an integer representing the distance of this point from a starting point. In practical adaptations of the Lee-Moore algorithm to circuit board routing [71], distance can be generalized to cost. The propagation phase of the algorithm starts with all unblocked points unoccupied, and schedules the neighbors of the starting point to be assigned a cost. Any unblocked neighbor then schedules its neighbors, and so on, so that the propagation phase terminates with all reachable points assigned costs. This information then allows the retrace phase to trace a minimal cost path back from the finishing point, or from any reachable point, back to the starting point. This algorithm is a simple and general tnaze-solving scheme that will find a minimal cost path if a path exists. The propagation phase is the costly part of the algorithm, since for a route involving t steps, as many as t2 points may be examined. The retrace phase is only linear in £. The Lee-Moore algorithm is seldom used today in routing programs, but routing approaches that examine channel areas rather than grid points to determine the sequence of channels through which a wire is routed are descendants of the Lee-Moore approach. The propagation phase is "a natural" for a mesh-connected cellular machine. The operation on the state of a node in response to signals from adjacent nodes can be carried out in a time measured in logic stage delays rather than instruction times. The computations required along the entire wavefront can be performed concurrently, thus reducing the time complexity from O(f2) to O(f). In a preliminary one-sided router or maze solver called the Mazer chip, Carroll experimented with the technique that IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984 was key to mapping this algorithm onto a cellular array on silicon. Instead of encoding a cost at each grid point, he followed Sutherland's suggestion of encoding a direction, with cost being implicit in the speed of local propagation. The Mazer chip encodes the state of a grid point with a "blocked" bit together with 4 bits represented in Fig. 5 as arrows pointing right, up, left, or down. If the 4 bits are all O's, the cell is unoccupied. The action of a cell, shown in circuit form in Fig. 6, is visualized as a four-way mousetrap, in which tripping any arrow in a cell then trips the corresponding arrow in all adjacent unoccupied cells. At the end of this propagation the direction arrow(s) point from all reachable grid points back toward the starting point. It is possible for two of the arrow bits to become set, representing a bifurcation in the set of minimal paths. This scheme works either synchronously or asynchronously; Fig. 6 applies to either approach, with the storage elements being respectively clocked set-reset flip-flops or latches. Carroll chose the asynchronous approach, which has advantages of speed, no clock to distribute, and a simpler node, and a disadvantage of a difficulty of assuring uniform propagation across chip boundaries. Of course, the Mazer array is also a RAM that allows the grid point -states to be examined and modified, both to set the blocking pattern and to accomplish the retrace phase. The asynchronous approach also allowed the Lee-Moore algorithms to be adapted to the heuristics that practical routers use to cope with the actual complications of routing many wires successively on multiple layers, and keeping previously routed wires out of the way of subsequent wires. These adaptations typically involve weighting the edges of the grid to make certain areas or directions more or less costly than others. Thus, one can perturb the wavefronts computed to find the least costly rather than the geometrically shortest path. The usual heuristic for routing two-layer printed circuit boards is to create a bias in favor of vertical wires on one side, horizontal wires on the other side, and to make interlayer "vias" fairly expensive and possible only in certain preassigned locations. Carroll's Pathfinder chip implemented this adaptation of the Lee-Moore algorithm on a two-layer grid by mapping the weights directly into delays. The circuitry used in the basic cell is ingenious, with the delays in the preferred and unpreferred directions in the two layers, and between layers, set by external control voltages. Thus, the pathfinder is a hybrid analog and digital computing network. The interested reader is encouraged to study Carroll's paper [13]. The pathfinder chip implemented in 1980 in 5 /im nMOS technology was a 4 by 8 array of 32 nodes, each about 50 kA2. A more modern process would allow on a single chip a 32 by 32 array of 1024 nodes with addressing logic and pad frame. A useful routing area of, say, 5-12 by 512 grid points would be a 16 by 16 mesh-connected system of 256 chips. Because of the number of pins required per chip - 256 just to deal with the periphery - a Pathfinder system would benefit greatly from the use of advanced hybrid packaging, or wafer scale integration if a suitable redundancy scheme could be devised. By virtue of its specialized organization, such a system attached to a modest host computer is estimated to route a printed circuit board more than 100 times faster than SEITZ: CONCURRENT VLSI ARCHITECTURES 1255 I_ Fig. 5. Mousetrap version of the Lee-Moore algorithm. Fig. 7. Fragment of a semantic network. as sets, situations, physical objects, persons, etc., and the arcs binary relations such as subset inclusion (represented by " s"), set membership (represented by "e"), and other relations as may be required by the knowledge to be represented. More complex relationships can be represented by vertices such as the "occurs-in" vertex in Fig. 7, which represents that the harvest of Granny Smith apples occurs in t-JR RESET Fig. 6. Logic diagram of the Mazer node (without RAM access part). NOT BLOCKED high performance mainframe computer, which, one should note, is a much more expensive although more readily available computing instrument. The analysis and synthesis of images are notoriously demanding computations on sequential computers, but susceptible to decomposition on a pixel-by-pixel basis if the demanding part of the computation can be formulated entirely in terms of local operations. Carroll's pathfinder is a particularly clear example of such a formulation, and the differences between the Mazer and Pathfinder chips illustrate the idiosyncrasies that so often separate the straightforward implementation of a simple algorithm from the complications of useful applications. a B. The Connection Machine The connection machine, or connection memory, is an innovative system for concurrent operations on a knowledge base represented as a semantic network. This system [29] was conceived by D. Hillis, working with several other researchers at the M.I T. Artificial Intelligence Laboratory. Machines with on the order of 100 000 nodes are being developed both at the M.I.T. Al Laboratory and by Thinking Machines Corporation. A semantic network [26], such as the example of Fig. 7, is a. directed graph in which the vertices represent objects such . September. Networks such as these can be represented in and manipulated by sequential computers, but if the network is large, the speed of operations such as deduction, matching, sorting, and searching leaves much to be desired. Even if the processor were very fast, the single access point to random access memory fundamentally limits the performance. Thus, the connection machine, like other smart memories, moves the processing into the memory. Many of the operations performed on semantic networks can be performed concurrently, so that both the storage and processing capability are in the best case proportional to the size of the machine. Each node of the connection machine, illustrated in block diagram form in Fig. 8, consists conceptually of a few registers, an arithmetic-logical unit (ALU), a message buffer, and a finite-state machine. The next state and output functions of the finite-state machine are referred to as a rule table, and all the rule tables in a system are identical. A node reacts to an incoming message according to the message type and the internal state, and performs a sequence of steps that may involve arithmetic or storage operations on the message contents and the contents of the registers, sending new messages, and changing its internal state. The registers and state may also be accessed as ordinary random access memory in the address space of a host computer. The rule tables are also RAM, and are presumably loaded by the same mechanism. In the designs currently being developed, a single chip contains 32 or 64 nodes, and a single rule table, ALU, and set of off-chip communication paths is shared by all the nodes in a single chip. The chips can be connected in a variety of communication plans, discussed below. Although in this implementation the chip resembles a node in a microcomputer array, the connection machine is probably better thought of as a "smart memory," in that its 1256 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984 SEMANTIC NETWORK I STATE I LTYPEI DATAJ Fig. 8. Conceptual block diagram of a connection machine node. operation and programming is based on the model of the simple nodes of a few registers each. A semantic network is stored in the connection machine as a pattern of virtual connections between cells. This representational approach is similar to LISP, where data are stored as structures of pointers. The registers in the connection machine nodes principally hold pointers to other nodes. The virtual connections map onto the physical wires by the routing of messages through intermediate nodes according to a destination address in the message. It is desirable for performance, but unnecessary to logically correct operation, that nodes that communicate frequently be physically close. The 1ow-level data structures and programming of the M.I.T. version of the connection machine, including the dynamic allocation of nodes, is very well described in the recent M.S. thesis by D. Christman [14]. One of the problems of representing a semantic network is that its vertices may be of arbitrary degree, but the number of pointers that can be stored in a node is limited. Thus, each vertex is represented in the connection machine as a balanced binary tree, as shown in Fig. 9. Each node then needs only three virtual connections. The depth of the tree is the log base 2 of the degree of the vertex, and with the root connection reserved, the number of nodes required to represent a vertex of degree n is n - 1. A bit in each node indicates whether the tree below it is left- or right-heavy, so that operations that add connections can leave it balanced, a well-known algorithm. Each arc of the semantic network is also represented as by a node that connects the two related vertices and the arc type. Space permits only a glimpse of what a connection machine can do, so we will offer only a short example of relational operations on sets. The connection memory also has facilities for representing and manipulating functions, that are analogous to, but more complex than those for relations, and also facilities for performing arithmetic. The interested reader is referred to Hillis' memo and Christman's thesis. Set operations are performed in the connection memory by representing membership in sets as individual bits in the state register. A particular bit in each of the N nodes of a connection memory is an N-bit set register. Clearly all of the standard set operations, complement, intersection, union, and difference, can be accomplished in parallel from an instruction broadcast from the host, and without communication between nodes. Sets can be mapped to other sets according to relations of a particular type, an operation requiring propagation of mes- Fig. 9. Representation of a semantic network in the connection machine. sages. For example, B: = APPLY-RELATION(color-of, {Granny Smith}) applies the relation "color-of" to the singleton set {Granny Smith}, and loads set register B with the set {green}. Internally this computation is accomplished by all nodes in the argument set sending messages that propagate only through nodes representing the "color-of" relation, in order to arrive at nodes that set the bit corresponding to set register B. One could have represented the set "apples" as red, and allowed this property to be inherited by any variety of apples in which the property is not explicitly specified. The varieties one might list would then be related to "apples" by a "virtual-copy" relation. Here the process of computing the set of red things is A = APPLY-REVERSE-RELATION(color-of, {red}) B = coMPLEMENT({red}) B = APPLY-REVERSE-RELATION(color-of, B) B: ='COMPLEMENT(B) C = APPLY-REVERSE-RELATION-CLOSURE(virtual-copy, A, B) In the first step, A is loaded with the set of all explicitly red things, e.g., {apples, cherries, fire trucks, etc.}. In the next two steps, B is loaded with the set of all things that have a "color-of" relation, but not to red, and so will include, for example, Granny Smith apples. B is then complemented, so as to include all red or possibly red things, and to exclude all such things that have a color other than red. C is then computed as the closure of the reversed virtual-copy relation starting with all red things (including apples) in the universe B, which specifically excludes inheritance of this property when a nonred color is already specified (including for Granny Smith apples). The closure computation requires that messages continue to propagate through all virtual-copy arcs until all such messages are absorbed in vertices that are already in set C. In discussing "how to connect a million processors," Hillis notes that "the most difficult technical problem in construct- 1257 SEITZ: CONCURRENT VLSI ARCHITECTURES ing a connection memory is the communications network." This problem did not appear to be very serious in the Pathfinder because communication could there be confined to neighboring nodes on a mesh. VLSI engines designed for problems that lack this crystalline regularity depend on virtual rather than wired connections, and achieve a useful separation of concerns between the algorithms and the engineering wirability issues. The designers are free to choose the highest performance connection plan that fits current levels of integration and packaging technology. The choice made for the connection machine is the family of hypertorus structures described in-Section III, although the designers of the connection memory discovered this result by a geometric construction with (literally) a twist, shown in Fig. 10. C. Other Logic-Enhanced Memories The Pathfinder and the Connection Machine are representative of differences in node size, wired and virtual connections, and the two applications of logic-enhanced memories that have been most widely studied, image computations and storage operations. Another interesting example of a smart memory for image computations is pixel-planes [23], a system being developed by H. Fuchs and J. Poulton of the University of North Carolina at Chapel Hill, and A. Paeth and A. Bell of Xerox Palo Alto Research Center. Pixel-planes enhances a raster memory with logic that performs scan conversion of abstract polygonal objects, computes shading, and provides a depth priority scheme for displaying only visible surfaces. A VLSI tactile sensing array [81] that also performs discrete two-dimensional convolutions was developed by J. Tanner of Caltech and M. Raibert and R. Eskenazi of Caltech JPL. Convolution is a simple, regular, and local computation on a mesh-connected array. Each node of this chip and system contains two processors, so that if the primary processor includes a defect, a spare processor can be selected in its place. Where this tactile sensing array incorporates pressure transduction with the storage and computation in each node, a number of other projects have incorporated optical sensing into smart memory arrays. R. F. Lyon thus refers to his optical mouse chip [50] as a "smart digital sensor." The optical mouse not only senses an image, but correlates it with the previous image in order to sense and report motion. Lyon's effort has since inspired several refinements and other experiments. J. Tanner and C. Mead developed a correlating optical motion detector [82] that performs the correlation by a hybrid of analog and digital techniques. G. Bishop and H. Fuchs are developing a system they call the self-tracker [5], intended to locate its own position in an environment such as a room. In the area of smart memories for storage operations, one might expect that VLSI would make associative memories feasible and attractive. However, the chip area required for simple associative matching does not compare well with more conventional hashing into word addressed memory. System designs in which association is more complex and requires computation, such as finding the closest match, or when association is combined with concurrent operations on Fig. 10. Twisted cube. While the maximal distance between vertices in a 3-cube is 3, in this twisted version it is 2. marked items, do start to exploit well the paradigm of mixing logic and memory at a small grain size. The non-Von being developed at Columbia University is a tree-structured medium grain single instruction multiple data (SIMD) machine aimed at such database operations. VLSI sorting memories have been studied extensively. The simplest types that sort in linear time (in a time equal to that required to load the data) on a linear exchange network have been implemented in great variety as student projects in VLSI design classes. C. Thompson's survey of 13 algorithms for sorting [85], and analysis of their complexity in terms of VLSI models of computation, is highly recommended. V. COMPUTATIONAL ARRAYS The two examples of logic-enhanced memories illustrate two different approaches to communication in these fine grain systems. The pathfinder nodes were connected in a mesh that conforms exactly to the communication requirements of the propagation phase of the Lee-Moore algorithm. The hypertorus communication plan of the connection memory was selected based on message flow performance in a system in which messages are routed, since it is not possible to predetermine a particular communication topology. These divergent communication paradigms will reappear in the discussion of microcomputer arrays; however, the examples of computational arrays discussed here adhere to a model of regular pipeline computation in which the communication plan of a real or abstract machine exactly matches the data dependencies of the algorithm, and in the ideal case [37] a node need "never wait" for operands. The virtual connection concept is a valid alternative for machines of this grain size. There is such a close relationship in the way in which data are processed in streams between computational arrays and data-flow machines that computational arrays might be considered the "wired connection" counterpart of the "routed connection" data-flow machine. If one uses a wavefront model of computation [17], [42], [32], [86], one can describe the operation of a computational array in data-flow terms. However, we shall omit here any discussion of VLSI implementations of data-flow machines. Research in VLSI computational or systolic arrays was pioneered by H. T. Kung and his students and collaborators at Carnegie-Mellon University. Several early publications, notably Section 8.3 of the Mead-Conway text [54], "Algorithms for VLSI processor arrays," contributed by H. T. Kung and C. Leiserson, C. Thompson's Ph.D. dissertation on VLSI models of computation [83], and H. T. Kung's papers on the structure of parallel algorithms and on 1258 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984 lining. It may not be immediately obvious that this network is the same filter as that of Fig. 11(a), and that it contains an equivalent defect. Here the input X must be broadcast to all stages, an operation that could not be accomplished in a single computing cycle for large N. Finally, one might insert pipeline stages in the filter of Fig. 11(a) to obtain the design shown in Fig. 11(d), a strictly linear and arbitrarily extensible array whose extensive pipelining imposes N samples of delay in excess of the original formulation. The FIR filter is a convolution of the coefficients and input samples. One can generalize this example to convolution, in A. Digital Signal Processing which some combination of the ai, inputs, and outputs move Digital signal processing lends itself extremely easily to in pipelines in the course of the computation. A variety of iterative computational arrays. It is typical of these applica- what H. T. Kung calls "semisystolic" [Fig. 11 (a) and (c)] and tions that one can use fairly short number representations, "pure systolic" [Fig. 11(b) and (d)] versions have been devised for convolution [41]. Since the discrete Fourier transsay, 8-24 bit integers or floating point mantissas, although form (DFT) is a convolution, it has the same formulations, even less precise representations of sampled data are feasible for certain applications such as synthetic aperture radar. It is although one would generally prefer the efficiency of the fast Fourier transform (FFT) algorithms discussed below. Recura frequent goal that the signal be processed at the same rate sive digital filters can be structured as a linear array with at which it is sampled. introgood is a filter (FIR) The finite impulse response feedback paths, a semisystolic structure at best, but can be samples From processor. ductory example of a simple signal converted to two-way pipelines [40]. As one might expect from the polynomial formulation of XO, XI, X2, 'the FIR filter is to produce an output the FIR filter, multiplication and division of polynomials can given by Yo, Y1, Y2, be accomplished by arrays that are essentially similar to filters [16]. If the arithmetic is done over finite fields, such Yn =Eaix,_ . networks are attractive architectures for coding algorithms, such as Reed-Solomon encoders [46] and decoders [87]. = output an produces 1,0, Thus, the input X 0, Y = 0,a1,a2, ,an impulse response of aN,00, range from about 10 N. in practice of N values length Typical of the filter that allows B. Fast Fourier Transforms 50. This is a formulation to expression Concurrent versions of the fast Fourier transform (FFT) one sample between the input and its first influence on the or the input [61] have been studied and used for many years, algorithm clocking a for required as minimum output, form an in the and interesting contrast in their sequential, comonly interested is generally output registers. One and microcomputer array implementations. array, the latency. putational of such a not in system, throughput A pre-VLSI architecture for a filter of this type would The FFT is often represented in the flow graph form shown typically consist of a single fast multiply-add element (that in Fig. 12(a), the vertices representing operations and the could be pipelined) and a small memory to hold the N - 1 arcs the data dependencies. The data items and roots of unity previous samples and the N coefficients. The expression (t terms) are complex numbers. The basic operation at the would be evaluated in N steps for each sample, with some pair of vertices at the output side of each "butterfly," shown cunning strategy for rewriting each of the old samples down in Fig. 12(b), involves four multiplications and six additions, with the possibility of several operations taking place -concurone position in the memory. Although this approach seemingly requires a minimum number of parts, it demands that rently. We shall treat the butterfly as indivisible. The traditional sequential approach to this computation, the multiply-add step be N times faster than the sample rate. A design for a given sample rate could be expected to serve whether in a program, the microcode of a peripheral array only up to a certain value ofN before one resorts to compound processor, or special purpose hardware, involves applying filters or to concurrency in the evaluation of the expression. the butterfly computation to data stored in memory in an A VLSI computational array for this filter can take a vari- order allowed by the flow graph. Each such application reety of forms [16], but since regularity and modularity are quires six memory reads and four writes, -and it may in some particularly desired in VLSI implementations, one might cases be advantageous to partition the memory [15]. Predictstart with the network illustrated in Fig. 1 (a). Although it is ably, the computational array approach is to use many butterboth modular and correct, the cascade summation violates the fly nodes. There are, however, many ways to trade area and principle that the computing rate be independent of the size time. The network shown in Fig. 13 is constructed directly by of the array. Assuming that one is indifferent about the delay replacing each butterfly structure in the flow graph with a imposed by the filter, one might then decide to pipeline the butterfly computing node. It is assumed that each node insummation network. The pipelined summation tree version cludes its own pipeline register(s). This construction was carried out with one permutation in shown in Fig. 11(b) achieves this objective, but at some loss of regularity in what started as a strictly linear structure. In location in the second column in order to achieve a conanother approach, shown in Fig. 11(c), one reverses the di- nection structure that is the same between each stage. That rection of the summation network, and reorganizes the pipe- this could be done is not a surprise to readers familiar with the, systolic architectures [38], [40], [41], stimulated what is today a much broader and deeper effort in the design and complexity analysis of algorithms for VLSI systems than we could hope to do justice to here. We offer instead several examples that illustrate some of the range and style of this architecture and computational model. Existent versions of such systems are commercially available as LSI and VLSI chips and chip sets aimed at signal processing applications, and more general systolic systems have been and are being assembled as testbeds [6], [88], [8], [80], [21], [75]. N SEITZ: CONCURRENT VLSI ARCHITECTURES 1259 WO WO x- WO. x- x *0 X+WKY y Fig. 12. (a) FFT flow graph. (b) '"Butterfly" computation. 0- y 1' 2 Fig. 11. (a) FIR filter #1. (b) FIR filter #2. (c) FIR filter #3. (d) FIR filter #4. 3, 4' shuffle network [73], [74], a uniform single stage of an indirect binary n-cube. The rearrangement of signals between stages takes its name from a perfect shuffle of a deck of cards. This shuffle includes a uniform distribution of wire lengths from the width to half the height of the wiring channel between stages, so this network is only "semisystolic." Thompson has shown [83] that to lay out an n-point shuffle on a chip, O(n2/log2 n) area is required. For sufficiently large n, the shuffle wiring would be larger than the n associated nodes, but this is a much larger value of n than one contemplates for a single chip. A consequence of the identical structure of each stage, and of the recursive structure of the shuffle, is that this computational array can be folded to trade off between time and area. In the form shown in Fig. 13, the network processes n points each time unit, but requires (n/2) log2 n butterfly nodes. An iterative form with a single stage would require log2 n time with area n/2. The network can also be folded vertically with the insertion of additional registers to roughly halve the area and double the time on each folding. C. Arithmetic The ability to trade timne and space in a parameterization of an algorithm leaves a great deal of flexibility in the lower 5, 3 7 7 Fig. 13. A computational array for the FFT using a shuffle between each stage. levels of design, particularly in the arithmetic. Many of the efforts in signal processing using computational arrays also employ pipelined serial arithmetic, a clearly synergistic combination [12], [51], [18] that allows both serial communication and highly efficient serial arithmetic. Parallel combinational arithmetic is popular in conventional computers largely to balance arithmetic performance with storage cycles, while in fact it is inefficient in the following sense. When one needs to minimize the time (latency) of an arithmetic operation, one makes different design choices than to optimize throughput. The lower bounds for multiplication of n-bit integers, even under a constant wire delay model [1], [7], show an area time squared invariance: AT2 O(n2). Even if such multiplier designs [63] spanned an interesting range of time complexity, the complexity analysis shows that one must increase the area 1260 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984 rather drastically in order to reduce the latency. In the case of computational arrays, latency is typically not an issue since one is pipelining within and between nodes in any case. One is then free to minimize the cost/performance, AT/p. where performance is measured in throughput. In fact, this problem is not so intellectually appealing as the latency K problem, in that optimal cost/performance is easily achieved in almost any highly pipelined multiplier. For example., cerN NOMINALLY IDENTICAL COMPUTERS tain versions of the very efficient carry-save multiplier [49] Fig. 14. Microcomputer array. generate a single 2n-bit product serially with A = n adder cells and T = n(tsum, + tc.,y) [75], a minimum (in binary notation without recoding) for AT/p achieved even with p = 1. A. Message-Passing Structures There are many approaches to supporting the commuD. Matrix Computations nication between a large number of concurrently operating Another well-developed set of applications for combi- instruction processors. The message-passing approach over national arrays are matrix computations, which use direct networks, while a direct extrapolation of the attention two-dimensional arrays, either square or hexagonal. The to locality and of the style of node interaction used in logichexagonal array provides three directions of flow, and enhanced memories and in computational arrays, is admitbeautiful mappings of matrix multiplication and LU decom- tedly at one extreme. position by Gaussian elimination without pivoting [39], [40]. Although it is certainly the most thoroughly studied paralWhen such a matrix is banded, one can use an array whose lel architecture, the shared storage multiprocessor in its simsize corresponds to the bandwidth. These methods have been plest form, in which all storage references traverse a switch extended to arrays that avoid broadcasting of data and con- or other "stunt box" between processors and storage banks, trol, that accomplish partial pivoting, and to systems that use is at the opposite extreme, and succeeds in making all storage an array of given bandwidth for solving systems of equations references equivalent by making them all equally expensive. with larger bandwidth [33], both for Gaussian elimination An intermediate approach is to associate the storage with and for the QR factorization method [34], [25]. the processors, corresponding again to Fig. 14, and to When applied to n x n matrices, and using n2 nodes, these organize the computation so that a majority of storage referarrays perform matrix multiplication or the solution of a sys- ences are local,-and only occasional references traverse the tem of linear equations in 0(n) time, an 0(n2) reduction from communication network. If the network is of the indirect the sequential machine time complexity. While one- type, such as that shown in Fig. 3(b), then all remote referdimensional computational arrays fit a single stream of data ences exhibit a similar latency. The Denelcor HEP and the from a disk or a sensor, the performance of these two- BBN Butterfly are examples of systems with this two-level dimensional arrays assumes that input and output occurs in hierarchy. One may also achieve a many-level hierarchy by n-wide streams. If these computational arrays were to be used an approach such as that used in Cm* [78], in which the in a computing environment in which the array serves as an switches and storage are organized in a hierarchy that proattached processor, one must deal with matrix problems of vides access to increasing amounts of storage through each variable size, and the 0(n2) I/O requirements of the array level of switch. Another very effective intermediate approach outweigh the 0(n) time complexity. Thus, the methods that is to provide a local cache for each processor [59], but in a provide for an array of relatively small bandwidth to be used multiprocessor environment, one must employ some method iteratively for larger problems, and for problems of variable of maintaining coherence between caches [20] with a tolersize, are crucial to the useful application of these two- able latency. Digital Equipment Corporation is developing a dimensional systems. VAX multiprocessor based on the coherent cache approach. Communication through storage employs a global address, so that the switching network must simulate a complete conVI. MICROCOMPUTER ARRAYS nection. Even when local storage or cache is used, the netThe remaining class of concurrent VLSI architectures to be work that handles remote references can become congested, discussed are systems that we label as microcomputer arrays, depending on both the characteristics and size of the network, although they might also be called microprocessor arrays and on the locality of references. Thus, it is not possible to [62], homogeneous machines [76], apiaries [28], ensembles put any hard limits on the number of processors such systems [66], transputers [21, mobs, conspiracies, teams, swarms, or might include, but systems in the 10-100 processor range myriad-micros. Whatever descriptive, clever, or poetic name appear to be entirely feasible. is attached to these machines, the basic structure, as shown One can also achieve the effect of a complete connection in Fig. 14, is a set of N computers that send messages to each with a direct network by routing messages from node to node. other through a communication network. The node complex- In this case, still assuming that the messages pertain to stority ranges from a few MA2, small enough to place several age references, the remote references might be to a neighnodes per chip today, to perhaps 1000 MA2. We shall assume boring or very distant node, so that the remote references fall that these multiple instruction multiple data (MIMD) arrays into a hierarchy of increasing access time to get at an inare homogeneous, and that there is no communication be- creasing fraction of the working set. Also, many algorithms tween the computers except through the network. do not require other than the limited connection plan of a 1261 SEITZ: CONCURRENT VLSI ARCHITECTURES given direct network, as we have seen from a sample of systolic algorithms, and as has been shown in extensive studies of algorithms for ultracomputers [64] based on a shuffle interconnection. Given a large disparity between local and remote accesses, and variability in the remote accesses, there is little motive to tie communication to storage access. Instead of suspending instruction interpretation in a node while waiting for a reply from a read request to a remote node, one can organize a computation into concurrent processes that communicate by message passing. A process is here a sequential program that includes actions of sending and receiving messages, and may also be able to put itself to sleep pending the completion of a send action or the receipt of a message. There are no global memory addresses; instead, there are references to processes used to direct the messages that pass between them. These messages can be of a variety of types that is limited only by what can be interpreted by the process code. The communication between concurrent processes is then explicit, as in Hoare's communicating sequential processes notation [30], concurrent extensions of object-oriented programming [43], or the actor model [28], rather than being through shared variables. This direct network message-passing architecture, with its corresponding necessity to distribute a concurrent computation in a way that respects locality, the limits in channel bandwidth, and latency in the direct network, is the extreme point that these experimental VLSI microcomputer arrays have adopted. Because they are relatively loosely coupled, these systems scale well to thousands of nodes. B. System Examples There are today quite a few working examples of microcomputer arrays. The system building projects at Caltech [67] evolved from the algorithm, architecture, and programming methods research of S. Browning [9]-[11], B. Locanthi [47], and C. R. Lang [43], and have been strongly influenced by A. Martin's processing surface experiments [52]. The Caltech Mosaic node [48] is a 16 MA2 custom chip, about 75 percent of which is devoted to storage, and the rest to a fast 16 bit processor and four bidirectional communication channels. Small prototype Mosaic systems are being used for programming experiments, and a 1024-node machine is under construction. Also in use at Caltech are several "cosmic cube" machines [67], [69], which use commercial parts amounting to about 200 MA2 per node. The cosmic cube is a hardware simulation of the kind of machine that could be built with single chip nodes about five years hence, but is useful even today. Numerous scientific and engineering applications [22] have been run on a 64-node binary six-cube connected cosmic cube at about 10 x the performance of a VAX1 1/780, thus providing at least an existence proof for the utility and costeffectiveness of the microcomputer array architecture and concurrent process message passing model. The CHiP (configurable highly parallel) computer project [72], started by L. Snyder at Purdue, is aimed at a system in which the communication can be configured for a computation. Fig. 15 shows two examples of a hexagonal mesh and a binary tree embedded on a structure of nodes (squares) HEXAGONAL MESH ( 00 0) 0 @ 0 @Q0 ll ,. O , 0 L 0, 0 000t 0 \0 0 0 0 0 C10 BINARY TREE Fig. 15. Two configurations for the CHiP computer. and simple switches (circles). This approach has the advantages both of tailoring the communication to the needs of an algorithm, and also may be a mechanism for configuring systems to avoid faulty nodes. The CHiP project employs an interesting interactive front end called the Poker parallel processing environment, and two "Pringle" hardware emulators for the CHiP computer are in use at Purdue and at the University of Washington. The systolic array testbeds referenced above are really microcomputer arrays that can be programmed to execute systolic algorithms. The Carnegie-Mellon programmable systolic computer (PSC) [21], a single-chip node that has 1262 been demonstrated in small arrays, devotes a relatively larger fraction of the area of a node to operations such as multiply, and is programmable at a microcode rather than conventional instruction set level. It accordingly represents a considerably stronger orientation toward performance in systolic algorithms than the microcomputer arrays. INMOS has recently announced [2] the "transputer," an advanced single chip computer with ona-chip storage and four communication channels, a node element quite similar to Mosaic. The Occam programming environment for the transputer supports both single transputer programming and concurrent programming of arrays of transputers. C. Earlier Examples Revisited The model of cornmunicating processes is quite consistent with the model implicit in the discussion of logic-enhanced memories and computational arrays. In fact, any of the computations already discussed can be mapped onto microcomputer arrays. This mapping can be described as a graph embedding in which the nodes of a finer grain machine are placed as processes within a nodes of a microcomputer array. The microcomputer array cannot compete in performance with its more specialized relatives when using the same algorithms. However, the practical difference between mapping a formulation to silicon, with the necessity and risk of build- ing a machine, versus programming an existing machine, can be expected to favor the microcomputer array. Also, the microcomputer array can generally employ more efficient algorithms. The choice of mapping a process formulation onto a microcomputer array influences in interesting ways the concurrency and load balancing in the computation. For example, if the same computation performed by the 512 x 512 pathfinder system discussed in Section IV were to be mapped uniformly onto a 32 x 32 toroidal mesh connected microcomputer array, each microcomputer node would need to deal with 256 grid points. It might seem most natural to assign 16 x 16 subgrids to each node. Although this is the best mapping for minimizing the communication between nodes, it is the worst from a load balancing standpoint. The locality of the propagation concentrates segments of wavefronts into nodes, and leaves many nodes idle. A much better mapping for load balancing assigns grid point (x, y) to node (x mod 32, y mod 32). This mapping disperses the wavefront in order to maximize concurrency at the expense of communication. The optimal napping would likely fall at an intermediate point, and depends on the relationship between the communication and computation performance of the nodes. Since it is still possible for a single node to have many propagation events scheduled, one cannot use the simple representation of time for cost that the pathfinder employed. One can either communicate costs directly in the messages, or propagate a sequence of locally coherent time-step waves from a corner of the mesh. One should notice that the microcomputer array can employ a largerfraction of its nodes concurrently than when the algorithm is expressed in silicon. The rmicrocomputer array can also perform the retrace phase, and can deal with a variety of speedup techniques. One such technique, studied by this author in a simulation, is to perform a computation for a IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984 subregion, stch as the 16 x 16 region indicated above, prior to any incoming message, such that the node can respond immediately to a message tagged for each of its boundary points. This trick was worth about a factor of 10 in performance. Another speedup scheme is to initiate the propagation phase from all points at once on a nmany-pin wire. The following example of a set of formulations for the FFT algorithm for a Mosaic system is due to a recent discussion with my colleague, D. Cohen. Here we will start with a mapping that is natural to a microcomputer implementation, and end up with a computational array. If one wished to compute the Fourier transform in successive windows of n samples, it is a desirable system organization for the transform to be computed at the sample rate. Thus, a pipeline of log2 n nodes, as shown in Fig. 16(a), allows the successive nodes to compute the successive stages of the transform shown in Fig. 12. Assuming that samples are sent between nodes in a top-to-bottom order in each stage, and starting with all queues empty, the first node must queue (n/2) samples before it can start its butterfly computations. As long as the node can perform butterfly computations at a rate of one for each two input messages, its input queue will not be longer than (n/2). After each butterfly, the node can rid itself of the first result immediately, but must queue another (n/2) outputs. At the second stage the situation is similar, except that the queue lengths are (n/4). The last stage needs to queue only one input and one output, but stores (n/2) roots of unity. The ability to queue intermediate results to smooth the computing load is the same extension of the systolic model in its use of geometrically varying queue size as is employed in certain sorting algorithmns [85], such as the bitonic sort. What if the nodes were not fast enough to perform the butterfly computations at the desired sample rate? One could use multiple independent pipelines, but there is another way. The system shown in Fig. 16(b) is intended to double the permissible input sample rate. The input node distributes even samples to the upper pipeline and odd samples to the lower pipeline. Each pipeline operates at half the sample rate and with half the storage requirements. The pipelines are independent until the last stage, at which point the butterfly reappears. Log2 n applications of this doubling bring one to a systolic implementation in which the queues have disap- peared. D. -Irregular Computations These programmable machines have additional capabilities well beyond those of emulating their finer grain relatives. They are not restricted to static process structures, but can build dynamic structures. Their storage capabilities allow them to queue blocks of data, and their programmability allows a wide choice of efficient algorithms. These are indeed very interesting machines to program! It is perhaps not very surprising that concurrent corhputations with regular structures and predictable communication demands [22] are very efficiently performed on microcomputer arrays-. These problems are, after all, the usual grist for parallel mills. What is just becoming clear is how far such machines can be extended to deal with com- 1263 SEITZ: CONCURRENT VLSI ARCHITECTURES communication are dropping so rapidly that one can reasonably think about systems of such large scale that they might more processing elements than early computers had bits have n-POINT FFTs AT I1F INPUT AT SAMPLE RATE 1/T RATE 1/T of storage. Loosely coupled concurrent systems place simpler demands on the design and engineering of such large-scale systems, but shift the burden of planning the distribution of processing and communication to the formulation and expression of the algorithms and applications. As J. Schwartz n it [64]: "A central problem is to develop techniques which put RATE 2/T RATE 2/T allow the organization of concurrent computation on a very large scale.... The deepest opportunities inherent in microFig. 16. (a) Microcomputer array FFT. (b) Throughput doubling by structure technology can only be realized if we find effective separation into even and odd samples. ways of structuring and using such [highly concurrent computing] assemblages." This paper has presented an effective and consistent, if putations in which the communication demands are irregular. Two concurrent electrical circuit simulation programs, somewhat crude and direct, way of structuring highly concurCONCISE, developed recently at Caltech by S. Mattisson rent computations in terms of processes and messages. Such [53] to be run on the cosmic cube, and a parallel version of formulations can be expressed in a single family of concurSPLICE developed at Berkeley by Deutch and Newton [19] rent VLSI architectures, either directly in silicon, or in confor the BBN Butterfly, are paradigms of this irregular class of current programs that run on arrays of microcomputers. computation that can still be performed efficiently on micro- Although examples of these machines have proved to provide computer arrays. Circuit simulation is a numerically de- very good performance and excellent cost/performance on manding computation that involves evaluation of nonlinear certain problems, one can make no pretense that these macircuit models to produce piecewise linear representations, chines are useful or efficient for all computational problems. solution of large sets of linear equations of irregular sparsity, There are fundamental questions of how one organizes time step determination, and integration. Although the mes- the cooperation of so many computers to perform a single sage traffic or storage access patterns are irregular, the procomputation. How does one formulate and specify such a cess structure is static for the course of the computation. computation, and verify its correctness? Is it necessary to Instead of the traditional indirect method of solving the ma- formulate and direct these computations explicitly, or are trix equation for a time step, CONCISE and the parallel there also effective implicit formulations from which the SPLICE both use iterative methods, and in all other regards, concurrency can be derived? There may be no universal such as their time step determination and windowing, tend to answers to these questions, but the experiments underway debunk the theory that concurrent machines are constrained may at least be suggestive of some of the problems and to using unsophisticated and inefficient algorithms. opportunities. The next test of the capabilities of microcomputer array architectures is their ability to support computations in which the process structure can change dynamically, and in which REFERENCES processes can distribute and relocate themselves during exe[1] H. Abelson and P. Andreae, "Information transfer and area-time tradecution. The idea that with so many machines it should be offs for VLSI multiplication," Commun. Ass. Comput. Mach., vol. 23, possible to achieve high performance, and also fault-tolerant no. 1, pp. 20-23, 1980. operation, is appealing and likely possible. The Japanese [2] I. Barron et al., "Transputer does 5 or more MIPS even when not used in parallel," Electronics, pp. 109-115, Nov. 17, 1983. Fifth-Generation project [58] and other efforts in high per[3] K. E. Batcher, "The flip network in STARAN," in Proc. Int. Conf. formance architectures for artificial intelligence applications Parallel Processing, 1976, pp. 65-71. cite the necessity of using VLSI and concurrency to achieve [4] G. Bilardi, M. Pracchi, and F. P. Preparata, "A critique and an appraisal of VLSI models of computation," in Proc. CMU Conf. VLSI Syst. Comtheir objectives. These Al problems are typically highly put., 1981. Rockville, MD: Comput. Sci. Press, 1981. branched and dynamic. Although microcomputer arrays are [5] G. Bishop and H. Fuchs, "A smart optical sensor on silicon," in Proc. interesting candidates for vision and knowledge base applicaConf. Advanced Res. VLSI. Jan. 1984. Dedham, MA: Artech, 1984, 65-73. tions, it remains to be seen whether Al machines can take a [6] J.pp.Blackmer, P. Kuekes, and G. Frank, "A 200 MOPS systolic prosimilar form. cessor," in Proc. SPIE, vol. 298, Real-Time Signal Processing IV, Soc. LOG2 n NODES L INPUT n-POINT AT FFTs SAMPLE AT VII. CONCLUSIONS The physical design task and the principles to which one must adhere in VLSI and its expected descendents are more demanding but also more uniform than in earlier technologies. The high cost of communication, in area, power, and time, relative to switching performance, constrains the design and engineering of large-scale tightly coupled systems. However, the absolute costs of switching, storage, and Photo-Opt. Instrum. Eng., 1981. [7] R. P. Brent and H. T. Kung, "The chip complexity of binary arithmetic," in Proc. 12th ACM Symp. Theory Comput., May 1980, pp. 190-200. [8] K. Bromley et al., "Systolic array processor developments," in Proc. CMU Conf. VLSI Syst. Comput., Oct. 1981. Rockville, MD: Comput. Sci. Press, 1981, pp. 273-284. [9] S. A. Browning, "The tree machine: A highly concurrent computing environment," Dep. Comput. Sci., California Inst. Technol., Pasadena, Tech Rep. 3760:TR:80, 1980. [10] , sect. 8.4.1 in C. A. Mead and L. Conway, Introduction to VLSI Systems. Reading, MA: Addison-Wesley, 1980. [11] S. A. Browning and C. L. Seitz, "Communication in a tree machine," in Proc. 2nd Caltech Conf. VLSI, Dep. Comput. Sci., California Inst. IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984 1264 Technol., Pasadena, 1981, pp. 509-526. [12] M. R. Buric and C. Mead, "Bit serial inner product processors in VLSI," in Proc. 2nd Caltech Conf. VLSI, Dep. Comput. Sci., California Inst. Technol., Pasadena, 1981, pp. 155-164. [13] C. R. Carroll, "A smart memory array processor for two layer path finding," in Proc. 2nd Caltech Conf. VLSI, Dep. Comput. Sci., California Inst. Technol., Pasadena, 1981, pp. 165-195. [14] D. P. Christman, "Programming the connection machine," M.S. thesis, Dep. Elec. Eng. Comput. Sci., Massachusetts Inst. Technol., Cambridge, 1984. [15] D. Cohen, "Simplified control of FFT hardware," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-24, Dec. 1976. [16] "Mathematical approach to iterative computation networks," in Proc. 4th Symp. Comput. Arithmetic (also as ISI/RR-78-73, U.S.C./Inform. Sci. Inst., Marina del Rey, CA, Nov. 1978), IEEE Cat. No. 78CH1412-6C, pp. 226-238, Oct. 1978. [17] A. L. Davis and R. M. Keller, "Data flow program graphs," IEEE Computer, vol. 15, pp. 26-41, Feb. 1982. [18] P. B. Denyer, "An introduction to bit-serial architectures for VLSI signal processing," in VLSI Architecture, B. Randall and P. C. Treleven, Eds. Englewood Cliffs, NJ: Prentice-Hall, 1983, ch. 20. , [19] J. T. Deutch and A. R. Newton, "A multiprocessor implementation relaxation based electrical circuit simulation," in Proc. 21st Design Automat. Conf., 1984, pp. 350-357. [20] M. Dubois and F. A. Briggs, "Effects of cache coherency in multiprocessors," IEEE Trans. Comput., vol. C-31, Nov. 1982. [21] A. L. Fisher et al., "The architecture of a programmable systolic chip," J. VLSI Comput. Syst., vol. 1, no. 2, pp. 1-16, 1984. [22] G. C. Fox and S. W. Otto, "The use of concurrent processors in science and engineering," Phys. Today, May 1984. [23] H. Fuchs et al., "Developing pixel-planes, a smart memory-based graphics system," in Proc. Conf. Advanced Res. VLSI, Massachusetts Inst. Technol., Cambridge, Jan. 1982. Dedham, MA: Artech, 1982. [24] G. R. Goke andG. J. Lipouski, "Banyan networks for partitioning processor systems," in Proc. 1st Annu. Svmp. Comput. Architecture, 1973, pp. 21-28. [25] D.E. Heller andI. C. F. Ipsen, "Systolic networks for orthogonal equivalence transformations and their applications," in Proc. Conf. Advanced Res. VLSI, Massachusetts Inst. Technol., Cambridge, Jan. 1982, Dedham, MA: Artech, 1982, pp. 113-122. [26] G. G. Hendrix, "Encoding knowledge in partitioned networks," in Association Networks. New York: Academic, 1979. [27] J. L. Hennessy, "VLSI processor architecture," IEEE Trans. Comput., this issue, pp. 1221-1246. [28] C. E. Hewitt, "The apiary network architecture for knowledgeable tems," in Conf. Rec. LISP Conf., Stanford, CA, Aug. 1980. [29] W. D. Hillis, "The connection machine (computer architecture for the new wave)," Massachusetts Inst. Technol., Cambridge, Al Memo 646, Sept. 1981. [30] C. A. R. Hoare, "Communicating sequential processes," Commun. Ass. Comput. Mach., vol. 21, no. 8, pp. 666-677, 1978. [31] B. Hoeneisen and C. A. Mead, "Fundamental limitations in microelectronics I, MOS technology," Solid-State Electron., vol. 15, pp. 819-829, 1972. [32] L. Johnsson and D. Cohen, "A mathematical approach to modeling flow of data and control in computational networks," in Proc. Conf. VLSI Syst. Comput., Oct. 1981. Rockville, MD: Comput. Press, 1981, pp. 213-225. [33] L. Johnsson, "Computational arrays for band matrix eq'uations," Dep. Comput. Sci., California Inst. Technol., Pasadena, Tech. Rep. 4287:TR:81, 1981. of raster multi- sys- the CMU Sci. [34] "A computational array for the QR-method," in Proc. Conf. vanced Res. VLSI, Massachusetts Inst. Technol., Cambridge, 1982. Dedham, MA: Artech, 1982, pp. 123-129. [35] D. J. Kuck et al.,"Measurements of parallelism in ordinary FORTRAN programs," IEEE Computer, vol. 7, pp. 37-46, Jan. 1974. [36] R. H. Kuhn and D. A. Padua, Eds., IEEE Computer Society Tutorial Parallel Processing, 1981. [37] H. T. Kung, "Synchronized and asynchronous parallel algorithmsior multiprocessors," in Algorithms and Complexity: New Directions aid Recent Results, J. F. Traub, Ed. New York: Academic, 1976, pp. 153-200. [38] "Let's design algorithms for VLSI," in Proc. Caltech Conf. VLSI, Dep. Comput. Sci., California Inst. Technol., Pasadena, 1979, Ad- on pp. 65-90. [39] H. T. Kung and C. E. Leiserson, "Algorithms for VLSI processor rays," in C. A. Mead and L. Conway, Introduction to VLSI Sys- ar- tems. Reading, MA: Addison-Wesley, 1980, sect. 8.3, pp. 271-292. [40] H. T. Kung, "The structure of parallel algorithms," in Advances in Computers, vol. 19. New York: Academic, 1980. [41] "Why systolic architectures?," IEEE Computer, vol. 15, Jan. 1982. [42] S. -Y. Kung et al., "Wavefront array processor: Language, architecture, and applications," IEEE Trans. Comput., vol. C-31, pp. 1054-1066, Nov. 1982. [43] C. R. Lang, Jr., "The extension of object-oriented languages to a homogeneous concurrent architecture," Dep. Comput. Sci., California Inst. Technol., Pasadena, Tech. Rep. 5014:TR:82, 1982. [44] D. H. Lawrie, "Access and alignment of data in an array processor," IEEE Trans. Comput., vol. C-24, pp. 1145-1155, Dec. 1975. [45] C. Lee, "An algorithm for path connections and its applications," IRE Trans. Electron. Comput., vol. EC-10, pp. 346-365, Sept. 1961. [46] K. Y. Liu, "Architecture for VLSI design of Reed-Solomon encoders," in Proc. 2nd Caltech Conf. VLSI, Dep. Comput. Sci., California Inst. Technol., Pasadena, 1981, pp. 539-554. [47] B. Locanthi, "The homogeneous machine," Dep. Comput. Sci., California Inst. Technol., Pasadena, Tech. Rep. 3759:TR:80, 1980. [48] C. Lutz et al., "Design of the mosaic element," in Proc. Conf. Advanced Res. VLSI, Massachusetts Inst. Technol., Cambridge, 1984. Dedham, MA: Artech, 1984, pp. 1-10. [49] R. F. Lyon, "Two's complement pipeline multipliers," IEEE Trans. Commun., vol. COM-24, pp. 418-425, Apr. 1976. [50] F. Lyon, "The optical mouse, and an architectural methodology for smart digital sensors," in Proc. CMU Conf. VLSI Syst. Comput., Oct. 1981. Rockville, MD: Comput. Sci. Press, 1981. [51] R. F. Lyon, "A bit-serial architectural methodology for signal processing," in VLSI 81. New York: Academic, 1981. [52] A. J. Martin, "The torus: An exercise in constructing a processing surface," in Proc. 2nd Caltech Conf. VLSI, Dep. Comput. Sci., California Inst. Technol., Pasadena, 1981, pp. 527-538. [53] S. Mattisson, "A concurrent circuit simulator," Dep. Comput. Sci., California Inst. Technol., Pasadena, Tech. Rep. 5142:TR:84, 1984. [54] C. A. Mead and L. Conway, Introduction to VLSI Systems. Reading, MA: Addison-Wesley, 1980. [55] C. A. Mead and M. Rem, "Minimum propagation delays in VLSI," IEEE J. Solid-State Circuits, vol. SC-17, pp. 773-775, Aug. 1982. [56] E. Moore, "Shortest path through a maze," Ann. Comput. Lab. Harvard Univ., vol. 30, pp. 285-292, 1959. [57] G. E. Moore, "Are we really ready for VLSI?," in Proc. Caltech Conf. VLSI, Dep. Comput. Sci., California Inst. Technol., Pasadena, 1979. [58] T. Moto-oka and H. S. Stone, "Fifth generation computer systems: A Japanese project," IEEE Computer, vol. 17, pp. 6-13, Mar. 1984. [59] J. H. Patel, "Analysis of multiprocessors with private cache memories," IEEE Trans. Comput., vol. C-31, pp. 296-304, Apr. 1982. [60] D. A. Patterson and C. H. Sequin, "Design considerations for single-chip computers of the future," IEEE J. Solid-State Circuits, vol. SC-15, Feb. 1980. [61] M. C. Pease,III, "An adaptation of the fast Fourier transform for parallel processing," J. Ass. Comput. Mach., vol. 15, pp. 252-264, 1968. [62] "The indirect binary n-cube microprocessor array," IEEE Trans. Comput., vol. C-26, pp. 458-473, May 1977. [63] F. P. Preparata, "A mesh-connected area-time optimal VLSI integer multiplier," in Proc. CMU Conf. VLSI Syst. Comput., 1981. Rockville, MD: Comput. Sci. Press, 1981. [64] J. T. Schwartz, "Ultracomputers," ACM Trans. Programming Languages Syst., vol. 2, pp. 484-521, Oct. 1980. [65] C. L. Seitz, "System timing," in C. A. Mead and L. Conway, Introduction to VLSI Systems. Reading, MA: Addison-Wesley, 1980. [66] ,"Ensemble architectures for VLSI -A survey and taxonomy," in Proc. Conf. Advanced Res. VLSI, Massachusetts Inst. Technol., Cambridge, Jan. 1982. Dedham, MA: Artech, 1982, pp. 130-135. [67] ,"Experiments with VLSI ensemble machines," J. VLSI Comput. Syst., vol. 1, no. 3, 1984. [68] C. L. Seitz and J. Matisoo, "Engineering limits on computer performance," Phys. Today, pp. 38-45, May 1984. [69] C. L. Seitz, "The cosmic cube," Commun. Ass. Comput. Mach., to be published, Dec. 1984. [70] H. J. Siegel, "Interconnection networks for SIMD machines," IEEE Computer, vol. 12, pp. 57-65, June 1979. [71] C. S. Slemaker, R. C. Mosteller, L. W. Leyking, and A. G. Livitsanos, "A programmable printed wiring router," in Proc.11th Design Automat. Workshop, June 1974. [72] L. Snyder, "Introduction to the configurable highly parallel computer," IEEE Computer, vol. 15, pp. 47-56, Jan. 1982. ', SErrZ: CONCURRENT VLSI ARCHITECTURES [73] H. S. Stone, "Parallel processing with the perfect shuffle," IEEE Trans. [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] Comput., vol. C-20, 1971. H. S. Stone, Ed., Introduction to Computer Architecture. Chicago, IL: Sci. Res. Assoc., 1975, particularly pp. 318-374. W. -K. Su, "Super mesh," M.S. thesis, Dep. Comput. Sci., Califomia Inst. Technol., Pasadena, Tech. Rep. 5125:TR:84, 1984. H. Sullivan and T. R. Brashkow, "A large scale homogeneous machine I & II," in Proc. 4th Annu. Symp. Comput. Architecture, 1977, pp. 105-124. 1. E. Sutherland and C. A. Mead, "Microelectronics and computer science," Sci. Amer., vol. 237, pp. 210-229, Sept. 1977. R. J. Swan et al., "CM* -A modular multimicroprocessor," in Proc. Nat. Comput. Conf., vol. 46. AFIPS Press, 1977, pp. 637-644. R. M. Swanson and J. D. Meindl, "Ion-implanted complementary MOS transistors in low-voltage circuits," IEEE J. Solid-State Circuits, vol. SC-7, pp. 146-153, Apr. 1972. J. J. Symanski, "NOSC systolic processor testbed," Naval Ocean Syst. Cen., Tech. Rep. TR NOSC TD 588, June 1983. J. E. Tanner, M. H. Raibert, and R. Eskenazi, "A VLSI tactile sensing array computer," in Proc. 2nd Caltech Conf. VLSI, Dep. Comput. Sci., California Inst. Technol., Pasadena, Jan. 1981, pp. 217-234. J. E. Tanner and C. Mead, "A correlating optical motion detector," in Proc. Conf. Advanced Res. VLSI, Massachusetts Inst. Technol., Cambridge, Jan. 1984. Dedham, MA: Artech, 1984, pp. 57-64. C. D. Thompson, "A complexity theory for VLSI," Dep. Comput. Sci., Carnegie-Mellon Univ., Pittsburgh, PA, Tech. Rep. CMU-CS-80-140, Aug. 1980. ,"The VLSI complexity of sorting," in Proc. CMU Conf. VLSI Syst. Comput., Computer Science Press, Oct. 1981, pp. 108-118. , "The VLSI complexity of sorting," IEEE Trans. Comput., vol. C-32, pp. 1171-1184, Dec. 1983. U. Weiser and A. Davis, "A wavefront notation tool for VLSI array design," in Proc. CMU Conf. VLSI Syst. Comput., Computer Science 1265 Press, Oct. 1981, pp. 226-234. [87] D. Whiting, "Bit serial Reed-Solomon decoders," Ph.D. dissertation, Dep. Comput. Sci., Califomia Inst. Technol., June 1984. [88] D. W. L. Yen and A. V. Kulkarni, "The ESL systolic processor for signal & image processing," in Proc. 1981 IEEE Comput. Soc. Workshop Comput. Arch. Pattern Anal. Image Database Management, Nov. 1981, pp. 265-272. Charles L. Seitz (S'68-M'69) received the B.S., M.S., and Ph.D. degrees from the Massachusetts Institute of Technology, Cambridge, MA. He is now a Professor of Computer Science at the Califromia Institute of Technology, Pasaderta, CA, where his research and teaching activities are in the areas of VLSI architecture and design, concurrent computation, and self-timed systems. Prior to joining the faculty of the Califomia Institute of Technology, he worked as an Industrial Consultant from 1972 to 1977, principally for the Burroughs Corporation, was an Assistant Professor of Computer Science at the University of Utah from 1970 to 1972, and was a Member of the Technical Staff of the Evans and Sutherland Computer Corporation from 1969 to 1971. While at the Massachusetts Institute of Technology, he was an Instructor of Electrical Engineering. Dr. Seitz is a member of the Association for Computing Machinery and of the IEEE Computer Society, and was the recipient of the Goodwin Medal for "conspicuously effective teaching" at the Massachusetts Institute of Technology.