Concurrent VLSI Architectures

advertisement
1247
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-33, NO. 12, DECEMBER 1984
Concurrent VLSI Architectures
CHARLES L. SEITZ, MEMBER, IEEE
(Invited Paper)
Abstract -This tutorial paper addresses some of the principles
and provides examples of concurrent architectures and designs
that have been inspired by VLSI technology. The circuit density
offered by VLSI provides the means for implementing systems
with very large numbers of computing elements, while its physical
characteristics provide an incentive to organize systems so that the
elements are relatively loosely coupled. One class of computer
architectures that evolve from this reasoning include an interesting and varied class of concurrent machines that adhere to a
structural model based on the repetition of regularly connected
elements. The systems included under this structural model range
from 1) systems that combine storage and logic at a fine grain size,
and are typically aimed at computations with images or storage
retrieval, to 2) systems that combine registers and arithmetic at a
medium grain size to form computational or systolic arrays for
signal processing and matrix computations, to 3) arrays of
instruction interpreting computers that use teamwork to perform many of the same demanding computations for which we use
high-performance single process computers today.
Index Terms -Computational arrays, concurrent computation, logic-enhanced memories, microcomputer arrays, multiprocessors, parallel processing, smart memories, systolic arrays,
VLSI.
I. INTRODUCTION
CONCURRENT (or parallel) architectures are not, of
\_ course, an idea original to VLSI. See, for example, the
IEEE Tutorial on Parallel Processing [36] for an excellent
background and collection of papers on this subject. The
editors introduce this volume by pointing out, "Whenever a
computer designer has reached for a level of performance beyond that provided by his contemporary technology, parallel
processing has been hi's apprentice." This comment applies
particularly well to VLSI architectures, as we shall see, although in the VLSI domain we usually use the term concurrent to suggest the independence of a collection of computing
activities, in preference to the lockstep connotations of the
term parallel.
In reaching for performance with VLSI, one finds that
the digital microelectronic technologies with the highest
complexity also have definite performance limitations. The
fundamental limitation is the high cost of communication,
relative to logic and storage [77]. Communication is expensive in chip area; 'indeed, most of the area of a chip is covered
Manuscript received June 30, 1984; revised July 31, 1984. This work was
supported in part by the Defense Advanced Research Projects Agency under
ARPA Order 3771, and monitored by the Office of Naval Research under
Contract N00014-79-C-0597.
The author is with the Department of Computer Science, California Institute
of Technology, Pasadena, CA 91125.
with wires on several levels, with transistor switches rarely
taking more than about 5 percent of the area on the lowest
levels. Communication is also expensive in sending signals
between chips, where package pin limitations, the area used
for bonding pads and pad drivers, and the cost of the packages, must be considered in multichip system designs.
Dynamic power supplied to the chip, and dissipated in the
circuits that switch capacitive signal nodes, is typically dominated by the parasitic capacitance of the internal wires,
bonding pads, and interchip wires, rather than by the c'apacitance of the transistor gates. In VLSI technologies such as
CMOS, in which the static power is negligible, communication then accounts for most of the power consumed and
dissipated on a chip.
When it comes to performance, communication is expensive in delay, both internally and between chips. In MOS
technologies, which exhibit the highest circuit density but a
poor relationship between transistor driving capabilities and
the wiring parasitics, circuit speeds are dominated by parasitic wiring capacitance. The switching speed of an MOS
transistor in modern processes, with one minimum size transistor driving the gate of an adjacent identical transistor, is in
the 0.1 ns range, but if one adds a few hundred microns of
wiring, the delay is increased to several nanoseconds. Also,
the nonzero resistance of the wires, together with the parasitic capacitances of a wire, imposes a delay in the wire itself
that is becoming increasingly significant at smaller.geometries. Finally, the disparity between internal signal energies
and the macroscopic world of bonding pads, package pins,
and interchip wiring is so large that the' delay penalty in
amplifying a signal so that it can run between chips is comparable to a clock period.
Thus, both the cost and performance metrics of VLSI favor
architectures in which communication is localized. This principle of locality is seen at every level of VLSI design. Cells
are designed and laid out to minimize area and wiring capacitance. Sections of chips and whole chips are carefully organized by cell placement in semicustom designs, and by
floor plans in custom designs, with the objectives of
minimizing wire area and of placing close together those
parts that must communicate at high bandwidth or with small
delay. Finally, partitioning of systems onto multiple chips
must adhere to limits on package pins and the signaling speed
between chips. These physical design considerations of the
geometry and energetics of communication, the effects of
scaling, and their architectural implications are discussed in
greater detail in Section II.
0018-9340/84/1200-1247$01.00 © 1984 IEEE
1248
The communication limitations outlined above influence
all VLSI architectures. Even the "general purpose" sequential processors that fit on single chips, the subject of the
companion article [27], exhibit a marked sensitivity of their
design to localization of communication. The performance of
such systems also depends strongly on the extent to which
they exploit covertly the concurrency found in the process of
interpreting a single instruction stream. Localization of communication is achieved by consolidating an entire instruction
processor onto a single chip together with as much of the
lowest levels of the storage hierarchy (registers, stack
frames, and instruction and data caches) as possible [60].
This on-chip storage can be very effective in reducing the
frequency of relatively slow off-chip storage references.
Concurrency in interpreting the sequential instruction stream
is achieved by instruction prefetching and by pipelining of
instruction execution. One might expect these techniques to
be even more effective in speeding up micromainframes than
when they are applied to conventional mainframes, and indeed are moving the micromainframe into the performance
range of mainframes, and at greatly reduced cost.
We refer to the approach of exploiting concurrency starting
from a sequential program definition as covert because it is
successful only if it is hidden, that is, only if the effect of
executing the source program, as assembled or compiled, is
the same as if it were interpreted sequentially. Unfortunately,
the degree of concurrency achieved by such techniques is
typically and in aggregate much less than 10, although in
some cases the concurrencies that can be discovered in sequential programs [35] are of considerably higher degree.
One is left, then, with the "intriguing question" posed by
C. Mead and L. Conway [54] in Chapter 8, "Highly Concurrent Systems," of their well-known text: "Does VLSI
offer more than inexpensive implementations of conventional computers?"
The concurrent VLSI architectures discussed here exploit
overtly the high degree of concurrency found in many interesting and computationally demanding applications. Here we
label the concurrency approach as overt because the concurrent formulation of the computation is out in the open,
rather than hidden in a single process representation from
which one discovers the concurrency with a compiler or during execution. In this approach one may think about, formulate, and express a computation in terms of many processes
and their interaction. The eventual implementation target of
such a formulation may be either a design directly in silicon,
or a concurrent program that runs on an array of communicating and concurrently operating programmable computers.
It is characteristic of many, but certainly not all, complex
and demanding computing problems that they can be formulated for execution with large degrees of concurrency-in
the thousands, even in millions of loosely coupled processes.
Also, the degree of concurrency in such formulations characteristically grows with the problem size. For example, the
number of computing cycles required for the computation of
a realistic image by ray-tracing grows with the resolution
with which it is displayed, but the concurrency that can be
exploited in casting rays grows similarly. Problems in signal
and image processing, computer graphics, storage retrieval,
matrix operations, solving sets of partial differential equa-
IEEE TRANSACTIONS ON COMPUTERS, VOL.
c-33,
NO.
12,
DECEMBER
1984
tions, physical simulation, and graph computations are
among those applications that have been intensely studied
and have been the object of clever system inventions by VLSI
researchers and others.
It is clear, however, that such machines are specialized to
the application for which they are conceived. Even those
concurrent machines that are programmable are not "general
purpose" in the usual sense, since not all computing problems
have the requisite concurrent formulation or size. It should be
noted also that these concurrent VLSI engines, while developing rapidly in experimental prototypes both in industry and
in universities, are relatively long range and adventurous
efforts that may not have a significant impact on computing
for several more years.
Even the study of conventional computer architectures,
designs, and programming systems is challenging in its diversity. One might expect the study of these specialized systems to be a chaos of individual inventions and applications.
In fact, the highly concurrent VLSI architectures that have
been sufficiently fully developed to use here as examples
tend to fall into a consistent pattern that is disciplined by the
characteristics of VLSI technology and design practice.
Thus, following the discussion of VLSI technology and design in Section II, and before turning to examples, we discuss
briefly in Section III a structural taxonomy of these concurrent systems.
II. VLSI TECHNOLOGY AND DESIGN
Digital systems have traditionally used a variety of technologies and manufacturing processes. The engineering of
systems composed of many parts allows for a degree of specialization of the technology to the function to be performed,
be it logic, storage, or communication. The progress in
microelectronics, particularly in the last decade, has changed
this engineering situation dramatically [68]. As the level of
integration has increased, and the number of manufactured
parts required to accomplish a given function has decreased,
the engineering situation has become more uniform. Most of
the design and engineering effort for a digital system today
occurs in the design of the chips. Once the attention is inside
a chip, the technology with which one creates a system
is very tidy and consistent, governed by a much smaller
set of rules and design paradigms than when many technologies have to be considered, and with no practical
''escape" into other technologies to bail the designer out of a
tough problem.
A. VLSI Models of Computation
A consequence of this uniformity is that VLSI models of
computation are quite realistic as a means of quantifying the
consequences in silicon area, a measure of-cost, and computing time, of architectural choices within a chip. When a
variety of digital technologies had to be considered, each
with its- own cost, performance, and functional specialization, such modeling was much less tractable, or could be
carried out only at a coarse level. As the scale of a system or
subsystem encompassed in a single chip increases, one might
hope and expect that architectural tradeoffs will be accom-
1249
SEITZ: CONCURRENT VLSI ARCHITECTURES
plished in a less ad hoc fashion. VLSI is indeed a beautiful
medium for studying the structure and design of digital systems, and this fact, as much as its economic importance,
explains its appeal in the research community.
VLSI models of computation [83] have been used extensively for the complexity analysis of concurrent algorithms.
Although they are abstractions from the actual complications
in chip design, they provide a way to describe concisely
several of the essential features and problems of VLSI design
that lead one toward concurrent architectures. We require,
and hence shall use, only rather crude models here. For area
A, it will suffice to use the actual or approximate area on
silicon, an existent upper bound. The scalable parameter A
[54] is used as the linear unit, where 2A is the feature size of
the process, so that the area in A2 units is actually a measure
of complexity. The computing time T for an element or system is taken as the latency, the delay from input to output for
a "problem instance." Thompson suggests [83], [84] that a
system working on p problem instances concurrently should
be taken to exhibit an area A that is its total area divided by
p. For our simpler purposes we shall instead take A simply as
area, and T/p as the average interval between problem instances at the input and output. Average throughput is then
p/T, and cost/performance corresponds to AT/p.
Concurrency may be exploited at any level of a VLSI
system design. One of the most common strategies in the
logic, organization, and architecture of VLSI systems is to
use pipelines - intermediate storage - in computation and
communication paths, in order to increase throughput even if
it is at the expense of increased latency. Fig. 1(a) is a simple
illustration of this approach, the evaluation of an expression
(in infix notation): (((Aop1B)op2C)op3D), where the opi are
binary operators. Although the operations must be performed
sequentially, this evaluation is assumed to be required repetitively. The boxes indicate the temporary storage for the
input and output. If this picture were a concurrent program
schema, the circles would represent processes, the boxes
their input and output queues, and the arcs the message paths.
If it were a block diagram, the circles would represent combinational functions, the boxes registers, and the arcs bundles
of wires. If it were a logic gate diagram, the circles would be
gates, the boxes clocked storage elements, and the arcs
wires. If one denotes the respective areas and times of the
operators as ai and ti, neglecting the boxes, the areas and
times simply accumulate, so that the cost/performance is just
(al + a2 + a3)(tl + t2 + t3)A pipelined version of this system, shown in Fig. 1(b), is
dealing with three problem instances concurrently. Here one
sees in miniature some of the opportunities and problems one
faces with concurrent computations. If the operation times
were equal, one has a system with only slightly increased
area (the boxes), the same communication plan, the same
latency time, and with p = 3, three times the throughput.
What if the ti were not equal? Clearly the throughput would
become 1/tma. When such a system is designed in silicon,
one can generally trade off area and time within the operators, or locate the pipeline synchronization, to make the times
approximately equal. When this situation appears in programming concurrent computers, it is referred to as load
balancing, and is dealt with in ways that depend on both the
o
A
C
B
S
+
OP-
_
,'
e
+
OP2
=
RESULT
D
,,'
'
S----} 1 PROBLEM
+
~~~2JINSTANCES
OP3
RESULT
II
Fig. 1. (a) Cascade evaluation. (b) Pipeline evaluation.
application and on the architecture.
Although there is apparent localization of communication
within and between the operators, this example is otherwise
independent of technology considerations. The case in favor
of concurrency in VLSI systems becomes more compelling when one examines the area-time performance of the
communication.
There are currently in vogue several different time models
for the delay of a wire and its driver. Each is physically valid
under a range of conditions in a particular technology. In
MOS technologies a simple amplifier with input capacitance
Ci, will drive a capacitive load C0ut in time T = rjnv(COut/Cin),
where rinv is a characteristic of the process, essentially the
delay of an inverter with unit fanout and no parasitic load.,
The area of the amplifier is proportional to Cm. By cascading
amplifiers in an optimal [54] size ratio C01t/CCjn = e, one can
boost a signal from an energy corresponding to capacitance
C, to the parasitic capacitance of a wire, CL, in a minimum
time T = Tinv loge (CL/CX)The parasitic capacitance of a wire varies with wire length
e according to a proportionality constant of the layer, so one
can assert that communication time is T = O(log i) in the
worst case in which one starts from a minimum signal energy.
The numbers for a typical CMOS process of today are
'rinv = 0.6 ns, and the ratio between the capacitance of largest wiring structures and a minimum size transistor gate is
about e8, and scales to about e15 in ultimate MOS technologies. Although in typical practice the compromise between driver area and delay dictates a suboptimal driver for
long wires or interchip wires, one can, if necessary, achieve
O(log f) dependence of the driver delay with wire length.
The resistance of a wire together with the parasitic capacitance adds a diffusive propagation term for the wire itself that
is O(f2) [65], [4]. The coefficient of this term makes this
phenomenon significant in the fairly resistive silicon and
polycrystalline silicon wires today, and is an important performance consideration in memory and programmed logic
array (PLA) structures in single-level metal processes, since
silicon wires must be used in one of the two directions in the
1250
matrix layout. In scaled technologies this problem appears
also in metal wires, due to their scaled thickness. For reasons
other than performance, such as noise immunity, one would
never allow this diffusive term to become seriously dominant. Instead, one can include active repeater amplifiers periodically along a long communication path, thus making the
delay 0(t), or one can convey signals long distances on
additional metal layers that are thicker.
While the dependence of delay time on wire length is quite
benign if one uses optimal driving structures, the combination of area cost and this slowly growing delay is a substantial
incentive to localization of communication. Communication
throughput, or bandwidth, in the VLSI medium is a very
meaningful measure that may be taken as a product of the
number of parallel wires (the width of the wiring track) between two parts, and the "bit" rate. In signaling schemes in
which there is no pipelining along the communication path,
there can be only one transition propagating along the path,
and the bit rate is bound by the reciprocal of the delay. If one
takes the area-time product as the objective function to be
minimized in a particular design, one sees that VLSI imposes
serious penalties for separating two parts that communicate
in this naive way.
The wire area, the signal energy, and the area of the optimal driver (about 10 percent of the wire area) is each proportional to the distance, that is, A = 0(t), while
T = 0(log C). Accordingly, the aggregate penalty in
cost/performance for violating the principle of locality is
0(f log C). The penalty may be still worse for wires that are
so long that they must be forced onto the upper, thicker metal
layers, since these layers will have coarser design rules,
exclusively long wires, and hence more area per wire, and are
expected to be a very limited resource [55].
What one sees in the example of pipelining, and from the
area-time performance of communication, is that the
throughput can be increased if one can devise a way to confine the (physical) diameter of tightly coupled parts of a
system. The expression "tightly coupled" can be taken in the
synchronous design style to describe a system in which a
large proportion of the communication must traverse the entire system on each clock period, such as from the input to the
output registers in Fig. 1(a).
On the other hand, when large systems are composed in a
loosely coupled fashion, by which we mean that the parts can
operate relatively independently that is, concurrently and can tolerate latency in their communication with other
parts, the raw performance and excellent cost/performance
that can be achieved in small diameter systems is reflected in
large systems. This principle has been applied so thoroughly
in the examples discussed later that these systems are arbitrarily extensible in the number of concurrent computing
elements, and open-ended in performance. They can be expanded to be as large as desired with each part still operating
at the same rate as when it is incorporated into a smaller
system. This property is closely associated with the ability of
a VLSI architecture to be scaled, that is, to exploit advances
in the circuit technology.
Another conclusion of this informal analysis is that communication between concurrent computing elements may
take only certain forms. Any wide path -many parallel
IEEE TRANSACTIONS ON COMPUTERS, VOL.
c-33,
NO.
12,
DECEMBER
1984
wires - must be strictly localized, such as can be achieved in
a mesh-connected system of concurrent computing elements.
Wiring economics dictate that any long path be narrow, and
will necessarily exhibit a significant latency. This latency
will be due in part to pipelining in communication. Also,
messages of many bits sent between concurrent computing
elements must be serialized according to the width of the
communication path, and would accordingly exhibit a latency that is dependent on the size of the communication
"packet." These two communication paradigms -local and
wide versus distant and latent- are indeed recurrent and
competing themes in the later examples; the former representing the wavefront or systolic type of computation, in
which all communication can be made local, and the latter
the queued message routing approach to less regular
computations.
B. Architectures That Scale
Another aspect of this increasing uniformity of digital
technology is that the future of at least the silicon technology
is believed to be well understood and, but for a few considerations, is not radically different from today's high complexity MOS technologies. Thus, one has a reasonable hope
of devising architectures with a longevity that parallels the
continued evolution of the technology, or in VLSI jargon,
"architectures that scale."
The physical consequences of feature size scaling of MOS
technologies are generally advantageous. If all physical dimensions and voltages are reduced together [54], so that
electric field strength remains constant (and no new materials
need be postulated), the electron "velocity" ,uE, where
,u is the mobility and E the electric field, remains constant,
and the transit time across the smaller dimension channel is
reduced in direct proportion. Capacitances per unit area
increase linearly in the scaling, but the areas decrease quadratically, so that the capacitances in a scaled circuit are
linearly smaller. One can see that at each circuit node the
relation iAt = Aq = CAv is satisfied in such a way that the
scaled circuit is an exact current-, voltage-, and time-scaled
replica of the original.
The energy associated with each switching event,
(1/2)CAv2, is scaling as the third power of the feature size.
If C is taken as the capacitance of the gate of a minimum
geometry transistor, this switching energy ES, is equivalent to
the product of the power per device and device delay, with the
device switching at maximum rate. The power-delay product
is the fundamental figure of merit for switching devices, and
has a direct relation to the cost/performance of systems implemented with those devices. One notes also that the power
per device scales down quadratically while the density scales
up quadratically, so that the power per unit area need not be
increased over today's levels.
This cube law scaling of ESW with feature size is a remarkable incentive for continued advances in MOS technologies.
If one were to take an existing single chip in a 2 ,tm CMOS
technology, say, an instruction processor about 5 mm square
running at a 20 MHz clock rate, and fabricate it at one-tenth
of its present feature size, it would take up only Vioo the area,
and in principle would operate from an 0.5 V source at V/ioo
the power, and at a 200 MHz clock rate. This is certainly an
1251
SEITZ: CONCURRENT VLSI ARCHITECTURES
attractive scaling. In practice, there are a few things that
go wrong in this scaling that would have some impact on
the scaled performance, but they are only difficulties, not
disasters, for computing elements that are the complexity of
a single chip today.
Operation at such small voltages is problematic, and due to
short channel effects, the threshold voltages and performance
of the transistors would not be quite as good as simple scaling
rules imply [31], [79]. Another effect that is significant is
that while the capacitance distributed along the wires and
transistor gates of a circuit node would be scaled to ¼/io the
previous value, the reduced cross section of the wires would
cause the resistance of the wire segments, even though they
are shorter, to increase by a factor of 10. The RC product,
which has dimensions of time, and is the coefficient in the
delay in diffusive propagation of signals on wires, unfortunately fails to scale with the other circuit delays. Also, unless
temperature is also scaled, the subthreshold conductance of
the MOS transistor scales up rather dramatically with reduced
threshold voltages, so that the dynamic storage structures that
are so widely employed in MOS technologies, such as for
dynamic RAM's, cannot be depended on to retain charge
for more than a few microseconds. One can get around each
of these scaling difficulties by variations in design style. For
example, the use of additional metal layers in layout would
be a compensation for the problem of wire resistance, and one
would expect increased use of static storage structures in
place of dynamic storage.
So, it appears that variations of the designs of today, and
of the design techniques, are feasible at least in small areas
of the chips in this futuristic technology. The design of a chip
with an area similar to today's chips, but with 100 times
higher circuit density, will depend strongly on the additional
layers of metal interconnection mentioned above. This situation is not at all unlike the wiring hierarchy employed today
with packaged chips on printed circuit boards or chips
mounted directly in ceramic carrier modules, but with the
next level of wiring absorbed into the chip. Even under the
optimistic and uniformitarian view that this future technology inherits our present design practices nearly intact,
there is no fundamental help in sight for relieving the communication limitations. Indeed, by confining the wiring to two
dimensions, we give up a physical dimension of packaging
and interconnection. Driving a signal that is equivalent to a
cross-chip wire of today has become no easier in scaling, and
because of the diffusive delays in the lowest levels of interconnect, will force some connections to higher levels. Driving the long distance interconnect on the upper metal layers
is essentially similar to driving bonding pads, package pins,
and interchip wiring today.
III. STRUCTURAL TAXONOMY
The concurrent VLSI architectures to be discussed in the
three following sections were selected as examples based on
being 1) systems for which at least prototypes exist, and
2) clearly inspired by the opportunities and consistent with
the design principles of VLSI. This selection is a family of
concurrent systems that I have elsewhere referred to as
ensembles [66], [67], and which can be discussed in terms of
a simple process model of computation. There are other computational models and concurrent architectures whose VLSI
implementations are interesting, such as data-flow and reduction machines, but these subjects would entail a survey in
themselves.
There is a commonality in the physical structure of this
family of systems that is in part a necessity of the VLSI
medium, and in part an artifact that we shall try to identify,
These examples are all regularly connected direct networks
of nominally identical concurrent computing elements. For
the following discussion and examples, we shall refer to the
computing elements as nodes, as in a computer network, and
the, communication paths between them as channels.
A. Communication Plans
The communication plans of these systems are direct networks, such as the diverse selection illustrated in Fig. 2. In
some systems the communication plan is a direct mapping of
the communication requirements of an algorithm. In other
cases messages are routed to a destination node through intermediate nodes, and the choice of communication plan is a
compromise between wirability and performance. For example, a family of hypertorus networks can be represented as
a k-coordinate periodic n-dimensional cube (k-ary n-cube)
that connects kn = N nodes together such that the maximal
shortest path between nodes is kn/2. In order to connect
N = 212 nodes, one might choose k = 26 and n 2, an
easily wirable two-dimensional mesh with 2 x 212 short
channels, for which kn/2 = 64; or k = 2 and n = 12, for a
binary (or Boolean) 12-cube with 6 x 212 channels, 1/6 of
which are as long as a radius of the system, for which kn/2 =
12; or some intermediate compromise. Similarly, there are
many parametric variations of the binary n-cube connected
m-cycle for connecting m2n nodes. The width of the communication path is still another engineering variable.
In the VLSI engines discussed here, the nodes themselves
produce, consume, and in some cases route messages, so that
these are what are called direct networks. In systems in which
messages are routed, the node could be partitioned into two
concurrently operating sections, as illustrated in Fig. 3(a),
one section (C) to compute, and the other section (R) to route
messages. The network illustrated is a binary 3-cube, and the
channels are labeled according to the dimensions 0, 1, 2. The
message section can be reorganized by the transformation
shown into the multistage routing network of interchange
boxes shown in Fig. 3(b). This transformation of the direct
Boolean n-cube illustrates its essential similarity in structure
and message flow performance to the corresponding indirect
network, which is the same as the "flip" network used in
STARAN [3], and under a rearrangement the same as the
Omega network [44] or the banyan [24]. (See Siegel [70] for
an insightful survey.) The absence of indirect networks in
current experiments with concurrent VLSI systems is probably partly an artifact of the VLSI "lore" that switching networks do not scale well, which is certainly true of some of
them, such as crossbar switches. Message switches are not
eliminated by direct networks, but rather are partly concealed
in this fully distributed form.
=
1252
IEEE TRANSACTIONS ON COMPUTERS, VOL.
0
RING
TREE
SNEP TREE
MESH
c-33,
DIMENSION
1
2
12,
NO.
DECEMBER
1984
I
i
?
I
DIMENSION
HYPERCUBE (BINARY n- CUBE)
C1
-
Ci
C2
-
C2
C3 -
C3
C4
C4
C5
C5
C6
-c6
07
07
Fig. 3. (a) Direct binary 3-cube, and a transformation from the direct connection to interchange boxes. (b) Indirect binary 3-cube of interchange boxes.
process, and to the economies of relatively larger fabrication
runs of a smaller set of chip types.
SHUFFLE - EXCHANGE
CUBE - CONNECTED CYCLES
Fig. 2. Typical direct networks.
B. Homogeneity
Another common characteristic of these experiments with
concurrent VLSI architectures is that the nodes are nominally
identical. We accordingly refer to these machines as
homogeneous [76], meaning that they are of uniform structure. Heterogeneous systems would allow nodes to be specialized for different functions, much as are different
computers on a network, or the functional elements in high
performance computers. For these early experiments, however, homogeneous machines are certainly logistically
simpler to design, test, assemble, and maintain. Homogeneity in programmable machines simplifies the software
by giving all parts the same capabilities. Homogeneous
machines also conform very well to the design flow of VLSI
chips, in which repetition and regularity simplify the layout
C. Node Complexity
The choice of connection network establishes one dimension of variation in a taxonomy of this family of concurrent
machines. Two other interesting and discriminative dimensions are the complexity of the nodes and the number of nodes
reasonably contemplated for a given machine. A taxonomy in
these two dimensions is shown in Fig. 4.
The node complexity, also referred to as the "grain size" of
the system, appears on the horizontal axis in Fig. 4 in A2 area
units. The complexity of today's single chip, marked by a "*"
on this axis, is an interesting point that separates systems of
many nodes per chip from those of many chips per node.
Today's "commodity" chips routinely reach 5 mm on a side
at 2.5 Am feature size, or 4000 A on a side, which translates
to 16 MA2, while advanced commercial chips are in the
50-100 MA2 range. Of course, these measures have been
doubling approximately every two years over the recent past
[57].
The two extreme zones indicated in Fig. 4, storage sys-
1253
SEITZ: CONCURRENT VLSI ARCHITECTURES
nodes. Useful systems would include thousands, even millions, of these nodes. The ability to mix logic and storage
1612.
economically at a fine grain in a single technology, which
was not so attractive in earlier digital technologies [77],
is part of what makes these architectures unique to VLSI.
109Typical applications of these systems are computations
LOGIC-ENHANCED MEMORIES
with images, such as scan conversion, correlation, and path
finding; or database operations such as sorting, association,
z
COMPUTATIONAL ARRAYS
and property inheritance. There is no real theory or com106 t
putational model behind the design of these systems. They
MICROCOMPUTER
\
X
tend to be in the nature of specialized individual inventions.
1
o
ARRAYS
The next zone represents systems for highly concurrent
z
numerical
computations, in which the nodes are capable of
\ CONVENTIONAL
COMPUTERS
operations such as multiplication and addition, and are connected in regular patterns that match the flow of data in the
i
j \D\ /
,\_~~~~~~~~~~~~~1
computation. These computational arrays, also called sysI
tolic arrays [38], [41] because of the rhythmic pumping of
112
l 106
103
1109
data in pipelines, can be implemented in a variety of forms.
NODE COMPLEXITY
2)
The range of node complexities shown in Fig. 4 assumes that
Fig. 4. Taxonomy of concurrent VLSI systems.
the sequencing of operations is either built into the nodes, or
that the node responds to control signals that are broadcast
tems composed of random-access memory (RAM) chips, and into the array in the style of single instruction multiple data
conventional single processor computers, are included for (SIMD) machines. However, the systolic algorithms decomparison with the three middle zones. For this com- signed for such machines are also highly efficient concurrent
parison, we take the RAM cell and whole computer as the formulations for microcomputer arrays. Thus, the com"node." The three middle zones represent concurrent VLSI putational or systolic array is both an architecture and a comengines whose design, engineering, and applications have putational model, and has stimulated a broad research effort
been studied in some depth, and which appear to be reason- in the design of concurrent algorithms for applications such
ably distinct classes. The examples in the three following as signal processing, matrix and graph computations, and
sections are respectively what are labeled as logic-enhanced sorting.
It requires only several MA2, not even a full chip, to imnplememories, computational arrays, and microcomputer arrays.
Let us here briefly traverse all five zones, left to right, to ment a minimal instruction processor and a small amount of
describe some of the characteristics of each of these classes. storage for program and data. A single chip today is sufficient
The RAM systems are composed almost exclusively of [48] for a processor with a rich instruction set and several
high complexity chips, and provide a way to gauge the cost thousand bytes of storage. These highly integrated computers
of a system if it were so highly integrated and produced in exhibit excellent cost/performance, but the performance and
such large quantities. The basic repeated cell for storing one storage comes in fairly small units. Thus, it seems inevitable
bit in a RAM varies from about 100 A2 for the densest one- that people are learning to team up myriads of these computtransistor dynamic RAM chips to about 400 A2 for high ers that fit in units approximating one per chip, or many per
performance static RAM's. Multichip assemblies of many wafer, to attack demanding computational problems. Each
identical RAM chips, the larger capacity systems being microcomputer is fitted with a number of communication
typically denser and slower, are accordingly shown as an ports, and the array of nodes is connected in a direct network
elongated and slightly slanted zone of expected variation. that is, as usual, dictated either by the application or by
The selling price of mainframe add-in storage based on message routing and flow performance considerations.
64K dynamic RAM chips is currently somewhat less than Whether one can multiply the computing performance in
$20 per RAM chip, packaged and powered. These 64K RAM concurrent execution by the number of nodes, or nearly so,
chips are about 10 MA2. The resulting estimate of $2/MA2 is is very much dependent on the problem. All of the concurrent
too low for chips produced in small quantities, so we will use formulations for finer grain machines can be mapped very
a more conservative estimate of $5/MA2 for today's tech- efficiently onto microcomputer arrays. In addition, machines
nology, with the understanding that this measure scales down of this class appear to be capable of performing many of
with improvements in the circuit and packaging tech- the same scientific and engineering computations for which
nologies. Hyperbolas of constant cost, the product of the cost people today use high-performance vector computers.
The last zone, consisting of conventional single- or
per node and number of nodes, appear in the log-log plot of
Fig. 4 as straight lines labeled with a cost that applies to the several-processor computers, exists in a broad range from the
highly integrated implementations that one hopes to achieve single-chip computer to high-speed supercomputers.
with VLSI.
IV. LOGIC-ENHANCED MEMORIES
Logic-enhanced memories, also called "smart memories,"
We turn now to two interesting examples of specialized
are very fine grain systems in which each node contains a few
to a few hundred bits of storage associated with logic that can machines that are paradigms of the genus of concurrent VLSI
operate on the storage contents and communicate with other architectures that mix logic and storage at a fine grain, variw
\
0
0
LL.
0
\
w
0
C
10
10
(X
1254
ously called logic-enhanced memories or "smart memories."
It should be noted that these systems achieve rather remarkable performance per cost, in comparison to the same computation on a general purpose sequential computer, by
1) specialization of the system to the algorithms, 2) concurrent operation of an appreciable subset of the nodes in the
system, 3) localization of communication between the stored
data and the logic in the node (an unlimited storage bandwidth), and 4) localization of communication in the connection plan between nodes.
A. Pathfinder
The Pathfinder chip and system was designed to perform
the computationally expensive part of two-layer wire routing,
such as is employed in printed circuit board design, by an
adaptation of the Lee-Moore algorithm. It was one of the
first VLSI "smart memory" systems conceived. The original
idea for this project was suggested by I. Sutherland in a 1976
internal Caltech memorandum titled "A better mousetrap,"
and a fully developed system was designed and carried
through to a small prototype [13] by C. R. Carroll.
The Lee-Moore algorithm [56], [45] finds the shortest
path(s) between two points in a rectangular grid, in which
path segments are allowed to run only horizontally or vertically, and in Which points on the grid may be blocked. The
blocked points correspond to wiring area already used, either
previously routed wires, component pads, or the edge of the
circuit board. Each grid point has a state, either blocked,
unoccupied, or, in the original form of the algorithm, an
integer representing the distance of this point from a starting
point. In practical adaptations of the Lee-Moore algorithm to
circuit board routing [71], distance can be generalized to
cost.
The propagation phase of the algorithm starts with all
unblocked points unoccupied, and schedules the neighbors of
the starting point to be assigned a cost. Any unblocked neighbor then schedules its neighbors, and so on, so that the propagation phase terminates with all reachable points assigned
costs. This information then allows the retrace phase to trace
a minimal cost path back from the finishing point, or from
any reachable point, back to the starting point.
This algorithm is a simple and general tnaze-solving
scheme that will find a minimal cost path if a path exists. The
propagation phase is the costly part of the algorithm, since for
a route involving t steps, as many as t2 points may be examined. The retrace phase is only linear in £. The Lee-Moore
algorithm is seldom used today in routing programs, but
routing approaches that examine channel areas rather than
grid points to determine the sequence of channels through
which a wire is routed are descendants of the Lee-Moore
approach.
The propagation phase is "a natural" for a mesh-connected
cellular machine. The operation on the state of a node in
response to signals from adjacent nodes can be carried out in
a time measured in logic stage delays rather than instruction
times. The computations required along the entire wavefront
can be performed concurrently, thus reducing the time complexity from O(f2) to O(f).
In a preliminary one-sided router or maze solver called the
Mazer chip, Carroll experimented with the technique that
IEEE TRANSACTIONS ON COMPUTERS, VOL.
c-33, NO. 12,
DECEMBER
1984
was key to mapping this algorithm onto a cellular array on
silicon. Instead of encoding a cost at each grid point, he
followed Sutherland's suggestion of encoding a direction,
with cost being implicit in the speed of local propagation.
The Mazer chip encodes the state of a grid point with a
"blocked" bit together with 4 bits represented in Fig. 5 as
arrows pointing right, up, left, or down. If the 4 bits are all
O's, the cell is unoccupied. The action of a cell, shown in
circuit form in Fig. 6, is visualized as a four-way mousetrap,
in which tripping any arrow in a cell then trips the corresponding arrow in all adjacent unoccupied cells. At the end
of this propagation the direction arrow(s) point from all
reachable grid points back toward the starting point. It is
possible for two of the arrow bits to become set, representing
a bifurcation in the set of minimal paths.
This scheme works either synchronously or asynchronously; Fig. 6 applies to either approach, with the storage elements being respectively clocked set-reset flip-flops
or latches. Carroll chose the asynchronous approach, which
has advantages of speed, no clock to distribute, and a simpler
node, and a disadvantage of a difficulty of assuring uniform
propagation across chip boundaries. Of course, the Mazer
array is also a RAM that allows the grid point -states to be
examined and modified, both to set the blocking pattern and
to accomplish the retrace phase.
The asynchronous approach also allowed the Lee-Moore
algorithms to be adapted to the heuristics that practical routers use to cope with the actual complications of routing many
wires successively on multiple layers, and keeping previously routed wires out of the way of subsequent wires.
These adaptations typically involve weighting the edges of
the grid to make certain areas or directions more or less costly
than others. Thus, one can perturb the wavefronts computed
to find the least costly rather than the geometrically shortest
path. The usual heuristic for routing two-layer printed circuit
boards is to create a bias in favor of vertical wires on one side,
horizontal wires on the other side, and to make interlayer
"vias" fairly expensive and possible only in certain preassigned locations.
Carroll's Pathfinder chip implemented this adaptation of
the Lee-Moore algorithm on a two-layer grid by mapping the
weights directly into delays. The circuitry used in the basic
cell is ingenious, with the delays in the preferred and unpreferred directions in the two layers, and between layers, set
by external control voltages. Thus, the pathfinder is a hybrid
analog and digital computing network. The interested reader
is encouraged to study Carroll's paper [13].
The pathfinder chip implemented in 1980 in 5 /im nMOS
technology was a 4 by 8 array of 32 nodes, each about
50 kA2. A more modern process would allow on a single chip
a 32 by 32 array of 1024 nodes with addressing logic and pad
frame. A useful routing area of, say, 5-12 by 512 grid points
would be a 16 by 16 mesh-connected system of 256 chips.
Because of the number of pins required per chip - 256 just
to deal with the periphery - a Pathfinder system would benefit greatly from the use of advanced hybrid packaging, or
wafer scale integration if a suitable redundancy scheme could
be devised. By virtue of its specialized organization, such a
system attached to a modest host computer is estimated to
route a printed circuit board more than 100 times faster than
SEITZ: CONCURRENT VLSI ARCHITECTURES
1255
I_
Fig. 5. Mousetrap version of the Lee-Moore algorithm.
Fig. 7. Fragment of a semantic network.
as sets, situations, physical objects, persons, etc., and the
arcs binary relations such as subset inclusion (represented by
" s"), set membership (represented by "e"), and other relations as may be required by the knowledge to be represented. More complex relationships can be represented
by vertices such as the "occurs-in" vertex in Fig. 7, which
represents that the harvest of Granny Smith apples occurs in
t-JR
RESET
Fig. 6. Logic diagram of the Mazer node (without RAM access part).
NOT BLOCKED
high performance mainframe computer, which, one should
note, is a much more expensive although more readily available computing instrument.
The analysis and synthesis of images are notoriously demanding computations on sequential computers, but susceptible to decomposition on a pixel-by-pixel basis if
the demanding part of the computation can be formulated
entirely in terms of local operations. Carroll's pathfinder is a
particularly clear example of such a formulation, and the
differences between the Mazer and Pathfinder chips illustrate
the idiosyncrasies that so often separate the straightforward
implementation of a simple algorithm from the complications
of useful applications.
a
B. The Connection Machine
The connection machine, or connection memory, is an
innovative system for concurrent operations on a knowledge
base represented as a semantic network. This system [29] was
conceived by D. Hillis, working with several other researchers at the M.I T. Artificial Intelligence Laboratory. Machines
with on the order of 100 000 nodes are being developed
both at the M.I.T. Al Laboratory and by Thinking Machines
Corporation.
A semantic network [26], such as the example of Fig. 7, is
a. directed graph in which the vertices represent objects such
.
September.
Networks such as these can be represented in and manipulated by sequential computers, but if the network is large, the
speed of operations such as deduction, matching, sorting,
and searching leaves much to be desired. Even if the processor were very fast, the single access point to random
access memory fundamentally limits the performance. Thus,
the connection machine, like other smart memories, moves
the processing into the memory. Many of the operations performed on semantic networks can be performed concurrently,
so that both the storage and processing capability are in the
best case proportional to the size of the machine.
Each node of the connection machine, illustrated in block
diagram form in Fig. 8, consists conceptually of a few registers, an arithmetic-logical unit (ALU), a message buffer, and
a finite-state machine. The next state and output functions of
the finite-state machine are referred to as a rule table, and all
the rule tables in a system are identical. A node reacts to an
incoming message according to the message type and the
internal state, and performs a sequence of steps that may
involve arithmetic or storage operations on the message contents and the contents of the registers, sending new messages,
and changing its internal state.
The registers and state may also be accessed as ordinary
random access memory in the address space of a host computer. The rule tables are also RAM, and are presumably
loaded by the same mechanism. In the designs currently
being developed, a single chip contains 32 or 64 nodes, and
a single rule table, ALU, and set of off-chip communication
paths is shared by all the nodes in a single chip. The chips can
be connected in a variety of communication plans, discussed
below. Although in this implementation the chip resembles a
node in a microcomputer array, the connection machine is
probably better thought of as a "smart memory," in that its
1256
IEEE TRANSACTIONS ON COMPUTERS, VOL.
c-33,
NO.
12,
DECEMBER
1984
SEMANTIC NETWORK
I STATE
I
LTYPEI
DATAJ
Fig. 8. Conceptual block diagram of a connection machine node.
operation and programming is based on the model of the
simple nodes of a few registers each.
A semantic network is stored in the connection machine as
a pattern of virtual connections between cells. This representational approach is similar to LISP, where data are stored as
structures of pointers. The registers in the connection machine nodes principally hold pointers to other nodes. The
virtual connections map onto the physical wires by the routing of messages through intermediate nodes according to a
destination address in the message. It is desirable for performance, but unnecessary to logically correct operation,
that nodes that communicate frequently be physically close.
The 1ow-level data structures and programming of the M.I.T.
version of the connection machine, including the dynamic
allocation of nodes, is very well described in the recent M.S.
thesis by D. Christman [14].
One of the problems of representing a semantic network is
that its vertices may be of arbitrary degree, but the number of
pointers that can be stored in a node is limited. Thus, each
vertex is represented in the connection machine as a balanced
binary tree, as shown in Fig. 9. Each node then needs only
three virtual connections. The depth of the tree is the log base
2 of the degree of the vertex, and with the root connection
reserved, the number of nodes required to represent a vertex
of degree n is n - 1. A bit in each node indicates whether the
tree below it is left- or right-heavy, so that operations that add
connections can leave it balanced, a well-known algorithm.
Each arc of the semantic network is also represented as by a
node that connects the two related vertices and the arc type.
Space permits only a glimpse of what a connection machine can do, so we will offer only a short example of relational operations on sets. The connection memory also has
facilities for representing and manipulating functions, that
are analogous to, but more complex than those for relations,
and also facilities for performing arithmetic. The interested
reader is referred to Hillis' memo and Christman's thesis.
Set operations are performed in the connection memory by
representing membership in sets as individual bits in the state
register. A particular bit in each of the N nodes of a connection memory is an N-bit set register. Clearly all of the
standard set operations, complement, intersection, union,
and difference, can be accomplished in parallel from an instruction broadcast from the host, and without communication between nodes.
Sets can be mapped to other sets according to relations of
a particular type, an operation requiring propagation of mes-
Fig. 9. Representation of a semantic network in the connection machine.
sages. For example,
B: = APPLY-RELATION(color-of, {Granny Smith})
applies the relation "color-of" to the singleton set {Granny
Smith}, and loads set register B with the set {green}. Internally this computation is accomplished by all nodes in the
argument set sending messages that propagate only through
nodes representing the "color-of" relation, in order to arrive
at nodes that set the bit corresponding to set register B.
One could have represented the set "apples" as red, and
allowed this property to be inherited by any variety of apples
in which the property is not explicitly specified. The varieties
one might list would then be related to "apples" by a
"virtual-copy" relation. Here the process of computing the
set of red things is
A = APPLY-REVERSE-RELATION(color-of, {red})
B = coMPLEMENT({red})
B = APPLY-REVERSE-RELATION(color-of, B)
B: ='COMPLEMENT(B)
C = APPLY-REVERSE-RELATION-CLOSURE(virtual-copy, A, B)
In the first step, A is loaded with the set of all explicitly red
things, e.g., {apples, cherries, fire trucks, etc.}. In the next
two steps, B is loaded with the set of all things that have a
"color-of" relation, but not to red, and so will include, for
example, Granny Smith apples. B is then complemented, so
as to include all red or possibly red things, and to exclude all
such things that have a color other than red. C is then computed as the closure of the reversed virtual-copy relation
starting with all red things (including apples) in the universe
B, which specifically excludes inheritance of this property
when a nonred color is already specified (including for
Granny Smith apples). The closure computation requires that
messages continue to propagate through all virtual-copy
arcs until all such messages are absorbed in vertices that are
already in set C.
In discussing "how to connect a million processors," Hillis
notes that "the most difficult technical problem in construct-
1257
SEITZ: CONCURRENT VLSI ARCHITECTURES
ing a connection memory is the communications network."
This problem did not appear to be very serious in the Pathfinder because communication could there be confined to
neighboring nodes on a mesh. VLSI engines designed for
problems that lack this crystalline regularity depend on virtual rather than wired connections, and achieve a useful
separation of concerns between the algorithms and the engineering wirability issues. The designers are free to choose
the highest performance connection plan that fits current
levels of integration and packaging technology. The choice
made for the connection machine is the family of hypertorus
structures described in-Section III, although the designers of
the connection memory discovered this result by a geometric
construction with (literally) a twist, shown in Fig. 10.
C. Other Logic-Enhanced Memories
The Pathfinder and the Connection Machine are representative of differences in node size, wired and virtual
connections, and the two applications of logic-enhanced
memories that have been most widely studied, image computations and storage operations.
Another interesting example of a smart memory for image
computations is pixel-planes [23], a system being developed
by H. Fuchs and J. Poulton of the University of North Carolina at Chapel Hill, and A. Paeth and A. Bell of Xerox Palo
Alto Research Center. Pixel-planes enhances a raster memory with logic that performs scan conversion of abstract
polygonal objects, computes shading, and provides a depth
priority scheme for displaying only visible surfaces.
A VLSI tactile sensing array [81] that also performs discrete two-dimensional convolutions was developed by J.
Tanner of Caltech and M. Raibert and R. Eskenazi of Caltech
JPL. Convolution is a simple, regular, and local computation
on a mesh-connected array. Each node of this chip and system
contains two processors, so that if the primary processor
includes a defect, a spare processor can be selected in its
place.
Where this tactile sensing array incorporates pressure
transduction with the storage and computation in each node,
a number of other projects have incorporated optical sensing
into smart memory arrays. R. F. Lyon thus refers to his optical mouse chip [50] as a "smart digital sensor." The optical
mouse not only senses an image, but correlates it with the
previous image in order to sense and report motion. Lyon's
effort has since inspired several refinements and other experiments. J. Tanner and C. Mead developed a correlating optical
motion detector [82] that performs the correlation by a hybrid of analog and digital techniques. G. Bishop and H.
Fuchs are developing a system they call the self-tracker [5],
intended to locate its own position in an environment such as
a room.
In the area of smart memories for storage operations, one
might expect that VLSI would make associative memories
feasible and attractive. However, the chip area required for
simple associative matching does not compare well with
more conventional hashing into word addressed memory.
System designs in which association is more complex and
requires computation, such as finding the closest match, or
when association is combined with concurrent operations on
Fig. 10. Twisted cube. While the maximal distance between vertices in a
3-cube is 3, in this twisted version it is 2.
marked items, do start to exploit well the paradigm of mixing
logic and memory at a small grain size. The non-Von being
developed at Columbia University is a tree-structured medium grain single instruction multiple data (SIMD) machine
aimed at such database operations.
VLSI sorting memories have been studied extensively. The
simplest types that sort in linear time (in a time equal to that
required to load the data) on a linear exchange network have
been implemented in great variety as student projects in VLSI
design classes. C. Thompson's survey of 13 algorithms for
sorting [85], and analysis of their complexity in terms of
VLSI models of computation, is highly recommended.
V. COMPUTATIONAL ARRAYS
The two examples of logic-enhanced memories illustrate
two different approaches to communication in these fine
grain systems. The pathfinder nodes were connected in a
mesh that conforms exactly to the communication requirements of the propagation phase of the Lee-Moore algorithm.
The hypertorus communication plan of the connection
memory was selected based on message flow performance in
a system in which messages are routed, since it is not possible
to predetermine a particular communication topology. These
divergent communication paradigms will reappear in the discussion of microcomputer arrays; however, the examples of
computational arrays discussed here adhere to a model of
regular pipeline computation in which the communication
plan of a real or abstract machine exactly matches the data
dependencies of the algorithm, and in the ideal case [37] a
node need "never wait" for operands.
The virtual connection concept is a valid alternative for
machines of this grain size. There is such a close relationship
in the way in which data are processed in streams between
computational arrays and data-flow machines that computational arrays might be considered the "wired connection"
counterpart of the "routed connection" data-flow machine. If
one uses a wavefront model of computation [17], [42], [32],
[86], one can describe the operation of a computational array
in data-flow terms. However, we shall omit here any discussion of VLSI implementations of data-flow machines.
Research in VLSI computational or systolic arrays was
pioneered by H. T. Kung and his students and collaborators
at Carnegie-Mellon University. Several early publications, notably Section 8.3 of the Mead-Conway text [54],
"Algorithms for VLSI processor arrays," contributed by
H. T. Kung and C. Leiserson, C. Thompson's Ph.D. dissertation on VLSI models of computation [83], and H. T.
Kung's papers on the structure of parallel algorithms and on
1258
IEEE TRANSACTIONS ON COMPUTERS, VOL.
c-33,
NO.
12, DECEMBER 1984
lining. It may not be immediately obvious that this network
is the same filter as that of Fig. 11(a), and that it contains an
equivalent defect. Here the input X must be broadcast to all
stages, an operation that could not be accomplished in a
single computing cycle for large N. Finally, one might insert
pipeline stages in the filter of Fig. 11(a) to obtain the design
shown in Fig. 11(d), a strictly linear and arbitrarily extensible array whose extensive pipelining imposes N samples of
delay in excess of the original formulation.
The FIR filter is a convolution of the coefficients and input
samples. One can generalize this example to convolution, in
A. Digital Signal Processing
which some combination of the ai, inputs, and outputs move
Digital signal processing lends itself extremely easily to in pipelines in the course of the computation. A variety of
iterative computational arrays. It is typical of these applica- what H. T. Kung calls "semisystolic" [Fig. 11 (a) and (c)] and
tions that one can use fairly short number representations, "pure systolic" [Fig. 11(b) and (d)] versions have been devised for convolution [41]. Since the discrete Fourier transsay, 8-24 bit integers or floating point mantissas, although
form (DFT) is a convolution, it has the same formulations,
even less precise representations of sampled data are feasible
for certain applications such as synthetic aperture radar. It is although one would generally prefer the efficiency of the fast
Fourier transform (FFT) algorithms discussed below. Recura frequent goal that the signal be processed at the same rate
sive digital filters can be structured as a linear array with
at which it is sampled.
introgood
is
a
filter
(FIR)
The finite impulse response
feedback paths, a semisystolic structure at best, but can be
samples
From
processor.
ductory example of a simple signal
converted to two-way pipelines [40].
As one might expect from the polynomial formulation of
XO, XI, X2, 'the FIR filter is to produce an output
the FIR filter, multiplication and division of polynomials can
given by
Yo, Y1, Y2,
be accomplished by arrays that are essentially similar to filters [16]. If the arithmetic is done over finite fields, such
Yn =Eaix,_ .
networks are attractive architectures for coding algorithms,
such as Reed-Solomon encoders [46] and decoders [87].
=
output
an
produces
1,0,
Thus, the input X
0,
Y = 0,a1,a2,
,an impulse response of
aN,00,
range from about 10
N.
in
practice
of
N
values
length Typical
of
the filter that allows B. Fast Fourier Transforms
50.
This
is
a
formulation
to
expression
Concurrent versions of the fast Fourier transform (FFT)
one sample between the input and its first influence on the
or
the
input
[61] have been studied and used for many years,
algorithm
clocking
a
for
required
as
minimum
output,
form
an
in
the
and
interesting contrast in their sequential, comonly
interested
is
generally
output registers. One
and microcomputer array implementations.
array,
the
latency.
putational
of
such
a
not
in
system,
throughput
A pre-VLSI architecture for a filter of this type would The FFT is often represented in the flow graph form shown
typically consist of a single fast multiply-add element (that in Fig. 12(a), the vertices representing operations and the
could be pipelined) and a small memory to hold the N - 1 arcs the data dependencies. The data items and roots of unity
previous samples and the N coefficients. The expression (t terms) are complex numbers. The basic operation at the
would be evaluated in N steps for each sample, with some pair of vertices at the output side of each "butterfly," shown
cunning strategy for rewriting each of the old samples down in Fig. 12(b), involves four multiplications and six additions,
with the possibility of several operations taking place -concurone position in the memory. Although this approach seemingly requires a minimum number of parts, it demands that rently. We shall treat the butterfly as indivisible.
The traditional sequential approach to this computation,
the multiply-add step be N times faster than the sample rate.
A design for a given sample rate could be expected to serve whether in a program, the microcode of a peripheral array
only up to a certain value ofN before one resorts to compound processor, or special purpose hardware, involves applying
filters or to concurrency in the evaluation of the expression. the butterfly computation to data stored in memory in an
A VLSI computational array for this filter can take a vari- order allowed by the flow graph. Each such application reety of forms [16], but since regularity and modularity are quires six memory reads and four writes, -and it may in some
particularly desired in VLSI implementations, one might cases be advantageous to partition the memory [15]. Predictstart with the network illustrated in Fig. 1 (a). Although it is ably, the computational array approach is to use many butterboth modular and correct, the cascade summation violates the fly nodes. There are, however, many ways to trade area and
principle that the computing rate be independent of the size time. The network shown in Fig. 13 is constructed directly by
of the array. Assuming that one is indifferent about the delay replacing each butterfly structure in the flow graph with a
imposed by the filter, one might then decide to pipeline the butterfly computing node. It is assumed that each node insummation network. The pipelined summation tree version cludes its own pipeline register(s).
This construction was carried out with one permutation in
shown in Fig. 11(b) achieves this objective, but at some loss
of regularity in what started as a strictly linear structure. In location in the second column in order to achieve a conanother approach, shown in Fig. 11(c), one reverses the di- nection structure that is the same between each stage. That
rection of the summation network, and reorganizes the pipe- this could be done is not a surprise to readers familiar with the,
systolic architectures [38], [40], [41], stimulated what is today a much broader and deeper effort in the design and complexity analysis of algorithms for VLSI systems than we
could hope to do justice to here. We offer instead several
examples that illustrate some of the range and style of this
architecture and computational model. Existent versions of
such systems are commercially available as LSI and VLSI
chips and chip sets aimed at signal processing applications,
and more general systolic systems have been and are being
assembled as testbeds [6], [88], [8], [80], [21], [75].
N
SEITZ: CONCURRENT VLSI ARCHITECTURES
1259
WO
WO
x-
WO.
x-
x
*0
X+WKY
y
Fig. 12. (a) FFT flow graph. (b) '"Butterfly" computation.
0-
y
1'
2
Fig. 11. (a) FIR filter #1. (b) FIR filter #2. (c) FIR filter #3. (d) FIR
filter #4.
3,
4'
shuffle network [73], [74], a uniform single stage of an indirect binary n-cube. The rearrangement of signals between
stages takes its name from a perfect shuffle of a deck of cards.
This shuffle includes a uniform distribution of wire lengths
from the width to half the height of the wiring channel between stages, so this network is only "semisystolic." Thompson has shown [83] that to lay out an n-point shuffle on a chip,
O(n2/log2 n) area is required. For sufficiently large n, the
shuffle wiring would be larger than the n associated nodes,
but this is a much larger value of n than one contemplates for
a single chip.
A consequence of the identical structure of each stage, and
of the recursive structure of the shuffle, is that this computational array can be folded to trade off between time and
area. In the form shown in Fig. 13, the network processes n
points each time unit, but requires (n/2) log2 n butterfly
nodes. An iterative form with a single stage would require
log2 n time with area n/2. The network can also be folded
vertically with the insertion of additional registers to roughly
halve the area and double the time on each folding.
C. Arithmetic
The ability to trade timne and space in a parameterization of
an algorithm leaves a great deal of flexibility in the lower
5,
3
7
7
Fig. 13. A computational array for the FFT using a shuffle between each
stage.
levels of design, particularly in the arithmetic. Many of the
efforts in signal processing using computational arrays also
employ pipelined serial arithmetic, a clearly synergistic combination [12], [51], [18] that allows both serial communication and highly efficient serial arithmetic.
Parallel combinational arithmetic is popular in conventional computers largely to balance arithmetic performance with storage cycles, while in fact it is inefficient in
the following sense. When one needs to minimize the time
(latency) of an arithmetic operation, one makes different
design choices than to optimize throughput. The lower
bounds for multiplication of n-bit integers, even under a
constant wire delay model [1], [7], show an area time
squared invariance: AT2 O(n2). Even if such multiplier
designs [63] spanned an interesting range of time complexity,
the complexity analysis shows that one must increase the area
1260
IEEE TRANSACTIONS ON COMPUTERS, VOL.
c-33,
NO.
12,
DECEMBER
1984
rather drastically in order to reduce the latency.
In the case of computational arrays, latency is typically not
an issue since one is pipelining within and between nodes in
any case. One is then free to minimize the cost/performance,
AT/p. where performance is measured in throughput. In fact,
this problem is not so intellectually appealing as the latency
K
problem, in that optimal cost/performance is easily achieved
in almost any highly pipelined multiplier. For example., cerN NOMINALLY IDENTICAL COMPUTERS
tain versions of the very efficient carry-save multiplier [49]
Fig. 14. Microcomputer array.
generate a single 2n-bit product serially with A = n adder
cells and T = n(tsum, + tc.,y) [75], a minimum (in binary notation without recoding) for AT/p achieved even with p = 1. A. Message-Passing Structures
There are many approaches to supporting the commuD. Matrix Computations
nication between a large number of concurrently operating
Another well-developed set of applications for combi- instruction processors. The message-passing approach over
national arrays are matrix computations, which use direct networks, while a direct extrapolation of the attention
two-dimensional arrays, either square or hexagonal. The to locality and of the style of node interaction used in logichexagonal array provides three directions of flow, and enhanced memories and in computational arrays, is admitbeautiful mappings of matrix multiplication and LU decom- tedly at one extreme.
position by Gaussian elimination without pivoting [39], [40].
Although it is certainly the most thoroughly studied paralWhen such a matrix is banded, one can use an array whose lel architecture, the shared storage multiprocessor in its simsize corresponds to the bandwidth. These methods have been plest form, in which all storage references traverse a switch
extended to arrays that avoid broadcasting of data and con- or other "stunt box" between processors and storage banks,
trol, that accomplish partial pivoting, and to systems that use is at the opposite extreme, and succeeds in making all storage
an array of given bandwidth for solving systems of equations references equivalent by making them all equally expensive.
with larger bandwidth [33], both for Gaussian elimination
An intermediate approach is to associate the storage with
and for the QR factorization method [34], [25].
the processors, corresponding again to Fig. 14, and to
When applied to n x n matrices, and using n2 nodes, these organize the computation so that a majority of storage referarrays perform matrix multiplication or the solution of a sys- ences are local,-and only occasional references traverse the
tem of linear equations in 0(n) time, an 0(n2) reduction from communication network. If the network is of the indirect
the sequential machine time complexity. While one- type, such as that shown in Fig. 3(b), then all remote referdimensional computational arrays fit a single stream of data ences exhibit a similar latency. The Denelcor HEP and the
from a disk or a sensor, the performance of these two- BBN Butterfly are examples of systems with this two-level
dimensional arrays assumes that input and output occurs in hierarchy. One may also achieve a many-level hierarchy by
n-wide streams. If these computational arrays were to be used an approach such as that used in Cm* [78], in which the
in a computing environment in which the array serves as an switches and storage are organized in a hierarchy that proattached processor, one must deal with matrix problems of vides access to increasing amounts of storage through each
variable size, and the 0(n2) I/O requirements of the array level of switch. Another very effective intermediate approach
outweigh the 0(n) time complexity. Thus, the methods that is to provide a local cache for each processor [59], but in a
provide for an array of relatively small bandwidth to be used multiprocessor environment, one must employ some method
iteratively for larger problems, and for problems of variable of maintaining coherence between caches [20] with a tolersize, are crucial to the useful application of these two- able latency. Digital Equipment Corporation is developing a
dimensional systems.
VAX multiprocessor based on the coherent cache approach.
Communication through storage employs a global address,
so
that the switching network must simulate a complete conVI. MICROCOMPUTER ARRAYS
nection. Even when local storage or cache is used, the netThe remaining class of concurrent VLSI architectures to be work that handles remote references can become congested,
discussed are systems that we label as microcomputer arrays, depending on both the characteristics and size of the network,
although they might also be called microprocessor arrays and on the locality of references. Thus, it is not possible to
[62], homogeneous machines [76], apiaries [28], ensembles put any hard limits on the number of processors such systems
[66], transputers [21, mobs, conspiracies, teams, swarms, or might include, but systems in the 10-100 processor range
myriad-micros. Whatever descriptive, clever, or poetic name appear to be entirely feasible.
is attached to these machines, the basic structure, as shown
One can also achieve the effect of a complete connection
in Fig. 14, is a set of N computers that send messages to each with a direct network by routing messages from node to node.
other through a communication network. The node complex- In this case, still assuming that the messages pertain to stority ranges from a few MA2, small enough to place several age references, the remote references might be to a neighnodes per chip today, to perhaps 1000 MA2. We shall assume boring or very distant node, so that the remote references fall
that these multiple instruction multiple data (MIMD) arrays into a hierarchy of increasing access time to get at an inare homogeneous, and that there is no communication be- creasing fraction of the working set. Also, many algorithms
tween the computers except through the network.
do not require other than the limited connection plan of a
1261
SEITZ: CONCURRENT VLSI ARCHITECTURES
given direct network, as we have seen from a sample of
systolic algorithms, and as has been shown in extensive studies of algorithms for ultracomputers [64] based on a shuffle
interconnection.
Given a large disparity between local and remote accesses,
and variability in the remote accesses, there is little motive to
tie communication to storage access. Instead of suspending
instruction interpretation in a node while waiting for a reply
from a read request to a remote node, one can organize a
computation into concurrent processes that communicate by
message passing. A process is here a sequential program that
includes actions of sending and receiving messages, and may
also be able to put itself to sleep pending the completion of
a send action or the receipt of a message. There are no global
memory addresses; instead, there are references to processes
used to direct the messages that pass between them. These
messages can be of a variety of types that is limited only by
what can be interpreted by the process code. The communication between concurrent processes is then explicit, as in
Hoare's communicating sequential processes notation [30],
concurrent extensions of object-oriented programming [43],
or the actor model [28], rather than being through shared
variables.
This direct network message-passing architecture, with its
corresponding necessity to distribute a concurrent computation in a way that respects locality, the limits in channel
bandwidth, and latency in the direct network, is the extreme
point that these experimental VLSI microcomputer arrays
have adopted. Because they are relatively loosely coupled,
these systems scale well to thousands of nodes.
B. System Examples
There are today quite a few working examples of microcomputer arrays.
The system building projects at Caltech [67] evolved from
the algorithm, architecture, and programming methods research of S. Browning [9]-[11], B. Locanthi [47], and C. R.
Lang [43], and have been strongly influenced by A. Martin's
processing surface experiments [52]. The Caltech Mosaic
node [48] is a 16 MA2 custom chip, about 75 percent of which
is devoted to storage, and the rest to a fast 16 bit processor
and four bidirectional communication channels. Small prototype Mosaic systems are being used for programming experiments, and a 1024-node machine is under construction. Also
in use at Caltech are several "cosmic cube" machines
[67], [69], which use commercial parts amounting to about
200 MA2 per node. The cosmic cube is a hardware simulation
of the kind of machine that could be built with single chip
nodes about five years hence, but is useful even today. Numerous scientific and engineering applications [22] have
been run on a 64-node binary six-cube connected cosmic
cube at about 10 x the performance of a VAX1 1/780, thus
providing at least an existence proof for the utility and costeffectiveness of the microcomputer array architecture and
concurrent process message passing model.
The CHiP (configurable highly parallel) computer project
[72], started by L. Snyder at Purdue, is aimed at a system in
which the communication can be configured for a computation. Fig. 15 shows two examples of a hexagonal mesh
and a binary tree embedded on a structure of nodes (squares)
HEXAGONAL MESH
(
00 0)
0 @ 0 @Q0
ll
,. O
,
0 L 0, 0
000t 0
\0
0
0
0
0
C10
BINARY TREE
Fig. 15. Two configurations for the CHiP computer.
and simple switches (circles). This approach has the advantages both of tailoring the communication to the needs of an
algorithm, and also may be a mechanism for configuring
systems to avoid faulty nodes. The CHiP project employs an
interesting interactive front end called the Poker parallel processing environment, and two "Pringle" hardware emulators
for the CHiP computer are in use at Purdue and at the University of Washington.
The systolic array testbeds referenced above are really
microcomputer arrays that can be programmed to execute
systolic algorithms. The Carnegie-Mellon programmable
systolic computer (PSC) [21], a single-chip node that has
1262
been demonstrated in small arrays, devotes a relatively larger
fraction of the area of a node to operations such as multiply,
and is programmable at a microcode rather than conventional
instruction set level. It accordingly represents a considerably
stronger orientation toward performance in systolic algorithms than the microcomputer arrays.
INMOS has recently announced [2] the "transputer," an
advanced single chip computer with ona-chip storage and four
communication channels, a node element quite similar to
Mosaic. The Occam programming environment for the transputer supports both single transputer programming and concurrent programming of arrays of transputers.
C. Earlier Examples Revisited
The model of cornmunicating processes is quite consistent
with the model implicit in the discussion of logic-enhanced
memories and computational arrays. In fact, any of the computations already discussed can be mapped onto microcomputer arrays. This mapping can be described as a graph
embedding in which the nodes of a finer grain machine are
placed as processes within a nodes of a microcomputer array.
The microcomputer array cannot compete in performance
with its more specialized relatives when using the same algorithms. However, the practical difference between mapping
a formulation to silicon, with the necessity and risk of build-
ing a machine, versus programming an existing machine, can
be expected to favor the microcomputer array. Also, the
microcomputer array can generally employ more efficient
algorithms.
The choice of mapping a process formulation onto a microcomputer array influences in interesting ways the concurrency and load balancing in the computation. For example, if
the same computation performed by the 512 x 512 pathfinder system discussed in Section IV were to be mapped
uniformly onto a 32 x 32 toroidal mesh connected microcomputer array, each microcomputer node would need to deal
with 256 grid points.
It might seem most natural to assign 16 x 16 subgrids to
each node. Although this is the best mapping for minimizing
the communication between nodes, it is the worst from a load
balancing standpoint. The locality of the propagation concentrates segments of wavefronts into nodes, and leaves many
nodes idle. A much better mapping for load balancing assigns
grid point (x, y) to node (x mod 32, y mod 32). This mapping
disperses the wavefront in order to maximize concurrency at
the expense of communication. The optimal napping would
likely fall at an intermediate point, and depends on the relationship between the communication and computation performance of the nodes. Since it is still possible for a single
node to have many propagation events scheduled, one cannot
use the simple representation of time for cost that the pathfinder employed. One can either communicate costs directly
in the messages, or propagate a sequence of locally coherent
time-step waves from a corner of the mesh.
One should notice that the microcomputer array can employ a largerfraction of its nodes concurrently than when the
algorithm is expressed in silicon. The rmicrocomputer array
can also perform the retrace phase, and can deal with a variety of speedup techniques. One such technique, studied by
this author in a simulation, is to perform a computation for a
IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
subregion, stch as the 16 x 16 region indicated above, prior
to any incoming message, such that the node can respond
immediately to a message tagged for each of its boundary
points. This trick was worth about a factor of 10 in performance. Another speedup scheme is to initiate the propagation phase from all points at once on a nmany-pin wire.
The following example of a set of formulations for the FFT
algorithm for a Mosaic system is due to a recent discussion
with my colleague, D. Cohen. Here we will start with a
mapping that is natural to a microcomputer implementation,
and end up with a computational array.
If one wished to compute the Fourier transform in successive windows of n samples, it is a desirable system organization for the transform to be computed at the sample
rate. Thus, a pipeline of log2 n nodes, as shown in Fig. 16(a),
allows the successive nodes to compute the successive stages
of the transform shown in Fig. 12. Assuming that samples are
sent between nodes in a top-to-bottom order in each stage,
and starting with all queues empty, the first node must queue
(n/2) samples before it can start its butterfly computations.
As long as the node can perform butterfly computations at a
rate of one for each two input messages, its input queue will
not be longer than (n/2). After each butterfly, the node can
rid itself of the first result immediately, but must queue another (n/2) outputs. At the second stage the situation is similar, except that the queue lengths are (n/4). The last stage
needs to queue only one input and one output, but stores
(n/2) roots of unity. The ability to queue intermediate results
to smooth the computing load is the same extension of the
systolic model in its use of geometrically varying queue size
as is employed in certain sorting algorithmns [85], such as the
bitonic sort.
What if the nodes were not fast enough to perform the
butterfly computations at the desired sample rate? One could
use multiple independent pipelines, but there is another way.
The system shown in Fig. 16(b) is intended to double the
permissible input sample rate. The input node distributes
even samples to the upper pipeline and odd samples to the
lower pipeline. Each pipeline operates at half the sample rate
and with half the storage requirements. The pipelines are
independent until the last stage, at which point the butterfly
reappears. Log2 n applications of this doubling bring one to
a systolic implementation in which the queues have disap-
peared.
D. -Irregular Computations
These programmable machines have additional capabilities well beyond those of emulating their finer grain relatives.
They are not restricted to static process structures, but can
build dynamic structures. Their storage capabilities allow
them to queue blocks of data, and their programmability
allows a wide choice of efficient algorithms. These are indeed very interesting machines to program!
It is perhaps not very surprising that concurrent corhputations with regular structures and predictable communication demands [22] are very efficiently performed on
microcomputer arrays-. These problems are, after all, the
usual grist for parallel mills. What is just becoming clear is
how far such machines can be extended to deal with com-
1263
SEITZ: CONCURRENT VLSI ARCHITECTURES
communication are dropping so rapidly that one can reasonably think about systems of such large scale that they might
more processing elements than early computers had bits
have
n-POINT FFTs AT
I1F
INPUT AT SAMPLE
RATE 1/T
RATE 1/T
of storage.
Loosely coupled concurrent systems place simpler demands on the design and engineering of such large-scale
systems, but shift the burden of planning the distribution of
processing and communication to the formulation and expression of the algorithms and applications. As J. Schwartz
n
it [64]: "A central problem is to develop techniques which
put
RATE 2/T
RATE 2/T
allow the organization of concurrent computation on a very
large scale.... The deepest opportunities inherent in microFig. 16. (a) Microcomputer array FFT. (b) Throughput doubling by
structure technology can only be realized if we find effective
separation into even and odd samples.
ways of structuring and using such [highly concurrent computing] assemblages."
This paper has presented an effective and consistent, if
putations in which the communication demands are irregular.
Two concurrent electrical circuit simulation programs, somewhat crude and direct, way of structuring highly concurCONCISE, developed recently at Caltech by S. Mattisson rent computations in terms of processes and messages. Such
[53] to be run on the cosmic cube, and a parallel version of formulations can be expressed in a single family of concurSPLICE developed at Berkeley by Deutch and Newton [19] rent VLSI architectures, either directly in silicon, or in confor the BBN Butterfly, are paradigms of this irregular class of current programs that run on arrays of microcomputers.
computation that can still be performed efficiently on micro- Although examples of these machines have proved to provide
computer arrays. Circuit simulation is a numerically de- very good performance and excellent cost/performance on
manding computation that involves evaluation of nonlinear certain problems, one can make no pretense that these macircuit models to produce piecewise linear representations, chines are useful or efficient for all computational problems.
solution of large sets of linear equations of irregular sparsity,
There are fundamental questions of how one organizes
time step determination, and integration. Although the mes- the cooperation of so many computers to perform a single
sage traffic or storage access patterns are irregular, the procomputation. How does one formulate and specify such a
cess structure is static for the course of the computation.
computation, and verify its correctness? Is it necessary to
Instead of the traditional indirect method of solving the ma- formulate and direct these computations explicitly, or are
trix equation for a time step, CONCISE and the parallel there also effective implicit formulations from which the
SPLICE both use iterative methods, and in all other regards, concurrency can be derived? There may be no universal
such as their time step determination and windowing, tend to answers to these questions, but the experiments underway
debunk the theory that concurrent machines are constrained may at least be suggestive of some of the problems and
to using unsophisticated and inefficient algorithms.
opportunities.
The next test of the capabilities of microcomputer array
architectures is their ability to support computations in which
the process structure can change dynamically, and in which
REFERENCES
processes can distribute and relocate themselves during exe[1] H. Abelson and P. Andreae, "Information transfer and area-time tradecution. The idea that with so many machines it should be
offs for VLSI multiplication," Commun. Ass. Comput. Mach., vol. 23,
possible to achieve high performance, and also fault-tolerant
no. 1, pp. 20-23, 1980.
operation, is appealing and likely possible. The Japanese
[2] I. Barron et al., "Transputer does 5 or more MIPS even when not used
in parallel," Electronics, pp. 109-115, Nov. 17, 1983.
Fifth-Generation project [58] and other efforts in high per[3] K. E. Batcher, "The flip network in STARAN," in Proc. Int. Conf.
formance architectures for artificial intelligence applications
Parallel Processing, 1976, pp. 65-71.
cite the necessity of using VLSI and concurrency to achieve
[4] G. Bilardi, M. Pracchi, and F. P. Preparata, "A critique and an appraisal
of VLSI models of computation," in Proc. CMU Conf. VLSI Syst. Comtheir objectives. These Al problems are typically highly
put., 1981. Rockville, MD: Comput. Sci. Press, 1981.
branched and dynamic. Although microcomputer arrays are
[5] G. Bishop and H. Fuchs, "A smart optical sensor on silicon," in Proc.
interesting candidates for vision and knowledge base applicaConf. Advanced Res. VLSI. Jan. 1984. Dedham, MA: Artech, 1984,
65-73.
tions, it remains to be seen whether Al machines can take a [6] J.pp.Blackmer,
P. Kuekes, and G. Frank, "A 200 MOPS systolic prosimilar form.
cessor," in Proc. SPIE, vol. 298, Real-Time Signal Processing IV, Soc.
LOG2
n
NODES
L
INPUT
n-POINT
AT
FFTs
SAMPLE
AT
VII. CONCLUSIONS
The physical design task and the principles to which one
must adhere in VLSI and its expected descendents are more
demanding but also more uniform than in earlier technologies. The high cost of communication, in area, power,
and time, relative to switching performance, constrains the
design and engineering of large-scale tightly coupled systems. However, the absolute costs of switching, storage, and
Photo-Opt. Instrum. Eng., 1981.
[7] R. P. Brent and H. T. Kung, "The chip complexity of binary arithmetic,"
in Proc. 12th ACM Symp. Theory Comput., May 1980, pp. 190-200.
[8] K. Bromley et al., "Systolic array processor developments," in Proc.
CMU Conf. VLSI Syst. Comput., Oct. 1981. Rockville, MD: Comput.
Sci. Press, 1981, pp. 273-284.
[9] S. A. Browning, "The tree machine: A highly concurrent computing
environment," Dep. Comput. Sci., California Inst. Technol., Pasadena,
Tech Rep. 3760:TR:80, 1980.
[10]
, sect. 8.4.1 in C. A. Mead and L. Conway, Introduction to VLSI
Systems. Reading, MA: Addison-Wesley, 1980.
[11] S. A. Browning and C. L. Seitz, "Communication in a tree machine," in
Proc. 2nd Caltech Conf. VLSI, Dep. Comput. Sci., California Inst.
IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
1264
Technol., Pasadena, 1981, pp. 509-526.
[12] M. R. Buric and C. Mead, "Bit serial inner product processors in VLSI,"
in Proc. 2nd Caltech Conf. VLSI, Dep. Comput. Sci., California Inst.
Technol., Pasadena, 1981, pp. 155-164.
[13] C. R. Carroll, "A smart memory array processor for two layer path
finding," in Proc. 2nd Caltech Conf. VLSI, Dep. Comput. Sci., California Inst. Technol., Pasadena, 1981, pp. 165-195.
[14] D. P. Christman, "Programming the connection machine," M.S. thesis,
Dep. Elec. Eng. Comput. Sci., Massachusetts Inst. Technol.,
Cambridge, 1984.
[15] D. Cohen, "Simplified control of FFT hardware," IEEE Trans. Acoust.,
Speech, Signal Processing, vol. ASSP-24, Dec. 1976.
[16]
"Mathematical approach to iterative computation networks," in
Proc. 4th Symp. Comput. Arithmetic (also as ISI/RR-78-73,
U.S.C./Inform. Sci. Inst., Marina del Rey, CA, Nov. 1978), IEEE Cat.
No. 78CH1412-6C, pp. 226-238, Oct. 1978.
[17] A. L. Davis and R. M. Keller, "Data flow program graphs," IEEE Computer, vol. 15, pp. 26-41, Feb. 1982.
[18] P. B. Denyer, "An introduction to bit-serial architectures for VLSI signal
processing," in VLSI Architecture, B. Randall and P. C. Treleven,
Eds. Englewood Cliffs, NJ: Prentice-Hall, 1983, ch. 20.
,
[19] J. T. Deutch and A. R. Newton, "A multiprocessor implementation
relaxation based electrical circuit simulation," in Proc. 21st Design Automat. Conf., 1984, pp. 350-357.
[20] M. Dubois and F. A. Briggs, "Effects of cache coherency in multiprocessors," IEEE Trans. Comput., vol. C-31, Nov. 1982.
[21] A. L. Fisher et al., "The architecture of a programmable systolic chip,"
J. VLSI Comput. Syst., vol. 1, no. 2, pp. 1-16, 1984.
[22] G. C. Fox and S. W. Otto, "The use of concurrent processors in science
and engineering," Phys. Today, May 1984.
[23] H. Fuchs et al., "Developing pixel-planes, a smart memory-based
graphics system," in Proc. Conf. Advanced Res. VLSI, Massachusetts
Inst. Technol., Cambridge, Jan. 1982. Dedham, MA: Artech, 1982.
[24] G. R. Goke andG. J. Lipouski, "Banyan networks for partitioning
processor systems," in Proc. 1st Annu. Svmp. Comput. Architecture,
1973, pp. 21-28.
[25] D.E. Heller andI. C. F. Ipsen, "Systolic networks for orthogonal equivalence transformations and their applications," in Proc. Conf. Advanced
Res. VLSI, Massachusetts Inst. Technol., Cambridge, Jan. 1982,
Dedham, MA: Artech, 1982, pp. 113-122.
[26] G. G. Hendrix, "Encoding knowledge in partitioned networks," in Association Networks. New York: Academic, 1979.
[27] J. L. Hennessy, "VLSI processor architecture," IEEE Trans. Comput.,
this issue, pp. 1221-1246.
[28] C. E. Hewitt, "The apiary network architecture for knowledgeable
tems," in Conf. Rec. LISP Conf., Stanford, CA, Aug. 1980.
[29] W. D. Hillis, "The connection machine (computer architecture for the
new wave)," Massachusetts Inst. Technol., Cambridge, Al Memo 646,
Sept. 1981.
[30] C. A. R. Hoare, "Communicating sequential processes," Commun. Ass.
Comput. Mach., vol. 21, no. 8, pp. 666-677, 1978.
[31] B. Hoeneisen and C. A. Mead, "Fundamental limitations in microelectronics I, MOS technology," Solid-State Electron., vol. 15,
pp. 819-829, 1972.
[32] L. Johnsson and D. Cohen, "A mathematical approach to modeling
flow of data and control in computational networks," in Proc.
Conf. VLSI Syst. Comput., Oct. 1981. Rockville, MD: Comput.
Press, 1981, pp. 213-225.
[33] L. Johnsson, "Computational arrays for band matrix eq'uations," Dep.
Comput. Sci., California Inst. Technol., Pasadena, Tech. Rep.
4287:TR:81, 1981.
of
raster
multi-
sys-
the
CMU
Sci.
[34]
"A computational array for the QR-method," in Proc. Conf.
vanced Res. VLSI, Massachusetts Inst. Technol., Cambridge,
1982. Dedham, MA: Artech, 1982, pp. 123-129.
[35] D. J. Kuck et al.,"Measurements of parallelism in ordinary FORTRAN
programs," IEEE Computer, vol. 7, pp. 37-46, Jan. 1974.
[36] R. H. Kuhn and D. A. Padua, Eds., IEEE Computer Society Tutorial
Parallel Processing, 1981.
[37] H. T. Kung, "Synchronized and asynchronous parallel algorithmsior
multiprocessors," in Algorithms and Complexity: New Directions aid
Recent Results, J. F. Traub, Ed. New York: Academic, 1976,
pp. 153-200.
[38]
"Let's design algorithms for VLSI," in Proc. Caltech Conf. VLSI,
Dep. Comput. Sci., California Inst. Technol., Pasadena, 1979,
Ad-
on
pp. 65-90.
[39] H. T. Kung and C. E. Leiserson, "Algorithms for VLSI processor
rays," in C. A. Mead and L. Conway, Introduction to VLSI Sys-
ar-
tems. Reading, MA: Addison-Wesley, 1980, sect. 8.3, pp. 271-292.
[40] H. T. Kung, "The structure of parallel algorithms," in Advances in Computers, vol. 19. New York: Academic, 1980.
[41]
"Why systolic architectures?," IEEE Computer, vol. 15, Jan.
1982.
[42] S. -Y. Kung et al., "Wavefront array processor: Language, architecture,
and applications," IEEE Trans. Comput., vol. C-31, pp. 1054-1066,
Nov. 1982.
[43] C. R. Lang, Jr., "The extension of object-oriented languages to a homogeneous concurrent architecture," Dep. Comput. Sci., California Inst.
Technol., Pasadena, Tech. Rep. 5014:TR:82, 1982.
[44] D. H. Lawrie, "Access and alignment of data in an array processor,"
IEEE Trans. Comput., vol. C-24, pp. 1145-1155, Dec. 1975.
[45] C. Lee, "An algorithm for path connections and its applications," IRE
Trans. Electron. Comput., vol. EC-10, pp. 346-365, Sept. 1961.
[46] K. Y. Liu, "Architecture for VLSI design of Reed-Solomon encoders,"
in Proc. 2nd Caltech Conf. VLSI, Dep. Comput. Sci., California Inst.
Technol., Pasadena, 1981, pp. 539-554.
[47] B. Locanthi, "The homogeneous machine," Dep. Comput. Sci., California Inst. Technol., Pasadena, Tech. Rep. 3759:TR:80, 1980.
[48] C. Lutz et al., "Design of the mosaic element," in Proc. Conf. Advanced
Res. VLSI, Massachusetts Inst. Technol., Cambridge, 1984. Dedham,
MA: Artech, 1984, pp. 1-10.
[49] R. F. Lyon, "Two's complement pipeline multipliers," IEEE Trans. Commun., vol. COM-24, pp. 418-425, Apr. 1976.
[50] F. Lyon, "The optical mouse, and an architectural methodology for smart
digital sensors," in Proc. CMU Conf. VLSI Syst. Comput., Oct.
1981. Rockville, MD: Comput. Sci. Press, 1981.
[51] R. F. Lyon, "A bit-serial architectural methodology for signal processing," in VLSI 81. New York: Academic, 1981.
[52] A. J. Martin, "The torus: An exercise in constructing a processing surface," in Proc. 2nd Caltech Conf. VLSI, Dep. Comput. Sci., California
Inst. Technol., Pasadena, 1981, pp. 527-538.
[53] S. Mattisson, "A concurrent circuit simulator," Dep. Comput. Sci., California Inst. Technol., Pasadena, Tech. Rep. 5142:TR:84, 1984.
[54] C. A. Mead and L. Conway, Introduction to VLSI Systems. Reading,
MA: Addison-Wesley, 1980.
[55] C. A. Mead and M. Rem, "Minimum propagation delays in VLSI," IEEE
J. Solid-State Circuits, vol. SC-17, pp. 773-775, Aug. 1982.
[56] E. Moore, "Shortest path through a maze," Ann. Comput. Lab. Harvard
Univ., vol. 30, pp. 285-292, 1959.
[57] G. E. Moore, "Are we really ready for VLSI?," in Proc. Caltech Conf.
VLSI, Dep. Comput. Sci., California Inst. Technol., Pasadena, 1979.
[58] T. Moto-oka and H. S. Stone, "Fifth generation computer systems: A
Japanese project," IEEE Computer, vol. 17, pp. 6-13, Mar. 1984.
[59] J. H. Patel, "Analysis of multiprocessors with private cache memories,"
IEEE Trans. Comput., vol. C-31, pp. 296-304, Apr. 1982.
[60] D. A. Patterson and C. H. Sequin, "Design considerations for single-chip
computers of the future," IEEE J. Solid-State Circuits, vol. SC-15, Feb.
1980.
[61] M. C. Pease,III, "An adaptation of the fast Fourier transform for parallel
processing," J. Ass. Comput. Mach., vol. 15, pp. 252-264, 1968.
[62]
"The indirect binary n-cube microprocessor array," IEEE Trans.
Comput., vol. C-26, pp. 458-473, May 1977.
[63] F. P. Preparata, "A mesh-connected area-time optimal VLSI integer multiplier," in Proc. CMU Conf. VLSI Syst. Comput., 1981. Rockville,
MD: Comput. Sci. Press, 1981.
[64] J. T. Schwartz, "Ultracomputers," ACM Trans. Programming Languages
Syst., vol. 2, pp. 484-521, Oct. 1980.
[65] C. L. Seitz, "System timing," in C. A. Mead and L. Conway, Introduction to VLSI Systems. Reading, MA: Addison-Wesley, 1980.
[66]
,"Ensemble architectures for VLSI -A survey and taxonomy," in
Proc. Conf. Advanced Res. VLSI, Massachusetts Inst. Technol., Cambridge, Jan. 1982. Dedham, MA: Artech, 1982, pp. 130-135.
[67]
,"Experiments with VLSI ensemble machines," J. VLSI Comput.
Syst., vol. 1, no. 3, 1984.
[68] C. L. Seitz and J. Matisoo, "Engineering limits on computer performance," Phys. Today, pp. 38-45, May 1984.
[69] C. L. Seitz, "The cosmic cube," Commun. Ass. Comput. Mach., to be
published, Dec. 1984.
[70] H. J. Siegel, "Interconnection networks for SIMD machines," IEEE
Computer, vol. 12, pp. 57-65, June 1979.
[71] C. S. Slemaker, R. C. Mosteller, L. W. Leyking, and A. G. Livitsanos,
"A programmable printed wiring router," in Proc.11th Design Automat.
Workshop, June 1974.
[72] L. Snyder, "Introduction to the configurable highly parallel computer,"
IEEE Computer, vol. 15, pp. 47-56, Jan. 1982.
',
SErrZ: CONCURRENT VLSI ARCHITECTURES
[73] H. S. Stone, "Parallel processing with the perfect shuffle," IEEE Trans.
[74]
[75]
[76]
[77]
[78]
[79]
[80]
[81]
[82]
[83]
[84]
[85]
[86]
Comput., vol. C-20, 1971.
H. S. Stone, Ed., Introduction to Computer Architecture. Chicago,
IL: Sci. Res. Assoc., 1975, particularly pp. 318-374.
W. -K. Su, "Super mesh," M.S. thesis, Dep. Comput. Sci., Califomia
Inst. Technol., Pasadena, Tech. Rep. 5125:TR:84, 1984.
H. Sullivan and T. R. Brashkow, "A large scale homogeneous machine
I & II," in Proc. 4th Annu. Symp. Comput. Architecture, 1977,
pp. 105-124.
1. E. Sutherland and C. A. Mead, "Microelectronics and computer science," Sci. Amer., vol. 237, pp. 210-229, Sept. 1977.
R. J. Swan et al., "CM* -A modular multimicroprocessor," in Proc.
Nat. Comput. Conf., vol. 46. AFIPS Press, 1977, pp. 637-644.
R. M. Swanson and J. D. Meindl, "Ion-implanted complementary MOS
transistors in low-voltage circuits," IEEE J. Solid-State Circuits,
vol. SC-7, pp. 146-153, Apr. 1972.
J. J. Symanski, "NOSC systolic processor testbed," Naval Ocean Syst.
Cen., Tech. Rep. TR NOSC TD 588, June 1983.
J. E. Tanner, M. H. Raibert, and R. Eskenazi, "A VLSI tactile sensing
array computer," in Proc. 2nd Caltech Conf. VLSI, Dep. Comput. Sci.,
California Inst. Technol., Pasadena, Jan. 1981, pp. 217-234.
J. E. Tanner and C. Mead, "A correlating optical motion detector," in
Proc. Conf. Advanced Res. VLSI, Massachusetts Inst. Technol., Cambridge, Jan. 1984. Dedham, MA: Artech, 1984, pp. 57-64.
C. D. Thompson, "A complexity theory for VLSI," Dep. Comput. Sci.,
Carnegie-Mellon Univ., Pittsburgh, PA, Tech. Rep. CMU-CS-80-140,
Aug. 1980.
,"The VLSI complexity of sorting," in Proc. CMU Conf. VLSI Syst.
Comput., Computer Science Press, Oct. 1981, pp. 108-118.
, "The VLSI complexity of sorting," IEEE Trans. Comput.,
vol. C-32, pp. 1171-1184, Dec. 1983.
U. Weiser and A. Davis, "A wavefront notation tool for VLSI array
design," in Proc. CMU Conf. VLSI Syst. Comput., Computer Science
1265
Press, Oct. 1981, pp. 226-234.
[87] D. Whiting, "Bit serial Reed-Solomon decoders," Ph.D. dissertation,
Dep. Comput. Sci., Califomia Inst. Technol., June 1984.
[88] D. W. L. Yen and A. V. Kulkarni, "The ESL systolic processor for signal
& image processing," in Proc. 1981 IEEE Comput. Soc. Workshop Comput. Arch. Pattern Anal. Image Database Management, Nov. 1981,
pp. 265-272.
Charles L. Seitz (S'68-M'69) received the B.S.,
M.S., and Ph.D. degrees from the Massachusetts
Institute of Technology, Cambridge, MA.
He is now a Professor of Computer Science at the
Califromia Institute of Technology, Pasaderta, CA,
where his research and teaching activities are in the
areas of VLSI architecture and design, concurrent
computation, and self-timed systems. Prior to joining the faculty of the Califomia Institute of Technology, he worked as an Industrial Consultant from
1972 to 1977, principally for the Burroughs Corporation, was an Assistant Professor of Computer Science at the University
of Utah from 1970 to 1972, and was a Member of the Technical Staff of the
Evans and Sutherland Computer Corporation from 1969 to 1971. While at
the Massachusetts Institute of Technology, he was an Instructor of Electrical
Engineering.
Dr. Seitz is a member of the Association for Computing Machinery and
of the IEEE Computer Society, and was the recipient of the Goodwin Medal
for "conspicuously effective teaching" at the Massachusetts Institute of
Technology.
Download