Uploaded by Sage.s Yang

wave-pipelining

advertisement
464
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 3, SEPTEMBER 1998
Wave-Pipelining: A Tutorial and Research Survey
Wayne P. Burleson, Member, IEEE, Maciej Ciesielski, Senior Member, IEEE, Fabian Klass, Associate Member, IEEE,
and Wentai Liu, Senior Member, IEEE
Abstract— Wave-pipelining is a method of high-performance
circuit design which implements pipelining in logic without the
use of intermediate latches or registers. The combination of
high-performance integrated circuit (IC) technologies, pipelined
architectures, and sophisticated computer-aided design (CAD)
tools has converted wave-pipelining from a theoretical oddity
into a realistic, although challenging, VLSI design method. This
paper presents a tutorial of the principles of wave-pipelining and
a survey of wave-pipelined VLSI chips and CAD tools for the
synthesis and analysis of wave-pipelined circuits.
Index Terms—Performance optimization, VLSI circuits, wavepipelining.
I. INTRODUCTION
W
AVE-PIPELINING is an example of one of the many
methods currently being used in sophisticated VLSI
designs. As an alternative to pipelining, it provides a method
for significantly reducing clock loads and the associated area,
power and latency while retaining the external functionality
and timing of a synchronous circuit. It is of particular interest today because it involves design and analysis across a
variety of levels (process, layout, circuit, logic, timing, and
architecture) which characterize VLSI design. However it also
questions some of the fundamental tenets of simplified VLSI
design as popularized in the early 1980’s.
The idea of wave-pipelining was originally introduced by
Cotten [6], who named it maximum rate pipelining. Cotten
observed that the rate at which logic can propagate through
the circuit depends not on the longest path delay but on the
difference between the longest and the shortest path delays.
As a result, several computation “waves,” i.e., logic signals
related to different clock cycles, can propagate through the
logic simultaneously. One can also view the wave-pipelining
as a virtual pipelining, in which each gate serves as a virtual
storage element.
In an attempt to understand the wave-pipelining phenomenon and turn it into a useful and reliable computer technology,
research focused on the following aspects: 1) developing
correct timing models and analyzing the problem mathematically [36], [8], [9], [17], [11], [12], [26], 2) developing logic
synthesis techniques and computer-aided design (CAD) tools
Manuscript received February 12, 1996; revised October 30, 1997. This
work was supported by the National Science Foundation under Grant MIP9208267.
W. P. Burleson and M. Ciesielski are with the Department of Electrical
and Computer Engineering, University of Massachusetts, Amherst, MA 01003
USA.
F. Klass is with the SUN Microsystems Inc., Sunnyvale, CA 90095 USA.
W. Liu is with the Department of Electrical and Computer Engineering,
North Carolina State University, Raleigh NC 27607 USA.
Publisher Item Identifier S 1063-8210(98)05977-0.
for wave-pipelined circuits [41], [43], [16], [20], [21], [37],
3) developing new circuit techniques specifically devoted to
wave-pipelining [27], and 4) testing the wave-pipelining ideas
by building VLSI chips [27], [42], [22]. A comparative study
of existing methods in wave-pipelining can be found in [23].
Wave-pipelining has suffered from a number of myths and
the problems which need to be solved in a wave-pipelined
design are not widely understood. This paper presents a tutorial
with examples to demonstrate the type of design problems
which arise in wave-pipelining. In Section II timing constraints
are derived which ensure the proper operation of a wavepipelined circuit. Section III discusses the sources of delay
variation which affect the timing constraints and presents
methods for minimizing their impact. Section IV reviews CAD
tools available for synthesizing wave-pipelined circuits and
demonstrates their use with two example circuits. Section V
reviews a number of industrial and academic designs which
employ wave-pipelining. Section VI poses some open research
problems in wave-pipelining and related areas.
The impact of the paper is broader than only wavepipelining as the methods of analysis and synthesis apply
to many other aggressive timing and circuit techniques being
used in industry today. We anticipate that in the future this
type of VLSI design techniques will be required to maintain
the steady increases in performance improvement to which we
have become accustomed. VLSI designers, CAD developers
and system designers need to be aware of these trends and their
future impact on design, manufacture, testing and education
in microelectronics.
II. CLOCKING OF WAVE-PIPELINED CIRCUITS
The timing requirements of wave-pipelined circuits will be
defined for a single combinational logic block with registers
attached to its inputs and outputs. In the case of multistage
pipelines, the timing constraints must hold for all stages. In the
sequel, the term conventional pipelining will be used to refer to
a pipeline where only a single set of data propagates between
registers at any given time. The term constructive clock-skew
will be used to refer to a clock-skew that is intentionally
created between two clock signals and that can be adjusted
with predictable effects. This is in contrast to an uncontrolled
clock-skew that exists in the circuit due to delay differences
along the clock lines.
Parameters listed below will be used in the derivation of
the timing constraints. These parameters are defined under
worst case conditions, including manufacturing tolerances,
data-dependent delays, and environmental changes.
1063–8210/98$10.00  1998 IEEE
BURLESON et al.: WAVE-PIPELINING: A TUTORIAL AND RESEARCH SURVEY
465
(a)
(b)
(c)
(d)
Fig. 1. Model of wave-pipelined circuit.
•
Minimum and maximum propagation delays in the combinational logic block.
Clock-period.
•
Register setup and hold times.
•
Propagation delay of a register.
•
Constructive clock skew between the out•
put and input registers.
Worst case uncontrolled clock skew at a
•
register.
Our analysis assumes that all the input and output registers
, and propagation
have the same setup time , hold time
. For simplicity and without much loss of generaldelay
ity, we shall only consider edge-triggered registers. Timing
constraints for other types of registers, including transparent
latches, can be similarly derived.
A. Circuit and Timing Models
The phenomenon of wave-pipelining can be illustrated
using the timing model presented in [12], shown in Fig. 1.
A synchronous logic block is clocked by a set of input
and output registers, with the constructive clock skew equal
to , Fig. 1(a). Associated with each block is a pair of
.
minimum and maximum propagation delays,
These parameters are the delays of the earliest and the latest
signals propagating through the logic along its shortest and
longest paths, respectively.
The spread of signals traveling along the longest and the
shortest paths through the logic block can be visualized as
delay contours. The shaded area between the contour lines
represents an unstable region when signals switch, i.e., when
the computation is being performed by the logic block. For a
simple logic network in Fig. 1(b) the delay contours are shown
in Fig. 1(c). The exact contours depend on the logic and the
delay characteristics of its elements. Following [12], we shall
simplify the delay contours by linearizing the delays along the
logic block to obtain delay “cones,” as shown in Fig. 1(d).
Fig. 2 depicts a combined temporal and spatial diagram for
a wave-pipelined circuit. The horizontal axis represents time,
while the vertical axis represents the logic depth of the circuit.
A point on the logic depth axis represents a node (gate) of
the circuit. The shaded regions denote the time intervals when
computation is being performed and data is unstable, while the
space between two adjacent computation regions corresponds
to stable data. The computation regions are labeled with the
clock cycle during which the data has been latched at the input
register. It can be observed that at any given reference time,
, several “waves” of computation can present in the logic
simultaneously.
For correct clocking, the output data must be sampled at
at which data is stable, i.e.,
must fall in the
time
nonshaded area. Furthermore, at any internal node of the
network, there must be a minimum separation between the
arrival time of the latest signal of a given wave, and the earliest
signal of the next wave. The temporal separation between the
waves at an internal node is represented in the figure by
.
B. Timing Constraints
For a wave-pipelined system to operate correctly, the system
clocking must be such that the output data is clocked after the
latest data has arrived at the outputs and before the earliest
data from the next clock cycle arrives at the outputs. We
shall first derive the conditions for clocking of the latest
and the earliest data propagating in the circuit, referred to
as the register constraints. We introduce parameter , which
represents the number of clock cycles needed for a signal to
propagate through the logic block before being latched by the
output register. This parameter serves as an intuitive measure
of the degree of wave-pipelining. The data should be clocked
466
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 3, SEPTEMBER 1998
Fig. 2. Temporal/spatial diagram for wave-pipelining system.
at time
by the rising edge of the output register
clock
cycles after it has been clocked by the input register. Due to
possible constructive skew (of arbitrary value) between the
output and the input registers, this time can be expressed as
steady. In addition, one must account for an uncontrolled
clock-skew
at the output register. As a result, the
is bounded above as follows:
(3)
(1)
1) Register Constraints:
a) Clocking of the latest data: This constraint requires
that the latest possible signal arrives early enough to be
th clock cycle.
clocked by the output register during
, which denotes the time at
Therefore, the lower bound on
which the output wave is captured, is given by
(2)
b) Clocking of the earliest data: This condition requires
that the arrival of the next wave must not interfere with the
clocking of the current wave. That is, the earliest possible
must arrive later than the clocking of the
signal of wave
th wave at the output register. This condition is similar to the
race-through constraint in conventional pipelining. Notice that
is given by
.
the earliest arrival of wave
After the clock pulse has been applied to the output register,
must be allowed for the data to remain
additional hold time
Combining constraints (2) and (3) gives us the well-known
maximum rate pipelining condition of Cotten
(4)
The minimum clock period is limited by the difference in
, plus the clocking overhead
path delays
resulting from the insertion of clocked
registers.
2) Internal Node Constraints: Constraint (4) guarantees
that waves do not collide, or overlap, at the output of the
logic block. Additional constraints need to be imposed at the
internal nodes of the combinational block to prevent wave
collision at the individual logic gates. These signal separation
constraints can be informally expressed as follows:
The next earliest possible wave should not arrive at
a node until the latest possible wave has propagated
through.
BURLESON et al.: WAVE-PIPELINING: A TUTORIAL AND RESEARCH SURVEY
467
These constraints, also known as internal node constraints,
were derived, in slightly different forms, by Ekroot [8], Wong
et al. [43], Joy et al. [17], and Gray et al. [12].
Let be an internal node (output of a gate) of the logic
network, a point on the logic depth axis in Fig. 2. To help
and
derive the internal node constraint we define
as the longest and the shortest propagation delays
from the primary inputs to node . The following internal node
constraint that must be satisfied at each node of the circuit:
Analyzing the above result reveals that the feasibility region
of clock period
may not be continuous. For
the clock period is only bounded below by
, as in the
conventional pipelining. For larger values of , however, it
. In
shows that the period is also bounded above by
, then any two successive intervals
addition, if
and
are disjoint. For a given value
the size of the interval which represents valid clock
of
and the decrease of
period grows with the increase of
.
(5)
is the minimum time that node must be stable
where
is
to correctly propagate a signal through the gate, and
the worst case uncontrolled clock skew at the input register.
Notice the similarity between constraint (5) and (4). While
is equivalent to
, the
is equivalent to
(the minimum sampling
term
time for the data). The factor of two in the clock-skew term
is not present here since the signals are not stored (clocked)
at the internal nodes.
C. Intervals of Valid Clocking
Based on the conditions presented in the previous section,
a linear program (LP) can be readily formulated to find a
minimum value of clock period for a given value of
(number of waves). The LP approach minimizes
subject
to both register constraints (2), (3) and the internal node
constraints (5) for all nodes in the circuit. This approach,
taken by Fishburn [9] and Joy et al. [17], seems to suggest
that the circuit will operate correctly at any value of clock
cycle above the computed minimum, that is, that the region
of valid clock period is only bounded below. However, as
independently demonstrated by Lam et al. [26] and by Gray
et al. [12], the feasibility region of the valid clock period
is not continuous but is composed of a finite set of disjoint
regions. Furthermore, the degree of wave-pipelining, , does
not change monotonically with the increase in clock frequency.
To see why this is the case we shall examine again register
constraints (2) and (3) to obtain a two-sided constraint on the
, the latching time of the data
clock period. Recalling that
,
at the output register can be expressed as
we obtain the following inequality (refer to Fig. 2):
(6)
To simplify the interpretation of the above relation let us
and
introduce two parameters
(7)
represents the maximum delay through the logic, including
clocking overhead and clock skews, while
(8)
represents minimum delay through the logic. With this, (6)
can be expressed as follows:
(9)
D. Timed Boolean Function Approach
An alternative approach to clocking analysis of wavepipelined systems, based on timed Boolean function representation, was proposed by Lam et al. [26]. A Timed Boolean
Function (TBF) is a Boolean function whose variables are
of the circuit
functions of time. Consider a TBF for output
in terms of its input variables
(10)
is a logic delay from input
to output . The
where
, normalized w.r. to,
value of this function at time
is
and sampled with the clock period
(11)
is the number of clock cycles needed to
where
propagate a signal from input to output . Now the condition
for valid computation that satisfies internal node constraints
can be expressed as follows [26]: computation is valid if and
in the support of , for all , have the
only if all variables
, i.e., if
.
same time argument
In this case the latency of the circuit, measured in the number
of clock cycles is the same as the degree of wave-pipelining,
. Combining the above constraint with (9), results in the
following theorem for valid clock period [26]:
in a circuit is valid if it satisfies (9)
The clock period
for some positive integer , and the function computed
evaluates to
.
at
TBF approach provides a convenient way to express the
condition for safe signal separation both at the internal nodes
of the circuit and at the output register. The internal node
constraints are simply accounted for by adding the requirement
that all variables in the support of the logic function have the
. Lam et al. [26] describe a
same discrete argument
procedure to compute all valid clock intervals for a circuit with
given delay parameters and tolerances. The reader is referred
to their paper for details.
III. SOURCES OF DELAY VARIATIONS
As discussed in Section II, critical speed-limiting factors in
wave-pipelining are the uncontrolled clock-skew, the sampling
time of registers, and the worst case transition time at the
logic outputs. While the minimization of these factors has
468
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 3, SEPTEMBER 1998
been a major challenge in the design of conventional highspeed pipelined systems as well, the equalization of path delays
comes as a new challenge for the design of wave-pipelined
systems. While in theory the path-delay equalization problem
has been solved, the real challenge is to accomplish it in the
presence of a variety of static and dynamic delay tolerances,
some of which are listed below.
1) Gate-Delay Data-Dependence: Gate-delay independence on the input pattern, i.e., constant gate delay,
is not guaranteed in general. It depends on the particular
technology and the structure of logic gates. Some input
patterns may cause significant delay variations.
2) Coupling Capacitance Effects: The effective capacitance
seen by a gate depends on the capacitive coupling
between adjacent wires. This may introduce significant
changes in the gate delay, especially in advanced multilevel metal processes.
3) Power-Supply Induced Noise: Power supply noise is attributed mainly to IR drops, capacitive coupling between
the interconnection wires and the power supply lines,
and the inductance of bonding wires, package trace, and
pins. Noise in the power-supply voltage can cause accumulative delay dispersions as waves propagate through
several logic layers.
4) Process Parameter Variations: Variations of process
parameters during manufacture may result in substantial
gate delay variations. Equivalent circuits from different
fabrication runs may have different propagation delays.
This delay dispersion must be accounted for to establish
the minimum separation between waves.
5) Temperature-Induced Delay Changes: Some process
parameters, such as carrier surface mobility of
metal–oxide–semiconductor (MOS) transistors, are
highly thermally sensitive. Changes in the operational
temperature result in changes in the gate-delay.
A. Data Dependencies
One of the major obstacles initially found in static complementary metal–oxide–semiconductor (CMOS) is the strong
dependence of the gate-delay on the input data. Consider for
example a static two-input CMOS NAND gate. Depending
on whether one or both of the parallel PMOS devices are
switching on, the delay of the gate can vary by a factor of two.
For gates with several inputs, this factor is even larger. Clearly,
this feature is undesired for wave-pipelining; constant logic
gate delay is a requirement for the equalization of path-delays.
To overcome this problem the use of biased CMOS gates, also
known as pseudo-NMOS gates, has been proposed. In those
devices parallel pullup devices are replaced by a single device
whose gate is connected to a bias voltage [28]. While this
approach reduces the delay dependence problem and makes
CMOS better suited for wave-pipelining, it is achieved at the
expense of increased power dissipation.
B. Process and Environmental Delay Variations
Changes in temperature, supply levels, and process parameters can have a substantial effect on the delay of a CMOS
circuit. How such changes affect the operation of wavepipelining is discussed below. Although the following analysis
is for temperature, it applies to process and power supply
changes as well.
and the
As shown in Section II-C, the clock-period
in a wave-pipelined circuit are related by
number of waves
(9) which can be rearranged as
(12)
are specified at nominal temperature.
where
We assume that the temperature across the die area is
uniform and that every path delay in the circuit is affected in
the same way by changes in temperature. If, for a temperature
and
are increased by a factor
above the nominal,
, constraint (12) becomes
(13)
If, for a temperature below the nominal,
are reduced by a factor
, we have
and
(14)
In order to find a valid clock-period that satisfies constraints
(12), (13), and (14), it is required that
(15)
or equivalently as
(16)
Equation (16) can be interpreted as follows. The clock period
at maximum
is limited by the difference between
and
at minimum temperature
temperature
. This is the minimum clock period that assures
correct operation in the entire range between the two temperatures. (Notice that the use of a fixed zero constructive clock
.)
skew still results in invalid intervals for
In conclusion, in addition to the limit imposed by the maxi, the process- and environmentmum difference
induced delay changes impose tighter constraints on the minimum clock-period (or the maximum number of waves) for
wave-pipelining. Notice that even in the ideal case of
(i.e., under the perfect delay equalization and zero
clocking overhead assumptions), the clock-period is still lim. While in conventional
ited by
pipelining the minimum clock-period can be specified based
on the worst case delay accounting for temperature, voltage,
and process, in wave-pipelining the best case delay must
also be considered. The consequence is a more substantial
performance degradation for wave pipelined designs.
IV. CAD TOOLS FOR WAVE-PIPELINING
Equations (4) and (9) derived in Section II suggest that the
minimum value of clock period as well as the range of valid
clock interval can be adjusted by modifying certain parameters
of the system. The following parameters can be controlled by
the designer: maximum and minimum delays
BURLESON et al.: WAVE-PIPELINING: A TUTORIAL AND RESEARCH SURVEY
of the combinational logic block, and the constructive clock
skew, , between the output and input registers. Two techniques are generally adopted to control these parameters. One,
referred to as a logic balancing, attempts to minimize the
. It involves logic restructuring, delay
difference
buffer insertion along short paths, and device sizing. The other
technique, clock buffer insertion, aims at adjusting the value
of constructive clock skew by inserting precisely controlled
delays along the clock paths.
This section briefly reviews some of the most popular techniques and practical CAD tools for the analysis, optimization,
and synthesis of wave-pipeline systems.
469
(a)
(b)
(c)
(d)
A. Timing Analysis and Optimization
In addition to the theoretical work on modeling and analysis
of wave-pipelining presented in Section II, a number of timing
analysis and verification tools for synchronous pipelined and
wave-pipelined systems have been developed. A notable example of such a tool is a pipeline scheduler, pipe , developed
by Sakallah et al. [36]. This work provides a unified formalism
for describing the timing in pipelines, accounts for multiphase
synchronous clocking, and handles both short and long path
delays. The tool generates correct minimum cycle-time clock
schedules and signal waveforms from a multiphase pipeline
specification. However, wave-pipelining and clock skews are
not taken into account explicitly.
A number of timing optimization tools were developed to
minimize clock period by adjusting constructive clock skews.
These techniques are based on the LP approach proposed by
Fishburn [9] and Joy et al. [16], mentioned in Section II-C.
B. Synthesis Tools
Most of the work in synthesis for wave-pipelining explores
the idea of minimizing the difference in logic path delays
by means of logic balancing. The following approaches are
typically used to achieve logic balancing: logic restructuring,
insertion of delay buffers or latches, device sizing, and controlled placement and routing of both the circuit components
and clock distribution trees.
Wong et al. [41], [43] developed a method and a CAD
tool to minimize path delays variance in ECL circuits by
inserting delay buffers followed by device fine-tuning. The
delay of each gate is fine-tuned by controlling its tail current,
a feature unique to ECL circuits. This method was tested on
circuits with regular structures, such as adders and multipliers,
achieving a factor of 2.5 increase in throughput at a cost from
10 to 50% increase in area [42]. Shenoy et al. [37] expressed
the logic balancing problem as an optimization procedure
with additional short path constraints. This technique was
implemented as an experimental CAD tool that interfaces with
the SIS logic synthesis system [38].
Both of these approaches aim at obtaining the maximum
rate pipelining with the use of minimum number of delay
buffers. However, for multiple-fan-out gates the delay buffers
are inserted individually for each fan-out. As a result a separate
step is typically required to merge the common buffers for area
recovery. While this is a simple task for ECL logic (delay of
Fig. 3. Node collapsing and decomposition.
an ECL gate does not depend on the fan-out), the fan-out
load of CMOS gates significantly affects the delay, and its
impact on timing cannot be ignored. For this reason, Kim et al.
[21] developed a delay buffer insertion method using chains
of buffers, thus eliminating the need for merging common
buffers. The remainder of this section briefly describes logic
balancing for CMOS circuits based on logic restructuring
and buffer insertion of Kim [21]. The algorithms have been
implemented in the SIS environment [38].
1) Logic Restructuring: The goal of logic restructuring is
to equalize signal arrival times at the inputs of each gate.
To facilitate delay estimation during logic restructuring the
circuit is first decomposed into “canonical” form composed
of two-input gates and inverters. It is then followed by
selective node collapsing, and recursive decomposition. Node
collapsing is a standard logic transformation technique which
combines several nodes of a logic network into a single
node. It facilitates the subsequent transformations, such as
recursive decomposition, which will transform the function
into a network with a desired timing property. This process
is illustrated in Fig. 3(a) and (b) where node with delay
of three units is collapsed into its fan-in nodes, creating a
single node with four inputs. The subsequent decomposition
transformation of the node creates a more balanced structure.
The decomposition of the collapsed nodes is accomplished
using kernel division technique employed in SIS [38]. The
goal is to find a decomposition that minimizes the difference
between latest arrival times at the inputs to the collapsed node.
This is accomplished by computing, for a given expression ,
all its multiple-cube subexpressions (kernels) and selecting the
one which leads to the most balanced structure. By estimating
the delay of the kernel , and the resulting quotient and
the remainder , the kernel which gives the best balancing
of the expression is selected. The kernel-based decomposition
is applied recursively to nodes , , and until no further
improvement is possible. Fig. 3(c) and (d) shows the result of
470
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 3, SEPTEMBER 1998
(a)
(b)
(c)
(d)
Fig. 4. Example of delay buffer insertion.
such a decomposition into two-input gates. The details of the
procedure can be found in [20].
2) Delay Buffer Insertion: Once the circuit is restructured,
additional balancing can be achieved by means of buffer
insertion. Delay buffers are inserted in appropriate places so
as to further minimize the difference in signal arrival times at
the gate inputs.
The underlying requirement imposed on buffer insertion is
that it must not affect the latency of the circuit. For this reason
the buffers are inserted only along fast paths, without affecting
the slow ones. The amount of the delay to be inserted at an
input to a gate is chosen so that the arrival time at that input
matches the arrival time of the slowest input, but does not
exceed it. As a result, the latest arrival time at the output of
the gate never increases as a result of buffer insertion. This
approach allows to localize the buffer insertion problem to
that of inserting a single chain of buffers at the output of each
gate, wherever needed. The desired delays at the fan-outs of
the gate are then obtained by providing taps at different stages
of the buffer chain. It is assumed that only one type of buffer,
with a fixed delay value, is available.
Fig. 4 illustrates the process of buffer insertion for the three
. For simplicity it is assumed that each
gates fanning out of
buffer contributes 1 unit delay, and the load at each fan-out
contributes 0.2 unit delay. The numbers given at the inputs
to the gates represent the latest signal arrival times. First, a
chain of two buffers is created to achieve the input delay of
. The input to
is then tapped off the end
8.0 units at
of the chain to match the arrival time (8.0) of its other input,
is tapped of the first stage of the chain,
while the input to
which provides the delay of 6.6. Notice that the arrival time
at the end of the chain (8.0) is independent of how the inputs
and
are connected to the chain. The details of the
to
procedure, including the proper ordering of the gates for buffer
chain construction, can be found in [21].
C. Synthesis Examples
To validate the logic synthesis approach described in the
previous sections a number of combinational circuits were
(a)
(b)
Fig. 5. (4,2) compressor circuit.
synthesized and tested. These included both regular arithmetic
circuits and a set of random logic circuits from the MCNC
benchmark set. This section presents two examples of circuits
synthesized with those tools.
1) (4,2) Compressor: Fig. 5(a) shows a manual design of
a (4,2) compressor circuit, used as a basic multiplier cell in
[24]. Fig. 5(b) has a synthesized version of the same function,
obtained with logic balancing tools described in Section IV-B.
A fair comparison of the two circuits is difficult to make
since each was designed with a different goal in mind, used
different processing technology and its performance was mea-
BURLESON et al.: WAVE-PIPELINING: A TUTORIAL AND RESEARCH SURVEY
sured differently. Circuit (a) is part of a 16
16 multiplier,
which was designed, balanced and tuned manually in order
to obtain the fastest running circuit; it was fabricated using
commercial 1 m CMOS technology; and its path delays were
measured by exhaustive HSPICE simulation. The maximum
and minimum reported delays were 1.46 and 1.19 ns, respectively, resulting in a maximum delay variation of less than
20%. Circuit (b) was automatically synthesized to allow for
degree of wave-pipelining equal to three; it was targeted for
a generic 2 m MOSIS cell library; finally, its delays were
computed using a simple unit fan-out delay model. The circuit
has delay of 2.18 ns and a latency of 6.6 ns (three waves).
Notice its balanced logic structure and the inserted buffers.
A simple-minded comparison of circuit areas based on gate
count, and of latencies, based on unit fan-out delay, shows
that synthesis tools can be used effectively to produce designs
comparable with the manually tuned circuits.
2) Random Logic Example: The random logic circuits are
characterized by a very irregular structure, and as such are
not well suited for logic balancing. The example shown below
(circuit b9, taken from the MCNC benchmark set) demonstrates that even for those circuits logic restructuring followed
by buffer insertion can result in a significant improvement in
circuit performance by means of wave-pipelining.
The circuit was automatically synthesized using the logic
balancing procedures described in Section IV-B. Using basic
two-input gates and inverters, and two types of delay buffers
from the MSU standard cell library, gate delay parameters
were constructed for the SIS library and the MOSIS 2 pwell process. The circuit was synthesized to allow for three
waves and laid out using standard cell design methodology.
The simulated value of the minimum clock period of the circuit
was 2.78 ns with latency 7.37 ns.
The latency of the circuit extracted from the layout was
11.39 ns with the degree of wave-pipelining equal to two.
These differences were due to the unaccounted for physical
effects of placement and routing. Considering the fact that
the layout synthesis was performed without any consideration
for timing optimization, it can be argued that obtaining the
physical circuit satisfying the target constraints is possible with
more sophisticated timing-driven layout design tools.
Fig. 6 shows the plot of CAzM [7] simulation result for
circuit b9, with inputs applied (a) every 20 ns and (b) every
6 ns. Notice how the two waveforms, scaled accordingly to
account for different clock period, match closely.
V. REAL DESIGNS WITH WAVE-PIPELINING
Many attempts have been made at using wave-pipelining
since Cotten proposed the technique in 1969 [6]. Prior to
1990, most designs of wave-pipelining systems targeted the
bipolar ECL technology. These included a floating point unit
[1], an experimental computer [35], and a population counter
[42]. Since then, several university research groups have
demonstrated the feasibility of CMOS implementations. In
addition, industry has begun to apply wave-pipelining to RAM
designs in both BiCMOS and CMOS technology. Table I
lists selected publications in wave-pipelining. The research
471
(a)
(b)
Fig. 6. CAzM simulation results for circuit b9 for two different values of
clock period.
spans theoretical formulation, CAD tools research, and design
projects that include dedicated processors and static/dynamic
RAM’s.
A. Dedicated Processors
The main hurdle for designing a wave-pipelined system is
the two-sided constraints on the timing. In addition, designers
must address the various sources of delay imbalance in the
design at the gate level, the circuit level, and at the detailed
layout level. The delay imbalance must also be minimized
for environmental variations such as voltage and temperature fluctuations. Also, care must be given in the design of
an on-chip test structure so that high-speed testing can be
done without the requirement for high-speed input–output.
This unique feature is employed by most of the university
prototypes presented in this section. Researchers have devised
various circuits to overcome the imbalances in successful
implementations of wave-pipelined systems. Examples include
wave-domino gates, biased-CMOS gates, static CMOS gates,
and multiplexer-based gates.
472
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 3, SEPTEMBER 1998
TABLE I
ACTIVITIES RELATED TO WAVE-PIPELINING
technology. It achieves 3.7 waves at the clock rates between
330 and 350 MHz despite the substantial effects of datadependent delay [24]. Circuit level tools were important for
this design. Using 2 m CMOS domino gates, Lien and
Burleson at the University of Massachusetts, have designed
4 Wallace tree multiplier that achieves two waves
a 4
in the multiplier circuit [27]. Nowka and Flynn of Stanford
University have designed a CMOS VLSI vector unit which
includes a vector register file, an adder, and a multiplier
[33]. The vector unit is implemented with 1 m CMOS
technology and simulated at 300 MHz. In this design, an
adaptive supply method is used to counteract the effects of
process variation.
B. Dynamic and Static RAM
In order to demonstrate the high-speed capabilities of
the wave-pipelining techniques using a conservative CMOS
process, most university prototypes share some common
aspects in architecture, circuit and layout level. First, they
are large circuits, where the effect of circuit design as
well as the properties of the interconnect affect the circuit
performance. Second, they can be implemented by regular
structures. Therefore, wave-pipelining technique can be more
easily applied. Third, a major portion of a prototype, such as
the carry tree, usually has a circuit structure where each cell
has regular fan-out. Therefore, the cell output capacitances are
mostly dominated by the length of the interconnection wires
and next input gates, whose lengths can be predicted due
to the circuit regularity. Last, due to a lack of commercial
tools that are directly applicable to designs using wavepipelining, each group has more or less developed in-house
design analysis and optimization tools which enable VLSI
design using wave-pipelining.
Wong et al. at Stanford University, Stanford, CA, have
designed a 63 bit population counter in ECL technology [42].
The counter achieves 2.5 waves and is the first bipolar LSI
chip that uses wave-pipelining. Liu et al. at North Carolina
State University have reported an architecture, circuit, layout,
and testing techniques for achieving high speedup in a CMOS
parallel adder using wave-pipelining [28]. The adder achieves
nine waves at 250 MHz and is implemented by biased-CMOS
gates with 2 m CMOS technology. Klass et al. at Stanford
University have designed and tested a 16 16 wave-pipelined
multiplier and implemented it using static CMOS 1.0 m
As the speed of microprocessors increases, so does the
performance gap between DRAM and the microprocessor.
Consequently, high-speed S/DRAM technology is necessary to
boost the overall system performance. Wave-pipelining offers
an efficient way to reduce access and cycle time without
incorporating additional registers, required by conventional
pipelining, and without the associated clocking costs. Along
with the reduced access time, wave-pipelining provides the
additional advantages of latency and power dissipation. Several successful RAM designs using wave-pipelining have been
reported [4], [25], [2], [31], [44], [15].
It is fair to say that the design of wave-pipelined S/DRAM
has become a trend in the industry. Chappel et al. [4] designed
a “bubble pipelined” RAM chip with 2 ns access time. Wavepipelining is used in the cache RAM access in the HP
Snake workstation [25]. NEC has designed a 220-MHz, 16Mbit BiCMOS wave-pipelined SRAM [31]. Hitachi [15] has
designed a 300-MHz, 4-Mbit wave-pipelined CMOS SRAM.
Moreover Hitachi proposed the concept of a dual sensing
latch circuit in order to achieve a shorter cycle time at 2.6
ns. Hyundai [44] has designed a 150-MHz, 8-bank, 256-Mbit
synchronous DRAM. All of these designs can sustain three to
four waves.
VI. CONCLUSIONS AND OPEN PROBLEMS
Despite recent advancement in wave-pipelining research
and other aggressive timing approaches, a number of open
problems still exist. Solutions to these problems may have
broader application than just wave-pipelining.
1) Testing: Wave-pipelining presents additional challenges
in delay testing due to the difficulty in observing internal
points in a circuit, which looks like a combinational
circuit but behaves like a sequential one. Testing for the
interaction of long and short paths and the sequential
false paths warrant further study.
2) Low-Power: Despite reduced clock loading, it appears
that wave-pipelining is not a low-power technique, since
at low-power supply levels delay variations tend to
worsen. However, alternative low power methods, such
as dynamic power supply adjustment and power-down
techniques could favor wave-pipelining.
BURLESON et al.: WAVE-PIPELINING: A TUTORIAL AND RESEARCH SURVEY
473
3) High-Level Synthesis: High-level synthesis tools have
only recently been able to exploit internally pipelined
modules. Wave-pipelined modules can present an additional degree of freedom to such tools, however the
methods for modeling and verifying the timing properties unique to wave-pipelining for use at higher levels
still need to be developed.
4) Dynamic Power Supply and Clock Tuning: One approach
to process and environmental variation is the dynamic
tuning of both power supplies and clock rates. This has
been explored in [33], but is still far from being a widely
used technique.
5) Physical Design Issues: Deep submicron technologies
will continue to complicate abstractions of physical
design. As interconnect delays begin to dominate gate
delays, timing optimization will have to be done at the
physical level.
6) Parameter Variation: Advanced technologies tend to prioritize density and speed at the cost of wider parameter
variation. This will make clock distribution and design
centering more of a challenge and could severely limit
a designer’s choice of timing schemes. One solution is
the use of asynchronous circuit techniques. However,
as the ever-increasing density and speed surpass the
capabilities of designers to exploit their use, it may
be that a truly “advanced” technology of the future
will prioritize parameter variation, much like analog
processes of today.
[16] D. A. Joy and M. J. Ciesielski, “Placement for clock period minimization
with multiple wave propagation,” in Proc. 28th Design Automation
Conf., 1991.
[17] D. A. Joy and M. J. Ciesielski, “Clock period minimization with wave
pipelining,” in IEEE Trans. Computer-Aided Design, Apr. 1993.
[18] S. T. Ju and C. W. Jen, “A high speed multiplier design using wave
pipelining technique,” in Proc. IEEE APCCAS, Australia, 1992, pp.
502–506.
[19] J. Kang, W. Liu, and R. Cavin, “A monolithic 625 mb/s data recovery
circuit in 1.2 m CMOS,” in Proc. Custom Integrated Circuit Conf.,
1993, pp. 625–628.
[20] T. S. Kim, W. Burleson, and M. Ciesielski, “Logic restructuring for
wave-pipelined circuits,” in Proc. Int. Workshop Logic Synthesis, 1993.
[21] T.-S. Kim, W. Burleson, and M. Ciesielski, “Delay buffer insertion
for wave-pipelined circuits,” in Notes of IFIP Int. Workshop Logic
Architecture Synthesis, Grenoble, France, Dec. 1993.
[22] F. Klass, M. J. Flynn, and A. J. van de Goor, Pushing the Limits of
CMOS Technology: A Wave-Pipelined Multiplier Hot Chips, 1992.
[23] F. Klass and M. J. Flynn, “Comparative studies of pipelined circuits,”
Stanford University, Tech. Rep. CSL-TR-93-579, July 1993.
[24] F. Klass, M. J. Flynn, and A. J. van de Goor, “Fast multiplication in
VLSI using wave-pipelining,” J. VLSI Signal Processing, 1994.
[25] C. Kohlhardt, “PA-RISC processor for “Snake” work-stations,” in Hot
Chips Symp., 1991, pp. 1.20–1.31.
[26] W. K. C. Lam, R. K. Brayton, and A. Sangiovanni-Vincentelli, “Valid
clocking in wavepipelined circuits,” in Proc. Int. Conf. Computer-Aided
Design, 1992.
[27] W.-H. Lein and W. Burleson, “Wave domino logic: Theory and applications,” in Proc. Int. Symp. Circuits Syst., 1992.
[28] W. Liu, C. Gray, D. Fan, T. Hughes, W. Farlow, and R. Cavin “A
250-MHz wave pipelined adder in 2-m CMOS,” IEEE J. Solid-State
Circuits, pp. 1117–1128, Sept. 1994.
[29] B. Lockyear and C. Ebeling, “The practical application of retiming to
the design of high-performance systems,” in Proc. ICCAD’93, 1993,
pp. 288–295.
[30] G. Moyer, M. Clements, W. Liu, T. Schaffer, and R. Cavin, “A
technique for high-speed, fine-resolution pattern generation and its
CMOS implementation,” in Proc. 16th Conf. Advanced Res. VLSI, 1995,
pp. 131–145.
[31] K. Nakamura et al., “A 220-MHz pipelined 16-mb BiCMOS SRAM with
PLL proportional self-timing generator,” IEEE J. Solid-State Circuits,
pp. 1317–1322, Nov. 1994.
[32] V. Nguyen, W. Liu, C. T. Gray, and R. K. Cavin, “A CMOS multiplier
using wave pipelining,” in Proc. Custom Integrated Circuits Conf., San
Diego, CA, May 1993, pp. 12.3.1–12.3.4.
[33] K. Nowka and M. Flynn, “System design using wave-pipelining: A
CMOS VLSI vector unit,” in Proc. ISCAS’95, 1995, pp. 2301–2304.
[34] J. Pratt and V. Heuring, “Delay synchronization in time-of-flight optical
systems,” Appl. Opt., vol. 31, no. 14, pp. 2430–2437, 1992.
[35] L. Qi and X. Peisu, “The design and implementation of a very fast
experimental pipelining computer,” J. Comput. Sci. Technol., vol. 2, no.
1, pp. 1–6, 1988.
[36] K. A. Sakallah, T. N. Mudge, T. M. Burks, and E. S. Davidson,
“Synchronization of pipelines,” in IEEE Trans. Computer-Aided Design,
vol. 12, 1993.
[37] N. V. Shenoy, R. K. Brayton, and A. L. Sangiovanni-Vincentelli,
“Minimum padding to satisfy short path constraints,” in Proc. Int.
Wokshop Logic Synthesis, 1993.
[38] E. M. Sentovich, K. J. Singh, C. Moon, H. Savoy, and R. K. Brayton,
“Sequential circuit design using synthesis and optimization,” in Proc.
Int. Conf. Comput. Design, 1992.
[39] H. Shin, J. Warnock, M. Immediato, K. Chin, C.-T. Chuang, M. Cribb,
D. Heidel, Y.-C. Sun, N. Mazzeo, and S. Brodskyi, “A 5 Gb/s 16
16 Si-bipolar crosspoint switch,” in Proc. IEEE Int. Solid-State Circuits
Conf., Feb. 1992, pp. 228–229.
[40] S. Tachibana et al., “A 2.6 ns wavepipelined CMOS SRAM with duealsensing-latch circuits,” IEEE J. Solid-State Circuits, pp. 487–490, Apr.
1995.
[41] D. Wong, G. De Micheli, and M. Flynn, “Inserting active delay elements
to achieve wave pipelining,” in Proc. Int. Conf. Computer-Aided Design,
Nov. 1989, pp. 270–273.
[42] D. Wong, G. De Micheli, M. Flynn, and R. Huston, “A bipolar
population counter using wave pipelining to achieve 2.5 normal clock
frequency,” IEEE J. Solid-State Circuits, vol. 27, May 1992.
[43] D. Wong, G. De Micheli, and M. Flynn, “Designing high performance
digital circuits using wave pipelining: Algorithms and practical experiences,” IEEE Trans. Computer-Aided Design, vol. 12, Jan. 1993.
[44] H. Yoo et al., “A 150 MHz 8-banks 256 m synchronous DRAM with
wave pipelining methods,” in Proc. ISSCC’95, 1995, pp. 250–251.
REFERENCES
[1] S. Anderson, J. Earle, R. Goldschmidt, and D. Powers, “The IBM system/360 model 91 floating point execution unit,” IBM J. Res. Develop.,
Jan. 1967.
[2] P. Bannon and A. Jan, “Today’s microprocessor aim for flexibility,” EE
Times, pp. 68–70, 1994.
[3] W. Burleson, E. Tan, and C. Lee, “A 150 MHz wave-pipelined adaptive
digital filter in 2 CMOS,” VLSI Signal Processing Workshop, 1994.
[4] T. I. Chappell, B. A. Chappell, S. E. Schuster, J. W. Allan, S. P.
Klepner, R. V. Joshi, and R. L. Franch, “A 2-ns cycle, 3.8 ns access
512-kb CMOS ECL SRAM with a fully pipelined architecture,” IEEE
J. Solid-State Circuits, pp. 1577–1585, Nov. 1991.
[5] M. Cooperman and P. Andrade, “CMOS gigabit-per-second switching,”
IEEE J. Solid-State Circuits, June 1993.
[6] L. Cotten, “Maximum rate pipelined systems,” in Proc. AFIPS Spring
Joint Comput. Conf., 1969.
[7] W. M. Jr. Coughran, E. Grosse, and D. J. Rose, “CAzM: A circuit
analyzer with macromodeling,” IEEE Trans. Electron Devices, vol. ED30, pp. 1207–1213, Apr. 1983.
[8] B. Ekroot, “Optimization of pipelined processors by insertion of combinational logic delay,” Ph.D. dissertation, Stanford Univ., Stanford, CA,
1987.
[9] J. Fishburn,“Clock skew optimization,” IEEE Trans. Comput., 1990.
[10] D. Ghosh and S. Nandy, “Design and realization of high-performance
wave-pipelined 8 8 b multiplier in CMOS technology,” IEEE Trans.
VLSI Syst., vol. 3, pp. 37–48, 1995.
[11] C. T. Gray, T. Hughes, S. Arora, W. Liu, and R. Cavin III, “Theoretical
and practical issues in CMOS wave pipelining,” in Proc. VLSI’91, 1991.
[12] C. T. Gray, W. Liu, and R. Cavin III, “Timing constraints fro wavepipelined systems,” IEEE Trans. Computer-Aided Design, vol. 13, pp.
987–1004, Aug. 1994.
[13] C. Gray, W. Liu, W. van Noije, T. Hughes, and R. Cavin, “A sampling
technique and its CMOS implementation with 1 Gb/s bandwidth and 25
ps resolution,” IEEE J. Solid-State Circuits, pp. 340–349, Mar. 1994.
[14] S. Gupta, private communication, Nov. 1994.
[15] K. Ishibashi et al., “A 300 MHz 4-mb wave-pipelined CMOS SRAM
using a multi-phase PLL,” in Proc. ISSCC’95, 1995, pp. 308–309.
2
2
2
474
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 6, NO. 3, SEPTEMBER 1998
Wayne P. Burleson (S’87–M’94) received the B.S.
and M.S. degrees in electrical engineering from
the Massachusetts Institute of Technology (M.I.T.),
Cambridge, in 1983. He received the Ph.D. degree
in electrical engineering from the University of
Colorado, Boulder, in 1989.
From 1983 to 1986, he worked at VLSI Technology, Inc., as a Custom Chip Designer of CMOS
VLSI for DSP (fax modems). He is currently an
Associate Professor of Electrical and Computer Engineering at the University of Massachusetts at
Amherst. He directs research in VLSI signal processing, in particular, VLSI
methods and CAD tools for low power and high speed. The research is being
applied in chips and systems for real-time robotics, wireless LAN’s, and
remote sensing.
Dr. Burleson was a Chair of the 1998 VLSI Signal Processing Workshop.
He has been the Guest Editor two special issues of IEEE journals, has been
an Associate Editor for IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II, and
is on the Editorial Board for the Journal of VLSI Signal Processing.
Maciej Ciesielski (M’85–SM’95) received the M.S.
degree in electrical engineering from Warsaw Technical University, Poland, in 1974, and the Ph.D.
degree in electrical engineering from the University
of Rochester, Rochester, NY, in 1983.
From 1983 to 1986, he was a Senior Member of
Technical Staff at GTE Laboratories, MA, involved
in a silicon compilation project. Currently, he is
Associate Professor in the Department of Electrical
and Computer Engineering at the University of
Massachusetts, Amherst. His research interests include CAD for VLSI systems, high-level and logic synthesis, layout synthesis
and physical design automation, performance optimization of integrated
circuits, and mathematical optimization methods.
Fabian Klass (A’95) received the B.S.E.E. degree
from Universidad Nacional de Tucuman, Argentina,
in 1985, the M.S.E.E. degree from Technion—Israel
Institute of Technology, Haifa, Israel, in 1989, and
the Ph.D. degree in electrical engineering from Delft
University, The Netherlands, in 1994.
From 1992 to 1994, he was a Visiting Scholar at
Stanford University, Stanford, CA, where he joined
the Computer Systems Laboratory, working in the
area of high-speed CMOS digital circuits (wavepipelining). In 1994, he joined Sun Microsystems
Inc., Sunnyvale, CA, where he is currently a Circuit Design Leader for the
UltraSparc III microprocessor. His current areas of interest include high-speed
CMOS design, clocking, and computer organization. He has published ten
technical papers on high-speed circuit techniques, holds one patent, and has
ten pending.
Wentai Liu (S’78–M’81–SM’93) received the
B.S.E.E. degree from National Chiao-Tung University, Taiwan, the M.S.E.E. degree from National
Taiwan University, Taipei, and the Ph.D. degree
from the University of Michigan, Ann Arbor.
Since 1983, he has been on the Faculty of North
Carolina State University, Raleigh, where he is
currently a Professor of electrical and computer
engineering. Currently, he leads the widely reported
retinal prosthesis project. His research interests
include visual prosthesis, VLSI design/CAD, and
sensor design. He holds three U.S. patents and has coauthored two books.
Dr. Liu received an IEEE Outstanding Paper Award in 1986. He has served
as an IEEE-CAS representative and is currently an Associate Editor for IEEE
TRANSACTIONS ON CIRCUITS AND SYSTEMS II.
Download