Completion Detection Circuits

advertisement
Practical Design and Performance Evaluation of Completion
Detection Circuits
Fu-Chiung Cheng
Department of Computer Science
Columbia University
New York, NY 10027
cheng@cs.columbia.edu
Abstract
To achieve the goal of designing high performance
self-timed circuits, one of the key factors is to design
a fast completion detection circuit, detecting the completion of the self-timed circuit. Some recent work proposed by Wuu and Yun [14, 15] on completion detection circuits is reviewed. A new design of high performance completion detection circuits for dual-rail selftimed circuits is presented. The results of our SPICE
simulation show that our computation-completion detection circuit is more than 9 times faster than Wuu’s
and Yun’s, and our reset-completion detection circuits
is 2.7 times faster than Wuu’s.
1. Introduction
Circuits may be classified as synchronous or asynchronous. Synchronous circuits have a clock to synchronize the operations of subsystems while asynchronous circuits do not. Subsystems in asynchronous
circuits usually need their start and completion signals
to synchronize with one another [13].
Because of the potential advantages of asynchronous
design — no clock skew problem, low power consumption, average-case performance, modularity, composability and reusability — interest in asynchronous logic
design is increasing [5, 8].
The promise of high performance is especially attractive. To achieve high performance, one must design a fast self-timed circuit with good average case
performance and a fast completion detection circuit,
detecting the completion of the self-timed circuit. For
designing high performance self-timed components, see
[3, 2]. This paper address the design of fast completion
detection circuits.
The simplest completion detection circuits may be
just a delay element which can be implemented by an
inverter chain. The delay must be larger than that
of the worst case. This type of completion detection
circuits is used by the asynchronous circuits with the
bundled data protocol [12, 1]. Since this bounded delay
approach does not exploit data dependency, leading to
the worst case performance, this scheme is undesirable
for high performance self-timed or delay-insensitive circuits.
Dean and Grass [6, 7] proposed a completion detection method by observing circuit power consumption.
When activated the circuit consumes power, and when
it is done the power consumption falls below a particular threshold. Such method is called Current-Sensing
Completion-Detection (CSCD). CSCD method allows
asynchronous circuits to be designed using single-rail
variable encoding. However, this method may cause
high quiescent currents which consume substantial
power and requires several gate delays.
a
a
C
b
c
b
c
Figure 1. Two-input Muller C-element
A C-element [11] may be used to implement a
completion detection circuit for self-timed or delayinsensitive circuits [1, 4]. Figure 1 shows a two-input
C-element, with two inputs a and b, and one output c,
and its timing diagram.
If a = b = 1 then c = 1 and if a = b = 0 then c = 0,
otherwise the value of c remains unchanged. This can
be generalized to an n-input C-element. The output of
an n-input C-element is 1 if all inputs are 1 and it is 0 if
Ack
all inputs are 0; otherwise its value remains unchanged.
Figure 2 shows a dual-rail self-timed component with
an n-input C-element. The self-timed component has
completed an operation when DoneReset signal goes
high and it has been reset when DoneReset signal goes
down.
A 00
A 10
0
S0
1
S0
.
.
Self−Timed
0
0
1
0
B
B
Component
Ack
.
.
.
+
Ack
+
Ack
C
2.1. Wuu’s Design of Multi-Input Muller Celements
Wuu [14] proposed a fast and area efficient implementation of a multi-input Muller C-element. The basic idea is to use faster N OR or N AN D gates to implement tree-like structure instead of AN D-OR logic.
That is, a multi-input N AN D gate can be implemented by a tree whose first level consists of two-input
N AN D gates, with subsequent levels alternating between N OR and N AN D gates. A dual arrangement
can be used to implemented multi-input N OR gates.
Equation (1) is rewritten as follows:
A completion detection circuit may be implemented
through an n-input C-element, which may be implemented by a tree of two-input C-elements shown in
Figure 3. This implementation is impractical for high
performance self-timed components due to the large
delay overhead.
1
C
DoneReset
Ack
Ack
DoneReset
C
n−2
n−1
Nandtree(Nortree + DoneReset) (2)
Figure 5 shows Wuu’s implementation of an 8-input
N andtree and an 8-input N ortree. Note that if the
number of levels of the N andtree or N ortree is even,
then an inverter is needed at the root.
C
C
=
where Nandtree = (Ack0 Ack1 ...Ackn−1 ) and Nortree =
(Ack0 + Ack1 + ... + Ackn−1 ).
...
...
+
n−1
It consists of an n-input AN D gate, an n-input OR
gate and a two-input C-element. The n-input AN D
gate can be regarded as the computation-completion
signal. That is, the computation is done when all Acki s
are turned on. The n-input OR gate can be regarded as
the reset-completion signal. That is, the circuit is reset
(the OR gate output goes to 0) and ready to accept
new input data when all Acki s are turned off.
In the following subsections, some recent work [14,
15] on completion detection circuits is reviewed.
2. Previous Work
Ack
...
DoneReset
reset
0
DoneReset
n−1
This paper focuses on the design of high performance completion detection circuits for dual-rail selftimed components.
This paper is organized as follows: Section 2 reviews
some recent work on completion detection circuits for
dual-rail self-timed components. Section 3 presents our
design of completion detection circuits in detail. The
logic complexity of completion detection circuits, presented in sections 2 and 3 is analyzed in section 4. Section 5 presents the results of SPICE simulation of these
completion detection circuits. Section 6 concludes this
paper as a whole.
0
done
Figure 4. Better n-input C-element
Figure 2. Self-timed component with completion detection circuit
Ack
&
C
0
.
.
.
Ack
...
n−1
Ack
0
Sn−1
1
Sn−1
0
...
C
+
&
+
&
+
&
Figure 3. n-input C-element
+
&
A better implementation of an n-input C-element,
shown in Figure 4, may be defined as follows:
DoneReset
=
(done + reset DoneReset)
+
&
+
(1)
&
where done = Ack0 Ack1 ...Ackn−1 and reset = Ack0 +
Ack1 + ... + Ackn−1 .
&
+
Figure 5. Nandtree and Nortree
2
Compared to the implementation of a tree of twoinput C-elements, Wuu has reported that his 32-input
C-element is from 3.9 to 4.4 times faster and has less
delay variance.
to change late should be connected to points with the
shortest paths to the output.
3. Designs of High Performance Completion Detection Circuits
2.2. Yun’s Design of Completion Detection Circuits
The completion signal, shown in Figure 4, consists
of an n-input AN D gate for the done signal and an ninput OR gate for the reset signal. They are then fed
to a 2-input C-element, whose output is the DoneReset
signal.
The logic equation for the reset signal is rewritten
as follows:
In Yun’s paper [15], a completion detection circuit
is implemented by a tree of domino logic gates. The
CMOS domino logic implementation of a 32-input completion detection circuit consists of 4 8-bit completion
detection domino logics connected as a tree, shown in
Figure 6.
1
prech
M
31
1
0
1
done
( Si + Si )
1
M
23
1
( Si + Si )
i=16
prech
o
o
1
0
1
S3
S3
0
prech
1
S2
1
0
S5
1
0
S7
S7
1
S1
S1
0
S0
1
S0
0
1
0
1
( Si + Si )
i=8
0
0
S4
S5
0
1
S6
S6
7
(Ack0 + Ack1 + ... + Ackn−1 )
=
0
1
((S00 + S01 ) + (S10 + S11 ) + ... + (Sn−1
+ Sn−1
))
( Si + Si )
i=0
o
o
15
M
0
S5
M
S2
=
A reasonably fast multi-input N OR gate may be efficiently implemented by dynamic CMOS logic. And,
the reset circuit, which is a 2n-input OR gate, can be
implemented by a 2n-input N OR gate with an inverter.
The dynamic CMOS implementation of reset circuit is
shown in Figure 7.
i=24
8−bit completion dectection domino logic
1
0
Reset
1
1
o
o
0
Si
1
Figure 6. Yun’s implementation of completion
detecting circuit
Si
reset
0
o
It works as follows: Initially, prech signal is set to
zero and thus precharge the CMOS domino logic. During the operation, prech is set to one and the domino
logic is floating at this moment. When all the acknowledge signals (Si0 + Si0 ) are set to one, the done signal
is turned on. The done signal is turned off as soon as
any acknowledge signal is turned off. And prech is set
back to zero.
The circuit implements only the computationcompletion (i.e. done) signal. No implementation of
a reset-completion (i.e. reset) signal was presented.
Note that to guarantee a self-timed component to work
correctly without a reset-completion signal, delay assumption has to be made so that no false done signal
may be generated.
The delay variance from the input to output is quite
large. For example, if S00 and S01 (i.e. Ack0 ) is the last
of all acknowledge signals, it would require at least 5
inverter delays to turn on done signal. Compared to S00
and S01 , the delay required by S70 and S71 is at least 2
inverter delays less. Careful arrangement of the input
bits is necessary since this does affect the overall performance. That is, those acknowledge signals expected
...
1
S0
S0
0
S n−1
o
1
S n−1
o
o
Figure 7. CMOS implementation of reset circuit
The done signal may be rewritten as follows:
Done
=
(Ack0 Ack1 ...Ackn−1 )
=
0
1
+ Sn−1
)
(S00 + S01 )(S10 + S11 )...(Sn−1
=
0
1
(S00 + S01 ) + (S10 + S11 ) + ... + (Sn−1
+ Sn−1
)
Thus, the done circuit may be realized through n 2input N OR gates and an n-input N OR gate. The dynamic CMOS implementation of done circuit is shown
in Figure 8.
In the computation cycle of the operation mode, either Si0 or Si1 for all 0 ≤ i ≤ n − 1 will be eventually
turned on. For the done signal, the PMOS transistor
will be closed and all NMOS transistors will be open.
Thus, the done signal will be turned on. The reset signal is turned on as soon as any Acki signal goes high.
In the reset cycle of the operation mode, Si0 and Si1
for all 0 ≤ i ≤ n−1 will be eventually turned off. Thus,
3
1
1
1
0
Si
1
Si
reset
Ack i
1
1
Si
o
reset
done
done
Ack 0
o
weak
done
Ack i
Si
1
...
Ack 1
o
o
DoneReset
DoneReset
Ack n−1
o
done
done
reset
reset
o
Figure 8. CMOS implementation of done circuit
o
Figure 9. Static and dynamic CMOS implementation of two-input C-element
the done signal is turned off as soon as any Acki signal
is turned off. And, the reset signal will be turned off
only after all the Acki signals are turned off.
The PMOS transistor in the pull-up circuit of done
circuit saves power in the non-operation mode. That is,
in a quiescent state, all Si0 and Si1 where 0 ≤ i ≤ n − 1
must be zero and all Acki (Acki ) where 0 ≤ i ≤ n − 1
are all zero (one). All pull-down transistors are closed.
To save power, it is required that the PMOS transistor
be open to cut off the path from Vdd to Ground. The
input to the PMOS transistor can be any Acki signal or
any other similar dual-rail signal used in the self-timed
components.
If the input low to the PMOS transistor arrives too
early, power is wasted till all the Acki (Acki ) turn on
(off) to cut off the path from Vdd to Ground. On the
other hand, if the input low to the PMOS transistor
arrives too late, say it is the latest acknowledge signal,
then the done circuit is not charged ahead and thus
take a longer time to turn on the done signal. For the
sake of low power consumption, the latest Ack signal is
a good choice. For the sake of high performance, any
Acki signal which is not latest may be used.
Note that when conducting, the pull-up path resistances of both done and reset circuits must be at least
five or six times as big as the pull-down resistance when
only one pull-down transistor is conducting. This can
be achieved by properly sizing the transistors.
The functionality of the dynamic CMOS logic shown
in Figure 8 and Figure 7 is equivalent to a dual-rail
multi-input C-element. It may be converted to a singlerail multi-input C-element by using a two-input Celement. Figure 9 shows two economic implementations of a two-input C-element. The left-hand side circuit is a static implementation of two-input C-element,
presented in [9]. For a high speed self-timed component which takes a few nanoseconds to perform a computation, the weak inverter of the state-holding element may be removed. The resulting circuit, which
is a dynamic CMOS implementation, is shown on the
right-hand side.
Figure 10 shows an n-input delay-insensitive ripplecarry adder [10] with the proposed completion detection circuit. The last carry (Cn0 and Cn1 ) and all sum
bits (Si0 and Si1 , for 0 ≤ i < n) are used to produce
the completion signal. When the DoneReset goes high,
the addition is complete and when the DoneReset goes
low, the adder is properly reset and ready to accept
next inputs.
0
DI Adder
1
A n−1 B n−1
1
0
A n−1
1
o
o
0
0
C1
C0
DIRCA
1
1
C1
1
S0
0
Si
1
C0
0
S0
1
...
Ack n−1
1
0
0
1
...
0
A0 A0 B0 B0
Ci
1
Si
Ack n
1
0
Ci
DIRCA
1
Ci+1
0
S n−1
1
o
0
1
Cn−1
1
S n−1
1
Ci+1
Cn−1
DIRCA
Cn
0
Ai Ai Bi Bi
0
0
Cn
1
1
B n−1
Ack i
o
o
Ack 0
o
o
o
1
1
Ack 0
done
Ack 0
...
Ack 1
o
Ack n−1
Ack n
DoneReset
o
o
o
1
0
S0
o
1
S0
reset
0
...
1
S0
S0
o
o
0
S n−1
1
0
S n−1
o
1
Cn
o
Cn
o
o
Figure 10. DI ripple-carry adder with completion detection circuit
4
4. Logic Complexity
The simulation contains two parts: First, 38 typical cases are used to analyze the performance of
computation-completion detection circuits proposed by
Wuu, Yun and us. The delay measured includes the
delay of the OR gate for Si0 and Si1 . Second, 38 typical cases are used to analyze the performance of resetcompletion detection circuits proposed by Wuu and us.
The ith case (0 ≤ i ≤ 31) is to mimic the situation that
the ith acknowledge (i.e. Acki ) signal is the last of all
completion signals. And, the ith case (32 ≤ i ≤ 37) is
to mimic the situation that multiple acknowledge signals are the last ones of all completion signals. The
spice simulation results of these 38 cases are not shown
here, due to the space limitation. Table 2 shows the
summary of minimal, maximal and average values of
these 38 cases, respectively.
The logic complexity consists of two parts: the number of transistors for the done signal and the number
of transistors for the done signal + the reset signal.
An n-input completion detection circuit of Wuu’s
design consists of n two-input OR gates for Ack signals,
an n-input N andtree for the done signal, and an ninput N ortree for the reset signal. Thus, the done
signal needs 6n + 4(n − 1) = 10n − 4 transistors and
the done signal + the reset signal needs 6n + 4(n − 1) +
4(n − 1) = 14n − 8 transistors. Note that the two-input
C-element is ignored.
An n-input computation-completion circuit of Yun’s
design consists of n8 8-bit completion detection domino
modules and a tree of two-input dynamic AN D gates.
n
Therefore, the done signal needs 27
8 n+5( 8 −1) = 4n−5
transistors.
An n-input completion detection circuit of our design (Cheng), shown in Figure 8 and Figure 7, consists
of n two-input N OR gates for Ack signals, an n-input
dynamic N OR gate for done signal with only one transistor in the pull-up circuit, and an 2n-input dynamic
OR gate for reset signal. Thus, the done signal needs
4n + (n + 1) = 5n + 1 transistors and the done signal +
the reset signal needs 4n + (n + 1) + (2n + 4) = 7n + 5
transistors.
The logic complexity of the above mentioned completion detection circuits is summarized in Table 1.
Circuit
Wuu
Yun
Cheng
n-bit
10n − 4
4n − 5
5n + 1
done
32-bit
316
123
161
64-bit
636
251
321
Case
Min
Max
Average
Case
Min
Max
Average
Computation Completion Detection
32-bit done (ns)
Speed Up
Wuu
Yun
Cheng
Sp1
Sp2
2.18
1.46
0.22
4.1
2.8
2.65
3.36
0.64
10.4
14.3
2.27
2.53
0.28
9.2
10.2
Reset Completion Detection
32-bit reset (ns)
Speed Up
Wuu
Cheng
Sp3
2.40
0.87
2.0
2.89
1.34
3.1
2.85
0.71
2.7
Table 2. Performance evaluation
The speed-up of circuit 1 against circuit 2, is defined as the ratio of the delay of circuit 2 to the delay
of circuit 1. Sp1 and Sp2 in Table 2 are the speedups of our computation-completion detection circuit
against Wuu’s and Yun’s, respectively. Sp3 in Table 2
is the speed-up of our reset-completion detection circuits, against Wuu’s.
The simulation results show that: First, both Wuu’s
and our computation-completion and reset-completion
detection circuits have very small delay variance from
input to output. Second, our computation-completion
circuit is 9.2 times faster than Wuu’s design and 10.2
times faster than Yun’s. Third, our reset-completion
circuit is 2.7 times faster than Wuu’s design.
Note that wire capacitance may not be ignored when
more than 8 NMOS transistors connected in parallel.
The SPICE results shown in this paper do not include
this factor.
For circuits with many transistors connected in parallel such as our done and reset circuits shown in Figure 10, wire capacitance may have a significant impact
on delay. It may be necessary to pay careful attention
to the effect of noise.
done + reset
n-bit
32-bit
64-bit
14n − 8
440
888
N/A
7n + 5
229
453
Table 1. Logic complexity of completion detection circuits
For the design of computation-completion circuits,
Yun’s design uses fewest transistors (about 22% fewer
than ours). For the design of computation-completion
plus reset-completion circuits, the number of transistors required in our design is about 49% less than
Wuu’s.
5. SPICE Simulation
Our completion detection circuits are compared
with those of Wuu and Yun by means of SPICE simulations using MOSIS 2 micron CMOS, level 2 parameters.
Smallest transistor size is assumed.
5
6. Conclusions
Research in Asynchronous Circuits and Systems.
IEEE Computer Society Press, March 1996.
This paper presents a new completion detection circuit with a very fast computation-completion detection and a very fast reset-completion detection circuits
for dual-rail self-timed components. The goal of designing high-performance self-timed circuits can not be
achieved without a low-overhead, very fast completion
detection circuit.
The SPICE simulation results show that our
computation-completion detection circuit is more than
9 times faster than Wuu’s and Yun’s; and, our resetcompletion detection circuit is 2.7 times faster than
Wuu’s. Our completion detection circuit is faster than
any other known ones and uses about half of the transistors of Wuu’s.
[8] Scott Hauck. Asynchronous design methodologies:
An overview. Proceedings of the IEEE, 83(1), January 1995.
[9] Alain J. Martin. Programming in VLSI: From
communicating processes to delay-insensitive circuits. In C. A. R. Hoare, editor, Developments
in Concurrency and Communication, UT Year of
Programming Series, pages 1–64. Addison-Wesley,
1990.
[10] Alain J. Martin. Asynchronous datapaths and the
design of an asynchronous adder. Formal Methods
in System Design, 1(1):119–137, July 1992.
[11] David E. Muller and W. S. Bartky. A theory of
asynchronous circuits. In Proceedings of an International Symposium on the Theory of Switching,
pages 204–243. Harvard University Press, April
1959.
References
[1] Erik Brunvand. Translating Concurrent Communicating Programs into Asynchronous Circuits.
PhD thesis, Carnegie Mellon University, 1991.
[12] Ivan E. Sutherland. Micropipelines. Communications of the ACM, 32(6):720–738, June 1989.
[2] Fu-Chiung Cheng. Synthesis of high speed delayinsensitive combinational iterative tree circuits.
In Proc. International Conf. Computer Design
(ICCD), pages 301–306, October 1997.
[13] S. H. Unger. A building block approach to unclocked systems. In Proc. Hawaii International
Conf. System Sciences, volume I. IEEE Computer
Society Press, January 1993.
[3] Fu-Chiung Cheng, Stephen H. Unger, Michael
Theobald, and Wen-Chung Cho. Delay-insensitive
carry-lookahead adders. In Proc. Int’l. Conf. VLSI
Design, pages 322–328. IEEE Computer Society
Press, 1997.
[14] Tzyh-Yung Wuu and Sarma B. K. Vrudhula. A design of a fast and area efficient multi-input Muller
C-element. IEEE Transactions on VLSI Systems,
1(2):215–219, June 1993.
[4] Ilana David, Ran Ginosar, and Michael Yoeli. An
efficient implementation of boolean functions as
self-timed circuits. IEEE Transactions on Computers, 41(1):2–11, January 1992.
[15] Kenneth Y. Yun, Ayoob E. Dooply, Julio Arceo,
Peter A. Beerel, and Vida Vakilotojar. The design and verification of a high-performance lowcontrol-overhead asynchronous differential equation solver. In Proc. International Symposium
on Advanced Research in Asynchronous Circuits
and Systems. IEEE Computer Society Press, April
1997.
[5] Al Davis and Steven M. Nowick. Asynchronous
circuit design: Motivation, background, and
methods. In Graham Birtwistle and Al Davis, editors, Asynchronous Digital Circuit Design, Workshops in Computing, pages 1–49. Springer-Verlag,
1995.
[6] Mark E. Dean, David L. Dill, and Mark Horowitz.
Self-timed logic using current-sensing completion
detection (CSCD). Journal of VLSI Signal Processing, 7(1/2):7–16, February 1994.
[7] E. Grass, R. C. S. Morling, and I. Kale. Activity monitoring completion detection (AMCD): A
new single rail approach to achieve self-timing.
In Proc. International Symposium on Advanced
6
Download