Practical Design and Performance Evaluation of Completion Detection Circuits Fu-Chiung Cheng Department of Computer Science Columbia University New York, NY 10027 cheng@cs.columbia.edu Abstract To achieve the goal of designing high performance self-timed circuits, one of the key factors is to design a fast completion detection circuit, detecting the completion of the self-timed circuit. Some recent work proposed by Wuu and Yun [14, 15] on completion detection circuits is reviewed. A new design of high performance completion detection circuits for dual-rail selftimed circuits is presented. The results of our SPICE simulation show that our computation-completion detection circuit is more than 9 times faster than Wuu’s and Yun’s, and our reset-completion detection circuits is 2.7 times faster than Wuu’s. 1. Introduction Circuits may be classified as synchronous or asynchronous. Synchronous circuits have a clock to synchronize the operations of subsystems while asynchronous circuits do not. Subsystems in asynchronous circuits usually need their start and completion signals to synchronize with one another [13]. Because of the potential advantages of asynchronous design — no clock skew problem, low power consumption, average-case performance, modularity, composability and reusability — interest in asynchronous logic design is increasing [5, 8]. The promise of high performance is especially attractive. To achieve high performance, one must design a fast self-timed circuit with good average case performance and a fast completion detection circuit, detecting the completion of the self-timed circuit. For designing high performance self-timed components, see [3, 2]. This paper address the design of fast completion detection circuits. The simplest completion detection circuits may be just a delay element which can be implemented by an inverter chain. The delay must be larger than that of the worst case. This type of completion detection circuits is used by the asynchronous circuits with the bundled data protocol [12, 1]. Since this bounded delay approach does not exploit data dependency, leading to the worst case performance, this scheme is undesirable for high performance self-timed or delay-insensitive circuits. Dean and Grass [6, 7] proposed a completion detection method by observing circuit power consumption. When activated the circuit consumes power, and when it is done the power consumption falls below a particular threshold. Such method is called Current-Sensing Completion-Detection (CSCD). CSCD method allows asynchronous circuits to be designed using single-rail variable encoding. However, this method may cause high quiescent currents which consume substantial power and requires several gate delays. a a C b c b c Figure 1. Two-input Muller C-element A C-element [11] may be used to implement a completion detection circuit for self-timed or delayinsensitive circuits [1, 4]. Figure 1 shows a two-input C-element, with two inputs a and b, and one output c, and its timing diagram. If a = b = 1 then c = 1 and if a = b = 0 then c = 0, otherwise the value of c remains unchanged. This can be generalized to an n-input C-element. The output of an n-input C-element is 1 if all inputs are 1 and it is 0 if Ack all inputs are 0; otherwise its value remains unchanged. Figure 2 shows a dual-rail self-timed component with an n-input C-element. The self-timed component has completed an operation when DoneReset signal goes high and it has been reset when DoneReset signal goes down. A 00 A 10 0 S0 1 S0 . . Self−Timed 0 0 1 0 B B Component Ack . . . + Ack + Ack C 2.1. Wuu’s Design of Multi-Input Muller Celements Wuu [14] proposed a fast and area efficient implementation of a multi-input Muller C-element. The basic idea is to use faster N OR or N AN D gates to implement tree-like structure instead of AN D-OR logic. That is, a multi-input N AN D gate can be implemented by a tree whose first level consists of two-input N AN D gates, with subsequent levels alternating between N OR and N AN D gates. A dual arrangement can be used to implemented multi-input N OR gates. Equation (1) is rewritten as follows: A completion detection circuit may be implemented through an n-input C-element, which may be implemented by a tree of two-input C-elements shown in Figure 3. This implementation is impractical for high performance self-timed components due to the large delay overhead. 1 C DoneReset Ack Ack DoneReset C n−2 n−1 Nandtree(Nortree + DoneReset) (2) Figure 5 shows Wuu’s implementation of an 8-input N andtree and an 8-input N ortree. Note that if the number of levels of the N andtree or N ortree is even, then an inverter is needed at the root. C C = where Nandtree = (Ack0 Ack1 ...Ackn−1 ) and Nortree = (Ack0 + Ack1 + ... + Ackn−1 ). ... ... + n−1 It consists of an n-input AN D gate, an n-input OR gate and a two-input C-element. The n-input AN D gate can be regarded as the computation-completion signal. That is, the computation is done when all Acki s are turned on. The n-input OR gate can be regarded as the reset-completion signal. That is, the circuit is reset (the OR gate output goes to 0) and ready to accept new input data when all Acki s are turned off. In the following subsections, some recent work [14, 15] on completion detection circuits is reviewed. 2. Previous Work Ack ... DoneReset reset 0 DoneReset n−1 This paper focuses on the design of high performance completion detection circuits for dual-rail selftimed components. This paper is organized as follows: Section 2 reviews some recent work on completion detection circuits for dual-rail self-timed components. Section 3 presents our design of completion detection circuits in detail. The logic complexity of completion detection circuits, presented in sections 2 and 3 is analyzed in section 4. Section 5 presents the results of SPICE simulation of these completion detection circuits. Section 6 concludes this paper as a whole. 0 done Figure 4. Better n-input C-element Figure 2. Self-timed component with completion detection circuit Ack & C 0 . . . Ack ... n−1 Ack 0 Sn−1 1 Sn−1 0 ... C + & + & + & Figure 3. n-input C-element + & A better implementation of an n-input C-element, shown in Figure 4, may be defined as follows: DoneReset = (done + reset DoneReset) + & + (1) & where done = Ack0 Ack1 ...Ackn−1 and reset = Ack0 + Ack1 + ... + Ackn−1 . & + Figure 5. Nandtree and Nortree 2 Compared to the implementation of a tree of twoinput C-elements, Wuu has reported that his 32-input C-element is from 3.9 to 4.4 times faster and has less delay variance. to change late should be connected to points with the shortest paths to the output. 3. Designs of High Performance Completion Detection Circuits 2.2. Yun’s Design of Completion Detection Circuits The completion signal, shown in Figure 4, consists of an n-input AN D gate for the done signal and an ninput OR gate for the reset signal. They are then fed to a 2-input C-element, whose output is the DoneReset signal. The logic equation for the reset signal is rewritten as follows: In Yun’s paper [15], a completion detection circuit is implemented by a tree of domino logic gates. The CMOS domino logic implementation of a 32-input completion detection circuit consists of 4 8-bit completion detection domino logics connected as a tree, shown in Figure 6. 1 prech M 31 1 0 1 done ( Si + Si ) 1 M 23 1 ( Si + Si ) i=16 prech o o 1 0 1 S3 S3 0 prech 1 S2 1 0 S5 1 0 S7 S7 1 S1 S1 0 S0 1 S0 0 1 0 1 ( Si + Si ) i=8 0 0 S4 S5 0 1 S6 S6 7 (Ack0 + Ack1 + ... + Ackn−1 ) = 0 1 ((S00 + S01 ) + (S10 + S11 ) + ... + (Sn−1 + Sn−1 )) ( Si + Si ) i=0 o o 15 M 0 S5 M S2 = A reasonably fast multi-input N OR gate may be efficiently implemented by dynamic CMOS logic. And, the reset circuit, which is a 2n-input OR gate, can be implemented by a 2n-input N OR gate with an inverter. The dynamic CMOS implementation of reset circuit is shown in Figure 7. i=24 8−bit completion dectection domino logic 1 0 Reset 1 1 o o 0 Si 1 Figure 6. Yun’s implementation of completion detecting circuit Si reset 0 o It works as follows: Initially, prech signal is set to zero and thus precharge the CMOS domino logic. During the operation, prech is set to one and the domino logic is floating at this moment. When all the acknowledge signals (Si0 + Si0 ) are set to one, the done signal is turned on. The done signal is turned off as soon as any acknowledge signal is turned off. And prech is set back to zero. The circuit implements only the computationcompletion (i.e. done) signal. No implementation of a reset-completion (i.e. reset) signal was presented. Note that to guarantee a self-timed component to work correctly without a reset-completion signal, delay assumption has to be made so that no false done signal may be generated. The delay variance from the input to output is quite large. For example, if S00 and S01 (i.e. Ack0 ) is the last of all acknowledge signals, it would require at least 5 inverter delays to turn on done signal. Compared to S00 and S01 , the delay required by S70 and S71 is at least 2 inverter delays less. Careful arrangement of the input bits is necessary since this does affect the overall performance. That is, those acknowledge signals expected ... 1 S0 S0 0 S n−1 o 1 S n−1 o o Figure 7. CMOS implementation of reset circuit The done signal may be rewritten as follows: Done = (Ack0 Ack1 ...Ackn−1 ) = 0 1 + Sn−1 ) (S00 + S01 )(S10 + S11 )...(Sn−1 = 0 1 (S00 + S01 ) + (S10 + S11 ) + ... + (Sn−1 + Sn−1 ) Thus, the done circuit may be realized through n 2input N OR gates and an n-input N OR gate. The dynamic CMOS implementation of done circuit is shown in Figure 8. In the computation cycle of the operation mode, either Si0 or Si1 for all 0 ≤ i ≤ n − 1 will be eventually turned on. For the done signal, the PMOS transistor will be closed and all NMOS transistors will be open. Thus, the done signal will be turned on. The reset signal is turned on as soon as any Acki signal goes high. In the reset cycle of the operation mode, Si0 and Si1 for all 0 ≤ i ≤ n−1 will be eventually turned off. Thus, 3 1 1 1 0 Si 1 Si reset Ack i 1 1 Si o reset done done Ack 0 o weak done Ack i Si 1 ... Ack 1 o o DoneReset DoneReset Ack n−1 o done done reset reset o Figure 8. CMOS implementation of done circuit o Figure 9. Static and dynamic CMOS implementation of two-input C-element the done signal is turned off as soon as any Acki signal is turned off. And, the reset signal will be turned off only after all the Acki signals are turned off. The PMOS transistor in the pull-up circuit of done circuit saves power in the non-operation mode. That is, in a quiescent state, all Si0 and Si1 where 0 ≤ i ≤ n − 1 must be zero and all Acki (Acki ) where 0 ≤ i ≤ n − 1 are all zero (one). All pull-down transistors are closed. To save power, it is required that the PMOS transistor be open to cut off the path from Vdd to Ground. The input to the PMOS transistor can be any Acki signal or any other similar dual-rail signal used in the self-timed components. If the input low to the PMOS transistor arrives too early, power is wasted till all the Acki (Acki ) turn on (off) to cut off the path from Vdd to Ground. On the other hand, if the input low to the PMOS transistor arrives too late, say it is the latest acknowledge signal, then the done circuit is not charged ahead and thus take a longer time to turn on the done signal. For the sake of low power consumption, the latest Ack signal is a good choice. For the sake of high performance, any Acki signal which is not latest may be used. Note that when conducting, the pull-up path resistances of both done and reset circuits must be at least five or six times as big as the pull-down resistance when only one pull-down transistor is conducting. This can be achieved by properly sizing the transistors. The functionality of the dynamic CMOS logic shown in Figure 8 and Figure 7 is equivalent to a dual-rail multi-input C-element. It may be converted to a singlerail multi-input C-element by using a two-input Celement. Figure 9 shows two economic implementations of a two-input C-element. The left-hand side circuit is a static implementation of two-input C-element, presented in [9]. For a high speed self-timed component which takes a few nanoseconds to perform a computation, the weak inverter of the state-holding element may be removed. The resulting circuit, which is a dynamic CMOS implementation, is shown on the right-hand side. Figure 10 shows an n-input delay-insensitive ripplecarry adder [10] with the proposed completion detection circuit. The last carry (Cn0 and Cn1 ) and all sum bits (Si0 and Si1 , for 0 ≤ i < n) are used to produce the completion signal. When the DoneReset goes high, the addition is complete and when the DoneReset goes low, the adder is properly reset and ready to accept next inputs. 0 DI Adder 1 A n−1 B n−1 1 0 A n−1 1 o o 0 0 C1 C0 DIRCA 1 1 C1 1 S0 0 Si 1 C0 0 S0 1 ... Ack n−1 1 0 0 1 ... 0 A0 A0 B0 B0 Ci 1 Si Ack n 1 0 Ci DIRCA 1 Ci+1 0 S n−1 1 o 0 1 Cn−1 1 S n−1 1 Ci+1 Cn−1 DIRCA Cn 0 Ai Ai Bi Bi 0 0 Cn 1 1 B n−1 Ack i o o Ack 0 o o o 1 1 Ack 0 done Ack 0 ... Ack 1 o Ack n−1 Ack n DoneReset o o o 1 0 S0 o 1 S0 reset 0 ... 1 S0 S0 o o 0 S n−1 1 0 S n−1 o 1 Cn o Cn o o Figure 10. DI ripple-carry adder with completion detection circuit 4 4. Logic Complexity The simulation contains two parts: First, 38 typical cases are used to analyze the performance of computation-completion detection circuits proposed by Wuu, Yun and us. The delay measured includes the delay of the OR gate for Si0 and Si1 . Second, 38 typical cases are used to analyze the performance of resetcompletion detection circuits proposed by Wuu and us. The ith case (0 ≤ i ≤ 31) is to mimic the situation that the ith acknowledge (i.e. Acki ) signal is the last of all completion signals. And, the ith case (32 ≤ i ≤ 37) is to mimic the situation that multiple acknowledge signals are the last ones of all completion signals. The spice simulation results of these 38 cases are not shown here, due to the space limitation. Table 2 shows the summary of minimal, maximal and average values of these 38 cases, respectively. The logic complexity consists of two parts: the number of transistors for the done signal and the number of transistors for the done signal + the reset signal. An n-input completion detection circuit of Wuu’s design consists of n two-input OR gates for Ack signals, an n-input N andtree for the done signal, and an ninput N ortree for the reset signal. Thus, the done signal needs 6n + 4(n − 1) = 10n − 4 transistors and the done signal + the reset signal needs 6n + 4(n − 1) + 4(n − 1) = 14n − 8 transistors. Note that the two-input C-element is ignored. An n-input computation-completion circuit of Yun’s design consists of n8 8-bit completion detection domino modules and a tree of two-input dynamic AN D gates. n Therefore, the done signal needs 27 8 n+5( 8 −1) = 4n−5 transistors. An n-input completion detection circuit of our design (Cheng), shown in Figure 8 and Figure 7, consists of n two-input N OR gates for Ack signals, an n-input dynamic N OR gate for done signal with only one transistor in the pull-up circuit, and an 2n-input dynamic OR gate for reset signal. Thus, the done signal needs 4n + (n + 1) = 5n + 1 transistors and the done signal + the reset signal needs 4n + (n + 1) + (2n + 4) = 7n + 5 transistors. The logic complexity of the above mentioned completion detection circuits is summarized in Table 1. Circuit Wuu Yun Cheng n-bit 10n − 4 4n − 5 5n + 1 done 32-bit 316 123 161 64-bit 636 251 321 Case Min Max Average Case Min Max Average Computation Completion Detection 32-bit done (ns) Speed Up Wuu Yun Cheng Sp1 Sp2 2.18 1.46 0.22 4.1 2.8 2.65 3.36 0.64 10.4 14.3 2.27 2.53 0.28 9.2 10.2 Reset Completion Detection 32-bit reset (ns) Speed Up Wuu Cheng Sp3 2.40 0.87 2.0 2.89 1.34 3.1 2.85 0.71 2.7 Table 2. Performance evaluation The speed-up of circuit 1 against circuit 2, is defined as the ratio of the delay of circuit 2 to the delay of circuit 1. Sp1 and Sp2 in Table 2 are the speedups of our computation-completion detection circuit against Wuu’s and Yun’s, respectively. Sp3 in Table 2 is the speed-up of our reset-completion detection circuits, against Wuu’s. The simulation results show that: First, both Wuu’s and our computation-completion and reset-completion detection circuits have very small delay variance from input to output. Second, our computation-completion circuit is 9.2 times faster than Wuu’s design and 10.2 times faster than Yun’s. Third, our reset-completion circuit is 2.7 times faster than Wuu’s design. Note that wire capacitance may not be ignored when more than 8 NMOS transistors connected in parallel. The SPICE results shown in this paper do not include this factor. For circuits with many transistors connected in parallel such as our done and reset circuits shown in Figure 10, wire capacitance may have a significant impact on delay. It may be necessary to pay careful attention to the effect of noise. done + reset n-bit 32-bit 64-bit 14n − 8 440 888 N/A 7n + 5 229 453 Table 1. Logic complexity of completion detection circuits For the design of computation-completion circuits, Yun’s design uses fewest transistors (about 22% fewer than ours). For the design of computation-completion plus reset-completion circuits, the number of transistors required in our design is about 49% less than Wuu’s. 5. SPICE Simulation Our completion detection circuits are compared with those of Wuu and Yun by means of SPICE simulations using MOSIS 2 micron CMOS, level 2 parameters. Smallest transistor size is assumed. 5 6. Conclusions Research in Asynchronous Circuits and Systems. IEEE Computer Society Press, March 1996. This paper presents a new completion detection circuit with a very fast computation-completion detection and a very fast reset-completion detection circuits for dual-rail self-timed components. The goal of designing high-performance self-timed circuits can not be achieved without a low-overhead, very fast completion detection circuit. The SPICE simulation results show that our computation-completion detection circuit is more than 9 times faster than Wuu’s and Yun’s; and, our resetcompletion detection circuit is 2.7 times faster than Wuu’s. Our completion detection circuit is faster than any other known ones and uses about half of the transistors of Wuu’s. [8] Scott Hauck. Asynchronous design methodologies: An overview. Proceedings of the IEEE, 83(1), January 1995. [9] Alain J. Martin. Programming in VLSI: From communicating processes to delay-insensitive circuits. In C. A. R. Hoare, editor, Developments in Concurrency and Communication, UT Year of Programming Series, pages 1–64. Addison-Wesley, 1990. [10] Alain J. Martin. Asynchronous datapaths and the design of an asynchronous adder. Formal Methods in System Design, 1(1):119–137, July 1992. [11] David E. Muller and W. S. Bartky. A theory of asynchronous circuits. In Proceedings of an International Symposium on the Theory of Switching, pages 204–243. Harvard University Press, April 1959. References [1] Erik Brunvand. Translating Concurrent Communicating Programs into Asynchronous Circuits. PhD thesis, Carnegie Mellon University, 1991. [12] Ivan E. Sutherland. Micropipelines. Communications of the ACM, 32(6):720–738, June 1989. [2] Fu-Chiung Cheng. Synthesis of high speed delayinsensitive combinational iterative tree circuits. In Proc. International Conf. Computer Design (ICCD), pages 301–306, October 1997. [13] S. H. Unger. A building block approach to unclocked systems. In Proc. Hawaii International Conf. System Sciences, volume I. IEEE Computer Society Press, January 1993. [3] Fu-Chiung Cheng, Stephen H. Unger, Michael Theobald, and Wen-Chung Cho. Delay-insensitive carry-lookahead adders. In Proc. Int’l. Conf. VLSI Design, pages 322–328. IEEE Computer Society Press, 1997. [14] Tzyh-Yung Wuu and Sarma B. K. Vrudhula. A design of a fast and area efficient multi-input Muller C-element. IEEE Transactions on VLSI Systems, 1(2):215–219, June 1993. [4] Ilana David, Ran Ginosar, and Michael Yoeli. An efficient implementation of boolean functions as self-timed circuits. IEEE Transactions on Computers, 41(1):2–11, January 1992. [15] Kenneth Y. Yun, Ayoob E. Dooply, Julio Arceo, Peter A. Beerel, and Vida Vakilotojar. The design and verification of a high-performance lowcontrol-overhead asynchronous differential equation solver. In Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems. IEEE Computer Society Press, April 1997. [5] Al Davis and Steven M. Nowick. Asynchronous circuit design: Motivation, background, and methods. In Graham Birtwistle and Al Davis, editors, Asynchronous Digital Circuit Design, Workshops in Computing, pages 1–49. Springer-Verlag, 1995. [6] Mark E. Dean, David L. Dill, and Mark Horowitz. Self-timed logic using current-sensing completion detection (CSCD). Journal of VLSI Signal Processing, 7(1/2):7–16, February 1994. [7] E. Grass, R. C. S. Morling, and I. Kale. Activity monitoring completion detection (AMCD): A new single rail approach to achieve self-timing. In Proc. International Symposium on Advanced 6