VLSI Arithmetic Adders & Multipliers Prof. Vojin G. Oklobdzija University of California http://www.ece.ucdavis.edu/acsel Introduction • Digital Computer Arithmetic belongs to Computer Architecture, however, it is also an aspect of logic design. • The objective of Computer Arithmetic is to develop appropriate algorithms that are utilizing available hardware in the most efficient way. • Ultimately, speed, power and chip area are the most often used measures, making a strong link between the algorithms and technology of implementation. Oklobdzija 2004 Computer Arithmetic 2 Basic Operations • • • • Addition Multiplication Multiply-Add Division • Evaluation of Functions • Multi-Media Oklobdzija 2004 Computer Arithmetic 3 Addition of Binary Numbers Addition of Binary Numbers Full Adder. The full adder is the fundamental building block of most arithmetic circuits: ai Cout bi Full Adder Cin si The sum and carry outputs are described as: si ai bi ci ai bi ci ai bi ci ai bi ci ci1 ai bi ci ai bi ci ai bi ci ai bi ci ai bi ai ci bi ci Oklobdzija 2004 Computer Arithmetic 5 Addition of Binary Numbers Inputs Outputs ci ai bi si ci+1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 1 1 1 Oklobdzija 2004 Computer Arithmetic Propagate Generate Propagate Generate 6 Full-Adder Implementation Full Adder operations is defined by equations: si aibi ci aibi ci aibi ci aibi ci ai bi ci pi ci ci 1 aibi ci aibi ci aibi gi pi ci ai bi Carry-Propagate: pi ai bi and Carry-Generate gi g i ai bi cout cin One-bit adder could be implemented as shown Oklobdzija 2004 Computer Arithmetic si 7 High-Speed Addition ci 1 gi pi ci ai gi ai bi bi pi ai bi 0 cout cin s 1 One-bit adder could be implemented more efficiently because MUX is faster Oklobdzija 2004 si pi ci si Computer Arithmetic 8 The Ripple-Carry Adder Oklobdzija 2004 Computer Arithmetic 9 The Ripple-Carry Adder A0 A1 B0 Co,0 Ci,0 FA S0 FA A2 B1 A3 B2 Co,2 C o,1 B3 Co,3 FA FA S2 S3 (= C i,1) S1 Worst case delay linear with the number of bits td = O(N) t adder N – 1 tcarry + tsum Goal: Make the fastest possible carry path circuit From Rabaey Oklobdzija 2004 Computer Arithmetic 10 Inversion Property A Ci A B FA Co Ci S B FA Co S S A B C i = S A B Ci C A B C = C A B C o i o i From Rabaey Oklobdzija 2004 Computer Arithmetic 11 Minimize Critical Path by Reducing Inverting Stages Even Cell A0 Ci,0 A1 B0 FA’ C o,0 S0 B1 FA’ S1 A2 Co,1 A3 B2 FA’ Odd Cell C o,2 S2 B3 FA’ C o,3 S3 Exploit Inversion Property From Rabaey Note: need 2 different types of cells Oklobdzija 2004 Computer Arithmetic 12 Ripple Carry Adder Carry-Chain of an RCA implemented using multiplexer from the ai+2 library: bi+2 ai+1 bi+1 ai bi standard cell Critical Path ci+1 cout ci cin Oklobdzija, ISCAS’88 si+2 Oklobdzija 2004 si+1 Computer Arithmetic si 13 Manchester Carry-Chain Realization of the Carry Path • Simple and very popular scheme for implementation of carry signal path Vdd Vdd Vdd Vdd Vdd Vdd Vdd Vdd Generate device Carry in Carry out + + + + + + + + Propagate device Predischarge & kill device Oklobdzija 2004 Computer Arithmetic 14 Original Design T. Kilburn, D. B. G. Edwards, D. Aspinall, "Parallel Addition in Digital Computers: A New Fast "Carry" Circuit", Proceedings of IEE, Vol. 106, pt. B, p. 464, September 1959. Oklobdzija 2004 Computer Arithmetic 15 Manchester Carry Chain (CMOS) •Implement P with pass-transistors •Implement G with pull-up, kill (delete) with pull-down •Use dynamic logic to reduce the complexity and speed up VDD Ci,0 P0 P1 P2 P3 P4 G0 G1 G2 G3 G4 Kilburn, et al, IEE Proc, 1959. Oklobdzija 2004 Computer Arithmetic 16 Pass-Transistor Realization in DPL C C VCC S A A B B S XOR/XNOR MULTIPLEXER BUFFER AND/NAND VCC A A B B C C VCC CO VCC A A B B CO MULTIPLEXER BUFFER OR/NOR Oklobdzija 2004 Computer Arithmetic 17 Carry-Skip Adder MacSorley, Proc IRE 1/61 Lehman, Burla, IRE Trans on Comp, 12/61 Oklobdzija 2004 Computer Arithmetic 18 Carry-Skip Adder G1 Ci,0 P0 G1 C o,0 P0 FA P2 FA G2 Co,1 FA G3 Co,3 FA G1 C o,0 P3 Co,2 FA P0 G1 G2 C o,1 FA Ci,0 P2 P3 G3 BP=P oP1 P2 P3 C o,2 FA FA Multiplexer P0 Co,3 Bypass From Rabaey Idea: If (P0 and P1 and P2 and P3 = 1) then C o3 = C 0, else “kill” or “generate”. Oklobdzija 2004 Computer Arithmetic 19 Carry-Skip Adder: N-bits, k-bits/group, r=N/k groups a (r-1)k b(r-1)k a (r-1)kb (r-1)k a N-1bN-1a N-k-1b N-k-1 OR Cout + ... ... + ... ... SN-1 S N-k-1 Pr-1 AND ... G r1 OR + G1 + ... ... S (r-1)k-1 ... ... OR OR Gr a k-1 b k-1 a0 b0 a 2k-1b 2k-1 ak bk ... ... ... ... S (r-2)k Pr-2 ... AND S 2k-1 Sk P1 AND Cin Go S k-1 S 0 P0 AND critical path, delay =2(k-1)+(N/2-2) Oklobdzija 2004 Computer Arithmetic 20 Carry-Skip Adder tp ripple adder bypass adder N td 2k 1t RCA 2 t SKIP 2k 4..8 Oklobdzija 2004 N Computer Arithmetic 21 Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Oklobdzija 2004 Computer Arithmetic 22 Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) a N-1b N-1 C out .. ... Gm Gm-1 SN-1 Pm Pm-1 Gm Gm-1 a aj b j .. . Gm-2 i ... .. G2 G1 G0 Si Pm-2 P2 ... a0 b0 ... Sj Gm-2 bi Cin S0 P1 G2 G1 skiping P0 G0 ... C ou Cin t rippling Oklobdzija 2004 Carry signal path Computer Arithmetic 23 Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) 6 1 3 4 5 5 =9 4 3 1 Any-point-to-any-point delay = 9 as compared to 12 for CSKA Oklobdzija 2004 Computer Arithmetic 24 Carry-chain block size determination for a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Oklobdzija 2004 Computer Arithmetic 25 Delay Calculation for Variable Block Adder (Oklobdzija, Barnes: IBM 1985) P0 Ci,0 G0 P1 P2 G1 P3 G2 BP Co,3 G3 BP Delay model: Oklobdzija 2004 Computer Arithmetic 26 Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Variable Group Length td c1 c2 N c3 Oklobdzija, Barnes, Arith’85 Oklobdzija 2004 Computer Arithmetic 27 Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Variable Block Lengths • • Oklobdzija 2004 No closed form solution for delay It is a dynamic programming problem Computer Arithmetic 28 Delay Comparison: Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Oklobdzija 2004 Computer Arithmetic 29 Delay Delay Comparison: Variable Block Adder 16 VBA 14 12 CLA 10 8 VBA- Multi-Level 6 4 2 0 4 11 18 25 32 39 46 53 60 Size N Oklobdzija 2004 Computer Arithmetic 30 VLSI Arithmetic Lecture 4 Prof. Vojin G. Oklobdzija University of California http://www.ece.ucdavis.edu/acsel Review Lecture 3 Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Oklobdzija 2004 Computer Arithmetic 33 Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) a N-1b N-1 C out .. ... Gm Gm-1 SN-1 Pm Pm-1 Gm Gm-1 a aj b j .. . Gm-2 i ... .. G2 G1 G0 Si Pm-2 P2 ... a0 b0 ... Sj Gm-2 bi Cin S0 P1 G2 G1 skiping P0 G0 ... C out Cin rippling Oklobdzija 2004 Carry signal path Computer Arithmetic 34 Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) 6 1 3 4 5 5 =9 4 3 1 Any-point-to-any-point delay = 9 as compared to 12 for CSKA Oklobdzija 2004 Computer Arithmetic 35 Carry-chain block size determination for a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Oklobdzija 2004 Computer Arithmetic 36 Delay Calculation for Variable Block Adder (Oklobdzija, Barnes: IBM 1985) P0 Ci,0 G0 P1 P2 G1 P3 G2 BP Co,3 G3 BP Delay model: Oklobdzija 2004 Computer Arithmetic 37 Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Variable Group Length td c1 c2 N c3 Oklobdzija, Barnes, Arith’85 Oklobdzija 2004 Computer Arithmetic 38 Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Variable Block Lengths • • Oklobdzija 2004 No closed form solution for delay It is a dynamic programming problem Computer Arithmetic 39 Delay Comparison: Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Oklobdzija 2004 Computer Arithmetic 40 Delay Delay Comparison: Variable Block Adder 16 Square Root Dependency 14 VBA 12 Log Dependency 10 CLA 8 VBA- Multi-Level 6 4 2 0 4 11 18 25 32 39 46 53 60 Size N Oklobdzija 2004 Computer Arithmetic 41 Circuit Issues • Adder speed can not be estimated based on: – logic gates in the critical path – number of transistors in the path – logic levels in the path • Estimating Adders speed is much more complex and many of the “fast” schemes may be misleading you. Oklobdzija 2004 Computer Arithmetic 42 Fan-Out Dependency Oklobdzija 2004 Computer Arithmetic 43 Fan-In Dependency This looks like “Logical Effort” (1985) Oklobdzija 2004 Computer Arithmetic 44 Delay Comparison: Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Oklobdzija 2004 Computer Arithmetic 45 Oklobdzija 2004 Computer Arithmetic 46 Carry-Lookahead Adder (Weinberger and Smith, 1958) ARITH-13: Presenting Achievement Award to Arnold Weinberger of IBM (who invented CLA adder in 1958) Ref: A. Weinberger and J. L. Smith, “A Logic for High-Speed Addition”, National Bureau of Standards, Circ. 591, p.3-12, 1958. Oklobdzija 2004 Computer Arithmetic 47 CLA Definitions: One-bit adder ci 1 gi pi ci ai gi ai bi bi pi ai bi 0 cout cin s 1 si pi ci si Oklobdzija 2004 Computer Arithmetic 48 CLA Definitions: 4-bit Adder a a a b b b b ai+3 i+2 i+3 Ci+4 Ci+3 gi+3 pi+3 i+1 i+2 Ci+2 gi+2 pi+2 i i+1 i Ci+1 gi+1 pi+1 Ci gi pi ci 1 aibi ci aibi ci aibi gi pi ci ci 2 gi 1 pi 1ci 1 gi 1 pi 1 ( gi pi c1 ) gi 1 pi 1gi pi 1 pi c1 Oklobdzija 2004 Computer Arithmetic 49 Carry-Lookahead a a 4-bits a a Adder: b b b b i+3 i+2 i+3 Ci+4 Ci+3 gi+3 pi+3 i+1 i+2 Ci+2 gi+2 pi+2 i i+1 i Ci+1 gi+1 pi+1 Ci gi pi ci 3 gi 2 pi 2ci 2 gi 2 pi 2 ( gi 1 pi 1gi pi 1 pi ci ) gi 2 pi 2 gi 1 pi 2 pi 1gi pi 2 pi 1 pi ci ci 4 gi 3 pi 3ci 3 gi 3 pi 3 ( gi 2 pi 2 gi 1 pi 2 pi 1gi ) gi 3 pi 3 gi 2 pi 3 pi 2 gi 1 pi 3 pi 2 pi 1gi pi 3 pi 2 pi 1 pi ci Gj Oklobdzija 2004 Pj Computer Arithmetic 50 Carry-Lookahead Adder G j gi 3 pi 3 gi 2 pi 3 pi 2 gi 1 pi 3 pi 2 pi 1gi Pj pi 3 pi 2 pi 1 pi ai+3 bi+3 One gate delay to calculate p, g One to calculate P and two for G ai+1 bi+1 ai bi Cj Cin gi+1pi+1 gi+1pi+1 C4(j+1) Three gate delays To calculate C4(j+1) gi+1pi+1 gi pi P, G Group C4j+3 c4( j 1) G j Pj c j Oklobdzija 2004 ai+2 bi+2 C4j+2 Gj C4j+1 Pj Compare that to 8 in RCA ! Computer Arithmetic 51 Carry-Lookahead Adder (Weinberger and Smith) G* j Gi 3 Pi 3Gi 2 Pi 3Pi 2Gi 1 Pi 3Pi 2 Pi 1Gi P* j Pi3 Pi 2 Pi 1Pi Gj+3 Pj+3 Pj+2 Gj+2 Gj+1 Pj+1 Gj Pj C4(j+1) C4j P* G* c4( j 1) G *k P *k c4 j C4j+3 C4j+2 C4j+1 Additional two gate delays C16 will take a total of 5 vs. 32 for RCA ! Oklobdzija 2004 Computer Arithmetic 52 32-bit Carry Lookahead Adder ai C28 C24 individual adders generating: gi, pi, and sum Si C20 C12 bi C8 C4 Cin C16 Carry-lookahead super- blocks of 4-bits blocks generating: G*i, P*i, and Cin for the 4-bit blocks Cout Cout Cin Cin Carry-lookahead blocks of 4-bits generating: Gi, Pi, and Cin for the adders Group producing final carry Cout and C16 Critical path delay = (for gi,pi)+2x2 (for G,P)+3x2 (for Cin)+1XOR- (for Sum) = appx. 12 of delay Oklobdzija 2004 Computer Arithmetic 53 Carry-Lookahead Adder (Weinberger and Smith: original derivation, 1958 ) Oklobdzija 2004 Computer Arithmetic 54 Carry-Lookahead Adder (Weinberger and Smith: original derivation ) Oklobdzija 2004 Computer Arithmetic 55 Carry-Lookahead Adder (Weinberger and Smith) please notice the similarity with Parallel-Prefix Adders ! Oklobdzija 2004 Computer Arithmetic 56 Carry-Lookahead Adder (Weinberger and Smith) please notice the similarity with Parallel-Prefix Adders ! Oklobdzija 2004 Computer Arithmetic 57 Motorola: CLA Implementation Example A. Naini, D. Bearden and W. Anderson, “A 4.5nS 96b CMOS Adder Design”, Proceedings of the IEEE Custom Integrated Circuits Conference, May 3-6, 1992. P63 G63 P62 G62 P61 G61 P60 G60 P59 P63:48 P63:0 C60 P63:60 P59:48 G59:48 P55:48 G55:48 P51:48 G51:48 P11:0 G11:0 P7:0 G7:0 P3:0 G3:0 C4 C8 C12 C16 C32 C48 C52 C56 G63:0 P59:56 G59:56 P55:52 G55:52 P51:48 G51:48 P15:12 G15:12 P11:8 G11:8 P7:4 G7:4 P3:0 G3:0 G15:0 P15:0 G31:16 P31:16 G47:32 G63:48 61 CARRY BLOCK G63:60 C P,G62:60 63 C P,G61:6062 C P,G60 PG BLOCK PG BLOCK G56 P55 G 52 P47:32 PG BLOCK 1.7nS P51 P,G2:0 P,G1:0 P,G0 C16 P15:0 G31:0 C32 P31:0 G47:0 P47:0 C48 3.75nS G15:0 C64 C0 2.35nS 2.0nS G48 P47 G32 P 31 G16 P15 G12 P PG BLOCK PG BLOCK 59 Computer Arithmetic Oklobdzija 2004 11 G8 P 7 G4 P3 G3 P2 G2 P1 G1 P0 G0 1.05nS ... ... ... ... ... ... ... ... 4.8nS Critical path in Motorola's 64-bit CLA 2.7nS Critical path: A, B - G0 - G3:0 - G15:0 - G47:0 - C48 - C60 - C63 - S63 Motorola's 64-bit CLA conventional PG Block no better situation here ! Basically, this is MCC performance with Carry-Skip. One should not expect any better results than VBA. Oklobdzija 2004 Computer Arithmetic carry ripples locally 5-transistors in the path 60 Motorola's 64-bit CLA Modified PG Block Intermediate propagate signals Pi:0 are generated to speed-up C3 still critical path resembles MCC Oklobdzija 2004 Computer Arithmetic 61 Motorola's 64-bit CLA 3.9nS 1.8nS 2.2nS 3.55nS 2.9nS Oklobdzija 2004 3.2nS Computer Arithmetic 62 P63 G63 P62 G62 P61 G61 P60 G 60 P59 56 G P 55 G52 PG BLOCK PG BLOCK P63:48 P63:0 C4 C8 C12 C16 C32 C48 C52 C56 P63:60 P59:48 G59:48 P55:48 G55:48 P51:48 G51:48 P11:0 G11:0 P7:0 G7:0 P3:0 G3:0 P31:16 G31:16 P15:0 G15:0 P47:0 G47:0 C32 P31:0 G31:0 C16 P15:0 G15:0 C64 3.75nS C48 G63:0 C60 P47:32 G63:48 C61 P59:56 G59:56 P55:52 G55:52 P51:48 G51:48 P15:12 G15:12 P11:8 G11:8 P7:4 G7:4 P3:0 G3:0 G47:32 C G63:60 P,G62:60 63 C P,G61:6062 P,G60 CARRY BLOCK 63 3.2nS Computer Arithmetic P51 P,G2:0 P,G1:0 P,G0 2.7nS C0 3.55nS 2.2nS 3.9nS 2.35nS 2.0nS G48 P47 G32 P31 G16 P15 G12 P 11 G8 P 7 G4 P3 G3 P2 G2 P1 G1 P0 G0 PG BLOCK 1.7nS 2.9nS Oklobdzija 2004 PG BLOCK PG BLOCK 1.05nS ... ... ... ... ... ... ... ... 4.8nS Critical path: A, B - G0 - G3:0 - G15:0 - G47:0 - C48 - C60 - C63 - S63 1.8nS Delay Optimized CLA B. Lee, V. G. Oklobdzija Journal of VLSI Signal Processing, Vol.3, No.4, October 1991 Delay Optimized CLA: LeeOklobdzija ‘91 (a.) Fixed groups and levels (b.) variable-sized groups, fixed levels (c.) variable-sized groups and fixed levels (d.) variable-sized groups and levels Oklobdzija 2004 Computer Arithmetic 65 Two-Levels of Logic Implementation of the Carry Block Oklobdzija 2004 Computer Arithmetic 66 Two-Levels of Logic Implementation of the Carry-Lookahead Block Oklobdzija 2004 Computer Arithmetic 67 Three-Levels of Logic Implementation of the Carry Block (restricted fan-in) Oklobdzija 2004 Computer Arithmetic 68 Three-Levels of Logic Implementation of the Carry Lookahead (restricted fan-in) Oklobdzija 2004 Computer Arithmetic 69 Delay Optimized CLA: Lee-Oklobdzija ‘91 Delay: Three-level BCLA Delay: Two-level BCLA Oklobdzija 2004 Computer Arithmetic 70 Delay Optimized CLA: Lee-Oklobdzija ‘91 (a.) 2-level BCLA =8.5nS Oklobdzija 2004 (b.) 3-level BCLA =8.9nS Computer Arithmetic 71 Ling’s Adder Huey Ling, “High-Speed Binary Adder” IBM Journal of Research and Development, Vol.5, No.3, 1981. Used in: IBM 3033, IBM 168, Amdahl V6, HP etc. Ling’s Derivations ai bi define: Ci 1 gi pi Ci Hi 1 Ci 1 Ci gi implies Ci+1 which implies Hi+1 , thus: gi= gi Hi+1 ci+1 ci gi ai bi si pi Ci pi Ci 1 pi gi pi pi Ci 1 ai bi pi gi ti pi Ci 1 pi Ci 1 pi H i 1 0 0 0 0 0 0 1 1 0 1 1 0 1 0 1 1 1 0 1 1 pi Ci pi Hi 1 Ci1 ti Hi1 Ci 1 gi pi Ci gi Hi 1 pi Ci gi Hi 1 pi Hi 1 ti Hi 1 Oklobdzija 2004 Computer Arithmetic 73 Ling’s Derivations From: and gi pi CC Hii 11 Cii11 Cii gi C 1 i piiC C g i i i i because: Hi 1 gi ti 1Hi Ci1 ti Hi1 fundamental expansion Now we need to derive Sum equation Oklobdzija 2004 Computer Arithmetic 74 Ling Adder Variation of CLA: Ling’s equations: pi ai bi ti ai bi gi ai bi gi ai bi Ci 1 gi pi Ci Hi 1 gi ti 1 Hi Si pi Ci Si ti Hi 1 gi ti 1Hi Ling, IBM J. Res. Dev, 5/81 Oklobdzija 2004 Computer Arithmetic 75 Ling Adder Ling’s equation: Variation of CLA: Ci 1 gi gi Ci pi Ci gi gi pi Ci Hi gi ti 1 Hi 1 Ci 1 gi ti Ci Ling uses different transfer function. Four of those functions have desired properties (Ling’s is one of them) see: Doran, IEEE Trans on Comp. Vol 37, No.9 Sept. 1988. Oklobdzija 2004 Computer Arithmetic 76 Ling Adder Conventional: Fan-in of 5 C4 g3 t3 g 2 t3t2 g1 t3t2t1g0 t3t2t1t0Cin Ling: H 4 g 3 t2 g 2 t2t1g1 t2t1t0 g0 t2t1t0t1Cin H 4 g 3 g 2 t2 g1 t2t1 g0 t2t1t0Cin Fan-in of 4 Oklobdzija 2004 Computer Arithmetic 77 Advantages of Ling’s Adder • Uniform loading in fan-in and fan-out • H16 contains 8 terms as compared to G16 that contains 15. • H16 can be implemented with one level of logic (in ECL), while G16 can not. (Ling’s adder takes full advantage of wiredOR, of special importance when ECL technology is used) Oklobdzija 2004 Computer Arithmetic 78 VLSI Arithmetic Lecture 5 Prof. Vojin G. Oklobdzija University of California http://www.ece.ucdavis.edu/acsel Review Lecture 4 Ling’s Adder Huey Ling, “High-Speed Binary Adder” IBM Journal of Research and Development, Vol.5, No.3, 1981. Used in: IBM 3033, IBM S370/168, Amdahl V6, HP etc. Ling’s Derivations ai bi define: Ci 1 gi pi Ci Hi 1 Ci 1 Ci gi implies Ci+1 which implies Hi+1 , thus: gi= gi Hi+1 ci+1 ci gi ai bi si pi Ci pi Ci pi gi pi pi Ci ai bi pi gi ti pi Ci pi Ci 1 pi Hi 1 0 0 0 0 0 0 1 1 0 1 1 0 1 0 1 1 1 0 1 1 pi Ci pi Hi 1 Ci1 ti Hi1 Ci 1 gi pi Ci gi Hi 1 pi Ci gi Hi 1 pi Hi 1 ti Hi 1 Oklobdzija 2004 Computer Arithmetic 82 Ling’s Derivations From: and gi pi CC Hii 11 Cii11 Cii gi C 1 i piiC C g i i i i because: Hi 1 gi ti 1Hi Ci1 ti Hi1 fundamental expansion Now we need to derive Sum equation Oklobdzija 2004 Computer Arithmetic 83 Ling Adder Variation of CLA: Ling’s equations: pi ai bi ti ai bi gi ai bi gi ai bi Ci 1 gi pi Ci Hi 1 gi ti 1 Hi Si pi Ci Si ti Hi 1 gi ti 1Hi Ling, IBM J. Res. Dev, 5/81 Oklobdzija 2004 Computer Arithmetic 84 Ling Adder Ling’s equation: Variation of CLA: Ci 1 gi gi Ci pi Ci gi gi pi Ci ai bi Hi+1 ci+1 ai-1 bi-1 Hi gi, ti ci si gi-1, ti-1 ci-1 si-1 Hi 1 gi ti 1 Hi Ci 1 gi ti Ci Ling uses different transfer function. Four of those functions have desired properties (Ling’s is one of them) see: Doran, IEEE Trans on Comp. Vol 37, No.9 Sept. 1988. Oklobdzija 2004 Computer Arithmetic 85 Ling Adder Conventional: Fan-in of 5 C4 g3 t3 g 2 t3t2 g1 t3t2t1g0 t3t2t1t0Cin Ling: H 4 g 3 t2 g 2 t2t1g1 t2t1t0 g0 t2t1t0t1Cin H 4 g 3 g 2 t2 g1 t2t1 g0 t2t1t0Cin Fan-in of 4 Oklobdzija 2004 Computer Arithmetic 86 Advantages of Ling’s Adder Uniform loading in fan-in and fan-out • • H16 contains 8 terms as compared to G16 that contains 15. • H16 can be implemented with one level of logic (in ECL), while G16 can not (with 8-way wireOR). (Ling’s adder takes full advantage of wired-OR, of special importance when ECL technology is used - his IBM limitation was fan-in of 4 and wire-OR of 8) Oklobdzija 2004 Computer Arithmetic 87 Ling: Weinberger Notes Oklobdzija 2004 Computer Arithmetic 88 Ling: Weinberger Notes Oklobdzija 2004 Computer Arithmetic 89 Ling: Weinberger Notes Oklobdzija 2004 Computer Arithmetic 90 Advantage of Ling’s Adder • 32-bit adder used in: IBM 3033, IBM S370/ Model168, Amdahl V6. • Implements 32-bit addition in 3 levels of logic • Implements 32-bit AGEN: B+Index+Disp in 4 levels of logic (rather than 6) • 5 levels of logic for 64-bit adder used in HP processor Oklobdzija 2004 Computer Arithmetic 91 Implementation of Ling’s Adder in CMOS (S. Naffziger, “A Subnanosecond 64-b Adder”, ISSCC ‘ 96) Oklobdzija 2004 Computer Arithmetic 92 S. Naffziger, ISSCC’96 H 4 g3 g2 t2 g1 t2t1g0 Ci 1 ti Hi 1 Oklobdzija 2004 Computer Arithmetic 93 S. Naffziger, ISSCC’96 H 4 g3 g2 t2 g1 t2t1g0 Oklobdzija 2004 Computer Arithmetic 94 H 4 g3 g2 t2 g1 t2t1g0 S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic 95 S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic 96 S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic 97 S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic 98 S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic 99 C16 p15 H16 p15 ( g15 g11 t11 g7 t11t7 g0 ) S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic 100 S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic 101 S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic 102 S. Naffziger, ISSCC’96 Oklobdzija 2004 Computer Arithmetic 103 Ling Adder Critical Path Oklobdzija 2004 Computer Arithmetic 104 Ling Adder: Circuits G3 CK A2 CK A3 B2 A2 A1 B2 B1 A1 G4 A0 B0 B1 A3 B3 A0 B0 A2 B2 LC SumL C1L LCH LCL C1H C0L C0H K G1 P2 CK P4 CK G0 P1 A1 B3 B1 CK CK P G2 G C1H SumH C1L C0H LCH C0L LCL CK Oklobdzija 2004 Computer Arithmetic 105 LCS4 – Critical G Path in1 4b (k,p) or (g,p) P4 G3 G4 12b C15 32b C47 C31 C15 16b S63 S62 Oklobdzija 2004 S48 Computer Arithmetic 106 LCS4 – Logical Effort Delay Prefix-4 Ling/Conditional-Sum (Dynamic - Long Carry Path) Stages dg3# (dg3) g4 (NAND2) C15# (GG4) C15 (INV) C47# (LC) C47 (INV) C47#b (INV) C47b (INV) S63# (SUM) S63 (INV) Branch 4.0 2.0 1.0 1.0 3.0 1.0 1.0 1.0 16.0 1.0 Oklobdzija 2004 LE 0.98 1.11 1.01 1.00 1.03 1.00 1.00 1.00 0.86 1.00 Total Path Effort fo, opt Parasitic Branch Total LE 2.97 1.84 1.80 1.00 3.32 3.84E+02 9.73E-01 3.74E+02 1.81 1.00 1.00 1.00 1.36 1.00 Computer Arithmetic Effort Delay (ps) Parasitic Delay (ps) Total Delay (ps) Total Delay (FO4) 66 70 136 7.2 107 Results: • 0.5u Technology • Speed: 0.930 nS • Nominal process, 80C, V=3.3V See: S. Naffziger, “A Subnanosecond 64-b Adder”, ISSCC ‘ 96 Oklobdzija 2004 Computer Arithmetic 108 Prefix Adders and Parallel Prefix Adders from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic 110 Prefix Adders Following recurrence operation is defined: (g, p)o(g’,p’)=(g+pg’, pp’) such that: (g0, p0) Gi, Pi = i=0 (gi, pi)o(Gi-1, Pi-1 ) 1≤i≤n for i=0, 1, ….. n ci+1 = Gi c1 = g0+ p0 cin (g-1, p-1)=(cin,cin) This operation is associative, but not commutative It can also span a range of bits (overlapping and adjacent) Oklobdzija 2004 Computer Arithmetic 111 from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic 112 Parallel Prefix Adders: variety of possibilities from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic 113 Pyramid Adder: M. Lehman, “A Comparative Study of Propagation Speed-up Circuits in Binary Arithmetic Units”, IFIP Congress, Munich, Germany, 1962. Oklobdzija 2004 Computer Arithmetic 114 Parallel Prefix Adders: variety of possibilities from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic 115 Parallel Prefix Adders: variety of possibilities from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic 116 Hybrid BK-KS Adder Oklobdzija 2004 Computer Arithmetic 117 Parallel Prefix Adders: S. Knowles 1999 operation is associative: h>i≥j≥k operation is idempotent: h>i≥j≥k produces carry: cin=0 Oklobdzija 2004 Computer Arithmetic 118 Parallel Prefix Adders: Ladner-Fisher Exploits associativity, but not idempotency. Produces minimal logical depth Oklobdzija 2004 Computer Arithmetic 119 Parallel Prefix Adders: Ladner-Fisher (16,8,4,2,1) Two wires at each level. Uniform, fan-in of two. Large fan-out (of 16; n/2); Large capacitive loading combined with the long wires (in the last stages) Oklobdzija 2004 Computer Arithmetic 120 Parallel Prefix Adders:Exploits Kogge-Stone idempotency to limit the fan-out to 1. Dramatic increase in wires. The wire span remains the same as in Ladner-Fisher. Buffers needed in both cases: K-S, L-F Oklobdzija 2004 Computer Arithmetic 121 Kogge-Stone Adder Oklobdzija 2004 Computer Arithmetic 122 Parallel Prefix Adders: Brent-Kung • Set the fan-out to one • Avoids explosion of wires (as in K-S) • Makes no sense in CMOS: – fan-out = 1 limit is arbitrary and extreme – much of the capacitive load is due to wire (anyway) • It is more efficient to insert buffers in L-F than to use B-K scheme Oklobdzija 2004 Computer Arithmetic 123 Brent-Kung Adder Oklobdzija 2004 Computer Arithmetic 124 Parallel Prefix Adders: Han-Carlson • Is a hybrid synthesis of L-F and K-S • Trades increase in logic depth for a reduction in fan-out: – effectively a higher-radix variant of K-S. – others do it similarly by serializing the prefix computation at the higher fan-out nodes. • Others, similarly trade the logical depth for reduction of fan-out and wire. Oklobdzija 2004 Computer Arithmetic 125 Parallel Prefix Adders: variety of possibilities from: Knowles bounded by L-F and K-S at ends Oklobdzija 2004 Computer Arithmetic 126 Parallel Prefix Adders: variety of possibilities Knowles 1999 Following rules are used: • Lateral wires at the jth level span 2j bits • Lateral fan-out at jth level is power of 2 up to 2j • Lateral fan-out at the jth level cannot exceed that a the (j+1)th level. Oklobdzija 2004 Computer Arithmetic 127 Parallel Prefix Adders: variety of possibilities Knowles 1999 • The number of minimal depth graphs of this type is given in: • at 4-bits there is only K-S and L-F, afterwards there are several new possibilities. Oklobdzija 2004 Computer Arithmetic 128 Parallel Prefix Adders: variety of possibilities Knowles 1999 example of a new 32-bit adder [4,4,2,2,1] Oklobdzija 2004 Computer Arithmetic 129 Parallel Prefix Adders: variety of possibilities Knowles 1999 Example of a new 32-bit adder [4,4,2,2,1] Oklobdzija 2004 Computer Arithmetic 130 Parallel Prefix Adders: variety of possibilities Knowles 1999 • Delay is given in terms of FO4 inverter delay: w.c. (nominal case is 40-50% faster) • K-S is the fastest • K-S adders are wire limited (requiring 80% more area) • The difference is less than 15% between examined schemes Oklobdzija 2004 Computer Arithmetic 131 Parallel Prefix Adders: variety of possibilities Knowles 1999 Oklobdzija 2004 Conclusion • Irregular, hybrid schmes are possible • The speed-up of 15% is achieved at the cost of large wiring, hence area and power • Circuits close in speed to K-S are available at significantly lower wiring cost Computer Arithmetic 132 VLSI Arithmetic Lecture 6 Prof. Vojin G. Oklobdzija University of California http://www.ece.ucdavis.edu/acsel Review Lecture 5 Prefix Adders and Parallel Prefix Adders from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic 136 Prefix Adders Following recurrence operation is defined: (g, p)o(g’,p’)=(g+pg’, pp’) such that: (g0, p0) Gi, Pi = i=0 (gi, pi)o(Gi-1, Pi-1 ) 1≤i≤n for i=0, 1, ….. n ci+1 = Gi c1 = g0+ p0 cin (g-1, p-1)=(cin,cin) This operation is associative, but not commutative It can also span a range of bits (overlapping and adjacent) Oklobdzija 2004 Computer Arithmetic 137 Parallel Prefix Adders: S. Knowles 1999 operation is associative: h>i≥j≥k operation is idempotent: h>i≥j≥k produces carry: cin=0 Oklobdzija 2004 Computer Arithmetic 138 from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic 139 Parallel Prefix Adders: variety of possibilities from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic 140 Parallel Prefix Adders: variety of possibilities from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic 141 Parallel Prefix Adders: variety of possibilities from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic 142 Kogge-Stone Adder Oklobdzija 2004 Computer Arithmetic 143 Brent-Kung Adder Oklobdzija 2004 Computer Arithmetic 144 Hybrid BK-KS Adder Oklobdzija 2004 Computer Arithmetic 145 Pyramid Adder: M. Lehman, “A Comparative Study of Propagation Speed-up Circuits in Binary Arithmetic Units”, IFIP Congress, Munich, Germany, 1962. Oklobdzija 2004 Computer Arithmetic 146 Parallel Prefix Adders: Ladner-Fisher Exploits associativity, but not idempotency. Produces minimal logical depth Oklobdzija 2004 Computer Arithmetic 147 Parallel Prefix Adders: Ladner-Fisher (16,8,4,2,1) Two wires at each level. Uniform, fan-in of two. Large fan-out (of 16; n/2); Large capacitive loading combined with the long wires (in the last stages) Oklobdzija 2004 Computer Arithmetic 148 Parallel Prefix Adders:Exploits Kogge-Stone idempotency to limit the fan-out to 1. Dramatic increase in wires. The wire span remains the same as in Ladner-Fisher. Buffers needed in both cases: K-S, L-F Oklobdzija 2004 Computer Arithmetic 149 Parallel Prefix Adders: Brent-Kung • Set the fan-out to one • Avoids explosion of wires (as in K-S) • Makes no sense in CMOS: – fan-out = 1 limit is arbitrary and extreme – much of the capacitive load is due to wire (anyway) • It is more efficient to insert buffers in L-F than to use B-K scheme Oklobdzija 2004 Computer Arithmetic 150 Two Parallel Prefix Adder Structures Han-Carlson Kogge-Stone Cout C15 C14 C13 C12 C11 C10 C9 C8 C7 C6 C5 C4 C3 C2 G1,P1 G1,P1 G2,P2 G2,P2 G3,P3 G3,P3 G4,P4 G4,P4 C1 Cout • log(bits) carry stages • Extra Wiring Oklobdzija 2004 C15 C14 C13 C12 C11 C10 C9 C8 C7 C6 C5 C4 C3 C2 C1 • log(bits) + 1 carry stages • Reduced Wiring and Gates Computer Arithmetic 151 Parallel Prefix Adders: Han-Carlson • Is a hybrid synthesis of L-F and K-S • Trades increase in logic depth for a reduction in fan-out: – effectively a higher-radix variant of K-S. – others do it similarly by serializing the prefix computation at the higher fan-out nodes. • Others, similarly trade the logical depth for reduction of fan-out and wire. Oklobdzija 2004 Computer Arithmetic 152 Parallel Prefix Adders: variety of possibilities from: Knowles bounded by L-F and K-S at ends Oklobdzija 2004 Computer Arithmetic 153 Parallel Prefix Adders: variety of possibilities Knowles 1999 Following rules are used: • Lateral wires at the jth level span 2j bits • Lateral fan-out at jth level is power of 2 up to 2j • Lateral fan-out at the jth level cannot exceed that a the (j+1)th level. Oklobdzija 2004 Computer Arithmetic 154 Parallel Prefix Adders: variety of possibilities Knowles 1999 • The number of minimal depth graphs of this type is given in: • at 4-bits there is only K-S and L-F, afterwards there are several new possibilities. Oklobdzija 2004 Computer Arithmetic 155 Parallel Prefix Adders: variety of possibilities Knowles 1999 example of a new 32-bit adder [4,4,2,2,1] Oklobdzija 2004 Computer Arithmetic 156 Parallel Prefix Adders: variety of possibilities Knowles 1999 Example of a new 32-bit adder [4,4,2,2,1] Oklobdzija 2004 Computer Arithmetic 157 Parallel Prefix Adders: variety of possibilities Knowles 1999 • Delay is given in terms of FO4 inverter delay: w.c. (nominal case is 40-50% faster) • K-S is the fastest • K-S adders are wire limited (requiring 80% more area) • The difference is less than 15% between examined schemes Oklobdzija 2004 Computer Arithmetic 158 Parallel Prefix Adders: variety of possibilities Knowles 1999 Oklobdzija 2004 Conclusion • Irregular, hybrid schmes are possible • The speed-up of 15% is achieved at the cost of large wiring, hence area and power • Circuits close in speed to K-S are available at significantly lower wiring cost Computer Arithmetic 159 Possibilities for Further Research • The logical depth is important (Knowles was right) • The fan-out is less important than fan-in (Knowles was wrong): – It is possible to examine a variety of topologies with restricted and varied fan-in. • Driving strength and Logical Effort rules were overlooked and at least neglected: – It is possible to create number of topologies taking LE rules into account. – It is further possible to combine the rules with compound domino implementation taking advantage of two different rules governing “dynamic” and “static”. • ItOklobdzija is still2004possible toComputer produce a better adder ! Arithmetic 160 Other Types of Adders Oklobdzija 2004 Computer Arithmetic 161 Conditional Sum Adder J. Sklansky, “Conditional-Sum Addition Logic”, IRE Transactions on Electronic Computers, EC-9, p.226-231, 1960. Conditional Sum Adder from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic 163 Conditional Sum Adder Oklobdzija 2004 Computer Arithmetic 164 Conditional Sum Adder from: Ercegovac-Lang Oklobdzija 2004 Computer Arithmetic 165 Conditional Sum Adder Oklobdzija 2004 Computer Arithmetic from: Ercegovac-Lang 166 Conditional Sum Adder Oklobdzija 2004 Computer Arithmetic 167 Carry-Select Adder O. J. Bedrij, “Carry-Select Adder”, IRE Transactions on Electronic Computers, June 1962, p.340-34 Carry-Select Sum Adder Oklobdzija 2004 Computer Arithmetic from: Ercegovac-Lang 169 Carry-Select Adder Addition under assumption of Cin=0 and Cin =1. Oklobdzija 2004 Computer Arithmetic 170 Carry Select Adder: combining two 32-b VBAs in select mode Delay =VBA32+ MUX Oklobdzija 2004 Computer Arithmetic 171 Carry-Select Adder O.J. Bedrij, IBM Poughkeepsie, 1962 Oklobdzija 2004 Computer Arithmetic 172