Lecture 17: Adders Outline Datapath Computer Arithmetic Principles Single-bit Addition Carry-Ripple Adder Carry-Skip Adder Carry-Lookahead Adder Carry-Select Adder Carry-Increment Adder Tree Adder 17: Adders CMOS VLSI Design 4th Ed. 2 A Generic Digital Processor 17: Adders CMOS VLSI Design 4th Ed. 3 Building Blocks for Digital Architectures Arithmetic unit – Bit sliced data path – adder, multiplier, shifter, comparator, etc. Memory – RAM, ROM, buffers, shift registers Control – Finite state machine (PLA, random logic) – Counters Interconnect – Switches, arbiters, bus 17: Adders CMOS VLSI Design 4th Ed. 4 An Intel Microprocessor 17: Adders CMOS VLSI Design 4th Ed. 5 Bit-Sliced Design 17: Adders CMOS VLSI Design 4th Ed. 6 Bit-Sliced Datapath 17: Adders CMOS VLSI Design 4th Ed. 7 Itanium Integer Datapath 17: Adders CMOS VLSI Design 4th Ed. 8 Motivation Arithmetic units are, among others, core of every data path and addressing unit. Data path is at the core of – microprocessors (CPU) – signal processors (DSP) – data processing application specific IC’s (ASIC) and programmable IC’s (FPGA) Standard arithmetic units available from libraries Design of arithmetic units necessary for – non-standard operations – high performance components – library development 17: Adders CMOS VLSI Design 4th Ed. 9 Naming Conventions Signal busses: A (1-D), Ai, (2-D), ai:k (sub-bus, 1-D) Signals: a, ai (1-D), ai,k (2-D), Ai:k (group signal) Circuit complexity measures: A (Area), T (cycle time, delay), AT (area-time product), L (latency, number of cycles). Arithmetic operators: +, -, •, /, log (=log2) Logic operators: OR, AND, XOR, NOT, … 17: Adders CMOS VLSI Design 4th Ed. 10 Circuit Complexity Measures Unit gate model – Inverter, buffer: A = 0, T = 0 – Simple monotonic 2-input gates (AND, OR, NAND, NOR): A = 1, T = 1 – Simple non-monotonic 2-input gates (XOR, XNOR): A = 2, T = 2 – Simple m-input gates: A = m – 1, T = log m – Wiring not considered – Only for estimation purposes 17: Adders CMOS VLSI Design 4th Ed. 11 Recursive Function Evaluation Given: inputs ai, outputs zi, function f (graph sym. •) Non-recursive functions (n.) – Output zi is a function of input ai zi f ai, x; i 0,...,n 1 – Parallel structure A On, T O1 17: Adders CMOS VLSI Design 4th Ed. 12 Recursive Function Evaluation Recursive functions (r.) – Output zi is a function of all inputs ak, k ≤ i • with a single output z = zn-1 (r.s.): ti f ai ,t i1; i 0,...,n 1 t1 0/1, z t n 1 – f is non-associative (r.s.n) » serial structure A On, T On – f is associative (r.s.a) » serial or single-tree structure 17: Adders A On, T Ologn CMOS VLSI Design 4th Ed. 13 Recursive Function Evaluation – Output zi is a function of all inputs ak, k ≤ i • multiple outputs zi (r.m.) (=> prefix problem) zi f ai,zi1; i 0,...,n 1, z1 0/1 – f is non-associative (r.m.n) » serial structure A On, T On – f is associative (r.m.a) » Serial or multi-tree structure A On 2 , T Ologn » Shared tree structure 17: Adders A Onlogn, T Ologn CMOS VLSI Design 4th Ed. 14 Arithmetic Operations Overview 17: Adders CMOS VLSI Design 4th Ed. 15 Overview of Arithmetic Operations Direct implementation of dedicated units – always: 1 – 5 – in most cases: 6 – sometimes: 7, 8 Sequential implementation using simpler units and several clock cycles (decomposition) – sometimes: 6 – in most cases: 7, 8, 9 Table look-up techniques using ROMs – universal: simple application to all operations – efficient only for single-operand operations of high complexity (8 - 12) and small word length. 17: Adders CMOS VLSI Design 4th Ed. 16 Overview of Arithmetic Operations Approximation using simpler units: 7 – 12 – Taylor series expansion – polynomial and rational approximations – convergence of recursive equation systems – CORDIC (COordinate Rotation DIgital Computer) 17: Adders CMOS VLSI Design 4th Ed. 17 Binary Number Systems Radix-2, binary number system (BNS): irredundant, weighted, positional, monotonic. n-bit number is an ordered sequence of bits (binary A an1,an2,...,a0 2, ai 0,1 digits) Simple and efficient implementation in digital circuits MSB/LSB (most/least significant bit): an-1/a0 Represents an integer or fixed point number, exact. Fixed point numbers: am1,...,a0 . a1,...,amn m-bit integer 17: Adders n-m bit fraction CMOS VLSI Design 4th Ed. 18 Binary Number Systems Unsigned: positive or natural numbers n 1 – Value: A an 12n 1 ... a1 2 a0 ai 2i i0 – Range: 0, 2n 1 Two’s (2’s) complement: standard representation of or integer numbers signed n 2 – Value n 1 i A an 1 2 – Range 17: Adders 2 ai 2 i0 , 2n 1 1 n 1 CMOS VLSI Design 4th Ed. 19 Binary Number Systems – Complement: A 2n A A 1, where A an 1,an 2,...,a0 – Sign: an-1 – Properties: asymmetric range, compatible with unsigned numbers in many arithmetic operations. (same treatment of positive and negative numbers) One’s (1’s) complement: similar to 2’s complement n 2 – Value: A a 2n 1 1 a 2i n 1 – Range: 17: Adders i i0 2 n 1 1, 2 n 1 1 CMOS VLSI Design 4th Ed. 20 Binary Number Systems n – Complement: A 2 A 1 A – Sign: an-1 – Properties: double representation of zero, symmetric range, modulo (2n-1) number system. Sign-magnitude: alternative representation of signed numbers n 2 a – Value: A 1 ai 2i n1 i0 n 1 n 1 2 1 , 2 1 – Range: – Complement: A a ,a ,...,a n1 n2 0 17: Adders CMOS VLSI Design 4th Ed. 21 Binary Number Systems Sign: an-1 Properties: double representation of zero, symmetric range, different treatment of positive and negative numbers in arithmetic operations, no MSB toggles at sign changes around 0 (=> low power) 17: Adders CMOS VLSI Design 4th Ed. 22 Gray Numbers Gray numbers (code): binary, irredundant, nonweighted, non-monotonic. – Property: unit-distance coding. Exactly one-bit toggles between adjacent numbers. – Applications: counters with low output toggle rate (low power busses), representation of continuous signals for low-error sampling (no false numbers due to switching of different bits at different times). – Non-monotonic numbers: difficult arithmetic operations (addition, comparison). 17: Adders CMOS VLSI Design 4th Ed. 23 Gray Numbers – Binary - Gray conversion gi bi1 bi , bn 0; i 0,...,n 1 (n.) – Gray – binary conversion bi bi1 gi , bn 0 i n 1,...,0 (r.m.a) 17: Adders CMOS VLSI Design 4th Ed. 24 Redundant Number Systems Non-binary, redundant, weighted number systems. Digit set larger than radix (typically radix 2) => multiple representations of the same number => redundancy. No carry propagation in adders => more efficient implementation of adder-based units (multipliers, dividers, etc.) Redundancy => no direct implementation of relational operators => conversion to irredundant numbers. Several bits used to represent one digit => higher storage requirements. Expensive conversion to irredundant numbers. Not necessary if redundant input operators are allowed. 17: Adders CMOS VLSI Design 4th Ed. 25 Delayed-Carry Representation Delayed-carry or half adder representation ri 0,1,2 , c i ,si ,ai ,bi 0,1 , ri c i1,si 2c i1 si ai bi , c i1si 0 n 1 R ri 2 i C,S C S A B i0 1 digit holds the sum of 2 bits (no carry out) Example: 01 + 01 = (0,0) (1,0) = 2 17: Adders CMOS VLSI Design 4th Ed. 26 Carry-Save Representation ri 0,1,2,3 , c i ,si ,ai ,bi ,di 0,1 , ri c i1,si 2c i1 si ai bi di ai ri n 1 R ri 2 i C,S C S A R i0 One digit holds the sum of 3 bits or 1 digit and 1 bit. No carry-out digit, carry is saved. Standard redundant number system for fast addition. 17: Adders CMOS VLSI Design 4th Ed. 27 Signed-Digit Representation Signed-digit (SD) or redundant digit (RD) number representation. n 1 ri,si ,ti 1,0,1 1,0,1 , R ri 2i i0 No carry propagation in S = R + T ri t i c i1,ui 2c i1 ui , c i1,ui 1,0,1 c i1,ui is redundant (e.g.,0 1 01 1 1 i c i ,ui s.t. si 1,0,1 One digit holds the sum of two digits. No carry-out. 17: Adders CMOS VLSI Design 4th Ed. 28 Signed-Digit Representation Minimal SD representation: minimal number of nonzero digits. 0111 1010001 0 – Applications: sequential multiplication (less cycles), filters with constant coefficients (less hardware). – Example: 7 0111, 11 11 , 101 1 , 1001 , 11 111, minimal 17: Adders CMOS VLSI Design 4th Ed. 29 Signed-Digit Representation Canonical SD representation: minimal SD. Not two non-zero digits in sequence. 01 1 101001 0 SD -> binary: carry propagation necessary => adder. Other applications: high speed multipliers. to carry-save, simple use for signed Similar numbers. 17: Adders CMOS VLSI Design 4th Ed. 30 Residue Number Systems Non-binary, irredundant, non-weighted number system. Carry-free and fast additions and multiplications. Complex and slow other arithmetic operations (e.g. comparison, sign, and overflow detection) because digits are not weighted. Conversion to weighted mixed-radix or binary system required. Codes for error correction and detection. Possible applications (but hardly used) – Digital filters – Error detection and correction 17: Adders CMOS VLSI Design 4th Ed. 31 Residue Number Systems Base is n-tuple of integers (mn-1, mn-2, …, m0), residues (or moduli). These mi are pairwise prime. A an 1,an 2,,a0 m n1 ,m ns ,,m 0 ai 0,1,,mi 1 n 1 Range : M mi , anywhere inZ i0 ai Amodmi A m i , A mi qi ai Arithmetic operations: each digit computed separately. 17: Adders CMOS VLSI Design 4th Ed. 32 Residue Number Systems z i Z m i f A m f A m i i A B mi A mi B mi A B m i A m i B m i ai m m i ai m i ai1 m aim i 2 m i mi mi f ai mi ai bi m ai bi m mi i i i (Fermats' T heorem) i Best moduli mi are 2k and 2k – 1. – High storage efficiency with k bits. – Simple modular addition k bit adder without cout 17: Adders CMOS VLSI Design 4th Ed. 33 Residue Number Systems Example: m1,m0 3,2, M 6 5 6 A a1,a0 5 3 , 5 2 2,1 4 5 6 1,0 2,1 1 2 3 , 0 1 2 0,1 3 6 4 5 6 1,0 2,1 1 2 3 , 0 1 2 2,0 2 6 17: Adders CMOS VLSI Design 4th Ed. 34 Floating-Point Numbers Larger range, smaller precision than fixed-point representation, inexact, real numbers. Double-number form => discontinuous precision. S | biased exponent E | unsigned norm mantissa M F 1 M E 1 M 2E bias S S Basic arithmetic operations A B 1SA SB M A M B E A E B A B 1 A M A 1 B M B E A E B E A 17: Adders S S CMOS VLSI Design 4th Ed. 35 Floating-Point Numbers Basic arithmetic operations based in fixed point add, multiply, and shift operations. Post-normalization required. Applications: – Processors: real floating point formats (e.g. IEEE standard), large range due to universal use. – ASICs: usually simplified floating-point formats with small exponents, smaller range. Used for range extension of normal fixed-point numbers. IEEE floating point format: 17: Adders CMOS VLSI Design 4th Ed. 36 Logarithmic Number System Alternative representation to floating point (mantissa + integer exponent -> only fixed point exponent). Single number form => continuous precision => higher accuracy, more reliable. S | biased fixed point exponentE L 1 E 1 2E bias S S Basic arithmetic operations: – (A < B) = (EA < EB) additionally consider sign – A + B by approximation or addition in conventional number system and double conversion. 17: Adders CMOS VLSI Design 4th Ed. 37 Logarithmic Number System Basic arithmetic operations A B 1 A S SB A 1 y SA EA EB yE A , y A 1 SA EA y – Simpler multiplication, exponentiation. More complex addition. – Expensive conversion: (anti)logarithms probably by table look-up. – Applications: real-time digital filters. 17: Adders CMOS VLSI Design 4th Ed. 38 Antitetrational Number System Tetration (t.x = 2{ 2 2... 2 and antitetration (a.t.x) x times Larger range, but smaller precision than logarithmic representation. Otherwise, analogous. Note that all these systems can be mixed in composite arithmetic. Choice of number representation should be hidden from the user. The compiler should handle it. Rational numbers can also be represented in floating slash notation. 17: Adders CMOS VLSI Design 4th Ed. 39 Round-Off Schemes Intermediate results with d additional lower bits. This results in higher accuracy. A an1,,a0,a1,,ad Rounding: keeping error e small during final word length reduction: R rn1,,r0 A e Trade-off: numericalaccuracy vs implementation cost. Truncation RTRUNC an1,,a0 – bias 1 1 = average error e d 1 2 2 to nearest (normal rounding) Round 17: Adders RROUND an 1,, a0 , A A 1 A 0.12 2 CMOS VLSI Design 4th Ed. 40 Round-Off Schemes Round to nearest 1 – The error is bias d 1 nearly symmetric 2 – + 0.12 can often be included in a previous operation. Round to nearest even/odd RROUND if a1,, ad 00 RROUND EVEN an 1,, a1,0 otherwise – bias = 0 (symmetric) – Mandatory in IEEE floating-point standard 3guard bits for rounding after floating point operations: guard bit G (postnormalization), round bit R (round to nearest ), sticky bit S (round to nearest even) 17: Adders CMOS VLSI Design 4th Ed. 41 Addition 17: Adders CMOS VLSI Design 4th Ed. 42 Single-Bit Addition A Half Adder S A B B Full Adder S A B C Cout Cout A B S A B Cout Cout MAJ ( A, B, C ) C S A B Cout S A B C Cout S 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1 17: Adders CMOS VLSI Design 4th Ed. 43 1-Bit Adders Add up m bits of same magnitude Output the sum as a k-bit number (k log m 1) Or count 1’s at inputs => (m,k) counter – combinational counter. A half adder is a (2,2) counter cout ,s 2cout s a b s a b (sum) c out ab (carry- out) A 3 , T 21 17: Adders CMOS VLSI Design 4th Ed. 44 1-Bit Adders 17: Adders CMOS VLSI Design 4th Ed. 45 1-Bit Adders A full-adder is a (3,2) counter. cout ,s 2cout s a b cin g ab A 7 , T 42 (generat e) c 0 ab p a b (propagate) c1 a b s a b c in p c in c out ab acin bcin ab a bc in g pcin pg pcin pa pcin c inc 0 c inc1 17: Adders CMOS VLSI Design 4th Ed. 46 PGK For a full adder, define what happens to carries (in terms of A and B) – Generate: Cout = 1 independent of C • G=A•B – Propagate: Cout = C • P=AB – Kill: Cout = 0 independent of C • K = ~A • ~B 17: Adders CMOS VLSI Design 4th Ed. 47 Full Adder Design I Brute force implementation from eqns S A B C Cout MAJ ( A, B, C ) A A A S C A B B C C MAJ C B C B B B C A C A A A 17: Adders C B S Cout B A B A B C A B C B B Cout B A CMOS VLSI Design 4th Ed. 48 Full Adder Design II Factor S in terms of Cout S = ABC + (A + B + C)(~Cout) Critical path is usually C to Cout in ripple adder MINORITY A B C Cout S S Cout 17: Adders CMOS VLSI Design 4th Ed. 49 Full Adder Design II Same circuit with sized transistors 17: Adders CMOS VLSI Design 4th Ed. 50 Layout Clever layout circumvents usual line of diffusion – Use wide transistors on critical path – Eliminate output inverters 17: Adders CMOS VLSI Design 4th Ed. 51 Full Adder Design III Complementary Pass Transistor Logic (CPL) – Slightly faster, but more area B B B C B C B C B C A S B C B C B C B C Cout A A S A Cout B B 17: Adders CMOS VLSI Design 4th Ed. 52 Full Adder Design III Transmission gates 17: Adders CMOS VLSI Design 4th Ed. 53 Full Adder Design IV Dual-rail domino – Very fast, but large and power hungry – Used in very fast multipliers C_h A_h B_h A_h C_l B_h A_l C_h B_h A_h C_l B_l Cout _l A_l B_l B_l S_l 17: Adders Cout _h S_h C_h B_h A_l CMOS VLSI Design 4th Ed. 54 (m,k) Counters k 1 m 1 j 0 i0 sk 1,,s0 s j 2 j ai Usually built from full-adders. Associativity of addition allows conversion from linear to tree structure => faster at the same number of FAs. log m A 7 m2 k 7m logm k 1 TLIN 4m 2logm , TTREE 4 log3 m 2logm 17: Adders CMOS VLSI Design 4th Ed. 55 (7,3) Counter Example A 28 , T 14 A 28 , T 10 17: Adders CMOS VLSI Design 4th Ed. 56 Carry Propagate Adders Add two n-bit operands A and B and an optional carry in cin by performing carry propagation. Sum (cout, S) is an irredundant (n+1) bit number cout ,S cout 2n S A B cin 2c i1 si ai bi c i i 0,1,,n 1 c 0 c in , c out c n (r.m.a) 17: Adders CMOS VLSI Design 4th Ed. 57 Carry Propagate Adders N-bit adder called CPA – Each sum bit depends on all previous carries – How do we compute all these carries quickly? AN...1 BN...1 Cout Cout + SN...1 17: Adders Cin Cin 00000 1111 +0000 1111 CMOS VLSI Design 4th Ed. Cout 11111 1111 +0000 0000 Cin carries A4...1 B4...1 S4...1 58 Ripple-Carry Adder(RCA) Serial arrangement of n full adders. Simplest, smallest, and slowest CPA structure. A 7n , T 2n , AT 14 n 2 17: Adders CMOS VLSI Design 4th Ed. 59 Carry-Ripple Adder Simplest design: cascade full adders – Critical path goes from Cin to Cout – Design full adder to have fast carry delay A4 B4 Cout B3 C3 S4 17: Adders A3 A2 B2 C2 S3 A1 B1 Cin C1 S2 CMOS VLSI Design 4th Ed. S1 60 Carry Ripple Adder Note that worst case delay is linear with number of bits. tadder N 1tcarry tsum Goal: Make the fastest possible carry path circuit. 17: Adders CMOS VLSI Design 4th Ed. 61 A Full Adder Circuit 17: Adders CMOS VLSI Design 4th Ed. 62 Inversion Property 17: Adders CMOS VLSI Design 4th Ed. 63 Inversions Critical path passes through majority gate – Built from minority + inverter – Eliminate inverter and use inverting full adder A4 B4 Cout B3 C3 S4 17: Adders A3 A2 B2 C2 S3 A1 B1 Cin C1 S2 S1 CMOS VLSI Design 4th Ed. 64 Mirror Adder 17: Adders CMOS VLSI Design 4th Ed. 65 Mirror Adder 17: Adders CMOS VLSI Design 4th Ed. 66 Mirror Adder The NMOS and PMOS chains are completely symmetrical. A maximum of two series transistors can be observed in the carry generation circuit. When laying out the cell, the most critical issue is the minimization of the capacitance at node Co. The reduction of the diffusion capacitances is particularly important. The capacitance at node Co is composed of four diffusion capacitances, two internal gate capacitances, and six gate capacitances in the connecting adder cell. 17: Adders CMOS VLSI Design 4th Ed. 67 Mirror Adder The transistors connected to Ci are placed closest to the input. Only the transistors in the carry stage have to be optimized for optimal speed. All transistors in the sum stage can be minimal size. 17: Adders CMOS VLSI Design 4th Ed. 68 Transmission Gate FA 17: Adders CMOS VLSI Design 4th Ed. 69 Carry Propagation Speed-up Concatenation of partial CPA’s with fast cin -> cout. Fast carry look-ahead logic for entire range of bits. 17: Adders CMOS VLSI Design 4th Ed. 70 Generate / Propagate Equations often factored into G and P Generate and propagate for groups spanning i:j Gi: j Gi:k Pi:k Pi: j Pi:k Gk 1: j Pk 1: j Base case Gi:i Gi Ai Bi Pi:i Pi Ai Bi 0 GCP 0:00:0 in G0:0 G0 Cin P0:0 P0 0 Sum: Si Pi Gi 1:0 17: Adders CMOS VLSI Design 4th Ed. 71 PG Logic A4 B4 A3 B3 A2 B2 A1 B1 Cin 1: Bitwise PG logic G4 P4 G3 P3 G2 P2 G1 P1 G0 P0 2: Group PG logic G3:0 G2:0 G1:0 G0:0 C3 C2 C1 C0 3: Sum logic C4 Cout 17: Adders S4 S3 S2 S1 CMOS VLSI Design 4th Ed. 72 PG Logic 17: Adders CMOS VLSI Design 4th Ed. 73 Carry-Ripple Revisited Gi:0 Gi Pi Gi 1:0 A4 B4 G4 P4 A3 B3 G3 P3 A2 B2 G2 P2 A1 B1 G1 P1 Cin G0 G3:0 G2:0 G1:0 G0:0 C3 C2 C1 C0 P0 C4 Cout 17: Adders S4 S3 S2 S1 CMOS VLSI Design 4th Ed. 74 Carry-Ripple PG Diagram Bit Position 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 tripple t pg ( N 1)tAO txor Delay 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 17: Adders CMOS VLSI Design 4th Ed. 75 PG Diagram Notation Black cell i:k Gray cell k-1:j i:k i:j Gi:k Pi:k Gk-1:j Pk-1:j 17: Adders Buffer k-1:j i:j i:j i:j Gi:j Gi:k Pi:k Gk-1:j Pi:j CMOS VLSI Design 4th Ed. Gi:j Gi:j Gi:j Pi:j Pi:j 76 Manchester Carry Chain 17: Adders CMOS VLSI Design 4th Ed. 77 Manchester Carry Chain 17: Adders CMOS VLSI Design 4th Ed. 78 Manchester Carry Chain 17: Adders CMOS VLSI Design 4th Ed. 79 Carry-Skip Adder Carry-ripple is slow through all N stages Carry-skip allows carry to skip over groups of n bits – Decision based on n-bit propagate signal Cout A16:13 B16:13 A12:9 B12:9 A8:5 B8:5 A4:1 P16:13 P12:9 P8:5 P4:1 1 0 C12 + S16:13 17: Adders 1 0 C8 + S12:9 1 0 C4 + S8:5 CMOS VLSI Design 4th Ed. B4:1 1 0 + Cin S4:1 80 Carry-Skip Adder 17: Adders CMOS VLSI Design 4th Ed. 81 Carry-Skip Adder 17: Adders CMOS VLSI Design 4th Ed. 82 Carry-Skip Adder 17: Adders CMOS VLSI Design 4th Ed. 83 Carry-Skip PG Diagram 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 16:0 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 For k n-bit groups (N = nk) tskip t pg 2 n 1 ( k 1) t AO txor 17: Adders CMOS VLSI Design 4th Ed. 84 Variable Group Size 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 16:0 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 Delay grows as O(sqrt(N)) 17: Adders CMOS VLSI Design 4th Ed. 85 Carry-Skip Adder Partial CPA with fast ck -> ci c i Pi1:kci Pi1:kc k (bit groupai1,,ak ) Pi1:k pi1 pi2 pk (group propagate) If Pi-1:k = 0 : ck does not become c’i and c’i is selected, becoming ci. If Pi-1:k = 0 : ck becomes c’i, but c’i is skipped. Path ck -> c’i -> ci never sensitized => fast ck -> ci False path => inherent logic redundancy => problems in circuit optimization, timing analysis, and testing. 17: Adders CMOS VLSI Design 4th Ed. 86 Carry-Skip Adder Variable group sizes are faster. – Use larger groups in the middle – Minimize delays a0 -> ck -> si-1 and ak -> ci -> sn-1 Partial CPA type is RCA or CSKA (multilevel CSKA) Medium speed-up at small hardware overhead (+ AND/bit +MUX/group) A 8n , T 4n 1 2 , AT 32n 3 2 17: Adders CMOS VLSI Design 4th Ed. 87 CSKA + Manchester 17: Adders CMOS VLSI Design 4th Ed. 88 Carry-Select Adder Trick for critical paths dependent on late input X – Precompute two possible outputs for X = 0, 1 – Select proper output when X arrives Carry-select adder precomputes n-bit sums – For both possible carries into n-bit group A16:13 B16:13 0 + Cout 1 + B8:5 C8 1 + CMOS VLSI Design 4th Ed. B4:1 C4 + Cin 0 S12:9 A4:1 0 + 1 0 1 0 1 S16:13 A8:5 0 + C12 1 + 17: Adders A12:9 B12:9 S8:5 S4:1 89 Carry-Select Adder Partial CPA with fast ck -> ci and ck -> si-1:k 0 si1:k c k si1:k c k s1i1:k c i c kc i0 c kc1i Two CPA’s compute two possible results (cin = 0/1), group carry-in ck selects correct one afterwards. Variable group sizes are faster; use larger groups at end (MSB). Balance delays a0 -> ck and ak -> ci0 Partial CPA type is RCA, CSLA (multilevel CSLA) or CLA. 17: Adders CMOS VLSI Design 4th Ed. 90 Carry-Select Adder High speed-up at high hardware overhead. – + MUX/bit + (CPA + MUX)/group A 14n , T 2.8n 1 2 , AT 39n 3 2 17: Adders CMOS VLSI Design 4th Ed. 91 Carry-Select Adder 17: Adders CMOS VLSI Design 4th Ed. 92 Carry-Select Adder 17: Adders CMOS VLSI Design 4th Ed. 93 Linear Carry-Select 17: Adders CMOS VLSI Design 4th Ed. 94 Square-Root Carry-Select 17: Adders CMOS VLSI Design 4th Ed. 95 Delay Comparison 17: Adders CMOS VLSI Design 4th Ed. 96 Carry-Increment Adder Partial CPA with fast ck -> ci and ck -> si-1:k si1:k si1:k c k , c i ci Pi1:kc k Pi1:k pi1 pi2 pk (group propagate) Result is incremented after addition if ck = 1 Variable group sizes are faster, use larger groups at end (MSB). Balance delays a0 -> ck and ak -> c’i Partial CPA could be RCA, CIA (multilevel CIA) or CLA. High speed-up at medium hardware overhead (+AND/bit + (incrementer + AND/OR)/group). Logic of CPA and incrementer could be merged. 17: Adders CMOS VLSI Design 4th Ed. 97 Carry-Increment Adder A 10n , T 2.8n 17: Adders 2 1 , AT 28n CMOS VLSI Design 4th Ed. 3 2 98 Carry-Increment Adder Example: gate-level schematic of carry-increment adder (CIA) – Only two different logic cells (bit-slices): IHA and IFA 17: Adders CMOS VLSI Design 4th Ed. 99 Carry-Increment Adder Factor initial PG and final XOR out of carry-select 15 14 13 12 11 10 13:12 8 7 6 9:8 14:12 15:12 9 4 3 2 1 0 5:4 10:8 11:8 5 6:4 7:4 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 tincrement t pg n 1 (k 1) t AO txor 17: Adders CMOS VLSI Design 4th Ed. 100 Variable Group Size Also buffer noncritical signals 15 14 13 12 11 10 9 12:11 8 7 6 8:7 13:11 4 5:4 9:7 14:11 5 3 2 1 0 3:2 6:4 10:7 15:11 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 15 14 13 12 11 10 9 12:11 7 6 8:7 13:11 14:11 8 9:7 10:7 5 5:4 6:4 4 3 3:2 2 1 0 1:0 3:0 6:0 15:11 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 17: Adders CMOS VLSI Design 4th Ed. 101 Conditional-Sum Adder Optimized multilevel CSLA with logn levels 0 1 Correct sum bits si1:k or sik:1 are conditionally selected through logn levels of multiplexers. Bit groups of size 2l at level l. Higher parallelism, more balanced signal paths. Highest speed-up at highest hardware overhead (2RCA + more than logn MUX/bit) 17: Adders CMOS VLSI Design 4th Ed. 102 Conditional-Sum Adder A 3n logn , T 2logn , AT 6n log2 n 17: Adders CMOS VLSI Design 4th Ed. 103 Conditional-Sum Adder 17: Adders CMOS VLSI Design 4th Ed. 104 Conditional-Sum Adder 17: Adders CMOS VLSI Design 4th Ed. 105 Carry-Lookahead Adder Carries look ahead before sum bits are computed c 0 c0 c1 g0 p0c0 c 2 g1 p1g0 p1 p0c0 c 3 g2 p2 g1 p2 p1g0 p2 p1 p0c0 g3 g3 p3 g2 p3 p2 g1 p3 p2 p1g0 p3 p3 p2 p1 p0 1 Hierarchical arrangement using logn levels: g3, p3 2 passed up, c’ passed down between levels. 0 High speed-up at medium hardware overhead. 17: Adders CMOS VLSI Design 4th Ed. 106 Carry-Lookahead Adder A 14n , T 4 log n , AT 56n log n 17: Adders CMOS VLSI Design 4th Ed. 107 Carry-Lookahead Adder 17: Adders CMOS VLSI Design 4th Ed. 108 Carry-Lookahead Adder Carry-lookahead adder computes Gi:0 for many bits in parallel. Uses higher-valency cells with more than two inputs. A16:13 B16:13 Cout G16:13 P16:13 + S16:13 17: Adders C12 A12:9 B12:9 G12:9 P12:9 + S12:9 A8:5 B8:5 C8 A4:1 C4 G8:5 P8:5 B4:1 G4:1 P4:1 + + S8:5 S4:1 CMOS VLSI Design 4th Ed. Cin 109 CLA PG Diagram 17: Adders CMOS VLSI Design 4th Ed. 110 Carry-Lookahead 17: Adders CMOS VLSI Design 4th Ed. 111 Lookahead Tree 17: Adders CMOS VLSI Design 4th Ed. 112 Lookahead Tree 17: Adders CMOS VLSI Design 4th Ed. 113 Higher-Valency Cells i:k k-1:l l-1:m m-1:j i:j Gi:k Pi:k Gk-1:l Pk-1:l Gl-1:m Pl-1:m Gm-1:j Gi:j Pi:j Pm-1:j 17: Adders CMOS VLSI Design 4th Ed. 114 Higher Valency PG Diagram 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 16:0 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 17: Adders CMOS VLSI Design 4th Ed. 115 Tree Adder If lookahead is good, lookahead across lookahead! – Recursive lookahead gives O(log N) delay Many variations on tree adders 17: Adders CMOS VLSI Design 4th Ed. 116 Parallel Prefix Adders Universal adder architecture comprising RCA, CIA, CLA, and more (entire range of area-delay trade-offs from slowest RCA to fastest CLA). Preprocessing, carry-lookahead, and postprocessing step. Carries calculated using parallel-prefix algorithms – High regularity: suitable for synthesis and layout – High flexibility: special adders, other arthmetic operations, exchangeable prefix algorithms. – High performance: smallest and fastest adders 17: Adders CMOS VLSI Design 4th Ed. 117 Parallel Prefix Adders A 5n 3A , T 4 2T 17: Adders CMOS VLSI Design 4th Ed. 118 Prefix Problem Inputs (xn-1,…,x0) outputs (yn-1,…,y0), associative binary operator • yn 1,, y0 xn 1 x0,, x1 x0, x0 or y 0 x 0 , y i x i y i1 ; i 1,,n 1 (r.m.a) Associativity of • => tree structures for evaluation x 3 x 2 x1 x 0 x 3 x 2 x1 x 0 14 2 43 14 2 43 14 2 43 1 1 1 y1 Y1:0 Y3:2 y1 Y1:0 1 4 44 2 4 4 43 1 4 4 2 4 43 2 2 y 3 Y3:0 y 2 Y2:0 1 4 44 2 4 4 43 3 y 3 Y3:0 17: Adders CMOS VLSI Design 4th Ed. 119 Prefix Problem l Group variables Yi:k : covers bits (xk,…,xi) at level l. Carry-propagation is prefix problem: Yi:kl Gi:kl ,Pi:kl G G 0 i:i ,Pi:i0 gi , pi l i:k l 1 l 1 ,Pi:kl Gi:l 1j 1,Pi:l 1 ; k ji j 1 G j:k ,P j:k l 1 l 1 l 1 Gi:l 1j 1 Pi:l 1 G ,P j 1 j:k j:k P j:k m c i1 Gi:0 ; i 0,,n 1 , l 1,,m Parallel-prefix algorithms: – Multi-tree structures T = O(n) -> O(logn) – Sharing subtrees A = O(n2) -> O(nlogn) – Different algorithms trading area vs delay. Also consider wirng and fanout. 17: Adders CMOS VLSI Design 4th Ed. 120 Prefix Algorithms Algorithms visualized by directed acyclic graphs (DAG) with array structure (n bits x m levels). Graph vertex symbols Performance measures: – A• : graph size (number of black nodes) – T• : graph depth (number of black nodes on critical path) 17: Adders CMOS VLSI Design 4th Ed. 121 Prefix Algorithms Serial prefix algorithm (RCA) A n 1 , T n 1 , FOmax 2 17: Adders CMOS VLSI Design 4th Ed. 122 Prefix Algorithms Sklansky parallel-prefix algorithm (PPA-SK) – Tree-like collection, parallel redistribution of carries 1 1 A nlogn , T logn , FOmax n 2 2 17: Adders CMOS VLSI Design 4th Ed. 123 Sklansky 15 14 13 12 11 10 15:14 13:12 11:10 15:12 14:12 15:8 14:8 11:8 10:8 13:8 9 9:8 8 7 6 7:6 7:4 5 5:4 6:4 4 3 2 3:2 3:0 1 0 1:0 2:0 12:8 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 17: Adders CMOS VLSI Design 4th Ed. 124 Prefix Algorithms Brent-Kung parallel-prefix algorithm (PPA-BK) – Traditional CLA is PPA-BK with 4-bit groups – Tree-like redistribution of carries (fan-out tree) A 2n logn 2 , T 2logn 2 FOmax logn 17: Adders CMOS VLSI Design 4th Ed. 125 Brent-Kung 15 14 13 12 11 10 15:14 13:12 15:12 11:10 9 9:8 11:8 8 7 7:6 6 5 5:4 7:4 15:8 4 3 3:2 2 1 0 1:0 3:0 7:0 11:0 13:0 9:0 5:0 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 17: Adders CMOS VLSI Design 4th Ed. 126 Prefix Algorithms Kogge-Stone parallel-prefix algorithm (PPA-KS) – very high wiring requirements A n log n n 1 , T log n , FOmax 2 17: Adders CMOS VLSI Design 4th Ed. 127 Kogge-Stone 15:14 14:13 13:12 12:11 11:10 10:9 9:8 8:7 7:6 6:5 5:4 4:3 3:2 2:1 3:0 2:0 15:12 14:11 13:10 12:9 11:8 10:7 9:6 8:5 7:4 6:3 5:2 4:1 13:6 12:5 11:4 10:3 9:2 8:1 7:0 6:0 5:0 4:0 15:8 14:7 1 2 3 4 5 6 7 8 9 15 14 13 12 11 10 0 1:0 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 17: Adders CMOS VLSI Design 4th Ed. 128 Prefix Algorithms Carry-increment parallel-prefix algorithm 1 1 A 2n 1.4n 2 , T 1.4n 2 , FOmax 1.4n 1 2 17: Adders CMOS VLSI Design 4th Ed. 129 Prefix Algorithms Mixed serial/parallel-prefix algorithm (RCA+PPA) – Linear size-depth trade-off using parameter k: 0 k n 2log n 2 – k = 0 : serial prefix graph – k n 2log n 1 : Brent-Kung parallel-prefix graph – Fills the gap between RCA and PPA-BK (CLA) in steps of single •-operations. 17: Adders CMOS VLSI Design 4th Ed. 130 Prefix Algorithms A n 1 k , T n 1 k , FOmax var. 17: Adders CMOS VLSI Design 4th Ed. 131 Prefix Algorithms Example: 4-bit PPA-SK – Efficient AND-OR-prefix circuit for the generate and AND-prefix circuit for the propagate signals – Optimization: alternatingly AOI/OAI- resp. NAND/NOR-gates (inverting gatesare smaller and faster). – Can also be realized using two MUX-prefix circuits 17: Adders CMOS VLSI Design 4th Ed. 132 Prefix Algorithms 17: Adders CMOS VLSI Design 4th Ed. 133 Prefix Algorithms Prefix adders can be synthesized by human or computer as well. Starting from a serial structure, one can use compression rules and expansion rules to obtain new graphs. Can generate all previous graphs except PPA-KS. Universal adder synthesis approach. 17: Adders CMOS VLSI Design 4th Ed. 134 Tree Adder Taxonomy Ideal N-bit tree adder would have – L = log N logic levels – Fanout never exceeding 2 – No more than one wiring track between levels Describe adder with 3-D taxonomy (l, f, t) – Logic levels: L+l – Fanout: 2f + 1 – Wiring tracks: 2t Known tree adders sit on plane defined by l + f + t = L-1 17: Adders CMOS VLSI Design 4th Ed. 135 Tree Adder Taxonomy l (Logic Levels) 3 (7) Brent-Kung f (Fanout) 2 (6) Sklansky 3 (9) 1 (5) 2 (5) 1 (3) 0 (2) 0 (4) 0 (1) 1 (2) 2 (4) Kogge-Stone 3 (8) t (Wire Tracks) 17: Adders CMOS VLSI Design 4th Ed. 136 Han-Carlson 15 14 13 12 11 10 9 8 7 6 5 4 3 15:14 13:12 11:10 9:8 7:6 5:4 3:2 15:12 13:10 11:8 9:6 7:4 5:2 3:0 15:8 13:6 11:4 9:2 7:0 5:0 2 1 0 1:0 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 17: Adders CMOS VLSI Design 4th Ed. 137 Knowles [2, 1, 1, 1] 15 14 13 12 11 10 9 8 7 6 5 4 3 2 15:14 14:13 13:12 12:11 11:10 10:9 9:8 8:7 7:6 6:5 5:4 4:3 3:2 2:1 15:12 14:11 13:10 3:0 2:0 15:8 14:7 13:6 12:9 11:8 10:7 9:6 8:5 7:4 6:3 5:2 4:1 12:5 11:4 10:3 9:2 8:1 7:0 6:0 5:0 4:0 1 0 1:0 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 17: Adders CMOS VLSI Design 4th Ed. 138 Ladner-Fischer 15 14 13 12 11 10 15:14 13:12 15:12 11:10 9 9:8 11:8 15:8 13:8 15:8 13:0 8 7 7:6 5 5:4 7:4 7:0 11:0 6 4 3 3:2 2 1 0 1:0 3:0 5:0 9:0 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 17: Adders CMOS VLSI Design 4th Ed. 139 Taxonomy Revisited (f)Ladner-Fischer (b) Sklansky 15 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 14 15:14 15:14 13:12 11:10 9:8 7:6 5:4 3:2 13 12 15:8 14:8 11:8 10:8 13:8 7:4 6:4 3:0 BrentKung LadnerFischer 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 LadnerFischer f (Fanout) 13:12 11:10 4 3 2 15:14 14:13 13:12 12:11 11:10 10:9 9:8 8:7 7:6 6:5 5:4 4:3 3:2 2:1 15:12 14:11 13:10 9:6 8:5 7:4 6:3 5:2 4:1 3:0 2:0 12:9 11:8 10:7 1 15:8 13:8 15:8 13:0 15:14 0 (2) 0 1:0 0 (4) 0 (1) 13:12 11:10 15:12 14:7 13:6 12:5 11:4 10:3 9:2 8:1 7:0 6:0 5:0 4 3 2 7:6 5:4 3:2 7:0 Knowles [4,2,1,1] 9:0 8:0 7:0 6:0 5:0 4:0 7 6 5 4 3 2 3:0 2:0 9 8 1 0 HanCarlson 9:8 7:6 5:4 3:2 7:4 15:8 1:0 3:0 7:0 11:0 9:0 5:0 2 (4) (d) Han-Carlson 15 14 13 (c) Kogge-Stone 15 14 13 12 11 10 13:6 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 HanCarlson Knowles [2,1,1,1] 14:7 9 8 7 6 5 4 3 2 9:8 8:7 7:6 6:5 5:4 4:3 3:2 2:1 12:9 11:8 10:7 9:6 8:5 7:4 6:3 5:2 4:1 3:0 2:0 12:5 11:4 10:3 9:2 8:1 7:0 6:0 5:0 4:0 1 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1:0 Kogge3 (8) Stone 15:14 13:12 11:10 9:8 7:6 5:4 3:2 15:12 13:10 11:8 9:6 7:4 5:2 3:0 15:8 13:6 11:4 9:2 7:0 5:0 1:0 t (Wire Tracks) 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 17: Adders 0:0 5:0 New (1,1,1) 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 15:8 1:0 1:0 1 (2) 15:12 14:11 13:10 0 3:0 4:0 15:14 14:13 13:12 12:11 11:10 10:9 1 9:0 11:8 13:0 15:8 5 7:4 11:0 15 14 13 12 11 10 1 (3) 5 9:8 11:8 15:0 14:0 13:0 12:0 11:0 10:0 2 (5) 6 6 1 (5) (e) Knowles [2,1,1,1] 7 7 (a) Brent-Kung 2 (6) 3 (9) 8 8 3 (7) Sklansky 9 9 l (Logic Levels) 2:0 12:8 15 14 13 12 11 10 10 1:0 15:12 15:12 14:12 11 0 15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 CMOS VLSI Design 4th Ed. 140 More Adder Issues Multilevel adders – Multilevel versions of adders possible • CSKA, CSLA, CIA Hybrid adders – Arbitrary combination of speed-up techniques possible. – Often used combinations: CLA – CSLA Transistor level adders – Influence of logic styles (dynamic logic, pass transistor logic) – Efficient transistor level implementation of ripple-carry chains (Manchester chain) – Combinations of speed-up techniques make sense. • Much higher design effort – Many efficient implementations exist in the literature. Higher valency (radix) also possible. 17: Adders CMOS VLSI Design 4th Ed. 141 More Adder Issues Higher valency is a poor choice in static CMOS logic since each stage has higher delay. However, if the stages are built using domino logic, it could prove to be an advantage. Nodes with large fanouts or long wires could use buffers. The prefix trees can also be internally pipelined. 17: Adders CMOS VLSI Design 4th Ed. 142 Transistor Level 17: Adders CMOS VLSI Design 4th Ed. 143 Transistor Level 17: Adders CMOS VLSI Design 4th Ed. 144 Transistor Level 17: Adders CMOS VLSI Design 4th Ed. 145 Higher Valency Adders 17: Adders CMOS VLSI Design 4th Ed. 146 Sparse Trees Building a prefix tree to compute carries in every bit is expensive in terms of power. An alternative is to compute carries into short groups such as s = 2,3,8, or 16 bits. Meanwhile, pairs of s-bit adders precompute the sums assuming both carries-in of 0 and 1 to each group. It is a hybrid between a prefix adder and carry select adder. 17: Adders CMOS VLSI Design 4th Ed. 147 Valency-3 BK Adder Sparse tree adder with s = 3 17: Adders CMOS VLSI Design 4th Ed. 148 Carry-Select Implementation 17: Adders CMOS VLSI Design 4th Ed. 149 Sparse Tree Adders Intel Valency-2 Sklansky sparse tree adder with s=4 17: Adders CMOS VLSI Design 4th Ed. 150 Sparse Tree Adders Valency-3 Kogge-Stone sparse tree adder with s=3 17: Adders CMOS VLSI Design 4th Ed. 151 Ling Adders Ling discovered a technique to remove one series transistor from the critical group generate path at the expense of another XOR gate in the sum precomputation. Define a pseudo-generate Hi:j = Gi + Gi-1:j This is a simpler computation. Ki Hi: j KiGi KiGi1: j Gi KiGi1: j Gi: j Define a pseudo-propagate signal I that is a shifted version of propagate. I i: j K i 1: j 1 H i: j H i: k I i: k H k 1: j I i: j I i: k I k 1: j 17: Adders CMOS VLSI Design 4th Ed. 152 Ling Adders Finally, the sums are computed by Si Pi K i1H i1:0 Si H i1:0 Pi K i1 H i1:0 Pi 17: Adders CMOS VLSI Design 4th Ed. 153 Ling Adders 17: Adders CMOS VLSI Design 4th Ed. 154 Comparison Standard-cell implementation, 0.8mm technology 17: Adders CMOS VLSI Design 4th Ed. 155 Comparison 17: Adders CMOS VLSI Design 4th Ed. 156 Summary Adder architectures offer area / power / delay tradeoffs. Choose the best one for your application. Architecture Classification Logic Levels Max Tracks Fanout Cells Carry-Ripple N-1 1 1 N Carry-Skip n=4 N/4 + 5 2 1 1.25N Carry-Inc. n=4 N/4 + 2 4 1 2N Brent-Kung (L-1, 0, 0) 2log2N – 1 2 1 2N Sklansky (0, L-1, 0) log2N N/2 + 1 1 0.5 Nlog2N Kogge-Stone (0, 0, L-1) log2N 2 N/2 Nlog2N 17: Adders CMOS VLSI Design 4th Ed. 157 E vs Delay Trade-off 17: Adders CMOS VLSI Design 4th Ed. 158 E vs Delay Tradeoff 90nm 64 bit domino KS Ling adder with various valency and s 17: Adders CMOS VLSI Design 4th Ed. 159 Area vs Delay Synthesized Adders 17: Adders CMOS VLSI Design 4th Ed. 160