Digital Design Chapter 6: Optimizations and Tradeoffs Slides to accompany the textbook Digital Design, First Edition, by Frank Vahid, John Wiley and Sons Publishers, 2007. http://www.ddvahid.com Copyright © 2007 Frank Vahid Instructors of courses requiring Vahid's Digital Design textbook (published by John Wiley and Sons) have permission to modify and use these slides for customary course-related activities, subject to keeping this copyright notice in place and unmodified. These slides may be posted as unanimated pdf versions on publicly-accessible course websites.. PowerPoint source (or pdf Digital Design with animations) may not be posted to publicly-accessible websites, but may be posted for students on internal protected sites or distributed directly to students by other electronic means. Copyright © 2006 1 Instructors may make printouts of the slides available to students for a reasonable photocopying charge, without incurring royalties. Any other use requires explicit permission. Instructors Franksource Vahidor obtain special use permissions from Wiley – see http://www.ddvahid.com for information. may obtain PowerPoint 6.1 Introduction • We now know how to build digital circuits – How can we build better circuits? • Let’s consider two important design criteria – Delay – the time from inputs changing to new correct stable output – Size – the number of transistors – For quick estimation, assume Transforming F1 to F2 represents an optimization: Better in all criteria of interest • Every gate has delay of “1 gate-delay” • Every gate input requires 2 transistors • Ignore inverters w x y F1 w x y 4 transistors 1 gate-delay w x F1 = wxy + wxy’ (a) Digital Design Copyright © 2006 Frank Vahid = wx(y+y’) = wx F2 F2 = wx (b) 20 size (transistors) 16 transistors 2 gate-delays si F1 15 e t 10rs) zo 5 r(t F2 1 2 3 4 delay (gate-delays) (c) 2 Note: Slides with animation are denoted with a small red "a" near the animated items Introduction • Tradeoff – Improves some, but worsens other, criteria of interest Transforming G1 to G2 represents a tradeoff: Some criteria better, others worse. G1 w y z G1 = wx + wy + z Digital Design Copyright © 2006 Frank Vahid w 20 12 transistors 3 gate-delays x y z G2 G2 = w(x+y) + z e z size (transistors) w x 14 transistors 2 gate-delays 15 10 G1 G2 5 1 2 3 4 delay (gate-delays) 3 Introduction Tradeoffs size All criteria of interest are improved (or at least kept the same) size Optimizations e z si delay Some criteria of interest are improved, while others are worsened e zsi delay siansis • We obviously prefer optimizations, but often must accept tradeoffs e – You can’t build a car that is the most comfortable, and has the best z fuel efficiency, and is the fastest – you have to give up something to gain other things. Digital Design Copyright © 2006 Frank Vahid 4 Combinational Logic Optimization and Tradeoffs • Two-level size optimization using algebraic methods Example F = xyz + xyz’ + x’y’z’ + x’y’z F = xy(z + z’) + x’y’(z + z’) – Goal: circuit with only two levels (ORed AND gates), with minimum transistors F = xy*1 + x’y’*1 • Though transistors getting cheaper (Moore’s Law), they still cost something F = xy + x’y’ • Define problem algebraically x F y’ 6 gate inputs = 12 transistors n 0 0 0 m y x y’ x’ m n n y F y’ n m x 1 Digital Design Copyright © 2006 Frank Vahid m x’ • F = abc + abc’ is sum-of-products; G = w(xy + z) is not. • Each literal and term translates to a gate input, each of which translates to about 2 transistors (see Ch. 2) • Ignore inverters for simplicity 4 literals + 2 terms = 6 gate inputs y – Sum-of-products yields two levels – Transform sum-of-products equation to have fewest literals and terms 6.2 x’ Note: Assuming 4-transistor 2-input AND/OR circuits; in reality, only NAND/NOR are so efficient. 1 1 5 Algebraic Two-Level Size Minimization • Previous example showed common algebraic minimization method – (Multiply out to sum-of-products, then) – Apply following as much possible • ab + ab’ = a(b + b’) = a*1 = a • “Combining terms to eliminate a variable” – (Formally called the “Uniting theorem”) – Duplicating a term sometimes helps • Note that doesn’t change function – c + d = c + d + d = c + d + d + d + d ... F = xyz + xyz’ + x’y’z’ + x’y’z F = xy(z + z’) + x’y’(z + z’) a F = xy*1 + x’y’*1 F = xy + x’y’ F = x’y’z’ + x’y’z + x’yz F = x’y’z’ + x’y’z + x’y’z + x’yz F = x’y’(z+z’) + x’z(y’+y) a F = x’y’ + x’z – Sometimes after combining terms, can combine resulting terms G = xy’z’ + xy’z + xyz + xyz’ Digital Design Copyright © 2006 Frank Vahid G = xy’(z’+z) + xy(z+z’) G = xy’ + xy (now do again) G = x(y’+y) G=x a 6 Karnaugh Maps for Two-Level Size Minimization F yz • Easy to miss “seeing” possible opportunities to combine terms • Karnaugh Maps (K-maps) x – Graphical method to help us find opportunities to combine terms – Minterms differing in one variable are adjacent in the map – Can clearly see opportunities to combine terms – look for adjacent 1s • For F, clearly two opportunities • Top left circle is shorthand for x’y’z’+x’y’z = x’y’(z’+z) = x’y’(1) = x’y’ • Draw circle, write term that has all the literals except the one that changes in the circle Digital Design Copyright © 2006 Frank Vahid Notice not in binary order 01 11 10 0 x’y’z’ x’y’z x’yz x’yz’ 1 xy’z’ xyz xy’z a xyz’ F = x’y’z + xyz + xyz’ + x’y’z’ F yz x 00 01 11 10 0 1 1 0 0 1 0 0 1 1 a yz x 00 01 11 10 0 1 1 0 0 1 0 0 1 1 F = x’y’ + xy x’y’ Easier than all that algebra: K-map Treat left & right as adjacent too F – Circle xy, x=1 & y=1 in both cells of the circle, but z changes (z=1 in one cell, 0 in the other) • Minimized function: OR the final terms 00 F = xyz + xyz’ + x’y’z’ + x’y’z F = xy(z + z’) + x’y’(z + z’) F = xy*1 + x’y’*1 F = xy + x’y’ a xy 7 K-maps • Four adjacent 1s means two variables can be eliminated – Makes intuitive sense – those two variables appear in all combinations, so one must be true – Draw one big circle – shorthand for the algebraic transformations above G = xy’z’ + xy’z + xyz + xyz’ G = x(y’z’+ y’z + yz + yz’) (must be true) G = x(y’(z’+z) + y(z+z’)) G = x(y’+y) G=x G yz x 00 01 11 10 0 0 0 0 0 1 1 1 1 1 x G yz x Draw the biggest 0 circle possible, or you’ll have more terms 1 than really needed xy’ Digital Design Copyright © 2006 Frank Vahid 00 01 11 10 0 0 0 0 1 1 1 1 xy 8 K-maps • Four adjacent cells can be in shape of a square • OK to cover a 1 twice H H = x’y’z + x’yz + xy’z + xyz yz x 00 01 0 1 0 – Just like duplicating a term (xy appears in all combinations) 11 10 1 0 a • Remember, c + d = c + d + d 1 • No need to cover 1s more than once 0 I – Yields extra terms – not minimized 1 1 0 yz x 0 z y’z 00 01 11 10 0 1 0 0 a J yz x x’y’ 1 y’z 1 1 1 x 00 01 11 10 0 1 1 0 0 1 0 1 1 0 Digital Design Copyright © 2006 Frank Vahid 1 xz a The two circles are shorthand for: I = x’y’z + xy’z’ + xy’z + xyz + xyz’ I = x’y’z + xy’z + xy’z’ + xy’z + xyz + xyz’ I = (x’y’z + xy’z) + (xy’z’ + xy’z + xyz + xyz’) I = (y’z) + (x) 9 K-maps • Circles can cross left/right sides K x – Remember, edges are adjacent • Minterms differ in one variable only • Circles must have 1, 2, 4, or 8 cells – 3, 5, or 7 not allowed – 3/5/7 doesn’t correspond to algebraic transformations that combine terms to eliminate a variable • Circling all the cells is OK – Function just equals 1 Digital Design Copyright © 2006 Frank Vahid x’y’z yz 00 01 11 10 0 0 1 0 0 1 1 0 0 1 00 01 11 10 0 0 0 0 0 1 1 1 1 0 00 01 11 10 0 1 1 1 1 1 1 1 1 1 L xz’ yz x E yz x 1 10 K-maps for Four Variables F • Four-variable K-map follows same principle wx w’xy’ – Adjacent cells differ in one variable – Left/right adjacent – Top/bottom also adjacent • 5 and 6 variable maps exist 00 01 11 10 00 0 0 1 0 01 1 1 1 0 11 0 0 1 0 10 0 0 1 0 yz G – But hard to use – But not very useful – easy to do algebraically by hand F z y 0 0 1 F=w’xy’+yz yz wx • Two-variable maps exist Digital Design Copyright © 2006 Frank Vahid yz 00 01 11 10 00 0 1 1 0 01 0 1 1 0 11 0 1 1 0 10 0 1 1 0 1 z G=z 11 Two-Level Size Minimization Using K-maps General K-map method 1. Convert the function’s equation into sum-of-products form 2. Place 1s in the appropriate K-map cells for each term 3. Cover all 1s by drawing the fewest largest circles, with every 1 included at least once; write the corresponding term for each circle 4. OR all the resulting terms to create the minimized function. Example: Minimize: G = a + a’b’c’ + b*(c’ + bc’) 1. Convert to sum-of-products G = a + a’b’c’ + bc’ + bc’ 2. Place 1s in appropriate cells G bc a a’b’c’ 0 00 01 11 10 1 0 0 1 a 1 1 1 1 1 a 3. Cover 1s G bc a Digital Design Copyright © 2006 Frank Vahid bc’ 00 01 11 10 0 1 0 0 1 1 1 1 1 1 4. OR terms: G = a + c’ c’ a 12 Two-Level Size Minimization Using K-maps – Four Variable Example • Minimize: a’b’c’d’ a’b’cd’ ab’c’d’ a’bd a’bcd’ – H = a’b’(cd’ + c’d’) + ab’c’d’ + ab’cd’ + a’bd + a’bcd’ 1. Convert to sum-of-products: – H = a’b’cd’ + a’b’c’d’ + ab’c’d’ + ab’cd’ + a’bd + a’bcd’ 2. Place 1s in K-map cells 3. Cover 1s 4. OR resulting terms H ab 00 01 11 10 00 1 0 0 1 01 0 1 1 1 a’bc 11 0 0 0 0 a’bd 10 1 0 0 1 H = b’d’ + a’bc + a’bd Digital Design Copyright © 2006 Frank Vahid ab’cd’ cd b’d’ a Funny-looking circle, but remember that left/right adjacent, and top/bottom adjacent 13 Don’t Care Input Combinations • What if particular input combinations can never occur? – e.g., Minimize F = xy’z’, given that x’y’z’ (xyz=000) can never be true, and that xy’z (xyz=101) can never be true – So it doesn’t matter what F outputs when x’y’z’ or xy’z is true, because those cases will never occur – Thus, make F be 1 or 0 for those cases in a way that best minimizes the equation • On K-map – Draw Xs for don’t care combinations • Include X in circle ONLY if minimizes equation Digital Design include other Xs • Don’t Copyright © 2006 Frank Vahid F y’z’ yz x 00 01 11 10 0 X 0 0 0 1 1 X 0 0 Good use of don’t cares F yz x y’z’ unneeded 00 01 11 10 0 X 0 0 0 1 1 X 0 0 xy’ Unnecessary use of don’t cares; results in extra term 14 Minimizization Example using Don’t Cares • Minimize: – F = a’bc’ + abc’ + a’b’c – Given don’t cares: a’bc, abc F bc a 0 a’c b 00 01 11 10 0 1 X 1 a • Note: Use don’t cares with caution – Must be sure that we really don’t care what the function outputs for that input combination – If we do care, even the slightest, then it’s probably safer to set the output to 0 Digital Design Copyright © 2006 Frank Vahid 1 0 0 X 1 F = a’c + b 15 Minimization with Don’t Cares Example: Sliding Switch • Switch with 5 positions 1 – 3-bit value gives position in binary 2 3 4 5 x • Want circuit that – Outputs 1 when switch is in position 2, 3, or 4 – Outputs 0 when switch is in position 1 or 5 – Note that the 3-bit input can never output binary 0, 6, or 7 • Treat as don’t care input combinations Digital Design Copyright © 2006 Frank Vahid 2,3,4, detector y z G yz Withou x t don’t cares: 0 F = x’y 1 + xy’z’ G 00 01 11 10 0 0 1 1 x’y a xy’z’ 1 0 yz x G 0 0 y 00 01 11 10 0 X 0 1 1 1 1 0 X X a With don’t cares: z’ F = y + z’ 16 Automating Two-Level Logic Size Minimization • Minimizing by hand – Is hard for functions with 5 or more variables – May not yield minimum cover depending on order we choose – Is error prone • Minimization thus typically done by automated tools – Exact algorithm: finds optimal solution – Heuristic: finds good solution, but not necessarily optimal I yz x 0 01 11 10 1 1 1 0 (a) a 1 1 y’z’ I x 0 0 x’y’ 1 1 yz xy 4 terms yz 00 01 11 10 1 1 1 0 a (b) 1 y’z’ Digital Design Copyright © 2006 Frank Vahid 00 1 0 x’z 1 1 xy Only 3 terms 17 Basic Concepts Underlying Automated Two-Level Logic Minimization • Definitions F yz x 00 – On-set: All minterms that define when F=1 – Off-set: All minterms that define when F=0 – Implicant: Any product term (minterm or other) that when 1 causes F=1 • On K-map, any legal (but not necessarily largest) circle • Cover: Implicant xy covers minterms xyz and xyz’ – Expanding a term: removing a variable (like larger K-map circle) • xyz xy is an expansion of xyz Digital Design Copyright © 2006 Frank Vahid x’y’z 01 11 10 0 0 1 0 0 1 0 0 1 1 xyz xyz’ xy 4 implicants of F Note: We use K-maps here just for intuitive illustration of concepts; automated tools do not use K-maps. • Prime implicant: Maximally expanded implicant – any expansion would cover 1s not in on-set • x’y’z, and xy, above • But not xyz or xyz’ – they can be expanded 18 Basic Concepts Underlying Automated Two-Level Logic Minimization • Definitions (cont) – Essential prime implicant: The only prime implicant that covers a particular minterm in a function’s on-set G yz x 0 • Importance: We must include all 1 essential PIs in a function’s cover x’y’ • In contrast, some, but not all, nonessential essential PIs will be included Digital Design Copyright © 2006 Frank Vahid not essential y’z 00 01 11 10 1 1 0 0 0 1 1 1 xz not essential xy essential 19 Automated Two-Level Logic Minimization Method • • Digital Design Copyright © 2006 Frank Vahid Steps 1 and 2 are exact Step 3: Hard. Checking all possibilities: exact, but computationally expensive. Checking some but not all: heuristic. 20 Example of Automated Two-Level Minimization I • 1. Determine all prime implicants • 2. Add essential PIs to cover – Italicized 1s are thus already covered – Only one uncovered 1 remains • 3. Cover remaining minterms with nonessential PIs – Pick among the two possible PIs Digital Design Copyright © 2006 Frank Vahid yz 00 01 x’z 11 10 0 1 1 1 0 1 1 0 0 1 x (a) y’z’ x’y’ I yz xz’ 00 01 x’z 11 10 0 1 1 1 0 1 1 0 0 1 x (b) y’z’ x’y’ I yz xz’ 00 01 x’z 11 10 0 1 1 1 0 1 1 0 0 1 x (c) y’z’ xz’ 21 Problem with Methods that Enumerate all Minterms or Compute all Prime Implicants • Too many minterms for functions with many variables – Function with 32 variables: • 232 = 4 billion possible minterms. • Too much compute time/memory • Too many computations to generate all prime implicants – Comparing every minterm with every other minterm, for 32 variables, is (4 billion)2 = 1 quadrillion computations – Functions with many variables could requires days, months, years, or more of computation – unreasonable Digital Design Copyright © 2006 Frank Vahid 22 Solution to Computation Problem • Solution – Don’t generate all minterms or prime implicants – Instead, just take input equation, and try to “iteratively” improve it – Ex: F = abcdefgh + abcdefgh’+ jklmnop • Note: 15 variables, may have thousands of minterms • But can minimize just by combining first two terms: – F = abcdefg(h+h’) + jklmnop = abcdefg + jklmnop Digital Design Copyright © 2006 Frank Vahid 23 Two-Level Minimization using Iterative Method • Method: Randomly apply “expand” operations, see if helps – Expand: remove a variable from a term I yz x 0 00 01 11 10 0 1 1 0 z (a) 1 0 • Like expanding circle size on K-map – e.g., Expanding x’z to z legal, but expanding x’z to z’ not legal, in shown function – After expand, remove other terms covered by newly expanded term x’z 1 1 0 xy’z xyz I yz x 0 00 01 11 10 0 1 1 0 x’ (b) – Keep trying (iterate) until doesn’t help x’z 1 0 1 xy’z 1 0 xyz Ex: F = abcdefgh + abcdefgh’+ jklmnop F = abcdefg + abcdefgh’ + jklmnop F = abcdefg + jklmnop Digital Design Copyright © 2006 Frank Vahid 24 Multi-Level Logic Optimization – Performance/Size Tradeoffs • We don’t always need the speed of two level logic – Multiple levels may yield fewer gates – Example • F1 = ab + acd + ace F2 = ab + ac(d + e) = a(b + c(d + e)) • General technique: Factor out literals – xy + xz = x(y+z) b 4 c a c e 4 b a d a 4 c 6 6 F1 d e 6 4 4 16 transistors 4 gate-delays F1 = ab + acd + ace F2 = a(b+c(d+e)) (a) (b) Digital Design Copyright © 2006 Frank Vahid F2 size (transistors) a 22 transistors 2 gate delays F1 20 )15 t10rs e zsio 5 a r(tnsis 1 F2 2 3 4 delay (gate-delays) (c) 25 Multi-Level Example • Q: Use multiple levels to reduce number of transistors for – F1 = abcd + abcef • A: abcd + abcef = abc(d + ef) • Has fewer gate inputs, thus fewer transistors a a b c d a b c e f 18 transistors 3 gate delays a b c 8 4 10 F1 6 d e 4 4 4 f F1 = abcd + abcef (a) Digital Design Copyright © 2006 Frank Vahid F2 = abc(d + ef) (b) size (transistors) 22 transistors 2 gate delays 20 F1 F2 )15 F2 10 5 1 2 3 4 delay (gate-delays) (c) 26 Multi-Level Example: Non-Critical Path • Critical path: longest delay path to output • Optimization: reduce size of logic on non-critical paths by using multiple levels Digital Design Copyright © 2006 Frank Vahid 27 Automated Multi-Level Methods • Main techniques use heuristic iterative methods – Define various operations • “Factor out”: xy + xz = x(y+z) • Expand, and others – Randomly apply, see if improves • May even accept changes that worsen, in hopes eventually leads to even better equation • Keep trying until can’t find further improvement – Not guaranteed to find best circuit, but rather a good one Digital Design Copyright © 2006 Frank Vahid 28 6.3 State Reduction (State Minimization) • Goal: Reduce number of states in FSM without changing behavior – Fewer states potentially reduces size of state register • Consider the two FSMs below with x=1, then 1, then 0, 0 Inputs: x; Outputs: y x’ x’ x x’ x x x’ x’ x x x S0 S1 S2 S3 S0 S1 y=0 y=1 y=0 y=1 y=0 y=1 state x S0 S1 S1 x’ S2 S2 state x y S0 S1 S1 a S0 S0 y For the same sequence of inputs, the output of the two FSMs is the same Digital Design Copyright © 2006 Frank Vahid x y if x = 1,1,0,0 then y = 0,1,1,0,0 29 State Reduction: Equivalent States Two states are equivalent if: 1. They assign the same values to outputs – e.g. S0 and S2 both assign y to 0, – S1 and S3 both assign y to 1 Inputs: x; Outputs: y x’ x’ x • starting from S1, y=1,1,0,0,… • starting from S3, y=1,1,0,0,… Digital Design Copyright © 2006 Frank Vahid x x’ x S0 S1 S2 S3 y=0 y=1 y=0 y=1 States S0 and S2 equivalent States S1 and S3 equivalent 2. AND, for all possible sequences of inputs, the FSM outputs will be the same starting from either state – e.g. say x=1,1,0,0,… x’ x a x’ S0, S2 y=0 x x x’ S1, S3 y=1 30 State Reduction: Example with no Equivalencies • Another example… • State S0 is not equivalent with any other state since its output (y=0) differs from other states’ output Inputs: x; Outputs: y • Consider state S1 and S3 – Outputs are initially the same (y=1) – From S1, when x=0, go to S2 where y=1 x’ x’ x’ x x’ x x S0 S1 S2 S3 y=0 y=1 y=1 y=1 Start from S1, x=0 x’ x x’ S0 y=0 S1 x’ x’ S1 y=1 S2 x x S3 y=1 y=1 x’ x x’ x x a x’ x – From S3, when x=0, go to S0 where y=0 y=0 y=1 – Outputs differ, so S1 and S3 are not Start from S3, x=0 equivalent. S0 Digital Design Copyright © 2006 Frank Vahid x x’ S2 y=1 x S3 y=1 31 State Reduction with Implication Tables • State reduction through visual inspection (what we did in the last few slides) isn’t reliable and cannot be automated – a more methodical approach is needed: implication tables • Example: S0 Inputs: x; Outputs: y x’ x’ x x’ S1 Redundant S2 x’ x x x S0 S1 S2 S3 y=0 y=1 y=0 y=1 Diagonal S3 – To compare every pair of states, construct a table of state pairs (above right) – Remove redundant state pairs, and state pairs along the diagonal since a state is equivalent to itself (right) S0 S1 S2 S3 S1 S2 S3 S0 S1 S2 Digital Design Copyright © 2006 Frank Vahid 32 State Reduction with Implication Tables • Mark (with an X) state pairs with different outputs as non-equivalent: – (S1,S0): At S1, y=1 and at S0, y=0. So S1 and S0 are non-equivalent. – (S2, S0): At S2, y=0 and at S0, y=0. So we don’t mark S2 and S0 now. – (S2, S1): Non-equivalent – (S3, S0): Non-equivalent – (S3, S1): Don’t mark – (S3, S2): Non-equivalent • We can see that S2 & S0 might be equivalent and S3 & S1 might be equivalent, but only if their next states are equivalent (remember the example from two slides ago) Digital Design Copyright © 2006 Frank Vahid Inputs: x; Outputs: y x’ x’ x x’ x x x’ x S0 S1 S2 S3 y=0 y=1 y=0 y=1 S1 S2 a S3 S0 S1 S2 33 State Reduction with Implication Tables • We need to check each unmarked state pair’s next states • We can start by listing what each unmarked state pair’s next states are for every combination of inputs – (S2, S0) • From S2, when x=1 go to S3 From S0, when x=1 go to S1 So we add (S3, S1) as a next state pair • From S2, when x=0 go to S2 From S0, when x=0 go to S0 So we add (S2, S0) as a next state pair – (S3, S1) Inputs: x; Outputs: y x’ x’ x x’ x x x’ x S0 S1 S2 S3 y=0 y=1 y=0 y=1 S1 S2 (S3, S1) (S2, S0) a (S3, S1) (S0, S2) S3 S0 S1 S2 • By a similar process, we add the next state pairs (S3, S1) and (S0, S2) Digital Design Copyright © 2006 Frank Vahid 34 State Reduction with Implication Tables • Next we check every unmarked state pair’s next state pairs • We mark the state pair if one of its next state pairs is marked – (S2, S0) • Next state pair (S3, S1) is not marked • Next state pair (S2, S0) is not marked • So we do nothing and move on – (S3, S1) • Next state pair (S3, S1) is not marked • Next state pair (S0, S2) is not marked • So we do nothing and move on Digital Design Copyright © 2006 Frank Vahid Inputs: x; Outputs: y x’ x’ x x’ x x x’ x S0 S1 S2 S3 y=0 y=1 y=0 y=1 S1 S2 (S3, S1) (S2, S0) (S3, S1) (S0, S2) S3 S0 S1 S2 35 State Reduction with Implication Tables • We just made a pass through the implication table – Make additional passes until no change occurs • Then merge the unmarked state pairs – they are equivalent Inputs: x; Outputs: y x’ x x S1,S3 y=0 y=1 x x’ x S0 S1 S2 S3 y=0 y=1 y=0 y=1 S1 (S3, S1) (S2, S0) (S3, S1) (S0, S2) S3 S0 S0,S2 Digital Design Copyright © 2006 Frank Vahid x’ x’ x S2 x’ x’ x S1 S2 36 State Reduction with Implication Tables Digital Design Copyright © 2006 Frank Vahid 37 State Reduction Example • Given FSM on the right – Step 1: Mark state pairs having different outputs as nonequivalent Inputs: x; Outputs: y x’ x’ x x’ x x x’ x S0 S1 S2 S3 y=0 y=1 y=1 y=1 S1 S2 a S3 S0 Digital Design Copyright © 2006 Frank Vahid S1 S2 38 State Reduction Example • Given FSM on the right – Step 1: Mark state pairs having different outputs as nonequivalent – Step 2: For each unmarked state pair, write the next state pairs for the same input values Inputs: x; Outputs: y x’ x’ x x x’ x S0 S1 S2 S3 y=0 y=1 y=1 y=1 S1 x=0 x=1 S2 (S2, S2) (S3, S1) S3 (S0, S2) (S0, S2) (S3, S1) (S3, S3) S0 Digital Design Copyright © 2006 Frank Vahid x’ x S1 a S2 39 Digital Design Copyright © 2006 Frank Vahid 40 State Reduction Example • Given FSM on the right – Step 1: Mark state pairs having different outputs as nonequivalent – Step 2: For each unmarked state pair, write the next state pairs for the same input values – Step 3: For each unmarked state pair, mark state pairs having nonequivalent next state pairs as nonequivalent. • Repeat this step until no change occurs, or until all states are marked. – Step 4: Merge remaining state pairs Digital Design Copyright © 2006 Frank Vahid Inputs: x; Outputs: y x’ x’ x x’ x x x’ x S0 S1 S2 S3 y=0 y=1 y=1 y=1 S1 S2 (S2, S2) (S3, S1) S3 (S0, S2) (S0, S2) (S3, S1) (S3, S3) S0 S1 a S2 All state pairs are marked – there are no equivalent state pairs to merge 41 A Larger State Reduction Example Inputs: x; Outputs: y x’ S3 x’ S1 x’ y=0 x x’ S0 y=0 S4 x y=0 x’ x x S2 S1 y=1 y=1 (S3,S4) (S2,S1) S2 S3 (S3,S2) (S0,S1) S4 – Step 1: Mark state pairs having different outputs as nonequivalent – Step 2: For each unmarked state pair, write the next state pairs for the same input values – Step 3: For each unmarked state pair, mark state pairs having nonequivalent next state pairs as nonequivalent. (S4,S3) (S0,S0) (S4,S2) (S0,S1) S0 S1 S2 S3 a • Repeat this step until no change occurs, or until all states are marked. – Step 4: Merge remaining state pairs Digital Design Copyright © 2006 Frank Vahid 42 A Larger State Reduction Example Inputs: x; Outputs: y x’ S3 x’ S1 x’ y=0 x x’ S0 y=0 S4 x y=0 x’ x x S2 S1 y=1 y=1 (S3,S4) (S2,S1) S2 S3 (S3,S2) (S0,S1) S4 – Step 1: Mark state pairs having different outputs as nonequivalent – Step 2: For each unmarked state pair, write the next state pairs for the same input values – Step 3: For each unmarked state pair, mark state pairs having nonequivalent next state pairs as nonequivalent. S0 x’ S2 S3 a S3,S4 x S0 y=0 Digital Design Copyright © 2006 Frank Vahid S1 Inputs: x; Outputs: y • Repeat this step until no change occurs, or until all states are marked. – Step 4: Merge remaining state pairs (S4,S3) (S0,S0) (S4,S2) (S0,S1) y=0 x’ x x’ x S1,S2 y=1 43 Need for Automation • Automation needed Inputs: x; Outputs: z x x – Table for large FSM too big for humans to work with • n inputs: each state pair can have next state pairs. • 4 inputs 24=16 next state pairs SB z=0 z=1 x’ x x SC x’ SD x’ SG z=0 x z=1 z=0 x x’ SE z=0 x’ x’ SI SH SA x’ 2n x x’ x SF z=0 x x’ x x’ SJ x x’ x' x’ z=1 z=0 x x’ SL x z=0 SM x’ z=0 SK z=1 SO z=1 SN x z=1 z=1 – 100 states would have table with 100*100=100,000 state pairs cells – State reduction typically automated • Often using heuristics to reduce compute time Digital Design Copyright © 2006 Frank Vahid 44 x State Encoding • Encoding: Assigning a unique bit representation to each state • Different encodings may optimize size, or tradeoff size and performance • Consider 3-Cycle Laser Timer… Inputs: b; Outputs: x x=0 00 Off b’ b x=1 01 On1 x=1 11 10 On2 x=1 10 11 On3 – Example 3.7’s encoding: 15 gate inputs – Try alternative encoding • • • • x = s1 + s0 n1 = s0 n0 = s1’b + s1’s0 Only 8 gate inputs Digital Design Copyright © 2006 Frank Vahid a 1 1 1 1 0 0 0 0 45 State Encoding: One-Hot Encoding • Inputs: none; Outputs: x x=0 One-hot encoding – One bit per state – a bit being ‘1’ corresponds to a particular state – Alternative to minimum bit-width encoding in previous example – For A, B, C, D: A: 0001, B: 0010, C: 0100, D: 1000 • x=1 A 00 0001 D 11 1000 B 01 0010 x=1 C 10 0100 x=1 a Example: FSM that outputs 0, 1, 1, 1 – Equations if one-hot encoding: • n3 = s2; n2 = s1; n1 = s0; x = s3 + s2 + s1 – Fewer gates and only one level of logic – less delay than two levels, so faster clock frequency x x n1 8 binary 6 4 one-hot 2 Digital Design Copyright © 2006 Frank Vahid 1 2 3 4 delay (gate-delays) s3 s1 s0 clk clk s2 s1 s0 n0 State register State register n0 n1 n2 n3 46 One-Hot Encoding Example: Three-Cycles-High Laser Timer • Four states – Use four-bit one-hot encoding – State table leads to equations: • • • • • x = s3 + s2 + s1 n3 = s2 n2 = s1 n1 = s0*b n0 = s0*b’ + s3 Inputs: b; Outputs: x x=0 0001 Off b’ b x=1 x=1 x=1 0010 On1 0100 On2 1000 On3 a – Smaller • 3+0+0+2+(2+2) = 9 gate inputs • Earlier binary encoding (Ch 3): 15 gate inputs – Faster • Critical path: n0 = s0*b’ + s3 • Previously: n0 = s1’s0’b + s1s0’ • 2-input AND slightly faster than 3input AND Digital Design Copyright © 2006 Frank Vahid 47 Output Encoding • Output encoding: Encoding method where the state encoding is same as the output values – Possible if enough outputs, all states with unique output values Use the output values as the state encoding Inputs: none; Outputs: x,y xy=00 A 00 D 11 B 01 C 10 xy=11 Digital Design Copyright © 2006 Frank Vahid xy=01 a xy=10 48 Output Encoding Example: Sequence Generator w x y z Inputs: none; Outputs: w, x, y, z wxyz=0001 wxyz=1000 A D B C wxyz=0011 wxyz=1100 • Generate sequence 0001, 0011, 1110, 1000, repeat – FSM shown • Use output values as state encoding • Create state table • Derive equations for next state s3 clk s2 s1 s0 State register n3 n2 n1 n0 – n3 = s1 + s2; n2 = s1; n1 = s1’s0; n0 = s1’s0 + s3s2’ Digital Design Copyright © 2006 Frank Vahid 49 Moore vs. Mealy FSMs Combinational logic I O S clk State register N Moore Mealy a • FSM implementation architecture – State register and logic – More detailed view • Next state logic – function of present state and FSM inputs • Output logic – If function of present state only – Moore FSM – If function of present state and FSM inputs – Mealy FSM Digital Design Copyright © 2006 Frank Vahid Inputs: b; Outputs: x /x=0 S0 S1 b/x=1 b’/x=0 Graphically: show outputs with arcs, not with states 50 Mealy FSMs May Have Fewer States Inputs: enough (bit) Outputs: d, clear (bit) Inputs: enough (bit) Outputs: d, clear (bit) /d=0, clear=1 Moore Init d=0 clear=1 Wait Init Mealy Wait enough’ enough’ enough Disp enough/d=1 d=1 clk clk Inputs: enough Inputs: enough State: I W W D I State: Outputs: clear Outputs: clear d d (a) I W W I (b) • Soda dispenser example: Initialize, wait until enough, dispense – Moore: 3 states; Mealy: 2 states Digital Design Copyright © 2006 Frank Vahid 51 Mealy vs. Moore • Q: Which is Moore, and which is Mealy? • A: Mealy on left, Moore on right – Mealy outputs on arcs, meaning outputs are function of state AND INPUTS – Moore outputs in states, meaning outputs are function of state only Digital Design Copyright © 2006 Frank Vahid Inputs: b; Outputs: s1, s0, p Time b’/s1s0=00, p=0 b/s1s0=00, p=1 Alarm b’/s1s0=01, p=0 b/s1s0=01, p=1 Date b’/s1s0=10, p=0 b/s1s0=10, p=1 Stpwch b’/s1s0=11, p=0 b/s1s0=11, p=1 Mealy Inputs: b; Outputs: s1, s0, p Time b’ s1s0=00, p=0 b S2 s1s0=00, p=1 b b’ Alarm s1s0=01, p=0 b S4 s1s0=01, p=1 b Date b’ s1s0=10, p=0 b S6 s1s0=10, p=1 b b’ Stpwch s1s0=11, p=0 b S8 s1s0=11, p=1 Moore 52 Mealy vs. Moore Example: Beeping Wristwatch • Button b – Sequences mux select lines s1s0 through 00, 01, 10, and 11 • Each value displays different internal register – Each unique button press should cause 1-cycle beep, with p=1 being beep • Must wait for button to be released (b’) and pushed again (b) before sequencing • Note that Moore requires unique state to pulse p, while Mealy pulses p on arc • Tradeoff: Mealy’s pulse on p may not last one full cycle Digital Design Copyright © 2006 Frank Vahid Inputs: b; Outputs: s1, s0, p Time b’/s1s0=00, p=0 b/s1s0=00, p=1 Alarm b’/s1s0=01, p=0 b/s1s0=01, p=1 Date b’/s1s0=10, p=0 b/s1s0=10, p=1 Stpwch b’/s1s0=11, p=0 b/s1s0=11, p=1 Mealy Inputs: b; Outputs: s1, s0, p Time b’ s1s0=00, p=0 b S2 s1s0=00, p=1 b b’ Alarm s1s0=01, p=0 b S4 s1s0=01, p=1 b Date b’ s1s0=10, p=0 b S6 s1s0=10, p=1 b b’ Stpwch s1s0=11, p=0 b S8 s1s0=11, p=1 Moore 53 Mealy vs. Moore Tradeoff • Mealy outputs change mid-cycle if input changes – Note earlier soda dispenser example • Mealy had fewer states, but output d not 1 for full cycle – Represents a type of tradeoff Inputs: enough (bit) Outputs: d, clear (bit) Inputs: enough (bit) Outputs: d, clear (bit) /d=0, clear=1 Moore Init d=0 clear=1 Wait Init Mealy Wait enough’ enough’ enough Disp enough/d=1 d=1 clk clk Inputs: enough Inputs: enough State: I W W D I State: Outputs: clear Outputs: clear d d Digital Design Copyright © 2006 Frank Vahid (a) I W W I 54 (b) Implementing a Mealy FSM • Straightforward – Convert to state table – Derive equations for each output – Key difference from Moore: External outputs (d, clear) may have different value in same state, depending on input values Digital Design Copyright © 2006 Frank Vahid Inputs: enough (bit) Outputs: d, clear (bit) / d=0, clear=1 Init Wait enough’/d=0 enough/d=1 55 Mealy and Moore can be Combined • Final note on Mealy/Moore – May be combined in same FSM Inputs: b; Outputs: s1, s0, p Time b’/p=0 s1s0=00 b/p=1 b’/p=0 Alarm s1s0=01 b/p=1 Combined b’/p=0 Moore/Mealy Date FSM for beeping s1s0=10 b/p=1 wristwatch example b’/p=0 Stpwch s1s0=11 b/p=1 Digital Design Copyright © 2006 Frank Vahid 56 Datapath Component Tradeoffs 6.4 • Can make some components faster (but bigger), or smaller (but slower), than the straightforward components we built in Ch 4 • We’ll build – A faster (but bigger) adder than the carry-ripple adder – A smaller (but slower) multiplier than the array-based multiplier • Could also do for the other Ch 4 components Digital Design Copyright © 2006 Frank Vahid 57 Faster Adder • Built carry-ripple adder in Ch 4 a3 b3 – Similar to adding by hand, column by column – Con: Slow • Output is not correct until the carries have rippled to the left • 4-bit carry-ripple adder has 4*2 = 8 gate delays a2 b2 a1 b1 a0 b0 cin 4-bit adder cout s3 s2 s1 s0 carries: c3 c2 c1 cin B: b3 b2 b1 b0 A: + a3 a2 a1 a0 s2 s1 s0 – Pro: Small cout s3 a • 4-bit carry-ripple adder has just 4*5 = 20 gates a3 b3 a2 b2 a1b1 a0 b0 ci a FA FA co Digital Design Copyright © 2006 Frank Vahid s3 FA s2 FA s1 s0 58 Faster Adder a3 b3 a2 b2 a1b1 a0 b0 ci • Faster adder – Use two-level combinational logic design process – Recall that 4-bit two-level adder was big – Pro: Fast Two-level: AND level followed by ORs • 2 gate delays – Con: Large • Truth table would have 2(4+4) =256 rows • Plot shows 4-bit adder would use about 500 gates • Is there a compromise design? – Between 2 and 8 gate delays – Between 20 and 500 gates Digital Design Copyright © 2006 Frank Vahid co s3 s2 s1 s0 2 4 10000 8000 6000 rs to4000 nsis ra T2000 0 1 3 N 5 6 7 59 8 Faster Adder – (Bad) Attempt at “Lookahead” • Idea – Modify carry-ripple adder – For a stage’s carry-in, don’t wait for carry to ripple, but rather directly compute from inputs of earlier stages • Called “lookahead” because current stage “looks ahead” at previous stages rather than waiting for carry to ripple to current stage a3a3 b3 b3 a2 a2 b2b2 a1 a1b1 b1 a0 b0 a0c0 b0 ci a look ahead look ahead c3 FA FA c4 co cout Digital Design Copyright © 2006 Frank Vahid c2 FA s3 3 stage s3 look ahead c1 FA stage 2s2 s2 FA c0 stage 1 s1 s1 stage 0 s0 s0 Notice – no rippling of carry 60 Faster Adder – (Bad) Attempt at “Lookahead” • Want each stage’s carry-in bit to be function of external inputs only (a’s, b’s, or c0) c3 a3b3 a2b2 c2 a1b1 c1 a0 b0 c0 a FA FA co • co2 s3 s1 FA co0 s0 Recall full-adder equations: – s = a xor b – c = bc + ac + ab a3b3 a2b2 a1b1 look ahead look ahead c3 s3 c1 stage 2 s2 Stage 1: c1=co0 look ahead c2 stage 3 Stage 0: Carry-in is already an external input: c0 a0b0 c0 stage 1 s1 Digital Design Copyright © 2006 Frank Vahid Stage 2: c2=co1 c0 FA c4 cout s2 FA co1 co0= b0c0 + a0c0 + a0b0 c1 = b0c0 + a0c0 + a0b0 co1 = b1c1 + a1c1 + a1b1 c2 = b1c1 + a1c1 + a1b1 c2 = b1(b0c0 + a0c0 + a0b0) + a1(b0c0 + a0c0 + a0b0) +a1b1 c2 = b1b0c0 + b1a0c0 + b1a0b0 + a1b0c0 + a1a0c0 + a1a0b0 + a1b1 stage 0 s0 Continue for c3 61 Faster Adder – (Bad) Attempt at “Lookahead” a3b3 a2b2 a1b1 a0b0 c0 • Carry lookahead logic function of external inputs – No waiting for ripple look ahead look ahead • Problem – Equations get too big – Not efficient – Need a better form of lookahead c3 look ahead c2 c1 c0 FA stage 3 c4 cout s3 stage 2 s2 stage 1 s1 stage 0 s0 c1 = b0c0 + a0c0 + a0b0 c2 = b1b0c0 + b1a0c0 + b1a0b0 + a1b0c0 + a1a0c0 + a1a0b0 + a1b1 c3 = b2b1b0c0 + b2b1a0c0 + b2b1a0b0 + b2a1b0c0 + b2a1a0c0 + b2a1a0b0 + b2a1b1 + a2b1b0c0 + a2b1a0c0 + a2b1a0b0 + a2a1b0c0 + a2a1a0c0 + a2a1a0b0 + a2a1b1 + a2b2 Digital Design Copyright © 2006 Frank Vahid 62 Better Form of Lookahead • Have each stage compute two terms – Propagate: P = a xor b – Generate: G = ab • Compute lookahead from P and G terms, not from external inputs – Why P & G? Because the logic comes out much simpler • Very clever finding; not particularly obvious though • Why those names? – G: If a and b are 1, carry-out will be 1 – “generate” a carry-out of 1 in this case – P: If only one of a or b is 1, then carry-out will equal the carry-in – propagate the carry-in to the carry-out in this case cin carries: c4 c3 c2 c1 c0 c1 B: b3 b2 b1 b0 A: + a3 a2 a1 a0 s3 s2 s1 s0 cout 0 1 + 1 0 (a) Digital Design Copyright © 2006 Frank Vahid 1 1 1 1 + 1 1 1 1 0 + 1 0 1 1 1 + 0 0 c0 b0 a0 if a0b0 = 1 if a0xor b0 = 1 then c1 = 1 then c1 = 1 if c0 = 1 (call this G:Generate) (call this P: Propagate) 63 Better Form of Lookahead a3 b3 a3 b3 a2 b2 Half-adder a2 b2 a1 b1 a1 b1 “Bad” lookahead Half-adder Half-adder look ahead look ahead a c3 G3 P3FA c3 G2 Carry-loo kahead lo gic c4 stage 3 cout cout s3 c2 c2 P2 s3 • With P & G, the carry lookahead equations are much simpler – Equations before plugging in • • • • c1 = G0 + P0c0 c2 = G1 + P1c1 c3 = G2 + P2c2 cout = G3 + P3c3 Digital Design Copyright © 2006 Frank Vahid look ahead c1 G1 (b) cin Half-adder P1 c1 stage 2 s2 s2 a0 b0 c0 a0 b0 G0 c0 P0 stage 1 s1 s1 c0 stage 0 s0 s0 After plugging in: c1 = G0 + P0c0 a c2 = G1 + P1c1 = G1 + P1(G0 + P0c0) c2 = G1 + P1G0 + P1P0c0 c3 = G2 + P2c2 = G2 + P2(G1 + P1G0 + P1P0c0) c3 = G2 + P2G1 + P2P1G0 + P2P1P0c0 cout = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0c0 Much simpler than the “bad” lookahead 64 Call this sum/propagate/generate (SPG) block Better Form of Lookahead a a3 b3 Half-adder a2 b2 SPG block G3 P3 c3 Carry-loo kahead lo gic cout a1 b1 Half-adder G2 Half-adder P2 s3 c2 s2 P3 G3 a0 b0 G1 P2 G2 Half-adder P1 (b) cin c1 G0 P0 s0 s1 P1 G1 c0 P0 G0 c0 Carry-loo kahead lo gic a Stage 4 Stage 3 Stage 2 Stage 1 c1 = G0 + P0c0 c2 = G1 + P1G0 + P1P0c0 c3 = G2 + P2G1 + P2P1G0 + P2P1P0c0 cout = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0c0 (c) Digital Design Copyright © 2006 Frank Vahid 65 Carry-Lookahead Adder -- High-Level View a3 a2 b3 a b cin SPG block P G P3 G3 a b cin SPG block P c3 P2 cout cout a1 b2 s3 G a P b cin SPG block G s2 – Each stage has SPG block with 2 gate levels – Carry-lookahead logic quickly computes the carry from the propagate and generate bits using 2 gate levels inside • Reasonable number of gates -- 4-bit adder has only 26 gates a P G2 c2 P1 G1 4-bit carry-lookahead logic • Fast -- only 4 gate delays Digital Design Copyright © 2006 Frank Vahid a0 b1 c1 P0 s1 b0 c0 b cin SPG block G G0 s0 • 4-bit adder comparison (gate delays, gates) – Carry-ripple: (8, 20) – Two-level: (2, 500) – CLA: (4, 26) o Nice compromise 66 Carry-Lookahead Adder – 32-bit? • Problem: Gates get bigger in each stage – 4th stage has 5-input gates – 32nd stage would have 33-input gates • Too many inputs for one gate • Would require building from smaller gates, meaning more levels (slower), more gates (bigger) • One solution: Connect 4-bit CLA adders in ripple manner – But slow (4 + 4 + 4 + 4 gate delays) a15-a12 b15-b12 a3a2a1a0 b3b2b1b0 4-bit adder cin cout s3s2s1s0 cout s15-s12 Digital Design Copyright © 2006 Frank Vahid a11-a8 b11-b8 a3a2a1a0 b3b2b1b0 4-bit adder cin cout s3s2s1s0 s11-s8 a7a6a5a4 b7b6b5b4 a3a2a1a0 b3b2b1b0 a3a2a1a0 b3b2b1b0 4-bit adder cin a3a2a1a0 b3b2b1b0 4-bit adder cin cout s3s2s1s0 cout s3s2s1s0 s7s6s5s4 s3s2s1s0 67 Hierarchical Carry-Lookahead Adders • Better solution -- Rather than rippling the carries, just repeat the carrylookahead concept – Requires minor modification of 4-bit CLA adder to output P and G These use carry-lookahead internally a15-a12 b15-b12 a11-a8 b11-b8 a7a6a5a4 b7b6b5b4 a3a2a1a0 b3b2b1b0 a3a2a1a0 b3b2b1b0 a3a2a1a0 b3b2b1b0 a3a2a1a0 b3b2b1b0 a3a2a1a0 b3b2b1b0 4-bit adder P G cout cin 4-bit adder P G cout s3s2 s1s0 s3s2 s1s0 P3G3 cin c3 P2 G2 P G cout s15-s12 4-bit adder cin 4-bit adder P G cout s3s2 s1s0 c2 P1 G1 4-bit carry-lookahead logic s11-s18 P G cout s3s2 s1s0 c1 P0 G0 s7-s4 s3-s0 Second level of carry-lookahead P2 G2 P3 G3 Carry lookahead logic P1 G1 P0 G0 c0 Same lookahead logic as inside the 4-bit adders a Digital Design Copyright © 2006 Frank Vahid cin Stage 4 cout Stage 3 c3 Stage 2 c2 Stage 1 c1 68 Hierarchial Carry-Lookahead Adders • • Hierarchical CLA concept can be applied for larger adders 32-bit hierarchical CLA – Only about 8 gate delays (2 for SPG block, then 2 per CLA level) – Only about 14 gates in each 4-bit CLA logic block a Q: How many gate delays for 64-bit hierarchical CLA, using 4-bit CLA logic? A: 16 CLA-logic blocks in 1st level, 4 in 2nd, 1 in 3rd -- so still just 8 gate delays (2 for SPG, and 2+2+2 for CLA logic). CLA is a very efficient method. Digital Design Copyright © 2006 Frank Vahid 69 Carry Select Adder • Another way to compose adders – High-order stage -- Compute result for carry in of 1 and of 0 • Select based on carry-out of low-order stage • Faster than pure rippling Operate in parallel suppose =1 Digital Design Copyright © 2006 Frank Vahid 70 Adder Tradeoffs size carry-lookahead multilevel carry-lookahead carry-select carryripple delay • Designer picks the adder that satisfies particular delay and size requirements – May use different adder types in different parts of same design • Faster adders on critical path, smaller adders on non-critical path Digital Design Copyright © 2006 Frank Vahid 71 Smaller Multiplier • Multiplier in Ch 4 was array style – Fast, reasonable size for 4-bit: 4*4 = 16 partial product AND terms, 3 adders – Rather big for 32-bit: 32*32 = 1024 AND terms, and 31 adders a 32-bit adder would have 1024 gates here ... a3 a2 a0 pp1 b0 a1 pp2 b1 pp3 b2 pp4 b3 0 0 + (5-bit) 00 a + (6-bit) 0 00 ... and 31 adders here (big ones, too) + (7-bit) Digital Design Copyright © 2006 Frank Vahid p7..p0 72 Smaller Multiplier -- Sequential (Add-and-Shift) Style • Smaller multiplier: Basic idea – Don’t compute all partial products simultaneously – Rather, compute one at a time (similar to by hand), maintain running sum (running sum) Step 1 0110 + 0 0 11 Step 2 0110 + 0 01 1 Step 3 0110 + 00 1 1 Step 4 0110 + 00 11 0000 00110 010010 0010010 + 0000 0010010 + 0000 00010010 (partial product) + 0 1 1 0 (new running sum) 0 0 1 1 0 Digital Design Copyright © 2006 Frank Vahid +0 1 1 0 010010 a 73 Smaller Multiplier -- Sequential (Add-and-Shift) Style multiplier • Design circuit that computes one partial product at a time, adds to running sum – Note that shifting running sum right (relative to partial product) after each step ensures partial product added to correct running sum bits Step 1 0110 + 0011 0000 + 0110 00110 Step 2 0110 + 0 01 1 00110 +0 1 1 0 010010 Digital Design Copyright © 2006 Frank Vahid Step 3 0110 + 001 1 010010 + 0000 0010010 multiplicand multiplicand register (4) load mdld mrld o rtnller mr3 o c mr2 mr1 4-bit adder multiplier register (4) load mr0 rsload rsclear rsshr start Step 4 0110 + 0011 (running sum) 0010010 (partial product) + 0000 0 0 0 1 0 0 1 0 (new running sum) load clear shr running sum register (8) a product 74 Smaller Multiplier -- Sequential Style: Controller controller start’ mdld start mdld = 1 mrld = 1 rsclear = 1 rsshr=1 mr0’ mr1’ mrld rsshr=1 mr2’ rsshr=1 mr3 mr2 mr1 mr0 rsload rsclear rsshr rsshr=1 mr3’ a mr0 mr1 rsload=1 mr2 rsload=1 mr3 rsload=1 rsload=1 start Wait for start=1 Looks at multiplier one bit at a time – Adds partial product (multiplicand) to running sum if present multiplier bit is 1 – Then shifts running sum right one position multiplicand multiplicand register (4) load controller • • multiplier Vs. array-style: a Pro: small • Just three registers, adder, and controller Con: slow • 2 cycles per multiplier bit • 32-bit: 32*2=64 cycles (plus 1 for init.) mdld mrld mr3 mr2 mr1 mr0 rsload rsclear rsshr 0110 4-bit adder multiplier register (4) load 0011 a load clear shr running sum register (8) 10010 00110 000 0110 00000000 0000 0010010 00010010 010010 000 start Digital Design Copyright © 2006 Frank Vahid Correct product product 75 RTL Design Optimizations and Tradeoffs 6.5 • While creating datapath during RTL design, there are several optimizations and tradeoffs, involving – – – – – – Pipelining Concurrency Component allocation Operator binding Operator scheduling Moore vs. Mealy high-level state machines Digital Design Copyright © 2006 Frank Vahid 76 Pipelining • Intuitive example: Washing dishes with a friend, you wash, friend dries – You wash plate 1 – Then friend dries plate 1, while you wash plate 2 – Then friend dries plate 2, while you wash plate 3; and so on – You don’t sit and watch friend dry; you start on the next plate Time Without pipelining: W1 D1 W2 D2 W3 D3 a With pipelining: W1 W2 W3 D1 D2 D3 “Stage 1” “Stage 2” • Pipelining: Break task into stages, each stage outputs data for next stage, all stages operate concurrently (if they have data) Digital Design Copyright © 2006 Frank Vahid 77 Pipelining Example X Y clk + 2ns 2ns + W Z Stage 1 W X Y + Z Longest path is only 2 ns + 2ns 1 2ns stage 2nsLongest path clk pipeline registers a + 2ns S So minimum clock period is 4ns clk S S(0) Stage 2 is 2+2 = 4 ns + 2ns stage 2S So minimum clock period is 2ns clk S(1) S S(0) S(1) • S = W+X+Y+Z • Datapath on left has critical path of 4 ns, so fastest clock period is 4 ns – Can read new data, add, and write result to S, every 4 ns • Datapath on right has critical path of only 2 ns – So can read new data every 2 ns – doubled performance (sort of...) Digital Design Copyright © 2006 Frank Vahid 78 Pipelining Example • Pipelining requires refined definition of performance – Latency: Time for new data to result in new output data (seconds) – Throughput: Rate at which new data can be input (items / second) – So pipelining above system • Doubled the throughput, from 1 item / 4 ns, to 1 item / 2 ns • Latency stayed the same: 4 ns Digital Design Copyright © 2006 Frank Vahid 79 Pipeline Example: FIR Datapath • 100-tap FIR filter: Row of 100 concurrent multipliers, followed by tree of adders – Assume 20 ns per multiplier – 14 ns for entire adder tree – Critical path of 20+14 = 34 ns • Add pipeline registers – Longest path now only 20 ns – Clock frequency can be nearly doubled • Great speedup with minimal extra hardware Digital Design Copyright © 2006 Frank Vahid 80 Concurrency • Concurrency: Divide task into Task subparts, execute subparts simultaneously – Dishwashing example: Divide stack into 3 substacks, give substacks to 3 neighbors, who work simultaneously -- 3 times speedup (ignoring time to move dishes to neighbors' homes) – Concurrency does things side-byside; pipelining instead uses stages (like a factory line) – Already used concurrency in FIR filter -- concurrent multiplications Digital Design Copyright © 2006 Frank Vahid * a Concurrency Can do both, too Pipelining * * 81 Concurrency Example: SAD Design Revisited • Sum-of-absolute differences video compression example (Ch 5) – Compute sum of absolute differences (SAD) of 256 pairs of pixels – Original : Main loop did 1 sum per iteration, 256 iterations, 2 cycles per iter. AB_addr A_data B_data !go S0 go S1 (i<256)’ sum = 0 i=0 i_lt_256 <256 i_inc i_clr 8 9 8 – i 8 S2 sum_ld i<256 sum=sum+abs(A[i]-B[i]) S3 i=i+1 sum_clr sum abs 8 32 32 sad_reg_ld sad_reg S4 32 sad_reg=sum Datapath + 32 sad 256 iters.*2 cycles/iter. = 512 cycles Digital Design Copyright © 2006 Frank Vahid -/abs/+ done in 1 cycle, but done 256 times 82 Concurrency Example: SAD Design Revisited • More concurrent design – Compute SAD for 16 pairs concurrently, do 16 times to compute all 16*16=256 SADs. – Main loop does 16 sums per iteration, only 16 iters., still 2 cycles per iter. AB_addr go AB_rd i_lt_16 S1 i_lt_16’ New: 16*2 = 32 cycles Orig: 256*2 = 512 cycles S0 go !go sum_clr=1 i_clr=1 S2 i_lt_16 AB_rd=1 sum_ld=1 i_inc=1 !(i_lt_16) S4 A0 B0 A1 B1 A14 B14 A15 B15 a <16 i_inc i_clr i – – – – sum abs abs abs abs sum_ld sum_clr sad_reg_ld + Adder tree to sum 16 values + Controller Digital Design Copyright © 2006 Frank Vahid 16 absolute values + sad_reg sad_reg_ld=1 16 subtractors + Datapath a sad All -/abs/+’s shown done in 1 cycle, but done only 16 times 83 Concurrency Example: SAD Design Revisited • Comparing the two designs – Original: 256 iterations * 2 cycles/iter = 512 cycles – More concurrent: 16 iterations * 2 cycles/iter = 32 cycles – Speedup: 512/32 = 16x speedup • Versus software – Recall: Estimated about 6 microprocessor cycles per iteration • 256 iterations * 6 cycles per iteration = 1536 cycles • Original design speedup vs. software: 1536 / 512 = 3x – (assuming cycle lengths are equal) !(i_lt_16)design’s speedup vs. software: 1536 / 32 = 48x • Concurrent – 48x is very significant – quality of video may be much better Digital Design Copyright © 2006 Frank Vahid 84 Component Allocation • Another RTL tradeoff: Component allocation – Choosing a particular set of functional units to implement a set of operations – e.g., given two states, each with multiplication • Can use 2 multipliers (*) • OR, can instead use 1 multiplier, and 2 muxes • Smaller size, but slightly longer delay due to the mux delay A B t1 = t2*t3 t4 = t5*t6 FSM-A: (t1ld=1) B: (t4ld=1) t2 t3 t5 t6 A: (sl=0; sr=0; t1ld=1) B: (sl=1; sr=1; t4ld=1) t2 t5 t3 t6 sl 2×1 2×1 sr a 2 mul 1 mul * * t1 Digital Design Copyright © 2006 Frank Vahid * t4 (a) t1 t4 delay (c) (b) 85 Operator Binding • Another RTL tradeoff: Operator binding – Mapping a set of operations to a particular component allocation A B C A B C t1 = t2* t3 t4 = t5* t6 t7 = t8* t3 t1 = t2* t3 t4 = t5* t6 t7 = t8* t3 t2 t3 t5 t8 sl 2x1 2 multipliers MULA allocated t1 2 muxes vs. 2x1 sr 1 mux t2 t8 t4 t7 t3 t5 t6 e zsi sl 2x1 MULB Binding 1 Digital Design Copyright © 2006 Frank Vahid t6 t3 MULA t1 size – Note: operator/operation mean behavior (multiplication, addition), while component (aka functional unit) means hardware (multiplier, adder) – Different bindings may yield different size or delay MULB t7 Binding 1 Binding 2 a delay t4 Binding 2 86 Operator Scheduling • Yet another RTL tradeoff: Operator scheduling – Introducing or merging states, and assigning operations to those states. A A C B (some t1 = t2*t3 operations) t4 = t5*t6 B2 B (some t1 = t2* t3 operations) t4 = t5* t6 (some operations) t4 = t5* t6 t2 t5 t3 t6 sl 2x1 * t1 Digital Design Copyright © 2006 Frank Vahid 3-state schedule t4 si e z size * t5 smaller (only 1 *) 4-state schedule (some operations) t3 t6 a t2 C 2x1 sr but more delay due to muxes * t1 t4 delay 87 Operator Scheduling Example: Smaller FIR Filter • 3-tap FIR filter design in Ch 5: Only one state – datapath computes new Y every cycle – Used 3 multipliers and 2 adders; can we reduce the design’s size? y(t) = c0*x(t) + c1*x(t-1) + c2*x(t-2) Inputs: X (N bits) Outputs: Y (N bits) Local registers: xt0, xt1, xt2 (N bits) CL 3-tap FIR filter e Ca1 Ca0 3 2x4 2 1 0 C S1 xt0 = X xt1 = xt0 xt2 = xt1 Y = xt0*c0 + xt1*c1 + xt2*c2 x(t) c0 xt0 x(t-1) c1 xt1 x(t-2) c2 xt2 X clk * * + * + yreg Y Digital Design Copyright © 2006 Frank Vahid 88 Operator Scheduling Example: Smaller FIR Filter • Reduce the design’s size by re-scheduling the operations – Do only one multiplication operation per state Inputs: X (N bits) Outputs: Y (N bits) Local registers: xt0, xt1, xt2 (N bits) xt0 = X xt1 = xt0 xt2 = xt1 Y = xt0*c0 + xt1*c1 + xt2*c2 S1 (a) Inputs: X (N bits) Outputs: Y (N bits) Local registers: xt0, xt1, xt2, sum (N bits) sum = 0 xt0 = N S1 xt1 = xt0 xt2 = xt1 S2 sum = sum + xt0 * c0 a S3 sum = sum +xt1 * c1 S4 sum = sum + xt2 * c2 S5 Y = sum y(t) = c0*x(t) + c1*x(t-1) + c2*x(t-2) Digital Design Copyright © 2006 Frank Vahid (b) 89 Operator Scheduling Example: Smaller FIR Filter • Reduce the design’s size by re-scheduling the operations – Do only one multiplication (*) operation per state, along with sum (+) Inputs: X (N bits) Outputs: Y (N bits) Local registers: xt0, xt1, xt2, sum (N bits) sum = 0 xt0 = X S1 xt1 = xt0 xt2 = xt1 X xt0 c0 c1 xt1 x_ld sum = sum + xt0 * c0 yreg 3x1 3x1 mul_s0 S3 * y_ld S4 S5 Y = sum Digital Design Copyright © 2006 Frank Vahid Y a sum = sum + xt1 * c1 sum = sum + xt2 * c2 c2 clk mul_s1 S2 xt2 MAC + sum Multiplyaccumulate: a common datapath component 90 Operator Scheduling Example: Smaller FIR Filter • Many other options exist between fully-concurrent and fully-serialized Digital Design Copyright © 2006 Frank Vahid concurrent FIR compromises size – e.g., for 3-tap FIR, can use 1, 2, or 3 multipliers – Can also choose fast array-style multipliers (which are concurrent internally) or slower shift-andadd multipliers (which are serialized internally) – Each options represents compromises serial FIR delay 91 More on Optimizations and Tradeoffs 6.6 • Serial vs. concurrent computation has been a common tradeoff theme at all levels of design – Serial: Perform tasks one at a time – Concurrent: Perform multiple tasks simultaneously • Combinational logic tradeoffs – Concurrent: Two-level logic (fast but big) – Serial: Multi-level logic (smaller but slower) • abc + abd + ef (ab)(c+d) + ef – essentially computes ab first (serialized) • Datapath component tradeoffs – Serial: Carry-ripple adder (small but slow) – Concurrent: Carry-lookahead adder (faster but bigger) • Computes the carry-in bits concurrently – Also multiplier: concurrent (array-style) vs. serial (shift-and-add) • RTL design tradeoffs – Concurrent: Schedule multiple operations in one state – Serial: Schedule one operation per state Digital Design Copyright © 2006 Frank Vahid 92 Higher vs. Lower Levels of Design • Optimizations and tradeoffs at higher levels typically have greater impact than those at lower levels – RTL decisions impact size/delay more than gate-level decisions Spotlight analogy: The lower you are, the less solution landscape is illuminated (meaning possible) Digital Design Copyright © 2006 Frank Vahid 93 Algorithm Selection • Chosen algorithm can have big impact – e.g., which filtering algorithm? • FIR is one type, but others require less computation at expense of lower-quality filtering • Example: Quickly find item’s address in 256-word memory – One use: data compression. Many others. – Algorithm 1: “Linear search” • Compare item with M[0], then M[1], M[2], ... • 256 comparisons worst case 0: 1: 2: 3: Linear search 0x00000000 0x00000001 0x0000000F 0x000000FF 64 96: 128: 96 0x00000F0A 0x0000FFAA 128 Binary search – Algorithm 2: “Binary search” (sort memory first) • Start considering entire memory range – – – – If M[mid]>item, consider lower half of M If M[mid]<item, consider upper half of M Repeat on new smaller range Dividing range by 2 each step; at most 8 such divisions • Only 8 comparisons in worst case • a 255: 0xFFFF0000 256x32 memory Choice of algorithm has tremendous impact – Far more impact than say choice of comparator type Digital Design Copyright © 2006 Frank Vahid 94 Power Optimization Until now, we’ve focused on size and delay Power is another important design criteria – Measured in Watts (energy/second) • Rate at which energy is consumed • Increasingly important as more transistors fit on a chip – Power not scaling down at same rate as size • Means more heat per unit area – cooling is difficult • Coupled with battery’s not improving at same rate 8 energy (1=value in 2001) • • energy demand 4 battery energy density 2 1 2001 03 05 07 09 – Means battery can’t supply chip’s power for as long – CMOS technology: Switching a wire from 0 to 1 consumes power (known as dynamic power) • P = k * CV2f – k: constant; C: capacitance of wires; V: voltage; f: switching frequency • Power reduction methods – Reduce voltage: But slower, and there’s a limit – What else? Digital Design Copyright © 2006 Frank Vahid 95 Power Optimization using Clock Gating 2 • P = k * CV f X • Much of a chip’s switching f (>30%) due to clock signals x_ld – After all, clock goes to every register – Portion of FIR filter shown on right • Notice clock signals n1, n2, n3, n4 • • Note: Advanced method, usually done by tools, not designers – Putting gates on clock wires creates variations in clock signal (clock skew); must be done with great care Digital Design Copyright © 2006 Frank Vahid Greatly reduced switching – less power xt1 c1 xt2 c2 yreg clk n1 n2 n3 n4 y_ld Much switching on clock wires clk Solution: Disable clock switching to registers unused in a particular state – Achieve using AND gates – FSM only sets 2nd input to AND gate to 1 in those states during which register gets loaded c0 xt0 n1, n2, n3 n4 a c0 xt0 X xt1 c1 xt2 c2 x_ld yreg s1 n1 n2 n3 n4 clk s5 y_ld clk n1, n2, n3 n4 96 • Another method: Use low-power gates – Multiple versions of gates may exist • Fast/high-power, and slow/low-power, versions size Power Optimization using Low-Power Gates on Non-Critical Paths er w – Use slow/low-power gates on non-critical paths o p high-power gates low-power gates on noncritical path low-power gates • Reduces power, without increasing delay delay Digital Design Copyright © 2006 Frank Vahid 97