1 EE 209 Multiplication Techniques 2 Digital System Design • End our semester revisiting our key concept: – In digital systems, algorithms can be implemented in hardware, software, or a combination of both Sensors Digital Inputs Outputs Clock Reset Custom Logic (ASIC or FPGA) Interconnect Microprocessors (Software executing on hardware) Interconnect Analog Inputs Analog to Digital Conversion (ADC) Digital Processing Digital to Analog Conversion (DAC) Analog Outputs Digital Outputs 3 Preparing for the Project • You will be designing a hardware engine for something "difficult/slow" to do in software • Then integrate that hardware engine with a processor core 4 Array Multiplier (Combinational) Add and Shift Method (Sequential) MULTIPLICATION TECHNIQUES 5 Unsigned Multiplication Review • Same rules as decimal multiplication • Multiply each bit of Q by M shifting as you go • An m-bit * n-bit mult. produces an m+n bit result (i.e. n-bit * n-bit produces 2*n bit result) • Notice each partial product is a shifted copy of M or 0 (zero) 1010 M (Multiplicand) * 1011 Q (Multiplier) 6 Unsigned Multiplication Review • Same rules as decimal multiplication • Multiply each bit of Q by M shifting as you go • An m-bit * n-bit mult. produces an m+n bit result (i.e. n-bit * n-bit produces 2*n bit result) • Notice each partial product is a shifted copy of M or 0 (zero) 1010 * 1011 1010 1010_ 0000__ + 1010___ 01101110 M (Multiplicand) Q (Multiplier) PP(Partial Products) P (Product) 7 Multiplication Techniques • A multiplier unit can be – Purely Combinational: Each partial product is produced in parallel and fed into an array of adders to generate the product – Sequential and Combinational: Produce and add 1 partial product at a time (per cycle) 8 Combinational Multiplier • Partial Product (PPi) Generation – Multiply Q[i] * M • if Q[i]=0 => PPi = 0 • if Q[i]=1 => PPi = M 9 Combinational Multiplier • Partial Product (PPi) Generation – Multiply Q[i] * M • if Q[i]=0 => PPi = 0 • if Q[i]=1 => PPi = M – AND gates can be used to generate each partial product M[3] M [2 ] 0 0 M[ 1 ] 0 0 M[0 ] 0 0 0 0 M[3] if… Q[ i]=0 1 M[3] M [2 ] 1 M[2] M[ 1 ] 1 M[1] M[0 ] 1 M[0] if… Q[ i]=1 10 Combinational Multiplier • Partial Products must be added together • Combinational multipliers suffer from long propagation delay through the adders – propagation delay is proportional to the number of partial products (i.e. number of bits of input) and the width of each adder 11 Adder Propagation Delay 1111 + 0001 X Y Co FA S X Ci 0 Y Co FA S X Ci 0 Y Co FA S X Ci 0 Y Co FA S Ci 0 12 Adder Propagation Delay 1111 + 0001 1 0 X 1 Y Co FA S 0 X Ci 0 1 Y Co FA S 0 X Ci 0 1 Y Co FA S 1 X Ci 0 Y Co FA S Ci 0 13 Adder Propagation Delay 1111 + 0001 1 0 X 0 1 Y Co FA 0 X Ci 0 1 Y Co FA 0 X Ci 0 1 Y Co FA 1 X Ci 1 Y Co FA S S S S 1 1 1 0 Ci 0 14 Adder Propagation Delay 1111 + 0001 1 0 X 0 1 Y Co FA 0 X Ci 0 1 Y Co FA 0 X Ci 1 1 Y Co FA 1 X Ci 1 Y Co FA S S S S 1 1 0 0 Ci 0 15 Adder Propagation Delay 1111 + 0001 1 0 X 0 1 Y Co FA 0 X Ci 1 1 Y Co FA 0 X Ci 1 1 Y Co FA 1 X Ci 1 Y Co FA S S S S 1 0 0 0 Ci 0 16 Adder Propagation Delay 1111 + 0001 1 0 X 1 1 Y Co FA 0 X Ci 1 1 Y Co FA 0 X Ci 1 1 Y Co FA 1 X Ci 1 Y Co FA S S S S 0 0 0 0 Ci 0 17 Critical Path • Critical Path = Longest possible delay path Assume tsum = 5 ns, tcarry= 4 ns X 16 ns Y Co FA S 17 ns X Ci 12 ns Y Co FA X Ci 8 ns Y Co FA X Ci 4 ns Y Co FA S S S 13 ns 9 ns 5 ns Ci Critical Path 18 Combinational Multiplier 19 Combinational Multiplier 20 Combinational Multiplier 21 Combinational Multiplier 22 Combinational Multiplier 23 Combinational Multiplier 24 Combinational Multiplier 25 Combinational Multiplier 26 Combinational Multiplier 27 Critical Paths Critical Path 1 Critical Path 2 28 Combinational Multiplier Analysis • Large Area due to (n-1) m-bit adders – n-1 because the first adder adds the first two partial products and then each adder afterwards adds one more partial product • Propagation delay is in two dimensions – proportional to m+n 29 Sequential Multiplier • Use 1 adder to add a single partial product per clock cycle keeping a running sum 30 Add and Shift Method • • • • • • Sequential algorithm n-bit * n-bit multiply Adds 1 partial product per clock Shift running sum 1-bit right each clock Three n-bit Registers, 1 Adder At start: – M = Multiplicand – Q = Multiplier – A = Answer => initialized to 0 • After completion – A and Q concatenate to form 2n-bit answer 31 Add and Shift Hardware 1010 = M * 1011 = Q C 0 A 0 0 Q 0 0 Cout 0 Cin 0 1010 M 1 0 1 1 32 Add and Shift Algorithm • C=0, A=0 • Repeat the following n-times – If Q[0] = 0, A = A+0 Else if Q[0] = 1, A= A+M – Shift right 1-bit (0→C→A→Q) 33 1010 * 1011 34 Add and Shift Multiplication 1010 = M * 1011 = Q 01101110 = Ans C 0 A 0 0 0 Q 0 1 0 1 1 C 0 Cout 0 Cin 0 1010 M M = 1010 A 0000 Q 1011 35 Add and Shift Multiplication 1010 C 0 1010 A 0 0 1010 * 1011 + 1010 1010 = M * 1011 = Q 01101110 = Ans 0 Q 0 1 0 1 1 0 Cout ADD Multiplicand 1010 0 Cin 0 1010 M C 0 M = 1010 A 0000 0 1010 Q 1011 1011 Add 36 Add and Shift Multiplication 1010 * 1011 + 1010 1010 1010 = M * 1011 = Q 01101110 = Ans C 0 A 1 0 1 Q 0 1 0 1 1 0 Cout Before Shift Right 0 Cin 0 1010 M C 0 M = 1010 A 0000 0 0 1010 0101 Q 1011 1011 0101 Add Shift 37 Add and Shift Multiplication 1010 * 1011 + 1010 1010 1010 = M * 1011 = Q 01101110 = Ans C 0 A 0 1 0 Q 1 0 1 0 1 1st bit of Product 0 Cout After Shift Right 0 Cin 0 1010 M C 0 M = 1010 A 0000 0 0 1010 0101 Q 1011 1011 0101 Add Shift 38 Add and Shift Multiplication 1111 C 0 A 0 1010 * 1011 1010 1010 + 1010011110 1010 = M * 1011 = Q 01101110 = Ans 1 0 Q 1 0 1 0 1 0 Cout ADD Multiplicand 1111 0 Cin 0 1010 M C 0 M = 1010 A 0000 0 0 1010 0101 1011 0101 Add Shift 0 1111 0101 Add Q 1011 39 Add and Shift Multiplication 1010 * 1011 1010 1010 + 1010011110 1010 = M * 1011 = Q 01101110 = Ans C 0 A 1 1 1 Q 1 0 1 0 1 0 Cout Before Shift Right 0 Cin 0 1010 M C 0 M = 1010 A 0000 0 0 1010 0101 1011 0101 Add Shift 0 0 1111 0111 0101 1010 Add Shift Q 1011 40 Add and Shift Multiplication 1010 * 1011 1010 1010 + 1010011110 1010 = M * 1011 = Q 01101110 = Ans C 0 A 0 1 1 Q 1 1 0 1 0 2nd bit of Product 0 Cout After Shift Right 0 Cin 0 1010 M C 0 M = 1010 A 0000 0 0 1010 0101 1011 0101 Add Shift 0 0 1111 0111 0101 1010 Add Shift Q 1011 41 Add and Shift Multiplication 0111 C 0 A 0 1010 * 1011 + 1010 1010 + 1010011110 + 0000-0011110 1010 = M * 1011 = Q 01101110 = Ans 1 1 Q 1 1 0 1 0 0 Cout ADD Zero 0111 0 Cin 0 1010 M C 0 M = 1010 A 0000 0 0 1010 0101 1011 0101 Add Shift 0 0 1111 0111 0101 1010 Add Shift 0 0111 1010 No Add Q 1011 42 Add and Shift Multiplication 1010 * 1011 + 1010 1010 + 1010011110 + 0000-0011110 1010 = M * 1011 = Q 01101110 = Ans C 0 A 0 1 1 Q 1 1 0 1 0 0 Cout Before Shift Right 0 Cin 0 1010 M C 0 M = 1010 A 0000 0 0 1010 0101 1011 0101 Add Shift 0 0 1111 0111 0101 1010 Add Shift 0 0 0111 0011 1010 1101 No Add Shift Q 1011 43 Add and Shift Multiplication 1010 * 1011 + 1010 1010 + 1010011110 + 0000-0011110 1010 = M * 1011 = Q 01101110 = Ans C 0 A 0 0 1 Q 1 1 1 0 1 3rd bit of Product 0 Cout After Shift Right 0 Cin 0 1010 M C 0 M = 1010 A 0000 0 0 1010 0101 1011 0101 Add Shift 0 0 1111 0111 0101 1010 Add Shift 0 0 0111 0011 1010 1101 No Add Shift Q 1011 44 Add and Shift Multiplication 1101 C 0 A 0 0 1010 * 1011 + 1010 1010 + 1010011110 + 0000-0011110 + 1010--01101110 1010 = M * 1011 = Q 01101110 = Ans 1 Q 1 1 1 0 1 0 Cout ADD Multiplicand 1101 0 Cin 0 1010 M C 0 M = 1010 A 0000 0 0 1010 0101 1011 0101 Add Shift 0 0 1111 0111 0101 1010 Add Shift 0 0 0111 0011 1010 1101 No Add Shift 0 1101 1101 Add Q 1011 45 Add and Shift Multiplication 1010 * 1011 + 1010 1010 + 1010011110 + 0000-0011110 + 1010--01101110 1010 = M * 1011 = Q 01101110 = Ans C 0 A 1 1 0 Q 1 1 1 0 1 0 Cout Before Shift Right 0 Cin 0 1010 M C 0 M = 1010 A 0000 0 0 1010 0101 1011 0101 Add Shift 0 0 1111 0111 0101 1010 Add Shift 0 0 0111 0011 1010 1101 No Add Shift 0 0 1101 0110 1101 1110 Add Shift Q 1011 46 Add and Shift Multiplication 1010 * 1011 + 1010 1010 + 1010011110 + 0000-0011110 + 1010--01101110 1010 = M * 1011 = Q 01101110 = Ans C 0 A 0 1 1 Q 0 1 1 1 0 Final Product 0 Cout After Shift Right 0 Cin 0 1010 M C 0 M = 1010 A 0000 0 0 1010 0101 1011 0101 Add Shift 0 0 1111 0111 0101 1010 Add Shift 0 0 0111 0011 1010 1101 No Add Shift 0 0 1101 0110 1101 1110 Add Shift Q 1011 47 Add and Shift Multiplication 1010 * 1011 + 1010 1010 + 1010011110 + 0000-0011110 + 1010--01101110 1010 = M * 1011 = Q 01101110 = Ans C 0 A 0 1 1 Q 0 1 1 1 0 Final Product 0 Cout Finished 0 Cin 0 1010 M C 0 M = 1010 A 0000 0 0 1010 0101 1011 0101 Add Shift 0 0 1111 0111 0101 1010 Add Shift 0 0 0111 0011 1010 1101 No Add Shift 0 0 1101 0110 1101 1110 Add Shift 0110 1110 = 11010 Q 1011 48 1101 * 0101 Example C 0 A 0 0 Q 0 0 0 1 0 1 C=0 M=1101 A=0000 0 1101 Q=0101 Description 0101 A=A+M Shift Right C,A,Q A=A+0 Cout Shift Right C,A,Q 0 Cin A=A+M Shift Right C,A,Q A=A+0 0 1101 M Shift Right C,A,Q 49 1101 * 0101 Example C 0 A 0 0 Q 0 0 Cout 0 Cin 0 1101 M 0 1 0 1 C=0 M=1101 A=0000 0 1101 0101 A=A+M 0 0110 1010 Shift Right C,A,Q 0 0110 1010 A=A+0 0 0011 0101 Shift Right C,A,Q 1 0000 0101 A=A+M 0 1000 0010 Shift Right C,A,Q 0 1000 0010 A=A+0 0 0100 0001 Shift Right C,A,Q Q=0101 Description 50 Sequential Multiplier Analysis • Pros: – Smaller Area due to the use of only 1 adder • Cons: – Slow to execute (2 cycles per bit of the multiplier) 51 Let's Practice our Design Skills • Break design into control and datapath – This is the datapath – 1 Adder – 2-to-1 mux – 2 shift registers (A/Q) – 1 normal reg (M) – 1 FF w/ Enable (C) C 0 A 0 0 Q 0 0 Cout 0 Cin 0 1010 M 1 0 1 1 52 State Machine Control • From our high level datapath we can arrive at a high-level state diagram On Reset (power on) START ADD WAIT If q[0] = 0 C,A = A+0 If q[0] = 1 C,A = A+M C=A=0 M=Min,Q=Qin CNT=0 START SHIFT C→A→Q CNT=CNT+1 MAXCNT MAXCNT START DONE DONE=1 START 53 Refining our Design • But now we need to refine our design to actual components, specific control bits, etc. 54 Sample Shift Register • Shift registers come in many flavors, we'll just look at one component we can use from Xilinx: SR4CLED • 4-bit Bi-directional Shift Register – ACLR: asynchronous reset – LD: Load/data enable • Allows usage as register w/ enable – CE: Must be 1 to shift, 0 means hold – Left: 0 = Right, 1 = Left – DSL and DSR • Data to shift in from left or right DSR D3 D2 D1 D0 DSL LD ACLR 4-bit Bidirectional CE Shift Register CLK LEFT Q3 Q2 Q1 Q0 ACLR CLK LD CE LEFT Q*[3:0] (case) 1 X X X X 0000 Reset 0 0,1 X X X Q[3:0] 0 ↑ 1 X X D[0:3] Load 0 ↑ 0 0 X Q[3:0] Hold 0 ↑ 0 1 0 DSR,Q[3:1] Right 0 ↑ 0 1 1 Q[2:0],DSL Left Xilinx: SR4CLED 55 Shift Registers CLK ACLR Hold Hold Load Right Hold Right Left Left LD CE LEFT DSR DSL D[3:0] Q[3:0] 1011 1111 0000 1011 0101 1010 0100 1001 56 Complete the DataPath Assume you build the state machine below and produce 4-signals that tell us which state we are in: • Qwait • Qadd • Qsh • Qdone DSR ACLR CLK SET DSL LD Shift Reg. DSR Shift Reg. CLK LEFT DSL LD D[3:0] ACLR CE Q[3:0] 1 Q D[3:0] CE LEFT Q[3:0] D EN CNTR Q0 Q1 ACLR Q2 EN CLR C4 A[3:0] S[3:0] B[3:0] Adder C0 S0 0000 Y 1 Q[3:0] D[3:0] EN /AR CLK 57 Complete the DataPath Qin[3:0] S[3:0] DSR 0 ACLR CLK CLK SET Shift Reg. DSL QAdd LD QSh CE LEFT 0 DSR 0 CLK Q[3:0] 1 Q D[3:0] D EN CLR DSL QWait LD QSh CE D[3:0] ACLR Shift Reg. CLK LEFT 0 Q[3:0] A[0] QAdd CLK A[3:0] CNTR Q0 Q1 ACLR Q2 EN Q[0] QWait C4 A[3:0] S[3:0] S[3:0] PP[3:0] B[3:0] Adder C0 0 S0 0000 Y 1 Q[3:0] D[3:0] EN /AR CLK Min[3:0] QWait 1 CLK MAXCNT 58 PROJECT 59 Square Root Finder • Find the square root of, x, without using sqrt function… • Pick a number, square it and see if it is equal to x • Use a binary search to narrow down the value you pick to square 1.87 Sqrt(5) 0 1.25 2.5 4 5 +∞ 60 Square Root Finder • Given a 16-bit integer, X, in the range [0-64,000], find the square root – Note: The square root of 0-64,000 lies in the range [0-256] – We start our low and high bounds at 0 and 256, pick the midpoint, square it and see how it compares to x – We then adjust either the low or high guess 61 Algorithm Pseudocode int sqrt(int x) // x is 16-bits, return val is 8 { int hi=256, lo=1; // 8-bit values while(hi-lo > 1){ int guess = (hi+lo)/2; if(guess*guess > x) hi = guess; else lo = guess; } return lo; } 62 Practice The Algorithm X=38 X=64 Lo Hi Lo Hi 1 256 1 256 1 128 1 128 1 64 1 64 1 32 1 32 1 16 1 16 1 8 8 16 4 8 8 12 6 8 8 10 6 7 8 9 63 Sqrt Datapath ? hi + ^2 ÷2 Load_Hi > Load_Lo x ? lo = 1 Done