Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review Ref: Stanley A. White, “Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review,” IEEE ASSP Magazine, July, 1989 VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) 11-1 Distributed Arithmetic (DA, 1974) • The most-often encountered form of computation in DSP: – Sum of product – Inner-product – Executed most efficiently by DA VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) 11-2 Derivation of DA Technique K • Sum of product: y = ∑ Ak xk (1) k =1 – where xk is a 2’s-complement binary number scaled such that | xk |<1, and Ak is fixed coefficients • xk : {bk0, bk1, bk2……, bk(N-1) }, wordlength=N – where bk0 is the sign bit – Express each xk as: • Sub (2) into (1) => N −1 xk = −bk 0 + ∑ bkn 2 −n n =1 N −1 ⎡ −n ⎤ y = ∑ Ak ⎢ − bk 0 + ∑ bkn 2 ⎥ k =1 n =1 ⎣ ⎦ K VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) (2) (3) 11-3 Derivation of DA Technique -continued I • Critical step N −1 K N −1 K ⎡ ⎤ y = ∑ Ak ⎢− bk 0 + ∑ bkn 2 − n ⎥ = ∑ Ak ∑ bkn 2 − n + ∑ Ak (−bk 0 ) k =1 n =1 k =1 ⎣ ⎦ k =1 n =1 N −1 K ⎤ −n K ⎡ = ∑ ⎢∑ Ak bkn ⎥ 2 + ∑ Ak (−bk 0 ) (4) n =1 ⎣ k =1 k =1 ⎦ K • where K is the number of inputs (or taps) and N is the wordlength of Data VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) 11-4 Derivation of DA Technique -continued II • Now consider the equation (4) ⎡K ⎤ −n K y = ∑ ⎢∑ Ak bkn ⎥ 2 + ∑ Ak (−bk 0 ) n =1 ⎣ k =1 k =1 ⎦ N −1 – ⎡K ⎤ has only 2K possible values A b ⎢∑ k kn ⎥ ⎣ k =1 ⎦ K – ∑ A ( −b k =1 k k0 ) has only 2K possible values – We can store it in a lookup-table(ROM): size=2×2K VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) 11-5 Technical Overview of DA • Advantage of DA: Efficiency of computing mechanization • A frequently argued: – Slowness because of its inherent bit-serial nature (not true) • Some modifications to increase the speed by employing techniques: – Plus more arithmetic operations – expense of exponentially increased memory VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) 11-6 Derivation of DA Technique -continued III • Example – Let number of inputs K=4 – The fixed coefficients are A1 =0.72, A2= -0.3, A3 = 0.95, A4 = 0.11 ⎡K ⎤ −n K y = ∑ ⎢∑ Ak bkn ⎥ 2 + ∑ Ak ( −bk 0 ) n =1 ⎣ k =1 k =1 ⎦ N −1 – We need 2×2K = 32-word ROM (K=4) VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) 11-7 Example • Unfolding ⎡ 4 ⎤ A b ⎢∑ k kn ⎥ = A1b1n + A2b2 n + A3b3n + A4b4 n ⎣ k =1 ⎦ 4 ∑ A ( −b k =1 k k0 )= A1 (−b1, 0 ) + A2 (−b2, 0 ) + A3 (−b3, 0 ) + A4 (−b4, 0 ) VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) 11-8 x 1 x2 x3 x4 Example -continued I • Hardware architecture ⎡K ⎤ −n K y = ∑ ⎢∑ Ak bkn ⎥2 + ∑ Ak ( −bk 0 ) n =1 ⎣ k =1 k =1 ⎦ N −1 VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) 11-9 Example -continued II • Shorten the table ⎡ 4 ⎤ A b ⎢∑ k kn ⎥ = A1b1n + A2b2 n + A3b3n + A4b4 n ⎣ k =1 ⎦ ⎡ 4 ⎤ − = − A ( b ) A ( b ) ∑ k k0 ⎢∑ k k 0 ⎥ k =1 ⎣ k =1 ⎦ 4 – Eq. (4) ⎡K ⎤ −n K y = ∑ ⎢∑ Ak bkn ⎥ 2 − ∑ Ak (bk 0 ) n =1 ⎣ k =1 k =1 ⎦ N −1 VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) (5) 11-10 Example -continued II Only 16 words of ROM are required,now. Distributed Arithmetic VSP Lecture 11 (cwliu@twins.ee.nctu.edu.tw) 11-11 Offset-Binary Coding (OBC) • Change Input data from “binary” to “signed-digit” 1 xk = [ xk − (− xk )] = {bk 0 , bk1 ,......bk ( n −1) } 2 (6) N −1 xk = −bk 0 + ∑ bkn 2 −n n =1 2‘s-complement N −1 − xk = −b k 0 + ∑ b kn 2 −n + 2 −( N −1) n =1 ( ) ( ) N −1 1⎡ ⎤ xk = ⎢ − bk 0 − bk 0 + ∑ bkn − bkn 2 −n − 2 −( N −1) ⎥ 2⎣ n =1 ⎦ VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) (7) 11-12 Offset-Binary Coding (OBC) (cont’d) ( ) ( ) N −1 1⎡ ⎤ xk = ⎢ − bk 0 − bk 0 + ∑ bkn − bkn 2 −n − 2 −( N −1) ⎥ 2⎣ n =1 ⎦ bkn − bkn , n ≠ 0 ckn = { − (bkn − bkn ) , n = 0 1 ⎡ N −1 ⎤ xk = ⎢ ∑ ckn 2 −n − 2 −( N −1) ⎥ 2 ⎣ n =0 ⎦ where ckn ∈{−1 , 1} K 代入 y = ∑ Ak xk k =1 N −1 1 K ⎡ N −1 −n − ( N −1) ⎤ −n − ( N −1) ( ) y = ∑ Ak ⎢∑ ckn 2 − 2 = Q b 2 + 2 Q(0 ) ∑ n ⎥ 2 k =1 ⎣ n =0 ⎦ n =0 K K Ak Q b = ckn ( ) Where ∑ n k =1 2 and VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) Q (0 ) = − ∑ k =1 Constant Ak 2 11-13 Offset-Binary Coding (OBC) (cont’d) • Hardware architecture b1n b2n b3n b4n N −1 Q(bn )211 +Distributed 2 Q(0 ) Arithmetic VSP Lecture ∑ n(cwliu@twins.ee.nctu.edu.tw) =0 −n − ( N −1) 11-14 Speed up of DA multiplication • Way I: Plus more arithmetic operations N −1 y = ∑ Q(bn )2 − n + 2 −( N −1) Q (0 ) Initial condition n =0 N −1 ∑ Q(b )2 n =0 n −n = Q(b0 )2 −0 + Q (b1 )2 −1 + .... + Q(bN − 2 )2 −( N − 2 ) + Q(bN −1 )2 −( N −1) Even part Odd part VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) 11-15 Speedup of DA multiplication • Way I: at the expense of linearly increased memory & arithmetic operation Odd part (sign} Even part Initial Condition 1/2*Q(0) N −1 y = ∑ Q (bn )2 − n + 2 − ( N −1) Q (0 ) n=0 VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) 11-16 Speed up of DA Multiplication • Way II: at the expense of exponentially increased memory • ROM : 2*7 words 1*128 words VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) 11-17 Conclusions • DA is a very efficient mechanism for computations that are dominated by inner products (convolution) • A good way to trade combinational logic with memory for high-performance computation. • When a many computing methods are compared, DA should be considered. It is not always (but often) best, and never poorly: save gate count around 50% to 80%. • Application: “VLSI implementation of a 16*16 discrete cosine transform,” by M.-T. Sun, T.-C. Chen, A. M. Gottlieb, IEEE Transactions on Circuits and Systems, Volume: 36 Issue: 4 , April 1989, Page(s): 610 –617, and many other transforms and DSP kernels. VSP Lecture 11 Distributed Arithmetic (cwliu@twins.ee.nctu.edu.tw) 11-18