Fast Convolution y(n) a x(n) a x(n 1) .. a x(n k 1) ...[0] 0 1 k 1 y(n 1) a x(n 1) a x(n) .. a x(n k 2) ...[1] 0 1 k 1 y(n 2) a x(n 2) a x(n 1) .. a x(n k 3) ...[2] 0 1 k 1 ............... y(n k 1) a x(n k 1) a x(n k 2) .. a x(n) ...[k 1] 0 1 k 1 Evaluate [0] by itself ? Evaluate [0]-[k-1] requires k2 multipliers Somewhat surprising – can do “better” FFT based O(k log2k) comes with lots of hidden costs: layout wires, large constant, numerical stability FFT is a fast way of computing the DFT frequency domain view of signals and systems For understanding fast convolution we can IGNORE the frequency domain interpretation of the DFT In stead we will think about DFT/FFT in terms of polynomial n 1 A( x ) a j x j j 0 n 1 B( x ) b j x j j 0 aj real or complex – degree bound n degree k largest k such that ak ≠ 0 Can add and multiply polynomials Add A(x) + B(x) = C(x), where cj=aj+bj Given A(x), B(x) of degree bound n their product C(x) is polynomial of degree bound 2n-1 such that C(x) = A(x).B(x) for all x. Multiply terms, combine terms with equal powers A( x) 6 x 3 7 x 2 10 x 9 B( x) 2 x 3 4 x 5 6 x 3 7 x 2 10 x 9 2x3 4x 5 __________ __________ ________ 30 x 3 35 x 2 50 x 45 24 x 4 28 x 3 40 x 2 36 x 12 x 6 14 x 5 20 x 4 18 x 3 __________ __________ ________ 12 x 6 14 x 5 44 x 4 20 x 3 75 x 2 86 x 45 Another way, C ( x) 2n 2 c j x j where c j j 0 j a b k k 0 j k [1] Relationship between polynomial multiplication and convolution. a(n 1)b(n 1) C(2n 2) C(2n 3) a(n 2)b(n 1) a(n 1)b(n 2) .... ..... C(1) a(0)b(1) a(1)b(0) a(0)b(0) C(0) Viewing a(0) a(1).....a(n 1) b(0)b(1).....b(n 1) as coefficients of degree-bound polynomials, coefficient vector C can be given by equation [1] is the convolution of a and b denoted by a b How to perform convolution/polynomial multiplication “faster” ? Use “point-wise” representation for some very specific points works both for software, hardware Coefficient representation of A( x) n 1 a x j 0 is just j j a(0) a(1).....a(n 1) - will be viewing as a vector and doing matrix multiplication. How to evaluate? Θ(n2) if plugged in. Θ(n) using Horner’s rule. A( x0 ) a0 x0 (a1 x0 (a2 ... x0 an 1 ))) adding Θ(n) multiplying Θ(n2) using equation [1] Alternate representation “Point-value representation” of A(x) of degree bound n {(x0,y0),(x1,y1), ...,(xn-1,yn-1)} such that yk=A(xk) for k=0,1,...,n-1 Many different representations [some are better than others] Easy to derive point-value representations from the coefficient form Select x0,x1, ...,xn-1; evaluate using Horner’s rule Θ(n2). Later Θ(nlog2n) Getting coefficient from point value representation – “interpolation” Theorem: For any set {(x0,y0),(x1,y1), ...,(xn-1,yn-1)} of n point value pairs there is a unique polynomial A(x) of degree bound n such that yk=A(xk) for k=0,1,...,n-1 Proof Idea: 1 x0 1 x1 . . 1 x n 1 a0 .. x1 a1 .. . .. n 1 .. x n 1 a n 1 .. x0 n 1 n 1 Matrix on the left denoted by V(x0,x1, ...,xn-1) has determinant (x jk k xj) hence invertible implying the existence of a unique solution. Incidentally, this gives an algorithm for going from point-value representation to coefficient representation. Θ(n3) to solve n equations in n unknowns. Faster approach is (x x ) A( x) Y (x x ) n 1 k 0 j j k k j k k j can compute coefficient in Θ(n2) n-point evaluation and interpolation are well-defined inverse operators – taking Θ(n2) time. [very bad numerically] Point value representation very convenient for many operations. C(x)=A(x)+B(x); {(x0,y0),(x1,y1), ...,(xn-1,yn-1)} {(x0,y0’),(x1,y1’), ...,(xn-1,yn-1’)} Point value representation is {(x0, y0+y0’),(x1, y1+y1’), ...,(xn-1, y n-1+yn-1’)} Θ(n) [point value representation over same n points] Multiplication: C(xk)=A(xk)+B(xk) - problem degree bounds for A and B is the sum of the degree bounds for A and B – use extended p.v.r for A and B How to evaluate polynomial in pvr at a new point Best know approach is to convert to coefficient form first. Fast multiplication of polynomials in coefficient form. Can exploit Θ(n) algorithms for polynomial multiplication in coefficient form. hinges on being able to go from coefficient to pvr and them pvr to coefficient form Already have Θ(n2) for these problems – choose evaluation points carefully – do in Θ(n log2n) time Use “complex roots of unity” as evaluation points. to get the pvr (DFT of the coefficient vector) -inverse DFT to interpolate Small detail degree bounds zero pad A, B coefficient vectors with n zeros Graphical view Complex roots of unity ω is complex n-th root of unity i.e. ωn=1 – exactly n distinct n-th roots of unity. e 2ik n for k=0,1,...,n-1 Interpretation eiu=cosu+i sinu ωn= e 2 i n principal n-th roots of unity. Others are powers of ωn { ω00, ω01,..., ω0n-1} is closed under multiplication ωnj ωnk= ωn(j+k)mod n Multiplication inverses exist ωn-1= ω0n-1 More properties: Cancellation Lemma: dk ωdndk= ωnk If n is even ωnn/2=-1 Halving Lemma: [LHS 2dni e = ωnk] If n>0 is even, then squares of the n complex n-th roots of unity are the n/2 complex n/2-th roots of unity (ωnk)2= ωn2k= ωn/2k (ωnk+n/2)2= ωn2k+n= ωn2k ωnn= ωn2k= (ωnk)2 i.e ωnk and ωnk+n/2 have the same square. Key to divide and conquer [recursive subproblems are only half as large] Summation Lemma for any n≥1 and k>0 such that k is not divisible by n j n 1 ( j 0 k n ) 0 [Geometric series formula 1 nn 1 1k 1 0 1 nk 1 nk 1 k n n k n k The DFT, n 1 A( x ) a j x j , given in coefficient form j 0 Evaluate A(x) at ωn0, ωn1, ...,ωnn-1 [i.e. the n compex n-th roots of unity] - assuming A has been appropriately zero-padded also assume n is power of 2 [more zero padding if needed] j n 1 Yk=A(ωnk)= a j 0 j ( ) k n Y=[Y0, Y1, ...., Yn] is defined to be the DFT of a will write as Y=DFTn(a) The FFT, takes advantage of the special properties of the complex roots of unity to compute DFT n(a) in O(n log2n) Use divide and conquer A[0](x)=a0+a2x2+a4x4+...+an-2xn/2-1 A[1](x)=a1+a3x2+a5x4+...+an-1xn/2-1 A[0] even index coefficients of A[lsb=1] A[1] odd index coefficients of A[msb=1] A(x)= A[0](x2)+x A[1](x2) ... [2] Hence the problem of computing A(x) at ωn0, ωn1, ...,ωnn-1 reduces to Evaluating A[0](x) and A[1](x) at (ωn0)2, (ωn1) 2, ...,(ωnn-1) 2 and then Combining according to [2] Look carefully at list of n points to evaluate A[0] and A[1] at (ωn0)2, (ωn1) 2, ...,(ωnn-1) 2 There are only n/2 distinct values here So to compute the n-point DFT do two n/2 distinct values compute 2 n/2 DFT computations T(n)=2T(n/2)+ Θ(n)= Θ(nlog2n) Still need to perform interpolation at the complex roots of unity. DFTn(a) is given in matrix form by Y=Vna 1 .. 1 Y0 1 1 3 n 1 Y 1 n .. n n 1 . .. .. . .. . 2 ( n 1) n 1 ( n 1)( n 1) n .. n Yn 1 1 n Kij th entry is ωnkj for j,k = 0,1,...,n-1 Inverse operation a=DFTn-1(Y) Lemma (j,k)th entry of Vn-1 is ωn-kj/n Proof: Look at the entry (j,j’) of Vn-1 Vn n kj n k 0 n 1 n 1 kj k ( j' j) n n 1 if j’=j n k 0 = 0 otherwise aj= 1 n 1 kj yk n n k 0 same approach as FFT ωn replaced by ωn-1 and result divided by n. Efficient FFT implementation common subexpression extraction make iterative and not recursive Rearrange elements of a so that adjacent pairs are DFT bit reversal RECURSIVE-FFT(a) n<-length[a] if n=1 then return a ωn<- e 2 i n ω<-1 a[0]<-(a0,a2,...,an-2) a[1]<-(a1,a3,...,an-1) y[0]<-RECURSIVE-FFT(a[0]) y[1]<-RECURSIVE-FFT(a[1]) for k<- to n/2 – 1 do yk <- yk[0]+ ωyk[1] yk+n/2 <- yk[0] – ω yk[1] ω <- ω ωn return y Parallel FFT Circuit for s<- 1 to log2n do for k <- 0 to n-1 by 2s do combine the two 2s-1 element DFT’s in A[k..k+2s-1-1] and A[k+2s-1..k+2s-1] into one 2s – element DFT in A[k.. k+2s-1] FFT-BASE(a) n<-length[a] for s<- 1 to log2n do m<-2s 2 i m ωm<- e for k<- 0 to n-1 by m do ω<-1 for j<- 0 to m/2 -1 do t <- ω A[k+j+m/2] u <- A[k+j] A[k+j]<-u+t A[k+j+m/2]<-u-t ω <- ω ωm ITERATIVE-FFT(a) BIT-RESERVE-COPY(a,A) n<-length[a] for s<- 1 to log2n do m<-2s 2 i ωm<- e m ω<-1 for j<-0 to m/2-1 do for k<j to n-1 by m do t <- ωA[k+m/2] u <- A[k] A[k] <- u+t A[k+m/2] <- u-t ω <- ω ωm return A bit reverse inputs log2n stages n/2 butterflies in parallel many other applications 1. 2. Evaluate polynomial of degree bound n at n points in O(n log2n) time Multiply large numbers Variants: integer valued coefficients All arithmetic mod Zp [rather than C] Practical Issues: layout cache unfriendly numerical instability number of calculations N N convolution N2 multiplies 3Nlog2N complex multiplies and adds [taking into account redundancy] 12Nlog2N real multiplies