289 IX. Basic Implementation Techniques and Fast Algorithm 9-A 快速演算法設計的原則 Fast Algorithm Design Goals: Saving Computational Time Number of Additions Number of Multiplications Number of Time Cycles Saving the Hardware Cost for Implementation 9-B 對於簡單矩陣快速演算法的設計 如何簡化下面四個運算 (1) y[1] = ax[1] + 2ax[2] y (1) a a x (1) (2) y (2) a a x(2) (3) y (1) a b x (1) y (2) b a x(2) (4) y (1) a b x (1) y (2) c a x(2) 290 291 (4) b a x (1) y (1) a b x(1) a a x(1) 0 y (2) c a x(2) a a x(2) c a 0 x(2) z (1) a a x(1) z (2) a a x(2) b a x(1) z (3) 0 z (4) c a x(2) 0 (i) z(1) = a[x(1) + x(2)], z(2) = z(1) (ii) z(3) = (b−a)x(2), z(4) = (c−a)x(1) (iii) y(1) = z(1) + z(3), y(2) = z(2) + z(4) 292 問題思考:如何對 complex number multiplication 來做 implementation? 293 9-C 一般化的簡化理論: 假設一個 MN sub-rectangular matrix S 可分解為 column vector 及 row vector 相乘 a1 b1 b2 a S 2 a M bN y1 x1 y x 2 S 2 y N xN z b1 x1 b2 x2 bN xN yn an z 若 [a1, a2, …., aM]T 有 M0 個相異的 non-trivial values (am 2k, am 2kah where m h) [b1, b2, …., bN] 有 N0 個相異的 non-trivial values 則 S 共需要 M0 + N0 個乘法 294 z[1] x[1] a1 b1 b2 z[2] x[2] a S 2 z[ N ] x[ N ] a M bN x[1] x[2] x[ N ] Step 1 za = b1x[1] + b2x[2] + …. + bNx[N] Step 2 z[1] = a1 za , z[2] = a2 za , ………, z[N] = aM za 295 簡化理論的變型 a1 b1 b2 a S 2 a M bN S1 S1 也是一個 MN matrix 若 S1 有 P1 個值不等於 0, 則 S 的乘法量上限為 M0 + N0 + P1 a1 b1 b2 a S 2 a M bN c1 d1 d 2 c 2 c M dN S1 以此類推 思考: 對於如下的情形需要多少乘法 y[1] a y[2] e y[3] f y[4] d b f e c c f e b d x[1] e x[2] f x[3] a x[4] 296 297 9-D Examples N 1 DFT: X m x n e j 2 m n N n 0 Without any simplification, the DFT needs 4N2 real multiplications (x[n] may be complex) 3 3 DFT 可以用特殊方法簡化 1 1 1 1 1/ 2 1/ 2 1 1/ 2 1/ 2 0 0 0 j 0 3 / 2 3/2 0 3 / 2 3 / 2 298 5 5 DFT 的例子 1 1 1 1 1 1 a b b a 1 b a a b 1 b a a b 1 a b b a 0 0 0 c j 0 d 0 d 0 c 0 d c c d 0 d c c d 0 c d d c a cos 2 / 5 b cos 4 / 5 c sin 2 / 5 d sin 4 / 5 299 8-point DCT y[0] 0.7010 y[1] 0.9808 y[2] 0.9239 y[3] 0.8315 y[4] 0.7010 y [5] 0.5556 y[6] 0.3827 y [7] 0.1951 0.7010 0.7010 0.7010 0.7010 0.8315 0.3827 0.5556 0.1951 0.1951 0.3827 0.9239 0.9239 0.1951 0.9808 0.5556 0.5556 0.7010 0.7010 0.7010 0.7010 0.9808 0.1951 0.8315 0.8315 0.9239 0.9239 0.3827 0.3827 0.5556 0.8315 0.9808 0.9808 0.7010 x[0] 0.5556 0.8315 0.9808 x[1] 0.3827 0.3827 0.9239 x[2] 0.9808 0.1951 0.8315 x[3] 0.7010 0.7010 0.7010 x[4] 0.1951 0.9808 0.5556 x[5] 0.9239 0.9239 0.3827 x[6] 0.8315 0.5556 0.1951 x[7] 0.7010 觀察對稱性質之後,令 z[0] x[0] x[7] z[1] x[1] x[6] z[2] x[2] x[5] z[3] x[3] x[4] z[4] x[0] x[7] z[5] x[1] x[6] z[6] x[2] x[5] z[7] x[3] x[4] 0.7010 Part 1: y[0] 0.7010 0.7010 0.7010 0.7010 z[0] y[2] 0.9239 0.3827 0.3827 0.9239 z[1] y[4] 0.7010 0.7010 0.7010 0.7010 z[2] y[6] 0.3827 0.9239 0.9239 0.3827 z[3] 300 Part 2: y[1] 0.9808 0.8315 0.5556 0.1951 z[4] y[3] 0.8315 0.1951 0.9808 0.5556 z[5] y[5] 0.5556 0.9808 0.1951 0.8315 z[6] y[7] 0.1951 0.5556 0.8315 0.9808 z[7] 301 [Ref] B. G. Lee, “A new algorithm for computing the discrete cosine transform,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 32, pp. 1243-1245, Dec. 1984. X. Fast Fourier Transform C. S. Burrus and T. W. Parks, “DFT / FFT and convolution algorithms”, John Wiley and Sons, New York, 1985. R. E. Blahut, Fast Algorithm for Digital Signal Processing, Addison Wesley Publishing Company. N 1 X m x n e j 2 m n N n 0 N-point Fourier Transform: 運算量為 N2 FFT (with the Cooley Tukey algorithm): 運算量為 Nlog2N 要學到的概念:(1)快速演算法不是只有 Cooley Tukey algorithm (2) 不是只有 N = 2k 有時候才有快速演算法 302 303 10-A Other DFT Implementation Algorithms (1) Cooley-Tukey algorithm (Butterfly form) (2) Radix-4, 8, 16, …. Algorithms (3) Prime Factor Algorithm (4) Goertzel Algorithm (5) Chirp Z transform (CZT) (6) Winograd algorithm 304 Reference J. W. Cooley and J. W. Tukey, “An algorithm for the machine computation of complex Fourier series,” Mathematics of Computation, vol. 19, pp. 297301, Apr. 1965. (Cooley-Tukey) C. S. Burrus, “Index Mappings for multidimensional formulation of the DFT and convolution,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 25, pp. 1239-242, June 1977. (Prime factor) G. Goertzel, “An algorithm for the evaluation of finite trigonometric series,” American Math. Monthly, vol. 65, pp. 34-35, Jan. 1958. (Goertzel) C. R. Hewes, R. W. Broderson, and D. D. Buss, “Applications of CCD and switched capacitor filter technology,” Proc. IEEE, vol. 67, no. 10, pp. 14031415, Oct. 1979. (CZT) S. Winograd, “On computing the discrete Fourier transform,” Mathematics of Computation, vol. 32, no. 141, pp. 179-199, Jan. 1978. (Winograd) R. E. Blahut, Fast Algorithm for Digital Signal Processing, Reading, Mass., Addison-Wesley, 1985. 305 10-B Cooley Tukey Algorithm When N = 2k N 1 X m x n e j 2 m n N n 0 N / 2 1 x 2n e j 2 m (2 n ) N N / 2 1 n 0 N / 2 1 x n e n 0 1 j x 2n 1 e j 2 m (2 n 1) N n 0 2 m n N /2 e j 2 m N / 2 1 N x n n 0 j 2 2 m n N /2 twiddle factors x1[n] = x[2n], x2[n] = x[2n+1] Therefore, one N-point DFT = two (N/2)-point DFT + twiddle factors 306 8-point DFT X[0] x[0] x[2] X[1] 4-point DFT x[4] X[2] x[6] X[3] 1 x[1] x[3] w 4-point DFT x[5] w2 w3 x[7] we j 2 8 X[4] -1 -1 X[5] -1 X[6] -1 w4 1 X[7] w8 1 307 8-point DFT x[0] x[4] X[0] X[1] -1 1 x[2] x[6] w2 -1 X[2] -1 X[3] -1 x[1] 1 x[5] w -1 x[3] 1 x[7] w2 we j 2 8 X[5] -1 w2 -1 X[6] -1 w3 -1 -1 X[4] -1 -1 X[7] 308 Number of real multiplications 的估算 2k-point DFT 一共有 k 個 stages 每個 stage 和下一個 stage 之間有 2k-1 個 twiddle factors 所以,一共有 2k-1(k-1) 個 twiddle factors 一般而言,每個 twiddle factor 需要 3 個 real multiplications 2k-point DFT 需要 3 32k 1 (k 1) N log 2 N 1 2 個 real multiplications Complexity of the N-point DFT: O(Nlog2N) 309 8-point DFT 只需要 4 個 real multiplications (Why?) 更精確的分析,使用 Cooley-Tukey algorithm 時,N-point DFT 需要 3 N log 2 N 5 N 8 2 (Why?) 個 real multiplications 310 10-C Radix-4 Algorithm 限制: N = 4k or N = 24k (此時 Cooley-Tukey algorithm 和 radix-4 algorithm 並用) N 1 X m x n e j 2 m n N n 0 N /4 1 x 4n e j 2 m n N /4 e j 2 m N /4 1 N x 4n 1 n 0 e j 2 (2 m ) N /4 1 N x 4n 2 e j 2 m n N /4 n 0 j n 0 2 m n N /4 e j 2 (3 m ) N /4 1 N x 4n 3 n 0 twiddle factors One N-point DFT = four (N/4)-point DFTs + twiddle factors j 2 m n N /4 311 Note: (1) radix-4 algorithm 最後可將 N = 4k-point DFT 拆解成 4-point DFTs 的組合 4-point DFTs 不需要任何的乘法 (2) 使用 radix-4 algorithm 時,N-point DFT 需要 9 43 16 N log 4 N N 4 12 3 個 real multiplications 312 Number of real multiplications for the N-point DFT N 乘法數 加法數 N 乘法數 N 乘法數 N 乘法數 1 0 0 12 8 26 104 44 160 2 0 4 13 52 27 114 45 170 3 2 12 14 32 28 64 48 92 4 0 16 15 40 30 80 52 208 5 10 34 16 20 32 72 54 228 6 4 36 18 32 33 160 56 156 7 16 72 20 40 35 150 60 160 8 4 52 21 62 36 64 63 256 9 16 72 22 80 39 182 64 204 10 20 88 24 28 40 100 66 284 11 46 156 25 148 42 124 70 300 313 N 乘法數 N 乘法數 N 乘法數 N 乘法數 72 164 128 560 256 1308 1024 7436 80 260 144 436 288 1160 1152 7088 81 480 160 680 312 1324 1260 7640 84 248 168 580 336 1412 2016 12728 88 412 180 680 360 1540 2048 16836 90 340 192 752 420 2080 2304 15868 96 280 204 976 504 2300 2520 16540 104 468 216 1020 512 3180 4032 29488 108 456 224 1016 560 3100 4096 37516 112 396 240 940 672 3496 4368 35828 120 380 252 1024 1008 5356 5040 36860 附錄十: Complexities 一覽表 N-point DFT: O N log2 N N-point DCT, DST, DHT: O N log2 N Two-dimensional (2-D) Nx Ny-point DFT: O ( N x N y )log 2 ( N x N y ) Convolution of an M-point sequence and an N-point sequence: O (M N 1)log2 (M N 1) O N O M when M/N and N/M are not large, when N >> M and M is a fixed constant. when M >> N and N is a fixed constant. 314 315 2-D Convolution of an (Mx My )-point matrix and an (Nx Ny )-point matrix: O ( M x N x 1)( M y N y 1)log 2 ( M x N x 1)( M y N y 1) when MxMy/NxNy and NxNy/MxMy are not large, OM xM y Nx N y when MxMy >> NxNy or NxNy >> MxMy O Nx N y when MxMy >> NxNy or NxNy >> MxMy , and Mx, My are fixed constants. 316 Four important concepts that should be learned from fast algorithm design: (1) (2) (3) (4)