High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gacic Markus Püschel Carnegie Mellon José M. F. Moura SPIRAL project http://www.spiral.net Electrical and Computer Engineering Department Carnegie Mellon University Motivation ?FIR filters - image enhancement, equalization, speech synthesis, biomedical signal processing, … Wavelets - image compression, detection, denoising, signal recovery ? Numerically intensive - significant impact on application performance ? High- Performance Implementations (How-To) ?Choose Fast Algorithm ? Arithmetic cost – only a rough estimate of performance ? Many algorithms with similar cost – which one? Carnegie Mellon ?Design Efficient Code ? Architecture conscious – machine dependent ? Obsolete when platform is changed/upgraded ?Best implementation: implementation Algorithm + Machine + Compiler ? Requires experts in algorithms and computer architecture ? Frequent re-implementation Better way: automatic performance tuning Existing FIR filtering and DWT software Matlab ©, Mathematica, S+Wavelets, WaveLab , Wave++, IPP Application oriented – not machine specific, or Hand-tuned code for specific platform Automatic platform-adapted solutions not available ! Carnegie Mellon We want to close this gap Automatic performance tuning packages ATLAS, PHiPAC, SPARSITY, FFTW, SPIRAL Include several basic linear algebra and DSP functions No filtering and wavelet kernels ! SPIRAL system Generator of optimized DSP transform implementations user goes for a coffee controls algorithm generation fast algorithm as SPL formula SPL Compiler C/Fortran/SIMD code platform-adapted implementation controls implementation options Search Engine SPIRAL Formula Generator specifies runtime on given platform comes back (or an espresso for small transforms) Carnegie Mellon DSP transform SPIRAL’s Four Key Concepts Transform ? DFTn ? e ? 2? jkl / n ? k ,l ? 0 ,....,n ? 1 parameterized matrix Diagonal matrix (twiddles) Rule DFTnm ? ?DFTn ? I m ??Tnnm ??I n ? DFTm ??Lnm m Kronecker product Identity Permutation • a breakdown strategy - product of sparse matrices • captures important structural information DFT8 Carnegie Mellon Rule Tree DFT 4 DFT 2 DFT 2 Formula DFT 2 • recursive application of rules • uniquely defines an algorithm • efficient representation • easy manipulation DFT8 ? ?F2 ? I 4 ??T48 ??I 2 ? ?I 2 ? F2 ? • few constructs and primitives • uniquely defines an algorithm • can be translated into code ???L82 Rule Trees = Formulas = Algorithms DFT 8 DFT 8 Cooley-Tukey Cooley-Tukey DFT 2 Formulas Algorithms DFT 2 Rule Cooley-Tukey DFT 2 Carnegie Mellon DFT 4 DFT 4 Cooley-Tukey DFT 2 DFT 2 DFT 2 DFT8 ? ( DFT4 ? I 2 ) ?T48 ?(I4 ? DFT2 ) ?L82 DFT8 ? (DFT2 ? I4 ) ?T28 ?(I 2 ? DFT4 ) ?L84 (DFT2 ? I 2 ) ?T24 ?(I2 ? DFT2 ) ?L42 x[0] X[0] x[0] X[0] x[4] X[1] x[4] X[1] x[1] X[2] x[2] X[2] X[3] x[6] x[5] WN1 x[2] x[6] X[3] WN2 WN2 X[4] x[1] X[5] x[5] X[4] X[5] WN1 x[3] X[6] x[3] X[6] WN2 WN2 x[7] WN3 WN2 X[7] x[7] X[7] WN2 WN3 Increasing level of abstraction Rule Trees Transform Examples: Rules and Rule Trees Trigonometric Rule DFTn ? CosDFTn ? i ?SinDFTn DCT-Recursive Rule CosDFTn ? S n ? CosDFTn / 2 ? DCTn( /II4) ?S n? ?Ln2 SinDFT n ? S n ? ??SinDFT n/2 Number of rule trees grows exponentially with ? DCTn( II/ 4) ?S n? ?Ln2 Type-Split Rule ) DCTn( II ) ? Pn ? DCTn( II/ 2) ? DCTn( IV /2 ? ? ? ? ? ???I Pn? ?Size of the transform ?Number of applicable rules ? F2 ?Pn ?? n/2 DFT16 Trigonometric Recursive RHT Rule k RHT2 k ? ?RHT2k ?1 ? I 2 k?1 ???F2 ? I 2k ?1 ?L22 SPIRAL also includes search over implementation options, such as the degree of loop unrolling SinDFT 16 CosDFT 16 ...... Carnegie Mellon Type-Conversion Rule DCTn( IV ) ? S n ??DCTn( II ) ??Dn DCT-Recursive SinDFT 8 DCT (II)4 DCT-Recursive SinDFT 4 Type-Split DCT (II)2 DCT(II)2 DCT (IV) 2 FIR Filters and the DWT in SPIRAL: Challenges ?Existing filtering and wavelet algorithm representations do not capture all important structural information ?SPIRAL uses a concise math framework to represent many algorithms for several transforms (DFT, DCTs, DHT, RDFT, …) ?Filtering and wavelet algorithms have considerably different structure from the current SPIRAL transforms ? Current framework not enough to capture algorithms’ structure ? Existing algorithm representations need to be “spiralized” – i.e., captured concisely using the rule formalism Carnegie Mellon ?Need changes on several levels: ? New transforms, constructs, rules, translation templates ? Modified search ? Need to rephrase several concepts of the framework ?Existing framework has to be redesigned to accommodate new constructions and decomposition strategies New framework has to be integrated in SPIRAL FIR Filters: Rules and Algorithms FIR Filter Transform yn ? hn ? xn ? k ?1 ? i? 0 hi ?xn? i standard “sum” notation ? h0 ? ?h ? h0 1 ? ? Baseline Rule ? h ? ? ? 1 Fn ( h ) ? ? hk ? 1 ? ? h0 ? Fn ?h ?? I n ? k ? 1 hT ? ? h h ? k?1 1 ? SPIRAL rule ? ? ? ? captures the structure ? ? h ? k ?1 ? SPIRAL transform notation New constructs Carnegie Mellon A? k B ? A? k B ? column and row overlapped direct sum Is ? k A ? overlapped tensor product Time Domain Methods Overlap-Add Rule Fn ?h ? ? I n / b ? k ? 1 Fb ?h ? concise rule notation • Localized access of the input data • Input is blocked into length b segments Overlap-Save Rule • Localized storage of the output T Fn ?h?? T ?hL ?? k ?1 ?I m/ b ? k ?1 Fb ?h??? k ?1 T ?hR ? • Output segments are computed independently Toeplitz matrix Blocking Rule Fn ?h ? ? I n / b ? v ?k / b ? ? b i ?1 • Multilevel blocking • Control over input & output locality T ?h i ? Frequency Domain Methods Expanded Circulant Rule Carnegie Mellon Cyclic -RDFT Rule ? I ? F n ?h ? ? C ?E k , n ? 1h ?E n , k ? 1 E n ,k ? ? n ? ?0 k x n ? C ?a? ? RDFTn-1 ?X RDFTn a ?RDFTn ? Circulant matrix sparse matrix Radix-2 Cooley Tukey RDFT Rule ? real DFT RDFTn ? ?RDFTn / 2 ? I2 ??X(wkl ) ??I n / 2 ? RDFT2 ??Ln2 Time domain - better data access structure Frequency domain - potentially reduce arithmetic cost DWT: Rules and Algorithms Filter Bank – Mallat’s Algorithm g(z -1) 2 {dj-1,k } g(z-1 ) {xk} = {cj,k } h(z-1) 2 DWT of {xn} 2 {cj-1,k } h(z-1 ) {dj-2,k } g(z-1 ) 2 2 d0,k 2 c0,k {c 1,k } h(z-1) Carnegie Mellon Mallat Rule DWTn (h, g) ? ?DWTn / 2 (h, g ) ? I n / 2 ???? 2 ???Fn / 2 (h ) ? Fused Mallat Rule DWTn (h, g ) ? ?DWTn / 2 (h, g ) ? I n / 2 ??H n (h, g ) ?E *n ,l ,r • Redundant computations removed • Filter structure lost Extension matrix n? k ? 1 Fn / 2 (g )??E*n ,l ,r ? I n / 2 ? k ? 2 hT ? H n (h, g ) ? ? T? ? I n/ 2 ? k ? 2 g ? single-level DWT Polyphase Representation 2 2 g(z -1) {xk } even {xk } {xk } z-1 -1 Polyphase Rule Gateway to frequency domain computation ?? H n (h, g ) ? FnT/ 2 (h o ) ? Carnegie Mellon ?he ( z ? 1 ) ho ( z ? 1 ) ? ? ? ?1 ?1 ? ? ge ( z ) g o ( z ) ? n/2 ? FnT/ 2 (h e ) ? n? k ? 2 Primal (Predict) Lifting Scheme ?F T n /2 (g o ) ? Dual (Update) n /2 ?? FnT/ 2 (g e ) ?Ln2? k ? 2 Lifting steps ?K 0 ??1 si ( z ? 1 )?? 1 0 ? ?1 s0 ( z ? 1 )?? 1 0 ? ?? ? ? ? ?? ? ? ?1 ?? ?1 0 1 / K 0 1 t ( z ) 1 0 1 t ( z ) 1 ? ? ? ? ? ?? ? i ? ? 0 • Asymptotically reduce cost by 50% • In-place computations Lifting Rule ? ?I {xk }odd 2 2 h(z ) ?he (z?1) ho (z?1)? P(z ) ? ? ?1 ?1 ? ?ge (z ) go(z )? ?1 P(z-1) H n (h, g ) ? K ?In / 2 ? n /2 ? n/ 2 ?? 1 / K ?I n / 2 ? I n / 2 ? T F n ? ?t o ? n / 2 (t o ) ? n/ 2 ? n /2 FnT/ 2 (s i ) ? I n / 2 ?Ln2? k ? 2 I n ? ?s i ? n / 2 ?? Examples: FIR Filter and DWT Rule Trees F n(h) DWT n(h,g) Polyphase Overlap-Save Fn/2(he ) DWT n/2(h,g) Fn/2(ge ) Fn/2(ho) Fn/2(go) T(hL ) F n(h) F b(h) T(h R) Overlap-Add Expanded-Circulant F b(h) ........ ........ ........ ........ Expanded-Circulant Blocking C b+l-1(h) Cb+l-1 (h) ..... T(h1) T(hk/b ) Cyclic-RDFT Cyclic-RDFT Blocking Fn (h) Overlap-Add RDFT b+l-1 T(h 1) Fb (h) ..... T(hk/b1 ) Blocking T(h1 ) RDFT Cooley-Tukey ..... RDFT b+l-1 T(hk/b) Blocking T(h 1) ..... T(h k/b1) RDFT Cooley-Tukey F n(h) Overlap-Add RDFT b+l-1/2 DWT n(h,g) RDFT2 F n(h) RDFT b+l-1/2 Mallat F b(h) RDFT 2 Blocking Blocking T(h 1 ) ..... ..... T(h1) T(h k/b) T(hk/b ) Blocking T(h 1 ) ..... T(h k/b1 ) DWT n/2 (h,g) Blocking F n/2(g) Fn/2(h) T(h 1) Blocking T(h1 ) ..... T(hk/b) ........ Exceedingly large number of trees even for modest size transforms ........ Carnegie Mellon F b(h) Overlap-Add ..... T(hk/b1 ) All these rule trees form an extensive search space of alternative algorithms in SPIRAL Time-Domain Methods flops / peak performance FIR Filter • overlap-add, overlap-save, and blocking • all methods have the same cost • baseline method = overlap-add with b=1 • sanity check – percentage of the peak performance 60-70% improvement over baseline Same arithmetic cost Carnegie Mellon comparison of different time-domain methods Pentium 4 unrolled loops SUN 1- level blocking blocks size 2 Frequency Domain vs. Time Domain Circulant FIR Filter Frequency domain – potentially lower cost Time domain – beter data access patterns Carnegie Mellon circulant too sparse “half–dense” circulants emerge in frequency domain methods best time-domain FD methods better on large “fat” circulants Filter lengths > 64 arithmetic cost cross-over runtime cross-over Considerable discrepancy between arithmetic costs and actual runtimes Daubechies D 4 DWT flops / peak performance for D 4 and D 30 Short Wavelet • single-level DWT using D4 • 3 different classes of rule trees based on the initial rule •Fused Mallat Rule (direct method) •Polyphase Rule (frequency domain) •Lifting Rule (lifting scheme) Frequency domain trees not found Lifting scheme – 30% lower cost – slower comparison of runtimes Athlon Carnegie Mellon Xeon direct method cache size reached Daubechies D 30 DWT Experiment Long Wavelet Carnegie Mellon Xeon extension overhead lifting better Athlon temp cache cache size reached Different methods are found for different size ranges Best implementations vary from machine to machine Summary ?Fast automatically generated code for FIR filters and DWT ? Significant contribution – no existing solutions ?Preliminary results ? Arithmetic cost not a reliable predictor of performance ? Best implementation not obvious for a human programmer ? Best implementations vary over different computer platforms What lies ahead? ?Further algorithmic and implementation optimizations ?Fusion of the extension, exploiting in-place structure, scheduling,.. Carnegie Mellon ?Entire class of wavelets and extension methods ? Orthogonal, biorthogonal, frames,… ? Periodic, symmetric, polynomial extension, zero-padding ?Extend to other wavelet transforms ? 2D-DWT, M-band, wavelet packet transform ?Utilize special instruction set extensions ? Single Instruction Multiple Data (SIMD) ? Fused Multiply-Add (FMA)