High - Performance Code Generation for FIR Filters and the Discrete Wavelet

advertisement
High-Performance Code Generation for
FIR Filters and the Discrete Wavelet
Transform Using SPIRAL
Aca Gacic
Markus Püschel
Carnegie Mellon
José M. F. Moura
SPIRAL project
http://www.spiral.net
Electrical and Computer Engineering Department
Carnegie Mellon University
Motivation
?FIR filters - image enhancement, equalization, speech synthesis,
biomedical signal processing, …
Wavelets - image compression, detection, denoising, signal recovery
? Numerically intensive - significant impact on application performance
?
High- Performance Implementations (How-To)
?Choose Fast Algorithm
? Arithmetic cost – only a rough estimate of performance
? Many algorithms with similar cost – which one?
Carnegie Mellon
?Design Efficient Code
? Architecture conscious – machine dependent
? Obsolete when platform is changed/upgraded
?Best implementation:
implementation Algorithm + Machine + Compiler
? Requires experts in algorithms and computer architecture
? Frequent re-implementation
Better way: automatic performance tuning
Existing FIR filtering and DWT software
Matlab ©, Mathematica, S+Wavelets, WaveLab , Wave++, IPP
Application oriented – not machine specific, or
Hand-tuned code for specific platform
Automatic platform-adapted solutions not available !
Carnegie Mellon
We want to close this gap
Automatic performance tuning packages
ATLAS, PHiPAC, SPARSITY, FFTW, SPIRAL
Include several basic linear algebra and DSP functions
No filtering and wavelet kernels !
SPIRAL system
Generator of optimized DSP transform implementations
user
goes for a coffee
controls
algorithm generation
fast algorithm
as SPL formula
SPL Compiler
C/Fortran/SIMD
code
platform-adapted
implementation
controls
implementation options
Search Engine
SPIRAL
Formula Generator
specifies
runtime on given platform
comes back
(or an espresso for small transforms)
Carnegie Mellon
DSP transform
SPIRAL’s Four Key Concepts
Transform
?
DFTn ? e ? 2? jkl / n
?
k ,l ? 0 ,....,n ? 1
parameterized matrix
Diagonal matrix (twiddles)
Rule
DFTnm ?
?DFTn ?
I m ??Tnnm ??I n ? DFTm ??Lnm
m
Kronecker product
Identity
Permutation
• a breakdown strategy - product of sparse matrices
• captures important structural information
DFT8
Carnegie Mellon
Rule Tree
DFT 4
DFT 2
DFT 2
Formula
DFT 2
• recursive application of rules
• uniquely defines an algorithm
• efficient representation
• easy manipulation
DFT8 ? ?F2 ? I 4 ??T48 ??I 2 ? ?I 2 ? F2 ?
• few constructs and primitives
• uniquely defines an algorithm
• can be translated into code
???L82
Rule Trees = Formulas = Algorithms
DFT 8
DFT 8
Cooley-Tukey
Cooley-Tukey
DFT 2
Formulas
Algorithms
DFT 2
Rule
Cooley-Tukey
DFT 2
Carnegie Mellon
DFT 4
DFT 4
Cooley-Tukey
DFT 2
DFT 2
DFT 2
DFT8 ? ( DFT4 ? I 2 ) ?T48 ?(I4 ? DFT2 ) ?L82
DFT8 ? (DFT2 ? I4 ) ?T28 ?(I 2 ? DFT4 ) ?L84
(DFT2 ? I 2 ) ?T24 ?(I2 ? DFT2 ) ?L42
x[0]
X[0]
x[0]
X[0]
x[4]
X[1]
x[4]
X[1]
x[1]
X[2]
x[2]
X[2]
X[3]
x[6]
x[5]
WN1
x[2]
x[6]
X[3]
WN2
WN2
X[4]
x[1]
X[5]
x[5]
X[4]
X[5]
WN1
x[3]
X[6]
x[3]
X[6]
WN2
WN2
x[7]
WN3
WN2
X[7]
x[7]
X[7]
WN2
WN3
Increasing level of abstraction
Rule Trees
Transform
Examples: Rules and Rule Trees
Trigonometric Rule
DFTn ? CosDFTn ? i ?SinDFTn
DCT-Recursive Rule
CosDFTn ? S n ? CosDFTn / 2 ? DCTn( /II4) ?S n? ?Ln2
SinDFT n ? S n
?
??SinDFT
n/2
Number of rule trees grows
exponentially with
? DCTn( II/ 4) ?S n? ?Ln2
Type-Split Rule
)
DCTn( II ) ? Pn ? DCTn( II/ 2) ? DCTn( IV
/2
?
?
?
?
? ???I
Pn?
?Size of the transform
?Number of applicable rules
? F2 ?Pn
??
n/2
DFT16
Trigonometric
Recursive RHT Rule
k
RHT2 k ? ?RHT2k ?1 ? I 2 k?1 ???F2 ? I 2k ?1 ?L22
SPIRAL also includes
search over implementation
options, such as the degree
of loop unrolling
SinDFT 16
CosDFT 16
......
Carnegie Mellon
Type-Conversion Rule
DCTn( IV ) ? S n ??DCTn( II ) ??Dn
DCT-Recursive
SinDFT 8
DCT (II)4
DCT-Recursive
SinDFT 4
Type-Split
DCT (II)2
DCT(II)2
DCT (IV) 2
FIR Filters and the DWT in SPIRAL: Challenges
?Existing filtering and wavelet algorithm representations do not
capture all important structural information
?SPIRAL uses a concise math framework to represent many
algorithms for several transforms (DFT, DCTs, DHT, RDFT, …)
?Filtering and wavelet algorithms have considerably different
structure from the current SPIRAL transforms
? Current framework not enough to capture algorithms’ structure
? Existing algorithm representations need to be “spiralized” – i.e.,
captured concisely using the rule formalism
Carnegie Mellon
?Need changes on several levels:
? New transforms, constructs, rules, translation templates
? Modified search
? Need to rephrase several concepts of the framework
?Existing framework has to be redesigned to accommodate new
constructions and decomposition strategies
New framework has to be integrated in SPIRAL
FIR Filters: Rules and Algorithms
FIR Filter Transform
yn ? hn ? xn ?
k ?1
?
i? 0
hi ?xn? i
standard “sum” notation
? h0
?
?h
?
h0
1
?
?
Baseline Rule
?
h
?
?
?
1
Fn ( h ) ? ? hk ? 1 ? ? h0 ?
Fn ?h ?? I n ? k ? 1 hT
?
?
h
h
?
k?1
1 ?
SPIRAL rule
?
?
? ?
captures the structure
?
?
h
?
k ?1 ?
SPIRAL transform notation
New constructs
Carnegie Mellon
A? k B ?
A? k B ?
column and row overlapped direct sum
Is ? k A ?
overlapped tensor product
Time Domain Methods
Overlap-Add Rule
Fn ?h ? ? I n / b ? k ? 1 Fb ?h ?
concise rule notation
• Localized access of the input data
• Input is blocked into length b segments
Overlap-Save Rule
• Localized storage of the output
T
Fn ?h?? T ?hL ?? k ?1 ?I m/ b ? k ?1 Fb ?h??? k ?1 T ?hR ? • Output segments are computed
independently
Toeplitz matrix
Blocking Rule
Fn ?h ? ? I n / b ?
v
?k / b ?
?
b
i ?1
• Multilevel blocking
• Control over input & output locality
T ?h i ?
Frequency Domain Methods
Expanded Circulant Rule
Carnegie Mellon
Cyclic -RDFT Rule
? I ?
F n ?h ? ? C ?E k , n ? 1h ?E n , k ? 1 E n ,k ? ? n ?
?0 k x n ?
C ?a? ? RDFTn-1 ?X RDFTn a ?RDFTn
?
Circulant matrix sparse matrix
Radix-2 Cooley Tukey RDFT Rule
?
real DFT
RDFTn ? ?RDFTn / 2 ? I2 ??X(wkl ) ??I n / 2 ? RDFT2 ??Ln2
Time domain - better data access structure
Frequency domain - potentially reduce arithmetic cost
DWT: Rules and Algorithms
Filter Bank – Mallat’s Algorithm
g(z -1)
2
{dj-1,k }
g(z-1 )
{xk} = {cj,k }
h(z-1)
2
DWT of {xn}
2
{cj-1,k }
h(z-1 )
{dj-2,k }
g(z-1 )
2
2
d0,k
2
c0,k
{c 1,k }
h(z-1)
Carnegie Mellon
Mallat Rule
DWTn (h, g) ? ?DWTn / 2 (h, g ) ? I n / 2 ???? 2 ???Fn / 2 (h ) ?
Fused Mallat Rule
DWTn (h, g ) ? ?DWTn / 2 (h, g ) ? I n / 2 ??H n (h, g ) ?E *n ,l ,r
• Redundant computations removed
• Filter structure lost
Extension matrix
n? k ? 1
Fn / 2 (g )??E*n ,l ,r
? I n / 2 ? k ? 2 hT ?
H n (h, g ) ? ?
T?
? I n/ 2 ? k ? 2 g ?
single-level DWT
Polyphase Representation
2
2
g(z -1)
{xk } even
{xk }
{xk }
z-1
-1
Polyphase Rule
Gateway to frequency domain computation
??
H n (h, g ) ? FnT/ 2 (h o ) ?
Carnegie Mellon
?he ( z ? 1 ) ho ( z ? 1 ) ?
?
?
?1
?1 ?
? ge ( z ) g o ( z ) ?
n/2
?
FnT/ 2 (h e ) ?
n? k ? 2
Primal
(Predict)
Lifting Scheme
?F
T
n /2
(g o ) ?
Dual
(Update)
n /2
??
FnT/ 2 (g e ) ?Ln2? k ? 2
Lifting steps
?K 0 ??1 si ( z ? 1 )?? 1 0 ? ?1 s0 ( z ? 1 )?? 1 0 ?
?? ?
?
?
??
? ? ?1
?? ?1
0
1
/
K
0
1
t
(
z
)
1
0
1
t
(
z
)
1
?
?
?
?
?
??
? i
?
? 0
• Asymptotically reduce cost by 50%
• In-place computations
Lifting Rule
?
?I
{xk }odd
2
2
h(z )
?he (z?1) ho (z?1)?
P(z ) ? ? ?1
?1 ?
?ge (z ) go(z )?
?1
P(z-1)
H n (h, g ) ? K ?In / 2 ?
n /2 ?
n/ 2
??
1 / K ?I n / 2 ? I n / 2 ?
T
F
n ? ?t o ? n / 2 (t o ) ?
n/ 2
?
n /2
FnT/ 2 (s i ) ?
I n / 2 ?Ln2? k ? 2
I
n ? ?s i ? n / 2
??
Examples: FIR Filter and DWT Rule Trees
F n(h)
DWT n(h,g)
Polyphase
Overlap-Save
Fn/2(he )
DWT n/2(h,g)
Fn/2(ge )
Fn/2(ho)
Fn/2(go)
T(hL )
F n(h)
F b(h)
T(h R)
Overlap-Add
Expanded-Circulant
F b(h)
........
........
........
........
Expanded-Circulant
Blocking
C b+l-1(h)
Cb+l-1 (h)
.....
T(h1)
T(hk/b )
Cyclic-RDFT
Cyclic-RDFT
Blocking
Fn (h)
Overlap-Add
RDFT b+l-1
T(h 1)
Fb (h)
.....
T(hk/b1 )
Blocking
T(h1 )
RDFT Cooley-Tukey
.....
RDFT b+l-1
T(hk/b)
Blocking
T(h 1)
.....
T(h k/b1)
RDFT Cooley-Tukey
F n(h)
Overlap-Add
RDFT b+l-1/2
DWT n(h,g)
RDFT2
F n(h)
RDFT b+l-1/2
Mallat
F b(h)
RDFT 2
Blocking
Blocking
T(h 1 )
.....
.....
T(h1)
T(h k/b)
T(hk/b )
Blocking
T(h 1 )
.....
T(h k/b1 )
DWT n/2 (h,g)
Blocking
F n/2(g)
Fn/2(h)
T(h 1)
Blocking
T(h1 )
.....
T(hk/b)
........
Exceedingly large
number of trees
even for modest
size transforms
........
Carnegie Mellon
F b(h)
Overlap-Add
.....
T(hk/b1 )
All these rule trees form
an extensive search
space of alternative
algorithms in SPIRAL
Time-Domain Methods
flops / peak performance
FIR Filter
• overlap-add, overlap-save, and blocking
• all methods have the same cost
• baseline method = overlap-add with b=1
• sanity check – percentage of the peak
performance
60-70% improvement over baseline
Same arithmetic cost
Carnegie Mellon
comparison of different time-domain methods
Pentium 4
unrolled loops
SUN
1- level blocking
blocks size 2
Frequency Domain vs. Time Domain
Circulant
FIR Filter
Frequency domain – potentially lower cost
Time domain – beter data access patterns
Carnegie Mellon
circulant too sparse
“half–dense” circulants emerge in
frequency domain methods
best time-domain
FD methods better on
large “fat” circulants
Filter lengths > 64
arithmetic cost cross-over
runtime cross-over
Considerable discrepancy between arithmetic costs and actual runtimes
Daubechies D 4 DWT
flops / peak performance for D 4 and D 30
Short Wavelet
• single-level DWT using D4
• 3 different classes of rule trees based
on the initial rule
•Fused Mallat Rule (direct method)
•Polyphase Rule (frequency domain)
•Lifting Rule (lifting scheme)
Frequency domain trees not found
Lifting scheme – 30% lower cost – slower
comparison of runtimes
Athlon
Carnegie Mellon
Xeon
direct method
cache size reached
Daubechies D 30 DWT Experiment
Long Wavelet
Carnegie Mellon
Xeon
extension
overhead
lifting
better
Athlon
temp
cache
cache size reached
Different methods are found for different size ranges
Best implementations vary from machine to machine
Summary
?Fast automatically generated code for FIR filters and DWT
? Significant contribution – no existing solutions
?Preliminary results
? Arithmetic cost not a reliable predictor of performance
? Best implementation not obvious for a human programmer
? Best implementations vary over different computer platforms
What lies ahead?
?Further algorithmic and implementation optimizations
?Fusion of the extension, exploiting in-place structure, scheduling,..
Carnegie Mellon
?Entire class of wavelets and extension methods
? Orthogonal, biorthogonal, frames,…
? Periodic, symmetric, polynomial extension, zero-padding
?Extend to other wavelet transforms
? 2D-DWT, M-band, wavelet packet transform
?Utilize special instruction set extensions
? Single Instruction Multiple Data (SIMD)
? Fused Multiply-Add (FMA)
Download