Spiral Search Space

advertisement
In Search of the Optimal WHT
Algorithm
J. R. Johnson
Drexel University
Markus Püschel
CMU
http://www.ece.cmu.edu/~spiral
Abstract
This presentation describes an approach to implementing and
optimizing fast signal transforms. Algorithms for computing signal
transforms are expressed by symbolic expressions, which can be
automatically generated and translated into programs. Optimizing an
implementation involves searching for the fastest program obtained from
one of the possible expressions. We apply this methodology to the
implementation of the Walsh-Hadamard transform. An environment,
accessible from MATLAB, is provided for generating and timing WHT
algorithms. These tools are used to search for the fastest WHT algorithm.
The fastest algorithm found is substantially faster than standard
approaches to implementing the WHT.
The work reported in this paper is part of the SPIRAL project (see
http://www.ece.cmu.edu/~spiral), an ongoing project whose goal is to
automate the implementation and optimization of signal processing
algorithms.
The Walsh-Hadamard Transform
WHT2n  F2  F2    F2 ,
n-fold
1 1 
F2  

1

1


n
WHT2n   ( I 2i1  F2  I 2ni )
“iterative”
WHT2n  ( F2  I 2n1 )(I 2  WHT2n1 )
“recursive”
i 1
Why WHT?
• Easy Structure
• Contains important constructor 
• Close to 2-power FFT
Influence of Cache Sizes
(Walsh-Hadamard Transform)
Runtime
Quotients
1.8
3.6
iterative
recursive
iterative
recursive
3.4
1.6
3.2
runtime(k)/runtime(k-1)
relative runtime
1.4
1.2
1
3
2.8
2.6
2.4
0.8
2.2
0.6
0.4
2
0
2
4
6
8
10
k
12
14
16
18
20
1.8
0
2
4
6
8
10
k
12
14
16
18
20
L1 L2
L1 L2 -Cache
|signal| <= |L1|: iterative faster (less overhead)
|signal| > |L1|: recursive faster (less cache misses)
Pentium II, LINUX
Increased Locality
(Grid Algorithm)
WHT2n  ( F2  F2 )  WHT2n2  (( F2  F2 )  I 2n2 )( I 2n2  WHT2n2 )
F2  I 2n1
I 2  F2  I 2n2
I 4  W2 n2
is computed
+ Local access pattern
+ Can be generalized to arbitrary tensor products
- Conflict cache misses due to 2-power stride
(if cache is not fully associative)
Runtime/L1-DCache Misses
Runtime
L1 DCache Misses
2.2
3
recursive
iterative
mixed
grid 4-step
2
recursive
iterative
mixed
grid 4-step
2.5
recursive
iterative
mixed
grid 4-step
relative cache misses
1.6
1.4
1.2
2
1.5
1
1
0.5
0.8
0.6
0
2
4
6
8
10
k
12
14
16
18
20
0
0
2
4
6
8
10
k
12
14
16
18
20
4
2.5
grid 4-step
grid scrambling
grid 4-step
grid scrambling
3.5
3
relative cache misses
2
relative runtime
relative runtime
1.8
1.5
2.5
2
1.5
1
1
0.5
0.5
0
0
2
4
6
8
10
k
12
14
16
18
20
0
2
4
6
8
10
k
12
14
16
18
20
grid 4-step vs. grid scrambling (dynamic data redistribution)
Effect of Unrolling
Runtime
L1 ICache Misses
2
2500
1.8
2000
number of cache misses
1.6
relative runtime
1.4
1.2
1
0.8
1500
1000
500
0.6
0.4
1
2
3
4
5
k
6
7
8
0
9
2
3
4
5
k
6
7
8
9
L2 Cache Misses
iterative:
loops vs. unrolled
500
450
400
number of cache misses
Compose small, unrolled
building blocks
1
350
300
250
200
150
100
50
Pentium II, LINUX
0
1
2
3
4
5
k
6
7
8
9
Class of WHT Algorithms
Let N = N1    Nt, Ni = 2ni
WHT
N

 I2
t
n1... ni 1
i 1
 WHT ni  I ni1...nt
2
2
R = N; S = 1;
for i = 1,…,t
R = R/Ni;
for j = 0,…,R-1
for k = 0,…,S-1
x(jNiS+k;S;jNiS+k+(Ni-1)S) =
WHTNi x(jNiS+k;S;jNiS+k+(Ni-1)S);
S = S*Ni;

Partition Trees
• Each WHT algorithm can be represented by a tree,
where a node labeled by n corresponds to WHT2n
4
1
1
4
1
3
1
1
1
4
iterative
1
3
1
1
1
mixed
2
1
recursive
1
Search Space
• Optimization of the WHT becomes a search, over
the space of partition trees, for the fastest
algorithm.
• The number of trees:
T
n
n
T_n
1
1
 Tn    Tn
 1
n1... nt n
2
2
3
6
4
24
1
5
112
t
6
568
7
3672
8
16768
Size of Search Space
• Let T(z) be the generating function for Tn
T ( z )  z /(1  z )  T ( z ) /(1  T ( z ))
2
Tn = (n/n3/2), where =4+8  6.8284
• Restricting to binary trees
B( z )  z /(1  z )  B( z )
Tn = (5n/n3/2)
2
WHT Package
• Uses a simple grammar for describing different
WHT algorithms (allows for unrolled code and
direct computation for verification)
• WHT expressions are parsed and a data structure
representing the algorithm (partition tree with
control information) is created
• Evaluator computes WHT using tree to control the
algorithm
• MATLAB interface allows experimentation
WHT Package
WHT(n) ::= direct[n]
|
small[n]
|
split[WHT(n1),…,WHT(nt)]
# n1 + … + nt = n
• Iterative
split[small[1],small[1],small[1],small[1]]
• Recursive
split[small[1],
split[small[1],split[small[1],small[1]]]]
• Grid 4-step
split[small[2],split[small[2],W(n-4)]]
Code Generation Strategies
• Recursive vs. Iterative data flow (improve register
allocation)
• Additional temps
to prevent
dependencies
(aid C compiler)
WHT(2n), runtime of different unrolled code
1.8
iterative + 2 temp variables
recursive + 2 temp variables
iterative + many temp variables
recursive + many temp variables
1.6
relative runtime
1.4
1.2
1
0.8
0.6
0.4
1
2
3
4
5
n
6
7
8
Dynamic Programming
• Assume optimal WHT only depends on size and
not stride parameters and state such as cache.
Then dynamic programming can be used in search
for the optimal WHT.
• Consider all possible splits of size n and assume
previously determined optimal algorithm is used
for recursive evaluations
• There are 2n-1 possible splits for W(n) and n-1
possible binary splits
Generating Splits
• Bijection between splits of W(n) and (n-1)-bit
numbers
000  1111 = [4]
001  111|1 = [3,1]
010  11|11 = [2,2]
011  11|1|1 = [2,1,1]
100  1|111 = [1,3]
101  1|11|1 = [1,2,1]
110  1|1|11 = [1,1,2]
111  1|1|1|1 = [1,1,1,1]
Sun Distribution
400 MHz UltraSPARC II
Pentium Distribution
233 MHz Pentium II (linux)
Optimal Formulas
Pentium
[1], [2], [3], [4], [5], [6]
[7]
[[4],[4]]
[[5],[4]]
[[5],[5]]
[[5],[6]]
[[2],[[5],[5]]]
[[2],[[5],[6]]]
[[2],[[2],[[5],[5]]]]
[[2],[[2],[[5],[6]]]]
[[2],[[2],[[2],[[5],[5]]]]]
[[2],[[2],[[2],[[5],[6]]]]]
[[2],[[2],[[2],[[2],[[5],[5]]]]]]
[[2],[[2],[[2],[[2],[[5],[6]]]]]]
[[2],[[2],[[2],[[2],[[2],[[5],[5]]]]]]]
UltraSPARC
[1], [2], [3], [4], [5], [6]
[[3],[4]]
[[4],[4]]
[[4],[5]]
[[5],[5]]
[[5],[6]]
[[4],[[4],[4]]]
[[[4],[5]],[4]]
[[4],[[5],[5]]]
[[[5],[5]],[5]]
[[[5],[5]],[6]]
[[4],[[[4],[5]],[4]]]
[[4],[[4],[[5],[5]]]]
[[4],[[[5],[5]],[5]]]
[[5],[[[5],[5]],[5]]]
Different Strides
• Dynamic programming assumption is not true.
Execution time depends on stride.
4.5
W(1)
W(2)
W(3)
W(4)
W(5)
4
time/time for stride 1
3.5
3
2.5
2
1.5
1
0.5
0
2
4
6
8
10
stride
12
14
16
18
20
Download