In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU http://www.ece.cmu.edu/~spiral Abstract This presentation describes an approach to implementing and optimizing fast signal transforms. Algorithms for computing signal transforms are expressed by symbolic expressions, which can be automatically generated and translated into programs. Optimizing an implementation involves searching for the fastest program obtained from one of the possible expressions. We apply this methodology to the implementation of the Walsh-Hadamard transform. An environment, accessible from MATLAB, is provided for generating and timing WHT algorithms. These tools are used to search for the fastest WHT algorithm. The fastest algorithm found is substantially faster than standard approaches to implementing the WHT. The work reported in this paper is part of the SPIRAL project (see http://www.ece.cmu.edu/~spiral), an ongoing project whose goal is to automate the implementation and optimization of signal processing algorithms. The Walsh-Hadamard Transform WHT2n F2 F2 F2 , n-fold 1 1 F2 1 1 n WHT2n ( I 2i1 F2 I 2ni ) “iterative” WHT2n ( F2 I 2n1 )(I 2 WHT2n1 ) “recursive” i 1 Why WHT? • Easy Structure • Contains important constructor • Close to 2-power FFT Influence of Cache Sizes (Walsh-Hadamard Transform) Runtime Quotients 1.8 3.6 iterative recursive iterative recursive 3.4 1.6 3.2 runtime(k)/runtime(k-1) relative runtime 1.4 1.2 1 3 2.8 2.6 2.4 0.8 2.2 0.6 0.4 2 0 2 4 6 8 10 k 12 14 16 18 20 1.8 0 2 4 6 8 10 k 12 14 16 18 20 L1 L2 L1 L2 -Cache |signal| <= |L1|: iterative faster (less overhead) |signal| > |L1|: recursive faster (less cache misses) Pentium II, LINUX Increased Locality (Grid Algorithm) WHT2n ( F2 F2 ) WHT2n2 (( F2 F2 ) I 2n2 )( I 2n2 WHT2n2 ) F2 I 2n1 I 2 F2 I 2n2 I 4 W2 n2 is computed + Local access pattern + Can be generalized to arbitrary tensor products - Conflict cache misses due to 2-power stride (if cache is not fully associative) Runtime/L1-DCache Misses Runtime L1 DCache Misses 2.2 3 recursive iterative mixed grid 4-step 2 recursive iterative mixed grid 4-step 2.5 recursive iterative mixed grid 4-step relative cache misses 1.6 1.4 1.2 2 1.5 1 1 0.5 0.8 0.6 0 2 4 6 8 10 k 12 14 16 18 20 0 0 2 4 6 8 10 k 12 14 16 18 20 4 2.5 grid 4-step grid scrambling grid 4-step grid scrambling 3.5 3 relative cache misses 2 relative runtime relative runtime 1.8 1.5 2.5 2 1.5 1 1 0.5 0.5 0 0 2 4 6 8 10 k 12 14 16 18 20 0 2 4 6 8 10 k 12 14 16 18 20 grid 4-step vs. grid scrambling (dynamic data redistribution) Effect of Unrolling Runtime L1 ICache Misses 2 2500 1.8 2000 number of cache misses 1.6 relative runtime 1.4 1.2 1 0.8 1500 1000 500 0.6 0.4 1 2 3 4 5 k 6 7 8 0 9 2 3 4 5 k 6 7 8 9 L2 Cache Misses iterative: loops vs. unrolled 500 450 400 number of cache misses Compose small, unrolled building blocks 1 350 300 250 200 150 100 50 Pentium II, LINUX 0 1 2 3 4 5 k 6 7 8 9 Class of WHT Algorithms Let N = N1 Nt, Ni = 2ni WHT N I2 t n1... ni 1 i 1 WHT ni I ni1...nt 2 2 R = N; S = 1; for i = 1,…,t R = R/Ni; for j = 0,…,R-1 for k = 0,…,S-1 x(jNiS+k;S;jNiS+k+(Ni-1)S) = WHTNi x(jNiS+k;S;jNiS+k+(Ni-1)S); S = S*Ni; Partition Trees • Each WHT algorithm can be represented by a tree, where a node labeled by n corresponds to WHT2n 4 1 1 4 1 3 1 1 1 4 iterative 1 3 1 1 1 mixed 2 1 recursive 1 Search Space • Optimization of the WHT becomes a search, over the space of partition trees, for the fastest algorithm. • The number of trees: T n n T_n 1 1 Tn Tn 1 n1... nt n 2 2 3 6 4 24 1 5 112 t 6 568 7 3672 8 16768 Size of Search Space • Let T(z) be the generating function for Tn T ( z ) z /(1 z ) T ( z ) /(1 T ( z )) 2 Tn = (n/n3/2), where =4+8 6.8284 • Restricting to binary trees B( z ) z /(1 z ) B( z ) Tn = (5n/n3/2) 2 WHT Package • Uses a simple grammar for describing different WHT algorithms (allows for unrolled code and direct computation for verification) • WHT expressions are parsed and a data structure representing the algorithm (partition tree with control information) is created • Evaluator computes WHT using tree to control the algorithm • MATLAB interface allows experimentation WHT Package WHT(n) ::= direct[n] | small[n] | split[WHT(n1),…,WHT(nt)] # n1 + … + nt = n • Iterative split[small[1],small[1],small[1],small[1]] • Recursive split[small[1], split[small[1],split[small[1],small[1]]]] • Grid 4-step split[small[2],split[small[2],W(n-4)]] Code Generation Strategies • Recursive vs. Iterative data flow (improve register allocation) • Additional temps to prevent dependencies (aid C compiler) WHT(2n), runtime of different unrolled code 1.8 iterative + 2 temp variables recursive + 2 temp variables iterative + many temp variables recursive + many temp variables 1.6 relative runtime 1.4 1.2 1 0.8 0.6 0.4 1 2 3 4 5 n 6 7 8 Dynamic Programming • Assume optimal WHT only depends on size and not stride parameters and state such as cache. Then dynamic programming can be used in search for the optimal WHT. • Consider all possible splits of size n and assume previously determined optimal algorithm is used for recursive evaluations • There are 2n-1 possible splits for W(n) and n-1 possible binary splits Generating Splits • Bijection between splits of W(n) and (n-1)-bit numbers 000 1111 = [4] 001 111|1 = [3,1] 010 11|11 = [2,2] 011 11|1|1 = [2,1,1] 100 1|111 = [1,3] 101 1|11|1 = [1,2,1] 110 1|1|11 = [1,1,2] 111 1|1|1|1 = [1,1,1,1] Sun Distribution 400 MHz UltraSPARC II Pentium Distribution 233 MHz Pentium II (linux) Optimal Formulas Pentium [1], [2], [3], [4], [5], [6] [7] [[4],[4]] [[5],[4]] [[5],[5]] [[5],[6]] [[2],[[5],[5]]] [[2],[[5],[6]]] [[2],[[2],[[5],[5]]]] [[2],[[2],[[5],[6]]]] [[2],[[2],[[2],[[5],[5]]]]] [[2],[[2],[[2],[[5],[6]]]]] [[2],[[2],[[2],[[2],[[5],[5]]]]]] [[2],[[2],[[2],[[2],[[5],[6]]]]]] [[2],[[2],[[2],[[2],[[2],[[5],[5]]]]]]] UltraSPARC [1], [2], [3], [4], [5], [6] [[3],[4]] [[4],[4]] [[4],[5]] [[5],[5]] [[5],[6]] [[4],[[4],[4]]] [[[4],[5]],[4]] [[4],[[5],[5]]] [[[5],[5]],[5]] [[[5],[5]],[6]] [[4],[[[4],[5]],[4]]] [[4],[[4],[[5],[5]]]] [[4],[[[5],[5]],[5]]] [[5],[[[5],[5]],[5]]] Different Strides • Dynamic programming assumption is not true. Execution time depends on stride. 4.5 W(1) W(2) W(3) W(4) W(5) 4 time/time for stride 1 3.5 3 2.5 2 1.5 1 0.5 0 2 4 6 8 10 stride 12 14 16 18 20