Distributed WHT Algorithms Kang Chen Jeremy Johnson Franz Franchetti Computer Science Drexel University Electrical and Computer Engineering Carnegie Mellon University http://www.spiral.net Sponsors Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting. Objective Generate high-performance implementations of linear computations (signal transforms) from mathematical descriptions Explore alternative implementations and optimize using formula generation, manipulation and search Prototype implementation using WHT Prototype transform (WHT) Build on existing sequential package SMP implementation using OpenMP Distributed memory implementation using MPI Sequential package presented at ICASSP’00 & ’01 and OpenMP extension presented at IPDPS’02 Incorporate into SPIRAL Automatic performance tuning for DSP transforms CMU: J. Hoe, J. Moura, M. Püschel. M. Veloso Drexel: J. Johnson UIUC: D. Padua R. W. Johnson www.spiral.net Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results Walsh-Hadamard Transform y WHT N x x is a signal of size N=2n. n WHT N WHT2 WHT2 WHT2 n is tensor product. i 1 1 1 1 1 11 WHT 4 WHT2 WHT2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Fast WHT algorithms obtained by factoring the WHT matrix t WHT2 n (I 2 n1+ ··· +ni-1WHT2i I 2 ni+1+ ··· +nt ) i 1 SPIRAL WHT Package All WHT algorithms have the same arithmetic cost O(N log N) but different data access patterns and varying amounts of recursion and iteration Small transforms (size 21 to 28) are implemented with straight-line code to reduce overhead The WHT package allows exploration of the O(7n) different algorithms and implementations using a simple grammar Optimization/adaptation to architectures is performed by searching for the fastest algorithm Dynamic Programming (DP) Evolutionary Algorithm (STEER) Johnson and Püschel: ICASSP 2000 Performance of WHT Algorithms (II) Automatically generate random algorithms for WHT216 using SPIRAL Only difference: order of arithmetic instructions Factor 5 Architecture Dependency The best WHT algorithms also depend on architecture Memory hierarchy Cache structure Cache miss penalty … UltraSPARC II v9 POWER3 II 222 25,(1) PowerPC RS64 III 222 217 24,(1) 210 213 24,(1) 25,(1) 25 222 212 26 24,(4) 26 29 24,(1) 218 26,(2) 212 25 25 222 A DDL split node 25,(1) An IL=1 straight-line WHT32 node 27 Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results Bit Permutations Definition (bn-1 … b1 b0 ) A permutation of {0,1,…,n-1} Binary representation of 0 i < 2n. P Permutation of {0,1,…,2n-1} defined by (bn-1 … b1 b0) (b(n-1) … b(1) b (0)) Distributed interpretation P = 2p processors Block cyclic data distribution. Leading p bits are the pid Trailing (n-p) bits are the local offset. pid offset pid offset (bn-1 … bn-p | bn-p-1 … b1 b0) (b(n-1) … b(n-p) | b (n-p-1) … b(1) b (0)) Stride Permutation L 8 Write at stride 4 (=8/2) 2 • 000 000 • 001 100 • 010 001 • 011 101 • 100 010 • 101 110 • 110 011 • 111 111 x0 1 0 0 0 0 0 1 0 x2 x 0 0 0 0 4 x 6 0 0 0 0 0 1 0 0 x 1 0 0 0 1 x3 0 0 0 0 x5 0 0 0 0 x7 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 (b2b1b0) (b0b2b1) 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 x0 x1 x 2 x3 x 4 x5 x6 x7 Distributed Stride Permutation 64 L 2 #0 000|**0 000|0** 000|**1 100|0** #1 001|**0 000|1** 001|**1 100|1** #2 010|**0 001|0** 010|**1 101|0** #3 011|**0 001|1** 011|**1 101|1** Processor address mapping Communication rules per processor # L 64 2 #4 100|**0 010|0** 100|**1 110|0** #5 101|**0 010|1** 101|**1 110|1** #6 110|**0 011|0** 110|**1 111|0** #7 111|**0 011|1** 111|**1 111|1** Local address mapping Communication Pattern 64 L 2 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Each PE sends 1/2 data to 2 different PEs Looks nicely regular… Communication Pattern 64 L 2 0 1 Y(0:1:3) X(0:2:6) 7 2 X(1:2:7) Y(4:1:7) 3 6 5 4 …but is highly irregular… Communication Pattern 64 L 4 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Each PE sends 1/4 data to 4 different PEs …and gets worse for larger parameters of L. Multi-Swap Permutation M 8 2 • 000 000 • 001 100 • 010 010 • 011 110 • 100 001 • 101 101 • 110 011 • 111 111 Writes at stride 4 Pairwise exchange of data x 0 1 x 4 0 x 0 2 x 6 0 0 x1 x5 0 0 x3 x7 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 (b2b1b0) (b0b1b2) 0 0 0 1 0 0 0 0 0 x0 0 x1 0 x2 0 x3 0 x4 0 x5 0 x6 1 x7 Communication Pattern M 64 2 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Each PE exchanges 1/2 data with another PE (4 size 2 All-to-All) Communication Pattern M 64 2 X(0:2:6) 0 X(1:2:7) 1 7 2 6 3 5 4 Communication Pattern M 64 4 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Each PE sends 1/4 data to 4 different PEs (2 size 4 All-to-All) Communication Scheduling Order two Latin square Used to schedule Allto-All permutation Uses Point-to-Point communication Simple recursive construction 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0 3 2 5 4 7 6 3 0 1 6 7 4 5 2 1 0 7 6 5 4 5 6 7 0 1 2 3 4 7 6 1 0 3 2 7 4 5 2 3 0 1 6 5 4 3 2 1 0 Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results Parallel WHT Package WHT partition tree is parallelized at the root node SMP implementation obtained using OpenMP Distributed memory implementation using MPI Dynamic programming decides when to use parallelism DP decides the best parallel root node DP builds the partition with best sequential subtrees Sequential WHT package: Dynamic Data Layout: OpenMP SMP version: Johnson and Püschel: ICASSP 2000, ICASSP 2001 N. Park and V. K. Prasanna: ICASSP 2001 K. Chen and J. Johnson: IPDPS 2002 Distributed Memory WHT Algorithms Distributed split, d_split, as root node Data equally distributed among threads Distributed stride permutation to exchange data Different sequences of permutations are possible Parallel form WHT transform on local data t WHT N ( I N1 N i1 WHT N i I N i1 N t ) i t ( L ( I N / N i WHT N i )) N Ni Sequential algorithm Pease dataflow Stride permutations Parallel local WHT i t ( Pi ( I N / Ni WHT N i )) i General dataflow Bit permutations Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results Theoretical Results Problem statement Find sequence of permutations that minimize communication and congestion Pease dataflow Total bandwidth = N log(N)(1-1/P) Conjectured Optimal Total bandwidth = N/2 log(P) + N(1-1/P) Optimal uses independent pairwise exchanges (except last permutation) Pease Dataflow L I W L I W L I W L I W 16 2 16 8 2 2 16 8 2 2 16 8 2 2 8 2 Theoretically Optimal Dataflow L I W P I W P I W P I W 16 2 16 8 2 ( 0,3) 16 8 2 ( 0, 2 ) 16 8 2 ( 0,1) 8 2 Outline Introduction Bit permutations Distributed WHT algorithms Theoretical results Experimental results Experimental Results Platform 32 Pentium III processors, 450 MHz 512 MB 8ns PCI-100 memory and 2 SMC 100 mbps fast Ethernet cards Distributed WHT package implemented using MPI Experiments All-to-All Distributed stride vs. multi-swap permutations Distributed WHT All-to-All log(local N) Point-to-point three-way MPI_Alltoall 25 45.39 63.17 140.5 24 22.67 29.99 78.7 23 11.33 15.02 44.23 22 5.76 7.72 20.89 21 2.93 3.85 8.44 20 1.62 2.12 5.15 19 0.68 1.12 2.64 Three different implementations of All-to-All permutation Point-to-point fastest Stride vs. Multi-Swap 80 Runtime (sec) 70 60 50 40 30 Stride Permutation 20 Multi-Swap Permutation 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 ld N Distributed WHT230 125 120 Runtime (sec) 115 110 105 100 d_split with Stride Perm 95 d_split with Multi-Swap Perm 90 5 10 15 20 25 n1 L I 2n 2n1 2nn1 W2n1 L I W 2n 2nn1 2n1 2nn1 vs. M I 2n 2n1 W2n1 2nn1 L I W 2n 2nn1 2n1 2nn1 Summary Self-adapting WHT package Optimize distributed WHT over different communication patterns and combinations of sequential code Use of point-to-point primitives for all-to-all Ongoing work: Lower bounds Use high-speed interconnect Generalize to other transforms Incorporate into SPIRAL http://www.spiral.net