SPIRAL: Automatic Implementation of Signal Processing Algorithms

advertisement
Distributed WHT Algorithms
Kang Chen
Jeremy Johnson
Franz Franchetti
Computer Science
Drexel University
Electrical and Computer Engineering
Carnegie Mellon University
http://www.spiral.net
Sponsors
Work supported by DARPA (DSO), Applied & Computational
Mathematics Program, OPAL, through grant managed by
research grant DABT63-98-1-0004 administered by the Army
Directorate of Contracting.
Objective
 Generate high-performance implementations of linear
computations (signal transforms) from mathematical
descriptions
 Explore alternative implementations and optimize using formula
generation, manipulation and search
 Prototype implementation using WHT
 Prototype transform (WHT)
 Build on existing sequential package
 SMP implementation using OpenMP
 Distributed memory implementation using MPI
 Sequential package presented at ICASSP’00 & ’01 and OpenMP
extension presented at IPDPS’02
 Incorporate into SPIRAL
Automatic performance tuning for DSP transforms
CMU:
J. Hoe, J. Moura, M. Püschel. M. Veloso
Drexel: J. Johnson
UIUC: D. Padua
R. W. Johnson
www.spiral.net
Outline
 Introduction
 Bit permutations
 Distributed WHT algorithms
 Theoretical results
 Experimental results
Outline
 Introduction
 Bit permutations
 Distributed WHT algorithms
 Theoretical results
 Experimental results
Walsh-Hadamard Transform
y  WHT N  x
x is a signal of size N=2n.
n


WHT N   WHT2  WHT2   WHT2
n
 is tensor product.
i 1

 1 1   1 1  11
WHT 4  WHT2  WHT2  
 
  1
 1 1  1 1 1
1 1 1
1 1 1 
1 1 1 
1 1 1 
Fast WHT algorithms obtained by factoring the WHT matrix
t
WHT2 n   (I 2 n1+ ··· +ni-1WHT2i  I 2 ni+1+ ··· +nt )
i 1
SPIRAL WHT Package
 All WHT algorithms have the same arithmetic cost
O(N log N) but different data access patterns and
varying amounts of recursion and iteration
 Small transforms (size 21 to 28) are implemented with
straight-line code to reduce overhead
 The WHT package allows exploration of the O(7n)
different algorithms and implementations using a simple
grammar
 Optimization/adaptation to architectures is performed
by searching for the fastest algorithm
 Dynamic Programming (DP)
 Evolutionary Algorithm (STEER)
Johnson and Püschel: ICASSP 2000
Performance of WHT Algorithms (II)
 Automatically generate random algorithms for
WHT216 using SPIRAL
 Only difference: order of arithmetic instructions
Factor 5
Architecture Dependency
The best WHT algorithms also depend on architecture
 Memory hierarchy
 Cache structure
 Cache miss penalty
 …
UltraSPARC II v9
POWER3 II
222
25,(1)
PowerPC RS64 III
222
217
24,(1)
210
213
24,(1)
25,(1)
25
222
212
26
24,(4)
26
29
24,(1)
218
26,(2)
212
25
25
222 A DDL split node
25,(1) An IL=1 straight-line WHT32 node
27
Outline
 Introduction
 Bit permutations
 Distributed WHT algorithms
 Theoretical results
 Experimental results
Bit Permutations
Definition
 
(bn-1 … b1 b0 )
A permutation of {0,1,…,n-1}
Binary representation of 0  i < 2n.
 P
Permutation of {0,1,…,2n-1} defined by
(bn-1 … b1 b0)  (b(n-1) … b(1) b (0))
Distributed interpretation




P = 2p processors
Block cyclic data distribution.
Leading p bits are the pid
Trailing (n-p) bits are the local offset.
pid
offset
pid
offset
(bn-1 … bn-p | bn-p-1 … b1 b0)  (b(n-1) … b(n-p) | b (n-p-1) … b(1) b (0))
Stride Permutation
L
8
Write at stride 4 (=8/2)
2
•
000
000
•
001
100
•
010
001
•
011
101
•
100
010
•
101
110
•
110
011
•
111
111
 x0 
1 0 0 0
 
0 0 1 0
 x2 

x 
0 0 0 0
4

 
 x 6   0 0 0 0
0 1 0 0
x 

 1
0 0 0 1
 x3
0 0 0 0
 

 x5 
0 0 0 0
 x7 
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
(b2b1b0) (b0b2b1)
0
0
0
1
0
0
0
0
0
0
0
0
0

0
0

1
 x0 
 
 x1 
x 
 2
 x3 
x 
 4
 x5 
 
 x6 
 x7 
Distributed Stride Permutation
64
L
2
#0 000|**0
000|0**
000|**1
100|0**
#1 001|**0
000|1**
001|**1
100|1**
#2 010|**0
001|0**
010|**1
101|0**
#3 011|**0
001|1**
011|**1
101|1**
Processor address mapping
Communication rules per processor #
L
64
2
#4 100|**0
010|0**
100|**1
110|0**
#5 101|**0
010|1**
101|**1
110|1**
#6 110|**0
011|0**
110|**1
111|0**
#7 111|**0
011|1**
111|**1
111|1**
Local address mapping
Communication Pattern
64
L
2
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Each PE sends 1/2 data to 2 different PEs
Looks nicely regular…
Communication Pattern
64
L
2
0
1
Y(0:1:3)
X(0:2:6)
7
2
X(1:2:7)
Y(4:1:7)
3
6
5
4
…but is highly irregular…
Communication Pattern
64
L
4
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Each PE sends 1/4 data to 4 different PEs
…and gets worse for larger parameters of L.
Multi-Swap Permutation
M
8
2
•
000
000
•
001
100
•
010
010
•
011
110
•
100
001
•
101
101
•
110
011
•
111
111
Writes at stride 4
Pairwise exchange of data
 x 0  1
  
 x 4  0
 x  0
 2 
 x 6   0
  0
 x1  
 x5   0
  0
 x3  
 x7  0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
(b2b1b0) (b0b1b2)
0
0
0
1
0
0
0
0
0   x0 
 
0  x1 
0  x2 
 
0   x3 
0  x4 
 
0   x5 
0   x6 
 
1  x7 
Communication Pattern
M
64
2
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Each PE exchanges 1/2 data with another PE
(4 size 2 All-to-All)
Communication Pattern
M
64
2
X(0:2:6)
0
X(1:2:7)
1
7
2
6
3
5
4
Communication Pattern
M
64
4
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Each PE sends 1/4 data to 4 different PEs
(2 size 4 All-to-All)
Communication Scheduling
 Order two Latin square
 Used to schedule Allto-All permutation
 Uses Point-to-Point
communication
 Simple recursive
construction
0
1

2

3
4

5
6

7
1 2 3 4 5 6 7
0 3 2 5 4 7 6
3 0 1 6 7 4 5

2 1 0 7 6 5 4
5 6 7 0 1 2 3

4 7 6 1 0 3 2
7 4 5 2 3 0 1

6 5 4 3 2 1 0
Outline
 Introduction
 Bit permutations
 Distributed WHT algorithms
 Theoretical results
 Experimental results
Parallel WHT Package
 WHT partition tree is parallelized at the root node
 SMP implementation obtained using OpenMP
 Distributed memory implementation using MPI
 Dynamic programming decides when to use parallelism
 DP decides the best parallel root node
 DP builds the partition with best sequential subtrees
Sequential WHT package:
Dynamic Data Layout:
OpenMP SMP version:
Johnson and Püschel: ICASSP 2000, ICASSP 2001
N. Park and V. K. Prasanna: ICASSP 2001
K. Chen and J. Johnson: IPDPS 2002
Distributed Memory WHT Algorithms
 Distributed split, d_split, as root node




Data equally distributed among threads
Distributed stride permutation to exchange data
Different sequences of permutations are possible
Parallel form WHT transform on local data
t
WHT N   ( I N1  N i1  WHT N i  I N i1 N t )
i
t
  ( L ( I N / N i  WHT N i ))
N
Ni
Sequential algorithm
Pease dataflow
Stride permutations
Parallel local WHT
i
t
  ( Pi ( I N / Ni  WHT N i ))
i
General dataflow
Bit permutations
Outline
 Introduction
 Bit permutations
 Distributed WHT algorithms
 Theoretical results
 Experimental results
Theoretical Results
Problem statement
Find sequence of permutations that minimize
communication and congestion
Pease dataflow
Total bandwidth = N log(N)(1-1/P)
Conjectured Optimal
Total bandwidth = N/2 log(P) + N(1-1/P)
Optimal uses independent pairwise exchanges
(except last permutation)
Pease Dataflow
L I  W L I  W L I  W L I  W 
16
2
16
8
2
2
16
8
2
2
16
8
2
2
8
2
Theoretically Optimal Dataflow
L I  W P I  W P I  W P I  W 
16
2
16
8
2
( 0,3)
16
8
2
( 0, 2 )
16
8
2
( 0,1)
8
2
Outline
 Introduction
 Bit permutations
 Distributed WHT algorithms
 Theoretical results
 Experimental results
Experimental Results
Platform
 32 Pentium III processors, 450 MHz
 512 MB 8ns PCI-100 memory and
 2 SMC 100 mbps fast Ethernet cards
Distributed WHT package implemented using MPI
Experiments
 All-to-All
 Distributed stride vs. multi-swap permutations
 Distributed WHT
All-to-All
log(local N) Point-to-point three-way MPI_Alltoall
25
45.39
63.17
140.5
24
22.67
29.99
78.7
23
11.33
15.02
44.23
22
5.76
7.72
20.89
21
2.93
3.85
8.44
20
1.62
2.12
5.15
19
0.68
1.12
2.64
Three different implementations of All-to-All permutation
Point-to-point fastest
Stride vs. Multi-Swap
80
Runtime (sec)
70
60
50
40
30
Stride Permutation
20
Multi-Swap Permutation
10
0
1
3
5
7
9
11 13 15 17 19 21 23 25
ld N
Distributed WHT230
125
120
Runtime (sec)
115
110
105
100
d_split with Stride Perm
95
d_split with Multi-Swap Perm
90
5
10
15
20
25
n1
L I
2n
2n1
2nn1
 W2n1
L I  W 
2n
2nn1
2n1
2nn1
vs.
M I
2n
2n1
 W2n1
2nn1
L I  W 
2n
2nn1
2n1
2nn1
Summary
 Self-adapting WHT package
 Optimize distributed WHT over different
communication patterns and combinations of
sequential code
 Use of point-to-point primitives for all-to-all
Ongoing work:
 Lower bounds
 Use high-speed interconnect
 Generalize to other transforms
 Incorporate into SPIRAL
http://www.spiral.net
Download