Lower bound

advertisement
Communication Lower Bound
for the Fast Fourier
Transform
Michael Anderson
Communication-Avoiding Algorithms
(CS294)
Fall 2011
Sources
• J. W. Hong and H. T. Kung. I/O complexity: The red-blue
pebble game. In STOC '81: Proceedings of the thirteenth
annual ACM symposium on Theory of computing, pages 326-333, New York, NY, USA, 1981. ACM.
• J. E. Savage. Extending the Hong-Kung model to memory
hierarchies. In COCOON, pages 270--281, 1995.
• CS256 Applied Theory of Computation Brown University.
Lecture 18
(http://www.cs.brown.edu/courses/csci2560/lectures/lect.18.
MemoryHierarchyIII.pdf)
• John E. Savage Models of Computation Exploring the Power
of Computing
• A. Aggarwal and J. S. Vitter. The input/output complexity
of sorting and related problems. Commun. ACM, 31(9):1116-1127, 1988.
Outline
1. Fast Fourier Transform
2. Lower bound
1. Two-level pebble game
2. S-span
3. Upper bound
4. Multilevel pebble game
5. Open Problems
Outline
1. Fast Fourier Transform
2. Lower bound
1. Two-level pebble game
2. S-span
3. Upper bound
4. Multilevel pebble game
5. Open Problems
Discrete Fourier Transform
N 1
X k   x ne
n
2jk
N
n 0
Output Vector
Input Vector
Unroll Output Vector
N 1
X 0   x n 0*n
 e
n 0
N 1

X1   x n 1*n
n 0
N 1
X 2   x n 2*n



n 0
N 1
X 3   x n 3*n
n 0
...
N 1
X N 2   x n (N 2)*n
n 0
N 1
X N 1   x n (N 1)*n

n 0

2j
N
Unroll Input Vector
 e
X0
X1
X2
X3
X4
...
X N 2
X N 1





x 0 0*0
x 0 1*0
x 0 2*0
x 0 3*0
x 0 4*0
...





x1 0*1
x1 1*1
x1 2*1
x1 3*1
x1 4*1
...





x 3 0*2
x 3 1*2
x 3 2*2
x 3 3*2
x 3 4*2
...
 x 0 (N 2)*0  x1 (N 2)*1  x 3 (N 2)*2
 x 0 (N 1)*0  x1 (N 1)*1  x 3 (N 1)*2





...
...
...
...
...






x N 2 0*(N 2)
x N 2 1*(N 2)
x N 2 2*(N 2)
x N 2 3*(N 2)
x N 2 4*(N 2)





2j
N
x N 1 0*(N 1)
x N 1 1*(N 1)
x N 1 2*(N 1)
x N 1 3*(N 1)
x N 1 4*(N 1)
...
...
...
 ...  x N 2 (N 2)*(N 2)  x N 1 (N 2)*(N 1)
 ...  x N 2 (N 1)*(N 2)  x N 1 (N 1)*(N 1)
Phrase as Matrix-Vector Multiply
I N P U T
x1

...


 1*1
 2*1
 3*1
 4*1
0*1
...

0*2
 1*2
 2*2
 3*2
 4*2
...
 (N 2)*0  (N 2)*1  (N 2)*2
 (N 1)*0  (N 1)*1  (N 1)*2
...
x N 2
xN 1
...

...
...
...
...


 1*(N 2)
 2*(N 2)
 3*(N 2)
 4*(N 2)

 1*(N 1)
1)
 2*(N

1)
 3*(N

1)
 4*(N

0*(N 2)
0*(N 1)
...
...
...

(N 2)*(N 2)
(N 2)*(N 1)
... 


...  (N 1)*(N 2)  (N 1)*(N 1)




X0
X1
X3
X4
X5
...
X N 2
X N 1
V E C T O R

 1*0

 2*0
 3*0
 4*0
0*0
x2
O U T P U T
x0
V E C T O R
Factorization
I N P U T
V E C T O R
O U T P U T
DFT
V E C T O R
Factorization
I N P U T
x 0  x 0  x1 k
V E C T O R
x1  x1  x 0 k
+*
x0
+*

DFT
+*
+*
+*
+*
DFT
+*
+*
x1
V E C T O R
x1
O U T P U T
x0
Factorization
I N P U T
V E C T O R
DFT
DFT
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
V E C T O R
DFT
+*
O U T P U T
DFT
+*
FFT
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
+*
V E C T O R
I N P U T
Compute
O U T P U T
V E C T O R
Shuffle
Outline
1. Fast Fourier Transform
2. Lower bound
1. Two-level pebble game
2. S-span
3. Upper bound
4. Multilevel pebble game
5. Open Problems
Red Blue (2-level) Pebble Game
• Used to analyze communication in straight-line programs (e.g.
Matrix multiply, FFT, matrix transpose)
• Played on a DAG. Vertices represent inputs, intermediate
data, and operations. Edges represent data dependencies
• Pebbles represent cache locations. Pebble color represents a
distinct level of the cache hierarchy. Placing a pebble on a
specific vertex means storing that data element in cache.
Red Blue (2-level) Pebble Game
• Used to analyze communication in straight-line programs (e.g.
Matrix multiply, FFT, matrix transpose)
• Played on a DAG. Vertices represent inputs, intermediate
data, and operations. Edges represent data dependencies
• Pebbles represent cache locations. Pebble color represents a
distinct level of the cache hierarchy. Placing a pebble on a
specific vertex means storing that data element in cache.
Red Pebble
(Fast Memory)
Blue Pebble
(Slow Memory)
Rules of the Red Blue Pebble Game
• (Initialization) A blue pebble can be placed on any
input vertex at any time
• (Input) A red pebble may be placed on any vertex that
contains a blue pebble
• (Output) A blue pebble may be placed on any vertex
that contains a red pebble
• (Computation) A red pebble can be placed on any vertex
if all of its immediate predecessors have red pebbles
• (Deletion) A pebble can be removed at any time
• (Goal) All output vertices contain blue pebbles
Playing the Game
• A pebbling strategy is a sequence of steps in which
the rules on the previous slide are used to move pebbles
• The number of red pebbles (size of fast memory) is
limited to S (assume infinite blue pebbles).
• A communication lower bound (or Minimum I/O Time) is
determined by proving the minimum number of (Input) and
(Output) rules invoked over all possible pebbling
strategies.
• The total number of computation steps should also be
minimized
S-span
• The S-span of DAG G, ρ(S,G), is the maximum number of
vertices of G that can be pebbled with S red pebbles in
red pebble game maximized over all initial placements of
S red pebbles.
• Red pebble game is like the red blue game but blue
pebbles cannot be stored on intermediate vertices.
Initial red pebble
(S=6)
Red Pebble
Using S-span for Lower Bounds
Divide the computation into h sub-pebblings (C1, C2...Ch) that
each communicate no more than S words between level 1 and 2.
Each sub-pebbling has 2S words available (S words initially in
the cache plus S inputs). Therefore, each sub-pebbling can
perform no more than ρ(2S,G) operations.
C1
C2
C3
C4
C6
C5
Input
Level-1
ops
Output
. . .
Ch
Using S-span for Lower Bounds
• Theorem For every pebbling P of G = (V,E) in the redblue pebble game with S red pebbles, the I/O time used,
T2(S,G,P) satisfies:
T2 (S,G,P)/S (2S,G)  V  In(G)
Number of words moved
(In batches of S words)

Total number of operations
Upper bound on arithmetic intensity
(number of operations per 2S words)
V  In(G)
T2 (S,G,P) /S  
(2S,G)
What is the S-span of the FFT DAG?
Lemma 1: The S-span of the FFT DAG on n
inputs is no greater than 2 S log(S)
when S < n.
Proof: Let num(p) denote the number of
moves currently allocated to pebble p.
Both p1 and p2 are moved to the upper
level nodes v1, and v2. (Illegal, but an
upper bound) If num(p1) = num(p2) then
increment both. Otherwise increment the
smaller.
The total number of red pebbling moves
is therefore bounded by:
2
 num( p)
ppebbles
v1
v2
u1
u2
p1
p2
What is the S-span of the FFT DAG?
Lemma 2: For each pebble p on node n in the FFT DAG, the number of
nodes, N(p), that contained a red pebble in the initial
configuration and that are connected by a directed path to n is at
least 2num(p)
What is the S-span of the FFT DAG?
Lemma 2: For each pebble p on node n in the FFT DAG, the number of
nodes, N(p), that contained a red pebble in the initial
configuration and that are connected by a directed path to n is at
least 2num(p)
Proof (Induction):
Base case: num(p) = 1. In this case, the node n needed 2 inputs.
Inductive step: Assume that N(p) is at least 2e-1 for some value of
num(p) < e-1. Show that N(p) becomes at least 2e when num(p) is
incremented to e during a butterfly operation.
Case 1: Pebbles p1 and p2 enter a butterfly operation with
num(p1)=num(p2)=e-1. Since u1 and u2 are roots of disjoint trees
with at least 2e-1 initial pebbles, the total number of initial
pebbles is now 2(2e-1) = 2e pebbles.
Case 2: num(p) < num(partner) in the butterfly. num(partner) > e
therefore the partner must have been connected to at least 2e
initial pebbles.
What is the S-span of the FFT DAG?
There are S pebbles and each pebble can only cover one initial
placement. Therefore num(p) < log(S), because there must be at
least 2num(p) initial pebbles. (Lemma 2)
According to Lemma 1, the total number of pebbling moves is
bounded by:
2
 num( p)  2 log( S)
ppebbles
ppebbles
So the
 S-span is 2 S log(S). QED

FFT Two-level Hierarchy Lower Bound
T2 (S,G,P) /S (2S,G)  V  In(G)
N log( N)  N
T2 (S,G,P) /S  
4S log( 2S)
N log N 
T2 (S,G,P)  

 log S 
Number of words moved
Outline
1. Fast Fourier Transform
2. Lower bound
1. Two-level pebble game
2. S-span
3. Upper bound
4. Multilevel pebble game
5. Open Problems
Transpose FFT
Transpose FFT (Upper Bound)
Suppose the FFT size is a power of 2. (N = 2d) There are
log(N) levels in the FFT DAG.
Divide the large FFT into many FFTs of size S, where S is
the size of fast memory. There are log(N)/log(S) stages of
independent size-S FFTs. After each stage, store the outputs
in slow memory for a total of N log(N)/log(S) words moved
between fast and slow memory, which achieves the lower
bound.
Outline
1. Fast Fourier Transform
2. Lower bound
1. Two-level pebble game
2. S-span
3. Upper bound
4. Multilevel pebble game
5. Open Problems
Multilevel Pebble Game
• Red/blue pebble game was for 2 levels (fast and slow)
• For multilevel game, data begins and ends in the
highest level memory (the Lth) and can be transferred
between consecutive levels (l-1 to l or vice versa)
Level-1
(Registers)
Level-2
(On-chip cache)
. . .
Level-L
(Main Memory)
Rules of the Multilevel Pebble Game
• (Initialization) A level-L pebble can be placed on any input
vertex at any time
• (Computation) A first-level pebble can be placed on any
vertex if all of its immediate predecessors have first-level
pebbles
• (Deletion) Except for level-L pebbles on output vertices, a
pebble at any level can be removed at any time
• (Input from level-l) For 2 < l < L-1, a level-(l-1) pebble
can be placed on any vertex carrying a level-l pebble
• (Output to level-l) For 2 < l < L-1, a level-(l) pebble can
be placed on any vertex carrying a level-(l-1) pebble
• (Goal) All output vertices contain level-L pebbles
Terminology
• Resource Vector p = (p1, p2, p3, ... pL-1) where pl is
the number of pebbles at level l. (Highest level is
assumed infinite)
• sl = sum of all available pebbles below level-l
• Minimal Pebbling assumes that the number of highest
level I/O operations is minimized, the number of I/O
operations is minimized at successively lower levels and
number of computation steps is minimized.
• Tl = Number of I/O operations at level l
Multilevel S-Span
Theorem:
Consider a minimal pebbling of the DAG G = (V,E) in the
standard memory hierarchy game with resource vector p
using sl pebbles at level l or less. The following lower
bound must be satisfied:
Tl(L ) (,G) /sl 1(2sl 1,G)  V
Level l sub-pebblings

C1
C2
C3
C4
C6
C5
Input
Level l1 ops
Output
. . .
Ch
Relating Multilevel to 2-level
Theorem:
The following inequality holds for 2 < l < L-1 when the
graph G is pebbled in the L-level game with resource
vector p.
(2)
Tl(L ) ( p,G)

T

2 (sl 1,G)

Review
• The minimum I/O time for the FFT in the 2-level case
is N log N / log S
• This was determined by finding the S-span of the FFT
graph using it to bound the number of words transferred
between memory levels
• The standard FFT algorithm achieves this lower bound
(so the lower bound is tight)
• Two-level lower bounds can be generalized to multilevel memory hierarchies
Outline
1. Fast Fourier Transform
2. Lower bound
1. Two-level pebble game
2. S-span
3. Upper bound
4. Multilevel pebble game
5. Open Problems
Open Problems
• Communication lower bounds for 2-D and 3-D FFTs
• I suspect that S-span argument also holds for 2-D
case
• What if S is larger than one row?
• Determining the FFT lower bound for the parallel model
described in this class
• Lower bounds for a “parallel hierarchal memory
model” using randomized sorting algorithms for
communication can be found here:
J. S. Vitter and E. A. M. Shriver. “Algorithms for Parallel Memory II:
Hierarchical Multilevel Memories”
• Using the pebble game (S-span) method to analyze new
algorithms
• Matrix Multiply and sorting and several other
examples can be found in the references listed
earlier
Questions?
Download