ppt

advertisement
23rd ACM Symposium on
Parallelism in Algorithms and Architectures
San Jose, California, June 4 - 6, 2011
Graph Expansion
and
Communication Costs
of
Fast Matrix Multiplication
Oded Schwartz1
Joint work with:
Grey Ballard1,
James Demmel1,
Olga Holtz1,2
1. UC-Berkeley
2. TU-Berlin
1
Results by:
• Grey Ballard, UCB
• Aydin Buluc, LBL
• Erin Carson, UCB
• James Demmel, UCB
• Jack Dongarra, UTK
• Ioana Dumitriu, U. Washington
• Andrew Gearhart, UCB
• Laura Grigori, INRIA
• Ming Gu, UCB Math
• Mark Hoemmen, Sandia NL
• Olga Holtz, UCB Math & TU Berlin
• Nicholas Knight, UCB
• Julien Langou, U. Colorado Denver
• Eran Rom, IBM Tel-Aviv
• Edgar Solomonik, UCB
Many others…
2
Motivation
Two kinds of costs:
Arithmetic (FLOPs)
Communication: moving data between
•
•
levels of a memory hierarchy (sequential case)
over a network connecting processors (parallel case)
Communication-avoiding algorithms:
Save time, save energy.
Sequential
Hierarchy
CPU
Cache
M1
M2
Parallel
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
M3

RAM
Mk = 
3
Motivation: expected bottleneck
Annual hardware improvements:
Exponential growth with large gaps
[Graham, Snir,Patterson, 04], [Fuller, Millett, 10]
CPU
(msec/flop)
59%
DRAM
Network
Bandwidth
(msec/word)
23%
26%
Latency
(msec/message)
5%
15%
Sequential
Hierarchy
CPU
Cache
M1
M2
Parallel
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
M3

RAM
Mk = 
4
Outline
Algorithms with “flavor” of 3 nested loops
• Lower bounds: Sequential, Hierarchy, Parallel.
[Ballard, Demmel, Holtz, S. 2009],
[Ballard, Demmel, Holtz, S. 2011a] extending
[Hong & Kung 81], [Irony,Toledo,Tiskin 04]
• Algorithms: Sequential, Parallel
Many contributions, mostly new
Strassen-like algorithms
• Lower bounds: Sequential, Hierarchy, Parallel.
This work
• Algorithms: Sequential, Parallel
[Ballard, Demmel, Holtz, Rom, S. 2011]
5
Lower bounds: for algorithms with
“flavor” of 3 nested loops
Matrix Multiplication
[Hong & Kung 81]
• Sequential




[Irony,Toledo,Tiskin 04]
• Sequential and parallel

 




 M 

M 

n
3
 M 

M  P 
n
3
6
Lower bounds: for algorithms with
“flavor” of 3 nested loops
[Ballard, Demmel, Holtz, S. 2009],
[Ballard, Demmel, Holtz, S. 2011a]
Following [Irony,Toledo,Tiskin 04]




• BLAS, LU, Cholesky, LDLT, and QR factorizations,  
eigenvalues and singular values, i.e.,
 

essentially all direct methods of linear algebra.

• Dense or sparse matrices
In sparse cases: bandwidth is a function NNZ.
• Bandwidth and latency.
• Sequential, hierarchical, and
parallel – distributed and shared memory models.
• Compositions of linear algebra operations.
• Certain graph optimization problems


 M 

M 

n
3
 M 

M  P 
n
3
[Demmel, Pearson, Poloni, Van Loan, 11]
• Tensor contraction
7
Do conventional dense algorithms as implemented in
LAPACK and ScaLAPACK attain these bounds?
Mostly not.
Are there other algorithms that do?
Mostly yes.
8
Motivation: a few example speedups,
Measured and Predicted
[Demmel, Ballard, Hoemmen, Grigori, Langou, Dongarra, Anderson 10]
[Anderson, Ballard, Demmel Keutzer 10]
[Bhatele, Demmel, Solomonik 11]
Measured: Parallel TSQR
– Intel Clovertown
– Up to 8x speedup (8 core, dual socket, 10M x 10)
– Pentium III cluster, Dolphin Interconnect, MPICH
• Up to 6.7x speedup (16 procs, 100K x 200)
– BlueGene/L
• Up to 4x speedup (32 procs, 1M x 50)
– Tesla C 2050 / Fermi
• Up to 13x (110,592 x 100)
Predicted: Parallel 2.5D LU
– Exascale
– Up to 4.5x speedup (218 nodes, 222x222)
9
Beyond 3-nested loops
How about the communication costs of algorithms
that have a more complex structure?
10
Recall: Strassen’s Fast Matrix Multiplication
[Strassen 69]
• Compute 2 x 2 matrix multiplication
using only 7 multiplications (instead of 8).
• Apply recursively (block-wise)
M1 = (A11 + A22)  (B11 + B22)
M2 = (A21 + A22)  B11
M3 =
A11  (B12 - B22)
M4 =
A22  (B21 - B11)
M5 = (A11+ A12)  B22
M6 = (A21 - A11)  (B11 + B12)
M7 = (A12 - A22)  (B21 + B22)
n/2
C11
C12
A11
A12
B11
B12
A21
A22
B21
B22
=
n/2
C21
C22
flops(n) = 7flops(n/2) + O(n2)
flops(n) = (nlog 7)
2
C11 = M1 + M4 - M5 + M7
C12 = M3 + M5
C21 = M2 + M4
C22 = M1 - M2 + M3 + M6
11
Strassen-like algorithms
• Compute n0 x n0 matrix multiplication
n/n0
using only n0 multiplications
(instead of n03).
=
• Apply recursively (block-wise)
0 
2.81
[Strassen 69] works fast in practice.
2.79
[Pan 78]
0 flops(n/n )
flops(n)
=
n
0
0
2.78
[Bini 79]
flops(n) = (n0)
2.55
[Schönhage 81]
2.50
[Pan Romani,Coppersmith Winograd 84]
2.48
[Strassen 87]
2.38
[Coppersmith Winograd 90]
2.38
[Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach
0
+ O(n2)
12
New lower bound for
Strassen’s fast matrix multiplication
Main Result:
The communication bandwidth lower bound is
Strassen-like:
For Strassen’s:
sequential
parallel










M 
n


M 
n

M 






M 
P 


 


log2 277
log
log 2 7


M 
n


M 
n
Recall for cubic:

00
0

M 



 


M 
P 







M 
n


M 
n
log2 288
log

M 


log 2 8
M 
P 

The parallel lower bounds applies to
one copy of the data: M = (n2/P)
c copies of the data: M = (c∙n2/P)
13
For sequential? hierarchy?
Yes, existing implementations do!
For parallel?
Yes, we think so
14
Sequential and new parallel
Strassen-like algorithms
Sequential and Hierarchy cases:
Attained by the natural recursive implementation.
Also: LU, QR,… (Black-box use of fast matrix multiplication)
[Ballard, Demmel, Holtz, S., Rom 2011]:
New parallel Strassen-like algorithm.
Attains the lower bound (we think).
This work:
This is as good as it gets.
15
Communication Lower Bounds
Proving that your algorithm/implementation is as good as it gets.
Approaches:
1. Reduction
[Ballard, Demmel, Holtz, S. 2009]
2. Geometric Embedding
[Irony,Toledo,Tiskin 04],
[Ballard, Demmel, Holtz, S. 2011a]
3. Graph Analysis
[Hong & Kung 81],
This work
16
Expansion (3rd approach)
[Ballard, Demmel, Holtz, S. 2011b],
in the spirit of [Hong & Kung 81]
S
V \S
Let G = (V,E) be a d-regular graph
Edge expansion =
h º min
( )
E S, S
S, S £
Small-set edge expansion =
V
2
dS
ht º min S, S £t
( )
E S, S
dS
17
Expansion (3rd approach)
The Computation Directed Acyclic Graph
Input / Output
Intermediate value
Dependency
WS
V
V \S
S
S: subset of
computation
RS: reads
S
RS
WS: writes
Communication-cost is Graph-expansion
18
What is the CDAG of Strassen’s algorithm?
19
The DAG of Strassen, n = 2
Dec1C
M1 = (A11 + A22)  (B11 + B22)
M2 = (A21 + A22)  B11
M3 =
A11  (B12 - B22)
M4 =
A22  (B21 - B11)
M5 = (A11+ A12)  B22
M6 = (A21 - A11)  (B11 + B12)
M7 = (A12 - A22)  (B21 + B22)
1,1 1,2 2,1 2,2
7
4
1
3
2
6
`
C11 = M1 + M4 - M5 + M7
C12 = M3 + M5
C21 = M2 + M4
C22 = M1 - M2 + M3 + M6
5
1,1 1,2 2,1 2,2
Enc1A
1,1 1,2 2,1 2,2
Enc1B
20
The DAG of Strassen, n=4
Dec1C
One recursive level:
1,1 1,2 2,1 2,2
• Each vertex splits into four.
• Multiply blocks
7
5
4
1
3
2
6
Dec1C
Enc1A
Enc1 B
`
Enc1A
Enc1B
21
The DAG of Strassen: further recursive steps
n2
Dec1C
Declg nC
n
Enclg nA
lg n
1,1 1,2 2,1 2,2
Enclg n B
n2
Recursive construction
Given DeciC, Construct Deci+1C:
1. Duplicate 4 times
2. Connect with a cross-layer of Dec1C
22
Expansion of a Segment
Main technical challenges:
Dec1C
• Two types of vertices:
with/without recursion.
• The graph is not regular.
1,1 1,2 2,1 2,2
7
5
4
1
3
2
6
Main lemma:
for
t = M w /2
`
æ M ö
ht = Wç w /2 ÷
èM ø
1,1 1,2 2,1 2,2
Enc1A
1,1 1,2 2,1 2,2
Enc1B
23
Open Problems
Find algorithms that attain the lower bounds:
• sparse matrix algorithms
• for sequential and parallel models
• that auto-tune or are cache oblivious
Address complex heterogeneous hardware:
• Lower bounds and algorithms
[Demmel, Volkov 08],[Ballard, Demmel, Gearhart 11]
Extend the techniques to other algorithms and algorithmic tools:
S
V \S
?
•
Non-uniform recursive structure
Characterize a communication lower bound for a problem rather
than for an algorithm.
24
Graph Expansion
and
Communication Costs
of
Fast Matrix Multiplication
Oded Schwartz1
Joint work with:
Grey Ballard1,
James Demmel1,
Olga Holtz1,2
Thank you!
http://bebop.cs.berkeley.edu/
25
EXTRA SLIDES
26
Upper bounds – Supporting Data Structures
Top (column-major): Full, Old Packed, Rectangular Full Packed.
Bottom (block-contiguous): Blocked, Recursive Format, Recursive Full Packed
[Frigo, Leiserson, Prokop, Ramachandran 99, Ahmed, Pingali 00, Elmroth, Gustavson, Jonsson,
Kagstrom 04].
27
Geometric Embedding (2nd approach)
(1) Generalized form:
(i,j)  S,
C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),
gi,j,k2 (A(i,k2), B(k2,j)),
…,
k1,k2,…  Sij
other arguments)
Thm: [Ballard, Demmel, Holtz, S. 2009b]
If an algorithm agrees with the generalized form then
BW = (G/ M1/2)
BW = (G/ PM1/2)
where G = |{g(i,j,k) | (i,j)  S, k  Sij }
in P-parallel model.
28
Geometric Embedding (2nd approach)
Read
Read
S1
Read
FLOP
For a given run (Algorithm, Machine, Input)
1. Partition computations into segments
of M reads / writes
M
Write
FLOP
S2
2. Any segment S has O(M) inputs/outputs.
Read
Read
3. Show that S performs  G(M) FLOPs gijk
FLOP
FLOP
FLOP
S3
Write
4. The total communication BW is
BW = BW of one segment  #segments
 M  G / G(M)
FLOP
...
Write
...
Time
Read
Example of a partition,
M=3
29
Applications
(1) Generalized form:
(i,j)  S,
C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),
gi,j,k2 (A(i,k2), B(k2,j)),
…,
k1,k2,…  Sij
other arguments)
BW = (G/ M1/2)
BW = (G/ PM1/2)
where G = |{g(i,j,k) | (i,j)  S, k  Sij }
in P-parallel model.
30
M
M
M
M
M
Expansion (3rd approach)
Computation DAG
For a given run (Algorithm, Machine, Input)
1. Consider the computation DAG: G = (V, E)
V = set of computations and inputs
E = dependencies
2. Partition G into segments S of (M/2) vertices
(correspond to time / location adjacency)
Input / Output
Intermediate value
Dependency
WS
V
S
RS
3. Show that every S has
 3M vertices with incoming / outgoing edges
 perform  M read/writes.
4. The total communication BW is
BW = BW of one segment  #segments
=
(M)
 O(n) / (M/2)
=
(n / M/2 -1)
31
n2
Is it a Good Expander?
Declg nC
n
Break G into edge-disjoint graphs,
corresponding to the algorithm on M1/2  M1/2 matrices.
Consider the expansions of S in each part (they sum up).

E S,S
 
Gi

E Si , Si
 
Gi

h Gi  d Si  h G  M
BW = (T(n))  h(G(M1/2))
BW = (T(n))  (G(M1/2))
Enlg nA
lg n
Enlg n B
n
2
1/ 2
 d
S2
S
S3
S1
S4
S5
We need to show that M/2 expands to (M).
h(G(n)) = (M/ M/2) for n = (M1/2).
Namely, for every n, h(G(n)) = (n2/n) = ((4/7)lg n)
32
Estimating the edge expansion- Combinatorially
In S
Not in S
Mixed



M

S1 S
2
S3
k  lg
 M
Sk
M   1 
• Dec1C is a consistency gadget:
Mixed pays  1/12 of its edges.
• The fraction of S vertices is consistent
between the 1st level and the four 2nd levels
(deviations pay linearly).
33

Estimating the BW - by Spectral-Gap
Estimating the spectrum of recursively constructed graphs is extremely
useful, e.g.,
•
•
•
•
•
•
The Zig-Zag construction [Reingold Vadhan Wigderson 00]
The PCP proof [Dinur 07]
The SL = L proof [Reingold 08]
The Quantum-expander construction [Ben-Aroya, S. Ta-Shma 08]
The Ramanujan-expander Zig-Zag construction [Ben-Aroya TaShma 08]
…
The additional difficulty here is the non-uniformity:
The replacement product is performed differently
on multiplication vertices vs. addition vertices.
34
The DAG of Strassen
n2
C
Declg nC
n
lg n
Enlg nA
Enlg nB
A
B
n2
1. Compute weighted sums of A’s elements.
2. Compute weighted sums of B’s elements.
3. Compute multiplications m1,m2,…,m.
4. Compute weighted sums of m1,m2,…,m to obtain C.
35
Reduction
(1st approach)
[Ballard, Demmel, Holtz, S. 2009a]
Thm:
Cholesky decomposition is
(communication-wise) as hard as matrix-multiplication
Proof:
By a reduction (from matrix-multiplication) that
preserves communication bandwidth, latency, and arithmetic.
Cor:
Any classical O(n3) algorithm for Cholesky decomposition requires:
Bandwidth: (n3 / M1/2)
Latency:
(n3 / M3/2)
(similar cor. for the parallel model).
36
Geometric Embedding (2nd approach)
[Ballard, Demmel, Holtz, S. 2011a]
Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]
(1) Generalized form:
(i,j)  S,
C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),
gi,j,k2 (A(i,k2), B(k2,j)),
…,
k1,k2,…  Sij
other arguments)
Many algorithms agree with Form (1).
• Some immediately:, e.g.,
Classic matrix multiplication (sparse included!)
LU, Cholesky, LDLT factorizations,
• All-Pairs-Shortest-Path, Floyd-Warshall…
• Some need careful arguments to follow Form (1), e.g.,
QR, SVD,…
37
Geometric Embedding (2nd approach)
[Ballard, Demmel, Holtz, S. 2011a]
Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]
(1) Generalized form:
(i,j)  S,
C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),
gi,j,k2 (A(i,k2), B(k2,j)),
…,
k1,k2,…  Sij
other arguments)
“C shadow”
x
y
C
A B
C
A B
z
V
V
y
z
x
“A shadow”
Volume of box
V = x·y·z
= ( xz · zy · yx)1/2
Thm: (Loomis & Whitney, 1949)
Volume of 3D set
V ≤ (area(A shadow)
· area(B shadow)
· area(C shadow) ) 1/2
38
Geometric Embedding (2nd approach)
[Ballard, Demmel, Holtz, S. 2011a]
Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]
(1) Generalized form:
(i,j)  S,
C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),
gi,j,k2 (A(i,k2), B(k2,j)),
…,
k1,k2,…  Sij
other arguments)
But many algorithms just don’t fit
the generalized form!
For example:
Strassen’s fast matrix
multiplication
39
Dense Linear Algebra: Sequential Model
Lower bound
Algorithm
Bandwidth
Attaining algorithm
Latency
Bandwidth
MatrixMultiplication
[Frigo, Leiserson, Prokop,
Ramachandran 99]
Cholesky




LU
[Ballard,
Demmel,
Holtz,
S. 11]
QR
Latency
3


 M 

M 

n






M 
n
[Ballard,
Demmel,
Holtz,
S. 11]
3




[Ahmad, Pingali 00]
[Ballard, Demmel, Holtz, S. 09]
[Toledo97]
[EG98]
[DGX08]
[DGHL08a]
Symmetric
Eigenvalues
[Ballard,Demmel,Dumitriu 10]
SVD
[Ballard,Demmel,Dumitriu 10]
(Generalized)
Nonsymetric
Eigenvalues
[Ballard,Demmel,Dumitriu 10]
40
Dense 2D parallel algorithms
memory per
per processor
processor == O(n
O(n22 // P)
P)
• Assume nxn matrices on P processors, memory
• ScaLAPACK assumes best block size b chosen
• Many references (see reports), Blue are new
• Recall lower bounds:
#words_moved = ( n2 / P1/2 )
and
#messages = ( P1/2 )
Algorithm
Reference
Factor exceeding
lower bound for
#words_moved
Factor exceeding
lower bound for
#messages
Matrix multiply
[Cannon, 69]
1
1
Cholesky
ScaLAPACK
log P
log P
LU
[GDX08]
ScaLAPACK
log P
log P
log P
( N / P1/2 ) · log P
QR
[DGHL08]
ScaLAPACK
Canlog
these
P be improved?
log3 P
log P
( N / P1/2 ) · log P
Sym Eig, SVD
[BDD10]
ScaLAPACK
log P
log P
log3 P
N / P1/2
Nonsym Eig
[BDD10]
ScaLAPACK
log P
P1/2 · log P
log3 P
N · log P
New 2.5D parallel algorithms:
Matrix multiplication and LU decomposition
c ∙ 3n2 = M∙P
c is the memory multiplicity factor
(c may be bounded by P):
[Solomonik, Demmel, 2011]:
Distinguished paper EuroPar’11
c1/2 times fewer words communicated
than [Cannon 69].
[Irony,Toledo,Tiskin 04],
[Ballard, Demmel, Holtz, S. 2011a]:
This is as good as it gets.
42
The Fifth SIAM workshop on
Combinatorial Scientific Computing
Darmstadt, May 19-21 2011
Graph Expansion
and
Communication Costs
of
Fast Matrix Multiplication
Oded Schwartz1
Joint work with:
Grey Ballard1,
James Demmel1,
Olga Holtz1,2
Thank you!
http://bebop.cs.berkeley.edu/
43
Download