Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds

advertisement
Brief Announcement:
Strong Scaling of Matrix Multiplication Algorithms and
Memory-Independent Communication Lower Bounds
Grey Ballard, James Demmel, Olga Holtz,
Benjamin Lipshitz and Oded Schwartz
UC Berkeley
SPAA
June 25, 2012
Pittsburgh, PA
Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C.
Discovery (Award #DIG07-10227). Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia,
NVIDIA, and Samsung. Research is also supported by DOE grants DE-SC0003959, DE-SC0004938, and DE-AC02-05CH11231;
the Sofja Kovalevskaja programme of Alexander von Humboldt Foundation; and by the National Science Foundation under
Benjamin Lipshitz
SPAA
2012
1
agreement
DMS-0635607.
Talk Summary
A new method for communication lower bounds of parallel algorithms
Independent of memory size
Applies to classical matrix multiplication
Applies to parallel Strassen’s matrix multiplication
And to other algorithms?
These memory-independent bounds impose limit on strong scaling
strong scaling increases available memory
The strong scaling ranges match those of recent algorithms
Benjamin Lipshitz
SPAA 2012
2
Communication and Execution Time Models
By communication we mean
moving data between processors on a distributed memory computer
Local
Local
Local
Local
Local
Local
Local
Local
Local
P = Number of processors
M = Memory per processor
n = Matrix dimension
We model execution time as a sum of computation and communication
terms, counted along the critical path:
T = # flops · γ + # words moved · β
Benjamin Lipshitz
SPAA 2012
3
Strong scaling of communication for matrix multiplication
Communication Volume
Classical
Strassen
P
Pmin
Benjamin Lipshitz
0/2
Pω
min
SPAA 2012
P3/2
min
4
Strong scaling
Strong scaling is increasing P for a fixed problem size.
Definition
A parallel algorithm exhibits perfect strong scaling if its execution time is
linear in 1/P
Example
The conventional parallel matrix
multiplication algorithms (e.g.,
Cannon [Can69], SUMMA
[vdGW97]) do not exhibit perfect
strong scaling:
T2D
Example
The new “2.5D” matrix
multiplication algorithm
[MT99, SD11] does exhibit perfect
strong scaling:
n3
n2
=
·γ+ √ ·β
P
P
Benjamin Lipshitz
T2.5D =
SPAA 2012
n3
n3
·γ+ √ ·β
P
P M
5
Strong scaling result: classical matrix multiplication
2
Let Pmin = Θ nM be the minimum number of processors required to
store the input and output matrices
Theorem ([Here])
A parallel algorithm can exhibit perfect strong scaling only in the range
3/2
Pmin ≤ P ≤ Pmin
Benjamin Lipshitz
SPAA 2012
6
Lemma: classical matrix multiplication
Lemma ([HK81, ITT04])
If a processor does
flops (of classical matrix multiplication), it must
F2 have access to Ω F 3 data
Proof using [LW49].
x
y
C
A B
z
V
y
z
x
Benjamin Lipshitz
SPAA 2012
7
Communication lower bounds: classical matrix
multiplication
Theorem (Memory-Dependent [ITT04])
Any parallel algorithm must communicate Ω
n3
√
P M
words
perfect strong scaling is possible!
Benjamin Lipshitz
SPAA 2012
8
Communication lower bounds: classical matrix
multiplication
Theorem (Memory-Dependent [ITT04])
Any parallel algorithm must communicate Ω
n3
√
P M
words
perfect strong scaling is possible!
Theorem (Memory-Independent [ACS90, SD11],[Here] )
Any parallel algorithm must communicate Ω
n2
P 2/3
words
perfect strong scaling is not possible!
Benjamin Lipshitz
SPAA 2012
8
Communication lower bounds: classical matrix
multiplication
Theorem (Memory-Dependent [ITT04])
Any parallel algorithm must communicate Ω
n3
√
P M
words
perfect strong scaling is possible
Theorem (Memory-Independent [ACS90, SD11],[Here] )
Any parallel algorithm must communicate Ω
n2
P 2/3
words
perfect strong scaling is not possible!
Perfect strong scaling is possible only when the memory dependent bound
dominates the memory-independent bound
Benjamin Lipshitz
SPAA 2012
8
Strong scaling result: classical matrix multiplication
2
Let Pmin = Θ nM be the minimum number of processors required to
store the input and output matrices
Theorem ([Here])
A parallel algorithm can exhibit perfect strong scaling only in the range
3/2
Pmin ≤ P ≤ Pmin
Benjamin Lipshitz
SPAA 2012
9
Extension to Strassen
Lemma (Strassen [Here] extending [BDHS11])
WS
S
V
To perform F flops, a processor needs
F 2/ω0
data
RS
Proof by graph expansion analysis
SPAA ‘11 Best Paper Award
x
Lemma (Classical [ITT04])
y
C
A B
To perform F flops, a processor needs F 2/3 data
z
V
y
z
x
Proof by geometric embedding
Benjamin Lipshitz
SPAA 2012
10
Extension to Strassen
Lemma (Strassen [Here] extending [BDHS11])
WS
S
V
To perform F flops, a processor needs
F 2/ω0
data
RS
Classical
Memory-Dependent LB
Memory-Independent LB
Strong Scaling Range
Benjamin Lipshitz
Ω
Ω
n3
PM 1/2
n2
P 2/3
Strassen
Ω
Ω
3/2
Pmin ≤ P ≤ Pmin
SPAA 2012
n ω0
PM ω0 /2−1
n2
P 2/ω0
ω /2
0
Pmin ≤ P ≤ Pmin
10
Extension to Strassen
Lemma (Strassen [Here] extending [BDHS11])
WS
S
V
To perform F flops, a processor needs
F 2/ω0
data
RS
Classical
Memory-Dependent LB
Memory-Independent LB
Ω
Ω
n3
PM 1/2
n2
P 2/3
Strassen
Ω
Ω
3/2
Strong Scaling Range
Pmin ≤ P ≤ Pmin
Attained by
2.5D
[MT99, SD11]
Benjamin Lipshitz
SPAA 2012
n ω0
PM ω0 /2−1
n2
P 2/ω0
ω /2
0
Pmin ≤ P ≤ Pmin
CAPS
[MT99, BDH+ 12]
(talk tomorrow)
10
Strong scaling of communication for matrix multiplication
Classical
Strassen
Communication Volume
n3
PM 1/2
n2
P 2/3
×P
n2
P 2/ω0
n3
PM ω0 /2−1
P
Pmin
Benjamin Lipshitz
×P
×P
×P
0/2
Pω
min
SPAA 2012
P3/2
min
11
Brief Announcement:
Strong Scaling of Matrix Multiplication Algorithms and
Memory-Independent Communication Lower Bounds
Grey Ballard, James Demmel, Olga Holtz,
Benjamin Lipshitz and Oded Schwartz
Thank You!
Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C.
Discovery (Award #DIG07-10227). Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia,
NVIDIA, and Samsung. Research is also supported by DOE grants DE-SC0003959, DE-SC0004938, and DE-AC02-05CH11231;
the Sofja Kovalevskaja programme of Alexander von Humboldt Foundation; and by the National Science Foundation under
agreement DMS-0635607.
Benjamin Lipshitz
SPAA 2012
12
References I
A. Aggarwal, A. K. Chandra, and M. Snir.
Communication complexity of PRAMs.
Theoretical Computer Science, 71(1):3 – 28, 1990.
G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz.
Communication-optimal parallel algorithm for Strassen’s matrix multiplication,
2012.
To appear in SPAA.
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz.
Minimizing communication in numerical linear algebra.
SIAM J. Matrix Analysis Applications, 32(3):866–901, 2011.
L. Cannon.
A cellular computer to implement the Kalman filter algorithm.
PhD thesis, Montana State University, Bozeman, MN, 1969.
Benjamin Lipshitz
SPAA 2012
13
References II
J. W. Hong and H. T. Kung.
I/O complexity: The red-blue pebble game.
In STOC ’81: Proceedings of the thirteenth annual ACM symposium on Theory of
computing, pages 326–333, New York, NY, USA, 1981. ACM.
D. Irony, S. Toledo, and A. Tiskin.
Communication lower bounds for distributed-memory matrix multiplication.
J. Parallel Distrib. Comput., 64(9):1017–1026, 2004.
L. H. Loomis and H. Whitney.
An inequality related to the isoperimetric inequality.
Bulletin of the AMS, 55:961–962, 1949.
W. F. McColl and A. Tiskin.
Memory-efficient matrix multiplication in the bsp model.
Algorithmica, 24:287–297, 1999.
10.1007/PL00008264.
Benjamin Lipshitz
SPAA 2012
14
References III
E. Solomonik and J. Demmel.
Communication-optimal parallel 2.5D matrix multiplication and LU factorization
algorithms.
In Euro-Par ’11: Proceedings of the 17th International European Conference on
Parallel and Distributed Computing. Springer, 2011.
R. A. van de Geijn and J. Watts.
SUMMA: scalable universal matrix multiplication algorithm.
Concurrency - Practice and Experience, 9(4):255–274, 1997.
Benjamin Lipshitz
SPAA 2012
15
Effective Performance, Fraction of Peak
Strong scaling performance plot
1.4
1.2
1
0.8
0.6
0.4
0.2
0
CAPS Model
2.5D Model
2D Model
5e2
Benjamin Lipshitz
1e3
CAPS
2.5D
2D
5e3
1e4
Number of Cores
SPAA 2012
5e4
16
Download