23rd ACM Symposium on Parallelism in Algorithms and Architectures San Jose, California, June 4 - 6, 2011 Graph Expansion and Communication Costs of Fast Matrix Multiplication Oded Schwartz1 Joint work with: Grey Ballard1, James Demmel1, Olga Holtz1,2 1. UC-Berkeley 2. TU-Berlin 1 Results by: • Grey Ballard, UCB • Aydin Buluc, LBL • Erin Carson, UCB • James Demmel, UCB • Jack Dongarra, UTK • Ioana Dumitriu, U. Washington • Andrew Gearhart, UCB • Laura Grigori, INRIA • Ming Gu, UCB Math • Mark Hoemmen, Sandia NL • Olga Holtz, UCB Math & TU Berlin • Nicholas Knight, UCB • Julien Langou, U. Colorado Denver • Eran Rom, IBM Tel-Aviv • Edgar Solomonik, UCB Many others… 2 Motivation Two kinds of costs: Arithmetic (FLOPs) Communication: moving data between • • levels of a memory hierarchy (sequential case) over a network connecting processors (parallel case) Communication-avoiding algorithms: Save time, save energy. Sequential Hierarchy CPU Cache M1 M2 Parallel CPU RAM CPU RAM CPU RAM CPU RAM M3 RAM Mk = 3 Motivation: expected bottleneck Annual hardware improvements: Exponential growth with large gaps [Graham, Snir,Patterson, 04], [Fuller, Millett, 10] CPU (msec/flop) 59% DRAM Network Bandwidth (msec/word) 23% 26% Latency (msec/message) 5% 15% Sequential Hierarchy CPU Cache M1 M2 Parallel CPU RAM CPU RAM CPU RAM CPU RAM M3 RAM Mk = 4 Outline Algorithms with “flavor” of 3 nested loops • Lower bounds: Sequential, Hierarchy, Parallel. [Ballard, Demmel, Holtz, S. 2009], [Ballard, Demmel, Holtz, S. 2011a] extending [Hong & Kung 81], [Irony,Toledo,Tiskin 04] • Algorithms: Sequential, Parallel Many contributions, mostly new Strassen-like algorithms • Lower bounds: Sequential, Hierarchy, Parallel. This work • Algorithms: Sequential, Parallel [Ballard, Demmel, Holtz, Rom, S. 2011] 5 Lower bounds: for algorithms with “flavor” of 3 nested loops Matrix Multiplication [Hong & Kung 81] • Sequential [Irony,Toledo,Tiskin 04] • Sequential and parallel M M n 3 M M P n 3 6 Lower bounds: for algorithms with “flavor” of 3 nested loops [Ballard, Demmel, Holtz, S. 2009], [Ballard, Demmel, Holtz, S. 2011a] Following [Irony,Toledo,Tiskin 04] • BLAS, LU, Cholesky, LDLT, and QR factorizations, eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. • Dense or sparse matrices In sparse cases: bandwidth is a function NNZ. • Bandwidth and latency. • Sequential, hierarchical, and parallel – distributed and shared memory models. • Compositions of linear algebra operations. • Certain graph optimization problems M M n 3 M M P n 3 [Demmel, Pearson, Poloni, Van Loan, 11] • Tensor contraction 7 Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds? Mostly not. Are there other algorithms that do? Mostly yes. 8 Motivation: a few example speedups, Measured and Predicted [Demmel, Ballard, Hoemmen, Grigori, Langou, Dongarra, Anderson 10] [Anderson, Ballard, Demmel Keutzer 10] [Bhatele, Demmel, Solomonik 11] Measured: Parallel TSQR – Intel Clovertown – Up to 8x speedup (8 core, dual socket, 10M x 10) – Pentium III cluster, Dolphin Interconnect, MPICH • Up to 6.7x speedup (16 procs, 100K x 200) – BlueGene/L • Up to 4x speedup (32 procs, 1M x 50) – Tesla C 2050 / Fermi • Up to 13x (110,592 x 100) Predicted: Parallel 2.5D LU – Exascale – Up to 4.5x speedup (218 nodes, 222x222) 9 Beyond 3-nested loops How about the communication costs of algorithms that have a more complex structure? 10 Recall: Strassen’s Fast Matrix Multiplication [Strassen 69] • Compute 2 x 2 matrix multiplication using only 7 multiplications (instead of 8). • Apply recursively (block-wise) M1 = (A11 + A22) (B11 + B22) M2 = (A21 + A22) B11 M3 = A11 (B12 - B22) M4 = A22 (B21 - B11) M5 = (A11+ A12) B22 M6 = (A21 - A11) (B11 + B12) M7 = (A12 - A22) (B21 + B22) n/2 C11 C12 A11 A12 B11 B12 A21 A22 B21 B22 = n/2 C21 C22 flops(n) = 7flops(n/2) + O(n2) flops(n) = (nlog 7) 2 C11 = M1 + M4 - M5 + M7 C12 = M3 + M5 C21 = M2 + M4 C22 = M1 - M2 + M3 + M6 11 Strassen-like algorithms • Compute n0 x n0 matrix multiplication n/n0 using only n0 multiplications (instead of n03). = • Apply recursively (block-wise) 0 2.81 [Strassen 69] works fast in practice. 2.79 [Pan 78] 0 flops(n/n ) flops(n) = n 0 0 2.78 [Bini 79] flops(n) = (n0) 2.55 [Schönhage 81] 2.50 [Pan Romani,Coppersmith Winograd 84] 2.48 [Strassen 87] 2.38 [Coppersmith Winograd 90] 2.38 [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach 0 + O(n2) 12 New lower bound for Strassen’s fast matrix multiplication Main Result: The communication bandwidth lower bound is Strassen-like: For Strassen’s: sequential parallel M n M n M M P log2 277 log log 2 7 M n M n Recall for cubic: 00 0 M M P M n M n log2 288 log M log 2 8 M P The parallel lower bounds applies to one copy of the data: M = (n2/P) c copies of the data: M = (c∙n2/P) 13 For sequential? hierarchy? Yes, existing implementations do! For parallel? Yes, we think so 14 Sequential and new parallel Strassen-like algorithms Sequential and Hierarchy cases: Attained by the natural recursive implementation. Also: LU, QR,… (Black-box use of fast matrix multiplication) [Ballard, Demmel, Holtz, S., Rom 2011]: New parallel Strassen-like algorithm. Attains the lower bound (we think). This work: This is as good as it gets. 15 Communication Lower Bounds Proving that your algorithm/implementation is as good as it gets. Approaches: 1. Reduction [Ballard, Demmel, Holtz, S. 2009] 2. Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3. Graph Analysis [Hong & Kung 81], This work 16 Expansion (3rd approach) [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81] S V \S Let G = (V,E) be a d-regular graph Edge expansion = h º min ( ) E S, S S, S £ Small-set edge expansion = V 2 dS ht º min S, S £t ( ) E S, S dS 17 Expansion (3rd approach) The Computation Directed Acyclic Graph Input / Output Intermediate value Dependency WS V V \S S S: subset of computation RS: reads S RS WS: writes Communication-cost is Graph-expansion 18 What is the CDAG of Strassen’s algorithm? 19 The DAG of Strassen, n = 2 Dec1C M1 = (A11 + A22) (B11 + B22) M2 = (A21 + A22) B11 M3 = A11 (B12 - B22) M4 = A22 (B21 - B11) M5 = (A11+ A12) B22 M6 = (A21 - A11) (B11 + B12) M7 = (A12 - A22) (B21 + B22) 1,1 1,2 2,1 2,2 7 4 1 3 2 6 ` C11 = M1 + M4 - M5 + M7 C12 = M3 + M5 C21 = M2 + M4 C22 = M1 - M2 + M3 + M6 5 1,1 1,2 2,1 2,2 Enc1A 1,1 1,2 2,1 2,2 Enc1B 20 The DAG of Strassen, n=4 Dec1C One recursive level: 1,1 1,2 2,1 2,2 • Each vertex splits into four. • Multiply blocks 7 5 4 1 3 2 6 Dec1C Enc1A Enc1 B ` Enc1A Enc1B 21 The DAG of Strassen: further recursive steps n2 Dec1C Declg nC n Enclg nA lg n 1,1 1,2 2,1 2,2 Enclg n B n2 Recursive construction Given DeciC, Construct Deci+1C: 1. Duplicate 4 times 2. Connect with a cross-layer of Dec1C 22 Expansion of a Segment Main technical challenges: Dec1C • Two types of vertices: with/without recursion. • The graph is not regular. 1,1 1,2 2,1 2,2 7 5 4 1 3 2 6 Main lemma: for t = M w /2 ` æ M ö ht = Wç w /2 ÷ èM ø 1,1 1,2 2,1 2,2 Enc1A 1,1 1,2 2,1 2,2 Enc1B 23 Open Problems Find algorithms that attain the lower bounds: • sparse matrix algorithms • for sequential and parallel models • that auto-tune or are cache oblivious Address complex heterogeneous hardware: • Lower bounds and algorithms [Demmel, Volkov 08],[Ballard, Demmel, Gearhart 11] Extend the techniques to other algorithms and algorithmic tools: S V \S ? • Non-uniform recursive structure Characterize a communication lower bound for a problem rather than for an algorithm. 24 Graph Expansion and Communication Costs of Fast Matrix Multiplication Oded Schwartz1 Joint work with: Grey Ballard1, James Demmel1, Olga Holtz1,2 Thank you! http://bebop.cs.berkeley.edu/ 25 EXTRA SLIDES 26 Upper bounds – Supporting Data Structures Top (column-major): Full, Old Packed, Rectangular Full Packed. Bottom (block-contiguous): Blocked, Recursive Format, Recursive Full Packed [Frigo, Leiserson, Prokop, Ramachandran 99, Ahmed, Pingali 00, Elmroth, Gustavson, Jonsson, Kagstrom 04]. 27 Geometric Embedding (2nd approach) (1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)), gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij other arguments) Thm: [Ballard, Demmel, Holtz, S. 2009b] If an algorithm agrees with the generalized form then BW = (G/ M1/2) BW = (G/ PM1/2) where G = |{g(i,j,k) | (i,j) S, k Sij } in P-parallel model. 28 Geometric Embedding (2nd approach) Read Read S1 Read FLOP For a given run (Algorithm, Machine, Input) 1. Partition computations into segments of M reads / writes M Write FLOP S2 2. Any segment S has O(M) inputs/outputs. Read Read 3. Show that S performs G(M) FLOPs gijk FLOP FLOP FLOP S3 Write 4. The total communication BW is BW = BW of one segment #segments M G / G(M) FLOP ... Write ... Time Read Example of a partition, M=3 29 Applications (1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)), gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij other arguments) BW = (G/ M1/2) BW = (G/ PM1/2) where G = |{g(i,j,k) | (i,j) S, k Sij } in P-parallel model. 30 M M M M M Expansion (3rd approach) Computation DAG For a given run (Algorithm, Machine, Input) 1. Consider the computation DAG: G = (V, E) V = set of computations and inputs E = dependencies 2. Partition G into segments S of (M/2) vertices (correspond to time / location adjacency) Input / Output Intermediate value Dependency WS V S RS 3. Show that every S has 3M vertices with incoming / outgoing edges perform M read/writes. 4. The total communication BW is BW = BW of one segment #segments = (M) O(n) / (M/2) = (n / M/2 -1) 31 n2 Is it a Good Expander? Declg nC n Break G into edge-disjoint graphs, corresponding to the algorithm on M1/2 M1/2 matrices. Consider the expansions of S in each part (they sum up). E S,S Gi E Si , Si Gi h Gi d Si h G M BW = (T(n)) h(G(M1/2)) BW = (T(n)) (G(M1/2)) Enlg nA lg n Enlg n B n 2 1/ 2 d S2 S S3 S1 S4 S5 We need to show that M/2 expands to (M). h(G(n)) = (M/ M/2) for n = (M1/2). Namely, for every n, h(G(n)) = (n2/n) = ((4/7)lg n) 32 Estimating the edge expansion- Combinatorially In S Not in S Mixed M S1 S 2 S3 k lg M Sk M 1 • Dec1C is a consistency gadget: Mixed pays 1/12 of its edges. • The fraction of S vertices is consistent between the 1st level and the four 2nd levels (deviations pay linearly). 33 Estimating the BW - by Spectral-Gap Estimating the spectrum of recursively constructed graphs is extremely useful, e.g., • • • • • • The Zig-Zag construction [Reingold Vadhan Wigderson 00] The PCP proof [Dinur 07] The SL = L proof [Reingold 08] The Quantum-expander construction [Ben-Aroya, S. Ta-Shma 08] The Ramanujan-expander Zig-Zag construction [Ben-Aroya TaShma 08] … The additional difficulty here is the non-uniformity: The replacement product is performed differently on multiplication vertices vs. addition vertices. 34 The DAG of Strassen n2 C Declg nC n lg n Enlg nA Enlg nB A B n2 1. Compute weighted sums of A’s elements. 2. Compute weighted sums of B’s elements. 3. Compute multiplications m1,m2,…,m. 4. Compute weighted sums of m1,m2,…,m to obtain C. 35 Reduction (1st approach) [Ballard, Demmel, Holtz, S. 2009a] Thm: Cholesky decomposition is (communication-wise) as hard as matrix-multiplication Proof: By a reduction (from matrix-multiplication) that preserves communication bandwidth, latency, and arithmetic. Cor: Any classical O(n3) algorithm for Cholesky decomposition requires: Bandwidth: (n3 / M1/2) Latency: (n3 / M3/2) (similar cor. for the parallel model). 36 Geometric Embedding (2nd approach) [Ballard, Demmel, Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] (1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)), gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij other arguments) Many algorithms agree with Form (1). • Some immediately:, e.g., Classic matrix multiplication (sparse included!) LU, Cholesky, LDLT factorizations, • All-Pairs-Shortest-Path, Floyd-Warshall… • Some need careful arguments to follow Form (1), e.g., QR, SVD,… 37 Geometric Embedding (2nd approach) [Ballard, Demmel, Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] (1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)), gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij other arguments) “C shadow” x y C A B C A B z V V y z x “A shadow” Volume of box V = x·y·z = ( xz · zy · yx)1/2 Thm: (Loomis & Whitney, 1949) Volume of 3D set V ≤ (area(A shadow) · area(B shadow) · area(C shadow) ) 1/2 38 Geometric Embedding (2nd approach) [Ballard, Demmel, Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] (1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)), gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij other arguments) But many algorithms just don’t fit the generalized form! For example: Strassen’s fast matrix multiplication 39 Dense Linear Algebra: Sequential Model Lower bound Algorithm Bandwidth Attaining algorithm Latency Bandwidth MatrixMultiplication [Frigo, Leiserson, Prokop, Ramachandran 99] Cholesky LU [Ballard, Demmel, Holtz, S. 11] QR Latency 3 M M n M n [Ballard, Demmel, Holtz, S. 11] 3 [Ahmad, Pingali 00] [Ballard, Demmel, Holtz, S. 09] [Toledo97] [EG98] [DGX08] [DGHL08a] Symmetric Eigenvalues [Ballard,Demmel,Dumitriu 10] SVD [Ballard,Demmel,Dumitriu 10] (Generalized) Nonsymetric Eigenvalues [Ballard,Demmel,Dumitriu 10] 40 Dense 2D parallel algorithms memory per per processor processor == O(n O(n22 // P) P) • Assume nxn matrices on P processors, memory • ScaLAPACK assumes best block size b chosen • Many references (see reports), Blue are new • Recall lower bounds: #words_moved = ( n2 / P1/2 ) and #messages = ( P1/2 ) Algorithm Reference Factor exceeding lower bound for #words_moved Factor exceeding lower bound for #messages Matrix multiply [Cannon, 69] 1 1 Cholesky ScaLAPACK log P log P LU [GDX08] ScaLAPACK log P log P log P ( N / P1/2 ) · log P QR [DGHL08] ScaLAPACK Canlog these P be improved? log3 P log P ( N / P1/2 ) · log P Sym Eig, SVD [BDD10] ScaLAPACK log P log P log3 P N / P1/2 Nonsym Eig [BDD10] ScaLAPACK log P P1/2 · log P log3 P N · log P New 2.5D parallel algorithms: Matrix multiplication and LU decomposition c ∙ 3n2 = M∙P c is the memory multiplicity factor (c may be bounded by P): [Solomonik, Demmel, 2011]: Distinguished paper EuroPar’11 c1/2 times fewer words communicated than [Cannon 69]. [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a]: This is as good as it gets. 42 The Fifth SIAM workshop on Combinatorial Scientific Computing Darmstadt, May 19-21 2011 Graph Expansion and Communication Costs of Fast Matrix Multiplication Oded Schwartz1 Joint work with: Grey Ballard1, James Demmel1, Olga Holtz1,2 Thank you! http://bebop.cs.berkeley.edu/ 43