Brief Announcement: Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley SPAA June 25, 2012 Pittsburgh, PA Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227). Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, and Samsung. Research is also supported by DOE grants DE-SC0003959, DE-SC0004938, and DE-AC02-05CH11231; the Sofja Kovalevskaja programme of Alexander von Humboldt Foundation; and by the National Science Foundation under Benjamin Lipshitz SPAA 2012 1 agreement DMS-0635607. Talk Summary A new method for communication lower bounds of parallel algorithms Independent of memory size Applies to classical matrix multiplication Applies to parallel Strassen’s matrix multiplication And to other algorithms? These memory-independent bounds impose limit on strong scaling strong scaling increases available memory The strong scaling ranges match those of recent algorithms Benjamin Lipshitz SPAA 2012 2 Communication and Execution Time Models By communication we mean moving data between processors on a distributed memory computer Local Local Local Local Local Local Local Local Local P = Number of processors M = Memory per processor n = Matrix dimension We model execution time as a sum of computation and communication terms, counted along the critical path: T = # flops · γ + # words moved · β Benjamin Lipshitz SPAA 2012 3 Strong scaling of communication for matrix multiplication Communication Volume Classical Strassen P Pmin Benjamin Lipshitz 0/2 Pω min SPAA 2012 P3/2 min 4 Strong scaling Strong scaling is increasing P for a fixed problem size. Definition A parallel algorithm exhibits perfect strong scaling if its execution time is linear in 1/P Example The conventional parallel matrix multiplication algorithms (e.g., Cannon [Can69], SUMMA [vdGW97]) do not exhibit perfect strong scaling: T2D Example The new “2.5D” matrix multiplication algorithm [MT99, SD11] does exhibit perfect strong scaling: n3 n2 = ·γ+ √ ·β P P Benjamin Lipshitz T2.5D = SPAA 2012 n3 n3 ·γ+ √ ·β P P M 5 Strong scaling result: classical matrix multiplication 2 Let Pmin = Θ nM be the minimum number of processors required to store the input and output matrices Theorem ([Here]) A parallel algorithm can exhibit perfect strong scaling only in the range 3/2 Pmin ≤ P ≤ Pmin Benjamin Lipshitz SPAA 2012 6 Lemma: classical matrix multiplication Lemma ([HK81, ITT04]) If a processor does flops (of classical matrix multiplication), it must F2 have access to Ω F 3 data Proof using [LW49]. x y C A B z V y z x Benjamin Lipshitz SPAA 2012 7 Communication lower bounds: classical matrix multiplication Theorem (Memory-Dependent [ITT04]) Any parallel algorithm must communicate Ω n3 √ P M words perfect strong scaling is possible! Benjamin Lipshitz SPAA 2012 8 Communication lower bounds: classical matrix multiplication Theorem (Memory-Dependent [ITT04]) Any parallel algorithm must communicate Ω n3 √ P M words perfect strong scaling is possible! Theorem (Memory-Independent [ACS90, SD11],[Here] ) Any parallel algorithm must communicate Ω n2 P 2/3 words perfect strong scaling is not possible! Benjamin Lipshitz SPAA 2012 8 Communication lower bounds: classical matrix multiplication Theorem (Memory-Dependent [ITT04]) Any parallel algorithm must communicate Ω n3 √ P M words perfect strong scaling is possible Theorem (Memory-Independent [ACS90, SD11],[Here] ) Any parallel algorithm must communicate Ω n2 P 2/3 words perfect strong scaling is not possible! Perfect strong scaling is possible only when the memory dependent bound dominates the memory-independent bound Benjamin Lipshitz SPAA 2012 8 Strong scaling result: classical matrix multiplication 2 Let Pmin = Θ nM be the minimum number of processors required to store the input and output matrices Theorem ([Here]) A parallel algorithm can exhibit perfect strong scaling only in the range 3/2 Pmin ≤ P ≤ Pmin Benjamin Lipshitz SPAA 2012 9 Extension to Strassen Lemma (Strassen [Here] extending [BDHS11]) WS S V To perform F flops, a processor needs F 2/ω0 data RS Proof by graph expansion analysis SPAA ‘11 Best Paper Award x Lemma (Classical [ITT04]) y C A B To perform F flops, a processor needs F 2/3 data z V y z x Proof by geometric embedding Benjamin Lipshitz SPAA 2012 10 Extension to Strassen Lemma (Strassen [Here] extending [BDHS11]) WS S V To perform F flops, a processor needs F 2/ω0 data RS Classical Memory-Dependent LB Memory-Independent LB Strong Scaling Range Benjamin Lipshitz Ω Ω n3 PM 1/2 n2 P 2/3 Strassen Ω Ω 3/2 Pmin ≤ P ≤ Pmin SPAA 2012 n ω0 PM ω0 /2−1 n2 P 2/ω0 ω /2 0 Pmin ≤ P ≤ Pmin 10 Extension to Strassen Lemma (Strassen [Here] extending [BDHS11]) WS S V To perform F flops, a processor needs F 2/ω0 data RS Classical Memory-Dependent LB Memory-Independent LB Ω Ω n3 PM 1/2 n2 P 2/3 Strassen Ω Ω 3/2 Strong Scaling Range Pmin ≤ P ≤ Pmin Attained by 2.5D [MT99, SD11] Benjamin Lipshitz SPAA 2012 n ω0 PM ω0 /2−1 n2 P 2/ω0 ω /2 0 Pmin ≤ P ≤ Pmin CAPS [MT99, BDH+ 12] (talk tomorrow) 10 Strong scaling of communication for matrix multiplication Classical Strassen Communication Volume n3 PM 1/2 n2 P 2/3 ×P n2 P 2/ω0 n3 PM ω0 /2−1 P Pmin Benjamin Lipshitz ×P ×P ×P 0/2 Pω min SPAA 2012 P3/2 min 11 Brief Announcement: Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz Thank You! Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227). Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, and Samsung. Research is also supported by DOE grants DE-SC0003959, DE-SC0004938, and DE-AC02-05CH11231; the Sofja Kovalevskaja programme of Alexander von Humboldt Foundation; and by the National Science Foundation under agreement DMS-0635607. Benjamin Lipshitz SPAA 2012 12 References I A. Aggarwal, A. K. Chandra, and M. Snir. Communication complexity of PRAMs. Theoretical Computer Science, 71(1):3 – 28, 1990. G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algorithm for Strassen’s matrix multiplication, 2012. To appear in SPAA. G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM J. Matrix Analysis Applications, 32(3):866–901, 2011. L. Cannon. A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University, Bozeman, MN, 1969. Benjamin Lipshitz SPAA 2012 13 References II J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In STOC ’81: Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 326–333, New York, NY, USA, 1981. ACM. D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput., 64(9):1017–1026, 2004. L. H. Loomis and H. Whitney. An inequality related to the isoperimetric inequality. Bulletin of the AMS, 55:961–962, 1949. W. F. McColl and A. Tiskin. Memory-efficient matrix multiplication in the bsp model. Algorithmica, 24:287–297, 1999. 10.1007/PL00008264. Benjamin Lipshitz SPAA 2012 14 References III E. Solomonik and J. Demmel. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In Euro-Par ’11: Proceedings of the 17th International European Conference on Parallel and Distributed Computing. Springer, 2011. R. A. van de Geijn and J. Watts. SUMMA: scalable universal matrix multiplication algorithm. Concurrency - Practice and Experience, 9(4):255–274, 1997. Benjamin Lipshitz SPAA 2012 15 Effective Performance, Fraction of Peak Strong scaling performance plot 1.4 1.2 1 0.8 0.6 0.4 0.2 0 CAPS Model 2.5D Model 2D Model 5e2 Benjamin Lipshitz 1e3 CAPS 2.5D 2D 5e3 1e4 Number of Cores SPAA 2012 5e4 16