Optimizing Matrix Multiply - Computer Science Division

Minimizing Communication in Linear Algebra James Demmel 9 Nov 2009 1 Outline • • • • • Background – 2 Big Events Why avoiding communication is important Communication-optimal direct linear algebra Autotuning of sparse matrix codes Communication-optimal iterative linear algebra 2 Collaborators • • • • • • • • • • • • • Grey Ballard, UCB EECS Ioana Dumitriu, U. Washington Laura Grigori, INRIA Ming Gu, UCB Math Mark Hoemmen, UCB EECS Olga Holtz, UCB Math & TU Berlin Julien Langou, U. Colorado Denver Marghoob Mohiyuddin, UCB EECS Oded Schwartz , TU Berlin Hua Xiang, INRIA Sam Williams, LBL Vasily Volkov, UCB EECS Kathy Yelick, UCB EECS & NERSC Thanks to Intel, Microsoft, UC Discovery, NSF, DOE, … • BeBOP group, bebop.cs.berkeley.edu • ParLab, parlab.eecs.berkeley.edu 3 2 Big Events 1. “Multicore revolution”, requiring all software (where performance matters!) to change • • ParLab – Industry/State funded center with 15 faculty, ~50 grad students ( parlab.eecs.berkeley.edu ) Rare synergy between commercial and scientific computing 2. Establishment of a new graduate program in Computational Science and Engineering (CSE) • • 117 Faculty from 22 Departments 60 existing courses, more being developed • CS267 – Applications of Parallel Computers – www.cs.berkeley.edu/~demmel/cs267_Spr09 • 3 Day Parallel “Boot Camp” in August – parlab.eecs.berkeley.edu/2009bootcamp Parallelism Revolution is Happening Now • Chip density continuing to increase ~2x every 2 years • Clock speed is not • Number of processor cores may double instead • There is little or no more hidden parallelism (ILP) to be found • Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) 5 The “7 Dwarfs” of High Performance Computing • Phil Colella (LBL) identified 7 kernels out of which most large scale simulation and data-analysis programs are composed: 1. Dense Linear Algebra • Ex: Solve Ax=b or Ax = λx where A is a dense matrix 2. Sparse Linear Algebra • Ex: Solve Ax=b or Ax = λx where A is a sparse matrix (mostly zero) 3. Operations on Structured Grids • Ex: Anew(i,j) = 4*A(i,j) – A(i-1,j) – A(i+1,j) – A(i,j-1) – A(i,j+1) 4. Operations on Unstructured Grids • Ex: Similar, but list of neighbors varies from entry to entry 5. Spectral Methods • Ex: Fast Fourier Transform (FFT) 6. N-Body (aka Particle Methods) • Ex: Compute electrostatic forces using Fast Multiple Method 7. Monte Carlo (aka MapReduce) • Ex: Many independent simulations using different inputs 6 Motif/Dwarf: Common Computational Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 HPC ML Games DB SPEC Embed (Red Hot  Blue Cool) Health Image Speech Music Browser Finite State Mach. Combinational Graph Traversal Structured Grid Dense Matrix Sparse Matrix Spectral (FFT) Dynamic Prog N-Body MapReduce Backtrack/ B&B Graphical Models Unstructured Grid www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf 2 Big Events 1. “Multicore revolution”, requiring all software (where performance matters!) to change • • ParLab – Industry/State funded center with 15 faculty, ~50 grad students ( parlab.eecs.berkeley.edu ) Rare synergy between commercial and scientific computing 2. Establishment of a new graduate program in Computational Science and Engineering (CSE) at UCB • • 117 Faculty from 22 Departments 60 existing courses, more being developed • CS267 – Applications of Parallel Computers – www.cs.berkeley.edu/~demmel/cs267_Spr09 • 3 Day Parallel “Boot Camp” in August – parlab.eecs.berkeley.edu/2009bootcamp Why avoiding communication is important (1/2) Algorithms have two costs: 1.Arithmetic (FLOPS) 2.Communication: moving data between • levels of a memory hierarchy (sequential case) • processors over a network (parallel case). CPU Cache CPU DRAM CPU DRAM CPU DRAM CPU DRAM DRAM 9 Why avoiding communication is important (2/2) • Running time of an algorithm is sum of 3 terms: • • • # flops * time_per_flop # words moved / bandwidth # messages * latency communication • Time_per_flop << 1/ bandwidth << latency • Gaps growing exponentially with time Annual improvements Time_per_flop 59% Bandwidth Latency Network 26% 15% DRAM 23% 5% • Goal : reorganize linear algebra to avoid communication • Between all memory hierarchy levels • L1 L2 DRAM network, etc • Not just hiding communication (speedup  2x ) • Arbitrary speedups possible 10 Naïve Sequential Matrix Multiply: C = C + A*B for i = 1 to n {read row i of A into fast memory, n2 reads} for j = 1 to n {read C(i,j) into fast memory, n2 reads} {read column j of B into fast memory, n3 reads} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory, n2 writes} n3 + O(n2) reads/writes altogether C(i,j) A(i,:) C(i,j) = + * B(:,j) 11 Less Communication with Blocked Matrix Multiply • Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size … Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc … b chosen so 3 bxb blocks fit in cache for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul, 4b2 reads/writes • (n/b)3 · 4b2 = 4n3/b reads/writes altogether • Minimized when 3b2 = cache size = M, yielding O(n3/M1/2) reads/writes • What if we had more levels of memory? (L1, L2, cache etc)? • Would need 3 more nested loops per level 12 Blocked vs Cache-Oblivious Algorithms • Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size … Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc … b chosen so 3 bxb blocks fit in cache for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul … another level of memory would need 3 more loops • Cache-oblivious Matmul C = A·B is independent of cache Function C = MM(A,B) If A and B are 1x1 C=A·B else … Break Anxn, Bnxn, Cnxn into (n/2)x(n/2) blocks labeled A(i,j), etc for i = 1 to 2, for j = 1 to 2, for k = 1 to 2 C(i,j) = C(i,j) + MM( A(i,k), B(k,j) ) … n/2 x n/2 matmul 13 Direct linear algebra: Prior Work on Matmul • Assume n3 algorithm (i.e. not Strassen-like) • Sequential case, with fast memory of size M • Lower bound on #words moved to/from slow memory =  (n3 / M1/2 ) [Hong, Kung, 81] • Attained using blocked or cache-oblivious algorithms • Parallel case on P processors: • Let NNZ be total memory needed; assume load balanced • Lower bound on #words communicated =  (n3 /(P· NNZ )1/2 ) [Irony, Tiskin, Toledo, 04] NNZ (name of alg) Lower bound on #words Attained by (“2D alg”)  (n2 / P1/2 ) [Cannon, 69] 3n2 P1/3 (“3D alg”)  (n2 / P2/3 ) [Johnson,93] 3n2 14 New lower bound for all “direct” linear algebra Let M = “fast” memory size per processor #words_moved by at least one processor = (#flops / M1/2 ) #messages_sent by at least one processor = (#flops / M3/2 ) • Holds for • BLAS, LU, QR, eig, SVD, … • Some whole programs (sequences of these operations, no matter how they are interleaved, eg computing Ak) • Dense and sparse matrices (where #flops << n3 ) • Sequential and parallel algorithms • Some graph-theoretic algorithms (eg Floyd-Warshall) • See [BDHS09] – proof sketch if time! 15 Can we attain these lower bounds? • Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds? • Mostly not • If not, are there other algorithms that do? • Yes • Goals for algorithms: • Minimize #words =  (#flops/ M1/2 ) • Minimize #messages =  (#flops/ M3/2 ) • Need new data structures • Minimize for multiple memory hierarchy levels • Cache-oblivious algorithms would be simplest • Fewest flops when matrix fits in fastest memory • Cache-oblivious algorithms don’t always attain this • Only a few sparse algorithms so far 16 Minimizing #messages requires new data structures • Need to load/store whole rectangular subblock of matrix with one “message” • Incompatible with conventional columnwise (rowwise) storage • Ex: Rows (columns) not in contiguous memory locations • Blocked storage: store as matrix of bxb blocks, each block stored contiguously • Ok for one level of memory hierarchy, what if more? • Recursive blocked storage: store each block using subblocks 17 Summary of dense sequential algorithms attaining communication lower bounds • Algorithms shown minimizing # Messages use (recursive) block layout • Many references (see reports), only some shown, plus ours • Cache-oblivious are underlined, Green are ours, ? is unknown/future work Algorithm 2 Levels of Memory #Words Moved BLAS-3 and # Messages Usual blocked or recursive algorithms Multiple Levels of Memory #Words Moved and #Messages Usual blocked algorithms (nested), or recursive [Gustavson,97] Cholesky LAPACK (with b = M1/2) [Gustavson 97] [BDHS09] [Gustavson,97] [Ahmed,Pingali,00] [BDHS09] (←same) (←same) LU with pivoting LAPACK (rarely) [Toledo,97] , [GDX 08] [GDX 08] not partial pivoting [Toledo, 97] [GDX 08]? [GDX 08]? QR LAPACK (rarely) [Elmroth,Gustavson,98] [DGHL08] [Frens,Wise,03] but 3x flops [DGHL08] [Elmroth, Gustavson,98] [DGHL08] ? [Frens,Wise,03] [DGHL08] ? [BDD09] [BDD09] Eig, SVD Not LAPACK [BDD09] randomized, but more flops Summary of dense 2D parallel algorithms attaining communication lower bounds • • • • Assume nxn matrices on P processors, memory per processor = O(n2 / P) ScaLAPACK assumes best block size b chosen Many references (see reports), Green are ours Recall lower bounds: #words_moved = ( n2 / P1/2 ) and #messages = ( P1/2 ) Algorithm Reference Factor exceeding lower bound for #words_moved Factor exceeding lower bound for #messages Matrix multiply [Cannon, 69] 1 1 Cholesky ScaLAPACK log P log P LU [GDX08] ScaLAPACK log P log P log P ( N / P1/2 ) · log P QR [DGHL08] ScaLAPACK log P log P log3 P ( N / P1/2 ) · log P Sym Eig, SVD [BDD09] ScaLAPACK log P log P log3 P N / P1/2 Nonsym Eig [BDD09] ScaLAPACK log P P1/2 · log P log3 P N · log P QR of a tall, skinny matrix is bottleneck; Use TSQR instead: W0   Q00   R00       Q10  W1     R10  W   .  W2 Q20 R20       W      Q R 30   30   3   R00      R01   R10   Q01 .   R    Q11   R11   20   R   30   R01     Q02 R02  R11  Q is represented implicitly as a product (tree of factors) [DGHL08] Minimizing Communication in TSQR W0 W = W1 W2 W3 R00 R10 R20 R30 W0 Sequential: W = W1 W2 W3 R00 W0 W = W1 W2 W3 R00 R01 Parallel: Dual Core: R01 R02 R11 R01 R01 R11 R02 R02 R11 R03 R03 Multicore / Multisocket / Multirack / Multisite / Out-of-core: ? Can choose reduction tree dynamically Performance of TSQR vs Sca/LAPACK • Parallel – Intel Clovertown – Up to 8x speedup (8 core, dual socket, 10M x 10) – Pentium III cluster, Dolphin Interconnect, MPICH • Up to 6.7x speedup (16 procs, 100K x 200) – BlueGene/L • Up to 4x speedup (32 procs, 1M x 50) – Both use Elmroth-Gustavson locally – enabled by TSQR • Sequential – OOC on PowerPC laptop • As little as 2x slowdown vs (predicted) infinite DRAM • LAPACK with virtual memory never finished • See UC Berkeley EECS Tech Report 2008-89 • General CAQR = Communication-Avoiding QR in progress Modeled Speedups of CAQR vs ScaLAPACK Petascale up to 22.9x IBM Power 5 up to 9.7x “Grid” up to 11x Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s.   2 1012 s,   105 s,   2 109 s / word . Randomized Rank-Revealing QR • Needed for eigenvalue and SVD algorithms • func [Q,R,V] = RRQR(A) • [V,R1] = qr(randn(n,n)) … V random with Haar measure • [Q,R] = qr(A · VT ) … so A = Q · R · V • Theorem (DDH). Provided that sr+1 << sr ~ s1 , RRQR produces a rank-revealing decomposition with high probability (when r = O(n), the probability is 1 – O(n-c) ). • Can apply to A = A1+/-1 A2+/-1 … Ak+/-1 • Do RRQR, followed by a series of (k-1) QR / RQ to propagate unitary/orthogonal matrix to the front • Analogous to [Stewart, 94] 24 What about o(n3) algorithms, like Strassen? • Thm (DDHK07) • Given any matmul running in O(n) ops for some >2, it can be made stable and still run in O(n+) ops, for any >0 • Current record:   2.38 • Thm (DDH08) • Given any stable matmul running in O(n+) ops, it is possible to do backward stable dense linear algebra in O(n+) ops • Also reduces communication to O(n+) • But can do much better (work in progress…) 25 Direct Linear Algebra: summary and future work • Communication lower bounds on #words_moved and #messages • BLAS, LU, Cholesky, QR, eig, SVD, … • Some whole programs (compositions of these operations, no matter how they are interleaved, eg computing Ak) • Dense and sparse matrices (where #flops << n3 ) • Sequential and parallel algorithms • Some graph-theoretic algorithms (eg Floyd-Warshall) • Algorithms that attain these lower bounds • Nearly all dense problems (some open problems) • A few sparse problems • Speed ups in theory and practice • Future work • Lots to implement, tune • Next generation of Sca/LAPACK on heterogeneous architectures (MAGMA) • Few algorithms in sparse case • Strassen-like algorithms • 3D Algorithms (only for Matmul so far), may be important for scaling 26 How hard is hand-tuning matmul, anyway? • Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09 • Students given “blocked” code to start with • Still hard to get close to vendor tuned performance (ACML) • For more discussion, see www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/ 27 How hard is hand-tuning matmul, anyway? 28 What part of the Matmul Search Space Looks Like Number of rows in register block A 2-D slice of a 3-D register-tile search space. The dark blue region was pruned. (Platform: Sun Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0 compiler) 29 Autotuning Matmul with ATLAS (n = 500) Source: Jack Dongarra 900.0 Vendor BLAS ATLAS BLAS F77 BLAS 800.0 700.0 MFLOPS 600.0 500.0 400.0 300.0 200.0 100.0 pa rc 22 00 0 027 U ltr aS ip 3 00 S un 12 0 G IR S G IR 10 0 00 m ip 2 820 0 50 III -5 66 m en tiu S Architectures P P en tiu Pr om P en tiu II2 20 0 0 er 320 0 Po w er 216 M M IB IB 60 4- PP M IB Po w 11 2 5 C /7 3 00 P9 0 EC D 5/ 13 0 650 3 -5 3 56 ev EC D ev H A M D A th lo n6 00 0.0 • ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor. • ATLAS written by C. Whaley, inspired by PHiPAC, by Asanovic, Bilmes, Chin, D. 30 Automatic Performance Tuning • Goal: Let machine do hard work of writing fast code • What do tuning of Matmul, dense BLAS, FFTs, signal processing, have in common? • Can do the tuning off-line: once per architecture, algorithm • Can take as much time as necessary (hours, a week…) • At run-time, algorithm choice may depend only on few parameters (matrix dimensions, size of FFT, etc.) • Can’t always do tuning off-line • Algorithm and implementation may strongly depend on data only known at run-time • Ex: Sparse matrix nonzero pattern determines both best data structure and implementation of Sparse-matrix-vector-multiplication (SpMV) • Part of search for best algorithm just be done (very quickly!) at run-time Source: Accelerator Cavity Design Problem (from Ko via Husbands) Linear Programming Matrix … A Sparse Matrix You Use Every Day SpMV with Compressed Sparse Row (CSR) Storage Matrix-vector multiply kernel: y(i)  y(i) + A(i,j)*x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]] Only 2 flops per 2 mem_refs: Limited by getting data from memory Example: The Difficulty of Tuning • n = 21200 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem Example: The Difficulty of Tuning • n = 21200 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem • 8x8 dense substructure: exploit this to limit #mem_refs Speedups on Itanium 2: The Need for Search 8x8 Mflop/s Best: 4x2 Reference Mflop/s Register Profile: Itanium 2 1190 Mflop/s 190 Mflop/s Power3 - 17% 252 Mflop/s Power4 - 16% 820 Mflop/s Register Profiles: IBM and Intel IA-64 122 Mflop/s Itanium 1 - 8% 247 Mflop/s 107 Mflop/s 459 Mflop/s Itanium 2 - 33% 1.2 Gflop/s 190 Mflop/s Ultra 2i - 11% 72 Mflop/s Ultra 3 - 5% 90 Mflop/s Register Profiles: Sun and Intel x86 35 Mflop/s Pentium III - 21% 108 Mflop/s 42 Mflop/s 50 Mflop/s Pentium III-M - 15% 122 Mflop/s 58 Mflop/s Another example of tuning challenges • More complicated non-zero structure in general • N = 16614 • NNZ = 1.1M Zoom in to top corner • More complicated non-zero structure in general • N = 16614 • NNZ = 1.1M 3x3 blocks look natural, but… • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • But would lead to lots of “fill-in” Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3x3 blocking • • • • Logical grid of 3x3 cells Fill-in explicit zeros Unroll 3x3 block multiplies “Fill ratio” = 1.5 • On Pentium III: 1.5x speedup! • Actual mflop rate 1.52 = 2.25 higher Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3x3 blocking • • • • Logical grid of 3x3 cells Fill-in explicit zeros Unroll 3x3 block multiplies “Fill ratio” = 1.5 • On Pentium III: 1.5x speedup! • Actual mflop rate 1.52 = 2.25 higher • In general: sample matrix to choose rxc block structure using heuristic Source: Accelerator Cavity Design Problem (from Ko via Husbands) Post-RCM Reordering 100x100 Submatrix Along Diagonal “Microscopic” Effect of RCM Reordering Before: Green + Red After: Green + Blue “Microscopic” Effect of Combined RCM+TSP Reordering Before: Green + Red After: Green + Blue Speedups from 1.6x to 3.3x Summary of SpMV Performance Optimizations • Sequential SpMV • • • • • • • • Register blocking (RB): up to 4x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Reordering to create dense structure + splitting: 2x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Cache blocking: 2.8x over CSR Multiple vectors (SpMM): 7x over CSR And combinations… OSKI • Multicore SpMV • NUMA, Affinity, SW Prefetch, Cache/LS/TLB blocking: 4x over naïve Pthreads • Sequential Sparse triangular solve • Hybrid sparse/dense data structure: 1.8x over CSR • Higher-level kernels • A·AT·x, AT·A·x: 4x over CSR, 1.8x over RB • More later … • All optimizations (being) automated in OSKI (bebop.cs.berkeley.edu) • Being adopted by Cray; Python interface in progress Avoiding Communication in Iterative Linear Algebra • k-steps of typical iterative solver for sparse Ax=b or Ax=λx • Does k SpMVs with starting vector • Finds “best” solution among all linear combinations of these k+1 vectors • Many such “Krylov Subspace Methods” • Conjugate Gradients, GMRES, Lanczos, Arnoldi, … • Goal: minimize communication in Krylov Subspace Methods • Assume matrix “well-partitioned,” with modest surface-to-volume ratio • Parallel implementation • Conventional: O(k log p) messages, because k calls to SpMV • New: O(log p) messages - optimal • Serial implementation • Conventional: O(k) moves of data from slow to fast memory • New: O(1) moves of data – optimal • Lots of speed up possible (modeled and measured) • Price: some redundant computation • Much prior work • See bebop.cs.berkeley.edu • CG: [van Rosendale, 83], [Chronopoulos and Gear, 89] • GMRES: [Walker, 88], [Joubert and Carey, 92], [Bai et al., 94] Minimizing Communication of GMRES to solve Ax=b • GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2 • Cost of k steps of standard GMRES vs new GMRES Standard GMRES for i=1 to k w = A · v(i-1) MGS(w, v(0),…,v(i-1)) update v(i), H endfor solve LSQ problem with H Communication-avoiding GMRES W = [ v, Av, A2v, … , Akv ] [Q,R] = TSQR(W) … “Tall Skinny QR” Build H from R, solve LSQ problem Sequential: #words_moved = O(k·nnz) from SpMV + O(k2·n) from MGS Parallel: #messages = O(k) from SpMV + O(k2 · log p) from MGS Sequential: #words_moved = O(nnz) from SpMV + O(k·n) from TSQR Parallel: #messages = O(1) from computing W + O(log p) from TSQR •Oops – W from power method, precision lost! “Monomial” basis [Ax,…,Akx] fails to converge A different polynomial basis does converge Speed ups of GMRES on 8-core Intel Clovertown [MHDY09] Communication Avoiding Iterative Linear Algebra: Future Work • Lots of algorithms to implement, autotune • Make widely available via OSKI, Trilinos, PETSc, Python, … • Extensions to other Krylov subspace methods • So far just Lanczos/CG and Arnoldi/GMRES • BiCGStab, CGS, QMR, … • Add preconditioning • Block diagonal are easiest • Hierarchical, semiseparable, … • Works if interactions between distant points can be “compressed” Summary Don’t Communic… 58 1/3 2/3 3/3 Extra Slides 62 Proof of Communication Lower Bound on C = A·B (1/5) • Proof from Irony/Toledo/Tiskin (2004) • Original proof, then generalization • Think of instruction stream being executed • Looks like “ … add, load, multiply, store, load, add, …” • Each load/store moves a word between fast and slow memory • We want to count the number of loads and stores, given that we are multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly reordered assuming addition is commutative/associative • Assuming that at most M words can be stored in fast memory • Outline: • Break instruction stream into segments, each with M loads and stores • Somehow bound the maximum number of flops that can be done in each segment, call it F • So F · # segments  T = total flops = 2·n3 , so # segments  T / F • So # loads & stores = M · #segments  M · T / F 63 Illustrating Segments, for M=3 Load Load Segment 1 Load FLOP Store FLOP Segment 2 Load Load FLOP FLOP FLOP Store Store FLOP Store Load FLOP ... Time Load Segment 3 Proof of Communication Lower Bound on C = A·B (1/5) • Proof from Irony/Toledo/Tiskin (2004) • Original proof, then generalization • Think of instruction stream being executed • Looks like “ … add, load, multiply, store, load, add, …” • Each load/store moves a word between fast and slow memory • We want to count the number of loads and stores, given that we are multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly reordered assuming addition is commutative/associative • Assuming that at most M words can be stored in fast memory • Outline: • Break instruction stream into segments, each with M loads and stores • Somehow bound the maximum number of flops that can be done in each segment, call it F • So F · # segments  T = total flops = 2·n3 , so # segments  T / F • So # loads & stores = M · #segments  M · T / F 65 Proof of Communication Lower Bound on C = A·B (2/5) • Given segment of instruction stream with M loads & stores, how many adds & multiplies (F) can we do? • At most 2M entries of C, 2M entries of A and/or 2M entries of B can be accessed • Use geometry: • Represent n3 multiplications by n x n x n cube • One n x n face represents A • each 1 x 1 subsquare represents one A(i,k) • One n x n face represents B • each 1 x 1 subsquare represents one B(k,j) • One n x n face represents C • each 1 x 1 subsquare represents one C(i,j) • Each 1 x 1 x 1 subcube represents one C(i,j) += A(i,k) · B(k,j) • May be added directly to C(i,j), or to temporary accumulator 66 Proof of Communication Lower Bound on C = A·B (3/5) k “C face” Cube representing C(1,1) += A(1,3)·B(3,1) C(2,3) A(2,1) A(1,1) B(1,3) B(2,1) A(1,2) j B(1,1) A(1,3) B(3,1) C(1,1) i “A face” • If we have at most 2M “A squares”, 2M “B squares”, and 2M “C squares” on faces, how many cubes can we have? 67 Proof of Communication Lower Bound on C = A·B (4/5) k “C shadow” x y z j y z x “A shadow” i # cubes in black box with side lengths x, y and z = Volume of black box = x·y·z = ( xz · zy · yx)1/2 = (#A□s · #B□s · #C□s )1/2 (i,k) is in “A shadow” if (i,j,k) in 3D set (j,k) is in “B shadow” if (i,j,k) in 3D set (i,j) is in “C shadow” if (i,j,k) in 3D set Thm (Loomis & Whitney, 1949) # cubes in 3D set = Volume of 3D set ≤ (area(A shadow) · area(B shadow) · area(C shadow)) 1/2 68 Proof of Communication Lower Bound on C = A·B (5/5) • Consider one “segment” of instructions with M loads, stores • Can be at most 2M entries of A, B, C available in one segment • Volume of set of cubes representing possible multiply/adds in one segment is ≤ (2M · 2M · 2M)1/2 = (2M) 3/2 ≡ F • # Segments  2n3 / F • # Loads & Stores = M · #Segments  M · 2n3 / F = n3 / (2M)1/2 • Parallel Case: apply reasoning to one processor out of P • # Adds and Muls  2n3 / P (at least one proc does this ) • M= n2 / P (each processor gets equal fraction of matrix) • # “Load & Stores” = # words moved from or to other procs  M · (2n3 /P) / F= M · (2n3 /P) / (2M) 3/2 = n2 / (2P)1/2 69 How to generalize this lower bound (1/4) • It doesn’t depend on C(i,j) being a matrix entry, just a unique memory location (same for A(i,k) and B(k,j) ) • It doesn’t depend on C and A not overlapping (or C and B, or A and B) • Some reorderings may change answer; still get a lower bound • It doesn’t depend on doing n3 multiply/adds • It doesn’t depend on C(i,j) just being a sum of products C(i,j) = k A(i,k)*B(k,j) For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) 70 How to generalize this lower bound (2/4) For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) (P) • It does assume the presence of an operand generating a load and/or store; how could this not happen? • Mem(b(k,j)) could be reused in many more gijk than (P) allows • • Ex: Compute C(m) = A * B.^m (Matlab notation) for m=1 to t Can move t1/2 times fewer words than  (#flops / M1/2 ) • We might generate a result during a segment, use it, and discard it, with generating any memory traffic • • Turns out QR, eig, SVD all may do this Need a different analysis for them later… 71 How to generalize this lower bound (3/4) (P) For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) • Need to distinguish Sources, Destinations of each operand in fast memory during a segment: • Possible Sources: S1: Already in fast memory at start of segment, or read; at most 2M S2: Created during segment; no bound without more information • Possible Destinations: D1: Left in fast memory at end of segment, or written; at most 2M D2: Discarded; no bound without more information • Need to assume no S2/D2 arguments; at most 4M of others 72 How to generalize this lower bound (4/4) (P) For all (i,j)  S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) • Theorem: To evaluate (P) with memory of size M, where • fij and gijk are “nontrivial” functions of their arguments • G is the total number of gijk‘s, • No S2/D2 arguments requires at least G/ (8M)1/2 – M slow memory references • Simpler: #words_moved =  (#flops / M1/2 ) • Corollary: To evaluate (P) requires at least G/ (81/2 M3/2 ) – 1 messages (simpler:  (#flops / M3/2 ) ) • Proof: maximum message size is M 73 Some Corollaries of this lower bound (1/2) (P) For all (i,j)  S Mem(c(i,j)) =  fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) #words_moved =  (#flops / M1/2 ) • Theorem applies to dense or sparse, parallel or sequential: • MatMul, including ATA or A2 • Triangular solve C = A-1·X • • C(i,j) = (X(i,j) - k<i A(i,k)*C(k,j)) / A(i,i) … A lower triangular C plays double role of b and c in Model (P) • LU factorization (any pivoting, LU or “ILU”) • L(i,j) = (A(i,j) - k<j L(i,k)*U(k,j)) / U(j,j), U(i,j) = A(i,j) - k<i L(i,k)*U(k,j) • L (and U) play double role as c and a (c and b) in Model (P) • Cholesky (any diagonal pivoting, C or “IC”) • L(i,j) = (A(i,j) - k<j L(i,k)*LT(k,j)) / L(j,j), L(j,j) = (A(j,j) - k<j L(j,k)*LT(k,j))1/2 • L (and LT) plays triple role as a, b and c 74 Some Corollaries of this lower bound (2/2) (P) For all (i,j)  S Mem(c(i,j)) =  fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij , some other arguments) #words_moved =  (#flops / M1/2 ) • Applies to “simplified” operations • Ex: Compute determinant of A(i,j) = func(i,j) using LU, where each func(i,j) called just once; so no inputs and 1 output • Still get  (#flops / M1/2 – 2n2) words_moved, by “imposing” loads/stores • Applies to compositions of operations • Ex: Compute Ak by repeated squaring, only input A, output Ak • Still get  (#flops / M1/2 – n2 log k ) by “imposing” loads/stores, • Holds for any interleaving of operations • Applies to some graph algorithms • Ex: All-Pairs-Shortest-Path by Floyd-Warshall: gijk = “+” and fij = “min” • Get  (F / M1/2) words_moved where F {n4 , n3 log n , n3 } 75 Lower bounds for Orthogonal Transformations (1/4) • Needed for QR, eig, SVD, … • Analyze Blocked Householder Transformations • j=1 to b (I – 2 uj uTj ) = I – U T UT where U = [ u1 , … , ub ] • Treat Givens as 2x2 Householder • Model details and assumptions • Write (I – U T UT )A = A – U(TUT A) = A – UZ where Z = T(UT A) • Only count operations in all A – UZ operations • “Generically” a large fraction of the work • Assume “Forward Progress”, that each successive Householder transformation leaves previously created zeros zero; Ok for • • • QR decomposition Reductions to condensed forms (Hessenberg, tri/bidiagonal) – Possible exception: bulge chasing in banded case One sweep of (block) Hessenberg QR iteration 76 Lower bounds for Orthogonal Transformations (2/4) • Perform many A – UZ where Z = T(UT A) • First challenge to applying theorem: need to collect all A-UZ into one big set to which model (P) applies • Write them all as { A(i,j) = A(i,j) - k U(i,k) Z(k,j) } where k = index of k-th transformation, • k not necessarily = index of column of A it comes from • Second challenge: All Z(k,j) may be S2/D2 • Recall: S2/D2 means computed on the fly and discarded • Ex: An x 2n =Qn x n · Rn x 2n where 2n2 = M so A fits in cache • • • • • Represent Q as n(n-1)/2 2x2 Householder (Givens) transforms There are n2(n-1)/2 = (M3/2) nonzero Z(k,j), not O(M) Still only do (M3/2) flops during segment But can’t use Loomis-Whitney to prove it! Need a new idea… 77 Lower bounds for Orthogonal Transformations (3/4) • Dealing with Z(k,j) being S2/D2 • How to bound #flops in A(i,j) = A(i,j) - k U(i,k) Z(k,j) ? • Neither A nor U is S2/D2 • A either turned into R or U, which are output • So at most 2M of each during segment • #flops ≤ ( #U(i,k) ) · ( #columns A(:,j) present ) ≤ ( #U(i,k) ) · ( #A(i,j) / min #nonzeros per column of A) ≤ h · O(M) / r where h = O(M) … How small can r be? • To store h ≡ #U(i,k) Householder vector entries in r rows, there can be at most r in the first column, r-1 in the second, etc., (to maintain “Forward Progress”) so r(r-1)/2  h so r  h1/2 • # flops ≤ h · O(M) / r ≤ O(M) h1/2 = O(M3/2) as desired 78 Lower bounds for Orthogonal Transformations (4/4) • Theorem: #words_moved by QR decomposition using (blocked) Householder transformations = ( #flops / M1/2 ) • Theorem: #words_moved by reduction to Hessenberg form, tridiagonal form, bidiagonal form, or one sweep of QR iteration (or block versions of any of these) = ( #flops / M1/2 ) • Assuming Forward Progress (created zeros remain zero) • Model: Merge left and right orthogonal transformations: A(i,j) = A(i,j) - kLUL(i,kL) ·ZL(kL,j) - kRZR(i,kR) ·UR(j,kR) 79 Summary of dense sequential Cholesky algorithms attaining communication lower bounds • Algorithms shown minimizing # Messages assume (recursive) block layout • Many references (see reports), only some shown, plus ours • Oldest reference may or may not include analysis • Cache-oblivious are underlined, Green are ours, ? is unknown/future work • b = block size, M = fast memory size, n = dimension Algorithm 2 Levels of Memory #Words Moved Cholesky LAPACK (with b = M1/2) Multiple Levels of Memory and # Messages #Words Moved and # Messages [BDHS09] like [Gustavson,97] & [Ahmed,Pingali,00] (←same) (←same) [BDHS09] has analysis [BDHS09] has analysis [Gustavson, 97] recursive, Uses recursive assumed good Cholesky, TRSM MatMul available and Matmul [BDHS09] 80 Summary of dense sequential QR algorithms attaining communication lower bounds • Algorithms shown minimizing # Messages assume (recursive) block layout • Many references (see reports), only some shown, plus ours • Oldest reference may or may not include analysis • Cache-oblivious are underlined, Green are ours, ? is unknown/future work • b = block size, M = fast memory size, n = dimension Algorithm 2 Levels of Memory #Words Moved QR [Elmroth, Gustavson,98] Recursive, may do more flops LAPACK (only for nM and b=M1/2, or n2M) and # Messages [Frens,Wise,03] explicit Q only, 3x more flops Multiple Levels of Memory #Words Moved and # Messages [Elmroth, Gustavson,98] [Frens,Wise,03] [DGHL08]? [DGHL08] implicit Q, but represented differently, ~ same flop count [DGHL08]? Can we get the same flop count? [DGHL08] 81 Summary of dense sequential LU algorithms attaining communication lower bounds • Algorithms shown minimizing # Messages assume (recursive) block layout • Many references (see reports), only some shown, plus ours • Oldest reference may or may not include analysis • Cache-oblivious are underlined, Green are ours, ? is unknown/future work • b = block size, M = fast memory size, n = dimension Algorithm 2 Levels of Memory #Words Moved LU with pivoting [Toledo, 97] recursive, assumes good TRSM and Matmul available Multiple Levels of Memory and # Messages #Words Moved and # Messages [GDX08] [Toledo, 97] [GDX08]? [GDX08]? LAPACK (only for nM and b=M1/2, or n2M) [GDX08] Different than partial pivoting, still stable 82 Same idea for LU, almost [GDX08] • First idea: Use same reduction tree as in QR, with LU at each node • Numerically unstable! • Need a different pivoting scheme: • At each node in tree, TSLU selects b pivot rows from 2b candidates from its 2 child nodes • At each node, do LU on 2b original rows selected by child nodes, not U factors from child nodes • When TSLU done, permute b selected rows to top of original matrix, redo b steps of LU without pivoting • Thm: Same results as partial pivoting on different input matrix whose entries are the same magnitudes as original • CALU – Communication Avoiding LU for general A – Use TSLU for panel factorizations – Apply to rest of matrix – Cost: redundant panel factorizations • Benefit: – Stable in practice, but not same pivot choice as GEPP – b times fewer messages overall - faster TSLU: LU factorization of a tall skinny matrix First try the obvious generalization of TSQR: W0    00   L00     10  W1    W   .  W  20  2   W     3 30     L10 L20  U 00       U10    U    20  L30  U 30  0 U 00      L01  U10    01 . U    11    20   U  1  30   U 01  .  L11   U11  U 01      02 L02U 02  U  11   2 Growth factor for TSLU based factorization Unstable for large P and large matrices. When P = # rows, TSLU is equivalent to parallel pivoting. Courtesy of H. Xiang Growth factor for better CALU approach Like threshold pivoting with worst case threshold = .33 , so |L| <= 3 Testing shows about same residual as GEPP Performance vs ScaLAPACK • TSLU (Tall-Skinny) – IBM Power 5 • Up to 4.37x faster (16 procs, 1M x 150) – Cray XT4 • Up to 5.52x faster (8 procs, 1M x 150) • CALU (Square) – IBM Power 5 • Up to 2.29x faster (64 procs, 1000 x 1000) – Cray XT4 • Up to 1.81x faster (64 procs, 1000 x 1000) • See INRIA Tech Report 6523 (2008), paper at SC08 CALU speedup prediction for a Petascale machine - up to 81x faster P = 8192 Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s.   2 1012 s,   105 s,   2 109 s / word . Eigenvalue problems and SVD • Uses spectral divide and conquer • Idea goes back to Bulgakov, Godunov, Malyshev k • Roughly: If A2 = QR, leading columns of Q converge to span invariant subspace for eigenvalues of A outside unit circle so Q*AQ becomes block upper triangular • Repeated squaring done implicitly • [Bai, D., Gu, 97] • Reduced just to Matmul, QR, and Generalized QR with pivoting • Eliminated need for exp(A), other improvements • [DDH,07] • Introduced randomized rank revealing QR • [Q’,R’]=qr(randn(n,n)), [Q,R]=qr(A·Q’) • Just need Matmul and QR • [BDD,09] • Shows that randomized GQR also rank revealing • Shows limits of divide-and-conquer, depending on pseudospectrum 89 • Cache-oblivious, using [Frens,Wise,03] Implicit Repeated Squaring Repeat:  B j   Q11 Q12   R j      A j   Q21 Q22   0  for each j, 1 j 1 j1 1 2j 0 0 A B j  (A B j1)  (A B ) 2 A j 1  Q12H A j H B j 1  Q22 Bj  until Rj converges Proof: from 1st step: from 2nd/3rd steps:  H H H Q12H B j  Q22 A j  0  Q12 Q22  B j A1 j 1  H H 1 2 A1 B  A Q Q B  (A B ) j 1 j 1 j 12 22 j j j Sparse Linear Algebra • Thm (George 73, Hoffman/Martin/Rose 73, Eisenstat/ Schultz/Sherman 76 ): Cholesky on an ns x ns sparse matrix whose graph is an s-dimensional nearest neighbor grid does ( n3(s-1) ) flops; attainable by nested dissection ordering • Proof: most work in dense Cholesky on final submatrix of size (ns-1) • Corollary [BDHS09,G09]: Sequential Sparse Cholesky on such a matrix moves ( n3(s-1) / M1/2 ) words, and Parallel Sparse Cholesky moves ( n2(s-1) / (P log n )1/2 ) words on P procs • Assumes “2D” parallel algorithm, i.e. minimal memory • Thm [G09]. Attainable in parallel case by nested dissection, eg with PSPASES. 91 Avoiding Communication in Iterative Linear Algebra • k-steps of typical iterative solver for sparse Ax=b or Ax=λx • Does k SpMVs with starting vector (eg with b, if solving Ax=b) • Finds “best” solution among all linear combinations of these k+1 vectors • Many such “Krylov Subspace Methods” • Conjugate Gradients, GMRES, Lanczos, Arnoldi, … • Goal: minimize communication in Krylov Subspace Methods • Assume matrix “well-partitioned,” with modest surface-to-volume ratio • Parallel implementation • Conventional: O(k log p) messages, because k calls to SpMV • New: O(log p) messages - optimal • Serial implementation • Conventional: O(k) moves of data from slow to fast memory • New: O(1) moves of data – optimal • Lots of speed up possible (modeled and measured) • Price: some redundant computation Locally Dependent Entries for [x,Ax], A tridiagonal, 2 processors Proc 1 Proc 2 A8x A7x A6x A5x A4x A3x A2x Ax x Can be computed without communication Locally Dependent Entries for [x,Ax,A2x], A tridiagonal, 2 processors Proc 1 Proc 2 A8x A7x A6x A5x A4x A3x A2x Ax x Can be computed without communication Locally Dependent Entries for [x,Ax,…,A3x], A tridiagonal, 2 processors Proc 1 Proc 2 A8x A7x A6x A5x A4x A3x A2x Ax x Can be computed without communication Locally Dependent Entries for [x,Ax,…,A4x], A tridiagonal, 2 processors Proc 1 Proc 2 A8x A7x A6x A5x A4x A3x A2x Ax x Can be computed without communication Locally Dependent Entries for [x,Ax,…,A8x], A tridiagonal, 2 processors Proc 1 Proc 2 A8x A7x A6x A5x A4x A3x A2x Ax x Can be computed without communication k=8 fold reuse of A Remotely Dependent Entries for [x,Ax,…,A8x], A tridiagonal, 2 processors Proc 1 Proc 2 A8x A7x A6x A5x A4x A3x A2x Ax x One message to get data needed to compute remotely dependent entries, not k=8 Minimizes number of messages = latency cost Price: redundant work  “surface/volume ratio” Remotely Dependent Entries for [x,Ax,A2x,A3x], A irregular, multiple processors Sequential [x,Ax,…,A4x], with memory hierarchy v One read of matrix from slow memory, not k=4 Minimizes words moved = bandwidth cost No redundant work Performance results on 8-Core Clovertown

Optimizing Matrix Multiply - Computer Science Division

Related documents

Products

Support

Optimizing Matrix Multiply - Computer Science Division

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib