Optimizing Matrix Multiply - Computer Science Division

advertisement
Minimizing Communication
in Linear Algebra
James Demmel
9 Nov 2009
1
Outline
•
•
•
•
•
Background – 2 Big Events
Why avoiding communication is important
Communication-optimal direct linear algebra
Autotuning of sparse matrix codes
Communication-optimal iterative linear algebra
2
Collaborators
•
•
•
•
•
•
•
•
•
•
•
•
•
Grey Ballard, UCB EECS
Ioana Dumitriu, U. Washington
Laura Grigori, INRIA
Ming Gu, UCB Math
Mark Hoemmen, UCB EECS
Olga Holtz, UCB Math & TU Berlin
Julien Langou, U. Colorado Denver
Marghoob Mohiyuddin, UCB EECS
Oded Schwartz , TU Berlin
Hua Xiang, INRIA
Sam Williams, LBL
Vasily Volkov, UCB EECS
Kathy Yelick, UCB EECS & NERSC
Thanks to Intel, Microsoft,
UC Discovery, NSF, DOE, …
• BeBOP group, bebop.cs.berkeley.edu
• ParLab, parlab.eecs.berkeley.edu
3
2 Big Events
1. “Multicore revolution”, requiring all software (where
performance matters!) to change
•
•
ParLab – Industry/State funded center with 15 faculty, ~50 grad
students ( parlab.eecs.berkeley.edu )
Rare synergy between commercial and scientific computing
2. Establishment of a new graduate program in
Computational Science and Engineering (CSE)
•
•
117 Faculty from 22 Departments
60 existing courses, more being developed
• CS267 – Applications of Parallel Computers
– www.cs.berkeley.edu/~demmel/cs267_Spr09
• 3 Day Parallel “Boot Camp” in August
– parlab.eecs.berkeley.edu/2009bootcamp
Parallelism Revolution is Happening Now
• Chip density continuing
to increase ~2x every
2 years
• Clock speed is not
• Number of processor
cores may double
instead
• There is little or no
more hidden parallelism
(ILP) to be found
• Parallelism must be
exposed to and
managed by software
Source: Intel, Microsoft (Sutter)
and Stanford (Olukotun,
Hammond)
5
The “7 Dwarfs” of High Performance Computing
• Phil Colella (LBL) identified 7 kernels out of which most large scale
simulation and data-analysis programs are composed:
1. Dense Linear Algebra
• Ex: Solve Ax=b or Ax = λx where A is a dense matrix
2. Sparse Linear Algebra
• Ex: Solve Ax=b or Ax = λx where A is a sparse matrix (mostly zero)
3. Operations on Structured Grids
•
Ex: Anew(i,j) = 4*A(i,j) – A(i-1,j) – A(i+1,j) – A(i,j-1) – A(i,j+1)
4. Operations on Unstructured Grids
• Ex: Similar, but list of neighbors varies from entry to entry
5. Spectral Methods
• Ex: Fast Fourier Transform (FFT)
6. N-Body (aka Particle Methods)
• Ex: Compute electrostatic forces using Fast Multiple Method
7. Monte Carlo (aka MapReduce)
• Ex: Many independent simulations using different inputs
6
Motif/Dwarf: Common Computational Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
HPC
ML
Games
DB
SPEC
Embed
(Red Hot  Blue Cool)
Health Image Speech Music Browser
Finite State Mach.
Combinational
Graph Traversal
Structured Grid
Dense Matrix
Sparse Matrix
Spectral (FFT)
Dynamic Prog
N-Body
MapReduce
Backtrack/ B&B
Graphical Models
Unstructured Grid
www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
2 Big Events
1. “Multicore revolution”, requiring all software (where
performance matters!) to change
•
•
ParLab – Industry/State funded center with 15 faculty, ~50 grad
students ( parlab.eecs.berkeley.edu )
Rare synergy between commercial and scientific computing
2. Establishment of a new graduate program in
Computational Science and Engineering (CSE) at UCB
•
•
117 Faculty from 22 Departments
60 existing courses, more being developed
• CS267 – Applications of Parallel Computers
– www.cs.berkeley.edu/~demmel/cs267_Spr09
• 3 Day Parallel “Boot Camp” in August
– parlab.eecs.berkeley.edu/2009bootcamp
Why avoiding communication is important (1/2)
Algorithms have two costs:
1.Arithmetic (FLOPS)
2.Communication: moving data between
• levels of a memory hierarchy (sequential case)
• processors over a network (parallel case).
CPU
Cache
CPU
DRAM
CPU
DRAM
CPU
DRAM
CPU
DRAM
DRAM
9
Why avoiding communication is important (2/2)
• Running time of an algorithm is sum of 3 terms:
•
•
•
# flops * time_per_flop
# words moved / bandwidth
# messages * latency
communication
• Time_per_flop << 1/ bandwidth << latency
• Gaps growing exponentially with time
Annual improvements
Time_per_flop
59%
Bandwidth
Latency
Network
26%
15%
DRAM
23%
5%
• Goal : reorganize linear algebra to avoid communication
• Between all memory hierarchy levels
• L1
L2
DRAM
network, etc
• Not just hiding communication (speedup  2x )
• Arbitrary speedups possible
10
Naïve Sequential Matrix Multiply: C = C + A*B
for i = 1 to n
{read row i of A into fast memory, n2 reads}
for j = 1 to n
{read C(i,j) into fast memory, n2 reads}
{read column j of B into fast memory, n3 reads}
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
{write C(i,j) back to slow memory, n2 writes}
n3 + O(n2) reads/writes altogether
C(i,j)
A(i,:)
C(i,j)
=
+
*
B(:,j)
11
Less Communication with Blocked Matrix Multiply
• Blocked Matmul C = A·B explicitly refers to subblocks of
A, B and C of dimensions that depend on cache size
… Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc
… b chosen so 3 bxb blocks fit in cache
for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b
C(i,j) = C(i,j) + A(i,k)·B(k,j)
… b x b matmul, 4b2 reads/writes
• (n/b)3 · 4b2 = 4n3/b reads/writes altogether
• Minimized when 3b2 = cache size = M, yielding O(n3/M1/2) reads/writes
• What if we had more levels of memory? (L1, L2, cache etc)?
• Would need 3 more nested loops per level
12
Blocked vs Cache-Oblivious Algorithms
• Blocked Matmul C = A·B explicitly refers to subblocks of
A, B and C of dimensions that depend on cache size
… Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc
… b chosen so 3 bxb blocks fit in cache
for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b
C(i,j) = C(i,j) + A(i,k)·B(k,j)
… b x b matmul
… another level of memory would need 3 more loops
• Cache-oblivious Matmul C = A·B is independent of cache
Function C = MM(A,B)
If A and B are 1x1
C=A·B
else … Break Anxn, Bnxn, Cnxn into (n/2)x(n/2) blocks labeled A(i,j), etc
for i = 1 to 2, for j = 1 to 2, for k = 1 to 2
C(i,j) = C(i,j) + MM( A(i,k), B(k,j) )
… n/2 x n/2 matmul
13
Direct linear algebra:
Prior Work on Matmul
• Assume n3 algorithm (i.e. not Strassen-like)
• Sequential case, with fast memory of size M
• Lower bound on #words moved to/from slow memory =
 (n3 / M1/2 ) [Hong, Kung, 81]
• Attained using blocked or cache-oblivious algorithms
• Parallel case on P processors:
• Let NNZ be total memory needed; assume load balanced
• Lower bound on #words communicated
=  (n3 /(P· NNZ )1/2 )
[Irony, Tiskin, Toledo, 04]
NNZ (name of alg) Lower bound
on #words
Attained by
(“2D alg”)
 (n2 / P1/2 )
[Cannon, 69]
3n2 P1/3 (“3D alg”)
 (n2 / P2/3 )
[Johnson,93]
3n2
14
New lower bound for all “direct” linear algebra
Let M = “fast” memory size per processor
#words_moved by at least one processor =
(#flops / M1/2 )
#messages_sent by at least one processor =
(#flops / M3/2 )
• Holds for
• BLAS, LU, QR, eig, SVD, …
• Some whole programs (sequences of these operations, no
matter how they are interleaved, eg computing Ak)
• Dense and sparse matrices (where #flops << n3 )
• Sequential and parallel algorithms
• Some graph-theoretic algorithms (eg Floyd-Warshall)
• See [BDHS09] – proof sketch if time!
15
Can we attain these lower bounds?
• Do conventional dense algorithms as implemented in
LAPACK and ScaLAPACK attain these bounds?
• Mostly not
• If not, are there other algorithms that do?
• Yes
• Goals for algorithms:
• Minimize #words =  (#flops/ M1/2 )
• Minimize #messages =  (#flops/ M3/2 )
•
Need new data structures
• Minimize for multiple memory hierarchy levels
•
Cache-oblivious algorithms would be simplest
• Fewest flops when matrix fits in fastest memory
•
Cache-oblivious algorithms don’t always attain this
• Only a few sparse algorithms so far
16
Minimizing #messages requires new data structures
• Need to load/store whole rectangular subblock of matrix with
one “message”
• Incompatible with conventional columnwise (rowwise) storage
• Ex: Rows (columns) not in contiguous memory locations
• Blocked storage: store as matrix of bxb blocks, each block
stored contiguously
• Ok for one level of memory hierarchy, what if more?
• Recursive blocked storage: store each block using subblocks
17
Summary of dense sequential algorithms
attaining communication lower bounds
• Algorithms shown minimizing # Messages use (recursive) block layout
• Many references (see reports), only some shown, plus ours
• Cache-oblivious are underlined, Green are ours, ? is unknown/future work
Algorithm
2 Levels of Memory
#Words Moved
BLAS-3
and # Messages
Usual blocked or recursive algorithms
Multiple Levels of Memory
#Words Moved
and #Messages
Usual blocked algorithms (nested),
or recursive [Gustavson,97]
Cholesky
LAPACK (with b = M1/2)
[Gustavson 97]
[BDHS09]
[Gustavson,97]
[Ahmed,Pingali,00]
[BDHS09]
(←same)
(←same)
LU with
pivoting
LAPACK (rarely)
[Toledo,97] , [GDX 08]
[GDX 08]
not partial pivoting
[Toledo, 97]
[GDX 08]?
[GDX 08]?
QR
LAPACK (rarely)
[Elmroth,Gustavson,98]
[DGHL08]
[Frens,Wise,03]
but 3x flops
[DGHL08]
[Elmroth,
Gustavson,98]
[DGHL08] ?
[Frens,Wise,03]
[DGHL08] ?
[BDD09]
[BDD09]
Eig, SVD
Not LAPACK
[BDD09] randomized, but more flops
Summary of dense 2D parallel algorithms
attaining communication lower bounds
•
•
•
•
Assume nxn matrices on P processors, memory per processor = O(n2 / P)
ScaLAPACK assumes best block size b chosen
Many references (see reports), Green are ours
Recall lower bounds:
#words_moved = ( n2 / P1/2 )
and
#messages = ( P1/2 )
Algorithm
Reference
Factor exceeding
lower bound for
#words_moved
Factor exceeding
lower bound for
#messages
Matrix multiply
[Cannon, 69]
1
1
Cholesky
ScaLAPACK
log P
log P
LU
[GDX08]
ScaLAPACK
log P
log P
log P
( N / P1/2 ) · log P
QR
[DGHL08]
ScaLAPACK
log P
log P
log3 P
( N / P1/2 ) · log P
Sym Eig, SVD
[BDD09]
ScaLAPACK
log P
log P
log3 P
N / P1/2
Nonsym Eig
[BDD09]
ScaLAPACK
log P
P1/2 · log P
log3 P
N · log P
QR of a tall, skinny matrix is bottleneck;
Use TSQR instead:
W0   Q00
  R00 
  


Q10
 W1  
  R10 
W  
.

W2
Q20
R20 
  


W  



Q
R
30   30 
 3 
 R00 


  R01 
 R10   Q01
. 
 R   
Q11   R11 
 20  
R 
 30 
 R01 
   Q02 R02
 R11 
Q is represented implicitly as a product (tree of factors)
[DGHL08]
Minimizing Communication in TSQR
W0
W = W1
W2
W3
R00
R10
R20
R30
W0
Sequential: W = W1
W2
W3
R00
W0
W = W1
W2
W3
R00
R01
Parallel:
Dual Core:
R01
R02
R11
R01
R01
R11
R02
R02
R11
R03
R03
Multicore / Multisocket / Multirack / Multisite / Out-of-core: ?
Can choose reduction tree dynamically
Performance of TSQR vs Sca/LAPACK
• Parallel
– Intel Clovertown
– Up to 8x speedup (8 core, dual socket, 10M x 10)
– Pentium III cluster, Dolphin Interconnect, MPICH
• Up to 6.7x speedup (16 procs, 100K x 200)
– BlueGene/L
• Up to 4x speedup (32 procs, 1M x 50)
– Both use Elmroth-Gustavson locally – enabled by TSQR
• Sequential
– OOC on PowerPC laptop
• As little as 2x slowdown vs (predicted) infinite DRAM
• LAPACK with virtual memory never finished
• See UC Berkeley EECS Tech Report 2008-89
• General CAQR = Communication-Avoiding QR in progress
Modeled Speedups of CAQR vs ScaLAPACK
Petascale
up to 22.9x
IBM Power 5
up to 9.7x
“Grid”
up to 11x
Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s.
  2 1012 s,   105 s,   2 109 s / word .
Randomized Rank-Revealing QR
• Needed for eigenvalue and SVD algorithms
• func [Q,R,V] = RRQR(A)
• [V,R1] = qr(randn(n,n)) … V random with Haar measure
• [Q,R] = qr(A · VT ) … so A = Q · R · V
• Theorem (DDH). Provided that sr+1 << sr ~ s1 , RRQR
produces a rank-revealing decomposition with high
probability (when r = O(n), the probability is 1 – O(n-c) ).
• Can apply to
A = A1+/-1 A2+/-1 … Ak+/-1
• Do RRQR, followed by a series of (k-1) QR / RQ to propagate
unitary/orthogonal matrix to the front
• Analogous to [Stewart, 94]
24
What about o(n3) algorithms, like Strassen?
• Thm (DDHK07)
• Given any matmul running in O(n) ops for some >2,
it can be made stable and still run in O(n+) ops,
for any >0
•
Current record:
  2.38
• Thm (DDH08)
• Given any stable matmul running in O(n+) ops, it is
possible to do backward stable dense linear algebra
in O(n+) ops
• Also reduces communication to O(n+)
• But can do much better (work in progress…)
25
Direct Linear Algebra: summary and future work
• Communication lower bounds on #words_moved and #messages
• BLAS, LU, Cholesky, QR, eig, SVD, …
• Some whole programs (compositions of these operations,
no matter how they are interleaved, eg computing Ak)
• Dense and sparse matrices (where #flops << n3 )
• Sequential and parallel algorithms
• Some graph-theoretic algorithms (eg Floyd-Warshall)
• Algorithms that attain these lower bounds
• Nearly all dense problems (some open problems)
• A few sparse problems
• Speed ups in theory and practice
• Future work
• Lots to implement, tune
•
Next generation of Sca/LAPACK on heterogeneous architectures (MAGMA)
• Few algorithms in sparse case
• Strassen-like algorithms
• 3D Algorithms (only for Matmul so far), may be important for scaling
26
How hard is hand-tuning matmul, anyway?
• Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09
• Students given “blocked” code to start with
• Still hard to get close to vendor tuned performance (ACML)
• For more discussion, see www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/
27
How hard is hand-tuning matmul, anyway?
28
What part of the Matmul Search Space Looks Like
Number of rows in register block
A 2-D slice of a 3-D register-tile search space. The dark blue region was pruned.
(Platform: Sun Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0 compiler)
29
Autotuning Matmul with ATLAS (n = 500)
Source: Jack Dongarra
900.0
Vendor BLAS
ATLAS BLAS
F77 BLAS
800.0
700.0
MFLOPS
600.0
500.0
400.0
300.0
200.0
100.0
pa
rc
22
00
0
027
U
ltr
aS
ip
3
00
S
un
12
0
G
IR
S
G
IR
10
0
00
m
ip
2
820
0
50
III
-5
66
m
en
tiu
S
Architectures
P
P
en
tiu
Pr
om
P
en
tiu
II2
20
0
0
er
320
0
Po
w
er
216
M
M
IB
IB
60
4-
PP
M
IB
Po
w
11
2
5
C
/7
3
00
P9
0
EC
D
5/
13
0
650
3
-5
3
56
ev
EC
D
ev
H
A
M
D
A
th
lo
n6
00
0.0
• ATLAS is faster than all other portable BLAS implementations and it is comparable
with machine-specific libraries provided by the vendor.
• ATLAS written by C. Whaley, inspired by PHiPAC, by Asanovic, Bilmes, Chin, D.
30
Automatic Performance Tuning
• Goal: Let machine do hard work of writing fast code
• What do tuning of Matmul, dense BLAS, FFTs, signal processing, have
in common?
• Can do the tuning off-line: once per architecture, algorithm
• Can take as much time as necessary (hours, a week…)
• At run-time, algorithm choice may depend only on few parameters
(matrix dimensions, size of FFT, etc.)
• Can’t always do tuning off-line
• Algorithm and implementation may strongly depend on data only known
at run-time
• Ex: Sparse matrix nonzero pattern determines both best data structure
and implementation of Sparse-matrix-vector-multiplication (SpMV)
• Part of search for best algorithm just be done (very quickly!) at run-time
Source: Accelerator Cavity Design Problem (from Ko via Husbands)
Linear Programming Matrix
…
A Sparse Matrix You Use Every Day
SpMV with Compressed Sparse Row (CSR) Storage
Matrix-vector multiply kernel: y(i)  y(i) + A(i,j)*x(j)
for each row i
for k=ptr[i] to ptr[i+1] do
y[i] = y[i] + val[k]*x[ind[k]]
Only 2 flops per
2 mem_refs:
Limited by getting
data from memory
Example: The Difficulty of Tuning
• n = 21200
• nnz = 1.5 M
• kernel: SpMV
• Source: NASA
structural analysis
problem
Example: The Difficulty of Tuning
• n = 21200
• nnz = 1.5 M
• kernel: SpMV
• Source: NASA
structural analysis
problem
• 8x8 dense
substructure:
exploit this to limit
#mem_refs
Speedups on Itanium 2: The Need for Search
8x8
Mflop/s
Best: 4x2
Reference
Mflop/s
Register Profile: Itanium 2
1190 Mflop/s
190 Mflop/s
Power3 - 17%
252 Mflop/s
Power4 - 16%
820 Mflop/s
Register Profiles: IBM and Intel IA-64
122 Mflop/s
Itanium 1 - 8%
247 Mflop/s
107 Mflop/s
459 Mflop/s
Itanium 2 - 33%
1.2 Gflop/s
190 Mflop/s
Ultra 2i - 11%
72 Mflop/s
Ultra 3 - 5%
90 Mflop/s
Register Profiles: Sun and Intel x86
35 Mflop/s
Pentium III - 21%
108 Mflop/s
42 Mflop/s
50 Mflop/s
Pentium III-M - 15%
122 Mflop/s
58 Mflop/s
Another example of tuning challenges
• More complicated
non-zero structure in
general
• N = 16614
• NNZ = 1.1M
Zoom in to top corner
• More complicated
non-zero structure
in general
• N = 16614
• NNZ = 1.1M
3x3 blocks look natural, but…
• More complicated non-zero
structure in general
• Example: 3x3 blocking
• Logical grid of 3x3 cells
• But would lead to lots of “fill-in”
Extra Work Can Improve Efficiency!
• More complicated non-zero
structure in general
• Example: 3x3 blocking
•
•
•
•
Logical grid of 3x3 cells
Fill-in explicit zeros
Unroll 3x3 block multiplies
“Fill ratio” = 1.5
• On Pentium III: 1.5x speedup!
• Actual mflop rate 1.52 = 2.25
higher
Extra Work Can Improve Efficiency!
• More complicated non-zero
structure in general
• Example: 3x3 blocking
•
•
•
•
Logical grid of 3x3 cells
Fill-in explicit zeros
Unroll 3x3 block multiplies
“Fill ratio” = 1.5
• On Pentium III: 1.5x speedup!
• Actual mflop rate 1.52 = 2.25
higher
• In general: sample matrix to
choose rxc block structure using
heuristic
Source: Accelerator Cavity Design Problem (from Ko via Husbands)
Post-RCM Reordering
100x100 Submatrix Along Diagonal
“Microscopic” Effect of RCM Reordering
Before: Green + Red
After: Green + Blue
“Microscopic” Effect of Combined RCM+TSP Reordering
Before: Green + Red
After: Green + Blue
Speedups from 1.6x to 3.3x
Summary of SpMV Performance Optimizations
• Sequential SpMV
•
•
•
•
•
•
•
•
Register blocking (RB): up to 4x over CSR
Symmetry: 2.8x over CSR, 2.6x over RB
Reordering to create dense structure + splitting: 2x over CSR
Variable block splitting: 2.1x over CSR, 1.8x over RB
Diagonals: 2x over CSR
Cache blocking: 2.8x over CSR
Multiple vectors (SpMM): 7x over CSR
And combinations…
OSKI
• Multicore SpMV
• NUMA, Affinity, SW Prefetch, Cache/LS/TLB blocking: 4x over naïve Pthreads
• Sequential Sparse triangular solve
• Hybrid sparse/dense data structure: 1.8x over CSR
• Higher-level kernels
• A·AT·x, AT·A·x: 4x over CSR, 1.8x over RB
• More later …
• All optimizations (being) automated in OSKI (bebop.cs.berkeley.edu)
• Being adopted by Cray; Python interface in progress
Avoiding Communication in Iterative Linear Algebra
• k-steps of typical iterative solver for sparse Ax=b or Ax=λx
• Does k SpMVs with starting vector
• Finds “best” solution among all linear combinations of these k+1 vectors
• Many such “Krylov Subspace Methods”
• Conjugate Gradients, GMRES, Lanczos, Arnoldi, …
• Goal: minimize communication in Krylov Subspace Methods
• Assume matrix “well-partitioned,” with modest surface-to-volume ratio
• Parallel implementation
• Conventional: O(k log p) messages, because k calls to SpMV
• New: O(log p) messages - optimal
• Serial implementation
• Conventional: O(k) moves of data from slow to fast memory
• New: O(1) moves of data – optimal
• Lots of speed up possible (modeled and measured)
• Price: some redundant computation
• Much prior work
• See bebop.cs.berkeley.edu
• CG: [van Rosendale, 83], [Chronopoulos and Gear, 89]
• GMRES: [Walker, 88], [Joubert and Carey, 92], [Bai et al., 94]
Minimizing Communication of GMRES to solve Ax=b
• GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2
• Cost of k steps of standard GMRES vs new GMRES
Standard GMRES
for i=1 to k
w = A · v(i-1)
MGS(w, v(0),…,v(i-1))
update v(i), H
endfor
solve LSQ problem with H
Communication-avoiding GMRES
W = [ v, Av, A2v, … , Akv ]
[Q,R] = TSQR(W) … “Tall Skinny QR”
Build H from R, solve LSQ problem
Sequential: #words_moved =
O(k·nnz) from SpMV
+ O(k2·n) from MGS
Parallel: #messages =
O(k) from SpMV
+ O(k2 · log p) from MGS
Sequential: #words_moved =
O(nnz) from SpMV
+ O(k·n) from TSQR
Parallel: #messages =
O(1) from computing W
+ O(log p) from TSQR
•Oops – W from power method, precision lost!
“Monomial” basis [Ax,…,Akx]
fails to converge
A different polynomial basis does converge
Speed ups of GMRES on 8-core Intel Clovertown
[MHDY09]
Communication Avoiding Iterative Linear Algebra:
Future Work
• Lots of algorithms to implement, autotune
• Make widely available via OSKI, Trilinos, PETSc, Python, …
• Extensions to other Krylov subspace methods
• So far just Lanczos/CG and Arnoldi/GMRES
• BiCGStab, CGS, QMR, …
• Add preconditioning
• Block diagonal are easiest
• Hierarchical, semiseparable, …
• Works if interactions between distant points can be “compressed”
Summary
Don’t Communic…
58
1/3
2/3
3/3
Extra Slides
62
Proof of Communication Lower Bound on C = A·B (1/5)
• Proof from Irony/Toledo/Tiskin (2004)
• Original proof, then generalization
• Think of instruction stream being executed
• Looks like “ … add, load, multiply, store, load, add, …”
•
Each load/store moves a word between fast and slow memory
• We want to count the number of loads and stores, given that we are
multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly
reordered assuming addition is commutative/associative
• Assuming that at most M words can be stored in fast memory
• Outline:
• Break instruction stream into segments, each with M loads and stores
• Somehow bound the maximum number of flops that can be done in
each segment, call it F
• So F · # segments  T = total flops = 2·n3 , so # segments  T / F
• So # loads & stores = M · #segments  M · T / F
63
Illustrating Segments, for M=3
Load
Load
Segment 1
Load
FLOP
Store
FLOP
Segment 2
Load
Load
FLOP
FLOP
FLOP
Store
Store
FLOP
Store
Load
FLOP
...
Time
Load
Segment 3
Proof of Communication Lower Bound on C = A·B (1/5)
• Proof from Irony/Toledo/Tiskin (2004)
• Original proof, then generalization
• Think of instruction stream being executed
• Looks like “ … add, load, multiply, store, load, add, …”
•
Each load/store moves a word between fast and slow memory
• We want to count the number of loads and stores, given that we are
multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly
reordered assuming addition is commutative/associative
• Assuming that at most M words can be stored in fast memory
• Outline:
• Break instruction stream into segments, each with M loads and stores
• Somehow bound the maximum number of flops that can be done in
each segment, call it F
• So F · # segments  T = total flops = 2·n3 , so # segments  T / F
• So # loads & stores = M · #segments  M · T / F
65
Proof of Communication Lower Bound on C = A·B (2/5)
• Given segment of instruction stream with M loads & stores,
how many adds & multiplies (F) can we do?
• At most 2M entries of C, 2M entries of A and/or 2M entries of B
can be accessed
• Use geometry:
• Represent n3 multiplications by n x n x n cube
• One n x n face represents A
•
each 1 x 1 subsquare represents one A(i,k)
• One n x n face represents B
•
each 1 x 1 subsquare represents one B(k,j)
• One n x n face represents C
•
each 1 x 1 subsquare represents one C(i,j)
• Each 1 x 1 x 1 subcube represents one C(i,j) += A(i,k) · B(k,j)
•
May be added directly to C(i,j), or to temporary accumulator
66
Proof of Communication Lower Bound on C = A·B (3/5)
k
“C face”
Cube representing
C(1,1) += A(1,3)·B(3,1)
C(2,3)
A(2,1)
A(1,1)
B(1,3)
B(2,1)
A(1,2)
j
B(1,1)
A(1,3)
B(3,1)
C(1,1)
i
“A face”
• If we have at most 2M “A squares”, 2M “B squares”, and
2M “C squares” on faces, how many cubes can we have?
67
Proof of Communication Lower Bound on C = A·B (4/5)
k
“C shadow”
x
y
z
j
y
z
x
“A shadow”
i
# cubes in black box with
side lengths x, y and z
= Volume of black box
= x·y·z
= ( xz · zy · yx)1/2
= (#A□s · #B□s · #C□s )1/2
(i,k) is in “A shadow” if (i,j,k) in 3D set
(j,k) is in “B shadow” if (i,j,k) in 3D set
(i,j) is in “C shadow” if (i,j,k) in 3D set
Thm (Loomis & Whitney, 1949)
# cubes in 3D set = Volume of 3D set
≤ (area(A shadow) · area(B shadow) ·
area(C shadow)) 1/2
68
Proof of Communication Lower Bound on C = A·B (5/5)
• Consider one “segment” of instructions with M loads, stores
• Can be at most 2M entries of A, B, C available in one segment
• Volume of set of cubes representing possible multiply/adds in
one segment is ≤ (2M · 2M · 2M)1/2 = (2M) 3/2 ≡ F
• # Segments  2n3 / F
• # Loads & Stores = M · #Segments  M · 2n3 / F = n3 / (2M)1/2
• Parallel Case: apply reasoning to one processor out of P
• # Adds and Muls  2n3 / P (at least one proc does this )
• M= n2 / P (each processor gets equal fraction of matrix)
• # “Load & Stores” = # words moved from or to other procs
 M · (2n3 /P) / F= M · (2n3 /P) / (2M) 3/2 = n2 / (2P)1/2
69
How to generalize this lower bound (1/4)
• It doesn’t depend on C(i,j) being a matrix entry, just a
unique memory location (same for A(i,k) and B(k,j) )
• It doesn’t depend on C and A not overlapping (or C and
B, or A and B)
• Some reorderings may change answer; still get a
lower bound
• It doesn’t depend on doing n3 multiply/adds
• It doesn’t depend on C(i,j) just being a sum of products
C(i,j) = k A(i,k)*B(k,j)
For all (i,j)  S
Mem(c(i,j)) =
fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij ,
some other arguments)
70
How to generalize this lower bound (2/4)
For all (i,j)  S
Mem(c(i,j)) =
fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij ,
some other arguments)
(P)
• It does assume the presence of an operand generating a
load and/or store; how could this not happen?
• Mem(b(k,j)) could be reused in many more gijk than (P) allows
•
•
Ex: Compute C(m) = A * B.^m (Matlab notation) for m=1 to t
Can move t1/2 times fewer words than  (#flops / M1/2 )
• We might generate a result during a segment, use it, and discard it,
with generating any memory traffic
•
•
Turns out QR, eig, SVD all may do this
Need a different analysis for them later…
71
How to generalize this lower bound (3/4)
(P)
For all (i,j)  S
Mem(c(i,j)) =
fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij ,
some other arguments)
• Need to distinguish Sources, Destinations of each operand
in fast memory during a segment:
• Possible Sources:
S1: Already in fast memory at start of segment, or read; at most 2M
S2: Created during segment; no bound without more information
• Possible Destinations:
D1: Left in fast memory at end of segment, or written; at most 2M
D2: Discarded; no bound without more information
• Need to assume no S2/D2 arguments; at most 4M of others
72
How to generalize this lower bound (4/4)
(P)
For all (i,j)  S
Mem(c(i,j)) =
fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij ,
some other arguments)
• Theorem: To evaluate (P) with memory of size M, where
• fij and gijk are “nontrivial” functions of their arguments
• G is the total number of gijk‘s,
• No S2/D2 arguments
requires at least G/ (8M)1/2 – M slow memory references
• Simpler: #words_moved =  (#flops / M1/2 )
• Corollary: To evaluate (P) requires at least
G/ (81/2 M3/2 ) – 1 messages (simpler:  (#flops / M3/2 ) )
• Proof: maximum message size is M
73
Some Corollaries of this lower bound (1/2)
(P)
For all (i,j)  S
Mem(c(i,j)) =

fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) )
for k  Sij , some other arguments)
#words_moved
=  (#flops / M1/2 )
• Theorem applies to dense or sparse, parallel or sequential:
• MatMul, including ATA or A2
• Triangular solve C = A-1·X
•
•
C(i,j) = (X(i,j) - k<i A(i,k)*C(k,j)) / A(i,i) … A lower triangular
C plays double role of b and c in Model (P)
• LU factorization (any pivoting, LU or “ILU”)
• L(i,j) = (A(i,j) - k<j L(i,k)*U(k,j)) / U(j,j), U(i,j) = A(i,j) - k<i L(i,k)*U(k,j)
• L (and U) play double role as c and a (c and b) in Model (P)
• Cholesky (any diagonal pivoting, C or “IC”)
• L(i,j) = (A(i,j) - k<j L(i,k)*LT(k,j)) / L(j,j), L(j,j) = (A(j,j) - k<j L(j,k)*LT(k,j))1/2
• L (and LT) plays triple role as a, b and c
74
Some Corollaries of this lower bound (2/2)
(P)
For all (i,j)  S
Mem(c(i,j)) =

fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) )
for k  Sij , some other arguments)
#words_moved
=  (#flops / M1/2 )
• Applies to “simplified” operations
• Ex: Compute determinant of A(i,j) = func(i,j) using LU, where each
func(i,j) called just once; so no inputs and 1 output
• Still get  (#flops / M1/2 – 2n2) words_moved, by “imposing” loads/stores
• Applies to compositions of operations
• Ex: Compute Ak by repeated squaring, only input A, output Ak
• Still get  (#flops / M1/2 – n2 log k ) by “imposing” loads/stores,
• Holds for any interleaving of operations
• Applies to some graph algorithms
• Ex: All-Pairs-Shortest-Path by Floyd-Warshall: gijk = “+” and fij = “min”
• Get  (F / M1/2) words_moved where F {n4 , n3 log n , n3 }
75
Lower bounds for Orthogonal Transformations (1/4)
• Needed for QR, eig, SVD, …
• Analyze Blocked Householder Transformations
• j=1 to b (I – 2 uj uTj ) = I – U T UT where U = [ u1 , … , ub ]
• Treat Givens as 2x2 Householder
• Model details and assumptions
• Write (I – U T UT )A = A – U(TUT A) = A – UZ where Z = T(UT A)
• Only count operations in all A – UZ operations
•
“Generically” a large fraction of the work
• Assume “Forward Progress”, that each successive Householder
transformation leaves previously created zeros zero; Ok for
•
•
•
QR decomposition
Reductions to condensed forms (Hessenberg, tri/bidiagonal)
– Possible exception: bulge chasing in banded case
One sweep of (block) Hessenberg QR iteration
76
Lower bounds for Orthogonal Transformations (2/4)
• Perform many A – UZ where Z = T(UT A)
• First challenge to applying theorem: need to collect all A-UZ
into one big set to which model (P) applies
• Write them all as { A(i,j) = A(i,j) - k U(i,k) Z(k,j) } where
k = index of k-th transformation,
• k not necessarily = index of column of A it comes from
• Second challenge: All Z(k,j) may be S2/D2
• Recall: S2/D2 means computed on the fly and discarded
• Ex: An x 2n =Qn x n · Rn x 2n where 2n2 = M so A fits in cache
•
•
•
•
•
Represent Q as n(n-1)/2 2x2 Householder (Givens) transforms
There are n2(n-1)/2 = (M3/2) nonzero Z(k,j), not O(M)
Still only do (M3/2) flops during segment
But can’t use Loomis-Whitney to prove it!
Need a new idea…
77
Lower bounds for Orthogonal Transformations (3/4)
• Dealing with Z(k,j) being S2/D2
• How to bound #flops in A(i,j) = A(i,j) - k U(i,k) Z(k,j) ?
• Neither A nor U is S2/D2
• A either turned into R or U, which are output
• So at most 2M of each during segment
• #flops ≤ ( #U(i,k) ) · ( #columns A(:,j) present )
≤ ( #U(i,k) ) · ( #A(i,j) / min #nonzeros per column of A)
≤
h
· O(M) / r where h = O(M)
… How small can r be?
• To store h ≡ #U(i,k) Householder vector entries in r rows,
there can be at most r in the first column, r-1 in the second, etc.,
(to maintain “Forward Progress”) so r(r-1)/2  h so r  h1/2
• # flops ≤ h · O(M) / r ≤ O(M) h1/2 = O(M3/2) as desired
78
Lower bounds for Orthogonal Transformations (4/4)
• Theorem: #words_moved by QR decomposition using
(blocked) Householder transformations = ( #flops / M1/2 )
• Theorem: #words_moved by reduction to Hessenberg form,
tridiagonal form, bidiagonal form, or one sweep of QR iteration
(or block versions of any of these) = ( #flops / M1/2 )
• Assuming Forward Progress (created zeros remain zero)
• Model: Merge left and right orthogonal transformations:
A(i,j) = A(i,j) - kLUL(i,kL) ·ZL(kL,j) - kRZR(i,kR) ·UR(j,kR)
79
Summary of dense sequential Cholesky algorithms
attaining communication lower bounds
• Algorithms shown minimizing # Messages assume (recursive) block layout
• Many references (see reports), only some shown, plus ours
• Oldest reference may or may not include analysis
• Cache-oblivious are underlined, Green are ours, ? is unknown/future work
• b = block size, M = fast memory size, n = dimension
Algorithm
2 Levels of Memory
#Words Moved
Cholesky LAPACK
(with b = M1/2)
Multiple Levels of Memory
and # Messages
#Words Moved
and # Messages
[BDHS09] like
[Gustavson,97] &
[Ahmed,Pingali,00]
(←same)
(←same)
[BDHS09]
has analysis
[BDHS09]
has analysis
[Gustavson, 97]
recursive,
Uses recursive
assumed good
Cholesky, TRSM
MatMul available and Matmul
[BDHS09]
80
Summary of dense sequential QR algorithms
attaining communication lower bounds
• Algorithms shown minimizing # Messages assume (recursive) block layout
• Many references (see reports), only some shown, plus ours
• Oldest reference may or may not include analysis
• Cache-oblivious are underlined, Green are ours, ? is unknown/future work
• b = block size, M = fast memory size, n = dimension
Algorithm
2 Levels of Memory
#Words Moved
QR
[Elmroth,
Gustavson,98]
Recursive, may
do more flops
LAPACK (only
for nM and
b=M1/2, or n2M)
and # Messages
[Frens,Wise,03]
explicit Q only,
3x more flops
Multiple Levels of Memory
#Words Moved
and # Messages
[Elmroth,
Gustavson,98]
[Frens,Wise,03]
[DGHL08]?
[DGHL08]
implicit Q, but
represented
differently,
~ same flop count
[DGHL08]?
Can we get the
same flop count?
[DGHL08]
81
Summary of dense sequential LU algorithms
attaining communication lower bounds
• Algorithms shown minimizing # Messages assume (recursive) block layout
• Many references (see reports), only some shown, plus ours
• Oldest reference may or may not include analysis
• Cache-oblivious are underlined, Green are ours, ? is unknown/future work
• b = block size, M = fast memory size, n = dimension
Algorithm
2 Levels of Memory
#Words Moved
LU with
pivoting
[Toledo, 97]
recursive, assumes
good TRSM and
Matmul available
Multiple Levels of Memory
and # Messages #Words Moved and # Messages
[GDX08]
[Toledo, 97]
[GDX08]?
[GDX08]?
LAPACK (only for
nM and b=M1/2,
or n2M)
[GDX08]
Different than partial
pivoting, still stable
82
Same idea for LU, almost
[GDX08]
• First idea: Use same reduction tree as in QR, with LU at each node
• Numerically unstable!
• Need a different pivoting scheme:
• At each node in tree, TSLU selects b pivot rows from 2b candidates
from its 2 child nodes
• At each node, do LU on 2b original rows selected by child nodes,
not U factors from child nodes
• When TSLU done, permute b selected rows to top of original matrix,
redo b steps of LU without pivoting
• Thm: Same results as partial pivoting on different input matrix
whose entries are the same magnitudes as original
• CALU – Communication Avoiding LU for general A
– Use TSLU for panel factorizations
– Apply to rest of matrix
– Cost: redundant panel factorizations
• Benefit:
– Stable in practice, but not same pivot choice as GEPP
– b times fewer messages overall - faster
TSLU: LU factorization of a tall skinny matrix
First try the obvious generalization of TSQR:
W0    00
  L00
  

10
 W1  

W  
.

W
 20
 2 

W  


3
30
   
L10
L20
 U 00 
 

  U10 
  U 
  20 
L30  U 30 
0
U 00 


  L01
 U10    01
.
U   
11  
 20  
U 
1
 30 
 U 01 
.

L11   U11 
U 01 

   02 L02U 02

U
 11   2
Growth factor for TSLU based factorization
Unstable for large P and large matrices.
When P = # rows, TSLU is equivalent to parallel pivoting.
Courtesy of H. Xiang
Growth factor for better CALU approach
Like threshold pivoting with worst case threshold = .33 , so |L| <= 3
Testing shows about same residual as GEPP
Performance vs ScaLAPACK
• TSLU (Tall-Skinny)
– IBM Power 5
• Up to 4.37x faster (16 procs, 1M x 150)
– Cray XT4
• Up to 5.52x faster (8 procs, 1M x 150)
• CALU (Square)
– IBM Power 5
• Up to 2.29x faster (64 procs, 1000 x 1000)
– Cray XT4
• Up to 1.81x faster (64 procs, 1000 x 1000)
• See INRIA Tech Report 6523 (2008), paper at SC08
CALU speedup prediction for a Petascale machine - up to 81x faster
P = 8192
Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s.
  2 1012 s,   105 s,   2 109 s / word .
Eigenvalue problems and SVD
• Uses spectral divide and conquer
• Idea goes back to Bulgakov, Godunov, Malyshev
k
• Roughly: If A2 = QR, leading columns of Q converge to span
invariant subspace for eigenvalues of A outside unit circle so Q*AQ
becomes block upper triangular
• Repeated squaring done implicitly
• [Bai, D., Gu, 97]
• Reduced just to Matmul, QR, and Generalized QR with pivoting
•
Eliminated need for exp(A), other improvements
• [DDH,07]
• Introduced randomized rank revealing QR
•
[Q’,R’]=qr(randn(n,n)), [Q,R]=qr(A·Q’)
• Just need Matmul and QR
• [BDD,09]
• Shows that randomized GQR also rank revealing
• Shows limits of divide-and-conquer, depending on pseudospectrum
89
• Cache-oblivious, using [Frens,Wise,03]
Implicit Repeated Squaring
Repeat:
 B j   Q11 Q12   R j 


 
A j   Q21 Q22   0 
for each j,
1
j
1
j1
1
2j
0
0
A B j  (A B j1)  (A B )
2
A j 1  Q12H A j
H
B j 1  Q22
Bj

until Rj converges
Proof:
from 1st step:
from 2nd/3rd steps:

H
H H
Q12H B j  Q22
A j  0  Q12
Q22  B j A1
j
1  H H
1
2
A1
B

A
Q
Q
B

(A
B
)
j 1 j 1
j
12
22 j
j
j
Sparse Linear Algebra
• Thm (George 73, Hoffman/Martin/Rose 73, Eisenstat/
Schultz/Sherman 76 ): Cholesky on an ns x ns sparse matrix
whose graph is an s-dimensional nearest neighbor grid does
( n3(s-1) ) flops; attainable by nested dissection ordering
• Proof: most work in dense Cholesky on final submatrix of size (ns-1)
• Corollary [BDHS09,G09]: Sequential Sparse Cholesky on
such a matrix moves ( n3(s-1) / M1/2 ) words, and Parallel
Sparse Cholesky moves ( n2(s-1) / (P log n )1/2 ) words on
P procs
• Assumes “2D” parallel algorithm, i.e. minimal memory
• Thm [G09]. Attainable in parallel case by nested dissection,
eg with PSPASES.
91
Avoiding Communication in Iterative Linear Algebra
• k-steps of typical iterative solver for sparse Ax=b or Ax=λx
• Does k SpMVs with starting vector (eg with b, if solving Ax=b)
• Finds “best” solution among all linear combinations of these k+1 vectors
• Many such “Krylov Subspace Methods”
•
Conjugate Gradients, GMRES, Lanczos, Arnoldi, …
• Goal: minimize communication in Krylov Subspace Methods
• Assume matrix “well-partitioned,” with modest surface-to-volume ratio
• Parallel implementation
• Conventional: O(k log p) messages, because k calls to SpMV
• New: O(log p) messages - optimal
• Serial implementation
• Conventional: O(k) moves of data from slow to fast memory
• New: O(1) moves of data – optimal
• Lots of speed up possible (modeled and measured)
• Price: some redundant computation
Locally Dependent Entries for
[x,Ax], A tridiagonal, 2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
Can be computed without communication
Locally Dependent Entries for
[x,Ax,A2x], A tridiagonal, 2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
Can be computed without communication
Locally Dependent Entries for
[x,Ax,…,A3x], A tridiagonal, 2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
Can be computed without communication
Locally Dependent Entries for
[x,Ax,…,A4x], A tridiagonal, 2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
Can be computed without communication
Locally Dependent Entries for
[x,Ax,…,A8x], A tridiagonal, 2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
Can be computed without communication
k=8 fold reuse of A
Remotely Dependent Entries for
[x,Ax,…,A8x], A tridiagonal, 2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
One message to get data needed to compute remotely dependent entries, not k=8
Minimizes number of messages = latency cost
Price: redundant work  “surface/volume ratio”
Remotely Dependent Entries for [x,Ax,A2x,A3x],
A irregular, multiple processors
Sequential [x,Ax,…,A4x], with memory hierarchy
v
One read of matrix from slow memory, not k=4
Minimizes words moved = bandwidth cost
No redundant work
Performance results on 8-Core Clovertown
Download