ppt - ISSNLA

advertisement
Minimizing Communication in
Numerical Linear Algebra
www.cs.berkeley.edu/~demmel
Multicore and GPU
Implementations
Jim Demmel
EECS & Math Departments, UC Berkeley
demmel@cs.berkeley.edu
Outline of Dense Linear Algebra
• Multicore implementations
- DAG Scheduling to minimize synchronization
• Optimization on GPUs
- Minimizing data movement between CPU and GPU
memories
Summer School Lecture 5
2
Multicore: Expressing Parallelism with a DAG
• DAG = Directed Acyclic Graph
- S1  S2 means statement S2 “depends on” statement S1
- Can execute in parallel any Si without input dependencies
• For simplicity, consider Cholesky A = LLT, not LU
- N by N matrix, numbered from A(0,0) to A(N-1,N-1)
- “Left looking” code
for k = 0 to N-1
for n = 0 to k-1
A(k,k) = A(k,k) – A(k,n)*A(k,n)
A(k,k) = sqrt(A(k,k))
for m = k+1 to N-1
for n = 0 to k-1
A(m,k) = A(m,k) – A(m,n)*A(k,n)
A(m,k) = A(m,k) / A(k,k)
3
Expressing Parallelism with a DAG - Cholesky
for k = 0 to N-1
for n = 0 to k-1
S1(k,n)
A(k,k) = A(k,k) – A(k,n)*A(k,n)
S2(k)
A(k,k) = sqrt(A(k,k))
for m = k+1 to N-1
for n = 0 to k-1
S3(k,m,n)
A(m,k) = A(m,k) – A(m,n)*A(k,n)
S4(k,m)
A(m,k) = A(m,k) · A(k,k)-1
n
k
S3(k,m,n)
m
S1(k,n)
S2(k)
S4(k,m)
DAG has N3/6 vertices:
S1(k,n)  S2(k) for n=0:k-1
S3(k,m,n)  S4(k,m) for n=0:k-1
S2(k)  S4(k,m) for m=k+1:N
S4(k,m)  S3 (k’,m,k) for k’>k
S4(k,m)  S3(k,m’,k) for m’>m
Expressing Parallelism with a DAG – Block Cholesky
• Each A[i,j] is a b-by-b block
for k = 0 to N/b-1
for n = 0 to k-1
S1(k,n)
SYRK:
A[k,k] = A[k,k] – A[k,n]*A[k,n]T
S2(k)
POTRF:
A(k,k) = unblocked_Cholesky(A(k,k))
for m = k+1 to N/b-1
for n = 0 to k-1
S3(k,m,n)
GEMM:
A[m,k] = A[m,k] – A[m,n]*A[k,n]T
S4(k,m)
TRSM:
A(m,k) = A(m,k) · A(k,k)-1
n
k
S3(k,m,n)
m
S1(k,n)
S2(k)
S4(k,m)
Same DAG, but only
(N/B)3/6 vertices
Sample Cholesky DAG with
#blocks in any row or column = N/b = 5
• Note implied order of
summation from left
to right
• Not necessary for
correctness, but it
does reflect what the
sequential code does
• Can process DAG in
any order respecting
dependences
Slide courtesy of Jakub Kurzak, UTK
03/02/2009
6
Scheduling options
• Static (pre-assign tasks to processors) or Dynamic
(idle processors grab ready jobs from work-queue)
- If dynamic, does scheduler take user hints/priorities?
• Respect locality (eg processor must have some task
data in its cache) vs not
• Build and store entire DAG to schedule it (which
may be very large, (N/b)3 ), or build just the next few
“levels” at a time (smaller, but less information for
scheduler)
• Programmer builds DAG & schedule vs depending
on compiler or run-time system
- Ease of programming, vs not exploiting user knowledge
- If compiler, how conservative is detection of parallelism?
Summer School Lecture 5
7
Schedulers tested
• Cilk
• programmer-defined parallelism
• spawn – creates independent tasks
• sync – synchronizes a sub-branch of the tree
• SMPSs
• dependency-defined parallelism
• pragma-based annotation of tasks (directionality of
the parameters)
• PLASMA (Static Pipeline)
• programmer-defined (hard-coded)
• apriori processing order
• progress tracking
• stalling on dependencies
Slide courtesy of Jakub Kurzak, UTK
8
Measured Results for Tiled Cholesky
PLASMA:
• Measured on Intel Tigerton 2.4 GHz
• Cilk 1D: one task is whole panel, but with “look ahead”
• Cilk 2D: tasks are blocks, scheduler steals work, little locality
• PLASMA works best
Summer School Lecture 5
Slide courtesy of Jakub Kurzak, UTK
9
More Measured Results for Tiled Cholesky
• Measured on Intel Tigerton 2.4 GHz
Cilk
SMPSs
PLASMA (Static Pipeline)
Slide courtesy of Jakub Kurzak, UTK
Still More Measured Results for Tiled Cholesky
• PLASMA (static pipeline) –
best
• SMPSs – somewhat worse
• Cilk 2D – inferior
• Cilk 1D – still worse
quad-socket, quad-core (16 cores total) Intel Tigerton 2.4 GHz
Slide courtesy of Jakub Kurzak, UTK
Intel’s Clovertown Quad Core
3 Implementations of LU factorization
Quad core w/2 sockets per board, w/ 8 Threads
45000
3. DAG Based (Dynamic Scheduling)
40000
35000
Mflop/s
30000
2. ScaLAPACK (Mess Pass using mem copy)
25000
1. LAPACK (BLAS Fork-Join Parallelism)
20000
15000
8 Core Experiments
10000
5000
0
1000
2000
Source: Jack Dongarra
3000
4000
5000
6000
7000
8000
9000 10000 11000 12000 13000 14000 15000
12Size
Problems
Scheduling on Multicore – Next Steps
• PLASMA 2.0.0 released
- Just Cholesky, QR, LU, using static scheduling
- LU does not do partial pivoting – Stability?
- http://icl.cs.utk.edu/plasma/
• Future of PLASMA
- Add dynamic scheduling, similar to SMPSs
• DAGs for eigenproblems are too complicated to do by hand
- Depend on user annotations and API, not compiler
- Still assume homogeneity of available cores
• What about GPUs, or mixtures of CPUs and GPUs?
- MAGMA
Summer School Lecture 5
13
QR Factorization Intel 16 cores
Tall Skinny Matrices
180
160
140
Gflop/s
120
100
Theoretical Peak
DGEMM Peak
"MKL (10.1)"
80
"LAPACK (3.2)"
60
40
20
0
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200
N, where matrix is M=51200 x N
14
LAPACK QR
Step 1
Step 2
Step 3
Summer School Lecture 5
Step 4
15
...
Parallel Tasks in QR
• Break into smaller tasks and remove dependencies
Summer School Lecture 5
Parallel Tasks in QR
Step 1: QR of block 1,1
Summer School Lecture 5
17
Parallel Tasks in QR
Step 1: QR of block 1,1
Step 2: Use R to zero A1,2
Summer School Lecture 5
18
Parallel Tasks in QR
Step 1: QR of block 1,1
Step 2: Use R to zero A1,2
Summer School Lecture 5
19
Parallel Tasks in QR
Step 1: QR of block 1,1
Step 2: Use R to zero A1,2
Step3: Use R to zero A1,3
.
.
.
Summer School Lecture 5
20
Parallel Tasks in QR
Step 1: QR of block 1,1
Step 2: Use R to zero A1,2
Step3: Use R to zero A1,3
.
.
.
Summer School Lecture 5
21
Parallel Tasks in QR
Step 1: QR of block 1,1
Step 2: Use R to zero A1,2
Step3: Use R to zero A1,3
.
.
.
Summer School Lecture 5
22
QR Factorization Intel 16 cores
Tall Skinny Matrices
180
160
140
Gflop/s
120
Theoretical Peak
100
DGEMM Peak
PLASMA (2.1)
80
"MKL (10.1)"
"LAPACK (3.2)"
60
40
20
0
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200
N, where matrix is M=51200 x N
23
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains
Summer School Lecture 5
Courtesy Jack Dongarra
24
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains
Courtesy Jack Dongarra
Summer School Lecture 5
25
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains
Courtesy Jack Dongarra
Summer School Lecture 5
26
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains
Courtesy Jack Dongarra
Summer School Lecture 5
27
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains
Courtesy Jack Dongarra
Summer School Lecture 5
28
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains
Courtesy Jack Dongarra
Summer School Lecture 5
29
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains
Courtesy Jack Dongarra
Summer School Lecture 5
30
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains
Courtesy Jack Dongarra
Summer School Lecture 5
31
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains
Courtesy Jack Dongarra
Summer School Lecture 5
32
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains
Courtesy Jack Dongarra
Summer School Lecture 5
33
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains
Courtesy Jack Dongarra
Summer School Lecture 5
34
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains
Courtesy Jack Dongarra
Summer School Lecture 5
35
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains
Courtesy Jack Dongarra
Summer School Lecture 5
36
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains
Courtesy Jack Dongarra
Summer School Lecture 5
37
Communication Reducing QR Factorization
TS matrix
 MT=6 and NT=3
 split into 2 domains
3 overlapped steps

panel factorization

updating the trailing submatrix

merge the domains

Final R computed
Courtesy Jack Dongarra
Summer School Lecture 5
38
Example with 4 and 8 Domains
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
Courtesy Jack Dongarra
Summer School Lecture 5
39
Execution Trace
16 core run
Courtesy Jack Dongarra
Summer School Lecture 5
40
Communication Reducing QR Factorization
Quad-socket, quad-core machine Intel Xeon EMT64 E7340 at 2.39 GHz.
Theoretical peak is 153.2 Gflop/s with 16 cores.
Matrix size 51200 by 3200
Courtesy Jack Dongarra
41
Sequential PLASMA
Cluster Experiment
• grig.sinrg.cs.utk.edu
• 61 nodes
- Two CPUs per node
- Intel Xeon 3.20GHz
- Peak performance 6.4 GFLOPS
- Myrinet interconnection (MX 1.0.0)
• Goto BLAS 1.26
- DGEMM performance 5.57 GFLOPS (87%)
• MPICH-MX
• gcc 64 bits
Courtesy Jack Dongarra
Summer School Lecture 5
43
Weak Scalability (8 columns of tiles)
Courtesy Jack Dongarra
44
Weak Scalability (8 columns of tiles)
Courtesy Jack Dongarra
45
Weak Scalability (8 columns of tiles)
On 1 CPU, the matrix size is 64x8 tiles
On k CPUs, the matrix size is k*64x8 tiles
Tiles are 200x200 blocks
Courtesy Jack Dongarra
46
Weak Scalability (8 columns of tiles)
Scalability of CAQR on the Grig Cluster (8 tiles per row)
peak
6
dgemm
GFLOPS per Core
5
Distri. CAQR
ScaLAPACK
4
3
2
1
0
1
2
Courtesy Jack Dongarra
4
8
Number of Cores
16
32
64
47
Dense Linear Algebra on GPUs
• Source: Vasily Volkov’s SC08 paper
- Best Student Paper Award
• New challenges
- More complicated memory hierarchy
- Not like “L1 inside L2 inside …”,
•
•
Need to choose which memory to use carefully
Need to move data manually
- GPU does some operations much faster than CPU, but not all
- CPU and GPU like different data layouts
Summer School Lecture 5
48
Motivation
• NVIDIA released CUBLAS 1.0 in 2007, which is BLAS for GPUs
• This enables a straightforward port of LAPACK to GPU
• Consider single precision only
impressive sheer
compute power
peak in
a*b+c
BLAS
SGEMM
not so great in matrixmatrix multiply
CUBLAS 1.1
MKL 10.0
GeForce 8800 GTX
Core2 Quad 2.4GHz
LAPACK
SGETRF
naive
MKL 10.0
0
50
disappointing performance in
(naive) LU factorization
100
150
Gflop/s
2007 results
200
250
300
350
• Goal: understand bottlenecks in the dense linear algebra kernels
• Requires detailed understanding of the GPU architecture
• Result 1: New coding recommendations for high performance on GPUs
• Result 2: New , fast variants of LU, QR, Cholesky, other routines
49
GPU Memory Hierarchy
16 KB
store
64 KB vector register file
16 lanes
crossbar
64 lanes
• Register file is the fastest and the largest on-chip
memory
- Constrained to vector operations only
• Shared memory permits indexed and shared access
- However, 2-4x smaller and 4x lower bandwidth than
registers
•
Only 1 operand in shared memory is allowed versus 4 register
operands
- Some instructions run slower if using shared memory
50
Memory Latency on GeForce 8800 GTX
Repeat k = A[k] where A[k] = (k + stride) mod array_size
600
550
16MB
500
450
800
8MB
non-cached, 128MB
32MB
700
600
500
350
300
400
250
cycles
latency, ns
400
300
200
150
5KB 5.5KB 20KB
100
192KB 224KB
50
768KB
local memory, 8KB
0
200
100
0
4
16
64
256
1KB
4KB 16KB 64KB 256KB 1MB 4MB 16MB
stride, bytes
51
(Some new) NVIDIA coding recommendations
• Minimize communication with CPU memory
• Keep as much data in registers as possible
- Largest, fastest on-GPU memory
- Vector-only operations
• Use as little shared memory as possible
- Smaller, slower than registers; use for communication, sharing only
- Speed limit: 66% of peak with one shared mem argument
• Use vector length VL=64, not max VL = 512
- Strip mine longer vectors into shorter ones
• Final matmul code similar to Cray X1 or IBM 3090 vector codes
52
__global__ void sgemmNN( const float *A, int lda, const float *B, int ldb, float* C, int ldc, int k, float alpha, float beta )
{
A += blockIdx.x * 64 + threadIdx.x + threadIdx.y*16;
Compute pointers to the data
B += threadIdx.x + ( blockIdx.y * 16 + threadIdx.y ) * ldb;
C += blockIdx.x * 64 + threadIdx.x + (threadIdx.y + blockIdx.y * ldc ) * 16;
__shared__ float bs[16][17];
float c[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
Declare the on-chip storage
const float *Blast = B + k;
do
{
#pragma unroll
for( int i = 0; i < 16; i += 4 )
bs[threadIdx.x][threadIdx.y+i] = B[i*ldb];
Read next B’s block
B += 16;
__syncthreads();
#pragma unroll
for( int i = 0; i < 16; i++, A += lda )
{
c[0] += A[0]*bs[i][0];
c[1] += A[0]*bs[i][1];
c[2] += A[0]*bs[i][2];
c[3] += A[0]*bs[i][3];
c[4] += A[0]*bs[i][4];
c[5] += A[0]*bs[i][5];
c[6] += A[0]*bs[i][6];
c[7] += A[0]*bs[i][7];
c[8] += A[0]*bs[i][8];
c[9] += A[0]*bs[i][9];
c[10] += A[0]*bs[i][10]; c[11] += A[0]*bs[i][11];
The bottleneck:
Read A’s columns
Do Rank-1 updates
c[12] += A[0]*bs[i][12]; c[13] += A[0]*bs[i][13]; c[14] += A[0]*bs[i][14]; c[15] += A[0]*bs[i][15];
}
__syncthreads();
} while( B < Blast );
for( int i = 0; i < 16; i++, C += ldc )
Store C’s block to memory
C[0] = alpha*c[i] + beta*C[0];
}
53
Our code vs. CUBLAS 1.1
Performance in multiplying two NxN matrices on GeForce 8800 GTX:
70%
multiply-and-add with an operand in shared memory (66%)
our implementation (60%)
60%
Fraction of Peak
50%
CUBLAS 1.1 (37%)
40%
30%
20%
10%
0%
64
128
256
512
1024
2048
4096
N
54
The Progress So Far
in registers
peak in
a*b+c
using shared memory
Arithmetic runs slower if
using shared memory
Core2 Quad
CUBLAS 1.1
our implementation (now in CUBLAS 2.0)
Core2 Quad
BLAS
SGEMM
naive
w/CUBLAS2.0
Core2 Quad
LAPACK
SGETRF
0
50
Good compared to
the new, smaller peak
Where does the time go?
100
150
GeForce 8800 GTX
200
250
300
350
Gflop/s
• We achieved predictable performance in SGEMM
- Which does O(N3) work in LU factorization
• But LU factorization (naïve SGETRF) still underperforms
- Must be due to the rest O(N2) work done in BLAS1 and BLAS2
- Why O(N2) work takes so much time?
55
Row-Pivoting in LU Factorization
Exchange two rows of an NxN matrix (SSWAP in CUBLAS 2.0):
1024
512
microseconds
256
128
64
40x
32
16
8
4
0
2048
4096
6144
8192
10240
12288
14336
16384
N
Row pivoting in column-major layout on GPU is very slow
This alone consumes half of the runtime in naïve SGETRF
56
BLAS1 Performance
Scale a column of an NxN matrix that fits in the GPU memory
(assumes aligned, unit-stride access)
8
7
microseconds
6
5
4
3
2
GeForce 8600 GTS, peak = 32 GB/s
GeForce GTX 280, peak = 141 GB/s
1
0
0
2048
4096
6144
8192
10240
12288
14336
16384
N
• Peak bandwidth of these GPUs differs by a factor of 4.4
• But runtimes are similar
• Small tasks on GPU are overhead bound
57
Panel Factorization
Factorizing Nx64 matrix in GPU memory using LAPACK’s SGETF2:
25
bound assumes 4 s overhead per BLAS call
and 127 GB/s bandwidth in memory access
(these are the best sustained numbers)
20
Gflop/s
15
10
5
0
64
128
256
512
1024
N
2048
4096
8192
16384
32768
• Invoking small BLAS operations on GPU from CPU is slow
• Can we call a sequence of BLAS operations from GPU?
• Requires barrier synchronization after each parallel BLAS operation
• Barrier is possible but requires sequential consistency for correctness
58
Optimizing Matrix Factorizations
•
•
•
•
•
Use GPU to compute matrix-matrix multiplies only
Factorize panels on the CPU
Use look-ahead to overlap computations on CPU and GPU
Batch Pivoting
Use right-looking algorithms to have more threads in SGEMM
- Better load balance in the GPU workload, better latency hiding
• Use row-major layout on GPU in LU factorization
- Requires extra (but fast) matrix transpose for each CPU-GPU transfer
• Substitute triangular solves of LX=B by TRSM with multiply by L–1
- At worst squares pivot growth factor in error bound (assume small anyway)
- Can check ||L–1||, use TRSM if too large
• Use two-level and variable size blocking as finer tuning
- Thicker blocks impose lower bandwidth requirements in SGEMM
- Variable size blocking improves CPU/GPU load balance
• Use column-cyclic layout when computing using two GPUs
- Requires no data exchange between GPUs in pivoting
- Cyclic layout is used on GPUs only so does not affect panel factorization
59
Performance Results
350
QR
Cholesky
LU
300
51%
Gflop/s
250
200
49%
150
100
78%
50
0
64
128
256
512
1024
2048
4096
8192
16384
Order of Matrix
Our solution runs at 50% of the system’s peak (shown on the right)
It is bound by SGEMM that runs at 60% of the GPU-only peak
60
Speedup of Factorizations on GPU over CPU
GPU only useful on large enough matrices
4.5
QR
Cholesky
LU
Speedup vs Core2 Quad
4.0
3.5
GTX280
3.0
4.4x
2.7x
2.5
8800GTX
2.0
1.5
1.0
0.5
0.0
64
128
256
512
1024
2048
4096
8192
16384
Order of Matrix
61
Slowdown when omitting one optimization
from LU on GeForce 8800 GTX
2.0
overlap CPU/GPU
1.9
transpose matrix
1.8
TRSM via GEMM
Slowdown
1.7
batch pivoting
1.6
1.5
1.4
1.3
1.2
1.1
1.0
0.9
64
128
256
512
1024
2048
4096
8192
16384
Order of Matrix
62
Time Breakdown for LU on GeForce 8800 GTX
100%
90%
80%
GPU
70%
CPU/GPU
overlap
Time
60%
CPU
50%
look-ahead
transpose
40%
30%
20%
CPU-GPU transfer
10%
0%
448
704
1088
1664
2496
3648
5312
7744
11264
Order of Matrix
63
LU Factorization using Two GPUs
538
550
500
450
400
Gflop/s
350
309
300
298
250
200
179
150
100
50
0
0
2500
5000
7500
10000 12500 15000 17500 20000 22500
Order of Matrix
• Second GPU allows 1.7x higher rates
• More than half-teraflop using two GPUs
64
Results for matmul, LU on NVIDIA
in registers
if using shared memory
peak in
a*b+c
Core2 Quad
CUBLAS 1.1
our implementation (now in CUBLAS 2.0)
BLAS
SGEMM
Core2 Quad
naive
LAPACK
SGETRF
our implementation
GeForce 8800 GTX
Core2 Quad
0
50
100
150
Gflop/s
200
250
300
350
• What we’ve achieved:
- Identified realistic peak speed of GPU architecture
- Achieved a large fraction of this peak in matrix multiply
- Achieved a large fraction of the matrix multiply rate in dense
factorizations
65
Extra Slides
Summer School Lecture 5
66
Download