Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Multicore and GPU Implementations Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu Outline of Dense Linear Algebra • Multicore implementations - DAG Scheduling to minimize synchronization • Optimization on GPUs - Minimizing data movement between CPU and GPU memories Summer School Lecture 5 2 Multicore: Expressing Parallelism with a DAG • DAG = Directed Acyclic Graph - S1 S2 means statement S2 “depends on” statement S1 - Can execute in parallel any Si without input dependencies • For simplicity, consider Cholesky A = LLT, not LU - N by N matrix, numbered from A(0,0) to A(N-1,N-1) - “Left looking” code for k = 0 to N-1 for n = 0 to k-1 A(k,k) = A(k,k) – A(k,n)*A(k,n) A(k,k) = sqrt(A(k,k)) for m = k+1 to N-1 for n = 0 to k-1 A(m,k) = A(m,k) – A(m,n)*A(k,n) A(m,k) = A(m,k) / A(k,k) 3 Expressing Parallelism with a DAG - Cholesky for k = 0 to N-1 for n = 0 to k-1 S1(k,n) A(k,k) = A(k,k) – A(k,n)*A(k,n) S2(k) A(k,k) = sqrt(A(k,k)) for m = k+1 to N-1 for n = 0 to k-1 S3(k,m,n) A(m,k) = A(m,k) – A(m,n)*A(k,n) S4(k,m) A(m,k) = A(m,k) · A(k,k)-1 n k S3(k,m,n) m S1(k,n) S2(k) S4(k,m) DAG has N3/6 vertices: S1(k,n) S2(k) for n=0:k-1 S3(k,m,n) S4(k,m) for n=0:k-1 S2(k) S4(k,m) for m=k+1:N S4(k,m) S3 (k’,m,k) for k’>k S4(k,m) S3(k,m’,k) for m’>m Expressing Parallelism with a DAG – Block Cholesky • Each A[i,j] is a b-by-b block for k = 0 to N/b-1 for n = 0 to k-1 S1(k,n) SYRK: A[k,k] = A[k,k] – A[k,n]*A[k,n]T S2(k) POTRF: A(k,k) = unblocked_Cholesky(A(k,k)) for m = k+1 to N/b-1 for n = 0 to k-1 S3(k,m,n) GEMM: A[m,k] = A[m,k] – A[m,n]*A[k,n]T S4(k,m) TRSM: A(m,k) = A(m,k) · A(k,k)-1 n k S3(k,m,n) m S1(k,n) S2(k) S4(k,m) Same DAG, but only (N/B)3/6 vertices Sample Cholesky DAG with #blocks in any row or column = N/b = 5 • Note implied order of summation from left to right • Not necessary for correctness, but it does reflect what the sequential code does • Can process DAG in any order respecting dependences Slide courtesy of Jakub Kurzak, UTK 03/02/2009 6 Scheduling options • Static (pre-assign tasks to processors) or Dynamic (idle processors grab ready jobs from work-queue) - If dynamic, does scheduler take user hints/priorities? • Respect locality (eg processor must have some task data in its cache) vs not • Build and store entire DAG to schedule it (which may be very large, (N/b)3 ), or build just the next few “levels” at a time (smaller, but less information for scheduler) • Programmer builds DAG & schedule vs depending on compiler or run-time system - Ease of programming, vs not exploiting user knowledge - If compiler, how conservative is detection of parallelism? Summer School Lecture 5 7 Schedulers tested • Cilk • programmer-defined parallelism • spawn – creates independent tasks • sync – synchronizes a sub-branch of the tree • SMPSs • dependency-defined parallelism • pragma-based annotation of tasks (directionality of the parameters) • PLASMA (Static Pipeline) • programmer-defined (hard-coded) • apriori processing order • progress tracking • stalling on dependencies Slide courtesy of Jakub Kurzak, UTK 8 Measured Results for Tiled Cholesky PLASMA: • Measured on Intel Tigerton 2.4 GHz • Cilk 1D: one task is whole panel, but with “look ahead” • Cilk 2D: tasks are blocks, scheduler steals work, little locality • PLASMA works best Summer School Lecture 5 Slide courtesy of Jakub Kurzak, UTK 9 More Measured Results for Tiled Cholesky • Measured on Intel Tigerton 2.4 GHz Cilk SMPSs PLASMA (Static Pipeline) Slide courtesy of Jakub Kurzak, UTK Still More Measured Results for Tiled Cholesky • PLASMA (static pipeline) – best • SMPSs – somewhat worse • Cilk 2D – inferior • Cilk 1D – still worse quad-socket, quad-core (16 cores total) Intel Tigerton 2.4 GHz Slide courtesy of Jakub Kurzak, UTK Intel’s Clovertown Quad Core 3 Implementations of LU factorization Quad core w/2 sockets per board, w/ 8 Threads 45000 3. DAG Based (Dynamic Scheduling) 40000 35000 Mflop/s 30000 2. ScaLAPACK (Mess Pass using mem copy) 25000 1. LAPACK (BLAS Fork-Join Parallelism) 20000 15000 8 Core Experiments 10000 5000 0 1000 2000 Source: Jack Dongarra 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 12Size Problems Scheduling on Multicore – Next Steps • PLASMA 2.0.0 released - Just Cholesky, QR, LU, using static scheduling - LU does not do partial pivoting – Stability? - http://icl.cs.utk.edu/plasma/ • Future of PLASMA - Add dynamic scheduling, similar to SMPSs • DAGs for eigenproblems are too complicated to do by hand - Depend on user annotations and API, not compiler - Still assume homogeneity of available cores • What about GPUs, or mixtures of CPUs and GPUs? - MAGMA Summer School Lecture 5 13 QR Factorization Intel 16 cores Tall Skinny Matrices 180 160 140 Gflop/s 120 100 Theoretical Peak DGEMM Peak "MKL (10.1)" 80 "LAPACK (3.2)" 60 40 20 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 N, where matrix is M=51200 x N 14 LAPACK QR Step 1 Step 2 Step 3 Summer School Lecture 5 Step 4 15 ... Parallel Tasks in QR • Break into smaller tasks and remove dependencies Summer School Lecture 5 Parallel Tasks in QR Step 1: QR of block 1,1 Summer School Lecture 5 17 Parallel Tasks in QR Step 1: QR of block 1,1 Step 2: Use R to zero A1,2 Summer School Lecture 5 18 Parallel Tasks in QR Step 1: QR of block 1,1 Step 2: Use R to zero A1,2 Summer School Lecture 5 19 Parallel Tasks in QR Step 1: QR of block 1,1 Step 2: Use R to zero A1,2 Step3: Use R to zero A1,3 . . . Summer School Lecture 5 20 Parallel Tasks in QR Step 1: QR of block 1,1 Step 2: Use R to zero A1,2 Step3: Use R to zero A1,3 . . . Summer School Lecture 5 21 Parallel Tasks in QR Step 1: QR of block 1,1 Step 2: Use R to zero A1,2 Step3: Use R to zero A1,3 . . . Summer School Lecture 5 22 QR Factorization Intel 16 cores Tall Skinny Matrices 180 160 140 Gflop/s 120 Theoretical Peak 100 DGEMM Peak PLASMA (2.1) 80 "MKL (10.1)" "LAPACK (3.2)" 60 40 20 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 N, where matrix is M=51200 x N 23 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Summer School Lecture 5 Courtesy Jack Dongarra 24 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Courtesy Jack Dongarra Summer School Lecture 5 25 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Courtesy Jack Dongarra Summer School Lecture 5 26 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Courtesy Jack Dongarra Summer School Lecture 5 27 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Courtesy Jack Dongarra Summer School Lecture 5 28 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Courtesy Jack Dongarra Summer School Lecture 5 29 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Courtesy Jack Dongarra Summer School Lecture 5 30 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Courtesy Jack Dongarra Summer School Lecture 5 31 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Courtesy Jack Dongarra Summer School Lecture 5 32 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Courtesy Jack Dongarra Summer School Lecture 5 33 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Courtesy Jack Dongarra Summer School Lecture 5 34 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Courtesy Jack Dongarra Summer School Lecture 5 35 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Courtesy Jack Dongarra Summer School Lecture 5 36 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Courtesy Jack Dongarra Summer School Lecture 5 37 Communication Reducing QR Factorization TS matrix MT=6 and NT=3 split into 2 domains 3 overlapped steps panel factorization updating the trailing submatrix merge the domains Final R computed Courtesy Jack Dongarra Summer School Lecture 5 38 Example with 4 and 8 Domains A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. Courtesy Jack Dongarra Summer School Lecture 5 39 Execution Trace 16 core run Courtesy Jack Dongarra Summer School Lecture 5 40 Communication Reducing QR Factorization Quad-socket, quad-core machine Intel Xeon EMT64 E7340 at 2.39 GHz. Theoretical peak is 153.2 Gflop/s with 16 cores. Matrix size 51200 by 3200 Courtesy Jack Dongarra 41 Sequential PLASMA Cluster Experiment • grig.sinrg.cs.utk.edu • 61 nodes - Two CPUs per node - Intel Xeon 3.20GHz - Peak performance 6.4 GFLOPS - Myrinet interconnection (MX 1.0.0) • Goto BLAS 1.26 - DGEMM performance 5.57 GFLOPS (87%) • MPICH-MX • gcc 64 bits Courtesy Jack Dongarra Summer School Lecture 5 43 Weak Scalability (8 columns of tiles) Courtesy Jack Dongarra 44 Weak Scalability (8 columns of tiles) Courtesy Jack Dongarra 45 Weak Scalability (8 columns of tiles) On 1 CPU, the matrix size is 64x8 tiles On k CPUs, the matrix size is k*64x8 tiles Tiles are 200x200 blocks Courtesy Jack Dongarra 46 Weak Scalability (8 columns of tiles) Scalability of CAQR on the Grig Cluster (8 tiles per row) peak 6 dgemm GFLOPS per Core 5 Distri. CAQR ScaLAPACK 4 3 2 1 0 1 2 Courtesy Jack Dongarra 4 8 Number of Cores 16 32 64 47 Dense Linear Algebra on GPUs • Source: Vasily Volkov’s SC08 paper - Best Student Paper Award • New challenges - More complicated memory hierarchy - Not like “L1 inside L2 inside …”, • • Need to choose which memory to use carefully Need to move data manually - GPU does some operations much faster than CPU, but not all - CPU and GPU like different data layouts Summer School Lecture 5 48 Motivation • NVIDIA released CUBLAS 1.0 in 2007, which is BLAS for GPUs • This enables a straightforward port of LAPACK to GPU • Consider single precision only impressive sheer compute power peak in a*b+c BLAS SGEMM not so great in matrixmatrix multiply CUBLAS 1.1 MKL 10.0 GeForce 8800 GTX Core2 Quad 2.4GHz LAPACK SGETRF naive MKL 10.0 0 50 disappointing performance in (naive) LU factorization 100 150 Gflop/s 2007 results 200 250 300 350 • Goal: understand bottlenecks in the dense linear algebra kernels • Requires detailed understanding of the GPU architecture • Result 1: New coding recommendations for high performance on GPUs • Result 2: New , fast variants of LU, QR, Cholesky, other routines 49 GPU Memory Hierarchy 16 KB store 64 KB vector register file 16 lanes crossbar 64 lanes • Register file is the fastest and the largest on-chip memory - Constrained to vector operations only • Shared memory permits indexed and shared access - However, 2-4x smaller and 4x lower bandwidth than registers • Only 1 operand in shared memory is allowed versus 4 register operands - Some instructions run slower if using shared memory 50 Memory Latency on GeForce 8800 GTX Repeat k = A[k] where A[k] = (k + stride) mod array_size 600 550 16MB 500 450 800 8MB non-cached, 128MB 32MB 700 600 500 350 300 400 250 cycles latency, ns 400 300 200 150 5KB 5.5KB 20KB 100 192KB 224KB 50 768KB local memory, 8KB 0 200 100 0 4 16 64 256 1KB 4KB 16KB 64KB 256KB 1MB 4MB 16MB stride, bytes 51 (Some new) NVIDIA coding recommendations • Minimize communication with CPU memory • Keep as much data in registers as possible - Largest, fastest on-GPU memory - Vector-only operations • Use as little shared memory as possible - Smaller, slower than registers; use for communication, sharing only - Speed limit: 66% of peak with one shared mem argument • Use vector length VL=64, not max VL = 512 - Strip mine longer vectors into shorter ones • Final matmul code similar to Cray X1 or IBM 3090 vector codes 52 __global__ void sgemmNN( const float *A, int lda, const float *B, int ldb, float* C, int ldc, int k, float alpha, float beta ) { A += blockIdx.x * 64 + threadIdx.x + threadIdx.y*16; Compute pointers to the data B += threadIdx.x + ( blockIdx.y * 16 + threadIdx.y ) * ldb; C += blockIdx.x * 64 + threadIdx.x + (threadIdx.y + blockIdx.y * ldc ) * 16; __shared__ float bs[16][17]; float c[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; Declare the on-chip storage const float *Blast = B + k; do { #pragma unroll for( int i = 0; i < 16; i += 4 ) bs[threadIdx.x][threadIdx.y+i] = B[i*ldb]; Read next B’s block B += 16; __syncthreads(); #pragma unroll for( int i = 0; i < 16; i++, A += lda ) { c[0] += A[0]*bs[i][0]; c[1] += A[0]*bs[i][1]; c[2] += A[0]*bs[i][2]; c[3] += A[0]*bs[i][3]; c[4] += A[0]*bs[i][4]; c[5] += A[0]*bs[i][5]; c[6] += A[0]*bs[i][6]; c[7] += A[0]*bs[i][7]; c[8] += A[0]*bs[i][8]; c[9] += A[0]*bs[i][9]; c[10] += A[0]*bs[i][10]; c[11] += A[0]*bs[i][11]; The bottleneck: Read A’s columns Do Rank-1 updates c[12] += A[0]*bs[i][12]; c[13] += A[0]*bs[i][13]; c[14] += A[0]*bs[i][14]; c[15] += A[0]*bs[i][15]; } __syncthreads(); } while( B < Blast ); for( int i = 0; i < 16; i++, C += ldc ) Store C’s block to memory C[0] = alpha*c[i] + beta*C[0]; } 53 Our code vs. CUBLAS 1.1 Performance in multiplying two NxN matrices on GeForce 8800 GTX: 70% multiply-and-add with an operand in shared memory (66%) our implementation (60%) 60% Fraction of Peak 50% CUBLAS 1.1 (37%) 40% 30% 20% 10% 0% 64 128 256 512 1024 2048 4096 N 54 The Progress So Far in registers peak in a*b+c using shared memory Arithmetic runs slower if using shared memory Core2 Quad CUBLAS 1.1 our implementation (now in CUBLAS 2.0) Core2 Quad BLAS SGEMM naive w/CUBLAS2.0 Core2 Quad LAPACK SGETRF 0 50 Good compared to the new, smaller peak Where does the time go? 100 150 GeForce 8800 GTX 200 250 300 350 Gflop/s • We achieved predictable performance in SGEMM - Which does O(N3) work in LU factorization • But LU factorization (naïve SGETRF) still underperforms - Must be due to the rest O(N2) work done in BLAS1 and BLAS2 - Why O(N2) work takes so much time? 55 Row-Pivoting in LU Factorization Exchange two rows of an NxN matrix (SSWAP in CUBLAS 2.0): 1024 512 microseconds 256 128 64 40x 32 16 8 4 0 2048 4096 6144 8192 10240 12288 14336 16384 N Row pivoting in column-major layout on GPU is very slow This alone consumes half of the runtime in naïve SGETRF 56 BLAS1 Performance Scale a column of an NxN matrix that fits in the GPU memory (assumes aligned, unit-stride access) 8 7 microseconds 6 5 4 3 2 GeForce 8600 GTS, peak = 32 GB/s GeForce GTX 280, peak = 141 GB/s 1 0 0 2048 4096 6144 8192 10240 12288 14336 16384 N • Peak bandwidth of these GPUs differs by a factor of 4.4 • But runtimes are similar • Small tasks on GPU are overhead bound 57 Panel Factorization Factorizing Nx64 matrix in GPU memory using LAPACK’s SGETF2: 25 bound assumes 4 s overhead per BLAS call and 127 GB/s bandwidth in memory access (these are the best sustained numbers) 20 Gflop/s 15 10 5 0 64 128 256 512 1024 N 2048 4096 8192 16384 32768 • Invoking small BLAS operations on GPU from CPU is slow • Can we call a sequence of BLAS operations from GPU? • Requires barrier synchronization after each parallel BLAS operation • Barrier is possible but requires sequential consistency for correctness 58 Optimizing Matrix Factorizations • • • • • Use GPU to compute matrix-matrix multiplies only Factorize panels on the CPU Use look-ahead to overlap computations on CPU and GPU Batch Pivoting Use right-looking algorithms to have more threads in SGEMM - Better load balance in the GPU workload, better latency hiding • Use row-major layout on GPU in LU factorization - Requires extra (but fast) matrix transpose for each CPU-GPU transfer • Substitute triangular solves of LX=B by TRSM with multiply by L–1 - At worst squares pivot growth factor in error bound (assume small anyway) - Can check ||L–1||, use TRSM if too large • Use two-level and variable size blocking as finer tuning - Thicker blocks impose lower bandwidth requirements in SGEMM - Variable size blocking improves CPU/GPU load balance • Use column-cyclic layout when computing using two GPUs - Requires no data exchange between GPUs in pivoting - Cyclic layout is used on GPUs only so does not affect panel factorization 59 Performance Results 350 QR Cholesky LU 300 51% Gflop/s 250 200 49% 150 100 78% 50 0 64 128 256 512 1024 2048 4096 8192 16384 Order of Matrix Our solution runs at 50% of the system’s peak (shown on the right) It is bound by SGEMM that runs at 60% of the GPU-only peak 60 Speedup of Factorizations on GPU over CPU GPU only useful on large enough matrices 4.5 QR Cholesky LU Speedup vs Core2 Quad 4.0 3.5 GTX280 3.0 4.4x 2.7x 2.5 8800GTX 2.0 1.5 1.0 0.5 0.0 64 128 256 512 1024 2048 4096 8192 16384 Order of Matrix 61 Slowdown when omitting one optimization from LU on GeForce 8800 GTX 2.0 overlap CPU/GPU 1.9 transpose matrix 1.8 TRSM via GEMM Slowdown 1.7 batch pivoting 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 64 128 256 512 1024 2048 4096 8192 16384 Order of Matrix 62 Time Breakdown for LU on GeForce 8800 GTX 100% 90% 80% GPU 70% CPU/GPU overlap Time 60% CPU 50% look-ahead transpose 40% 30% 20% CPU-GPU transfer 10% 0% 448 704 1088 1664 2496 3648 5312 7744 11264 Order of Matrix 63 LU Factorization using Two GPUs 538 550 500 450 400 Gflop/s 350 309 300 298 250 200 179 150 100 50 0 0 2500 5000 7500 10000 12500 15000 17500 20000 22500 Order of Matrix • Second GPU allows 1.7x higher rates • More than half-teraflop using two GPUs 64 Results for matmul, LU on NVIDIA in registers if using shared memory peak in a*b+c Core2 Quad CUBLAS 1.1 our implementation (now in CUBLAS 2.0) BLAS SGEMM Core2 Quad naive LAPACK SGETRF our implementation GeForce 8800 GTX Core2 Quad 0 50 100 150 Gflop/s 200 250 300 350 • What we’ve achieved: - Identified realistic peak speed of GPU architecture - Achieved a large fraction of this peak in matrix multiply - Achieved a large fraction of the matrix multiply rate in dense factorizations 65 Extra Slides Summer School Lecture 5 66