PROGRAMMING SOLUTIONS FOR LINEAR ALGEBRA Joe Eaton, PH.D. Manager, Sparse Linear Algebra Libraries and Graph Analytics HPC Sparse Days at St. Giron June 29, 2015 1 How Can I use GPUs… Without having to Learn CUDA? 2 GPU-Accelerated Libraries CUDA Toolkit Libraries CUSPARSE, CUSOLVER, CUBLAS cuDNN NVIDIA Proprietary libraries AmgX NAGA (working title) Third Party libraries Trilinos, PETSc ArrayFire, CHOLMOD https://developer.nvidia.com/gpu-accelerated-libraries 3 1 cuBLAS & nvBLAS AGENDA 2 cuSPARSE 3 cuSOLVER 4 AmgX & 3rd Party 4 Sparse Problems Linear Systems Ax= f Linear Least Squares min ||B y - f||2 Eigenvalue Problems AV =VD Singular Value Decomposition A = U D VT Truncated for model reduction min|| ||2 [ ] [ ] COMPUTATIONAL CHEMISTRY BIOLOGY Machine Learning 5 cuBLAS and nvBLAS 6 NVBLAS = DROP IN ACCELERATION Application Linear Algebra BLAS cuBLAS GPUWhiz *Automatically detect and replace calls to GEMM with cuBLAS No application code changes Good for existing libraries where BLAS heavily used. 7 CUBLAS: (DENSE MATRIX) X (DENSE VECTOR) 2x Speedup in CUDA 7.5 for small matrices <500x500 y = α ∗ op(A)∗x + β∗y A = dense matrix x = dense vector y = dense vector y1 y2 y3 \alpha A11 A12 A13 A14 A15 A21 A22 A23 A24 A25 A31 A32 A33 A34 A35 3 2 -5 9 1 + \beta 8 y1 y2 y3 BATCHED CUBLAS ROUTINES Process many similar matrices at once - Factorize (LU) - Multiply (GEMM) - Good for FEM codes, multi-scale methods - Chemistry kinetics, up to 72 species, combined with CVODE 9 cuSPARSE 10 CUSPARSE: (DENSE MATRIX) X (SPARSE VECTOR) Speeds up Natural Language Processing cusparse<T>gemvi() y = α ∗ op(A)∗x + β∗y A = dense matrix y1 y2 y3 \alpha A11 A12 A21 A22 A31 A32 A13 A14 A15 A23 A24 A25 A33 A34 A35 2 1 + \beta y1 y2 y3 x = sparse vector y = dense vector Sparse vector could be frequencies of words in a text sample 11 GRAPH COLORING: REALISTIC EXAMPLES Offshore (N=259,789; nnz=4,242,673) G3_Circuit (N=1,585,478; nnz=7,660,826) 1000000 100000 10000 10000 rows rows 100000 1000 1000 100 100 10 10 1 1 graph coloring 2 3 4 5 6 7 8 1 9 1 levels 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 levels level-scheduling level-scheduling 1000 100 100 rows 1000 10 10 1 125 249 373 497 621 745 869 993 1117 1241 1365 1489 1613 1737 1861 1985 2109 2233 2357 2481 2605 2729 2853 2977 3101 3225 3349 1 1 1 94 187 280 373 466 559 652 745 838 931 1024 1117 1210 1303 1396 1489 1582 1675 1768 1861 1954 2047 2140 2233 2326 2419 2512 rows graph coloring 12 levels levels ILU0 + COLORING = BIG WIN Speedup vs Levels approach 6 5 4 3 2 1 0 13 cuSOLVER New in CUDA 7.0, Extended in CUDA 7.5 14 cuSOLVER Routines for solving sparse or dense linear systems and eigenproblems. Divided into 3 APIs for 3 different use cases: cuSolverDN – subset of LAPACK for small dense systems LU, Cholesky, QR, LDLT, SVD cuSolverSP– sparse direct solvers and Eigensolvers, sparse QR, sparse batch QR, least squares cuSolverRF– fast refactorization solver for sparse matrices Multiscale methods, Chemistry NVIDIA CONFIDENTIAL cuSolverDN API Subset of LAPACK (direct solvers for dense matrices) – Only most popular methods Cholesky / LU QR,SVD Bunch-Kaufman LDLT Batched QR Useful for: Computer vision High order FEM 16 cuSolverRF API LU-based sparse direct solver, requires factorization to already be computed (e.g. using KLU, SUPERLU) Batched version – many small matrices to be solved in parallel Useful for: SPICE Combustion simulation Chemically reacting flow calculation Other types of ODEs, mechanics Multiscale methods, FEM 17 cuSolverSP API Sparse direct solver based on QR factorization Linear solver A*x = b (QR or Cholesky-based) Least-squares solver min|A*x – b| Eigenvalue solver based on shift-inverse A*x = \lambda*x Find number of Eigenvalues in a box Useful for: Well models in Oil & Gas Non-linear solvers via Newton’s method Anywhere a sparse-direct solver is required 18 CUSOLVER DENSE GFLOPS VS MKL 1800 1600 1400 1200 1000 800 600 400 200 0 GPU CPU GPU:K40c M=N=4096 CPU: Intel(R) Xeon(TM) E5-2697v3 CPU @ 3.60GHz, 14 cores MKL v11.04 19 CUSOLVER SPEEDUP cuSolver SP: Sparse QR Analysis, Factorization and Solve cuSolver DN: Cholesky Analysis, Factorization and Solve 4.0 SPEEDUP 3.0 2.04 2.0 1.23 1.38 1.0 SPEEDUP 3.66 12 11 10 9 8 7 6 5 4 3 2 1 0 11.26 1.98 1.92 1.42 1.2 0.0 SPOTRF DPOTRF CPOTRF ZPOTRF GPU:K40c M=N=4096 CPU: Intel(R) Xeon(TM) E5-2697v3 CPU @ 3.60GHz, 14 cores MKL v11.04 for Dense Cholesky, Nvidia csr-QR implementation for CPU and GPU 20 AmgX 21 NESTED SOLVERS GMRES AMG MC-DILU Jacobi Example solvers GMRES: local and global operations, no setup AMG: setup - graph coarsening and matrix-matrix products solve - smoothing and matrix-vector products Jacobi: simple local (neighbor) operations smoothing, extract diagonal MC-DILU: setup - graph coloring and factorization solve – local (sub)matrix-vector multiplication 22 ALGEBRAIC MULTIGRID (AMG) Solve Ax=f start with initial guess x0 1. pre-smooth xk=xk-1+M-1(f-A*xk-1) end with approximate solution xk+2 V cycle 2 levels 8. post-smooth xk+2=xk+1+M-1(f-A*xk+1) 2. rk=f-A*xk 7. xk+1=xk+dk 3. gk=PT*rk 6. dk=P*ek 4. B=PT*A*P 5. Solve Bek=gk 23 Graph Coarsening - Aggregation Fine Coarse Restriction (PT) Prolongation (P) 24 ANSYS® Fluent 16.0 ANSYS Fluent on NVIDIA GPUs NVIDIA CONFIDENTIAL GPU Acceleration of Water Jacket Analysis ANSYS FLUENT 16.0 PERFORMANCE ON PRESSURE-BASED COUPLED SOLVER 8 CPU cores – Amg 8 CPU cores 2 GPUs – AmgX 8 CPU cores + 2 GPUs 6391 70% AMG solver time Internal Flow Model 2.5 X 4557 • • • Lower is Better 5.9 X Unsteady RANS model Fluid: water Time for 20 time steps 2520 775 CPU only CPU + GPU AMG solver time CPU only CPU + GPU Fluent solution time 26 GPU Acceleration of Water Jacket Analysis ANSYS FLUENT 16.0 PERFORMANCE ON PRESSURE-BASED COUPLED SOLVER 16 CPU cores – Amg 16 CPU cores 2 GPUs – AmgX 16 CPU cores + 2 GPUs 3062 70% AMG solver time ~2X Internal Flow Model 2048 Lower is Better 2.9 X 1647 • • • Unsteady RANS model Fluid: water Time for 20 time steps 710 CPU only CPU + GPU AMG solver time CPU only CPU + GPU Fluent solution time 27 ANSYS Fluent power consumption study 2X Speed up 38% Energy savings 28 GPU VALUE PROPOSITION FOR FLUENT 16.0 Simulation productivity with HPC Workgroup 33 Jobs/day TRUCK BODY MODEL (14 million cells) Additional productivity from GPUs 200% 11 Jobs/day CPU only Additional cost of adding GPUs 40% Higher is Better CPU + GPU All results are based on turbulent flow over a truck case (14million cells) until convergence; steady-state, pressure-based coupled solver with double-precision; Hardware: Intel Xeon E5-2698 V3 (64 CPU cores on 4 sockets; 2 nodes with infiniband interconnect) 4 Tesla K80 GPUs. License: ANSYS Fluent and ANSYS HPC Workgroup 64. 5X Additional productivity/$ spent on GPUs CPU-only solution cost 100% Cost CPU Simulation 100% productivity from CPU-only system Benefit GPU CPU-only solution cost is approximated and includes both hardware and software license costs. Benefit/productivity is based on the number of completed Fluent jobs/day. 29 AMGX KEY FEATURES Multi-GPU support workstation and clusters up to hundreds of nodes Sweet spot seems to be 8 GPUs/Node More solvers, smoothers and preconditioners Krylov methods, basic iterative solvers, AMG Eigenvalue Solvers Subspace Iteration, Restarted Arnoldi (ARPACK) Lanczos, Jacobi-Davidson, LOBPCG 30 POISSON EQUATION Aggregation and Classical Weak Scaling, 8Million DOF per GPU 12.0 Setup Time 10.0 Time (s) 8.0 AmgX 1.0 (PMIS) 6.0 AmgX 1.0 (AGG) 4.0 2.0 0.0 1 2 4 8 16 # of GPUs 32 64 128 256 Titan (Oak Ridge National Laboratory) GPU: NVIDIA K20x (one per node) CPU: 16 core AMD Opteron 6274 @ 2.2GHz 512 31 POISSON EQUATION Aggregation and Classical Weak Scaling, 8Million DOF per GPU Time per Iteration vs Log(P) Classical 0.16 0.14 Aggregation R² = 0.9249 0.12 0.10 Linear (Classical) Linear (Aggregation) R² = 0.9437 0.08 0.06 0.04 0.02 0.00 1 2 4 8 16 32 64 128 256 512 Titan (Oak Ridge National Laboratory) GPU: NVIDIA K20x (one per node) CPU: 16 core AMD Opteron 6274 @ 2.2GHz 32 GPU Coding Made Easier & More Efficient Hyper-Q: 32 MPI jobs per GPU Dynamic Parallelism: GPU Generates Work Easy Speed-up for Legacy MPI Apps Less Effort, Higher Performance Quicksort CP2K- Quantum Chemistry 4.0x 3x Relative Sorting Performance Speedup vs. Dual K20 20x 15x 10x 5x 2x 3.0x 2.0x 1.0x 0.0x 0x 0 5 10 Number of GPUs K20 with Hyper-Q 15 K20 without Hyper-Q 20 0 5 Increasing Problem Size (# of Elements) Without Dynamic Parallelism 10 Millions With Dynamic Parallelism 33 Third Party Libraries 34 THIRD PARTY LIBRARIES Trilinos (Sandia National Labs) Parallel Primitives Discretizations, Linear System & Eigenvalue Solvers, ... Supports NVIDIA GPUs http://trilinos.sandia.gov/ Kokkos (parallel primitives) PETSc (Argonne National Labs) Parallel Primitives PETSc http://www.mcs.anl.gov/petsc/index.html Iterative Methods, Nonlinear Solvers, ... Supports NVIDIA GPUs (iterative methods + some preconditioners) 35 THIRD PARTY LIBRARIES ArrayFire Factorizations, Eigenvalue solvers, ... Array abstraction for numerical operations http://arrayfire.com/ • Super-LU (Sherry Li) • GPU acceleration/offload supernodal LU CHOLMOD (Tim Davis) Cholesky factorization (s.p.d. A=LLT) CHOLMOD 36 OPENACC: OPEN, SIMPLE, PORTABLE • Open Standard • Easy, Compiler-Driven Approach main() { … <serial code> … #pragma acc kernels { <compute intensive code> } … } • Portable on GPUs and Xeon Phi Compiler Hint CAM-SE Climate 6x Faster on GPU Top Kernel: 50% of Runtime Only 5% of code modified 1 Source to maintain 37 THANK YOU! Questions, Feedback, Discussion NVIDIA CONFIDENTIAL Web and social graphs Giga-scale memory requirements - 10GB – 100GB in binary format Power-law distribution of dependencies - ~101 – 102 non-zero element per row in average - A few rows almost dense Hard to effectively partition NVIDIA CONFIDENTIAL Properties of Wikipedia Article Graph 63.15% of values come from 20% of variables N The largest row 9.99% dense 50% of rows have less than 4 elements NNZ Row ID and NNZ 50% NVIDIA CONFIDENTIAL 3721339 66454329 AVG_nnz 17.8 MinNNZ 0 MaxNNZ 371662 StdDev 343 Curve of a 104 dense matrix Power Law