J. Eaton

advertisement
PROGRAMMING SOLUTIONS FOR LINEAR
ALGEBRA
Joe Eaton, PH.D.
Manager, Sparse Linear Algebra Libraries
and Graph Analytics
HPC
Sparse Days at St. Giron
June 29, 2015
1
How Can I use GPUs…
Without having to Learn CUDA?
2
GPU-Accelerated Libraries
CUDA Toolkit Libraries
CUSPARSE, CUSOLVER, CUBLAS
cuDNN
NVIDIA Proprietary libraries
AmgX
NAGA (working title)
Third Party libraries
Trilinos, PETSc
ArrayFire, CHOLMOD
https://developer.nvidia.com/gpu-accelerated-libraries
3
1 cuBLAS & nvBLAS
AGENDA
2 cuSPARSE
3 cuSOLVER
4 AmgX & 3rd Party
4
Sparse Problems
 Linear Systems
Ax= f
 Linear Least Squares
min ||B y - f||2
 Eigenvalue Problems
AV =VD
 Singular Value Decomposition
A = U D VT
Truncated for model reduction
min||
||2
[ ]
[ ]
COMPUTATIONAL
CHEMISTRY
BIOLOGY
Machine Learning
5
cuBLAS and nvBLAS
6
NVBLAS = DROP IN ACCELERATION
Application
Linear Algebra
BLAS
cuBLAS
GPUWhiz
*Automatically detect
and replace calls to
GEMM with cuBLAS
No application code
changes
Good for existing
libraries where BLAS
heavily used.
7
CUBLAS: (DENSE MATRIX) X (DENSE VECTOR)
2x Speedup in CUDA 7.5 for small matrices <500x500
y = α ∗ op(A)∗x + β∗y
A = dense matrix
x = dense vector
y = dense vector
y1
y2
y3
\alpha
A11 A12 A13 A14 A15
A21 A22 A23 A24 A25
A31 A32 A33 A34 A35
3
2
-5
9
1
+ \beta
8
y1
y2
y3
BATCHED CUBLAS ROUTINES
Process many similar
matrices at once
- Factorize (LU)
- Multiply (GEMM)
- Good for FEM codes,
multi-scale methods
- Chemistry kinetics, up to
72 species, combined
with CVODE
9
cuSPARSE
10
CUSPARSE: (DENSE MATRIX) X (SPARSE VECTOR)
Speeds up Natural Language Processing
cusparse<T>gemvi()
y = α ∗ op(A)∗x + β∗y
A = dense matrix
y1
y2
y3
\alpha
A11 A12
A21 A22
A31 A32
A13 A14 A15
A23 A24 A25
A33 A34 A35
2
1
+ \beta
y1
y2
y3
x = sparse vector
y = dense vector
Sparse vector could be frequencies
of words in a text sample
11
GRAPH COLORING: REALISTIC EXAMPLES
Offshore (N=259,789; nnz=4,242,673)
G3_Circuit (N=1,585,478; nnz=7,660,826)
1000000
100000
10000
10000
rows
rows
100000
1000
1000
100
100
10
10
1
1
graph coloring
2
3
4
5
6
7
8
1
9
1
levels
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21
levels
level-scheduling
level-scheduling
1000
100
100
rows
1000
10
10
1
125
249
373
497
621
745
869
993
1117
1241
1365
1489
1613
1737
1861
1985
2109
2233
2357
2481
2605
2729
2853
2977
3101
3225
3349
1
1
1
94
187
280
373
466
559
652
745
838
931
1024
1117
1210
1303
1396
1489
1582
1675
1768
1861
1954
2047
2140
2233
2326
2419
2512
rows
graph coloring
12
levels
levels
ILU0 + COLORING = BIG WIN
Speedup vs Levels
approach
6
5
4
3
2
1
0
13
cuSOLVER
New in CUDA 7.0, Extended in CUDA 7.5
14
cuSOLVER
Routines for solving sparse or dense linear systems and eigenproblems.
Divided into 3 APIs for 3 different use cases:
cuSolverDN – subset of LAPACK for small dense systems
LU, Cholesky, QR, LDLT, SVD
cuSolverSP– sparse direct solvers and Eigensolvers,
sparse QR, sparse batch QR, least squares
cuSolverRF– fast refactorization solver for sparse matrices
Multiscale methods, Chemistry
NVIDIA CONFIDENTIAL
cuSolverDN API
Subset of LAPACK (direct solvers for dense matrices)
– Only most popular methods
Cholesky / LU
QR,SVD
Bunch-Kaufman LDLT
Batched QR
Useful for:
Computer vision
High order FEM
16
cuSolverRF API
LU-based sparse direct solver, requires factorization to already be
computed (e.g. using KLU, SUPERLU)
Batched version – many small matrices to be solved in parallel
Useful for:
SPICE
Combustion simulation
Chemically reacting flow calculation
Other types of ODEs, mechanics
Multiscale methods, FEM
17
cuSolverSP API
Sparse direct solver based on QR factorization
Linear solver A*x = b (QR or Cholesky-based)
Least-squares solver min|A*x – b|
Eigenvalue solver based on shift-inverse
A*x = \lambda*x
Find number of Eigenvalues in a box
Useful for:
Well models in Oil & Gas
Non-linear solvers via Newton’s method
Anywhere a sparse-direct solver is required
18
CUSOLVER DENSE GFLOPS VS MKL
1800
1600
1400
1200
1000
800
600
400
200
0
GPU
CPU
GPU:K40c M=N=4096
CPU: Intel(R) Xeon(TM) E5-2697v3 CPU @ 3.60GHz, 14 cores
MKL v11.04
19
CUSOLVER SPEEDUP
cuSolver SP: Sparse QR Analysis,
Factorization and Solve
cuSolver DN: Cholesky Analysis,
Factorization and Solve
4.0
SPEEDUP
3.0
2.04
2.0
1.23
1.38
1.0
SPEEDUP
3.66
12
11
10
9
8
7
6
5
4
3
2
1
0
11.26
1.98
1.92
1.42
1.2
0.0
SPOTRF
DPOTRF
CPOTRF
ZPOTRF
GPU:K40c M=N=4096
CPU: Intel(R) Xeon(TM) E5-2697v3 CPU @ 3.60GHz, 14 cores
MKL v11.04 for Dense Cholesky, Nvidia csr-QR implementation for CPU and GPU
20
AmgX
21
NESTED SOLVERS
GMRES
AMG
MC-DILU
Jacobi
Example solvers
GMRES: local and global operations, no setup
AMG: setup - graph coarsening and matrix-matrix products
solve - smoothing and matrix-vector products
Jacobi: simple local (neighbor) operations smoothing, extract diagonal
MC-DILU: setup - graph coloring and factorization
solve – local (sub)matrix-vector multiplication
22
ALGEBRAIC MULTIGRID (AMG)
Solve Ax=f
start with initial guess x0
1. pre-smooth
xk=xk-1+M-1(f-A*xk-1)
end with approximate solution xk+2
V cycle
2 levels
8. post-smooth
xk+2=xk+1+M-1(f-A*xk+1)
2. rk=f-A*xk
7. xk+1=xk+dk
3. gk=PT*rk
6. dk=P*ek
4. B=PT*A*P
5. Solve Bek=gk
23
Graph Coarsening - Aggregation
Fine
Coarse
Restriction (PT)
Prolongation (P)
24
ANSYS® Fluent 16.0
ANSYS Fluent on NVIDIA GPUs
NVIDIA CONFIDENTIAL
GPU Acceleration of Water Jacket Analysis
ANSYS FLUENT 16.0 PERFORMANCE ON PRESSURE-BASED COUPLED SOLVER
8 CPU cores – Amg
8 CPU cores
2 GPUs – AmgX
8 CPU cores + 2 GPUs
6391
70% AMG solver time
Internal Flow Model
2.5 X
4557
•
•
•
Lower
is
Better
5.9 X
Unsteady RANS model
Fluid: water
Time for 20 time steps
2520
775
CPU only
CPU + GPU
AMG solver time
CPU only
CPU + GPU
Fluent solution time
26
GPU Acceleration of Water Jacket Analysis
ANSYS FLUENT 16.0 PERFORMANCE ON PRESSURE-BASED COUPLED SOLVER
16 CPU cores – Amg
16 CPU cores
2 GPUs – AmgX
16 CPU cores + 2 GPUs
3062
70% AMG solver time
~2X
Internal Flow Model
2048
Lower
is
Better
2.9 X
1647
•
•
•
Unsteady RANS model
Fluid: water
Time for 20 time steps
710
CPU only
CPU + GPU
AMG solver time
CPU only
CPU + GPU
Fluent solution time
27
ANSYS Fluent power consumption study
2X
Speed up
38%
Energy savings
28
GPU VALUE PROPOSITION FOR FLUENT 16.0
Simulation productivity with HPC Workgroup
33
Jobs/day
TRUCK BODY MODEL
(14 million cells)
Additional
productivity from
GPUs
200%
11
Jobs/day
CPU only
Additional cost
of adding GPUs 40%
Higher
is
Better
CPU + GPU
All results are based on turbulent flow over a truck case (14million cells) until convergence; steady-state, pressure-based
coupled solver with double-precision; Hardware: Intel Xeon
E5-2698 V3 (64 CPU cores on 4 sockets; 2 nodes with
infiniband interconnect) 4 Tesla K80 GPUs. License: ANSYS
Fluent and ANSYS HPC Workgroup 64.
5X
Additional productivity/$
spent on GPUs
CPU-only
solution cost 100%
Cost
CPU
Simulation
100% productivity from
CPU-only system
Benefit
GPU
CPU-only solution cost is approximated and includes both hardware
and software license costs. Benefit/productivity is based on the
number of completed Fluent jobs/day.
29
AMGX KEY FEATURES
Multi-GPU support
workstation and clusters up to hundreds of nodes
Sweet spot seems to be 8 GPUs/Node
More solvers, smoothers and preconditioners
Krylov methods, basic iterative solvers, AMG
Eigenvalue Solvers
Subspace Iteration, Restarted Arnoldi (ARPACK)
Lanczos, Jacobi-Davidson, LOBPCG
30
POISSON EQUATION
Aggregation and Classical Weak Scaling, 8Million DOF per GPU
12.0
Setup Time
10.0
Time (s)
8.0
AmgX 1.0 (PMIS)
6.0
AmgX 1.0 (AGG)
4.0
2.0
0.0
1
2
4
8
16
# of GPUs
32
64
128
256
Titan (Oak Ridge National Laboratory)
GPU: NVIDIA K20x (one per node)
CPU: 16 core AMD Opteron 6274 @ 2.2GHz
512
31
POISSON EQUATION
Aggregation and Classical Weak Scaling, 8Million DOF per GPU
Time per Iteration vs Log(P)
Classical
0.16
0.14
Aggregation
R² = 0.9249
0.12
0.10
Linear
(Classical)
Linear
(Aggregation)
R² = 0.9437
0.08
0.06
0.04
0.02
0.00
1
2
4
8
16
32
64
128
256
512
Titan (Oak Ridge National Laboratory)
GPU: NVIDIA K20x (one per node)
CPU: 16 core AMD Opteron 6274 @ 2.2GHz
32
GPU Coding Made Easier & More Efficient
Hyper-Q: 32 MPI jobs per GPU
Dynamic Parallelism: GPU Generates Work
Easy Speed-up for Legacy MPI Apps
Less Effort, Higher Performance
Quicksort
CP2K- Quantum Chemistry
4.0x
3x
Relative Sorting Performance
Speedup vs. Dual K20
20x
15x
10x
5x
2x
3.0x
2.0x
1.0x
0.0x
0x
0
5
10
Number of GPUs
K20 with Hyper-Q
15
K20 without Hyper-Q
20
0
5
Increasing Problem Size (# of Elements)
Without Dynamic Parallelism
10
Millions
With Dynamic Parallelism
33
Third Party Libraries
34
THIRD PARTY LIBRARIES
Trilinos (Sandia National Labs)
Parallel Primitives
Discretizations, Linear System & Eigenvalue Solvers, ...
Supports NVIDIA GPUs
http://trilinos.sandia.gov/
Kokkos (parallel primitives)
PETSc (Argonne National Labs)
Parallel Primitives
PETSc
http://www.mcs.anl.gov/petsc/index.html
Iterative Methods, Nonlinear Solvers, ...
Supports NVIDIA GPUs
(iterative methods + some preconditioners)
35
THIRD PARTY LIBRARIES
ArrayFire
Factorizations, Eigenvalue solvers, ...
Array abstraction for numerical operations
http://arrayfire.com/
• Super-LU (Sherry Li)
• GPU acceleration/offload supernodal LU
CHOLMOD (Tim Davis)
Cholesky factorization (s.p.d. A=LLT)
CHOLMOD
36
OPENACC: OPEN, SIMPLE, PORTABLE
• Open Standard
• Easy, Compiler-Driven Approach
main() {
…
<serial code>
…
#pragma acc kernels
{
<compute intensive code>
}
…
}
• Portable on GPUs and Xeon Phi
Compiler
Hint
CAM-SE Climate
6x Faster on GPU
Top Kernel: 50% of Runtime
Only 5% of code modified
1 Source to maintain
37
THANK YOU!
Questions, Feedback, Discussion
NVIDIA CONFIDENTIAL
Web and social graphs
Giga-scale memory requirements
- 10GB – 100GB in binary format
Power-law distribution of dependencies
- ~101 – 102 non-zero element per row in average
- A few rows almost dense
Hard to effectively partition
NVIDIA CONFIDENTIAL
Properties of Wikipedia Article Graph
63.15% of values come from 20% of variables
N
The largest row 9.99% dense
50% of rows have less than 4 elements
NNZ
Row ID and NNZ
50%
NVIDIA CONFIDENTIAL
3721339
66454329
AVG_nnz
17.8
MinNNZ
0
MaxNNZ
371662
StdDev
343
Curve of a 104 dense matrix
Power Law
Download