GPU Sparse LU Factorization and its Application in Circuit

advertisement
Department of Electronic Engineering, Tsinghua University
GPU Sparse LU Factorization and Its
Application in Circuit Simulation
Nano-scale Integrated Circuit and System Lab.,
EE Department, Tsinghua University
Ling Ren
Nano-scale Integrated Circuit and System Lab.
1
Department of Electronic Engineering, Tsinghua University
Abstract
 First work on GPU sparse LU factorization



Algorithm description: elimination graph (EGraph)
Algorithm analysis: parallelism in left-looking
Algorithm implementation: timing order on GPU
 Supplement to OpenCL BLAS


Current cl_AMDBLAS has Triangular Solve but no LU
Objective of LU: 𝐴𝑥 = 𝑏 → 𝐿𝑈𝑥 = 𝑏 → 𝐿𝑦 = 𝑏, 𝑈𝑥 = 𝑦
Nano-scale Integrated Circuit and System Lab.
2
Department of Electronic Engineering, Tsinghua University
Outline




Background
Sparse LU factorization
Dense LU factorization
Summary
Nano-scale Integrated Circuit and System Lab.
3
Department of Electronic Engineering, Tsinghua University
Background
 SPICE: the most popular circuit simulator


Simulating VSLI (~1 billion transistors) takes several days
Bottleneck: Sparse LU factorization
Bottleneck
 Dynamic fluids, structural, economics …
Nano-scale Integrated Circuit and System Lab.
4
Department of Electronic Engineering, Tsinghua University
Outline




Background
Sparse LU factorization
Dense LU factorization
Summary
Nano-scale Integrated Circuit and System Lab.
5
Department of Electronic Engineering, Tsinghua University
Sparse LU factorization - related works

[SuperLU 1999]
• Sequential, multi-thread, distributed versions
• Incorporate Supernode, efficent for dense blocks

[Pardiso 2002]
• Sequential, multi-thread, distributed, GPU [Christen2007]
versions
• Adopt Supernode


But supernodes rarely form in circuit matrices
[KLU 2010]
• Optimized for circuit matrices
• Only sequential, use G/P left looking algorithm [G/P 1988]
• Adopt BTF, without Supernode
Nano-scale Integrated Circuit and System Lab.
6
Department of Electronic Engineering, Tsinghua University
Sparse LU factorization – left-looking



b



b







b
Sequentially process each column
When processing column k, use all the columns on the
left (1, 2, ..., k-1) to update column k.
Update = vector multiply-and-add (MAD)






Update
c
c
c





a  a  cb
aa
read
write
•read+write>arithmetic
Nano-scale Integrated Circuit and System Lab.
7
Department of Electronic Engineering, Tsinghua University
Vector MAD
Algorithm description – EGraph


Every column is updated with several columns on its left
Nonzero structure of U determines the dependency
nonzero
(a) Upper triangular
matrix U
(b)EGraph
Nano-scale Integrated Circuit and System Lab.
8
Department of Electronic Engineering, Tsinghua University
Algorithm analysis – two kinds of parallelism
 Divide columns into levels: columns in the same
level are independent of each other


Cluster mode: many columns factorized in parallel
Pipeline mode: Overlap columns from different levels
Pipeline parallelism, alone with timing order
Overlapped factorization
in pipeline mode
Thread 1
Thread 2
Column 1
Column 3
Column 2
......
Column 4
Nano-scale Integrated Circuit and System Lab.
......
9
Department of Electronic Engineering, Tsinghua University
Sparse LU factorization - workflow
Nano-scale Integrated Circuit and System Lab.
10
Department of Electronic Engineering, Tsinghua University
Sparse LU factorization - preprocessing
 Preprocessing: only once on CPU




MC64 to ensure numerical stability [MC64];
Approximate Minimum Degree to reduce fill-ins [AMD] ;
pre-factorization (numeric factorization with partial
pivoting) to calculate the symbolic structure of L and U.
Sorting the nonzeros of L and U (introduced later)
Nano-scale Integrated Circuit and System Lab.
11
Department of Electronic Engineering, Tsinghua University
Sparse LU factorization – on GPU
 GPU inputs



Location and values of nonzeros in A
Location of nonzeros in L and U
The Escheduler
 GPU outputs

Values of nonzeros in L and U
 CSC (Compressed Sparse Column) format for sparse
matrices A, L and U
Nano-scale Integrated Circuit and System Lab.
12
Department of Electronic Engineering, Tsinghua University
Sparse LU factorization - avoid deadlock
 In traditional GPU programs, some wavefronts are inactive at
the beginning (limited resource etc.). They wait for other
active wavefronts to finish and then become active.
 But in sparse LU, we must ensure all wavefronts are active
from the beginning
Nano-scale Integrated Circuit and System Lab.
13
Department of Electronic Engineering, Tsinghua University
Sparse LU factorization - data formats
 data formats for intermediate results:
dense arrays vs. CSC

CSC (Compressed Sparse Column)
• Can be put in local memory
• Indexed accesses inconvenient (binary search)
• Using too much local memory reduces active workgroups, which leads to severe performance loss

Dense arrays > CSC format: 2.5x
Nano-scale Integrated Circuit and System Lab.
14
Department of Electronic Engineering, Tsinghua University
Sparse LU factorization - data locality
 Higher global memory bandwidth if consecutive
work-items access consecutive address
 Improve data locality

Nonzeros of L and U are out-of-order after preprocessing,
sort them according to row indices
 1.7x speedup, overheads negligible

Performed only once, incorporated into preprocessing
Nano-scale Integrated Circuit and System Lab.
15
Department of Electronic Engineering, Tsinghua University
Experimental setups
 CPU


2 Xeon E5405 CPUs (8 cores in total)
2x6 MB L2 cache, 16GB ram
 GPU

AMD Radeon 5870 GPU
 Testing matrices

University of Florida Sparse Matrix Collection [Davis]
http://www.cise.ufl.edu/research/sparse/matrices/
Nano-scale Integrated Circuit and System Lab.
16
Department of Electronic Engineering, Tsinghua University
Sparse LU factorization - Experimental results

GPU speedups positively related to floating point
operations (flops)
Nano-scale Integrated Circuit and System Lab.
17
Department of Electronic Engineering, Tsinghua University
Sparse LU factorization - Experimental results
 Matrices divided into 4 groups

First three groups according to Mflops
• GPU speedup positively related to Mflops

4th group: denormal floating point numbers
• Used to represent extremely small numbers
• Very slowly on CPU, full speed support on GPU

An advantage of GPU in sparse LU and scientific computing
• Very high speedups for this group
Nano-scale Integrated Circuit and System Lab.
18
Department of Electronic Engineering, Tsinghua University
Sparse LU factorization - Experimental results
 Average speedup of each group
Group
GPU
bandwidth
GB / s
Over 1
CPU
Over 4
CPUs
Over 8
CPUs
Over KLU
1
0.81
0.41
0.24
0.22
0.58
2
10.97
2.43
0.85
0.55
3.64
3
52.59
10.53
3.65
2.58
15.58
4
36.82
26.86
8.01
4.48
25.61
All
15.91
4.51
1.64
1.13
6.25
Nano-scale Integrated Circuit and System Lab.
19
Department of Electronic Engineering, Tsinghua University
Scalability – BBD
 Problem

How to use multiple GPUs?
 Circuit-partition-based
simulation algorithm



bordered-block-diagonal (BBD)
Diagonal blocks are factorized
independently
But An becomes dense. So we
need dense LU factorization
Nano-scale Integrated Circuit and System Lab.
20
Department of Electronic Engineering, Tsinghua University
Outline




Background
Sparse LU factorization
Dense LU factorization
Summary
Nano-scale Integrated Circuit and System Lab.
21
Department of Electronic Engineering, Tsinghua University
Dense LU Factorization – blocked algorithm
 Three core operations



Dense LU factorization
Triangular matrix inversion
Matrix multiplication
 Suitable for GPU


GEMM most frequent
GEMM very efficient on GPU
•
920 Gflop/s (single), 290 Gflop/sfinished
(double)LU + inverse GEMM
Nano-scale Integrated Circuit and System Lab.
22
Department of Electronic Engineering, Tsinghua University
Dense LU Factorization – performance
 443 Gflop/s (single), 163 Gflop/s (double)
Nano-scale Integrated Circuit and System Lab.
23
Department of Electronic Engineering, Tsinghua University
Dense LU Factorization – related works
 Comparison to previous studies
Performance of Dense LU Factorization
Work
Hardware
Single
Double
[Galoppo2005]
GTX 7800
10
--
[Volkov2008]
GTX 8800
179
--
[Tomov2010]
8 Xeon Harpertown
100
50
[Tomov2010]
GTX 280
300
--
[Tomov2010]
8 Xeon Harpertown + GTX 280
388
99
Ours
Radeon 5870
443
163
Nano-scale Integrated Circuit and System Lab.
24
Department of Electronic Engineering, Tsinghua University
Dense LU Factorization – further improvement
 CPU BLAS for Gaussian elimination 100 Gflop/s
 GEMM can be further improved
 Scalability to multiple GPUs



Blocked dense LU: independent GEMMs on multiple GPUs
Diagonal blocks in BBD on multiple GPUs
Linear performance improvement expected
Nano-scale Integrated Circuit and System Lab.
25
Department of Electronic Engineering, Tsinghua University
Summary
 First work on GPU sparse LU factorization

Exploit parallelism of left-looking algorithm
 Blocked dense LU factorization

443 Gflop/s (single), 163 Gflop/s (double)
 Supplement to OpenCL BLAS
 Accelerate SPICE simulators
Nano-scale Integrated Circuit and System Lab.
26
Department of Electronic Engineering, Tsinghua University
Reference





[SPICE] L. W. Nagel, “SPICE 2: A computer program to stimulate semiconductor
circuits,” Ph.D. dissertation, University of California, Berkeley, 1975.
[SuperLU1999] J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. H. Liu,
“A supernodal approach to sparse partial pivoting,” SIAM J. Matrix Analysis and
Applications, vol. 20, no. 3, pp. 720–755, 1999
[Pardiso2002] O. Schenk and K. Gartner, “Solving unsymmetric sparse systems of
linear equations with pardiso,” Computational Science - ICCS 2002, vol. 2330, pp.
355–363, 2002.
[G/P 1988] J. R. Gilbert and T. Peierls, “Sparse partial pivoting in time proportional
to arithmetic operations,” SIAM J. Sci. Statist. Comput., vol. 9, pp. 862– 874, 1988
[KLU2010] T. A. Davis and E. Palamadai Natarajan, “Algorithm 907: KLU, a direct
sparse solver for circuit simulation problems,” ACM Trans. Math. Softw., vol. 37,
pp. 36:1–36:17, September 2010.
Nano-scale Integrated Circuit and System Lab.
27
Department of Electronic Engineering, Tsinghua University
Reference





[Christen2007] M. Christen, O. Schenk, and H. Burkhart, “General-purpose sparse
matrix building blocks using the nvidia cuda technology platform,” 2007.
[Davis] T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,”
to appear in ACM Transactions on Mathematical Software.
[Galoppo2005] N. Galoppo, N. K. Govindaraju, M. Henson, and D. Manocha, “LUGPU: Efficient algorithms for solving dense linear systems on graphics hardware,”
SC Conference, vol. 0, p. 3, 2005.
[Volkov2008] V. Volkov and J. Demmel, “LU, QR and Cholesky factorizations using
vector capabilities of gpus,” EECS Department, University of California, Berkeley,
Tech. Rep. UCB/EECS-2008-49, May 2008.
[Tomov2010] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linear
algebra for hybrid gpu accelerated manycore systems,” Parallel Comput., vol. 36,
pp. 232–240, June 2010.
Nano-scale Integrated Circuit and System Lab.
28
Department of Electronic Engineering, Tsinghua University
Reference


[MC64] I. S. Duff and J. Koster, “The design and use of algorithms for permuting
large entries to the diagonal of sparse matrices,” SIAM J. Matrix Anal. and Applics,
no. 4, pp. 889–901, 1997.
[AMD] P. R. Amestoy, Enseeiht-Irit, T. A. Davis, and I. S. Duff, “Algorithm 837: AMD,
an approximate minimum degree ordering algorithm,” ACM Trans. Math. Softw.,
vol. 30, pp. 381–388, September 2004.
Nano-scale Integrated Circuit and System Lab.
29
Department of Electronic Engineering, Tsinghua University
Thank you !
Nano-scale Integrated Circuit and System Lab.,
EE Department, Tsinghua University
Nano-scale Integrated Circuit and System Lab.
30
Department of Electronic Engineering, Tsinghua University
Sparse LU factorization – Terminology
 Elimination Graph Definition


An edge from j to k iff U(j, k) != 0
In the following context, node = column
• Level Definition
– The length of the longest
path from any source node
to itself.
– Source nodes have no
incoming edges.
Nano-scale Integrated Circuit and System Lab.
31
Department of Electronic Engineering, Tsinghua University
Sparse LU factorization - Experimental results
Nano-scale Integrated Circuit and System Lab.
32
Department of Electronic Engineering, Tsinghua University
Dense LU factorization – Basic algorithm
 Blocked LU factorization

Factorize 𝐴11 to get 𝐿11 and 𝑈11
Nano-scale Integrated Circuit and System Lab.
33
Department of Electronic Engineering, Tsinghua University
Dense LU factorization – Basic algorithm
 Repeat the process to obtain 𝐿𝑖2 , 𝑈2𝑗 , and so on
Nano-scale Integrated Circuit and System Lab.
34
Download