Sparse LU Factorization for Parallel Circuit Simulation on GPU

advertisement
Sparse LU Factorization for
Parallel Circuit Simulation on
GPU
Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang,
Huazhong Yang
Department of Electronic Engineering, Tsinghua
National Laboratory for Information Science and
Technology, Tsinghua University
DAC 2012
Outline
•
•
•
•
•
Introduction
Preliminaries
Sparse LU factorization on GPU
Experimental results
Conclusions
1
Introduction
• Flowchart of a SPICE simulator
2
Introduction (cont.)
• SPICE takes several days or even weeks on
simulation for modern designs.
– The sparse matrix solver by LU factorization is
performed iteratively and hence time-consuming.
• However, it is difficult to parallelize the sparse
solver because of the high data-dependency
during the numeric LU factorization and the
irregular structure of circuit matrices.
3
Introduction (cont.)
• Previous works focus on dense matrices.
– Factorize a sparse matrix by a highly
parallelize dense solver is still much slower
than sequential sparse solver.
• [8]~[13] compute dense blocks on GPU but the
rest are on CPU.
– Still the dense idea.
• [15]~[17] apply G/P left-looking algorithm on
FPGA.
– Scalability is limited because FPGA on-chip
resources.
• [18] implements it on multi-core CPU.
– Scalability is limited by the number of cores.
4
Introduction (cont.)
• Multi/Many-core era has come.
• Graphic Processing Unit (GPU) can now
be used to perform general computing.
– Become popular in parallel processing for its
cost-effectiveness.
• The state of the art GPUs provide a
possible solution to the limited scalability.
• For now, the latest nVidia GeForce GTX
690 has large number of cores and
memory.
5
GeForce GTX 690 official spec
6
Contributions
• Exposing more parallelism for manycore architecture.
• Ensuring timing order on GPU.
• Optimizing memory access pattern.
7
Preliminaries
• Sparse matrix
• LU factorization (decomposition)
• GPU architecture and CUDA.
8
Sparse matrix
• The number of nonzero elements in a
𝑁 × π‘ sparse matrix is about in 𝑂(𝑁).
• An example of adjacency matrix.
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
9
0
0
0
0
0
1
0
0
1
1
0
0
0
0
1
0
0
0
1
1
0
0
1
0
0
0
0
0
1
1
0
1
0
0
0
0
0
0
1
0
LU factorization
• Or called LU decomposition.
• 𝐴 = πΏπ‘ˆ where A is a square matrix, L
is a lower-triangular matrix, and U is a
upper-triangular matrix.
10
CUDA programming
• Compute Unified Device Architecture
• The CPU code does the sequential
part.
• Highly parallelized part usually
implement in the GPU code, called
kernel.
• Calling GPU function in CPU code is
called kernel launch.
11
Execution of GPU thread
• Threads are grouped into thread blocks.
• Each thread block is assigned to a
streaming multiprocessor (SM), which
contains multiple streaming processors
(SPs), to be executed.
• The actual execution of threads on SPs is
done in groups of 32 threads, called warps.
• SPs execute one warp at a time.
12
GPU architecture
• nVidia GeForce 9800 GTX
14
Sparse LU factorization on GPU
•
•
•
•
•
Overall flow
Preprocessing
Exposing more parallelism
Ensuring timing order
Optimizing memory access pattern
15
Overall flow
16
Preprocessing
• HSL_MC64 algorithm to improve numeric
stability.
– Find permutation matrix.
• AMD (Approximate Minimum Degree)
algorithm to reduce fill-ins.
– Find permutation matrix.
• G/P algorithm based pre-factorization (a
complete numeric factorization with partial
pivoting) to calculate the symbolic
structure of the LU factors.
– Can extract the total flops.
17
Exposing more parallelism
• Sequential G/P left-looking algorithm
18
19
Exposing more parallelism
• Dependency graph and scheduler
20
Exposing more parallelism (cont.)
• Treat the threads that process the same
column as a virtue group.
• In cluster mode
– Columns are very sparse, so while ensuring
enough threads in total, we make virtue groups
as small as possible to minimize idle threads.
• In pipeline mode
– Columns usually contain enough nonzeros for
a warp or several warps. So the size of virtue
groups matters little in the sense of reducing
idle threads. We use one warp as one virtue
group.
21
Ensuring timing order
Ensuring timing order (cont.)
• Suppose column 8, 9 and 10 are being processed,
and other columns are finished. Column 9 can be
first updated with column 4, 6, 7, corresponding to
the solid green arrows. But currently column 9 can
not be updated with column 8. It must wait for
column 8 to finish. Similar situation for column 10.
23
Ensuring timing order (cont.)
• Key is to avoid deadlock.
– Not all warps are active at the beginning.
– If we active warps in wrong order in the
pipeline mode, deadlock will occur.
• There is no inter-warp context switching.
24
Optimizing memory access pattern
• The main difference between CPU and GPU
parallel programming is the memory access.
• Two alternative data formats for the intermediate
vectors (x in Algorithm 2).
– CSC(Compressed Sparse Column) sparse vectors.
• Save space and can be placed in shared memory.
– Dense arrays.
• Have to reside in global memory.
• Two reasons to choose dense arrays.
– CSC format is inconvenient for indexed accesses.
– Using too much shared memory would reduce the
number of resident warps per SM, and hence
performance degradation.
25
CSC format
• Specified by the 3 arrays {val, row_ind, col_ptr},
where row_ind stores the row indices of each
nonzero, and col_ptr stores the index of the
elements in val which start a column of A.
• Example
26
Improve data locality
• Memory access coalescing.
– Several memory transactions can be
coalesced into one transaction when
consecutive threads access consecutive
memory locations.
– Sort the nonzeros in L and U by their row
indices to improve the data locality.
27
Effectiveness of sorting
• GPU bandwidth increase is about 2.4x on
average.
• CPU sparse LU factorization also benefits
from sorted nonzeros, but the bandwidth
increase is only 1.15x.
28
Experimental results
• Environments
– 2 Xeon E5405 CPUs, 2 Xeon X5680 CPUs,
AMD Radeon 5870 GPU, and NVIDIA GTX
580 GPU.
– Experiments on CPU are implemented in C on
a 64-bit Linux server. Radeon 5870 is
programmed using OpenCL v1.1. GTX580 is
programmed with CUDA 4.0.
• Benchmarks are from University of Florida
Sparse Matrix Collection.
29
Devices specifications
30
Performance and speedup
• Group A are cases under 200 Mflops.
– Results are mostly worse than CPU version.
• Group B are cases over 200 Mflops.
• Group C are cases that contains many
denormal number during factorization.
– CPU cannot handle it in normal speed so that
great speedup achieved by GPU.
31
Performance and speedup (cont.)
• We can see the GPU bandwidth is positively
related to Mflops, which indicates that in sparse
LU factorization, the high memory bandwidth of
GPU can be exploited only when the problem
scale is large enough.
32
Scalability Analysis
• The average and
detail performance
on the four devices
are listed in table
and figure,
respectively.
33
Scalability Analysis (cont.)
• The best performance is attained with
about 24 resident warps per SM, rather
than with maximum resident warps.
• On GTX 580, it achieves 74% peak
bandwidth at most.
34
Scalability Analysis (cont.)
• On Radeon 5870, it achieves 45% peak
bandwidth at most (on xenon1). A primary reason
is that there are too few active wavefronts on
Radeon 5870 to fully utilize the global memory
bandwidth.
• On the two CPUs and Radeon 5870 GPU, the
bandwidth keeps increasing with the issued
threads (wavefronts).
35
Hybrid solver for circuit simulation
• The matrices with few flops in
factorization are not suitable for GPU
acceleration.
– Combine both CPU/GPU version as a
hybrid solver.
– Choose one of them based on flops
computation.
36
Conclusions
• Sparse matrix solver is one of the
runtime bottleneck of SPICE.
• Propose the first work on GPU-based
sparse LU factorization intended for
circuit simulation.
• Experiments demonstrate that GPU
outperforms CPU on matrices with
many floating point operations in
factorization.
37
Download