Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University DAC 2012 Outline • • • • • Introduction Preliminaries Sparse LU factorization on GPU Experimental results Conclusions 1 Introduction • Flowchart of a SPICE simulator 2 Introduction (cont.) • SPICE takes several days or even weeks on simulation for modern designs. – The sparse matrix solver by LU factorization is performed iteratively and hence time-consuming. • However, it is difficult to parallelize the sparse solver because of the high data-dependency during the numeric LU factorization and the irregular structure of circuit matrices. 3 Introduction (cont.) • Previous works focus on dense matrices. – Factorize a sparse matrix by a highly parallelize dense solver is still much slower than sequential sparse solver. • [8]~[13] compute dense blocks on GPU but the rest are on CPU. – Still the dense idea. • [15]~[17] apply G/P left-looking algorithm on FPGA. – Scalability is limited because FPGA on-chip resources. • [18] implements it on multi-core CPU. – Scalability is limited by the number of cores. 4 Introduction (cont.) • Multi/Many-core era has come. • Graphic Processing Unit (GPU) can now be used to perform general computing. – Become popular in parallel processing for its cost-effectiveness. • The state of the art GPUs provide a possible solution to the limited scalability. • For now, the latest nVidia GeForce GTX 690 has large number of cores and memory. 5 GeForce GTX 690 official spec 6 Contributions • Exposing more parallelism for manycore architecture. • Ensuring timing order on GPU. • Optimizing memory access pattern. 7 Preliminaries • Sparse matrix • LU factorization (decomposition) • GPU architecture and CUDA. 8 Sparse matrix • The number of nonzero elements in a π × π sparse matrix is about in π(π). • An example of adjacency matrix. 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 9 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 LU factorization • Or called LU decomposition. • π΄ = πΏπ where A is a square matrix, L is a lower-triangular matrix, and U is a upper-triangular matrix. 10 CUDA programming • Compute Unified Device Architecture • The CPU code does the sequential part. • Highly parallelized part usually implement in the GPU code, called kernel. • Calling GPU function in CPU code is called kernel launch. 11 Execution of GPU thread • Threads are grouped into thread blocks. • Each thread block is assigned to a streaming multiprocessor (SM), which contains multiple streaming processors (SPs), to be executed. • The actual execution of threads on SPs is done in groups of 32 threads, called warps. • SPs execute one warp at a time. 12 GPU architecture • nVidia GeForce 9800 GTX 14 Sparse LU factorization on GPU • • • • • Overall flow Preprocessing Exposing more parallelism Ensuring timing order Optimizing memory access pattern 15 Overall flow 16 Preprocessing • HSL_MC64 algorithm to improve numeric stability. – Find permutation matrix. • AMD (Approximate Minimum Degree) algorithm to reduce fill-ins. – Find permutation matrix. • G/P algorithm based pre-factorization (a complete numeric factorization with partial pivoting) to calculate the symbolic structure of the LU factors. – Can extract the total flops. 17 Exposing more parallelism • Sequential G/P left-looking algorithm 18 19 Exposing more parallelism • Dependency graph and scheduler 20 Exposing more parallelism (cont.) • Treat the threads that process the same column as a virtue group. • In cluster mode – Columns are very sparse, so while ensuring enough threads in total, we make virtue groups as small as possible to minimize idle threads. • In pipeline mode – Columns usually contain enough nonzeros for a warp or several warps. So the size of virtue groups matters little in the sense of reducing idle threads. We use one warp as one virtue group. 21 Ensuring timing order Ensuring timing order (cont.) • Suppose column 8, 9 and 10 are being processed, and other columns are finished. Column 9 can be first updated with column 4, 6, 7, corresponding to the solid green arrows. But currently column 9 can not be updated with column 8. It must wait for column 8 to finish. Similar situation for column 10. 23 Ensuring timing order (cont.) • Key is to avoid deadlock. – Not all warps are active at the beginning. – If we active warps in wrong order in the pipeline mode, deadlock will occur. • There is no inter-warp context switching. 24 Optimizing memory access pattern • The main difference between CPU and GPU parallel programming is the memory access. • Two alternative data formats for the intermediate vectors (x in Algorithm 2). – CSC(Compressed Sparse Column) sparse vectors. • Save space and can be placed in shared memory. – Dense arrays. • Have to reside in global memory. • Two reasons to choose dense arrays. – CSC format is inconvenient for indexed accesses. – Using too much shared memory would reduce the number of resident warps per SM, and hence performance degradation. 25 CSC format • Specified by the 3 arrays {val, row_ind, col_ptr}, where row_ind stores the row indices of each nonzero, and col_ptr stores the index of the elements in val which start a column of A. • Example 26 Improve data locality • Memory access coalescing. – Several memory transactions can be coalesced into one transaction when consecutive threads access consecutive memory locations. – Sort the nonzeros in L and U by their row indices to improve the data locality. 27 Effectiveness of sorting • GPU bandwidth increase is about 2.4x on average. • CPU sparse LU factorization also benefits from sorted nonzeros, but the bandwidth increase is only 1.15x. 28 Experimental results • Environments – 2 Xeon E5405 CPUs, 2 Xeon X5680 CPUs, AMD Radeon 5870 GPU, and NVIDIA GTX 580 GPU. – Experiments on CPU are implemented in C on a 64-bit Linux server. Radeon 5870 is programmed using OpenCL v1.1. GTX580 is programmed with CUDA 4.0. • Benchmarks are from University of Florida Sparse Matrix Collection. 29 Devices specifications 30 Performance and speedup • Group A are cases under 200 Mflops. – Results are mostly worse than CPU version. • Group B are cases over 200 Mflops. • Group C are cases that contains many denormal number during factorization. – CPU cannot handle it in normal speed so that great speedup achieved by GPU. 31 Performance and speedup (cont.) • We can see the GPU bandwidth is positively related to Mflops, which indicates that in sparse LU factorization, the high memory bandwidth of GPU can be exploited only when the problem scale is large enough. 32 Scalability Analysis • The average and detail performance on the four devices are listed in table and figure, respectively. 33 Scalability Analysis (cont.) • The best performance is attained with about 24 resident warps per SM, rather than with maximum resident warps. • On GTX 580, it achieves 74% peak bandwidth at most. 34 Scalability Analysis (cont.) • On Radeon 5870, it achieves 45% peak bandwidth at most (on xenon1). A primary reason is that there are too few active wavefronts on Radeon 5870 to fully utilize the global memory bandwidth. • On the two CPUs and Radeon 5870 GPU, the bandwidth keeps increasing with the issued threads (wavefronts). 35 Hybrid solver for circuit simulation • The matrices with few flops in factorization are not suitable for GPU acceleration. – Combine both CPU/GPU version as a hybrid solver. – Choose one of them based on flops computation. 36 Conclusions • Sparse matrix solver is one of the runtime bottleneck of SPICE. • Propose the first work on GPU-based sparse LU factorization intended for circuit simulation. • Experiments demonstrate that GPU outperforms CPU on matrices with many floating point operations in factorization. 37