By Xinggao Xia and Jong Chul Lee Applications of Systems of Linear Equations Biology Physics Chemistry n : the number of elements in matrix Technique Additions Multiplications/Divisions Gauss-Jordan n3/2 n3/2 Gaussian Elimination n3/3 n3/3 Cramer’s Rule n4/3 n4/3 Table 1: Computational Complexity of Various Solving Techniques Comments: Computational complexity increases tremendously as the dimension of matrix increases. Gaussian Elimination solver has obvious advantage in terms of complexity as the matrix size increases. Ax B A B Iteration No.1 1 m 21 1 1 m 31 1 1 pivot 1 2 3 1 1 4 1 9 1 Iteration No.2 1 pivot 0 m 32 2 0 1 1 2 1 1 3 2 8 0 Iteration No.3 1 0 0 1 1 0 1 1 3 2 2 4 1 0 0 1 1 0 1 1 3 2 1 2 Normalization Iteration No.1 1 0 : 0 0 0 Iteration No.2 1 0 : 0 0 0 1 0 : 0 0 …… …… Iteration No.N 1 0 : 0 0 0 1 0 : 0 0 1 0 1 : 0 1 0 : 0 1 Inter-iteration parallelism For Iteration i A[j][]=A[j][] –m[j][i]*matrix pivot row Multiplier array m must be determined before each iteration m0i 1 0 m1i : m2i 0 m3i 0 0 Perfectly fit CUDA architecture Modified Gaussian Elimination is considered for CUDA linear equations solver More parallelism No back substitution Partial pivoting guarantees accuracy of solution Traditional Gaussian Elimination Initial state 1 m 21 1 1 m 31 1 1 pivot 1 2 3 1 1 4 1 9 1 Iteration No.1 1 pivot 0 m 32 2 0 1 1 2 Iteration No.2 1 1 3 2 8 0 1 0 0 1 1 0 1 1 3 2 2 4 Modified Gaussian Elimination Initial state 1 m 21 1 1 m 31 1 1 pivot 1 2 3 1 1 4 1 9 1 Iteration No.1 1 pivot 0 m 32 2 0 1 1 2 1 1 3 2 8 0 Iteration No.2 Iteration No.3 1 0 0 1 0 0 0 1 0 2 3 3 2 2 4 0 1 0 0 7 0 8 1 2 For iteration ith Row j=Row j-mj*Row i Column i Row i Traditional Gaussian Elimination 0 : 0 0 1 0 0 : 0 Added elimination in modified Gaussian Elimination Back Substitution Gaussian Elimination Iteration No.1 1 0 : 0 0 0 Iteration No.2 1 0 : 0 0 0 1 0 : 0 0 …… 1 0 : …… 0 0 0 Iteration No.N-1 1 0 : 0 0 1 0 1 : 0 1 0 : 0 11 Traditional Gaussian Linear Solver Modified Gaussian Elimination Iteration No.1 1 0 : 0 0 0 Iteration No.2 1 0 : 0 0 0 0 1 0 : 0 0 …… 1 0 : …… 0 0 0 Iteration No.N 0 1 0 : 0 0 0 0 1 0 : 0 0 : 0 11 : 0 0 : : 0 1 0 0 : : : 0 1 Modified Gaussian Linear Solver For (i=0; i<N; i++) { Partial pivoting { Transfer the ith column back to host; Search the maximum of this column and return the index; (Host) Switches rows if necessary; (Device) } Determine the multiplier column; (Device) Modified Gaussian elimination; (Device) } Normalize the solution; (Device) Transfer solution back to host; Threads Architecture Matrix handling like modified Gaussian Elimination kernel, each thread handles an operation of A[j][i]=A[j][i]-mj*A[i][j] for iteration ith, use two dimensional grid and block, total of N*N threads in the kernel Row or column handling like partial pivoting and others, each thread for one elements in the row or column, use one dimentsional grid and block, total of N threads in the kernel CudaMemcpy:Device to Host d_temp h_temp c Kernel1 Host: search maximum is c 0 : 0 0 a b c : x Minimizing Device Host transportation: Switching rows by kernel Kernel2 d_temp Kernel4 0 : 0 0 a b c : x Kernel3 For ith iteration Each thread handles: A[j][i]=A[j][i]-mj*A[i][j] N Iteration i data partitioning B(0,0) B(1,1) …… B(N-1,1) T(0,0) T(0,1) ………………T(0,M-1) : : : B(i,j) B(0,N-1) Column i Row i 0 : 0 0 1 0 0 : 0 T(0,0) T(0,1) : : : : : : : T(0,M-1) BLOCK_SIZE N B(0,1) BLOCK_SIZE Multiplier Column m N B(0,0) B(1,1) …… B(N-1,1) N B(0,1) : : : B(i,j) Shared Memory B(0,N-1) Row i Shared Memory For ith iteration: A[j][i]=A[j][i]-mj*A[i][j] Platform Configuration: GPU: GeForce 8400 GS 1 SM, 8 cores, Clock rate 1.40GHz CPU: Intel Core2 Quad Q6600 Clock rate 2.39GHz Comments: GPU implementation (Global or shared) is much slower than CPU implementation(1SM) Try to mimic Tesla (16SM) by scaling GPU time by 16 512 1024 2048 4096 Serial Traditional Gaussian Linear Solver 47 403 5214 46098 Serial Modified Gaussian Linear Solver 71 564 8412 69949 Global Memory (1SM) 1718 13488 108916 862580 Shared Memory (1SM) 662 4806 38923 312787 Global Memory (scaled by 16) 107 843 6807 53911 Shared Memory (scaled by 16) 41 300 2433 19549 Time (ms) 80000 70000 60000 50000 40000 30000 20000 10000 0 Time (ms) 2048 4096 Time (ms) 900 800 700 600 500 400 300 200 100 0 Matrix size 80000 70000 60000 50000 40000 30000 20000 10000 0 Matrix size Traditional GE Modified GE Global(16SM) Shared(16SM) 512 1024 Matrix size Traditional GE Modified GE Global(16SM) Shared(16SM) 512 1024 2048 4096 Comments: CPU prefers traditional GE solver than modified GE solver GPU shared implementation is always 2-3 times faster than global implementation GPU(16SM) shared implementation is around 2 times speedup compared to traditional GE For 1024 case (1SM), global memory implementation time is 13488ms, shared implementation is 4806ms Method #Calls GPU(usec) %GPU time Global GE_kernel 1024 1.3e+07 99.11 Shared GE_kernel 1024 4.8e+06 97.6 gld uncoalesced gld coalesced %uncoalesced rate Global 1048576 131072 89 Shared 61440 73728 45 Conclusion: Linear equations solver based on modified Gaussian Elimination is implemented on CUDA Shared memory is about 3 times faster than global memory implementation Shared memory is expected about 3 times faster than traditional Gaussian Elimination Solver serial code in 16SM GPU Partial pivoting guarantees stability and accuracy. (error less than 0.001 compared to serial code) Problem found: More uncoalesced global memory accessing offsets advantages gained from more parallelism in modified Gaussian Elimination.