Solving linear systems Solving linear systems – p. 1 Overview Chapter 12 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP We want to find vector x = (x0 , x1 , . . . , xn−1 ) as solution of a system of linear equations: Ax = b, where A is an n × n matrix, vector b is of length n Three topics for today: Back substitution Gaussian elimination Iterative methods for sparse linear systems Solving linear systems – p. 2 A system of linear equations Example: a0,0 x0 a1,0 x0 ... an−1,0 x0 +a0,1 x1 +a1,1 x1 + . . . +a0,n−1 xn−1 + . . . +a1,n−1 xn−1 = = b0 b1 +an−1,1 x1 + . . . +an−1,n−1 xn−1 = bn−1 where ai,j and bi are constants, xj are the unknown values to be found Solving linear systems – p. 3 Back substitution An algorithm for solving Ax = b when A is upper triangular i > j ⇒ ai,j = 0 We shall first look at its serial implementation Two possible parallelizations Solving linear systems – p. 4 Example of back substitution Starting point: 1x0 +1x1 −2x1 −1x2 −3x2 2x2 +4x3 +1x3 −3x3 2x3 = = = = 8 5 0 4 Solving linear systems – p. 5 Example of back substitution (cont’d) After step 1: 1x0 +1x1 −2x1 −1x2 −3x2 2x2 2x3 = = = = 0 3 6 4 Solving linear systems – p. 6 Example of back substitution (cont’d) After step 2: 1x0 +1x1 −2x1 2x2 2x3 = = = = 3 12 6 4 Solving linear systems – p. 7 Example of back substitution (cont’d) After step 3: 1x0 −2x1 x2 x3 = = = = 9 12 3 2 Solving linear systems – p. 8 Pseudo-code for back substitution a[0..n − 1, 0..n − 1] — coefficient matrix b[0..n − 1] — constant vector x[0..n − 1] — solution vector for i ← n − 1 down to 1 do x[i] ← b[i]/a[i, i] for j ← 0 to i − 1 do b[j] ← b[j] − x[i] × a[j, i] a[j, i] ← 0 endfor endfor Solving linear systems – p. 9 Observations about back substitution In each i iteration, x[i] must be computed first as b[i]/a[i, i] However, b[i] depends on previous i iterations Therefore, the i for-loop can not be executed in parallel The j for-loop inside each i iteration can be executed in parallel Solving linear systems – p. 10 Row-oriented parallel back substitution The rows of A are distributed among p processes, in an interleaved striped decomposition: If mod(i, p)=k, then row i is assigned to process k The b and x vectors are distributed in the same way In each i iteration, the process responsible for row i computes xi = bi /ai,i Then, the newly computed xi value is broadcast to all processes Thereafter, each process updates all its responsible bj values as bj = bj − aj,i xi Complexity: Average number of iterations of loop j per process: n/(2p) Therefore, computational complexity: O(n2 /p) Complexity of communication latency time: O(n log p) Complexity of communication data transmission time: O(n log p) Solving linear systems – p. 11 Column-oriented parallel back substitution Alternatively, we can distribute the columns of A by an interleaved striped decomposition During iteration i, the responsible process computes xi and updates the entire b vector Then, the newly updated b vector must be sent to the successor process before the next i iteration Therefore, the column-oriented parallel back substitution is actually not a parallel algorithm, because there is computational concurrency! Complexity: Computational complexity: O(n2 ) Complexity of communication latency time: O(n) Complexity of communication data transmission time: O(n2 ) Solving linear systems – p. 12 Comparison The column-oriented parallel back substitution is always slower than the sequential substitution The row-oriented parallel back substitution can be faster than the sequential substitution depending on the values of n, p and communication speeds The row-oriented parallel back substitution can also be slower than the column-oriented parallel back substitution especially when n is relatively small and p is relatively large Solving linear systems – p. 13 Gaussian elimination A well-known algorithm for solving dense linear systems The original system Ax = b is reduced by Gaussian elimination to an upper triangular system T x = c Then, back substitution can be used to find x Solving linear systems – p. 14 Example of Gaussian elimination Starting point: 4x0 2x0 −4x0 8x0 +6x1 −3x1 +18x1 +2x2 +5x2 −5x2 −2x2 −2x3 −2x3 +4x3 +3x3 = = = = 8 4 1 40 Solving linear systems – p. 15 Example of Gaussian elimination (cont’d) After step 1: 4x0 +6x1 −3x1 +3x1 +6x1 +2x2 +4x2 −3x2 −6x2 −2x3 −1x3 +2x3 +7x3 = = = = 8 0 9 24 Solving linear systems – p. 16 Example of Gaussian elimination (cont’d) After step 2: 4x0 +6x1 −3x1 +2x2 +4x2 +1x2 +2x2 −2x3 −1x3 +1x3 +5x3 = = = = 8 0 9 24 Solving linear systems – p. 17 Example of Gaussian elimination (cont’d) After step 3: 4x0 +6x1 −3x1 +2x2 +4x2 +1x2 −2x3 −1x3 +1x3 +3x3 = = = = 8 0 9 6 Solving linear systems – p. 18 Sequential algorithm of Gaussian elimination Total n − 1 steps are needed for a linear system with n × n matrix A and n × 1 vector b During step i, The nonzero elements of A below the diagonal in column i are eliminated by replacing each row j, where i + 1 ≤ j < n, with the sum of row j and −aj,i /ai,i times row i Solving linear systems – p. 19 Partial pivoting During step i of Gaussian elimination, row i is called the pivot row, that is, the row used to drive to zero all nonzero elements below the diagonal in column i However, if ai,i is zero or very close to zero, we will have “trouble” Gaussian elimination with partial pivoting: In step i, rows i through n − 1 are searched for the row whose column i element has the largest absolute value Then, this row is swapped (pivoted) with row i Solving linear systems – p. 20 Pseudo-code for Gaussian elimination (row pivoting) for i ← 0 to n − 1 magnitude ← 0 for j ← i to n − 1 if |a[loc[j], i]| > magnitude magnitude ← |a[loc[j], i]| picked ← j endif endfor swap loc[i] and loc[picked] for j ← i + 1 to n − 1 t ← a[loc[j], i]/a[loc[i], i] for k ← i + 1 to n − 1 a[loc[j], k] ← a[loc[j], k] − a[loc[i], k] × t endfor endfor endfor Solving linear systems – p. 21 Parallel algorithms for Gaussian elimination The outermost i loop can not be parallelized Both the innermost k loop and the middle j loop can be executed in parallel Two parallel algorithms Two data decompositions Solving linear systems – p. 22 Row-oriented parallel Gaussian elimination Row-wise block striped decomposition of the rows Use of partial pivoting ensures load balancing as the outermost i loop proceeds Determining the pivot row (value of picked) can use cooperation among processes Each process first finds its candidate for picked together with its value of |a[loc[picked]], i| Then, MPI Allreduce is used with operation MPI MAXLOC and datatype MPI DOUBLE INT More communication is needed per i iteration: When picked is decided, the process in charge of row picked broadcasts aloc[picked],i , aloc[picked],i+1 , . . . , aloc[picked],n−1 to all other processes Then, each process carries out, concurrently, a segment of the middle j loop Solving linear systems – p. 23 Column-oriented parallel Gaussian elimination Column-wise interleaved striped decomposition of A During iteration i The process controlling column i is alone responsible for finding the pivot row Once the pivot row is identified, the controlling process has to broadcast value of picked and column i elements of the unmarked rows The remaining computations are carried out in parallel by all processes Solving linear systems – p. 24 Another parallel Gaussian elimination algorithm Use of column pivoting Broadcast is replaced by a series of point-to-point message send/receive The flow of messages is pipelined Possibility of overlap between computation and communication See Section 12.4.6 in the textbook for the details Solving linear systems – p. 25 Linear systems with a sparse matrix Gaussian elimination followed by back substitution is an example of direct method for solving linear systems Gaussian elimination works well for dense matrix A When A is sparse, that is, only a few elements are nonzero, Gaussian elimination is not a good choice Iterative methods are better choices for sparse matrices Solving linear systems – p. 26 Iterative methods A series of approximation vectors: x0 , x1 , . . . Simple iterative methods (such as Jacobi method) Advanced iterative methods (such as conjugate gradient method) Data decomposition Row-wise block striped decomposition of matrix A (only nonzero elements are stored) Matching block striped decomposition of vectors b and x Each process needs to store a few “ghost values” of x, in addition to its segment of vector x A parallel matrix-vector multiplication = local sequential matrix-vector multiplication + communication afterward Solving linear systems – p. 27