BSS 797: Principles of Parallel Computing Lecture 15 Parallel Linear Algebra II Algorithm 13: Matrix-Vector operations The setup: Multiply an n*n matrix A with an n*1 vector B (to produce a vector C) on P processors. The matrix A is A11 A21 ... An1 A12 A22 ... An2 ... A1n A2n ... Ann and the vector B b1 b2 ... bn One processor: assume enough memory to fit the matrix. for(i=1; i <= n; i++) for(j=1; j <= n; j++) C[i] += A[i][j] * B[j]; Complexity: T(1,n) = c n2 where c is a constant characterizing the processor. Matrix-Vector on P nodes Method I: The most obvious row partition the matrix and replicate an n*1 vector on P processors. Matrix A is row partitioned to a11@1 a21@1 a31@2 a41@2 a51@3 a61@3 ... a12@1 a22@1 a32@2 a42@2 a52@3 a62@3 ... a1n@1 a2n@1 a3n@2 a4n@2 a5n@3 a6n@3 ... an1@P an2@P ann@P While the Vector is allocated on all processors: b1@1<->P b2@1<->P b3@1<->P b4@1<->P b5@1<->P b6@1<->P ... bn@1<->P Therefore, each processor is given n/P rows of the matrix and the entire column of the vector. The actual implementation is easy to do. Performance analysis for this row partition: Each processor multiplies a matrix of (n/P)*n with a vector n*1. The cost is Tm(P, n) = c n * (n/P). The time to distribute the (n/P) * 1 vectors to form a (n * 1) vector on every processor is d * P * (n/P). Speed up S(P,n) = T(1,n)/T(P,n) = P/(1 + c1 P/n). Remarks: 1. h(P,n) ~= P/n: One more example 2. Method is easy to implement 3. Two defects: 1. rare to find applications of this mapping 2. memory redundancy Method II: Row partition both the matrix and vector. a11@1 a21@1 a31@2 a41@2 a51@3 a61@3 ... an1@P a12@1 a22@1 a32@2 a42@2 a52@3 a62@3 ... an2@P a1n@1 a2n@1 a3n@2 a4n@2 a5n@3 a6n@3 ... ann@P The vector is b1@1 b2@1 b3@2 b4@2 b5@3 b6@3 ... bn@P Each processor is given n/P rows of the matrix and n/P rows of the vector. Implementation Steps Step I: Multiplying the portions shown locally in parallel. a11@1 N N N N N a22@2 N N N N N a33@3 ... N N N N N ann@P The vector is b1@1 b2@2 b3@3 ... bn@P Step II: Roll up the vectors n/P up by one processor (the 1st back to the bottom) and multiply N N N N an1@P a12@1 N N N N N a23@2 N N N N N a34@3 ... N The vector is b2@1 b3@2 b4@3 ... b1@P Step III: Repeat Step II for P times until all vector elements visit all processors. Done! Performance analysis Each processor multiplies a matrix of (n/P)*n with a vector n*1. The cost is Tm(P, n) = c n * (n/P). The communication time roll up the subvectors is proportional to n. Thus, adding the final collection time, we found communication costs d' n (d'=machine constant, bigger than d.) Speed up S(P,n) = T(1,n)/T(P,n) = P/(1 + c'1P/n). Remarks: 1. h(P,n) ~= P/n: One more example 2. memory is parallelized 3. this mapping is popular in applications 4. One defect: difficult to implement (communications) Method III: Column partition both the matrix and vector. a11@1 a12@2 a21@1 a22@2 a31@1 a32@2 a41@1 a42@2 a51@1 a52@2 a61@1 a62@2 ... ... an1@1 an2@2 a1n@P a2n@P a3n@P a4n@P a5n@P a6n@P ... ann@P Vector is b1@1 b2@2 ... bn@P Each processor is given n/P columns of the matrix and n/P rows of the vector. Implementation steps Step I: Multiplying sub-Row 1 with entire B by all P processors and collecting a global sum (communication) to produce sub-Row 1 of C. Step II: Multiplying sub-Row 2 with entire B by all P processors and collecting a global sum (communication) to produce sub-Row 2 of C. Step III: Repeat Step II until all sub-Rows of A are processed. Performance Analysis Each processor multiplies a matrix of (n/P) * n with a vector n*1. The cost is Tm(P, n) = c n*(n/P). The communication time roll up the subvectors is proportional to n. Thus, adding the final collection time, we found communication costs d'' n (d''=machine const, bigger than d.) Speed up: S(P,n) = T(1,n)/T(P,n) = P(1 + c''1 P/n). Remarks: 1. h(P,n) ~= P/n: One more example 2. memory is parallelized 3. reasonably eay to implement (communications) 4. this mapping can be seen often in applications Algorithm 14: Matrix-Matrix operations The setup: Multiply an n*n matrix A with another n*n matrix B with to produce C: Matrix A: a11 a21 ... an1 a12 a22 ... an2 a1n a2n ... ann b12 b22 ... bn2 b1n b2n ... bnn Matrix B: b11 b21 ... bn1 One One processor: assume enough memory to fit the matrix. for(i=1; i <= n; i++) for(j=1; j <=n; j++) for(k=1; k <=n; k++) C[i][j] += A[i][k] * B[k][j]; Complexity: T(1,n) = c n^3 where c is a constant characterizing the processor. Method I: Row-column partition Name: aka, Ring Method The setup: Multiply an n*n matrix A with another n*n matrix B with to produce C. Initially, Matrix A a11@1 a21@2 ... an1@P a12@1 a22@2 ... an2@P a1n@1 a2n@2 ... ann@P b12@2 b22@2 b32@2 b42@2 b52@2 b62@2 ... bn2@2 b1n@P b2n@P b3n@P b4n@P b5n@P b6n@P ... bnn@P Matrix B b11@1 b21@1 b31@1 b41@1 b51@1 b61@1 ... bn1@1 Remarks: 1. Row-column and column-row are symmetrical 2. Intrinsic 1D partition Implementation Steps Step I: Multiply sub-row 1 in A and sub-column 1 in B, c11 is created, by Processor 1. Multiply sub-row 2 in A and sub-column 2 in B, c22 is created, by Processor 2, etc. Multiply sub-row P in A and sub-column P in B, cPP is created, by Processor P, etc. The dirgonal sub-matrix is created, after Step I. Step II: Roll up all sub-rows by one processor unit, and multiply to produce c21, c32, ..., and c1P by Processors 1, 2, P, respectively. Matrix A a21@1 a31@2 ... a11@P a22@1 a32@2 ... a12@P a2n@1 a3n@2 ... a1n@P b12@2 b22@2 b32@2 b42@2 b1n@P b2n@P b3n@P b4n@P Matrix B b11@1 b21@1 b31@1 b41@1 b51@1 b61@1 ... bn1@1 b52@2 b62@2 ... bn2@2 b5n@P b6n@P ... bnn@P Step III: Roll up again and again until all sub-rows visit all processors. Done! Performance analysis One processor time: T(1,n) = c n3 tcomp On P processors, T(P,n) = Troll + Tcomp: roll up time and submatrix multiplication time. Troll = (P-1) * n2/P * tcomm ~= n2tcomm Tcomp ~= P2 c [n/P]3 tcomp. Therefore, T(P,n) = Troll + Tcomp ~= n2 tcomm + P2 c [n/P]3 tcomp. Speed up: S(P,n)= P /(1 + [P/(c n)] * tcomm/tcomp) Performance Analysis Remarks: 1. overhead h(P,n) ~= P/n * tcomm/tcomp 2. universal law is proved again 3. only large problems benefit 4. method is easy to implement 5. quite often, natural way to decompose 6. memory efficient