BSS 797: Principles of Parallel Computing

advertisement
BSS 797: Principles of Parallel Computing
Lecture 15
Parallel Linear Algebra II
Algorithm 13: Matrix-Vector operations
The setup: Multiply an n*n matrix A with an n*1 vector
B (to produce a vector C) on P processors. The matrix A
is
A11
A21
...
An1
A12
A22
...
An2
...
A1n
A2n
...
Ann
and the vector B
b1
b2
...
bn
One processor: assume enough memory to fit the matrix.
for(i=1; i <= n; i++)
for(j=1; j <= n; j++) C[i] += A[i][j] * B[j];
Complexity: T(1,n) = c n2 where c is a constant
characterizing the processor.
Matrix-Vector on P nodes
Method I: The most obvious row partition the matrix
and replicate an n*1 vector on P processors.
Matrix A is row partitioned to
a11@1
a21@1
a31@2
a41@2
a51@3
a61@3
...
a12@1
a22@1
a32@2
a42@2
a52@3
a62@3
...
a1n@1
a2n@1
a3n@2
a4n@2
a5n@3
a6n@3
...
an1@P
an2@P
ann@P
While the Vector is allocated on all processors:
b1@1<->P
b2@1<->P
b3@1<->P
b4@1<->P
b5@1<->P
b6@1<->P
...
bn@1<->P
Therefore, each processor is given n/P rows of the
matrix and the entire column of the vector.
The actual implementation is easy to do.
Performance analysis for this row partition:
Each processor multiplies a matrix of (n/P)*n with a
vector n*1. The cost is Tm(P, n) = c n * (n/P). The time
to distribute the (n/P) * 1 vectors to form a (n * 1)
vector on every processor is d * P * (n/P). Speed up
S(P,n) = T(1,n)/T(P,n) = P/(1 + c1 P/n).
Remarks:
1. h(P,n) ~= P/n: One more example
2. Method is easy to implement
3. Two defects:
1. rare to find applications of this mapping
2. memory redundancy
Method II: Row partition both the matrix and vector.
a11@1
a21@1
a31@2
a41@2
a51@3
a61@3
...
an1@P
a12@1
a22@1
a32@2
a42@2
a52@3
a62@3
...
an2@P
a1n@1
a2n@1
a3n@2
a4n@2
a5n@3
a6n@3
...
ann@P
The vector is
b1@1
b2@1
b3@2
b4@2
b5@3
b6@3
...
bn@P
Each processor is given n/P rows of the matrix and n/P
rows of the vector.
Implementation Steps
Step I: Multiplying the portions shown locally in
parallel.
a11@1
N
N
N
N
N
a22@2
N
N
N
N
N
a33@3
...
N
N
N
N
N
ann@P
The vector is
b1@1
b2@2
b3@3
...
bn@P
Step II: Roll up the vectors n/P up by one processor (the
1st back to the bottom) and multiply
N
N
N
N
an1@P
a12@1
N
N
N
N
N
a23@2
N
N
N
N
N
a34@3
...
N
The vector is
b2@1
b3@2
b4@3
...
b1@P
Step III: Repeat Step II for P times until all vector
elements visit all processors. Done!
Performance analysis
Each processor multiplies a matrix of (n/P)*n with a
vector n*1. The cost is Tm(P, n) = c n * (n/P). The
communication time roll up the subvectors is
proportional to n. Thus, adding the final collection time,
we found communication costs d' n (d'=machine
constant, bigger than d.)
Speed up
S(P,n) = T(1,n)/T(P,n) = P/(1 + c'1P/n).
Remarks:
1. h(P,n) ~= P/n: One more example
2. memory is parallelized
3. this mapping is popular in applications
4. One defect: difficult to implement
(communications)
Method III: Column partition both the matrix and
vector.
a11@1
a12@2
a21@1
a22@2
a31@1
a32@2
a41@1
a42@2
a51@1
a52@2
a61@1
a62@2
... ...
an1@1
an2@2
a1n@P
a2n@P
a3n@P
a4n@P
a5n@P
a6n@P
...
ann@P
Vector is
b1@1
b2@2
...
bn@P
Each processor is given n/P columns of the matrix and
n/P rows of the vector.
Implementation steps
Step I: Multiplying sub-Row 1 with entire B by all P
processors and collecting a global sum (communication)
to produce sub-Row 1 of C.
Step II: Multiplying sub-Row 2 with entire B by all P
processors and collecting a global sum (communication)
to produce sub-Row 2 of C.
Step III: Repeat Step II until all sub-Rows of A are
processed.
Performance Analysis
Each processor multiplies a matrix of (n/P) * n with a
vector n*1. The cost is Tm(P, n) = c n*(n/P). The
communication time roll up the subvectors is
proportional to n. Thus, adding the final collection time,
we found communication costs d'' n (d''=machine const,
bigger than d.)
Speed up:
S(P,n) = T(1,n)/T(P,n) = P(1 + c''1 P/n).
Remarks:
1. h(P,n) ~= P/n: One more example
2. memory is parallelized
3. reasonably eay to implement (communications)
4. this mapping can be seen often in applications
Algorithm 14: Matrix-Matrix operations
The setup: Multiply an n*n matrix A with another n*n
matrix B with to produce C: Matrix A:
a11
a21
...
an1
a12
a22
...
an2
a1n
a2n
...
ann
b12
b22
...
bn2
b1n
b2n
...
bnn
Matrix B:
b11
b21
...
bn1
One One processor: assume enough memory to fit the
matrix.
for(i=1; i <= n; i++)
for(j=1; j <=n; j++)
for(k=1; k <=n; k++) C[i][j] += A[i][k] * B[k][j];
Complexity: T(1,n) = c n^3
where c is a constant characterizing the processor.
Method I: Row-column partition
Name: aka, Ring Method
The setup: Multiply an n*n matrix A with another n*n
matrix B with to produce C.
Initially,
Matrix A
a11@1
a21@2
...
an1@P
a12@1
a22@2
...
an2@P
a1n@1
a2n@2
...
ann@P
b12@2
b22@2
b32@2
b42@2
b52@2
b62@2
...
bn2@2
b1n@P
b2n@P
b3n@P
b4n@P
b5n@P
b6n@P
...
bnn@P
Matrix B
b11@1
b21@1
b31@1
b41@1
b51@1
b61@1
...
bn1@1
Remarks:
1. Row-column and column-row are symmetrical
2. Intrinsic 1D partition
Implementation Steps
Step I: Multiply sub-row 1 in A and sub-column 1 in B,
c11 is created, by Processor 1. Multiply sub-row 2 in A
and sub-column 2 in B, c22 is created, by Processor 2, etc.
Multiply sub-row P in A and sub-column P in B, cPP is
created, by Processor P, etc. The dirgonal sub-matrix is
created, after Step I.
Step II: Roll up all sub-rows by one processor unit, and
multiply to produce c21, c32, ..., and c1P by Processors
1, 2, P, respectively.
Matrix A
a21@1
a31@2
...
a11@P
a22@1
a32@2
...
a12@P
a2n@1
a3n@2
...
a1n@P
b12@2
b22@2
b32@2
b42@2
b1n@P
b2n@P
b3n@P
b4n@P
Matrix B
b11@1
b21@1
b31@1
b41@1
b51@1
b61@1
...
bn1@1
b52@2
b62@2
...
bn2@2
b5n@P
b6n@P
...
bnn@P
Step III: Roll up again and again until all sub-rows visit
all processors. Done!
Performance analysis
One processor time:
T(1,n) = c n3 tcomp
On P processors,
T(P,n) = Troll + Tcomp:
roll up time and submatrix multiplication time.
Troll = (P-1) * n2/P * tcomm ~= n2tcomm
Tcomp ~= P2 c [n/P]3 tcomp.
Therefore,
T(P,n) = Troll + Tcomp
~= n2 tcomm + P2 c [n/P]3 tcomp.
Speed up:
S(P,n)= P /(1 + [P/(c n)] * tcomm/tcomp)
Performance Analysis Remarks:
1. overhead h(P,n) ~= P/n * tcomm/tcomp
2. universal law is proved again
3. only large problems benefit
4. method is easy to implement
5. quite often, natural way to decompose
6. memory efficient
Download