CSC/ECE 506: Architecture of Parallel Computers

advertisement
CSC/ECE 506: Architecture of Parallel Computers
Problem Set 2
Due Friday, February 12, 2010
Problems 2, 3, and 4 will be graded. There are 60 points on these problems. Note: You must do
all the problems, even the non-graded ones. If you do not do some of them, half as many points
as they are worth will be subtracted from your score on graded problems.
Problem 1. (15 points) [Solihin 3.5] Code analysis. For the code shown below,
…
for (i=1; i <=N; i++} {
for (j=1; j<=i; j++) { // note the index range
S1: a[i][j] = b[i][j] + c[i][j];
S2: b[i][j] = a[i+1][j-1] * b[i-1][j-1] * c[i-1][j];
S3: c[i+1][j] = a[i][j];
}
}
(a) Draw the iteration-space traversal graph (ITG).
(b) List all the dependences, and clearly indicate which dependence is loop-independent vs. loopcarried.
(c) Draw the loop-carried dependence graph (LDG).
Problem 2. (15 points) [Solihin 3.6] Other parallelism.
For the code shown below:
…
for (i=1; i<=N;i++){
for (j=1; j<=i; j++) { // note the index range
S1: a[i][j] = b[i][j] + c[i][j];
S2: b[i][j] = a[i-1][j-1];
S3: c[i][j] = a[i][j];
S4: d[i][j] = d[i][j-1] + 1;
}
}
(a) Show the code that exploits function parallelism.
(b) If S4 is removed from the loop, exploit DOACROSS parallelism to the fullest, and show the
code.
(c) If S4 is removed from the loop, exploit DOPIPE parallelism to the fullest, and show the code.
Problem 3. (30 points) [Solihin 3.7] Parallelizing linear transformation.
Consider a linear transformation algorithm to compute Y = A x B + C, where Y, A, B, and C have
dimensions n x p, n x m, m x p, and n x p, respectively. Assume that n, m, and p are divisible by
2. The algorithm is shown:
int i, j, k;
–1–
float A[n][m], B[m][p], Y[n][p], C[n][p], x;
…
// begin linear transformation
for (i=0; i<n; i++) {
for (j=0; j<p; j++) {
x = 0;
for (k=0; k<m; k++)
x = x + A[i][k] * B[k][j];
Y[i][j] = x + C[i][j];
}
}
As can be seen from the code, all the loops (for i, j, and k) are parallel.
(a) Indicate for each variable (i, j, k, A, B, C, Y, x) whether it should be declared a private or
shared variable when we parallelize:
i) for the i loop only.
ii) for the j loop only.
iii) for the k loop only.
(b) Parallelize the algorithm for the “for i” loop only, using the message passing programming
model on 2 processors. Use send(destination, data) and recv(source, data) as your primitives,
similarly to the example given in class.
(c) Without using OpenMP directives, parallelize the algorithm for the “for i” loop only, using the
shared memory programming model on 2 processors. Use “begin parallel” and “end parallel” to
mark your parallel region, and insert proper synchronization (e.g., locks and barriers). Remove
unnecessary serialization, if appropriate.
(d) Without using the OpenMP directives, parallelize the algorithm for the “for k” loop only, using
the shared memory programming model on 2 processors. Use “begin parallel” and “end parallel”
to mark your parallel region, and insert proper synchronization (e.g., locks and barriers). Remove
unnecessary serialization, if appropriate.
Problem 4. (15 points) Write the code for an average_columns function for the data-parallel
model of lecture 8.
The function should divide up the data so that each processor operates on columns of data. The
average of each column of data should be written into the corresponding element of the 1dimensional array column_averages_array. The calling function is already written for you.
int nprocs;
int row_count;
int col_count;
double **A;
double *column_averages_array;
procedure main()
begin
read(nprocs);
read(row_count);
read(col_count);
A <= G_MALLOC( row_count * col_count );
column_averages_array <= G_MALLOC( col_count );
–2–
read(A);
average_columns(A, column_averages_array, row_count, col_count);
print_1d_array( "Column Averages:\n", column_averages_array );
end
procedure average_columns(A, col_avgs, row_count, col_count)
double **A, double* col_avgs, int row_count, int col_count;
begin
*** your code here. ***
end
Problem 5. (25 points) Gaussian elimination is a well-known technique for solving simultaneous
linear systems of equations. Variables are eliminated one by one until there is only one left, and
then the discovered values of variables are back-substituted to obtain the values of other
variables. In practice, the coefficients of the unknowns in the equation system are represented as
a matrix A, and the matrix is first converted to an upper-triangular matrix (a matrix in which all
elements below the main diagonal are 0). Then back-substitution is used. Let us focus on the
conversion to an upper-triangular matrix by successive variable elimination. Pseudocode for
sequential Gaussian elimination is shown below. The diagonal element for a particular iteration
of the k loop is called the pivot element, and its row is called the pivot row.
procedure Eliminate (A)
/*triangularize the matrix A*/
begin
for k <= 0 to n–1 do
/*loop over all diagonal(pivot)elements*/
begin
for j <= k+1 to n–1 do /*for all elts. in the row of, and to the right of, the pivot elt.*/
A[k,j] = A[k,j]/A[k,k];
/*divide by pivot element*/
A[k,k] = 1;
for i <= k+1 to n-1 do
/*for all rows below the pivot row*/
for j <= k+1 to n–1 do
/*for all elements in the row*/
A[i,j] = A[i,j] – A[i,k]*A[k,j];
endfor
A[i,k] = 0;
endfor
endfor
end procedure
(a) Draw a simple figure illustrating the dependences among matrix elements.
(b) Assuming a decomposition into rows and an assignment into blocks of contiguous rows, write
a shared address space parallel version using the primitives used for the equation solver in the
chapter.
(c) Write a message-passing version for the same decomposition and assignment, first using
synchronous message passing and then any form of asynchronous message passing.
(d) Can you see obvious performance problems with this partitioning?
(e) Modify both the shared address space and message-passing versions to use an interleaved
assignment of rows to processes.
(f) Discuss the trade-offs (programming difficulty and any likely major performance differences) in
programming the shared address space and message-passing versions.
–3–
–4–
Download