CSC/ECE 506: Architecture of Parallel Computers Problem Set 2 Due Friday, February 12, 2010 Problems 2, 3, and 4 will be graded. There are 60 points on these problems. Note: You must do all the problems, even the non-graded ones. If you do not do some of them, half as many points as they are worth will be subtracted from your score on graded problems. Problem 1. (15 points) [Solihin 3.5] Code analysis. For the code shown below, … for (i=1; i <=N; i++} { for (j=1; j<=i; j++) { // note the index range S1: a[i][j] = b[i][j] + c[i][j]; S2: b[i][j] = a[i+1][j-1] * b[i-1][j-1] * c[i-1][j]; S3: c[i+1][j] = a[i][j]; } } (a) Draw the iteration-space traversal graph (ITG). (b) List all the dependences, and clearly indicate which dependence is loop-independent vs. loopcarried. (c) Draw the loop-carried dependence graph (LDG). Problem 2. (15 points) [Solihin 3.6] Other parallelism. For the code shown below: … for (i=1; i<=N;i++){ for (j=1; j<=i; j++) { // note the index range S1: a[i][j] = b[i][j] + c[i][j]; S2: b[i][j] = a[i-1][j-1]; S3: c[i][j] = a[i][j]; S4: d[i][j] = d[i][j-1] + 1; } } (a) Show the code that exploits function parallelism. (b) If S4 is removed from the loop, exploit DOACROSS parallelism to the fullest, and show the code. (c) If S4 is removed from the loop, exploit DOPIPE parallelism to the fullest, and show the code. Problem 3. (30 points) [Solihin 3.7] Parallelizing linear transformation. Consider a linear transformation algorithm to compute Y = A x B + C, where Y, A, B, and C have dimensions n x p, n x m, m x p, and n x p, respectively. Assume that n, m, and p are divisible by 2. The algorithm is shown: int i, j, k; –1– float A[n][m], B[m][p], Y[n][p], C[n][p], x; … // begin linear transformation for (i=0; i<n; i++) { for (j=0; j<p; j++) { x = 0; for (k=0; k<m; k++) x = x + A[i][k] * B[k][j]; Y[i][j] = x + C[i][j]; } } As can be seen from the code, all the loops (for i, j, and k) are parallel. (a) Indicate for each variable (i, j, k, A, B, C, Y, x) whether it should be declared a private or shared variable when we parallelize: i) for the i loop only. ii) for the j loop only. iii) for the k loop only. (b) Parallelize the algorithm for the “for i” loop only, using the message passing programming model on 2 processors. Use send(destination, data) and recv(source, data) as your primitives, similarly to the example given in class. (c) Without using OpenMP directives, parallelize the algorithm for the “for i” loop only, using the shared memory programming model on 2 processors. Use “begin parallel” and “end parallel” to mark your parallel region, and insert proper synchronization (e.g., locks and barriers). Remove unnecessary serialization, if appropriate. (d) Without using the OpenMP directives, parallelize the algorithm for the “for k” loop only, using the shared memory programming model on 2 processors. Use “begin parallel” and “end parallel” to mark your parallel region, and insert proper synchronization (e.g., locks and barriers). Remove unnecessary serialization, if appropriate. Problem 4. (15 points) Write the code for an average_columns function for the data-parallel model of lecture 8. The function should divide up the data so that each processor operates on columns of data. The average of each column of data should be written into the corresponding element of the 1dimensional array column_averages_array. The calling function is already written for you. int nprocs; int row_count; int col_count; double **A; double *column_averages_array; procedure main() begin read(nprocs); read(row_count); read(col_count); A <= G_MALLOC( row_count * col_count ); column_averages_array <= G_MALLOC( col_count ); –2– read(A); average_columns(A, column_averages_array, row_count, col_count); print_1d_array( "Column Averages:\n", column_averages_array ); end procedure average_columns(A, col_avgs, row_count, col_count) double **A, double* col_avgs, int row_count, int col_count; begin *** your code here. *** end Problem 5. (25 points) Gaussian elimination is a well-known technique for solving simultaneous linear systems of equations. Variables are eliminated one by one until there is only one left, and then the discovered values of variables are back-substituted to obtain the values of other variables. In practice, the coefficients of the unknowns in the equation system are represented as a matrix A, and the matrix is first converted to an upper-triangular matrix (a matrix in which all elements below the main diagonal are 0). Then back-substitution is used. Let us focus on the conversion to an upper-triangular matrix by successive variable elimination. Pseudocode for sequential Gaussian elimination is shown below. The diagonal element for a particular iteration of the k loop is called the pivot element, and its row is called the pivot row. procedure Eliminate (A) /*triangularize the matrix A*/ begin for k <= 0 to n–1 do /*loop over all diagonal(pivot)elements*/ begin for j <= k+1 to n–1 do /*for all elts. in the row of, and to the right of, the pivot elt.*/ A[k,j] = A[k,j]/A[k,k]; /*divide by pivot element*/ A[k,k] = 1; for i <= k+1 to n-1 do /*for all rows below the pivot row*/ for j <= k+1 to n–1 do /*for all elements in the row*/ A[i,j] = A[i,j] – A[i,k]*A[k,j]; endfor A[i,k] = 0; endfor endfor end procedure (a) Draw a simple figure illustrating the dependences among matrix elements. (b) Assuming a decomposition into rows and an assignment into blocks of contiguous rows, write a shared address space parallel version using the primitives used for the equation solver in the chapter. (c) Write a message-passing version for the same decomposition and assignment, first using synchronous message passing and then any form of asynchronous message passing. (d) Can you see obvious performance problems with this partitioning? (e) Modify both the shared address space and message-passing versions to use an interleaved assignment of rows to processes. (f) Discuss the trade-offs (programming difficulty and any likely major performance differences) in programming the shared address space and message-passing versions. –3– –4–