COURSE: ACOE401 DATE: TIME: 2 ½ Hours INSTRUCTIONS TO CANDIDATES: Answer any Four (4) questions only. Students are allowed to use any type of non-programmable calculator. QUESTION 1: (a) After analyzing the execution of a program it was found that 10% of the execution time was spend on code segments that cannot be parallelized, while 90% of the execution time was spend on code segments that can be parallelized. Using an execution timeline diagram based on Admahls law determine: (9 marks) (i) The maximum speedup that can be achieved for this program, on a 3-node symmetric multiprocessor. (Speedup = 100/(10+90/3) = 2.5) (ii) The maximum speedup that can be achieved for this program, on a 6-node symmetric multiprocessor. (Speedup = 100/(10+90/6) = 4) (iii) The maximum speedup that can be achieved for this program, on an infinite node symmetric multiprocessor. (Speedup = 100/(10+0) = 10) (b) Figure Q1 below shows the source code and the execution times for a program tested on an 8-node shared memory system. (i) Explain why the parallel execution time with one thread is greater than the sequential execution time (The time needed to create parallel threads is high compared to the actual computation time) (4 marks) (ii) Calculate the speedup compared to the sequential time when the number of threads is 8. (Speedup = 300/150 = 2) (2 marks) (iii) Explain two reasons for the low performance achieved. For each reason suggest a change in the code that will improve the performance. (10 marks) Reason1: All threads write in the shared value ‘count[id]’ in each loop iteration. This will cause cache invalidations in each iteration due to false sharing. This problem can be solved by using array padding, ie using ‘count[id*8] will force each thread to write in a different cache block. Reason2: The number of iterations in the inner loop is ‘i’. Thread0 will do iterations from 0 to 1250, while thread7 will do iterations from 8750 to 10000. Thus thread7 will do much more work than thread0, resulting in load imbalances. The same applies for all threads. This problem can be solved by changing the way loop iterations are assigned to threads (eg assign to thread0 iterations 0..100, 1000..1100, 2000..2100 etc) Reason3: The critical section for ‘sum’ is executed in each loop iteration, resulting in excessive thread waits. This problem can be solved by promoting ‘sum’ to array with padding and use the critical section outside the loop. (Give any two of the above reasons) Page: 1 of 6 int nthrds,count[8] ,sum=0; #pragma omp parallel num_threads(8) { int i,k; int id = omp_get_thread_num(); nthrds = omp_get_num_threads(); int first = id * (10000/nthrds); int last = first + 10000/nthrds; count[id]=0; for (i=first; i< last; i++) { for (k=0;k<i;k++){ if(list[i][k]>0) count[id]++; if(list[i][k]!=i*k) count[id]- -; #pragma omp critical sum +=list[i][k]; } } } Number of Threads Execution Time Sequential 300 ms 1 310 ms 2 215 ms 4 200 ms 8 150 ms Figure Q1 QUESTION 2: (a) Specify whether the following are valid states for the MESI protocol. Cache1 Cache2 Cache3 Mem Valid Initial State (9 marks) Justification M 6 I 6 I 4 4 Yes ---- M 3 I 2 E 3 3 No Can not have both M and E I 2 I 4 S 2 2 No There is only one S S 5 I 0 S 3 5 No C3 must be the same as C1 and Mem. (a) A three-processor shared memory multiprocessor employs the MESI protocol to ensure cache coherence. Fill up the tables below to show the state of the caches, the memory and any bus activity during the execution of the listed CPU operations. Use the notation (M C1 or C2M or none) to show bus activity. Use the table attached) (16 marks) State X State X State X X Shared Bus Activity Initial State → E 2 I 3 I 9 2 -- 2 Load (reg,X) S 2 S 2 I 9 2 C2M 2 Store(X,3) I 2 E 3 I 9 3 MC2 E 3 I 2 I 4 3 ---- CPU Operation Initial State → Cache 1 Cache 2 Cache 3 Memory 1 Store(X,1) M 1 I 2 I 4 3 2 Load(reg,X) S 1 S 1 I 4 1 Initial State → I 6 S 0 S 0 0 MC1 C2M -- 3 Store(X,4) I 6 I 0 E 4 4 MC3 3 Store(X,6) I 6 I 0 M 6 4 --- Page: 2 of 6 Initial State → S 4 S 4 I 0 4 3 Store(X,4) I 4 I 4 E 4 4 1 Load(reg,X) S 4 I 4 S 4 4 -C3M MC3 C1M PR/S BR+BW PR+BR Invalid PW/S Shared BW BR BW PW/~S BR PR/~S BW Exclusive PW PW Modified PR PR+PW QUESTION 3: (a) A 1024X1024 array, called invals[ ], contains integers. An error detection algorithm requires that the row checksum (summation of all values in each row) is stored in the array cs_row[], and the column checksum (summation of all values in each column) is stored in the array cs_cols[], as shown below. Write a program for a shared memory system to implement the above algorithm using OpenMP with the most appropriate worksharing constructs. (15 marks) invals cs_rows 16 24 25 30 13 34 14 16 87 30 14 12 10 66 14 27 17 18 76 73 99 68 74 16+24+25+30 = 95 95 cs_cols Useful OpenMP functions and directives: (void) omp_set_num_threads(int num_threads) int omp_get_num_threads( ) int omp_get_max_threads( ) int omp_get_thread _num( ) int omp_get_num_procs( ) #pragma omp critical [ name ] #pragma omp barrier #pragma omp atomic #pragma omp parallel [if (scalar_expression), private (list), shared (list), default (shared | none), firstprivate (list), reduction (operator: list), copyin (list)] #pragma omp for [schedule (type [,chunk]),ordered, private(list), firstprivate(list), lastprivate(list), shared (list), reduction(op: list)] Op = [+, -, *,min,max, &, ^, |, &&, ||] Page: 3 of 6 #pragma omp sections [private (list), firstprivate (list), lastprivate (list), reduction (operator: list), nowait] #pragma omp single [private (list), firstprivate (list), nowait] (a) For the OpenMP program shown below, explain four reasons that could lead to wrong results when running the program with 6 threads. For each reason suggest a change in the code that will correct the result. (10 marks) Reason1: T is a shared variable written by all threads at any time, thus errors due to data races. T should be set as a private processor. Reason2: Variable ‘area’ is a shared variable written by all threads at any time, thus errors due to data races. This problem can be solved by promoting ‘area’ to array Reason3: Each thread executes 10000/6 = 1666.7 (ie 1666) loop iterations. Thus only 9996 iterations are executed, with the last 4 not executed. This problem can be solved by assigning to the last thread the remaining iterations. Reason4: The master thread calculates ‘res’ after completing its own iterations, without waiting for the other threads to complete their iterations, thus the value of ‘area’ used might not be the correct one. This problem can be solved by inserting a barrier instruction before the master. double x=2.7; double step = x/(1.08); double area=0.0; double T=0.0; double res; for (int i=0; i< 10000; i++) { T=(i+0.5)*step; area+=(sin(T)*sin(T)); } res=sqrt((2*area*step)/x); Sequential Code double x=2.7; int nthrds; double step = x/(1.08); double area=0.0; double T=0.0; double res; #pragma omp parallel num_threads(6) { int id = omp_get_thread_num(); nthrds = omp_get_num_threads(); int mystart = id * (10000/nthrds); int myend = mystart + 10000/nthrds; for (int i=mystart; i< myend; i++) { T=(i+0.5)*step; area+=(sin(T)*sin(T)); } #pragma omp master res=sqrt((2*area*step)/x); } OpenMP Code Page: 4 of 6 QUESTION 4: The following code runs in a multi-processor system, where Thread1 and Thread2 run on different processors. X, Y, A, and F are shared memory values, while r1 and r2 are register values (all initialized to 0). Thread 1 Thread 2 1a: Move r1,1 2a: Move r1,2 1b: Store(A1),r1 2b: Store(A2),r1 1c: Store(F1),r1 2c: Store(F2),r1 1d: If (F2= = 0) goto 1d 2d: If (F1= = 0) goto 2d 1e: Load r2,(A2) 2e: Load r2,(A1) 1f: Store(X),r2 2f Store(Y),r2 1g: Store(Y),r1 2g: Store(X),r1 (a) Write down a sequence of instruction execution that will store in X and Y the values 2, 1 respectively, after the execution of both threads. Assume sequential consistency. (5 marks) 1a 1b 1c 2a 2b 2c 1d 1e 1f 2d 2e 2f 1g 2g r1=1 A1=1 F1=1 r1=2 A2=2 F2=2 T r2=2 X=2 T r2=1 Y=1 Y=1 X=2 (b) Write down a sequence of instruction execution that will store in X and Y the values 0, 1 respectively, after the execution of both threads. Assume that the processors employ outof-order execution with NO speculative execution. (5 marks) 2a 2c 1a 1b 1c 1d 1e 2b 2d 2e 2g 1g 2f 1f r1=2 F2=2 r1=1 A1=1 F1=1 T r2=0 A2=2 T r2=1 X=2 Y=1 Y=1 X=0 (c) Write down a sequence of instruction execution that will store in X and Y the values 0, 0 respectively, after the execution of both threads. Assume that the processors employ outof-order execution with speculative execution. (5 marks) 1e 2e 1a 1b 1c 1d 2a 2b 2c 2d 2g 1g 2f 1f r2=0 r2=0 r1=1 A1=1 F1=1 T r1=2 A2=2 F2=2 T X=2 Y=1 Y=0 X=0 (d) If the processors employ out-of-order with no speculative execution, specify whether the following instruction sequences are valid. Justify your answer. (5 marks) (i) 1a1c1b2a2b2c2d2e2f1d1e1g1f2g Valid (ii) 1a1b1c2a2c2b2d1g 2e2f1d1e1f2g Invalid. 1g ..1d (e) If the processors employ out-of-order with speculative execution, specify whether the following instruction sequences are valid. Justify your answer. (5 marks) (i) 1a1c1b2a2b2c2g 2d2e2f1d1f 1e1g Invalid. 1f ..1e (ii) 1a1c1b2a2b2c2f 2d2e2g1d1f 1e1g Invalid. 2f ..2e and Invalid. 1f ..1e Page: 5 of 6 QUESTION 5: (a) Rewrite the following MPI code using the most appropriate collective communication commands. (15 marks) void main (int argc, char *argv[]) { int myrank, size, sum=0, i, temp, N=10000, arr[N]; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &size); if (myrank ==0) { get_data(filename,arr); for(i=1;i<size;i++) Scatter MPI_Send(arr+i*(N/size),N/size,MPI_INT,i,i,MPI_COMM_WORLD); for(i=0;i<N/size;i=i++) sum=sum+arr[i]; for(i=1;i<size;i++) { Reduce_Sum MPI_Recv(&temp,1,MPI_INT,i,myrank,MPI_COMM_WORLD,&status); sum=sum+temp; } for(i=1;i<size;i++) MPI_Send(&sum),1,MPI_INT,i,i+1,MPI_COMM_WORLD); Broadcast for(i=0;i<N/size;i=i++) arr[i] = sum/arr[i]; for(i=1;i<size;i++) { Gather MPI_Recv(arr+i*(N/size),N/size,MPI_INT,i,i+1,MPI_COMM_WORLD,&status); } else { MPI_Recv(arr,N/size,MPI_INT,0,myrank,MPI_COMM_WORLD, &status); for (i = 0;i <N/size; i=i++) sum=sum+arr[i] MPI_Send(&sum,1,MPI_INT,0,myrank,MPI_COMM_WORLD); MPI_Recv(&sum,1,MPI_INT,0,myrank+1,MPI_COMM_WORLD, &status); for (i = 0;i <N/size; i=i++) arr[i] = sum/arr[i]; MPI_Send(arr,N/size,MPI_INT,0,myrank+1,MPI_COMM_WORLD); } MPI_Finalize(); } Useful MPI functions: MPI_Comm_init(&argc, &argv) MPI_Finalize() MPI_Comm_rank(comm, p_id) MPI_Comm_size(comm, size) MPI_Send(s_msg, s_count, datatype, dest, tag, comm) MPI_Recv(r_msg, r_count, datatype, srce, tag, comm, status) MPI_Bcast(message, count, datatype, root, comm) MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) MPI_Reduce(operand, result, count, datatype, operation, root, comm) Where Operation = [MPI_SUM, MPI_MAX, MPI_MIN, MPI_PROD, MPI_LAND, MPI_LOR, MPI_LXOR, MPI_BAND, MPI_BOR, MPI_BXOR, MPI_MAXLOC, MPI_MINLOC] Page: 6 of 6 (b) Figure below shows the timelines of the execution of a program on a four-node message passing system. Calculate the speedup, compared to the single node execution, the efficiency and the utilization factor for each node. 0 1 CPU Communication Overhead Comp. Communication Delay 0 1 0 2 0 1 0 8 Comp Computation 2 Computation 3 12 16 20 24 28 CPU Parallelization Overheads Global Synchronization Computation 3 0 4 CPU Idle Time Computation 0 0 CPU Computation Time (10 marks) 32 36 1 Comp 1 0 Comp 2 0 Comp 3 0 40 0 2 44 0 3 0 48 Comp 52 56 Time Computation 1: 23+4+5 = 32 Computation 2: 19+4 = 23 Computation 3: 20+4 = 24 Computation 4: 18+4 = 22 Computation total: 32+23+24+22 = 101 Speedup = 101/56 = 1.8 Efficiency = (1.8/4) x100 = 45% UF1 = (32/56) x 100 = 57% UF2 = (23/56) x 100 = 41% UF3 = (24/56) x 100 = 43% UF4 = (22/56) x 100 = 39% Page: 7 of 6