Uploaded by asdvvds

ACOE401-FinalSample-1

advertisement
COURSE:
ACOE401
DATE:
TIME: 2 ½ Hours
INSTRUCTIONS TO CANDIDATES:
Answer any Four (4) questions only.
Students are allowed to use any type of non-programmable calculator.
QUESTION 1:
(a)
After analyzing the execution of a program it was found that 10% of the execution time
was spend on code segments that cannot be parallelized, while 90% of the execution
time was spend on code segments that can be parallelized. Using an execution timeline
diagram based on Admahls law determine:
(9 marks)
(i)
The maximum speedup that can be achieved for this program, on a 3-node
symmetric multiprocessor. (Speedup = 100/(10+90/3) = 2.5)
(ii) The maximum speedup that can be achieved for this program, on a 6-node
symmetric multiprocessor. (Speedup = 100/(10+90/6) = 4)
(iii) The maximum speedup that can be achieved for this program, on an infinite node
symmetric multiprocessor. (Speedup = 100/(10+0) = 10)
(b)
Figure Q1 below shows the source code and the execution times for a program tested
on an 8-node shared memory system.
(i)
Explain why the parallel execution time with one thread is greater than the
sequential execution time (The time needed to create parallel threads is high
compared to the actual computation time)
(4 marks)
(ii) Calculate the speedup compared to the sequential time when the number of threads
is 8. (Speedup = 300/150 = 2)
(2 marks)
(iii) Explain two reasons for the low performance achieved. For each reason suggest a
change in the code that will improve the performance.
(10 marks)
Reason1: All threads write in the shared value ‘count[id]’ in each loop iteration. This will cause
cache invalidations in each iteration due to false sharing. This problem can be solved by using
array padding, ie using ‘count[id*8] will force each thread to write in a different cache block.
Reason2: The number of iterations in the inner loop is ‘i’. Thread0 will do iterations from 0 to
1250, while thread7 will do iterations from 8750 to 10000. Thus thread7 will do much more
work than thread0, resulting in load imbalances. The same applies for all threads. This
problem can be solved by changing the way loop iterations are assigned to threads (eg assign
to thread0 iterations 0..100, 1000..1100, 2000..2100 etc)
Reason3: The critical section for ‘sum’ is executed in each loop iteration, resulting in
excessive thread waits. This problem can be solved by promoting ‘sum’ to array with padding
and use the critical section outside the loop.
(Give any two of the above reasons)
Page: 1 of 6
int nthrds,count[8] ,sum=0;
#pragma omp parallel num_threads(8)
{ int i,k;
int id = omp_get_thread_num();
nthrds = omp_get_num_threads();
int first = id * (10000/nthrds);
int last = first + 10000/nthrds;
count[id]=0;
for (i=first; i< last; i++)
{ for (k=0;k<i;k++){
if(list[i][k]>0) count[id]++;
if(list[i][k]!=i*k) count[id]- -;
#pragma omp critical
sum +=list[i][k]; }
}
}
Number of
Threads
Execution
Time
Sequential
300 ms
1
310 ms
2
215 ms
4
200 ms
8
150 ms
Figure Q1
QUESTION 2:
(a) Specify whether the following are valid states for the MESI protocol.
Cache1 Cache2 Cache3 Mem Valid
Initial
State
(9 marks)
Justification
M
6
I
6
I
4
4
Yes
----
M
3
I
2
E
3
3
No
Can not have both M and E
I
2
I
4
S
2
2
No
There is only one S
S
5
I
0
S
3
5
No
C3 must be the same as C1 and Mem.
(a) A three-processor shared memory multiprocessor employs the MESI protocol to ensure
cache coherence. Fill up the tables below to show the state of the caches, the memory
and any bus activity during the execution of the listed CPU operations. Use the notation
(M C1 or C2M or none) to show bus activity. Use the table attached)
(16 marks)
State
X
State
X
State
X
X
Shared
Bus
Activity
Initial State →
E
2
I
3
I
9
2
--
2
Load (reg,X)
S
2
S
2
I
9
2
C2M
2
Store(X,3)
I
2
E
3
I
9
3
MC2
E
3
I
2
I
4
3
----
CPU Operation
Initial State →
Cache 1
Cache 2
Cache 3
Memory
1
Store(X,1)
M
1
I
2
I
4
3
2
Load(reg,X)
S
1
S
1
I
4
1
Initial State →
I
6
S
0
S
0
0
MC1
C2M
--
3
Store(X,4)
I
6
I
0
E
4
4
MC3
3
Store(X,6)
I
6
I
0
M
6
4
---
Page: 2 of 6
Initial State →
S
4
S
4
I
0
4
3
Store(X,4)
I
4
I
4
E
4
4
1
Load(reg,X)
S
4
I
4
S
4
4
-C3M
MC3
C1M
PR/S
BR+BW
PR+BR
Invalid
PW/S
Shared
BW
BR
BW
PW/~S
BR
PR/~S
BW
Exclusive
PW
PW
Modified
PR
PR+PW
QUESTION 3:
(a) A 1024X1024 array, called invals[ ], contains integers. An error detection algorithm
requires that the row checksum (summation of all values in each row) is stored in the
array cs_row[], and the column checksum (summation of all values in each column) is
stored in the array cs_cols[], as shown below. Write a program for a shared memory
system to implement the above algorithm using OpenMP with the most appropriate worksharing constructs.
(15 marks)
invals
cs_rows
16
24
25
30
13
34
14
16
87
30
14
12
10
66
14
27
17
18
76
73
99
68
74
16+24+25+30 = 95
95
cs_cols
Useful OpenMP functions and directives:





(void) omp_set_num_threads(int num_threads)

int omp_get_num_threads( )
int omp_get_max_threads( )
 int omp_get_thread _num( )
int omp_get_num_procs( )
 #pragma omp critical [ name ]
#pragma omp barrier
 #pragma omp atomic
#pragma omp parallel [if (scalar_expression), private (list), shared (list), default
(shared | none), firstprivate (list), reduction (operator: list), copyin (list)]
 #pragma omp for [schedule (type [,chunk]),ordered, private(list), firstprivate(list),
lastprivate(list), shared (list), reduction(op: list)] Op = [+, -, *,min,max, &, ^, |, &&, ||]
Page: 3 of 6
 #pragma omp sections [private (list), firstprivate (list), lastprivate (list), reduction
(operator: list), nowait]
 #pragma omp single [private (list), firstprivate (list), nowait]
(a)
For the OpenMP program shown below, explain four reasons that could lead to wrong
results when running the program with 6 threads. For each reason suggest a change in
the code that will correct the result.
(10 marks)
Reason1: T is a shared variable written by all threads at any time, thus errors due to data
races. T should be set as a private processor.
Reason2: Variable ‘area’ is a shared variable written by all threads at any time, thus errors
due to data races. This problem can be solved by promoting ‘area’ to array
Reason3: Each thread executes 10000/6 = 1666.7 (ie 1666) loop iterations. Thus only 9996
iterations are executed, with the last 4 not executed. This problem can be solved by assigning
to the last thread the remaining iterations.
Reason4: The master thread calculates ‘res’ after completing its own iterations, without
waiting for the other threads to complete their iterations, thus the value of ‘area’ used might
not be the correct one. This problem can be solved by inserting a barrier instruction before the
master.
double x=2.7;
double step = x/(1.08);
double area=0.0;
double T=0.0;
double res;
for (int i=0; i< 10000; i++)
{
T=(i+0.5)*step;
area+=(sin(T)*sin(T));
}
res=sqrt((2*area*step)/x);
Sequential Code
double x=2.7; int nthrds; double step = x/(1.08);
double area=0.0; double T=0.0; double res;
#pragma omp parallel num_threads(6)
{ int id = omp_get_thread_num();
nthrds = omp_get_num_threads();
int mystart = id * (10000/nthrds);
int myend = mystart + 10000/nthrds;
for (int i=mystart; i< myend; i++)
{ T=(i+0.5)*step;
area+=(sin(T)*sin(T)); }
#pragma omp master
res=sqrt((2*area*step)/x);
}
OpenMP Code
Page: 4 of 6
QUESTION 4:
The following code runs in a multi-processor system, where Thread1 and Thread2 run on
different processors. X, Y, A, and F are shared memory values, while r1 and r2 are register
values (all initialized to 0).
Thread 1
Thread 2
1a: Move r1,1
2a: Move r1,2
1b: Store(A1),r1
2b: Store(A2),r1
1c: Store(F1),r1
2c: Store(F2),r1
1d: If (F2= = 0) goto 1d
2d: If (F1= = 0) goto 2d
1e: Load r2,(A2)
2e: Load r2,(A1)
1f: Store(X),r2
2f
Store(Y),r2
1g: Store(Y),r1
2g: Store(X),r1
(a) Write down a sequence of instruction execution that will store in X and Y the values 2, 1
respectively, after the execution of both threads. Assume sequential consistency. (5 marks)
1a
1b
1c
2a
2b
2c
1d
1e
1f
2d
2e
2f
1g
2g
r1=1
A1=1
F1=1
r1=2
A2=2
F2=2
T
r2=2
X=2
T
r2=1
Y=1
Y=1
X=2
(b) Write down a sequence of instruction execution that will store in X and Y the values 0, 1
respectively, after the execution of both threads. Assume that the processors employ outof-order execution with NO speculative execution.
(5 marks)
2a
2c
1a
1b
1c
1d
1e
2b
2d
2e
2g
1g
2f
1f
r1=2
F2=2
r1=1
A1=1
F1=1
T
r2=0
A2=2
T
r2=1
X=2
Y=1
Y=1
X=0
(c) Write down a sequence of instruction execution that will store in X and Y the values 0, 0
respectively, after the execution of both threads. Assume that the processors employ outof-order execution with speculative execution.
(5 marks)
1e
2e
1a
1b
1c
1d
2a
2b
2c
2d
2g
1g
2f
1f
r2=0
r2=0
r1=1
A1=1
F1=1
T
r1=2
A2=2
F2=2
T
X=2
Y=1
Y=0
X=0
(d) If the processors employ out-of-order with no speculative execution, specify whether the
following instruction sequences are valid. Justify your answer.
(5 marks)
(i) 1a1c1b2a2b2c2d2e2f1d1e1g1f2g Valid
(ii) 1a1b1c2a2c2b2d1g 2e2f1d1e1f2g Invalid. 1g ..1d
(e) If the processors employ out-of-order with speculative execution, specify whether the
following instruction sequences are valid. Justify your answer.
(5 marks)
(i) 1a1c1b2a2b2c2g 2d2e2f1d1f 1e1g Invalid. 1f ..1e
(ii) 1a1c1b2a2b2c2f 2d2e2g1d1f 1e1g Invalid. 2f ..2e
and Invalid. 1f ..1e
Page: 5 of 6
QUESTION 5:
(a) Rewrite the following MPI code using the most appropriate collective communication
commands.
(15 marks)
void main (int argc, char *argv[])
{
int myrank, size, sum=0, i, temp, N=10000, arr[N];
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (myrank ==0) {
get_data(filename,arr);
for(i=1;i<size;i++)
Scatter
MPI_Send(arr+i*(N/size),N/size,MPI_INT,i,i,MPI_COMM_WORLD);
for(i=0;i<N/size;i=i++) sum=sum+arr[i];
for(i=1;i<size;i++) {
Reduce_Sum
MPI_Recv(&temp,1,MPI_INT,i,myrank,MPI_COMM_WORLD,&status);
sum=sum+temp; }
for(i=1;i<size;i++) MPI_Send(&sum),1,MPI_INT,i,i+1,MPI_COMM_WORLD);
Broadcast
for(i=0;i<N/size;i=i++) arr[i] = sum/arr[i];
for(i=1;i<size;i++) {
Gather
MPI_Recv(arr+i*(N/size),N/size,MPI_INT,i,i+1,MPI_COMM_WORLD,&status);
}
else {
MPI_Recv(arr,N/size,MPI_INT,0,myrank,MPI_COMM_WORLD, &status);
for (i = 0;i <N/size; i=i++)
sum=sum+arr[i]
MPI_Send(&sum,1,MPI_INT,0,myrank,MPI_COMM_WORLD);
MPI_Recv(&sum,1,MPI_INT,0,myrank+1,MPI_COMM_WORLD, &status);
for (i = 0;i <N/size; i=i++) arr[i] = sum/arr[i];
MPI_Send(arr,N/size,MPI_INT,0,myrank+1,MPI_COMM_WORLD);
}
MPI_Finalize();
}
Useful MPI functions:
 MPI_Comm_init(&argc, &argv)
 MPI_Finalize()
 MPI_Comm_rank(comm, p_id)
 MPI_Comm_size(comm, size)
 MPI_Send(s_msg, s_count, datatype, dest, tag, comm)
 MPI_Recv(r_msg, r_count, datatype, srce, tag, comm, status)
 MPI_Bcast(message, count, datatype, root, comm)
 MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm)
 MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm)
 MPI_Reduce(operand, result, count, datatype, operation, root, comm)
Where Operation = [MPI_SUM, MPI_MAX, MPI_MIN, MPI_PROD, MPI_LAND, MPI_LOR, MPI_LXOR,
MPI_BAND, MPI_BOR, MPI_BXOR, MPI_MAXLOC, MPI_MINLOC]
Page: 6 of 6
(b) Figure below shows the timelines of the execution of a program on a four-node message
passing system. Calculate the speedup, compared to the single node execution, the
efficiency and the utilization factor for each node.
0
1
CPU Communication Overhead
Comp.
Communication Delay
0
1
0
2
0
1
0
8
Comp
Computation
2
Computation
3
12
16
20
24
28
CPU Parallelization Overheads
Global Synchronization
Computation
3
0
4
CPU Idle Time
Computation
0
0
CPU Computation Time
(10 marks)
32
36
1
Comp
1
0
Comp
2
0
Comp
3
0
40
0 2
44
0 3
0
48
Comp
52
56
Time
Computation 1: 23+4+5 = 32
Computation 2: 19+4 = 23
Computation 3: 20+4 = 24
Computation 4: 18+4 = 22
Computation total: 32+23+24+22 = 101
Speedup = 101/56 = 1.8
Efficiency = (1.8/4) x100 = 45%
UF1 = (32/56) x 100 = 57%
UF2 = (23/56) x 100 = 41%
UF3 = (24/56) x 100 = 43%
UF4 = (22/56) x 100 = 39%
Page: 7 of 6
Download