CSC/ECE 506: Architecture of Parallel Computers

advertisement
CSC/ECE 506: Architecture of Parallel Computers
Summer 2006
Problem Set 2
Due Wednesday, June 28, 2006
Problems 2, 3, and 4 will be graded. There are 60 points on these problems. Note: You must do
all the problems, even the non-graded ones. If you do not do some of them, half as many points
as they are worth will be subtracted from your score on the graded problems.
 
Problem 1. (20 points) [CS&G 7.2] A radix-2 FFT over n complex numbers is implemented as a
sequence of log n completely parallel steps, requiring 5n log n floating-point operations while
reading and writing each element of data log n times. If this data is spread over the memories in
a NUMA design using either a cycle or block distribution, the log n p of the steps will access
data in the local memory, assuming both n 
and p are powers of two, and in the remaining

 p
steps half of the reads and half the writes 
will 
be local. (The choice of layout determines
log
which steps are local and whichareremote, but the ratio stays the same.) Calculate the
communication-to-computation ratio on a distributed-memory
design where each processor has a
 
local memory, as in Figure 7.2.
What communication
bandwidth
would the network need to

sustain for the machine to deliver 250 p MFLOPS on p processors?
Problem 2. (15 points) The following instruction sequence is for two processors, P1 and P2.
1 P1: READ 4
2 P2: READ 1
3 P1: READ 1
4 P1: WRITE 4
5 P2: READ 1
6 P1: READ 5
7 P2: WRITE 5
8 P1: READ 5
9 P1: READ 6
10 P2: READ 1
11 P1: WRITE 6
12 P2: READ 2
13 P1: READ 7


Here, READ 4 is a read to location 4 in memory given the following configuration. Block A and
Block B are memory blocks that map to the same cache line. Each block contains 4 words. The
values listed are the memory addresses and not integer values.
A:1234
B:5678
Each processor has a direct-mapped cache with 2 cache lines that have 4 words per cache line.
Block transfers between memory and cache costs 8 clock cycles while cache-to-cache transfers
cost 2 clock cycles. Invalidations as well as both read and write hits cost 1 cycle. All caches are
initially empty.
Compute the total time needed to execute the given sequence, using the MESI protocol for a busbased shared-memory multiprocessor. When a miss happens, categorize it as one of the
following:

Cold misses occur the first time that a block is referenced.

Conflict misses are misses that would not occur if the cache were (the same size, but)
fully associative with LRU replacement.

Capacity misses occur when the cache size is not sufficient to hold data between
references.

Coherence misses are misses caused by invalidations to preserve cache coherence.
For coherence misses, indicate whether it’s false sharing or true sharing. True sharing occurs
when one processor writes some words in a cache block, invalidating that block in another
processor’s cache, after which the second processor reads one of the modified words.
Problem 3. (20 points) The write-back strategy and dirty bits are related to the write strategy of a
cache. Assume that we have a memory in which the largest address referenced is FFFF. We
also have a cache which has a 16 blocks or lines.
Consider the following references to this cache:
1.
2.
3.
4.
5.
6.
read 04CF
read F3C7
write 0423
write 0433
write 2BC4
read 04BF
This sequence is repeated 20 times, for a total of 120 accesses.
Assume—
This cache is empty initially and the entire block is fetched on a miss
The block size is 16 words
The cache allocates storage on write misses and uses the write-back
replacement policy.
(a) What is the total number of misses (read misses + write misses) for the cache, if the cache is
direct mapped?
(b) What is the total number of write-backs of dirty blocks for this direct-mapped cache?
(c) Find the smallest direct-mapped cache size such that there are no misses besides
compulsory misses.
(d) If the cache is 2-way set-associative, find the smallest cache size such that there are no
misses other than compulsory misses.
Problem 4. (25 points) For each of the four different kinds of cache organizations, display one
sequence of block references for which the organization outperforms the other three
organizations (i.e., one sequence for which direct mapping performs the best of all organizations,
one sequence for which fully associative is the best of all, etc.; four sequences altogether). To
make your answers short, assume—



there are four block frames in the cache;
the set or sector size (where applicable) is 2 block frames;
in all cases, LRU replacement is used.
None of your examples should require more than about ten block references.
Problem 5. (20 points) Consider a parallel program using a client-server model. This program
has one process acting as a server. The server continuously reads the status of the clients, then
computes and stores results in memory. The client process continuously read the server’s result
and computes its local variables. The client then updates its status in the memory.
(a) If this program was to run on a bus based system, what kind of cache coherence protocol
would provide the best result (invalidation or update)? Explain your answer.
(b) When using invalidation protocols, which type of coherence cache misses will most likely be
encountered in the above program? How can we reduce them?
(c) Instead of using client-server method, we can have all the processors compute and
exchanging data to each other. What are the advantages and disadvantages over the clientserver model?
(d) Consider protocols that flush data out to the bus when one processor tries to read the data
that is already stored in another processor’s cache. Instead of having to go to memory to fetch
the data block, we can pick the data block up from the bus when it’s being flushed to memory.
This is called cache-to-cache sharing. Is this an advantage over always fetching from memory? In
what condition does fetching from memory would be more preferred?
Download