CMU 18-447 Introduction to Computer Architecture, Spring 2014 HW 7: Prefetching, Runahead, Cache Coherence, Multiprocessors, Faults Instructor: Prof. Onur Mutlu TAs: Rachata Ausavarungnirun, Varun Kohli, Xiao Bo Zhao, Paraj Tyle Assigned: Wed., 4/9, 2014 Due: Wed., 4/30, 2014 (Midnight) Handin: /afs/ece/class/ece447/handin/hw7 Please submit as ONE PDF: [andrewID].pdf 1 Prefetching [40 points] You and your colleague are tasked with designing the prefetcher of a machine your company is designing. The machine has a single core, L1 and L2 caches and a DRAM memory system. We will examine different prefetcher designs and analyze the trade-offs involved. • For all parts of this question, we want to compute prefetch accuracy, coverage and bandwidth overhead after the prefetcher is trained and is in steady state. Therefore, exclude the first six requests from all computations. • If there is a request already outstanding to a cache block, a new request for the same cache block will not be generated. The new request will be merged with the already outstanding request in the MSHRs. (a) You first design a stride prefetcher that observes the last three cache block requests. If there is a constant stride between the last three requests, it prefetches the next cache block using that stride. You run an application that has the following access pattern to memory (these are cache block addresses): A A+1 A+2 A+7 A+8 A+9 A+14 A+15 A+16 A+21 A+22 A+23 A+28 A+29 A+30... Assume this pattern continues for a long time. Compute the coverage of your stride prefetcher for this application. 0%. After each group of three requests, a prefetch is triggered due to a detected stride, but the prefetched block is always useless; none of the demand requests are covered by this prefetch. Compute the accuracy of your stride prefetcher for this application. 0% (b) Your colleague designs a new prefetcher that, on a cache block access, prefetches the next N cache blocks. The coverage and accuracy of this prefetcher are 66.67% and 50% respectively for the above application. What is the value of N? 1/23 N = 2. After (for example) the access to block 14, the prefetcher prefetches blocks 15 and 16. After 15, it prefetches 16 (merged with the prefetch that was already issued) and 17. After 16, it prefetches 17 and 18. Hence, two out of every three demand accesses are covered (66.7%), and half of prefetches are useful (50%). We define the bandwidth overhead of a prefetcher as Total number of cache block requests with the prefetcher Total number of cache block requests without the prefetcher (1) What is the bandwidth overhead of this next-N-block prefetcher for the above application? 5/3. For every group of accesses to three consecutive cache blocks, two extra blocks are prefetched. For example, cache blocks 14, 15 and 16 are fetched. In addition to these, blocks 17 and 18 are prefetched (c) Your colleague wants to improve the coverage of her next-N-block prefetcher further for the above application, but is willing to tolerate a bandwidth overhead of at most 2x. Is this possible? YES NO Why or why not? To get better coverage, the prefetcher must prefetch into the next group of 3 strided requests from the previous group, because the full group of 3 is already prefetched by the first. For instance, on an access to A+14, A+15 and A+16 are already prefetched. To improve coverage, A+21 (which is the first of the next group of 3 strided requests) should be prefetched. However, this would require prefetching the four cache blocks in between A+16 and A+21 (A+17, A+18, A+19, A+20). This increases the bandwidth overhead beyond 2x. (d) What is the minimum value of N required to achieve a coverage of 100% for the above application? Remember that you should exclude the first six requests from your computations. N = 5 (so that A+16 prefetches A+21, then A+21 prefetches A+22, A+23, etc.) What is the bandwidth overhead at this value of N? 7/3 (e) You are not happy with the large bandwidth overhead required to achieve a prefetch coverage of 100% with a next-N-block prefetcher. You aim to design a prefetcher that achieves a coverage of 100% with a 1x bandwidth overhead. Propose a prefetcher design that accomplishes this goal. Be concrete and clear. 2/23 1. A prefetcher that learns the pattern of strides: 1, 1, 5, in this case. This can be accomplished by keeping the last three strides and a confidence counter for each pattern of last three strides. 2. A two-delta stride prefetcher could record up to two different strides X and Y, and the number of strides X that are traversed before a stride of Y is traversed. For this sequence, the prefetcher would learn that X = 1, Y = 5, and that two strides of101 occur followed by one stride of Y. 3/23 2 Running Ahead [55 points] Consider the following program, running on an in-order processor with no pipelining: LD ADD LD ADD LD ADD LD ADD R1 R2 R9 R4 R11 R7 R12 R6 ← ← ← ← ← ← ← ← (R3) R4, R6 (R5) R7, R8 (R16) R8, R10 (R11) R8, R15 // Load A // Load B // Load C // Load D Assume that all registers are initialized and available prior to the beginning of the shown code. Each load takes 1 cycle to execute, and each add takes 2 cycles to execute. Loads A through D are all cache misses. In addition to the 1 cycle taken to execute each load, these cache misses take 6, 9, 12, and 3 cycles, respectively for Loads A through D, to complete. For now, assume that no penalty is incurred when entering or exiting runahead mode. Note: Please show all your work for partial credit. (a) For how many cycles does this program run without runahead execution? 4/23 Table 1: Execution timeline for Problem 6(b), using runahead execution with no exit penalty and no optimizations Cycle 1 2 3 4 5 6 7 8 9 10 Operations 17 18 19 Load 1 Issue Add 1 Add 1 Load 2 Issue Add 2 Add 2 Load 1 Finish Add 1 Add 1 Load 2 Issue, already issued Add 2 Add 2 Load 2 Finish Add 2 Add 2 Load 3 Issue, already issued Add 3 Add 3 Load 3 Finish 20 21 22 23 24 25 26 27 Add 3 Add 3 Load 4 Issue Add 4 Add 4 Load 4 Finish Add 4 Add 4 11 12 13 14 15 16 Enter/Exit Runahead Mode enter RA Load 3 Issue exit RA Runahead Is Additional Mode? Runahead Instruction? Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes enter RA Load 3 Issue, already issued exit RA enter RA Load 4 Issue, dependent on R11 from Load 3. Load 3 finishes at the end of cycle 19, so R11 isn’t available now. exit RA enter RA exit RA 5/23 (b) For how many cycles does this program run with runahead execution? 27 cycles. See Table 1. (c) How many additional instructions are executed in runahead execution mode? 14 instructions. See Table 1. (d) Next, assume that exiting runahead execution mode incurs a penalty of 3 cycles. In this case, for how many cycles does this program run with runahead execution? 6/23 4 runahead periods in total. 27 + 4 ∗ 3 = 39 cycles. (e) At least how many cycles should the runahead exit penalty be, such that enabling runahead execution decreases performance? Please show your work. 4 cycles. (27 + 4 × X > 42) (f) Which load instructions cause runahead periods? Circle all that did: Load A Load B Load C Load D For each load that caused a runahead period, tell us if the period generated a prefetch request. If it did not, explain why it did not and what type of a period it caused. Load A: This period generated prefetches. Load B: Short runahead period. / Overlapping runahead periods. Load C: Dependent loads. / Short runahead period. 7/23 Load D: No loads to prefetch in this period. (Useless runahead period) (g) For each load that caused a runahead period that did not result in a prefetch, explain how you would best mitigate the inefficiency of runahead execution. Load A: N/A. Prefetch was generated. Load B: Do not enter runahead mode if only a few cycles are left before the corresponding cache miss will be filled. Do not enter runahead mode if the subsequent instruction(s) have already been executed during a previous runahead period. Load C: Predict the value of cache-miss address loads. Load D: Predict if a period will generate useful cache misses, and do not execute that period if it’s deemed useless. (h) If all useless runahead periods were eliminated, how many additional instructions would be executed in runahead mode? Since only the first runahead period generated prefetches, only that runahead period will be executed. According to Table 1, 6 instructions are executed in the first runahead period. How does this number compare with your answer from part (c)? 8/23 Less. (i) Assume still that the runahead exit penalty is 3 cycles, as in part (d). If all useless runahead execution periods were eliminated (i.e., runahead execution is made efficient), for how many cycles does the program run with runahead execution? One runahead period, so 27 + 1 ∗ 3 = 30 cycles. How does this number compare with your answer from part (d)? Less. (j) At least how many cycles should the runahead exit penalty be such that enabling efficient runahead execution decreases performance? Please show your work. 16 cycles. (27 + 1 × X > 42) 9/23 3 Amdahl’s Law [30 points] Consider the following three processors (X, Y, and Z) that are fabricated on a constant silicon area of 16A. Assume that the single-thread performance of a core increases with the square root of its area. Processor X 1 large core of area 16A Processor Y 4 medium cores of area 4A Processor Z 16 small cores of area A On each of the three processors, we will execute a workload where S fraction of its work is serial, and where 1 − S fraction of its work is infinitely parallelizable. As a function of S, plot the execution time of the workload on each of the three processors. Assume that it takes time T to execute the entire workload using only one of the small cores of Processor Z. Please label the value of the y-axis intercept and the slope. Processor X Execution Time Slope: 0 T/4 S 1 Processor Y Execution Time T/2 Slope: 3T/8 T/8 S 1 Processor Z 10/23 Execution Time T Slope: 15T/16 T/16 S 1 (a) Which processor has the lowest execution time for the widest range of S? X (b) Typically, for a realistic workload, the parallel fraction is not infinitely parallelizable. What are the three fundamental reasons why? 1. Synchronization 2. Load imbalance 3. Resource contention 11/23 4 Coherent Caches [40 points] You just got back a prototype of your latest processor design, which has two cores and uses the MESI cache coherence protocol for each core’s private L1 caches. (a) For this subproblem, assume a 2-core processor where each core contains a private L1 cache. The two caches are coherent and implement the MESI protocol. Assume that each cache can only hold 1 cache block. The letters below show the MESI state transitions of the cache block in CPU 0’s private L1 cache. For each transition (pair of adjacent MESI states) below, list all possible actions CPU 0 and CPU 1 could perform to cause that transition. The first transition, where the cache block in CPU 0’s cache transitions from Invalid to Exclusive state, has been completed for you as an example. I E S S I M M E S E I S M S CPU 0 issued a read. CPU 1 issued nothing. (b) Scenario 1 Let’s say that you discover that there are some bugs in the design of your processor’s coherence modules. Specifically, the BusRead and BusWrite signals on the module occasionally do not get asserted when they should have (but data still gets transferred correctly to the cache). Fill in the table below with a 4 if, for each MESI state, the missing signal has no effect on correctness. If correctness may be affected, fill in a 6. BusRead BusWrite M 6 6 E 6 6 S 4 6 I 4 4 12/23 (c) Scenario 2 Let’s say that instead you discover that there are no bugs in the design of your processor’s coherence modules. Instead, however, you find that occasionally cosmic rays strike the MESI state storage in your coherence modules, causing a state to instantaneously change to another. Fill in a cell in the table below with a 4 if, for a starting MESI state on the top, instantaneously changing the state to the state on the left affects neither correctness nor performance. Fill in a cell in the table with a ○ if correctness is not affected but performance could be affected by the state change. If correctness may be affected, fill in a 6. m starting state (real status) e s i m 4 ○ 6 6 e 6 4 6 6 s 6 ○ 4 6 i 6 ○ ○ 4 ending state (current status) 13/23 5 Parallel Speedup [30 points] You are a programmer at a large corporation, and you have been asked to parallelize an old program so that it runs faster on modern multicore processors. (a) You parallelize the program and discover that its speedup over the single-threaded version of the same program is significantly less than the number of processors. You find that many cache invalidations are occuring in each core’s data cache. What program behavior could be causing these invalidations (in 20 words or less)? cache ping-ponging and performance degradation (b) You modify the program to fix this first performance issue. However, now you find that the program is slowed down by a global state update that must happen in only a single thread after every parallel computation. In particular, your program performs 90% of its work (measured as processor-seconds) in the parallel portion and 10% of its work in this serial portion. The parallel portion is perfectly parallelizable. What is the maximum speedup of the program if the multicore processor had an infinite number of cores? 10x speedup (Amdahl) (c) How many processors would be required to attain a speedup of 4? (d) In order to execute your program with parallel and serial portions more efficiently, your corporation decides to design a custom heterogeneous processor. • This processor will have one large core (which executes code more quickly but also takes greater die area on-chip) and multiple small cores (which execute code more slowly but also consume less area), all sharing one processor die. • When your program is in its parallel portion, all of its threads execute only on small cores. • When your program is in its serial portion, the one active thread executes on the large core. • Performance (execution speed) of a core is proportional to the square root of its area. • Assume that there are 16 units of die area available. A small core must take 1 unit of die area. The large core may take any number of units of die area n2 , where n is a positive integer. • Assume that any area not used by the large core will be filled with small cores. How large would you make the large core for the fastest possible execution of your program? speedup = 1 / (10% / sqrt(L) + 90% / (16-L)) maximize speedup -> minimize 10/sqrt(L) + 90/(16-L) L = k^2, k integer minimize 10/k + 90/(16-k^2) k = 2: 5 + 90/12 = 5 + 7.5 = 12.5 k = 3: 3.33 + 90/7 = 3.33 + 12.... = 15... k = 1: 10 + 90/15 = 10 + 6 = 16 k = 2 -> L = 4 (large core uses 4 units of area) speedup is 8 What would the same program’s speedup be if all 16 units of die area were used to build a homogeneous system with 16 small cores, the serial portion ran on one of the small cores, and the parallel portion ran on all 16 small cores? 14/23 speedup is 1 / (.1 + .9/16) = 6.4 Does it make sense to use a heterogeneous system for this program which has 10% of its work in serial sections? YES NO Why or why not? (e) Now you optimize the serial portion of your program and it becomes only 4% of total work (the parallel portion is the remaining 96%). What is the best choice for the size of the large core in this case? L = 1 (make it the same size as the small cores) 15/23 What is the program’s speedup for this choice of large core size? 1 / (0.02 / 1 + 0.98 / 15) = 9.61 What would the same program’s speedup be for this 4%/96% serial/parallel split if all 16 units of die area were used to build a homogeneous system with 16 small cores, the serial portion ran on one of the small cores, and the parallel portion ran on all 16 small cores? 1 / (0.04 + 0.96 / 16) = 1 / (0.1) = 10 Does it make sense to use a heterogeneous system for this program which has 4% of its work in serial sections? YES NO Why or why not? 16/23 6 Transient Faults [45 points] In class and in labs, we have implicitly assumed that the circuits in the processor are always correct. However, in the real world, this is not always the case. One common type of physical problem in a processor is called a transient fault. A transient fault simply means that a bit in a flip-flop or register incorrectly changes (i.e., flips) to the opposite state (e.g., from 0 to 1 or 1 to 0) temporarily. Consider how a transient fault could affect each of the following processor structures. For each part of this question, answer (i) does a transient fault affect the correctness of the processor (will the processor still give correct results for any program if such a fault occurs in this structure)? (ii) if the fault does not affect correctness, does it affect the performance of the processor? Valid answers for each question are “can affect” or “cannot affect”. Answer for performance in a particular situation only if correctness is not affected. When in doubt, state your assumptions. (a) A bit in the global history register of the global branch predictor: Flipped from 0 to 1: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? N/A Performance: CAN AFFECT CANNOT AFFECT Flipped from 1 to 0: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? N/A Performance: CAN AFFECT CANNOT AFFECT (b) A bit in a pipeline register (prior to the stage where branches are resolved) that when set to ‘1’ indicates that a branch should trigger a flush due to misprediction, when that pipeline register is currently holding a branch instruction: Flipped from 0 to 1: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? N/A Performance: CAN AFFECT CANNOT AFFECT Flipped from 1 to 0: 17/23 Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? A flip to 0 eliminates a flush that was necessary to recover from a misprediction, hence causes incorrect instructions to write back their results. Note that a flip to 1 does not impact correctness because it only causes an extra (unnecessary) flush. Performance: CAN AFFECT CANNOT AFFECT (c) A bit in the state register of the microsequencer in a microcoded design: Flipped from 0 to 1: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? Such a bit-flip causes the microcode to jump to another state, which will likely cause random/unexpected behavior. Performance: CAN AFFECT CANNOT AFFECT Flipped from 1 to 0: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? See above (same reason for both flips). Performance: CAN AFFECT CANNOT AFFECT (d) A bit in the tag field of a register alias table (RAT) entry in an out-of-order design: Flipped from 0 to 1: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? Such a flip could result in a random (undesired) result being captured in the register file. Performance: CAN AFFECT CANNOT AFFECT Flipped from 1 to 0: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? See above (same reason for both flips). 18/23 Performance: CAN AFFECT CANNOT AFFECT (e) The dirty bit of a cache block in a write-back cache: Flipped from 0 to 1: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? N/A Performance: CAN AFFECT CANNOT AFFECT Flipped from 1 to 0: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? Such a flip causes a modified data block to not be written back to memory, thus losing data. Performance: CAN AFFECT CANNOT AFFECT (f) A bit in the application id field of an application-aware memory scheduler’s request buffer entry: Flipped from 0 to 1: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? N/A Performance: CAN AFFECT CANNOT AFFECT Flipped from 1 to 0: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? N/A Performance: CAN AFFECT CANNOT AFFECT (g) Reference bit in a page table entry: Flipped from 0 to 1: Correctness: CAN AFFECT CANNOT AFFECT 19/23 If correctness is affected, why? N/A Performance: CAN AFFECT CANNOT AFFECT Flipped from 1 to 0: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? N/A Performance: CAN AFFECT CANNOT AFFECT (h) A bit indicating the processor is in runahead mode: Flipped from 0 to 1: Correctness: CAN AFFECT CANNOT AFFECT If correctness why? A flip to is1 affected, causes the processor to spuriously enter runahead mode. Depending on how the logic to exit runahead mode is implemented, the processor may never exit runahead mode (hence never making forward progress), and even if it does, it will not restart from the correct checkpoint afterward. Performance: CAN AFFECT CANNOT AFFECT Flipped from 1 to 0: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? A flip to 0 causes the processor to behave as if it is in normal mode when it should really be in runahead mode, which will result in speculatively executed instructions to update the architectural state (when they should not). Performance: CAN AFFECT CANNOT AFFECT (i) A bit in the stride field of a stride prefetcher: Flipped from 0 to 1: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? N/A Performance: CAN AFFECT CANNOT AFFECT 20/23 Flipped from 1 to 0: Correctness: CAN AFFECT CANNOT AFFECT If correctness is affected, why? N/A Performance: CAN AFFECT CANNOT AFFECT 21/23 7 Cache coherence: MESI [20 points] Assume you are designing a MESI snoopy bus cache coherence protocol for write-back private caches in a multi-core processor. Each private cache is connected to its own processor core as well as a common bus that is shared among other private caches. There are 4 input commands a private cache may get for a cacheline. Assume that bus commands and core commands will not arrive at the same time: • CoreRd: Read request from the cacheś core • CoreWr: Write request from the cacheś core • BusGet: Read request from the bus • BusGetI: Read and Invalidate request from the bus (invalidates shared data in the cache that receives the request) There are 4 actions a cache can take while transferring to another state: • Flush: Write back the cacheline to lower-level cache • BusGet: Issue a BusGet request to the bus • BusGetI: Issue a BusGetI request to the bus • None: Take no action There is also an “is shared (S)” signal on the bus. S is asserted upon a BusGet request when at least one of the private caches shares a copy of the data (BusGet (S)). Otherwise S is deasserted (BusGet (not S)). Assume upon a BusGet or BusGetI request, the inclusive lower-level cache will eventually supply the data and there are no private cache to private cache transfers. On the next page, Rachata drew a MESI state diagram. There are 4 mistakes in his diagram. Please show the mistakes and correct them. You may want to practice on the scratch paper first before finalizing your answer. If you made a mess, clearly write down the mistakes and the changes below. 22/23 CoreRd|None CoreRd|None CoreWr|None Exclusive(E) Modified(M) CoreRd|BusGet(not S) BusGet|Flush CoreWr|BusGetI Shared(S) BusGetI|None CoreRd|BusGet(S) BusGet|None CoreRd|None 23/23 BusGetI|Flush Invalid(I)