Performance of coherence protocols Cache misses can be classified into four categories: • Cold misses (or “compulsory misses”) occur the first time that a block is referenced. • Conflict misses are misses that would not occur if the cache were fully associative with LRU replacement. • Capacity misses occur when the cache size is not sufficient to hold data between references. • Coherence misses are misses caused by the coherence protocol. The first three types occur in uniprocessors. The last is specific to multiprocessors. Coherence misses can be divided into those caused by true sharing and those caused by false sharing. False-sharing misses are those caused by having a line size larger than one word. Can you explain? True-sharing misses, on the other hand, occur when o a processor writes into a cache block, invalidating the block in another processors’ cache, o after which the other processor reads one of the modified words. How can we attack each of the four kinds of misses? • To reduce capacity misses, we can • To reduce conflict misses, we can • To reduce cold misses, we can • To reduce coherence misses, we can What cache line size is performs best? Which protocol is best? Lecture 14 Architecture of Parallel Computers 1 Questions like these can be answered by simulation. Getting the answer write is part art and part science. Parameters need to be chosen for the simulator. Culler & Singh (1998) selected a single-level 4-way set-associative 1 MB cache with 64-byte lines. The simulation assumes an idealized memory model, which assumes that references take constant time. Why is this not realistic? The simulated workload consists of six parallel programs from the SPLASH-2 suite and one multiprogrammed workload, consisting of mainly serial programs. Look through the graphs below, and name the six parallel programs. Effect of coherence protocol Three coherence protocols were compared: • The Illinois MESI protocol (“Ill”, left bar). • The three-state MSI invalidation protocol (3St) with bus upgrade for SM transitions. (Instead of rereading data from main memory when a block moves to the M state, we just issue a bus transaction invalidating the other copies.) • The three-state MSI invalidation protocol without bus upgrade (3St-BusRdX). (This means that when a block moves to the M state, we reread it from main memory.) Which protocol would you expect to be best? Looking at the graph below, which protocol seems to be best for the parallel programs? © 2012 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2012 2 Lecture 14 OS-DATA/3St-RdEx Data OS-Data/3St 60 bus bus 50 40 20 Architecture of Parallel Computers t x Radix/III Radix/3St Radix/3St-RdEx Raytrace/3St Raytrace/3St-RdEx Ex t Raytrace/III Ill d l x Radiosity/3St Radiosity/3St-RdEx Radiosity/III Ocean/III Ocean/3S Ocean/3St-RdEx LU/3St-RdEx LU/III LU/3St Barnes/III Barnes/3St Barnes/3St-RdEx 0 OS-Data/III Address OS-Code/3St 70 OS-Code/III Appl-Data/3St-RdEx 0 Appl-Data/3St (MB/s) Traffic (MB/s) 180 Appl-Data/III Appl-Code/3St Appl-Code/III Traffic 200 Address bus 160 Data bus 140 120 100 80 60 40 20 80 Look at the multiprocessor workload, using the graph at the left. Which protocol appears to be best? 30 Why might this be true? 10 E 3 Invalidate vs. update with respect to miss rate [CS&G §5.4.5] Which is better, an update or an invalidation protocol? Let’s look at real programs. 2.50 0.60 False sharing True sharing 0.50 Capacity 2.00 Cold Miss rate (%) Miss rate (%) 0.40 0.30 1.50 1.00 0.20 0.50 0.10 Radix/upd Radix/mix Radix/inv Raytrace/upd Raytrace/inv Ocean/upd Ocean/mix Ocean/inv LU/upd 0.00 LU/inv 0.00 Where there are many coherence misses, If there were many capacity misses, © 2012 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2012 4 with respect to bus traffic Compare the Upgrade/update rate (%) 2.50 2.00 1.50 1.00 with the 0.50 0.00 ● upgrades in inv. protocol LU/inv ● updates in upd. protocol LU/upd Each of these operations produces bus traffic. Ocean/inv Which are more frequent? Ocean/mix Ocean/upd Raytrace/inv Which protocol causes more bus traffic? Raytrace/upd Upgrade/update rate (%) 8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00 The main problem is that one processor tends to write a block multiple times before another processor reads it. Radix/inv Radix/mix Radix/upd This causes several bus transactions instead of one, as there would be in an invalidation protocol. Effect of cache line size on miss rate If we increase the line size, the number of coherence misses might go up or down. Why? What happens to the number of false-sharing misses? Lecture 14 Architecture of Parallel Computers 5 What happens to the number of true-sharing misses? If we increase the line size, what happens to— capacity misses? conflict misses? bus traffic? So it is not clear which line size will work best. 0.6 Upgrade False sharing 0.5 True sharing Capacity Cold Miss rate (%) 0.4 0.3 0.2 8 0.1 Radiosity/128 Radiosity/256 Radiosity/64 Radiosity/16 Radiosity/32 Radiosity/8 Lu/256 Lu/64 Lu/128 Lu/16 Lu/32 Lu/8 Barnes/256 Barnes/128 Barnes/64 Barnes/16 Barnes/32 Barnes/8 0 Results for the first three applications seem to show that which line size is best? © 2012 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2012 6 Lecture 14 2 Architecture of Parallel Computers Raytrace/128 Raytrace/256 Raytrace/64 Raytrace/16 Raytrace/32 Raytrace/8 Radiosity/64 4 Radiosity/128 28 Radiosity/256 Radiosity/8 Radiosity/16 Radiosity/32 0 8 Raytrace/8 Raytrace/256 Raytrace/128 Raytrace/64 Raytrace/32 Raytrace/16 6 8 4 2 6 8 Radix/256 Radix/128 Radix/64 Radix/32 Radix/16 Radix/8 Ocean/256 Ocean/128 Ocean/64 Ocean/32 0 Barnes/256 Barnes/128 0.16 Barnes/64 Ocean/8 Ocean/16 10 Barnes/16 Barnes/32 Barnes/8 Traffic (bytes/instructions) Miss rate (%) For the second set of applications, which do not fit in cache, Radix shows a greatly increasing number of false-sharing misses with increasing block size. 12 Upgrade False sharing True sharing Capacity 8 Cold 6 4 2 on bus traffic Larger line sizes generate more bus traffic. 0.18 Address bus 0.14 Data bus 0.12 0.1 0.08 0.06 0.04 0.02 7 10 1.8 Address bus 9 Data bus Data bus 8 1.4 Ocean/256 Ocean/128 Ocean/64 Ocean/32 Ocean/8 Ocean/16 0 Radix/256 0 Radix/128 0.2 Radix/64 1 Radix/32 0.4 Radix/16 2 LU/256 0.6 LU/128 3 0.8 LU/64 4 1 LU/32 5 1.2 LU/16 6 LU/8 Traffic (bytes/FLOP) 7 Radix/8 Traffic (bytes/instruction) Address bus 1.6 The results are different than for miss rate—traffic almost always increases with increasing line size. But address-bus traffic moves in the opposite direction from data-bus traffic. With this in mind, which line size would you say is best? Write propagation in multilevel caches [§8.4.2] The coherence protocols we have seen so far have been based on one-level caches. Suppose each processor has its own L1 cache and L2 cache, and L2 the caches are coherent. Writes must be propagated upstream and downstream. Define the terms. Downstream write propagation. © 2012 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2012 8 Upstream write propagation. Which makes downstream propagation simpler, a write-through or write-back L1 cache? Why? For upstream write propagation: o An invalidation/intervention received by the L2 must be propagated to the L1 (in case the L1 has the block). o The inclusion property cuts down the number of such upstream invalidations/interventions. Lecture 14 Architecture of Parallel Computers 9