DOC - the Department of Computer Science

advertisement
Performance of coherence protocols
Cache misses can be classified into four categories:
• Cold misses (or “compulsory misses”) occur the first time
that a block is referenced.
• Conflict misses are misses that would not occur if the
cache were fully associative with LRU replacement.
• Capacity misses occur when the cache size is not sufficient
to hold data between references.
• Coherence misses are misses caused by the coherence
protocol.
The first three types occur in uniprocessors. The last is specific to
multiprocessors.
Coherence misses can be divided into those caused by true sharing
and those caused by false sharing.
 False-sharing misses are those caused by having a line size
larger than one word. Can you explain?
 True-sharing misses, on the other hand, occur when
o a processor writes into a cache block, invalidating the
block in another processors’ cache,
o after which the other processor reads one of the modified
words.
How can we attack each of the four kinds of misses?
• To reduce capacity misses, we can
• To reduce conflict misses, we can
• To reduce cold misses, we can
• To reduce coherence misses, we can
What cache line size is performs best? Which protocol is best?
Lecture 14
Architecture of Parallel Computers
1
Questions like these can be answered by simulation. Getting the
answer write is part art and part science.
Parameters need to be chosen for the simulator. Culler & Singh
(1998) selected a single-level 4-way set-associative 1 MB cache with
64-byte lines.
The simulation assumes an idealized memory model, which assumes
that references take constant time. Why is this not realistic?
The simulated workload consists of
 six parallel programs from the SPLASH-2 suite and
 one multiprogrammed workload, consisting of mainly serial
programs.
Look through the graphs below, and name the six parallel programs.
Effect of coherence protocol
Three coherence protocols were compared:
• The Illinois MESI protocol (“Ill”, left bar).
• The three-state MSI invalidation protocol (3St) with bus
upgrade for SM transitions. (Instead of rereading data from
main memory when a block moves to the M state, we just issue a
bus transaction invalidating the other copies.)
• The three-state MSI invalidation protocol without bus
upgrade (3St-BusRdX). (This means that when a block moves to
the M state, we reread it from main memory.)
Which protocol would you expect to be best?
Looking at the graph below, which protocol seems to be best for the
parallel programs?
© 2012 Edward F. Gehringer
CSC/ECE 506 Lecture Notes, Spring 2012
2
Lecture 14
OS-DATA/3St-RdEx
Data
OS-Data/3St
60
bus
bus
50
40
20
Architecture of Parallel Computers
t
x
Radix/III
Radix/3St
Radix/3St-RdEx
Raytrace/3St
Raytrace/3St-RdEx
Ex
t
Raytrace/III Ill
d
l
x
Radiosity/3St
Radiosity/3St-RdEx
Radiosity/III
Ocean/III
Ocean/3S
Ocean/3St-RdEx
LU/3St-RdEx
LU/III
LU/3St
Barnes/III
Barnes/3St
Barnes/3St-RdEx
0
OS-Data/III
Address
OS-Code/3St
70
OS-Code/III
Appl-Data/3St-RdEx
0
Appl-Data/3St
(MB/s)
Traffic (MB/s)
180
Appl-Data/III
Appl-Code/3St
Appl-Code/III
Traffic
200
Address bus
160
Data bus
140
120
100
80
60
40
20
80
Look at the
multiprocessor
workload, using the
graph at the left. Which
protocol appears to be
best?
30
Why might this be true?
10
E
3
Invalidate vs. update
with respect to miss rate
[CS&G §5.4.5] Which is better, an update or an invalidation protocol?
Let’s look at real programs.
2.50
0.60
False sharing
True sharing
0.50
Capacity
2.00
Cold
Miss rate (%)
Miss rate (%)
0.40
0.30
1.50
1.00
0.20
0.50
0.10
Radix/upd
Radix/mix
Radix/inv
Raytrace/upd
Raytrace/inv
Ocean/upd
Ocean/mix
Ocean/inv
LU/upd
0.00
LU/inv
0.00
Where there are many coherence misses,
If there were many capacity misses,
© 2012 Edward F. Gehringer
CSC/ECE 506 Lecture Notes, Spring 2012
4
with respect to bus traffic
Compare the
Upgrade/update rate (%)
2.50
2.00
1.50
1.00
with the
0.50
0.00
● upgrades in inv. protocol
LU/inv
● updates in upd. protocol
LU/upd
Each of these operations
produces bus traffic.
Ocean/inv
Which are more frequent?
Ocean/mix
Ocean/upd
Raytrace/inv
Which protocol causes
more bus traffic?
Raytrace/upd
Upgrade/update rate (%)
8.00
7.00
6.00
5.00
4.00
3.00
2.00
1.00
0.00
The main problem is that
one processor tends to
write a block multiple
times before another
processor reads it.
Radix/inv
Radix/mix
Radix/upd
This causes several bus transactions instead of one, as there would
be in an invalidation protocol.
Effect of cache line size
on miss rate
If we increase the line size, the number of coherence misses might
go up or down. Why?
 What happens to the number of false-sharing misses?
Lecture 14
Architecture of Parallel Computers
5
 What happens to the number of true-sharing misses?
If we increase the line size, what happens to—
 capacity misses?
 conflict misses?
 bus traffic?
So it is not clear which line size will work best.
0.6
Upgrade
False sharing
0.5
True sharing
Capacity
Cold
Miss rate (%)
0.4
0.3
0.2
8
0.1
Radiosity/128
Radiosity/256
Radiosity/64
Radiosity/16
Radiosity/32
Radiosity/8
Lu/256
Lu/64
Lu/128
Lu/16
Lu/32
Lu/8
Barnes/256
Barnes/128
Barnes/64
Barnes/16
Barnes/32
Barnes/8
0
Results for the first three applications seem to show that which line
size is best?
© 2012 Edward F. Gehringer
CSC/ECE 506 Lecture Notes, Spring 2012
6
Lecture 14
2
Architecture of Parallel Computers
Raytrace/128
Raytrace/256
Raytrace/64
Raytrace/16
Raytrace/32
Raytrace/8
Radiosity/64 4
Radiosity/128 28
Radiosity/256
Radiosity/8
Radiosity/16
Radiosity/32
0
8
Raytrace/8
Raytrace/256
Raytrace/128
Raytrace/64
Raytrace/32
Raytrace/16
6
8
4
2
6
8
Radix/256
Radix/128
Radix/64
Radix/32
Radix/16
Radix/8
Ocean/256
Ocean/128
Ocean/64
Ocean/32
0
Barnes/256
Barnes/128
0.16
Barnes/64
Ocean/8
Ocean/16
10
Barnes/16
Barnes/32
Barnes/8
Traffic (bytes/instructions)
Miss rate (%)
For the second set of applications, which do not fit in cache, Radix
shows a greatly increasing number of false-sharing misses with
increasing block size.
12
Upgrade
False sharing
True sharing
Capacity
8
Cold
6
4
2
on bus traffic
Larger line sizes generate more bus traffic.
0.18
Address bus
0.14
Data bus
0.12
0.1
0.08
0.06
0.04
0.02
7
10
1.8
Address bus
9
Data bus
Data bus
8
1.4
Ocean/256
Ocean/128
Ocean/64
Ocean/32
Ocean/8
Ocean/16
0
Radix/256
0
Radix/128
0.2
Radix/64
1
Radix/32
0.4
Radix/16
2
LU/256
0.6
LU/128
3
0.8
LU/64
4
1
LU/32
5
1.2
LU/16
6
LU/8
Traffic (bytes/FLOP)
7
Radix/8
Traffic (bytes/instruction)
Address bus
1.6
The results are different than for miss rate—traffic almost always
increases with increasing line size.
But address-bus traffic moves in the opposite direction from data-bus
traffic.
With this in mind, which line size would you say is best?
Write propagation in multilevel caches
[§8.4.2] The coherence protocols we have seen so far have been
based on one-level caches.
Suppose each processor has its own L1 cache and L2 cache, and L2
the caches are coherent.
Writes must be propagated upstream and downstream.
Define the terms.
 Downstream write propagation.
© 2012 Edward F. Gehringer
CSC/ECE 506 Lecture Notes, Spring 2012
8
 Upstream write propagation.
Which makes downstream propagation simpler, a write-through or
write-back L1 cache? Why?
For upstream write propagation:
o An invalidation/intervention received by the L2 must be
propagated to the L1 (in case the L1 has the block).
o The inclusion property cuts down the number of such
upstream invalidations/interventions.
Lecture 14
Architecture of Parallel Computers
9
Download