Slides

advertisement
A Low-Overhead Coherence
Solution for Multiprocessors with
Private Cache Memories
Also known as “Snoopy cache”
Paper by: Mark S. Papamarcos and
Janak H. Patel
Presented by: Cameron Mott
3/25/2005
Outline
 Goals
 Outline
 Examples
 Solutions
 Details on this method
 Results
 Analysis
 Success
 Comments/Questions
Goals
 Reduce bus traffic
 Reduce bus wait
 Increase possible number of processors before
saturation of bus
 Increase processor utilization
 Low cost
 Extensible
 Long length of life for strategy
Structure
 The typical layout for a multi-processor
machine:
Difficulties
 Bus speed and saturation limits the processor
utilization (there is a single time-shared bus
with an arbitration mechanism).
 This scheme suffers from the well-known data
consistency or “cache coherence” problem
where two processors have the same writable
data block in their private cache.
Coherence example
 Process communication in shared-memory
multiprocessors can be implemented by exchanging
information through shared variables
 This sharing can result in several copies of a shared
block in one or more caches at the same time.
Time
Event
Cache contents
for CPU A
Cache contents
for CPU B
0
Memory
contents for
location X
1
1
CPU A reads X
1
1
2
CPU B reads X
1
1
1
3
CPU A stores 0 into X
0
1
0
Enforcing Coherence Styles
-
-
Hardware based
Use a global table, the table keeps track of what
memory is held and where.
“Snoopy” cache
No need for centralized hardware
All processors share the same cache bus
Each cache “snoops” or listens to cache transactions
from other processors
Used in CSM machines using a bus
 Snoopy caches

To solve coherence, each processor can send
out the address of the block that is being
written in cache, each other processor that
contains that entry then invalidates the local
entry (called broadcast invalidate).
Other Snoopy Methods
 Broadcast-Invalidate

Any write to cache transmits the address
throughout the system. Other caches check
their directory, and purge the block if it exists
locally. This does not require extra status bits,
but does eat up a lot of bus time.
 Improvements to above

Introduce a bias filter. The bias filter is a small
associative memory that stores the most
frequently invalidated blocks.
Goodman’s Strategy
 Goodman proposes his strategy for multiple
processor systems with independent cache
but a shared bus.

Invalidate is broadcast only when a block is
written in cache the first time (thus “write
once”). This block is also written through to
main memory. If a block in cache is written to
more than once (by different processors for
example), the block must be written back to
memory before replacing it.
Write-Once
 Combination of write-through and write-back.
Example
 Online example
 http://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/writeOnceHelp.htm
 Note that the only browser that displayed this on my computer was IE…
Details
 Two bits in each block in the cache keep track of the status of that block.
1.
2.
3.
4.

Invalid = The data in this line is not present or is not valid.
Exclusive-Unmodified (Excl-Unmod) = This is an exclusive cache line. The
line is coherent with memory and is held unmodified only in one cache. The
cache owns the line and can modify it without having to notify the rest of the
system. No other caches in the system may have a copy of this line.
Shared-Unmodified (Shared-Unmod) = This is a shared cache line. The
line is coherent with memory and may be present in several caches.
Caches must notify the rest of the system about any changes to this line.
The main memory owns this cache line.
Exclusive-Modified (Excl-Mod) = There is modified data in this cache line.
The line is incoherent with memory, so the cache is said to own the line. No
other caches in the system may have a copy of this line.
Other papers discuss MESI caches. How does this fit with Papamarcos
and Patel’s work?




M: Exclusive Modified
E: Exclusive Unmodified
S: Shared Unmodified
I: Invalid
Details (cont.)

Snoopy cache actions:


Read With Intent to Modify – This is the “write” cycle. If the address
on the bus matches a Shared or Exclusive line, the line is
invalidated. If a line is Modified, the cache must cause the bus
action to abort, write the modified line back to memory, invalidate
the line, and then allow the bus read to retry. Alternatively, the
owning cache can supply the line directly to the requestor across
the bus.
Read - If the address on the bus matches a Shared line there is no
change. If the line is Exclusive, the state changes to Shared. If a
line is Modified, the cache must cause the bus action to abort, write
the modified line back to memory, change the line to Shared, and
then allow the bus read to retry. Alternatively, the owning cache can
supply the line to the requestor directly and change its state to
Shared.
Flow diagrams
 Other cache can now provide requested memory.
 This changes the status bit to shared-unmod.
 Block is also written back to memory if another cache
had an Excl-Mod entry for that block. The status of
that block is then changed to shared-unmod after
being written and shared with the other processor.
 Writes cause any other cache to set the
corresponding entry to invalid.
 If memory provided the block, the status becomes
exclusive-unmod.
 No signal is necessary if the status is not sharedunmod.
Problems
 What if?
 A block is Shared-Unmodified and two caches attempt
to change the block at the same time.
 Depending on the implementation, the bus provides
the “sync” mechanism. Only one processor can have
control of the bus at any one time. This provides a
contention mechanism to determine which processor
wins.
 Requires that this operation is indivisible.
Results
 Results were analyzed using an
approximation algorithm.
 Is this appropriate? Can an approximation be
used to justify the algorithm?
 Accuracy of the approximation: error rate of
less than 5% in certain circumstances
Parameters
Variable Label
Number assumed for calculations
N
Description
Number of processors
a
90%
Processor Memory reference rate (cache requests)
m
5%
Miss ratio
w
20%
Fraction of memory references that are written
d
50%
Probability that a block in cache has been locally modified or (“dirty”)
u
30%
Fraction of write requests that reference unmodified blocks
s
5%
Fraction of write requests that reference shared blocks
A
1
Number of cycles required for bus arbitration
T
2
Number of cycles for a block transfer
I
2
Number of cycles for a block Invalidate
W
Average waiting time per bus request
b
(Derived) Ave. number of bus requests per unit of useful processor activity
Miss Ratio
Miss Ratio (Cont)
Degree of Sharing
Write Back Probability
Block Transfer Time
Cost of implementing
Note
 This algorithm and structure does have a finite limit to
the number of supported processors. Diminishing
returns are noted for performance as the number of
processors increase. Thus, this strategy should not
be utilized in systems of 30 processors or more (as
an estimate). This all depends on the system
parameters of course, but it is limited by these
factors.
 For a system utilizing a finite number of processors,
this strategy is very effective, and is in use today.
References
 A Low-Overhead Coherence Solution for Multiprocessors with




Private Cache Memories Mark S. Papamarcos and Janak H.
Patel
“Cache Coherence” Srini Devadas
http://csg.csail.mit.edu/u/d/devadas/public_html/6.004/Lectures/l
ect23/sld001.htm
“Dynamic Decentralized Cache Schemes for MIMD Parallel
Processors” Tu Phan
http://www.cs.nmsu.edu/~pfeiffer/classes/573/sem/s03/presentat
ions/Dynamic%20Decentralized%20Cache%20Schemes.ppt
H&P 3rd. Edition Mark Smotherman
http://www.cs.clemson.edu/~mark/464/hp3e6.html
Vivio: Write Once cache coherency protocol Jeremy Jones
http://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/writeOnceHelp.h
tm
Download