CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches Krste Asanovic

advertisement
CS 152 Computer Architecture
and Engineering
Lecture 20: Snoopy Caches
Krste Asanovic
Electrical Engineering and Computer Sciences
University of California, Berkeley
http://www.eecs.berkeley.edu/~krste
http://inst.cs.berkeley.edu/~cs152
April 15, 2010
CS152, Spring 2010
Recap: Sequential Consistency
A Memory Model
P
P
P
P
P
P
M
“ A system is sequentially consistent if the result of
any execution is the same as if the operations of all
the processors were executed in some sequential
order, and the operations of each individual processor
appear in the order specified by the program”
Leslie Lamport
Sequential Consistency =
arbitrary order-preserving interleaving
of memory references of sequential programs
April 15, 2010
CS152, Spring 2010
2
Recap: Sequential Consistency
Sequential consistency imposes more memory ordering
constraints than those imposed by uniprocessor
program dependencies (
)
What are these in our example ?
T1:
Store (X), 1 (X = 1)
Store (Y), 11 (Y = 11)
T2:
Load R1, (Y)
Store (Y’), R1 (Y’= Y)
Load R2, (X)
Store (X’), R2 (X’= X)
additional SC requirements
April 15, 2010
CS152, Spring 2010
3
Recap: Mutual Exclusion and Locks
Want to guarantee only one process is active in a critical
section
• Blocking atomic read-modify-write instructions
e.g., Test&Set, Fetch&Add, Swap
vs
• Non-blocking atomic read-modify-write instructions
e.g., Compare&Swap, Load-reserve/Store-conditional
vs
• Protocols based on ordinary Loads and Stores
April 15, 2010
CS152, Spring 2010
4
Issues in Implementing
Sequential Consistency
P
P
P
P
P
P
M
Implementation of SC is complicated by two issues
• Out-of-order execution capability
Load(a); Load(b)
yes
Load(a); Store(b)
yes if a  b
Store(a); Load(b)
yes if a  b
Store(a); Store(b)
yes if a  b
• Caches
Caches can prevent the effect of a store from
being seen by other processors
SC complications motivate architects to consider
weak or relaxed memory models
April 15, 2010
CS152, Spring 2010
5
Memory Fences
Instructions to sequentialize memory accesses
Processors with relaxed or weak memory models (i.e.,
permit Loads and Stores to different addresses to be
reordered) need to provide memory fence instructions
to force the serialization of memory accesses
Examples of processors with relaxed memory models:
Sparc V8 (TSO,PSO): Membar
Sparc V9 (RMO):
Membar #LoadLoad, Membar #LoadStore
Membar #StoreLoad, Membar #StoreStore
PowerPC (WO): Sync, EIEIO
Memory fences are expensive operations, however, one
pays the cost of serialization only when it is required
April 15, 2010
CS152, Spring 2010
6
Using Memory Fences
Producer
tail
head
Consumer
Rtail
Rtail
Rhead
R
Consumer:
Load Rhead, (head)
spin: Load Rtail, (tail)
if Rhead==Rtail goto spin
MembarLL
Load R, (Rhead)
Rhead=Rhead+1
Store (head), Rhead
ensures that R is
process(R)
not loaded before
x has been stored
Producer posting Item x:
Load Rtail, (tail)
Store (Rtail), x
MembarSS
Rtail=Rtail+1
Store (tail), Rtail
ensures that tail ptr
is not updated before
x has been stored
April 15, 2010
CS152, Spring 2010
7
Memory Coherence in SMPs
CPU-1
A
CPU-2
cache-1
100
A
100
cache-2
CPU-Memory bus
A
100
memory
Suppose CPU-1 updates A to 200.
write-back: memory and cache-2 have stale values
write-through: cache-2 has a stale value
Do these stale values matter?
What is the view of shared memory for programming?
April 15, 2010
CS152, Spring 2010
8
Write-back Caches & SC
• T1 is executed
prog T1
ST X, 1
ST Y,11
• cache-1 writes back Y
• T2 executed
• cache-1 writes back X
• cache-2 writes back
X’ & Y’
April 15, 2010
cache-1
X= 1
Y=11
memory
X=0
Y =10
X’=
Y’=
X= 1
Y=11
X=0
Y =11
X’=
Y’=
Y=
Y’=
X=
X’=
X= 1
Y=11
X=0
Y =11
X’=
Y’=
X=1
Y =11
X’=
Y’=
Y = 11
Y’= 11
X=0
X’= 0
Y = 11
Y’= 11
X=0
X’= 0
X=1
Y =11
X’= 0
Y’=11
Y =11
Y’=11
X=0
X’= 0
X= 1
Y=11
X= 1
Y=11
CS152, Spring 2010
cache-2
Y=
Y’=
X=
X’=
prog T2
LD Y, R1
ST Y’, R1
LD X, R2
ST X’,R2
9
Write-through Caches & SC
prog T1
ST X, 1
ST Y,11
• T1 executed
• T2 executed
cache-1
X= 0
Y=10
memory
X=0
Y =10
X’=
Y’=
cache-2
Y=
Y’=
X=0
X’=
X= 1
Y=11
X=1
Y =11
X’=
Y’=
Y=
Y’=
X=0
X’=
X= 1
Y=11
X=1
Y =11
X’= 0
Y’=11
Y = 11
Y’= 11
X=0
X’= 0
prog T2
LD Y, R1
ST Y’, R1
LD X, R2
ST X’,R2
Write-through caches don’t preserve
sequential consistency either
April 15, 2010
CS152, Spring 2010
10
Cache Coherence vs.
Memory Consistency
• A cache coherence protocol ensures that all writes
by one processor are eventually visible to other
processors
– i.e., updates are not lost
• A memory consistency model gives the rules on
when a write by one processor can be observed by a
read on another
– Equivalently, what values can be seen by a load
• A cache coherence protocol is not enough to ensure
sequential consistency
– But if sequentially consistent, then caches must be coherent
• Combination of cache coherence protocol plus
processor memory reorder buffer implements a given
machine’s memory consistency model
April 15, 2010
CS152, Spring 2010
11
Maintaining Cache Coherence
Hardware support is required such that
• only one processor at a time has write
permission for a location
• no processor can load a stale copy of
the location after a write
 cache coherence protocols
April 15, 2010
CS152, Spring 2010
12
Warmup: Parallel I/O
Memory
Bus
Address (A)
Proc.
Data (D)
Physical
Memory
Cache
R/W
Page transfers
occur while the
Processor is running
A
Either Cache or DMA can
be the Bus Master and
effect transfers
D
R/W
DMA
DISK
(DMA stands for Direct Memory Access, means the I/O device
can read/write memory autonomous from the CPU)
April 15, 2010
CS152, Spring 2010
13
Problems with Parallel I/O
Cached portions
of page
Physical
Memory
Memory
Bus
Proc.
Cache
DMA transfers
DMA
DISK
Memory
Disk
April 15, 2010
Disk: Physical memory may be
stale if cache copy is dirty
Memory: Cache may hold stale data and not
see memory writes
CS152, Spring 2010
14
Snoopy Cache Goodman 1983
• Idea: Have cache watch (or snoop upon) DMA
transfers, and then “do the right thing”
• Snoopy cache tags are dual-ported
Used to drive Memory Bus
when Cache is Bus Master
A
Proc.
R/W
Tags and
State
D
Data
(lines)
A
R/W
Snoopy read port
attached to Memory
Bus
Cache
April 15, 2010
CS152, Spring 2010
15
Snoopy Cache Actions for DMA
Observed Bus
Cycle
Cache State
Cache Action
Address not cached
No action
Cached, unmodified
No action
Cached, modified
Cache intervenes
Address not cached
No action
DMA Write
Cached, unmodified
Cache purges its copy
Disk
Cached, modified
???
DMA Read
Memory
Disk
Memory
April 15, 2010
CS152, Spring 2010
16
CS152 Administrivia
April 15, 2010
CS152, Spring 2010
17
Shared Memory Multiprocessor
Memory
Bus
M1
Snoopy
Cache
M2
Snoopy
Cache
M3
Snoopy
Cache
Physical
Memory
DMA
DISKS
Use snoopy mechanism to keep all processors’
view of memory coherent
April 15, 2010
CS152, Spring 2010
18
Snoopy Cache Coherence Protocols
write miss:
the address is invalidated in all other
caches before the write is performed
read miss:
if a dirty copy is found in some cache, a writeback is performed before the memory is read
April 15, 2010
CS152, Spring 2010
19
Cache State Transition Diagram
The MSI protocol
Each cache line has state bits
Address tag
state
bits
M: Modified
S: Shared
I: Invalid
Write miss
(P1 gets line from memory)
Other processor reads
(P1 writes back)
M
Other processor
intent to write
(P1 writes back)
Read miss
(P1 gets line from memory)
Read by any
processor
April 15, 2010
S
P1 reads
or writes
Other processor
intent to write
CS152, Spring 2010
I
Cache state in
processor P1
20
Two Processor Example
(Reading and writing the same cache line)
P1 reads
P1 writes
P2 reads
P2 writes
P1 reads
P1 writes
P2 writes
P1
P2 reads,
P1 writes back
M
P1 reads
or writes
Write miss
P2 intent to write
Read
miss
P1 writes
P2
S
P2 intent to write
P1 reads,
P2 writes back
I
M
P2 reads
or writes
Write miss
P1 intent to write
Read
miss
April 15, 2010
S
P1 intent to write
CS152, Spring 2010
I
21
Observation
Other processor reads
P1 writes back
M
P1 reads
or writes
Write miss
Other processor
intent to write
Read
miss
Read by any
processor
S
Other processor
intent to write
I
• If a line is in the M state then no other cache can have
a copy of the line!
– Memory stays coherent, multiple differing copies cannot exist
April 15, 2010
CS152, Spring 2010
22
MESI: An Enhanced MSI protocol
increased performance for private data
Each cache line has a tag M: Modified Exclusive
E: Exclusive but unmodified
S: Shared
I: Invalid
Address tag
state
bits
Write miss
P1 write
or read
M
Other processor reads
P1 writes back
Read miss,
shared
Read by any
processor
April 15, 2010
S
P1 write
P1 intent
to write
Other
processor
reads
E
Other processor
intent to write, P1
writes back
Other processor
intent to write
CS152, Spring 2010
P1 read
Read miss,
not shared
Other processor
intent to write
I
Cache state in
processor P1
23
Optimized Snoop with Level-2 Caches
CPU
CPU
CPU
CPU
L1 $
L1 $
L1 $
L1 $
L2 $
L2 $
L2 $
L2 $
Snooper
Snooper
Snooper
Snooper
• Processors often have two-level caches
• small L1, large L2 (usually both on chip now)
• Inclusion property: entries in L1 must be in L2
invalidation in L2  invalidation in L1
• Snooping on L2 does not affect CPU-L1 bandwidth
What problem could occur?
April 15, 2010
CS152, Spring 2010
24
Intervention
CPU-1
A
CPU-2
cache-1
200
cache-2
CPU-Memory bus
A
100
memory (stale data)
When a read-miss for A occurs in cache-2,
a read request for A is placed on the bus
• Cache-1 needs to supply & change its state to shared
• The memory may respond to the request also!
Does memory know it has stale data?
Cache-1 needs to intervene through memory
controller to supply correct data to cache-2
April 15, 2010
CS152, Spring 2010
25
False Sharing
state
blk addr data0 data1
...
dataN
A cache block contains more than one word
Cache-coherence is done at the block-level and
not word-level
Suppose M1 writes wordi and M2 writes wordk and
both words have the same block address.
What can happen?
April 15, 2010
CS152, Spring 2010
26
Synchronization and Caches:
Performance Issues
Processor 1
Processor 2
Processor 3
R1
L: swap (mutex), R;
if <R> then goto L;
<critical section>
M[mutex]  0;
R1
L: swap (mutex), R;
if <R> then goto L;
<critical section>
M[mutex]  0;
R1
L: swap (mutex), R;
if <R> then goto L;
<critical section>
M[mutex]  0;
cache
mutex=1
cache
cache
CPU-Memory Bus
Cache-coherence protocols will cause mutex to ping-pong
between P1’s and P2’s caches.
Ping-ponging can be reduced by first reading the mutex
location (non-atomically) and executing a swap only if it is
found to be zero.
April 15, 2010
CS152, Spring 2010
27
Load-reserve & Store-conditional
Special register(s) to hold reservation flag and
address, and the outcome of store-conditional
Load-reserve R, (a):
<flag, adr>  <1, a>;
R  M[a];
Store-conditional (a), R:
if <flag, adr> == <1, a>
then cancel other procs’
reservation on a;
M[a] <R>;
status succeed;
else status fail;
If the snooper sees a store transaction to the address
in the reserve register, the reserve bit is set to 0
• Several processors may reserve ‘a’ simultaneously
• These instructions are like ordinary loads and stores
with respect to the bus traffic
Can implement reservation by using cache hit/miss, no
additional hardware required (problems?)
April 15, 2010
CS152, Spring 2010
28
Out-of-Order Loads/Stores & CC
snooper
Wb-req, Inv-req, Inv-rep
load/store
buffers
CPU
Cache
(I/S/E)
pushout (Wb-rep)
(S-rep, E-rep)
(S-req, E-req)
Blocking caches
Memory
One request at a time + CC  SC
CPU/Memory
Interface
Non-blocking caches
Multiple requests (different addresses) concurrently + CC
 Relaxed memory models
CC ensures that all processors observe the same
order of loads and stores to an address
April 15, 2010
CS152, Spring 2010
29
Acknowledgements
• These slides contain material developed and
copyright by:
–
–
–
–
–
–
Arvind (MIT)
Krste Asanovic (MIT/UCB)
Joel Emer (Intel/MIT)
James Hoe (CMU)
John Kubiatowicz (UCB)
David Patterson (UCB)
• MIT material derived from course 6.823
• UCB material derived from course CS252
April 15, 2010
CS152, Spring 2010
30
Download