Lecture 10: Cache Coherence Review: What is a Cache? Introduction Directory protocols

advertisement
2015-12-01
Lecture 10: Cache Coherence

Directory protocols



Introduction
Snoopy protocols
L1-L2 consistence
Zebo Peng, IDA, LiTH
1
TDTS 08 – Lecture 10
Review: What is a Cache?



Small, fast memory used to improve memory system performance.
It exploits spatial and temporal locality.
We have conceptually many caches!




Registers a cache for first-level cache;
First-level cache a cache for second-level cache;
Second-level cache a cache for main memory; and
Main memory a cache for hard disk (virtual memory).
Processor
Regs
L1-Cache
Bigger
L2-Cache
Faster
Main Memory
Disk, Tape, etc.
Zebo Peng, IDA, LiTH
2
TDTS 08 – Lecture 10
1
2015-12-01
What Happen with Multiprocessor?
Pi
$
Pj
$
M
I/O
Different processors may access values at same memory location

 Multiple copies of the same data in different caches.
How to ensure data integrity at all times?

 An update by a processor at time t should be available for other
processors at time t+1.

I/O may also address main memory directly.

In particular, WRITE operations must be carefully coordinated.
Zebo Peng, IDA, LiTH
3
TDTS 08 – Lecture 10
Write Through

All writes go to main memory as well as cache.
 Cache and memory are consistent.

Write slows down to main memory access.
 OK, since the write percentage is small (ca. 15%).
For multiprocessor:

There is additional inconsistency due to other cache
copies of the same memory location.

Multiple CPUs must therefore monitor main memory
traffic to keep local cache up to date.
 May lead to lots of traffic and monitoring activities.
Zebo Peng, IDA, LiTH
4
TDTS 08 – Lecture 10
2
2015-12-01
Write Back

Updates initially made in cache only.
 Cache and memory are not consistent by nature.

Update bit for cache slot is set when update occurs.

If a block is to be replaced, write to main memory if its
update bit is set.
For multiprocessor:

Other caches get also out of sync!
 Mechanism must be used to maintain cache coherence.

I/O must access main memory also through cache.
Zebo Peng, IDA, LiTH
5
TDTS 08 – Lecture 10
Software Solutions

Based on code analysis:
 Determine which data items may become unsafe for caching,
e.g., global variables used and updated by several processes.
 Mark them, so that they are not cached.
 Alternatively, determine the unsafe periods and insert code to
enforce cache coherence.

Compiler and OS deal with the problem.

Overhead is transferred to compile time.

Design complexity is transferred from hardware to software.

However, software tends to make conservative decisions
 Inefficient cache utilization.
 Low performance of the memory system.
Zebo Peng, IDA, LiTH
6
TDTS 08 – Lecture 10
3
2015-12-01
Hardware Solutions

Use cache coherence protocols.

Dynamic recognition of potential problems.

Run-time solution.

More efficient use of cache.

Transparent to programmer.

Two main techniques:
 Directory protocols
 Snoopy protocols
Zebo Peng, IDA, LiTH
7
TDTS 08 – Lecture 10
Lecture 10: Cache Coherence




Introduction
Directory protocols
Snoopy protocols
L1-L2 consistence
Zebo Peng, IDA, LiTH
8
TDTS 08 – Lecture 10
4
2015-12-01
Directory Protocols

A directory is used to collect and maintain information
about copies of data in caches.

The directory is stored in the memory system.

Memory references are checked against the directory.

Appropriate transfers are performed.
 Memory access may be avoided by copying from another
cache rather than fetching the data from the memory.

Creates central bottleneck = the directory.
 We can have multiple directories.

Effective in large scale systems with complex interconnection schemes, such as NUMA.
Zebo Peng, IDA, LiTH
9
TDTS 08 – Lecture 10
Directory Protocol Example
Each node maintains
a directory for a
portions of memory
and cache status.
Cache Coherence Directory
Zebo Peng, IDA, LiTH
10
TDTS 08 – Lecture 10
5
2015-12-01
Directory Protocol Example
...
798 in M1
…
798
Zebo Peng, IDA, LiTH
11
TDTS 08 – Lecture 10
Memory Access Sequence
Node 2 processor 3 (P2-3) requests location 798 which is in the
memory of node 1:

P2-3 issues read request on snoopy bus of node 2.

Directory on node 2 recognizes location is on node 1.

Node 2 directory requests node 1’s directory.

Node 1 directory requests contents of 798.

Node 1 memory puts data on (node 1 local) bus.

Node 1 directory gets data from (node 1 local) bus.

Data transferred to node 2’s directory.

Node 2 directory puts data on (node 2 local) bus.

Data picked up, put in P2-3’s cache and delivered to processor.
Zebo Peng, IDA, LiTH
12
TDTS 08 – Lecture 10
6
2015-12-01
Cache Coherence Operations

Node 1 directory keeps note that node 2 has copy of
data (at location 798).

If data modified in the cache, this is broadcasted to
other nodes.

Local directories monitor and purge local cache if
necessary.

Local directory monitors changes to local data in remote
caches and marks memory invalid until write back.

Local directory forces write back if memory location
requested by another processor.
Zebo Peng, IDA, LiTH
13
TDTS 08 – Lecture 10
Lecture 10: Cache Coherence




Introduction
Directory protocols
Snoopy protocols
L1-L2 consistence
Zebo Peng, IDA, LiTH
14
TDTS 08 – Lecture 10
7
2015-12-01
Snoopy Protocols
Distribute cache coherence responsibility among all cache
controllers.
The initial cache controller recognizes that a line is shared.
Updates are announced to all other caches.
Each cache controller “snoop” on the network to observe these
broadcasted notifications, and react accordingly.
Ideally suited to a bus based multiprocessor system:





 The bus provides a simple means of broadcasting and snooping.
Increased bus traffic due to broadcasting and snooping.
Two approaches: write invalidate and write update.


P
P
$
$
Zebo Peng, IDA, LiTH
. . .
P
$
15
M
TDTS 08 – Lecture 10
Write Invalidate SP

Suitable for multiple readers, but one writer at a time.

Generally, a line may be shared among several caches
for reading purposes.

When one of the caches wants to write to the line, it first
issues a notice to invalidate the line in all other caches.

The writing processor then has exclusive access until
the line is required by another processor.

Used in Pentium and PowerPC systems.

State of every cache line is marked as Modified,
Exclusive, Shared or Invalid.
 The MESI Protocol.
Zebo Peng, IDA, LiTH
16
TDTS 08 – Lecture 10
8
2015-12-01
Snoopy Cache Organization
Cache
Tags
Processor
Snoop H/W
Cache
Tags
Processor
Snoop H/W
Tags
Snoop H/W
Processor
Cache
Dirty
Address/data


Memory


Zebo Peng, IDA, LiTH
Modified  modified and differs
from memory.
Exclusive  same as in memory
and not in other caches.
Shared  same as in memory and
present in other caches.
Invalid  contends not valid.
17
TDTS 08 – Lecture 10
MESI State Transition Diagram
Cache line at initialing processor
(next event triggered by processor)
Line in snooping cache(s)
(next event triggered by bus)
SHR
Shared RH
Invalid
RH
Modif
SHW
RH
Modif
WH
Excl
SHW
Shared
SHR
RMS
WM
Invalid
Excl
WH
fill
copyback
Zebo Peng, IDA, LiTH
Invalidate
18
Read-with-intent-to-modify
TDTS 08 – Lecture 10
9
2015-12-01
Write Update SP

Work well with multiple readers and writers.

Updated word is distributed to all other processors.
P
M

. . .
$
P
P
$
$
It may generate many unnecessary updates:
 If a processor just reads a value once and does not need it
again; or
 If a processor updates a value many times before it is read
by the other processors (bad programming).
Zebo Peng, IDA, LiTH
19
TDTS 08 – Lecture 10
Invalidate vs. Update Protocols


An update protocol may generate many unnecessary cache
updates.
However, if two processors make interleaved reads and updates to a
variable, an update protocol is better.
 An invalidate protocol may lead to many memory accesses.

Both protocols suffer from false sharing overheads:
 Two words are not shared, however, they lie on the same cache line.
P
M

. . .
$
P
P
$
$
Most modern machines use invalidate protocols, since we have
usually the situation of one writer with many readers.
Zebo Peng, IDA, LiTH
20
TDTS 08 – Lecture 10
10
2015-12-01
Directory vs. Snoopy Schemes

Snoopy caches:





Each coherence operation is sent to all processors.
It generates large traffic, which is an inherent limitation.
Easy to implement on a bus-based system.
Not feasible for machines with memory distributed across a large
number of sub-systems.
Directory caches:
 The need for a broadcast media is replaced by the directory.
 The additional information stored in the directory may add significant
overhead.
 The underlying network must be able to carry all the coherence
requests.
 The directory is a point of contention, therefore, distributed directory
schemes are often used.
Zebo Peng, IDA, LiTH
21
TDTS 08 – Lecture 10
Lecture 10: Cache Coherence




Introduction
Directory protocols
Snoopy protocols
L1-L2 consistence
Zebo Peng, IDA, LiTH
22
TDTS 08 – Lecture 10
11
2015-12-01
L1-L2 Cache Consistency

Cache coherence techniques
 Apply only to caches connected to a bus or other interconnection
mechanism  typically L2 caches.
 However, processors often have L1 caches that are not connected to
a bus, therefore no snoopy protocol can be used.
Zebo Peng, IDA, LiTH
23
TDTS 08 – Lecture 10
L1-L2 Cache Consistency (Cont’d)
Solution: Extend cache coherence protocols to L1 caches:

L1 line should keep track of the state of the corresponding L2
line, and L1 should write-through to L2

It requires:
 L1 must be a subset of L2.
 The associativity of the L2 cache should be
equal or greater than that of the L1 cache.
• Ex. if L2 is 2-way set associate while L1
is 4-way set associate, it doesn’t work.

P
L1 $
L2 $
If L1 has a write-back policy, the interaction
between L1 and L2 is more complex.
Zebo Peng, IDA, LiTH
24
TDTS 08 – Lecture 10
12
2015-12-01
An Example: Alpha-Server 4100

Four-processor shared-memory symmetric multiprocessor system.

Each processor has a three-level cache hierarchy:
 L1 consists of two direct-mapped on-chip caches, one
for instruction and one for data.
• Write-through to L2 with a write buffer.
 L2 is an on-chip three-way set associative cache with
write-back to L3.
 L3 is a off-chip direct-mapped cache with write-back to
main memory.
Zebo Peng, IDA, LiTH
25
TDTS 08 – Lecture 10
Summary

Cache coherence in multiprocessor systems is an
important issue to be considered.
 Otherwise, performance will suffer.

Additional hardware is required to coordinate access to
data that might have multiple cache copies.

The underlying technique must provide guarantees on
the correct semantics.

Both hardware and software solutions can be used.

There are several different protocols to be selected for
the hardware solutions.
Zebo Peng, IDA, LiTH
26
TDTS 08 – Lecture 10
13
Download