Uploaded by Lilly Az

Simple Cache and Cache Parallelism Hierarchy Summary Notes

advertisement
SIMPLE CACHE, CACHE COHERENCE
Using a Finite-State Machine to Control a
Simple Cache
Simple Cache






Direct-mapped cache
Write-back using write allocate
Block size is four words (16 bytes or 128 bits)
Cache size is 16 KiB, so it holds 1024 blocks
32-bit addresses
Includes a valid bit and dirty bit per block
The fields of an address for the cache:



Cache index is 10 bits
Block offset is 4 bits
Tag size is 32-(10+4) or18 bits
The signals between the processor to the cache:






1-bit Read or Write signal
1-bit Valid signal, saying whether there is a cache operation or not
32-bit address
32-bit data from processor to cache
32-bit data from cache to processor
1-bit Ready signal, saying the cache operation is complete
Interface between memory and cache has the same fields as processor and cache, except
data fields are 128 bits wide. The signals between memory and cache:






1-bit Read or Write signal
1-bit Valid signal, saying whether there is a memory operation or not
32-bit address
128-bit data from cache to memory
128-bit data from memory to cache
1-bit Ready signal, saying the memory operation is complete
We assume a memory controller that will notify the cache via the Ready signal when the
memory read or write is finished.
Finite-State Machines
The control for a cache must specify both the signals to be set in any step and the next step
in the sequence.
Finite-state machine: A sequential logic function consisting of a set of inputs and outputs, a
next-state function that maps the current state and the inputs to a new state, and an output
function that maps the current state and possibly the inputs to a set of asserted outputs. The
implementation usually assumes that all outputs that are not explicitly asserted are
deasserted.
Next-state function: A combinational function that, given the inputs and the current state,
determines the next state of a finite-state machine.
Mux controls select one of the inputs, whether they are 0 or 1, so in finite-state machines we
always specify the setting of all the mux controls.
A finite-state machine can be implemented with a temporary register that holds the current
state and a block of combinational logic that determines both the data-path signals to be
asserted and the next state.
FSM for a Simple Cache Controller




Idle: This state waits for a valid read or write request from the processor, which
moves the FSM to the Compare Tag state.
Compare Tag: Tests to see if the requested read/write is a hit or a miss. If the data in
the cache block referred to by the index portion of the address are valid, and the tag
portion of the address matches the tag, then it is a hit. If it is a load, the data are read
from the selected word, if it’s a write, the data are written. The Cache ready signal is
then set. If it’s a write, dirty bit is set to 1. Valid bit and tag field area also set.
If it’s a hit and the block is valid, the FSM returns to the idle state. A miss first updates
the cache tag and then goes either to the Write-Back state, if the block at this
location has dirty bit value of 1, or to the Allocate state if it is 0.
Write-Back: writes the 128-bit block to memory using the address composed from
the tag and cache index. We remain in this state waiting for the Ready signal from
memory. When the memory write is complete, the FSM goes to the Allocate state.
Allocate: The new block is fetched from memory. We remain in this state waiting for
the Ready signal from memory. When the memory read is complete, the FSM goes to
the Compare Tag state.
This simple model could easily be extended with more states to try to improve performance.
For example, the Compare Tag state does both the compare and the read or write of the
cache data in a single clock cycle. Another optimization would be to add a write buffer so
that we could save the dirty block and then read the new block first so that the processor
doesn’t have to wait for two memory accesses on a dirty miss.
Cache Coherence
Cache coherence problem: Multiple processors on single chip likely share common
physical address space. In caching shared data, because the view of memory held by 2
different processors is through their individual caches, they could end up seeing 2 different
values for the same location.
Informally, a memory system is coherent if any read of a data item returns the most recently
written value of that data item.
2 different aspects of memory system behaviour:
1) Coherence: defines what values can be returned by a read.
2) Consistency: determines when a written value will be returned by a read.
A memory system is coherent if:
1. A read by a processor P to a location X that follows a write by P to X, with no writes
of X by another processor occurring between the write and the read by P, always
returns the value written by P. This preserves program order.
2. A read by a processor to location X that follows a write by another processor to X
returns the written value if the read and write are sufficiently separated in time and
no other writes to X occur between the two accesses. Thus, in Figure 5.39, we need a
mechanism so that the value 0 in the cache of CPU B is replaced by the value 1 after
CPU A stores 1 into memory at address X in time step 3. This defines a coherent view
of memory.
3. Writes to the same location are serialized; two writes to the same location by any two
processors are seen in the same order by all processors. For example, if CPU B stores
2 into memory at address X after time step 3, processors can never read the value at
location X as 2 and then later read it as 1.
Basic Schemes for Enforcing Coherence
Caches provide:

Migration: A data item can be moved to a local cache and used there in a
transparent fashion. Migration reduces both the latency to access a shared data item
that is allocated remotely and the bandwidth demand on the shared memory.

Replication: When shared data are being simultaneously read, the caches make a
copy of the data item in the local cache. Replication reduces both latency of access
and contention for a read shared data item.
The protocols to maintain coherence for multiple processors are called cache coherence
protocols. Key to implementing a cache coherence protocol is tracking the state of any
sharing of a data block.
The most popular cache coherence protocol is snooping. Every cache that has a copy of the
data from a block of physical memory also has a copy of the sharing status of the block, but
no centralized state is kept. The caches are all accessible via some broadcast medium (a bus
or network), and all cache controllers monitor or snoop on the medium to determine
whether or not they have a copy of a block that is requested on a bus or switch access.
Any communication medium that broadcasts cache misses to all processors can be used to
implement a snooping-based coherence scheme. This broadcasting makes snooping
protocols simple to implement but also limits their scalability.
Snooping Protocols
One method of enforcing coherence is to ensure that a processor has exclusive access to a
data item before it writes that item. This is called a write invalidate protocol because it
invalidates copies in other caches on a write. Exclusive access ensures that no other readable
or writable copies of an item exist when the write occurs: all other cached copies of the item
are invalidated.
To see how this protocol ensures coherence, consider a write followed by a read by another
processor: since the write requires exclusive access, any copy held by the reading processor
must be invalidated. Thus, when the read occurs, it misses in the cache, and the cache is
forced to fetch a new copy of the data.
For a write, the writing processor must have exclusive access, preventing any other processor
from being able to write simultaneously. If two processors do attempt to write the same data
at the same time, one of them wins the race, causing the other processor’s copy to be
invalidated. For the other processor to complete its write, it must obtain a new copy of the
data, which must now contain the updated value (write serialization).
Block size plays an important role in cache coherency. For example, take the case of
snooping on a cache with a block size of eight words, with a single word alternatively written
and read by two processors. Most protocols exchange full blocks between processors,
thereby increasing coherency bandwidth demands.
Large blocks can also cause what is called false sharing: when two unrelated shared
variables are located in the same cache block, the whole block is exchanged between
processors even though the processors are accessing different variables.
PARALLELISM AND MEMORY
HIERARCHY
Redundant Arrays of Inexpensive Disks
Accelerating I/O performance was the original motivation of disk arrays. The argument was
that by replacing a few big disks with many small disks, performance would improve because
there would be more read heads. The flaw in the argument was that disk arrays could make
reliability much worse.
The solution was to add redundancy so that the system could cope with disk failures without
losing information.
Redundant Arrays of Inexpensive Disks (RAID): An organization of disks that uses an array
of small and inexpensive disks so as to increase both performance and reliability
(dependability).
No Redundancy (RAID 0)
Striping: Allocation of logically sequential blocks to separate disks to allow higher
performance than a single disk can deliver. Or, simply spreading data over multiple disks.
Striping across a set of disks makes the collection appear to software as a single large disk,
which simplifies storage management. It also improves performance for large accesses, since
many disks can operate at once.
Mirroring (RAID 1)
Mirroring: Writing identical data to multiple disks to increase data availability. Uses twice as
many disks as does RAID 0. Whenever data are written to one disk, that data are also written
to a redundant disk, so that there are always two copies of the information. If a disk fails, the
system just goes to the “mirror” and reads it. Mirroring is the most expensive RAID solution,
since it requires the most disks.
Error Detecting and Correcting Code (RAID 2)
RAID 2 borrows an error detection and correction scheme most often used for memories. It’s
fallen into disuse.
Bit-Interleaved Parity (RAID 3)
The cost of higher availability can be reduced to 1/n, where n is the number of disks in a
protection group.
Protection group: The group of data disks or blocks that share a common check disk or
block.
Rather than have a complete copy of the original data for each disk, we need only add
enough redundant information to restore the lost information on a failure. Reads or writes
go to all disks in the group, with one extra disk to hold the check information in case there is
a failure.
When a disk fails, then you subtract all the data in the good disks from the parity disk; the
remaining information must be the missing information. Many disks must be read to
determine the missing data. The assumption is that taking longer to recover from failure but
pending less on redundant storage is a good trade-off.
Block-Interleaved Parity (RAID 4)
Uses the same ratio of data disks and check disks as RAID 3 but the parity is stored as blocks
and associated with a set of data blocks.
In RAID 3, every access went to all disks. However, some applications prefer smaller accesses,
allowing independent accesses to occur in parallel
Since error detection information in each sector is checked on reads to see if the data are
correct, such “small reads” to each disk can occur independently as long as the minimum
access is one sector. In the RAID context, a small access goes to just one disk in a protection
group while a large access goes to all the disks in a protection group.
Each small write would demand that all other disks be accessed to read the rest of the
information needed to recalculate the new parity. A “small write” would require reading the
old data and old parity, adding the new information, and then writing the new parity to the
parity disk and the new data to the data disk.
The key insight to reduce this overhead is that parity is simply a sum of information; by
watching which bits change when we write the new information, we need only change the
corresponding bits on the parity disk. We must read the old data from the disk being written,
compare old data to the new data to see which bits change, read the old parity, change the
corresponding bits, and then write the new data and new parity. Thus, the small write
involves four disk accesses to two disks instead of accessing all disks. This organization is
RAID 4.
Distributed Block-Interleaved Parity (RAID 5)
One drawback to RAID 4: parity disk must be updated on every write, so parity disk is the
bottleneck for back-to-back writes.
To fix it, the parity information can be spread throughout all the disks so that there is no
single bottleneck for writes. This distribution is RAID 5.
In RAID 5 the parity associated with each row of data blocks is no longer restricted to a
single disk. This organization allows multiple writes to occur simultaneously as long as the
parity blocks are not located on the same disk.
For example, a write to block 8 on the right must also access its parity block P2, thereby
occupying the first and third disks. A second write to block 5 on the right, implying an
update to its parity block P1, accesses the second and fourth disks and thus could occur
concurrently with the write to block 8.
P + Q Redundancy (RAID 6)
Parity-based schemes protect against a single self-identifying failure. When a single failure
correction is not sufficient, parity can be generalized to have a second calculation over the
data and another check disk of information. This second check block allows recovery from a
second failure. Thus, the storage overhead is twice that of RAID 5. The small write shortcut of
Figure e5.11.2 works as well, except now there are six disk accesses instead of four to update
both P and Q information.
RAID Summary
RAID 1 and RAID 5 are widely used in servers.
One weakness of the RAID systems is repair. First, to avoid making the data unavailable
during repair, the array must be designed to allow the failed disks to be replaced without
having to turn off the system. Second, another failure could occur during repair, so the repair
time affects the chances of losing data: the longer the repair time, the greater the chances of
another failure that will lose data.
Some systems include standby spares so that the data can be reconstructed instantly upon
discovery of the failure. The operator can then replace the failed disks in a more leisurely
fashion.
There are questions about how disk technology changes over time. Although disk
manufacturers quote very high MTTF for their products, those numbers are under nominal
conditions.
The calculation of RAID reliability assumes independence between disk failures, but disk
failures could be correlated, because such damage due to the environment would likely
happen to all the disks in the array.
Another concern is that since disk bandwidth is growing more slowly than disk capacity, the
time to repair a disk in a RAID system is increasing, which in turn enhances the chances of a
second failure.
Another concern is that reading much more data during reconstruction means increasing the
chance of an uncorrectable read media failure, which would result in data loss.
Other arguments for concern about simultaneous multiple failures are the increasing number
of disks in arrays and the use of higher-capacity disks.
These trends have led to a growing interest in protecting against more than one failure, and
so RAID 6 is increasingly being offered.
Download