SIMPLE CACHE, CACHE COHERENCE Using a Finite-State Machine to Control a Simple Cache Simple Cache Direct-mapped cache Write-back using write allocate Block size is four words (16 bytes or 128 bits) Cache size is 16 KiB, so it holds 1024 blocks 32-bit addresses Includes a valid bit and dirty bit per block The fields of an address for the cache: Cache index is 10 bits Block offset is 4 bits Tag size is 32-(10+4) or18 bits The signals between the processor to the cache: 1-bit Read or Write signal 1-bit Valid signal, saying whether there is a cache operation or not 32-bit address 32-bit data from processor to cache 32-bit data from cache to processor 1-bit Ready signal, saying the cache operation is complete Interface between memory and cache has the same fields as processor and cache, except data fields are 128 bits wide. The signals between memory and cache: 1-bit Read or Write signal 1-bit Valid signal, saying whether there is a memory operation or not 32-bit address 128-bit data from cache to memory 128-bit data from memory to cache 1-bit Ready signal, saying the memory operation is complete We assume a memory controller that will notify the cache via the Ready signal when the memory read or write is finished. Finite-State Machines The control for a cache must specify both the signals to be set in any step and the next step in the sequence. Finite-state machine: A sequential logic function consisting of a set of inputs and outputs, a next-state function that maps the current state and the inputs to a new state, and an output function that maps the current state and possibly the inputs to a set of asserted outputs. The implementation usually assumes that all outputs that are not explicitly asserted are deasserted. Next-state function: A combinational function that, given the inputs and the current state, determines the next state of a finite-state machine. Mux controls select one of the inputs, whether they are 0 or 1, so in finite-state machines we always specify the setting of all the mux controls. A finite-state machine can be implemented with a temporary register that holds the current state and a block of combinational logic that determines both the data-path signals to be asserted and the next state. FSM for a Simple Cache Controller Idle: This state waits for a valid read or write request from the processor, which moves the FSM to the Compare Tag state. Compare Tag: Tests to see if the requested read/write is a hit or a miss. If the data in the cache block referred to by the index portion of the address are valid, and the tag portion of the address matches the tag, then it is a hit. If it is a load, the data are read from the selected word, if it’s a write, the data are written. The Cache ready signal is then set. If it’s a write, dirty bit is set to 1. Valid bit and tag field area also set. If it’s a hit and the block is valid, the FSM returns to the idle state. A miss first updates the cache tag and then goes either to the Write-Back state, if the block at this location has dirty bit value of 1, or to the Allocate state if it is 0. Write-Back: writes the 128-bit block to memory using the address composed from the tag and cache index. We remain in this state waiting for the Ready signal from memory. When the memory write is complete, the FSM goes to the Allocate state. Allocate: The new block is fetched from memory. We remain in this state waiting for the Ready signal from memory. When the memory read is complete, the FSM goes to the Compare Tag state. This simple model could easily be extended with more states to try to improve performance. For example, the Compare Tag state does both the compare and the read or write of the cache data in a single clock cycle. Another optimization would be to add a write buffer so that we could save the dirty block and then read the new block first so that the processor doesn’t have to wait for two memory accesses on a dirty miss. Cache Coherence Cache coherence problem: Multiple processors on single chip likely share common physical address space. In caching shared data, because the view of memory held by 2 different processors is through their individual caches, they could end up seeing 2 different values for the same location. Informally, a memory system is coherent if any read of a data item returns the most recently written value of that data item. 2 different aspects of memory system behaviour: 1) Coherence: defines what values can be returned by a read. 2) Consistency: determines when a written value will be returned by a read. A memory system is coherent if: 1. A read by a processor P to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P. This preserves program order. 2. A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses. Thus, in Figure 5.39, we need a mechanism so that the value 0 in the cache of CPU B is replaced by the value 1 after CPU A stores 1 into memory at address X in time step 3. This defines a coherent view of memory. 3. Writes to the same location are serialized; two writes to the same location by any two processors are seen in the same order by all processors. For example, if CPU B stores 2 into memory at address X after time step 3, processors can never read the value at location X as 2 and then later read it as 1. Basic Schemes for Enforcing Coherence Caches provide: Migration: A data item can be moved to a local cache and used there in a transparent fashion. Migration reduces both the latency to access a shared data item that is allocated remotely and the bandwidth demand on the shared memory. Replication: When shared data are being simultaneously read, the caches make a copy of the data item in the local cache. Replication reduces both latency of access and contention for a read shared data item. The protocols to maintain coherence for multiple processors are called cache coherence protocols. Key to implementing a cache coherence protocol is tracking the state of any sharing of a data block. The most popular cache coherence protocol is snooping. Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block, but no centralized state is kept. The caches are all accessible via some broadcast medium (a bus or network), and all cache controllers monitor or snoop on the medium to determine whether or not they have a copy of a block that is requested on a bus or switch access. Any communication medium that broadcasts cache misses to all processors can be used to implement a snooping-based coherence scheme. This broadcasting makes snooping protocols simple to implement but also limits their scalability. Snooping Protocols One method of enforcing coherence is to ensure that a processor has exclusive access to a data item before it writes that item. This is called a write invalidate protocol because it invalidates copies in other caches on a write. Exclusive access ensures that no other readable or writable copies of an item exist when the write occurs: all other cached copies of the item are invalidated. To see how this protocol ensures coherence, consider a write followed by a read by another processor: since the write requires exclusive access, any copy held by the reading processor must be invalidated. Thus, when the read occurs, it misses in the cache, and the cache is forced to fetch a new copy of the data. For a write, the writing processor must have exclusive access, preventing any other processor from being able to write simultaneously. If two processors do attempt to write the same data at the same time, one of them wins the race, causing the other processor’s copy to be invalidated. For the other processor to complete its write, it must obtain a new copy of the data, which must now contain the updated value (write serialization). Block size plays an important role in cache coherency. For example, take the case of snooping on a cache with a block size of eight words, with a single word alternatively written and read by two processors. Most protocols exchange full blocks between processors, thereby increasing coherency bandwidth demands. Large blocks can also cause what is called false sharing: when two unrelated shared variables are located in the same cache block, the whole block is exchanged between processors even though the processors are accessing different variables. PARALLELISM AND MEMORY HIERARCHY Redundant Arrays of Inexpensive Disks Accelerating I/O performance was the original motivation of disk arrays. The argument was that by replacing a few big disks with many small disks, performance would improve because there would be more read heads. The flaw in the argument was that disk arrays could make reliability much worse. The solution was to add redundancy so that the system could cope with disk failures without losing information. Redundant Arrays of Inexpensive Disks (RAID): An organization of disks that uses an array of small and inexpensive disks so as to increase both performance and reliability (dependability). No Redundancy (RAID 0) Striping: Allocation of logically sequential blocks to separate disks to allow higher performance than a single disk can deliver. Or, simply spreading data over multiple disks. Striping across a set of disks makes the collection appear to software as a single large disk, which simplifies storage management. It also improves performance for large accesses, since many disks can operate at once. Mirroring (RAID 1) Mirroring: Writing identical data to multiple disks to increase data availability. Uses twice as many disks as does RAID 0. Whenever data are written to one disk, that data are also written to a redundant disk, so that there are always two copies of the information. If a disk fails, the system just goes to the “mirror” and reads it. Mirroring is the most expensive RAID solution, since it requires the most disks. Error Detecting and Correcting Code (RAID 2) RAID 2 borrows an error detection and correction scheme most often used for memories. It’s fallen into disuse. Bit-Interleaved Parity (RAID 3) The cost of higher availability can be reduced to 1/n, where n is the number of disks in a protection group. Protection group: The group of data disks or blocks that share a common check disk or block. Rather than have a complete copy of the original data for each disk, we need only add enough redundant information to restore the lost information on a failure. Reads or writes go to all disks in the group, with one extra disk to hold the check information in case there is a failure. When a disk fails, then you subtract all the data in the good disks from the parity disk; the remaining information must be the missing information. Many disks must be read to determine the missing data. The assumption is that taking longer to recover from failure but pending less on redundant storage is a good trade-off. Block-Interleaved Parity (RAID 4) Uses the same ratio of data disks and check disks as RAID 3 but the parity is stored as blocks and associated with a set of data blocks. In RAID 3, every access went to all disks. However, some applications prefer smaller accesses, allowing independent accesses to occur in parallel Since error detection information in each sector is checked on reads to see if the data are correct, such “small reads” to each disk can occur independently as long as the minimum access is one sector. In the RAID context, a small access goes to just one disk in a protection group while a large access goes to all the disks in a protection group. Each small write would demand that all other disks be accessed to read the rest of the information needed to recalculate the new parity. A “small write” would require reading the old data and old parity, adding the new information, and then writing the new parity to the parity disk and the new data to the data disk. The key insight to reduce this overhead is that parity is simply a sum of information; by watching which bits change when we write the new information, we need only change the corresponding bits on the parity disk. We must read the old data from the disk being written, compare old data to the new data to see which bits change, read the old parity, change the corresponding bits, and then write the new data and new parity. Thus, the small write involves four disk accesses to two disks instead of accessing all disks. This organization is RAID 4. Distributed Block-Interleaved Parity (RAID 5) One drawback to RAID 4: parity disk must be updated on every write, so parity disk is the bottleneck for back-to-back writes. To fix it, the parity information can be spread throughout all the disks so that there is no single bottleneck for writes. This distribution is RAID 5. In RAID 5 the parity associated with each row of data blocks is no longer restricted to a single disk. This organization allows multiple writes to occur simultaneously as long as the parity blocks are not located on the same disk. For example, a write to block 8 on the right must also access its parity block P2, thereby occupying the first and third disks. A second write to block 5 on the right, implying an update to its parity block P1, accesses the second and fourth disks and thus could occur concurrently with the write to block 8. P + Q Redundancy (RAID 6) Parity-based schemes protect against a single self-identifying failure. When a single failure correction is not sufficient, parity can be generalized to have a second calculation over the data and another check disk of information. This second check block allows recovery from a second failure. Thus, the storage overhead is twice that of RAID 5. The small write shortcut of Figure e5.11.2 works as well, except now there are six disk accesses instead of four to update both P and Q information. RAID Summary RAID 1 and RAID 5 are widely used in servers. One weakness of the RAID systems is repair. First, to avoid making the data unavailable during repair, the array must be designed to allow the failed disks to be replaced without having to turn off the system. Second, another failure could occur during repair, so the repair time affects the chances of losing data: the longer the repair time, the greater the chances of another failure that will lose data. Some systems include standby spares so that the data can be reconstructed instantly upon discovery of the failure. The operator can then replace the failed disks in a more leisurely fashion. There are questions about how disk technology changes over time. Although disk manufacturers quote very high MTTF for their products, those numbers are under nominal conditions. The calculation of RAID reliability assumes independence between disk failures, but disk failures could be correlated, because such damage due to the environment would likely happen to all the disks in the array. Another concern is that since disk bandwidth is growing more slowly than disk capacity, the time to repair a disk in a RAID system is increasing, which in turn enhances the chances of a second failure. Another concern is that reading much more data during reconstruction means increasing the chance of an uncorrectable read media failure, which would result in data loss. Other arguments for concern about simultaneous multiple failures are the increasing number of disks in arrays and the use of higher-capacity disks. These trends have led to a growing interest in protecting against more than one failure, and so RAID 6 is increasingly being offered.