William Stallings Computer Organization and Architecture Chapter 4 Internal Memory The four-level memory hierarchy ♦Computer memory is organized into a hierarchy. ♦Decreasing cost/bit, increasing capacity, slower access time, and decreasing frequency of access of the memory by the processor ♦The cache automatically retains a copy of some of the recently used words from the DRAM. Memory Hierarchy Registers In CPU Internal or Main memory May include one or more levels of cache “RAM” External memory Backing store 4.1 COMPUTER MEMORY SYSTEM OVERVIEW Characteristics of Memory Systems Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation Location The term location refers to whether memory is internal or external to the computer. CPU The processor requires its own local memory , in the form of registers. Internal Main memory, cache External Peripheral storage devices, such as disk and tape Capacity Internal memory capacity typically expressed in terms of bytes(1byte=8bits)or words. External memory capacity expressed in bytes. Word The natural unit of organisation Word length usually 8, 16 and 32 bits The size of the word is typically equal to the number of bits used to represent a number and to the instruction length. Unfortunately, there are many exceptions. Unit of Transfer Internal Usually governed by data bus width External Usually a block which is much larger than a word Addressable unit Smallest location which can be uniquely addressed At the word level or byte level In any case, 2A=N, A is the length in bits of an address N is the number of addressable units Access Methods (1) Sequential access Start at the beginning and read through in order Access time depends on location of data and previous location variable e.g. tape Direct access Individual blocks have unique address Access is by jumping to vicinity plus sequential search Access time depends on location and previous location variable e.g. disk Access Methods (2) Random Individual addresses identify locations exactly Access time is independent of location or previous access and is constant e.g. RAM Associative Data is located by a comparison with contents of a portion of the store Access time is independent of location or previous access and is constant e.g. cache Performance Parameters Access time For random-access memory the time it takes to perform a read or write operation. Time between presenting the address to the memory and getting the valid data For non-random-access memory The time it takes to position the read-write mechanism at the desired location. Memory Cycle time Cycle time is access time plus additional time Time may be required for the memory to “recover” before next access Transfer Rate Rate at which data can be moved Physical Types Semiconductor RAM Magnetic Disk & Tape Optical CD (Compact Disk) & DVD (Digital Video Disk) Others Bubble Hologram Physical Characteristics Decay Volatility In a volatile memory, information decays naturally or is lost when electrical power is switched off. In a nonvolatile memory, no electrical power is needed to retain information, e.g. magnetic-surface memory. Erasable Power consumption Organisation Organisation means physical arrangement of bits into words Obvious arrangement not always used Memory Hierarchy Registers In CPU Internal or Main memory May include one or more levels of cache “RAM” External memory Backing store The Bottom Line The design constraints on a computer’s memory: How much? Capacity How fast? Time is money How expensive? A trade-ff among the three key characteristics of memory: cost, capacity, and access time. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape Hierarchy List Across this spectrum of technologies: Faster access time, greater cost per bit Greater capacity, smaller cost per bit Greater capacity, slower access time From top to down: Decreasing cost per bit Increasing capacity Increasing access time Decreasing frequency of access of the memory by the processor So you want fast? It is possible to build a computer which uses only static RAM (see later) This would be very fast This would need no cache How can you cache cache? This would cost a very large amount Locality of Reference During the course of the execution of a program, memory references tend to cluster e.g. loops and subroutines Main memory is usually extended with a higher-speed, smaller cache. It is a device for staging the movement of data between main memory and processor registers to improve performance. External memory, called Secondary or auxiliary memory are used to store program and data files and visible to the programmer only in terms of files and records. 4.2 Semiconductor Main Memory Table 4.2 Semiconductor Memory Types Types of Random-Access Semiconductor Memory RAM Misnamed as all semiconductor memory is random access, because all of the types listed in the table are random access. Read/Write Volatile A RAM must be provided with a constant power supply. Temporary storage Static or dynamic Dynamic RAM (DRAM) Bits stored as charge in capacitors Charges leak Need refreshing even when powered Simpler construction Smaller per bit Less expensive Need refresh circuits Slower Main memory Static RAM (SRAM) Bits stored as on/off switches No charges to leak No refreshing needed when powered More complex construction Larger per bit More expensive Does not need refresh circuits Faster Cache Read Only Memory (ROM) Permanent storage Applications Microprogramming (see later) Library subroutines Systems programs (BIOS) Function tables Types of ROM Written during manufacture Very expensive for small runs Programmable (once) PROM Needs special equipment to program Read “mostly” Erasable Programmable (EPROM) Erased by UV Electrically Erasable (EEPROM) Takes much longer to write than read Flash memory It is intermediate between EPROM and EEPROM in both cost and functionality. Erase whole memory electrically or erase blocks of memory Organisation in detail Memory cell The basic element of a semiconductor memory Two stable states being written into to set the state, or being read to sense the state Chip Logic One extreme organization : the physical arrangement of cells in the array is the same as the logical arrangement. The array is organized into W words of B bits each. e.g. A 16Mbit chip can be organised as 1M 16-bit words One-bit-per-chip in which data is read/written one bit at a time A bit per chip system has 16 lots of 1Mbit chip with bit 1 of each word in chip 1 and so on Chip Logic Typical organization of a 16-Mbit DRAM A 16Mbit chip can be organised as a 2048 x 2048 x 4bit array Reduces number of address pins Multiplex row address and column address 11 pins to address (211=2048) An additional 11 address lines select one of 2048 columns of 4bits per column. Four data lines are for the input and output of 4 bits to and from a data buffer. On write, the bit driver of each bit line is activated for a 1 or 0 according to the value of the corresponding data line. On read, the value of each bit line selects which row of cells is used for reading or writing. Adding one more pin devoted to addressing doubles the number of rows and columns, and so the size of the chip memory grows by a factor 4. Typical 16 Mb DRAM (4M x 4) Refreshing Refresh circuit included on chip Disable chip Count through rows Read & Write back Takes time Slows down apparent performance Chip Packaging EPROM package , which is a one-word-per-chip, 8-Mbit chip organized as 1M×8 •The address of the word being accessed . For 1M words, a total of 20 pins (220=1M) are needed. •D0~D7 •The power supply to the chip (VCC) •A ground pin (Vss) •A chip enable (CE) pin: the CE pin is used to indicate whether or not the address is valid for this chip. •A program voltage (Vpp) DRAM package, 16-Mbit chip organized as 4M×4 RAM chip can be updated, the data pins are input/output different from ROM chip •Write Enable pin (WE) •Output Enable pin (OE) •Row Address Select (RAS) •Column Address Select (CAS) Module Organisation ·If a RAM chip contain only 1bit per word, clearly a number of chips equal to the number of bits per words are needed. e.g. How a memory module consisting of 256K 8-bit words could be organized? 256K=218, an 18-bit address needed; The address is presented to 8 256K×1-bit chips, each of which provides the input/output of 1 bit. Figure 4.6 256kbyte memory Organization Module Organisation (2) Figure 4.7 1-Mbyte Memory Organization (1M×8bit/256K×8bit)=4=22 As show in figure 4.7, 1M word by 8bits per word is organized as four columns of chips, each column containing 256K words arranged as in Figure 4.6. 1M=220 For 1M word, 20 address lines are needed. The 18 least significant bits are routed to all 32 modules. The high-order 2 bits are input to a group select logic module that sends a chip enable signal to one of the four columns of modules. Error Correction Hard Failure Permanent defect Soft Error Random, non-destructive No permanent damage to memory Detected using Hamming error correcting code Error Correcting Code Function •A function f, is performed on the data to produce a code. •When the previously stored word is read out, the code is used to detect and possible correct errors. •A new set of K code bits is generated from the M data bits and compared with the fetched code bits. Even Parity bits Figure 4.9 Hamming Error-Correcting Code Figure 4.9 uses Venn diagrams to illustrate the use of Hamming code on 4-bit words (M=4). With three intersection circles, there are seven compartments. We assign the 4 data bits to the inner compartments. The remaining compartments are filled with parity bits. Each parity bit is chosen so that the total number of 1s in its circle is even. Figure 4.8 ErrorCorrecting Code The comparison logic receives as input two k-bit values. A bit-by-bit comparison is done by taking the exclusive-or of the two inputs. The results is called the syndrome word. The syndrome word is therefore K bits wide and has a range between 0 and 2K-1. The value 0 indicates that no error was detected. Leaving 2K-1 values to indicate, if there is an error, which bit was in error (the numerical value of the syndrome indicates the position of the data bit in error). An error could occur on any of the M data bits or K check bits so, 2K-1≥M+K (This equation gives the number of bits needed to correct a single bit error in a word containing M data bits.) Those bit positions whose position number are powers of 2 are designated as check bits. Each check bit operates on every data bit position whose position number contains a 1 in the corresponding column position. Bit position n is checked by those bits Ci such that ∑i=n. C8 C4 C2 C1 Figure 4.10 Layout of Data bits and Check bits The check bits are calculated as follows, where the symbol designates the exclusive-or operation: Assume that the 8-bit input words is 00111001, with data bit M1 in the right-most position. The calculations are as follows: Suppose the data bit 3 sustains an error and is changed from 0 to 1. When the new check bits are compared with the old check bits, the syndrome word is formed: The result is 0110, indicating that bit position 6, which contains data bit 3, in error. Figure 4.11 Check Bit Degeneration a single-error-correction (SEC) code More commonly, semiconductor memory is equipped with a single-error-correcting double-error-detecting (SEC-DED) code. An error-correction code enhances the reliability of the memory at the cost of added complexity. Table 4.3 Increase in Word Length with Error Correction 1 1 Figure 4.12 Hamming SEC-DEC Code The sequence show that if two errors occur (Figure 4.12 c), the checking procedure goes astray (d) and worsens the problem by creating a third error (e). To overcome the problem, an eighth bit is added that is set so that the total number of 1s in the diagram is even. 4.3 CASHE MEMORY Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module Cache operation - overview Figure 4.14 Cache/Main-Memory Structure (P118) Cache includes tags to identify which block of main memory is in each cache slot. The tag is usually a portion of the main memory address. Line Number Tag Block 0 1 2 C-1 (a) Cache Block length (k words) Memory address 0 1 Block (K words) 2 3 Block 2n-1 (b) Main Memory Word Length Figure 4.15 Cache Read Operation (P119) • CPU requests contents of memory location • Check cache for this data • If present, get from cache (fast) • If not present, read required block from main memory to cache • Then deliver from cache to CPU Typical Cache Organization •In this organization, the cache connects to the processor via data, control, and address lines. •The data and address lines attach to data and address buffers, which attach to a system bus from which main memory is reached. •When a cache hit occurs, the data and address buffers are disabled and communication is only between processor and cache, with no system bus traffic Figure 4.16 Typical Cache Organization •When a cache miss occurs, the desired address is loaded onto the system bus and the data are returned through a data buffer to both the cache and main memory. Elements of Cache Design Size Mapping Function Direct Associative Set Associative Replacement Algorithm Least recently used (LRU) First in first out (FIFO) Least frequently used (LFU) Random Write Policy Write through Write back Write once Block Size Number of Caches Single or two level Unified or split Cache Size A trade-off between cost per bit and access time Cost More cache is expensive Speed More cache is faster (up to a point) Checking cache for data takes time “Optimum” cache sizes Suggested : between 1K and 512K words. Mapping Function Three techniques direct, associative, and set associative Elements of the example Cache of 64kByte Cache block of 4 bytes Data is transferred between memory and the cache in blocks of 4 bytes each. i.e. cache is 16k (214) lines of 4 bytes 16MBytes main memory 24 bit address (224=16M) Main memory (4M blocks of 4 bytes each) Direct Mapping Each block of main memory maps to only one cache line i.e. if a block is in cache, it must be in one specific place Address is in two parts Least Significant w bits identify unique word or byte within a block of main memory. Most Significant s bits specify one memory block The MSBs are split into a cache line field r and a tag of s-r (most significant) The line field of r identifies one of the m=2r lines of the cache Direct Mapping Cache Line Table Cache line Every row has the same cache line number; Every column has the same tag number. Main Memory blocks assigned 0 1 0, 1, m, m+1, 2m, … 2s-m 2m+1…2s-m+1 m-1 m-1, 2m-1, 3m-1… 2s-1 The mapping is expressed as: i= j modulo m where i =cache line number j = main memory block number m = number of lines in the cache No two blocks in the same line have the same Tag field! Direct Mapping Cache Organization The r-bit line number is used as an index into the cache to access a particular line. If the (s-r) bit tag number matches the tag number currently stored in that line, then the w-bit word number is used to select one of the 2w bytes in that line. Otherwise, the s bits tag-plus-line field is used to fetch a block from main memory. Direct Mapping Address Structure Tag s-r 8 Line or Slot r 14 Word w 2 24 bit address w =2 bit word identifier (4 byte block) s=22 bit block identifier 8 bit tag (=22-14) 14 bit slot or line No two blocks in the same line have the same Tag field Check contents of cache by finding line and checking Tag Direct Mapping Example • The cache is organized as 16K=214 lines of 4 bytes each. • The main memory consists of 16Mbytes, organized as 4M blocks of 4 bytes each. • i= j modulo m i = cache line number j = main memory block number m = number of lines in the cache • Note that no two blocks that map into the same line number have the same tag number. Main Memory Address Direct Mapping pros & cons Advantages Simple Inexpensive Disadvantages Fixed location for given block If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high Associative Mapping A main memory block can load into any line of cache Memory address is interpreted as a tag and a word field. Tag uniquely identifies block of memory Every line’s tag is examined for a match Disadvantages of associative mapping Cache searching gets expensive Complex circuitry required to examine the tags of all caches in parallel. Fully Associative Cache Organization Associative Mapping Address Structure Word 2 bit Tag 22 bit 22 bit tag stored with each 32 bit (4B) block of data Compare tag field with tag entry in cache to check for hit Least significant 2 bits of address identify which 16-bit word is required from 32 bit data block e.g. Address Tag Data Cache line 16339C 058CE7 FEDCBA98 0001 Associative Mapping Example Main Memory Address Set Associative Mapping Cache is divided into a number of sets Each set contains a number of lines A given block maps to any line in a given set e.g. Block B can be in any line of set i e.g. 2 lines per set 2 way associative mapping A given block can be in one of 2 lines in only one set Set Associative Mapping In this case , the cache is divided into v sets, each of which consists of k lines. The relationships are m=v×k i = j modulo v where i=cache set number j=main memory block number m=number of lines in the cache This is referred to as k-way set associative mapping. Two Way Set Associative Cache Organization The d set bits specify one of v=2d sets. The s bits of the tag and set fields specify one of the 2s blocks of main memory. With K-way set associative mapping, the tag in a memory address is much smaller and is only compared to the k tags within a single set. Set Associative Mapping Example 13 bit set number Block number in main memory is modulo 213 000000, 00A000, 00B000, 00C000 … map to same set Set Associative Mapping Address Structure Tag 9 bit Word 2 bit Set 13 bit Use set field to determine cache set to look in Tag+Set field specifies one of the blocks in the main memory. Compare tag field to see if we have a hit e.g Address Tag Data Set number 1FF 7FFC 1FF 24682468 1FFF Two Way Set Associative Mapping Example e.g Address 1FF 7FFC Tag 1FF Data 24682468 Set number 1FFF 02C 0004 02C 11235813 0001 Main Memory Address Replacement Algorithms (1) Direct mapping When a new block is brought into the cache, one of the existing blocks must be replaced. Direct mapping No choice Each block only maps to one line Replace that line Replacement Algorithms (2) Associative & Set Associative Hardware implemented algorithm (speed) Least Recently used (LRU) Replace that block in the set which has been in the cache longest with no reference to it. (hit ratio + time) e.g. in 2 way set associative Which of the 2 block is LRU? First in first out (FIFO) replace block in the set that has been in cache longest. (time) Least frequently used replace block in the set which has had fewest hits. Random (hit ratio) Write Policy Must not overwrite a cache block unless main memory is up to date Problems to contend with More than one device may have access to main memory. Data inconsistent between memory and cache Multiple CPUs may have individual caches Data inconsistent among caches Write Policy Write through Write back Write once Write through All writes go to main memory as well as cache Any other processor-cache can monitor main memory traffic to keep local (to CPU) cache updated. Disadvantages Lots of traffic Slows down writes Write back Updates initially made in cache only Update bit for cache slot is set when update occurs If block in cache is to be replaced, write to main memory only if update bit is set Other caches get out of sync I/O must access main memory through cache Because portions of main memory are invalid Approaches to cache coherency Bus watching with write through Each cache controller monitors the address lines to detect write operations to memory by other bus masters. This strategy depends on the use of a write-through policy by all cache controller. Hardware transparency Additional hardware is used to ensure that all the updates to main memory via cache are reflected in all caches. Noncachable memory Only a portion of main memory is shared by more than one processor. In such a system, all accesses to shared memory are cache misses, because the shared memory is never copied to the cache. The noncachable memory can be identified using chip-select logic or high-access bits. Line Size The principle of locality Data in the vicinity of a referenced word is likely to be referenced in the near future. The relationship between block size and hit ratio is complex, depending on the locality characteristics of a particular program, and no definitive optimum value has been found. A size of from two to eight words seems reasonably close to optimum. Number of caches A single cache Multiple caches The number of levels of caches The use of unified versus split caches Split caches: one dedicated to instructions and one dedicated to data • Key advantage of split caches: eliminate contention for cache between the instruction processor and the execution unit. Unified cache: a single cache used to store references to both data and instructions For a given cache size, a unified cache has a higher hit rate than split caches because it balances the load between instruction and data fetches automatically. Number of caches The on-chip cache: cache and processor on the same chip When the requested instruction or data is found in the on-chip cache, the bus access is eliminated. Because of the short data paths internal to the processor, on-chip cache accesses will complete appreciably faster than would even zero-wait state bus cycles. Advantages Reduce the processor’s external bus activity Speed up execution times Increase overall system performance A two-level cache The internal cache designated as level 1 (L1) The external cache designated as level 2 (L2) 4.4 Pentium Cache Foreground reading Find out detail of Pentium II cache systems NOT just from Stallings! 4.5 Newer RAM Technology (1) Basic DRAM same since first RAM chips Constraints of the traditional DRAM chip: its internal architecture and its interface to the processor’s memory bus. Enhanced DRAM Contains small SRAM as well SRAM holds last line read A comparator stores the 11-bit value of the most recent row address selection. Cache DRAM (CDRAM) Larger SRAM component Use as cache or serial buffer Newer RAM Technology (2) Synchronous DRAM (SDRAM) Access is synchronized with an external clock unlike DRAM asynchronous. Address is presented to RAM Since SDRAM moves data in time with system clock, CPU knows when data will be ready CPU does not have to wait, it can do something else Burst mode allows SDRAM to set up stream of data and fire it out in block Internal logic of the SDRAM • In burst mode, a series of data bits can be clocked out rapidly after the first bit has been accessed. Burst mode is useful when all the bits to be accessed are in sequence and in the same row of the array as the initial access •A dual-bank internal architecture that improves opportunities for on-chip parallelism. • The mode register and associated control logic provide a mechanism to customize the SDRAM to suit specific system needs. Newer RAM Technology (3) Foreground reading Check out any other RAM you can find See Web site: The RAM Guide Exercises P143 4.4, 4.6, 4.7, 4.8 P145 4.20 Deadline