Computer Organization The Memory System Department of CSE, SSE Mukka www.bookspar.com | Website for students | VTU NOTES Chapter Objectives Basic memory circuits Organization of the main memory Cache memory concept – Virtual memory mechanism – Shortens the effective memory access time Increases the apparent size of the main memory Secondary storage Magnetic disks Optical disks Magnetic tapes www.bookspar.com | Website for students | VTU NOTES Basic Memory Concepts The maximum size of the Main Memory (MM) that can be used in any computer is determined by its addressing scheme. For eg., 16 – bit computer that generates 16-bit addresses is capable of addressing up to ? 32 – bit computer with 32-bit address can address _____ memory locations 40 – bit computer can address _______ memory locations www.bookspar.com | Website for students | VTU NOTES Word addressability and byteaddressability If the smallest addressable unit of information is a memory word, the machine is called word-addressable. If individual memory bytes are assigned distinct addresses, the computer is called byte-addressable. Most of the commercial machines are byte-addressable. For example in a byte-addressable 32-bit computer, each memory word contains 4 bytes. A possible word-address assignment would be: Word Address 0 4 8 Byte Address 0 1 2 4 5 6 8 9 10 www.bookspar.com | Website for students | VTU NOTES 3 7 11 Basic Memory Concepts Word length of a computer is the number of bits actually stored or retrieved in one memory access For eg., a byte addressable 32-bit computer, whose instructions generate 32-bit addresses High order 30 bit to determine which word in memory Low order 2 bits to determine which byte in that word Suppose we want to fetch only one byte from a word. In case of Read operation, other bytes are discarded by processor In case of Write operation, care should be taken not to overwrite other bytes www.bookspar.com | Website for students | VTU NOTES Basic Memory concepts Data transfer between memory and the processor takes place through the use of 2 processor registers If MAR is k bits long and MDR n bits long MAR – Memory address register MDR – Memory Data Register Memory unit may contain up to 2k addressable locations During a memory cycle, n bits of data are transferred between memory and the processor No of address lines and data lines in processor? There are additional control lines read/write, MFC, no of bytes to be transferred etc www.bookspar.com | Website for students | VTU NOTES Processor Memory k-bit address bus MAR n-bit data bus MDR Up to 2 k addressable locations Word length = n bits Control lines ( R / W , MFC, etc.) Figure 5.1. Connection of the memory to the processor. www.bookspar.com | Website for students | VTU NOTES How processor reads data from the memory ? Loads the address of the required memory location into MAR Sets R/W line to 1 The memory responds by placing the requested data on data lines Confirms this action by asserting MFC signal Upon receipt of MFC signal, processor loads the data on the data lines in to the MDR register www.bookspar.com | Website for students | VTU NOTES How processor Writes Data into memory? Loads the address of the location into MAR Loads the data into MDR Indicates Write operation by setting R/W line to 0 www.bookspar.com | Website for students | VTU NOTES Some concepts Memory Access Times: It is a useful measure of the speed of the memory unit. It is the time that elapses between the initiation of an operation and the completion of that operation (for example, the time between READ and MFC). Memory Cycle Time :It is an important measure of the memory system. It is the minimum time delay required between the initiations of two successive memory operations (for example, the time between two successive READ operations). The cycle time is usually slightly longer than the access time. www.bookspar.com | Website for students | VTU NOTES Random Access Memory (RAM) A memory unit is called a Random Access Memory if any location can be accessed for a READ or WRITE operation in some fixed amount of time that is independent of the location’s address. Main memory units are of this type. This distinguishes them from serial or partly serial access storage devices such as magnetic tapes and disks which are used as the secondary storage device. www.bookspar.com | Website for students | VTU NOTES Cache Memory The CPU processes instructions and data faster than they can be fetched from compatibly priced main memory unit. Memory cycle time becomes the bottleneck in the system. One way to reduce the memory access time is to use cache memory. Its a small and fast memory that is inserted between the larger, slower main memory and the CPU. Holds the currently active segments of a program and its data. Because of the locality of address references, CPU finds the relevant information mostly in the cache memory itself (cache hit) infrequently needs access to the main memory (cache miss) With suitable size of the cache memory, cache hit rates of over 90% are possible www.bookspar.com | Website for students | VTU NOTES Memory Interleaving This technique divides the memory system into a number of memory modules Arranges addressing so that successive words in the address space are placed in different modules. When requests for memory access involve consecutive addresses, the access will be to different modules. Since parallel access to these modules is possible, the average rate of fetching words from the Main Memory can be increased www.bookspar.com | Website for students | VTU NOTES Virtual Memory In a virtual memory System, the addresses generated by the program may be different from the actual physical address The required mapping between physical memory and logical address space is implemented by a special memory control unit, called the memory management unit. The mapping function may be changed during program execution according to system requirements. The logical (virtual) address space the address generated by the CPU is referred to as a virtual or logical address. can be as large as the addressing capability of the CPU The physical address space the actual physical memory can be much smaller. www.bookspar.com | Website for students | VTU NOTES Virtual memory Only the active portion of the virtual address space is mapped onto the physical memory the rest of the virtual address space is mapped onto the bulk storage device like magnetic disks( hard disks) If the addressed information is in the Main Memory (MM), it is accessed and execution proceeds. Otherwise, an exception is generated, in response to which the memory management unit transfers a contiguous block of words containing the desired word from the bulk storage unit to the MM, displacing some block that is currently inactive. www.bookspar.com | Website for students | VTU NOTES b7 b7 b1 b1 b0 b0 W0 • • • FF A0 A2 • • • A1 W1 FF Address decoder • • • • • • • • • • • • • • • • • • Memory cells A3 • • • W15 Sense / Write circuit Data input/output lines: b7 Sense / Write circuit Sense / Write circuit b1 b0 Figure 5.2. Organization of bit cells in a memory chip. www.bookspar.com | Website for students | VTU NOTES R/W CS An example of memory organization A memory chip consisting of 16 words of 8 bits each, which is usually referred to as a 16 x 8 organization. The data input and the data output of each Sense/Write circuit are connected to a single bi-directional data line in order to reduce the number of pins required. One control line, the R/W (Read/Write) input is used a specify the required operation and another control line, the CS (Chip Select) input is used to select a given chip in a multichip memory system. This circuit requires 14 external connections, and allowing 2 pins for power supply and ground connections, can be manufactured in the form of a 16-pin chip. It can store 16 x 8 = 128 bits. www.bookspar.com | Website for students | VTU NOTES 5-bit row address W0 W1 5-bit decoder 32 ´ 32 memory cell array W31 10-bit address Sense/Write circuitry 32-to-1 output multiplexer and input demultiplexer 5-bit column address Data input/output Figure 5.3. Organization of a 1K 1 memory chip. www.bookspar.com | Website for students | VTU NOTES R/ W CS 1K X 1 memory chip The 10-bit address is divided into two groups of 5 bits each to form the row and column addresses for the cell array. A row address selects a row of 32 cells, all of which are accessed in parallel. One of these, selected by the column address, is connected to the external data lines by the input and output multiplexers. This structure can store 1024 bits, can be implemented in a 16-pin chip. www.bookspar.com | Website for students | VTU NOTES Static memories Memories that consist of circuits capable of retaining their state as long as power is applied – static memories Static rams can be accessed very quickly – few nanosecs www.bookspar.com | Website for students | VTU NOTES Bit line b b Two inverters 2 transistors T1 and T2 T1 X Y T2 Word line Bit lines Figure 5.4. A static RAM cell. www.bookspar.com | Website for students | VTU NOTES When word line is at ground level transistors are turned off, and latch retains its state Read and Write operation in SRAM Read Word line is activated – to close switches T1 and T2 If cell is in state 1, the signal on bit line b is high and signal on bit line b’ is low Opposite holds if cell is in state 0 Sense/Write circuits at the end of the bit lines monitor the states of b and b’ and sets output Write State of cell is set by placing appropriate value on bit line ba dn b’ and then word line is activated This forces the cell into corresponding state Required signals on bit lines are generated by Sense/Write circuit www.bookspar.com | Website for students | VTU NOTES www.bookspar.com | Website for students | VTU NOTES Dynamic RAMs Static RAMs are fast but come at a higher cost Less expensive RAMs using less no of transistors, Their cells require several transistors But their cells cannot retain their state indefinitely Called as Dynamic RAMs Information stored in the form of charge on a capacitor This charge can be maintained only for tens of milliseconds Contents must be periodically refreshed www.bookspar.com | Website for students | VTU NOTES Dynamic RAM – needs to refreshed periodically to hold data Bit line Word line T C Figure 5.6. A single-transistor dynamic memory cell www.bookspar.com | Website for students | VTU NOTES A 16-Mbit DRAM chip, configured as 2M X 8 The cells are organized as 4K X 4K array The 4096 cells in each row are divided into 512 groups of 8 A row can hence store 512 bytes of data 12 Address bits are required to select a row 9 bits needed to specify a group of 8 bits in the selected row www.bookspar.com | Website for students | VTU NOTES Timing is controlled asynchronously. Specialized memory controller circuit to provide the necessary control signals CAS and RAS, that govern A20 - 9 A 8- 0 the timing. RA S Row address latch Row decoder 4096X(512X8) cell array Sense / Write circuits Hence it is asynchronous DRAM Column address latch CA S CS R/ W Column decoder D7 D0 Figure 5.7. Internal organization of a 2M ´ 8 dynamic memory chip. www.bookspar.com | Website for students | VTU NOTES Fast Page mode All bits of a row are sensed but only 8 bits are placed This byte is selected by column address bits A simple modification can make it access other bytes of the same row without having to reselect the row Add a latch to the output of the sense amplifier in each column The application of a row address will load latches corresponding to all bits in a selected row Most useful arrangement is to transfer bytes in sequential order Need only different column addresses to place the different bytes on the data lines Apply a consecutive sequence of column addresses under the control of successive CAS signals. This scheme allows transferring a block of data at a much faster rate than can be achieved for transfers involving random addresses This block transfer capability is called as fast page mode www.bookspar.com | Website for students | VTU NOTES SYNCHRONOUS DRAMs Operation directly synchronized with a clock signal Called as SDRAMs The cell array is the same as in Asynchronous DRAMs. The address and data connections are buffered by means of registers Output of each sense amplifier is connected to a latch A read operation causes the contents of all cells in the selected row to be loaded into these latches If an access is made for refreshing purposes only, it wont change the contents of these latches Data held in the latches that correspond to the selected column(s) are transferred into the data output register www.bookspar.com | Website for students | VTU NOTES Refresh counter Row address latch Row decoder Cell array Column address counter Column decoder Read/Write circuits & latches Row/Column address Clock RA S CA S R/ W Mode register and timing control Data input register Data output register CS Data Figure 5.8. Synchronous DRAM. www.bookspar.com | Website for students | VTU NOTES SYNCHRONOUS DRAMs SDRAMs have several different modes of operation Selected by writing control information into a mode register Can specify burst operations of different lengths In SDRAMs, it is not necessary to provide externally generated pulses on the CAS line to select successive columns Necessary signals are provided using a column counter and clock signal Hence new data can be placed on data lines at each clock cycle All actions triggered by rising edge of the clock www.bookspar.com | Website for students | VTU NOTES Clock R/ W RAS CAS Address Data Row Col D0 D1 D2 D3 Figure 5.9. Burst read of length 4 in an SDRAM. www.bookspar.com | Website for students | VTU NOTES Burst read of length 4 in an SDRAM. Row address latched under control of RAS signal Column address is latched under control of CAS signal After delay of 1 cycle, first set of data bits placed on data lines Memory takes about 2-3 cycles to activate selected row SDRAM automatically increments column address to access the next 3 sets of bits in the selected row, placed on data lines in successive clock cycles SDRAMs have built in refresh circuitry Provides the addresses of rows that are selected for refreshing Each row must be refreshed at least every 64ns www.bookspar.com | Website for students | VTU NOTES Latency and Bandwidth The parameters that indicate the performance of the memory Memory latency – amount of time it takes to transfer a word of data to or from memory In block transfers, latency is used to denote the time it takes to transfer the first word of data This is longer than the time needed to transfer each subsequent word of a block In prev diagram, access cycle begins with assertion of RAS and first word is transferred 5 cycles later. Hence latency is 5 clock cycles www.bookspar.com | Website for students | VTU NOTES Bandwidth Bandwidth usually is the no of bits or bytes that can be transferred in one sec Depends on Speed of memory access Transfer capability of the links – speed of the bus No of bits that can be accessed in parallel Bandwidth is product of the rate at which data are transferred ( and accessed) and width of the data bus www.bookspar.com | Website for students | VTU NOTES Double – Data – Rate SDRAM ( DDR SDRAMs) The standard SDRAM performs all actions on the rising edge of the clock signal DDR SDRAMs access the cell array in same way but transfers data on both the edges of the clock The latency is the same as standard SDRAMs But since they transfer data on both the edges of clock, bandwidth is essentially doubled for long burst transfers To make this possible, the cell array is organized into 2 banks Each bank can be accessed separately Consecutive words of a given block are stored in different banks Efficiently used in applications where block transfers are prevalent Eg., main memory to and from processor caches www.bookspar.com | Website for students | VTU NOTES Questions for assignment 1. Explain how processor reads and writes data froma dn to memory 2. explain organization of 1K X 1 memory chip 3. Explain a single SRAM cell with diagram. How read and write operations are carried out? 4. Explain DRAM cell with diagram. How read and write operations are carried out? 5. Explain 2M X 8 DRAM chip. How can you modify this for fast page mode 6. Explain SDRAMs with help of a diagram 7. Explain the terms latency and bandwidth 8. Explain the burst length read of 4 in SDRAM with timing diagram 9. Explain DDR SDRAMs www.bookspar.com | Website for students | VTU NOTES Structure of larger memories Memory systems connected to form larger memories There are 2 types of memory systems Static memory systems Dynamic memory systems www.bookspar.com | Website for students | VTU NOTES Static Memory systems Following is the diagram for implementation of 2M X 32 memory using 16 512K X 8 static memory chips There are 4 columns, each column containing 4 chips to implement one byte position Only selected chips ( using chip select input ) place data on output lines 21 address bits are needed to select a 32 bit word in this memory high order 2 bits used to determine which of the 4 chip select signals should be activated 19 bits used to access specific byte locations inside each chip of selected row R/W inputs of each chip are tied together to form a single R/W signal Dynamic memory systems are organized much in the same manner as static Physical implementation more conveniently done in the form of memory modules www.bookspar.com | Website for students | VTU NOTES 21-bit addresses 19-bit internal chip address A0 A1 A19 A20 2-bit decoder 512 K X 8 memory chip D31-24 D23-16 D 15-8 D7-0 512 K X 8 memory chip 19-bit address 8-bit data input/output Chip select | Website forusing students | 8 static memory chips. Figure 5.10. Organization ofwww.bookspar.com a 2M 32 memory module 512K VTU NOTES Memory System Considerations The choice of a RAM for a given system depends on several factors Static RAMs are used when very fast operation is the primary requirement Cost Speed Power dissipation Size of chip Used mostly in cache memories Dynamic RAMs are predominant choice for computer main memories High densities achievable make larger memories economically feasible www.bookspar.com | Website for students | VTU NOTES Memory Controller To reduce number of pins, dynamic memory chips use multiplexed address inputs Address divided into 2 parts High-order address bits, to select a row in a cell array, are provided first and latched into memory under control of RAS Low-order address bits, to select a column, are provided on the same address pins and latched under CAS signal Processor issues all bits of address at the same time The required multiplexing of address bits are performed by a memory controller circuit, www.bookspar.com | Website for students | VTU NOTES Row/Column address Address RAS R/ W Request Memory controller Processor CAS R/ W CS Clock Clock Data Figure 5.11. Use of a memory controller. www.bookspar.com | Website for students | VTU NOTES Memory Memory controller functions Interposed between processor and memory Processor sends Request signal The controller forwards the row and column portions of address to the memory Accepts complete address and R/W signal from the processor Generates the RAS and CAS signals Also sends R/W and CS signals to the memory Data lines are directly connected between the processor and the memory When used with DRAM chips, the memory controller provides all the information needed to control the refreshing process Contains a refresh counter – to refresh all rows within the time limit specified for a device www.bookspar.com | Website for students | VTU NOTES RAMBUS Memory To increase the system bandwidth we need to increase system bus width or system bus speed A wide bus is expensive and required lot of space on motherboard Rambus – narrow bus but much faster Key feature is fast signaling method used to transfer information between chips Uses the concept of differential signaling Instead of either 0 volts or Vsupply ( 5 Volts ), uses 0.3 volt differences from a reference voltage called as Vref www.bookspar.com | Website for students | VTU NOTES READ-ONLY Memories (ROMs) Both SRAMs and DRAMs are volatile Many applications need to retain data even if power is off Loses data if power is turned off E.g., a hard disk used to store information, including OS When system is turned on , need to load OS from hard disk to memory Need to execute a program that boots OS That boot program, since is large, is stored on disk Processor must execute some instructions that load boot program into memory So we need a small amount of non volatile memory that holds instructions needed to load boot program into RAM Special type of writing process to place info into non volatile memories Called as ROM – Read Only Memory www.bookspar.com | Website for students | VTU NOTES Bit line Word line T P Not connected to store a 1 Connected to store a 0 Figure 5.12. A ROM cell. www.bookspar.com | Website for students | VTU NOTES ROM Transistor is connected to ground at point P then 0 is stored Else 1 is stored Bit line connected to a power supply through a resistor To read, word line is activated If voltage drops down – then 0 If voltage remains same – then 1 www.bookspar.com | Website for students | VTU NOTES PROM Allows data to be loaded by the user Achieved by inserting a fuse at point P in the prev figure Before it is programmed, memory contains all 0s The user can insert 1at required locations using highcurrent pulses Process is irreversible www.bookspar.com | Website for students | VTU NOTES EPROM Allows the stored data to be erased and new data to be loaded Erasable, reprogrammable ROM – called as EPROM Can be used when memory is being developed So that it can accommodate changes Cell structure is similar to ROM The connection to ground is always made at point P A special transistor is used – ability to function either as a normal transistor or as a disabled transistor which is always turned off Can be programmed to behave as permanently open switch Can erase by exposing the chip to UV light which dissipate the charges trapped in transistor memory cells www.bookspar.com | Website for students | VTU NOTES EEPROM Disadvantage of EPROMs EEPROM – another version of Erasable PROM that can be both programmed and erased electrically Chip must be physically removed from the circuit for reprogramming Entire contents are erased from UV light Need not be removed for erasure Can erase cell contents selectively Disadvantage Different voltages needed for erasing , writing and reading stored data www.bookspar.com | Website for students | VTU NOTES Flash Memory An approach similar to EEPROM In EEPROM can read and write a single cell In Flash memory – can read the contents of a single cell but can write only to a block of cells Flash devices have greater density A flash cell is based on a single transistor controlled by trapped charge Higher capacity Lower cost per bit Require a single power supply voltage Consumes less power in operation Used in portable equipment that is battery driven – handheld computers, cell phones, digital cameras, MP3 players www.bookspar.com | Website for students | VTU NOTES Flash Cards and Flash Drives Single flash chips do not provide sufficient storage capacity Larger memory modules are required – flash cards and flash drives Flash cards Mount flash chips on a small card A card is simply plugged into a conveniently accessible slot Variety of sizes Flash Drives Larger modules to replace hard disk drives Designed to fully emulate hard disks – not yet possible Storage capacity is significantly lower Have shorter seek and access times hence faster response Lower power consumption Insensitive to vibration www.bookspar.com | Website for students | VTU NOTES Speed, Size and Cost Ideal memory – fast, large and inexpensive Very fast memory if SRAM chips are used DRAM chips are cheaper But also slower Solution for space is to provide large secondary storage devices These chips are expensive So impractical to build a large Memory using SRAM chips Very large disks at reasonable prices For main memory – use DRAMs Use SRAMs in smaller memories like cache memory www.bookspar.com | Website for students | VTU NOTES Processor Registers Increasing size Primary L1 cache Increasing Increasing speed cost per bit Secondary cache L2 Main memory Magnetic disk secondary memory www.bookspar.com | Website for students | Figure 5.13. Memory hierarchy. VTU NOTES Cache Memories Speed of main memory slower than modern processors Processor cannot spend time wasting to access instructions and data in main memory Use a cache memory which is much faster and makes the main memory appear faster to processor than it really is Effectiveness of cache based on locality of reference – many instructions in localized areas of the program are executed repeatedly during some time period, and the remainder of the program is accessed relatively infrequently. Two ways Temporal – a recently executed instruction is likely to be executed again very soon Spatial – instructions in close proximity to a recently executed instruction ( with respect to instruction’s address ) are likely to be executed soon www.bookspar.com | Website for students | VTU NOTES Operation of a cache If the active segments of the program can be placed in fast cache memory – can reduce total execution time significantly Memory control circuitry designed to take advantage of locality of reference Temporal – whenever an item( instruction or data) is first needed, this item is brought into the cache – remains there till needed again Spatial – instead of fetching just one item from the main memory to the cache, fetch several items that reside at adjacent addresses. – referred to as block or cache line. Replacement algorithm – to decide which block of data to be moved back from cache to main memory so that a new block can be accommodated www.bookspar.com | Website for students | VTU NOTES Processor Cache Main memory Figure 5.14. Use of a cache memory. www.bookspar.com | Website for students | VTU NOTES Operation of a cache Read request from processor The contents of a block of memory words containing the location specified are transferred into the cache one word at a time The cache can store reasonable no of words; but it is small compared to main memory When the program references any of the locations in this block, the desired contents are read directly from the cache. The correspondence between main memory blocks and those in the cache is specified by a mapping function When the cache is full and a memory word that’s not in cache is referenced, the cache control hardware decides which block must be removed to create space for newly arrived block The collection of rules for this operation is called replacement algorithm www.bookspar.com | Website for students | VTU NOTES Cache operation Processor does not explicitly need to know existence of cache It issues read/write requests using memory addresses the cache control circuitry determines whether the requested word is currently in cache. If in cache, the read/write operation is preformed on appropriate cache location In this case, a read or write hit is said to have occurred If its read operation, then main memory is not involved For a write operation – there are 2 options Write-through protocol – both cache and main memory updated simultaneously Write-back or copy-back protocol – only cache will be updated during write operation. ( denote this using a dirty or modified bit) later when we are moving this block back to main memory – update the main memory. www.bookspar.com | Website for students | VTU NOTES Limitations of write-through and write back protocols The write-through protocol is simpler but it results in unnecessary write operations in the main memory when a given cache word is updated several times during its cache residency The write-back protocol may also result in unnecessary write operations because when a cache block is written back to the memory all words of the block are written back, even if only a single word has been changed while the block was in cache www.bookspar.com | Website for students | VTU NOTES Read miss When the addressed word is not present in cache The block of words that contains the requested word is copied from the main memory into cache After that, the requested word is sent to processor Alternatively, this word may be sent to the processor as soon as it is read from the memory Called as load-through or early restart Reduces the processor’s waiting period But needs more complex circuitry www.bookspar.com | Website for students | VTU NOTES Write miss Occurs if the addressed word is not in cache If write-through protocol is used, the information is written directly into the main memory If write-back protocol is used, the block containing the addressed word is first brought into the cache The desired word in the cache is overwritten with new info www.bookspar.com | Website for students | VTU NOTES Mapping functions Correspondence between main memory blocks and those in the cache 3 techniques Consider a cache 128 blocks of 16 words each Total of 2K words ( 2048) Consider Main memory Direct mapping Associative mapping Set-associative mapping has 16 bit address 64Kwords 4K blocks of 16 words each Consecutive address refers to consecutive memory locations www.bookspar.com | Website for students | VTU NOTES Direct Mapping The simplest way to determine cache locations in which to store memory blocks is the direct-mapping technique Block j of main memory maps onto block j modulo 128 of the cache. ( refer following figure ) Whenever one of main memory blocks 0,128,256,.. is loaded in the cache, it is stored in cache block 0. Blocks 1,129,257,… are stored in cache block ? Since more than one memory block is mapped onto a given cache block position, contention may arise even when the cache is not full Eg., instructions of a program may start at block 1 and continue in block 129 ( possibly after a branch) Can resolve the contention by allowing new block to overwrite the currently resident block www.bookspar.com | Website for students | VTU NOTES Main memory Block 0 Block 1 tag Cache Block 127 Block 0 Block 128 tag Block 1 Block 129 tag Block 127 Block 255 Block 256 Block 257 Block 4095 T ag 5 Block 7 Word 4 Main memory address Figure 5.15. Direct-mapped cache. www.bookspar.com | Website for students | VTU NOTES Direct mapping contd.. Placement of a block in the cache is determined from the memory address. Memory address divided into 3 fields Low order 4 bits – 1 out of 16 words 7 bit cache block field – to determine which cache block this new block is stored High- order 5 bits – tag bits associated with the location in cache Identifies which of the 32 blocks that are mapped into this cache position are currently resident in cache If they match then the desired word is in the cache, If there is no match, the block containing the required word must first be read from the main memory and loaded into cache | Website for students | Direct mapping iswww.bookspar.com easy but not flexible VTU NOTES Associative mapping A main memory block can be placed into any cache block position 12 tag bits to identify a memory block when it is resident in the cache The tag bits of an address received from the processor are compared to the tag bits of each block of the cache to see if desired block is present Called as associative - mapping technique Gives complete freedom in choosing the cache location in which to place the memory block New block has to replace an existing block only if the cache is full Need replacement algorithms to choose which block to replace Cost of this mapping technique is higher than direct mapping as we need to search all 128 tag patterns – called as associative search www.bookspar.com | Website for students | VTU NOTES Main memory Block 0 Block 1 Cache tag Block 0 tag Block 1 Block tag i Block 127 Block 4095 T ag 12 W ord 4 Main memory address Figure 5.16. Associative-mapped cache. www.bookspar.com | Website for students | VTU NOTES Set-associative mapping Combination of direct mapping and associative mapping Blocks of the cache are grouped into sets Mapping allows a block of the main memory to reside in any block of a specific set. So, we have got a few choices where to place the block, the problem of contention of the direct method is eased The hardware cost is reduced by decreasing the size of the associative search. Following figure is a example – with 2 blocks per set. Memory blocks 0,64,128,….,4032 map into cache set 0, and they can occupy either of the two block positions within this set. Total 64 sets, so we need 6 bits to choose a set Compare tag field with tags of the cache blocks to check if the desired block is present www.bookspar.com | Website for students | VTU NOTES Main memory Block 0 Block 1 Cache tag Set 0 tag tag Set 1 tag tag Set 63 tag Block 0 Block 63 Block 1 Block 64 Block 2 Block 65 Block 3 Block 127 Block 126 Block 128 Block 127 Block 129 Block 4095 T ag 6 Set 6 W ord 4 Main memory address Figure 5.17. Set-associative-mapped cache with two blocks per set. www.bookspar.com | Website for students | VTU NOTES Set associative mapping contd… No of blocks per set is parameter that can be selected to suit the requirements of the computer Four blocks per set can be accommodated by a 5-bit set field. Eight blocks per set can be accommodated by 4-bit set field 128 blocks per set? Requires no set bits and is fully associative technique , with 12 tag bits Other extreme of one block per set is direct-mapping method A cache with k blocks per set is called as k-way setassociative cache. www.bookspar.com | Website for students | VTU NOTES Valid bit and cache coherence problem A control bit called as valid bit is provided for each block Valid bit is initially 0 when Indicates whether the block contains a valid data Is different from the dirty or modified bit Dirty bit is required only in systems that don’t use write-through method power is applied to system Main memory is loaded with new programs and data from the disk Transfers from the disk to the main memory are carried out by a DMA mechanism Normally DMA transfers bypass cache ( cost and performance) www.bookspar.com | Website for students | VTU NOTES Valid bit and cache coherence problem Valid bit of a block is set to 1 the first time block is loaded from main memory Whenever a main memory block is updated by a source that bypasses cache A check is made to determine whether the block being loaded is currently in cache Whenever a DMA transfer made from the main memory to the disk, and cache uses write-back protocol If so, then its valid bit is cleared to 0. This ensures that stale data does not exist in cache Data in the memory might not reflect the changes that have been made in the cached copy. Solution : flush the cache by forcing the dirty data to be written back to the memory before DMA transfer takes place Need to ensure that two different entities ( processor and DMA in this case ) uses the same copies of data is referred to as a cache coherence problem www.bookspar.com | Website for students | VTU NOTES Replacement algorithms In direct mapping method, the position of each block is predetermined No replacement strategy exists In associative and set-associative strategy, there is some flexibility If cache is full when a new block arrives, the cache controller must decide which of the old blocks to overwrite This decision is very important and determines system performance Keep the blocks in the cache that may be referenced in the near future Some algorithms LRU block Oldest block Random block www.bookspar.com | Website for students | VTU NOTES Least Recently Used ( LRU ) replacement algorithm Uses the property of locality of reference High probability that the blocks that have been referenced recently will be referenced soon So when a block needs to be overwritten, overwrite the one that has gone the longest time without being referenced This block is called as least recently used block The cache controller must track references to all the blocks Uses a 2-bit counter for a set of 4 blocks When hit occurs – the block’s counter is made 0 Lower values are incremented by 1 Higher values are unchanged When a miss occurs – Set is not full – new block is loaded and assigned counter value 0 Set is full – block with counter value 3 is removed and new block put in its place. Other 3 blocks’ counters are incremented by 1 www.bookspar.com | Website for students | VTU NOTES Reading assignment : Go through the examples of mapping techniques in the text book www.bookspar.com | Website for students | VTU NOTES Performance considerations 2 key factors in success of a computer – cost and performance Performance depends on how fast instructions can be brought into the processor for execution and how fast they can be executed The objective is to achieve best possible performance at the lowest possible cost Challenge in design alternative is to improve performance without increasing cost Measure of success – price/performance ratio In this unit, we will focus on the first aspect In case of memory we need shorter access time and larger capacity If we have a slow and faster unit – it is beneficial if we can transfer data at the rate of faster unit – to achieve this we go for parallel access using a technique called as interleaving www.bookspar.com | Website for students | VTU NOTES Interleaving Main memory of a computer is structured as a collection of physically separate modules Each with its own address buffer register ( ABR )and data buffer register ( DBR ) Memory access operations may proceed in more than one module at the same time. Two ways of implementing interleaving High order k bits name one of n modules and low order m bits name a particular word in that module When consecutive locations are accessed only one module is involved Devices with DMA capability can access info from other memory modules Low – order k bits select a module and high order m bits name a location within that module Consecutive addresses are located in successive modules Hence faster access and higher average utilization Called as memory interleaving – more effective way www.bookspar.com | Website for students | VTU NOTES k bits Module ABR DBR m bits Address in module ABR Module 0 DBR MM address ABR DBR Module n- 1 Module i (a) Consecutive words in a module m bits k bits Address in module ABR DBR Module 0 ABR Module DBR MM address ABR DBR Module 2k - 1 Module i (b) Consecutive words in consecutive modules Figure 5.25. Addressing multiple-module memory systems. Website for students | Go through the examplewww.bookspar.com in the text for |better understanding VTU NOTES Problem A cache with 8-word blocks, on a read miss, the block that contains the desired word must be copied from main memory into cache. Assume – it takes one clock cycle to send an address to main memory Memory built using DRAM chips – first word access takes 8 cc and subsequent words in same block can be accessed in 4cc per word. One CC needed to send one word to cache Using a single memory module – time needed to load desired block into cache is 1 + 8 +(7X4) + 1 = 38 CC Using memory interleaving – 4 words accessed in 8 CC and transferred in next 4 CC word by word, during which remaining 4 words are read and stored in DBR. These 4 words are transferred one word at a time to cache So time required to transfer a block is 1+8+4+4 = 17 CC www.bookspar.com | Website for students | VTU NOTES Hit rate and miss penalty The number of hits stated as fraction of all attempted accesses is called the hit rate The number of misses stated as a fraction of all attempted accesses is called as miss rate Hit rates well over 0.9 are essential for high-performance computers Performance is adversely affected by the actions that must be taken after a miss the extra time needed to bring the desired info into the cache is called as miss penalty Miss penalty is the time needed to bring a block of data from a slower unit in memory hierarchy to a faster unit Interleaving can reduce miss penalty substantially www.bookspar.com | Website for students | VTU NOTES Problem Let h be hit rate, M the miss penalty – ( time to access info from main memory), and C – the time to access information in the cache. Average access time is tave = hC + (1-h)M Consider same parameters as previous problem If computer has no cache, then using a fast processor and a typical DRAM main memory, it takes 10 clock cycles for each memory read access Suppose computer has a cache that holds 8-word blocks and an interleaved main memory. Suppose 30 percent of the instructions in a program perform read or write operation – 130 memory accesses for every 100 instructions Assume hit rates are .95 for instructions and .9 for data Assume miss penalty is same for both read and write accesses An estimate of improvement in performance is – Time without cache / time with cache = Then it requires 17 cycles ( as discussed before ) to load one block to cache (130 X 10) / ( 100(.95X1 + .05X17) + 30(.9X1 + .1X17) ) = 5.04 So computer with cache performs 5 time better ( considering processor clock and system bus have same speed) www.bookspar.com | Website for students | VTU NOTES Caches on the processor chip From speed point of view, optimal space for cache is on the processor chip Since space on processor chip Is required for many other functions, this limits the size of cache that can be accommodated Either a combined cache(offers greater flexibility in mapping) for instructions and data or separate caches(increases parallel access of information but more complex circuitry) for instructions and data. Normally 2 levels of caches are used L1 and L2 cache L1 designed to allow very fast access by processor L2 can be slower but it should be much larger to ensure high hit rate Its access time will have a very large effect on clock rate of processor A work station computer may include L1 cache with capacity 10s of kilobytes and L2 cache with capacity several megabytes Including L2 cache further reduces the impact of main memory speed on the performance of the computer www.bookspar.com | Website for students | VTU NOTES Cache on processor chip Average access time experienced by the processor with 2 levels of caches is tave = h1C1 + (1-h1)h2C2 + (1-h1)(1-h2)M h1 – hit rate in L1 h2 – hit rate in L2 C1 – time to access info in L1 cache C2 – time to access info in L2 cache M – time to access info in main memory No of misses in L2 cache must be very low www.bookspar.com | Website for students | VTU NOTES Write buffers Temporary storage area for write requests Usage when write-through protocol is used Each write operation results in writing new value to the memory If processor waits for memory function to be completed, then processor is slowed down Processor immediately does not require results of write operation Processor instead of waiting for write operation to complete, places the write requests into this buffer and continues execution of next instruction The write requests are sent to memory whenever read requests are not serviced by memory Because read requests must be serviced immediately – else processor cannot proceed without the data to be read www.bookspar.com | Website for students | VTU NOTES Write buffers Write buffer holds a number of write requests A read request may refer to data that are still in write buffer So, addresses of data to be read from memory are compared with addresses of the data in the write buffer In case of match , data in write buffers are used Usage of write buffers when write-back protocol is used Write operations are simply performed on the corresponding word in the cache If a new block of data comes into cache as a result of read miss, it replaces an existing block which has some dirty data ( modified data) which has to be written into main memory If write-back operation is performed first, then processor has to wait longer for new block to be read into the cache So to read the data first, provide a fast write buffer for temporary storage of dirty block that is ejected Afterwards contents of write buffer are written into memory www.bookspar.com | Website for students | VTU NOTES Prefetching New data are brought into cache when they are first needed The processor has to pause until the new data arrive To avoid this stalling of the processor, it is possible to prefetch the data into the cache before they are needed Prefetching done either through software or hardware In software – include a separate prefetch instruction – which loads the data into cache by the time data are required in the program Allows overlapping of accesses to main memory and computation of the processor Prefetching instructions inserted either by compiler or by programmer – compiler insertion is better In hardware – adding circuitry that attempts to discover a pattern in memory references and prefetches data according to this pattern www.bookspar.com | Website for students | VTU NOTES Lock-up free cache The software Prefetching does not work well if it interferes with normal execution of instructions If action of prefetching stops other accesses to the cache until the prefetch is completed A cache of this type is said to be locked while it services a miss Allow the processor to access the cache while the miss is being serviced A cache that allows multiple outstanding misses is called lockupfree cache Since it can service only one miss at a time, it must have circuitry to keep track of all outstanding misses By including special registers to hold pertinent information www.bookspar.com | Website for students | VTU NOTES VIRTUAL Memories Refer to slides given separately www.bookspar.com | Website for students | VTU NOTES SECONDARY Storage Magnetic hard disks Organization and accessing of data on a disk Access time Typical disks Data buffer/cache Disk controller Floppy disks RAID Disk arrays Commodity disk considerations Optical disks CD technology CD-ROM CD-Recordable CD-Rewritable DVD Technology DVD-RAM www.bookspar.com | Website for students | VTU NOTES Magnetic tape systems