2015-11-03 Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH 1 TDTS 08 – Lecture 2 Internal and External Memories Date transfer CPU Main Memory Control Data transfer Control Secondary Memory Zebo Peng, IDA, LiTH 2 TDTS 08 – Lecture 2 1 2015-11-03 Main Memory Model A word (8, 16, 32, or 64 bits) Memory Control Unit 3 address 21 0 a bit Read/write control Address selection MBR (in CPU) MAR (in CPU) Zebo Peng, IDA, LiTH 3 TDTS 08 – Lecture 2 Memory Characteristics The most important characteristics of a memory: speed — as fast as possible; size — as large as possible; cost — reasonable price. They are determined by the technology used for implementation. Your personal library Zebo Peng, IDA, LiTH 4 TDTS 08 – Lecture 2 2 2015-11-03 Memory Access Bottleneck CPU Quantitative measurement of the capacity of the bottleneck is the Memory Bandwidth Memory Zebo Peng, IDA, LiTH 5 TDTS 08 – Lecture 2 Memory Bandwidth Memory bandwidth denotes the amount of data that can be accessed from a memory per second: M-Bandwidth = 1 memory cycle time ∙ amount of data per access Ex. MCT = 100 nano second and 4 bytes (a word) per access: M-Bandwidth = 40 mega bytes per second. There are two basic techniques to increase the bandwidth of a given memory: Reduce the memory cycle time • Expensive • Memory size limitation Divide the memory into several banks, each of which has its own control unit. Zebo Peng, IDA, LiTH 6 TDTS 08 – Lecture 2 3 2015-11-03 Memory Banks Interleaving placement of program and data 12 8 4 0 13 9 5 1 Control Unit 14 10 6 2 Control Unit 15 11 7 3 Control Unit Control Unit CPU Zebo Peng, IDA, LiTH 7 TDTS 08 – Lecture 2 Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH 8 TDTS 08 – Lecture 2 4 2015-11-03 Motivation What do we need? A memory to store very large programs and to work at a speed comparable to that of the CPU. The reality is: The larger a memory, the slower it will be; The faster a memory, the greater the cost per bit. A solution: To build a composite memory system which combines a small and fast memory with a large and slow memory, and behaves, most of the time, like a large and fast memory. This two-level principle can be extended to a hierarchy of many levels. Zebo Peng, IDA, LiTH 9 TDTS 08 – Lecture 2 Memory Hierarchy CPU Registers Cache Main Memory Secondary Memory of direct access type Secondary Memory of archive type Zebo Peng, IDA, LiTH 10 TDTS 08 – Lecture 2 5 2015-11-03 Memory Hierarchy Access time example Capacity example 16-256 CPU 1-10 ns Registers 10-50 ns Cache 4-512K 40-500 ns Main Memory 4-256M 5-100 ms (for 4KB) 0.5-5 s (for 8KB) Secondary Memory of direct access type 40G/unit Secondary Memory of archive type Zebo Peng, IDA, LiTH As one goes down the hierarchy, the following occur: Decreasing cost/bit. Increasing capacity. Increasing access time. Decreasing frequency of access by the CPU. 50M/tape 11 TDTS 08 – Lecture 2 Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH 12 TDTS 08 – Lecture 2 6 2015-11-03 Mismatch of CPU and MM Speeds 4 Cycle Time (nano second) 10 Speed Gap (ca. one order of magnitude, i.e., 10 times) 3 10 2 10 1 10 0 1955 1960 Zebo Peng, IDA, LiTH 1965 1970 1975 1980 13 1985 1990 2000 2005 TDTS 08 – Lecture 2 Cache Memory CPU addresses Main Memory instructions and data instructions and data addresses Cache addresses instructions and data A cache is a very fast memory which is put between the main memory and the CPU, and used to hold segments of program and data of the main memory. Zebo Peng, IDA, LiTH 14 TDTS 08 – Lecture 2 7 2015-11-03 Zebo’s Cache Memory Model Personal library for a high-speed reader Cache Storage cells Memory controller A computer is a “predictable and iterative reader,” therefore high cache hit ratio, e.g., 96%, is achievable, even with a relatively small cache. Zebo Peng, IDA, LiTH 15 TDTS 08 – Lecture 2 Cache Memory Features It is transparent to the programmers. It is only a small part of the program/data in the main memory which has its copy in the cache (e.g., 8KB cache with 8MB memory). If the CPU wants to access program/data not in the cache (called a cache miss), the relevant block of the main memory will be copied into the cache. The intermediate-future memory access will usually refer to the same word or words in the neighborhood, and will not have to involve the main memory. This property of program executions is denoted as locality of reference. Zebo Peng, IDA, LiTH 16 TDTS 08 – Lecture 2 8 2015-11-03 Locality of Reference Temporal locality: If an item is referenced, it will tend to be referenced again soon. Spatial locality: If an item is referenced, items whose addresses are close by will tend to be referenced soon. This access pattern is referred as locality of reference principle, which is an intrinsic features of the von Neumann architecture: Sequential instruction storage. Loops and iterations (e.g., subroutine calls). Sequential data storage (e.g., array). Zebo Peng, IDA, LiTH 17 TDTS 08 – Lecture 2 Layered Memory Performance Average Access Time Phit x Tcache_access + (1 – Phit) x (Tmm_access + Tcache_access) x Block_size + Tchecking where Phit = the probability of cache hit, cache hit ratio; Tcache_access = cache access time; Tmm_access = main memory access time; Block_size = number of words in a cache block; and Tchecking = the time needed to check for cache hit or miss. Ex. A computer has 8MB MM with 100 ns access time, 8KB cache with 10 ns access time, BS=4, and Tchecking = 2.1 ns, Phit = 0.97, AAT will be 25 ns. Zebo Peng, IDA, LiTH 18 TDTS 08 – Lecture 2 9 2015-11-03 Cache Design The size and nature of the copied block must be carefully designed, as well as the algorithm to decide which block to be removed from the cache when it is full: Cache block size (line size). Total cache size. Mapping function. Replacement method. Write policy. Numbers of caches: • Single, two-level, or three-level cache. • Unified vs. split cache. Zebo Peng, IDA, LiTH 19 TDTS 08 – Lecture 2 Split Data and Instruction Caches? Split caches (Harvard Architectures): + Competition for the cache between instruction processing and execution units is eliminated. + Instruction fetch can proceed in parallel with memory access from the CPU for operands. One may be overloaded while the other is under utilized. Unified caches: + Better balance the load between instruction and data fetches depending on the dynamics of the program execution. + Design and implementation are cheaper. Lower performance. Zebo Peng, IDA, LiTH 20 TDTS 08 – Lecture 2 10 2015-11-03 Direct Mapping Cache Direct mapping - Each block of the main memory is mapped into a fixed cache slot. 1 2 1 1 2 2 Cache Storage cells Zebo Peng, IDA, LiTH Memory controller 21 TDTS 08 – Lecture 2 Direct Mapping Cache Example We have a 10,000-word MM and a 100-word cache. 10 memory cells are grouped into a block. Tag 2 Memory address = 9990-9999 0 Slot 1 1 1 Word 1 5 Tag Slot No. 9 90-99 8 80-89 7 70-79 6 66-69 5 50-59 4 40-49 3 30-39 2 20-29 1 10-19 01 0 00-09 100-Word Cache 0120-0129 0110-0119 0100-0109 0020-0029 0010-0019 0000-0009 10,000-Word Memory Zebo Peng, IDA, LiTH 22 TDTS 08 – Lecture 2 11 2015-11-03 Direct Mapping Pros & Cons Simple to implement and therefore inexpensive. Fixed location for blocks. If a program accesses 2 blocks that map to the same cache slot repeatedly, cache miss rate is very high. 1 2 1 1 2 2 Cache Storage cells Zebo Peng, IDA, LiTH Memory controller 23 TDTS 08 – Lecture 2 Associative Mapping A main memory block can be loaded into any slot of the cache. To determine if a block is in the cache, a mechanism is needed to simultaneously examine every slot’s tag. associative memory example 9990-9999 Tag (3 ps) Tag 0120-0129 0110-0119 0100-0109 0106 , 0107 010 287 001 297 100-Word Cache 0020-0029 0010-0019 0000-0009 10,000-Word Memory Zebo Peng, IDA, LiTH 24 90-99 80-89 70-79 66-69 50-59 40-49 30-39 20-29 10-19 00-09 TDTS 08 – Lecture 2 12 2015-11-03 Fully Associative Organization Zebo Peng, IDA, LiTH 25 TDTS 08 – Lecture 2 Set Associative Organization The cache is divided into a number of sets (K). Each set contains a number of slots (W). A given block maps to any slot in a given set. e.g. block i can be in any slot of set j. For example, 2 slots per set (W = 2): 2-way associative mapping. A given block can be in one of 2 slots. Direct mapping: W = 1 (no alternative). Fully associative: K = 1 (W = total number of all slots in the cache, all mappings possible). Zebo Peng, IDA, LiTH 26 TDTS 08 – Lecture 2 13 2015-11-03 Replacement Algorithms With direct mapping, it is no need. With associative mapping, a replacement algorithm is needed in order to determine which block to replace: First-in-first-out (FIFO). Least-recently used (LRU) replace the block that has been in the cache longest with not reference to it. Lest-frequently used (LFU) replace the block that has experienced the fewest references. Use info Tag 54 15:55 Random. Zebo Peng, IDA, LiTH 27 TDTS 08 – Lecture 2 Write Policy The problem: How to keep cache content and main memory content consistent without losing too much performance? Write through: All write operations are passed to main memory: If the addressed location is currently in the cache, the cache is updated so that it is coherent with the main memory. For writes, the processor always slows down to main memory speed. Since the percentage of writes is small (ca. 15%), this scheme doesn’t lead to large performance reduction. Zebo Peng, IDA, LiTH 28 TDTS 08 – Lecture 2 14 2015-11-03 Write Policy (Cont’d) Write through with buffered write: The same as write-through, but instead of slowing the processor down by writing directly to main memory, the write address and data are stored in a high-speed write buffer; the write buffer transfers data to main memory while the processor continues its task. Higher speed, but more complex hardware. Write back: Write operations update only the cache memory which is not kept coherent with main memory. When the slot is replaced from the cache, its content has to be copied back to memory. Good performance (usually several writes are performed on a cache block before it is replaced), but more complex hardware is needed. Cache coherence problems are very complex and difficult to solve in multiprocessor systems (to be discussed later)! Zebo Peng, IDA, LiTH 29 TDTS 08 – Lecture 2 Cache Architecture Examples Intel 80486 (introduced 1989) a single on-chip cache of 8 Kbytes line size: 16 bytes 4-way set associative organization Intel Pentium (introduced 1993) two on-chip caches, one for data and one for instructions each cache: 8 Kbytes line size: 32 bytes 2-way set associative organization IBM PowerPC 620 (introduced 1995) two on-chip caches, one for data and one for instructions each cache: 32 Kbytes line size: 64 bytes 8-way set associative organization Zebo Peng, IDA, LiTH 30 TDTS 08 – Lecture 2 15 2015-11-03 Cache Architecture Examples (Cont’d) Intel Itanium 2 (introduced 2002) three levels of cache: L1 L2 L3 Contents Split D and I Unified D + I Unified D + I Size 16 Kbytes each 256 Kbytes 3 Mbytes Line size 64 bytes 128 bytes 128 bytes Associativity 4 way 8 way 12 way Access time 1 cycle 5-7 cycles 14-17 cycles Store policy Write-back Write-back Write-through Zebo Peng, IDA, LiTH 31 TDTS 08 – Lecture 2 Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH 32 TDTS 08 – Lecture 2 16 2015-11-03 Motivation for Virtual Memory The physical main memory (RAM) is very limited in space. It may not be big enough to store all the executing programs at the same time. Some program may need memory larger than the main memory size, but not all the program need to be maintained in the main memory at the same time. Virtual Memory takes advantage of the fact that at any given instant of time, an executing program needs only a fraction of the memory that the whole program occupies. The basic idea: Load only pieces of each executing program which are currently needed. Zebo Peng, IDA, LiTH 33 TDTS 08 – Lecture 2 Paging Divide programs (processes) into equal sized, small blocks, called pages. Divide the primary memory into equal sized, small blocks called page frames. Allocate the required number of page frames to a program. A program does not require continuous page frames! The operating system (OS) is responsible for: Maintaining a list of free frames. Using a page table to keep track of the mapping between pages and page frames. Zebo Peng, IDA, LiTH 34 TDTS 08 – Lecture 2 17 2015-11-03 Logical and Physical Addresses Implementation of the page-tables: 0 1 2 3 Zebo Peng, IDA, LiTH 35 Main memory — slow since an extra memory access is needed. Separate registers — fast but expensive. Cache. TDTS 08 – Lecture 2 Objective of Virtual Memory To provide the user/programmer with a much bigger memory than the main memory with the help of the operative system. Virtual memory size >> main memory size. Program addresses 0000 1000 2000 3000 MM addresses 0000 Secondary memory 1000 2000 3000 4000 5000 Zebo Peng, IDA, LiTH 36 TDTS 08 – Lecture 2 18 2015-11-03 Page Fault When accessing a VM page which is not in the main memory, a page fault occurs. The page must then be loaded from the secondary memory into the main memory by the OS. Virtual Address Page Number Offset Page Map Page Fault (Interrupt to OS) Pages in MM Zebo Peng, IDA, LiTH 37 TDTS 08 – Lecture 2 Page Replacement When a page fault occurs and all page frames are occupied, one of them must be replaced. If the replaced page has been modified during the time it resides in the main memory, the updated version should be written back to the secondary memory. Our wish is to replace the page which will not be accessed in the future for the longest amount of time. Problem — We don’t know exactly what will happen in the future. Solution — We predict the future by studying the access patterns up till now (“learn from history”). Zebo Peng, IDA, LiTH 38 TDTS 08 – Lecture 2 19 2015-11-03 Replacement Algorithms FIFO (First In First Out) — To replace the one in MM the longest of time. LRU (Least Recently Used) — To replace the one that has not be accessed the longest time. LFU (Least Frequently Used) — To replace the one that has the smallest number of access during the latest time period. The replacement by random (used for Cache) is not used for VM! Zebo Peng, IDA, LiTH 39 TDTS 08 – Lecture 2 Summary A memory system has to store very large programs and a lot of data and still provide fast access. No single type of memory can provide all the need of a computer system. Usually several different storage mechanisms are organized in a layer hierarchy. Cache is a hardware solution to improve memory access which is transparent to the programmers. Virtual memory provides a much larger address space than the available physical space with the help of the OS (software solution). The layer structure works very well due to the locality of reference principle. Zebo Peng, IDA, LiTH 40 TDTS 08 – Lecture 2 20