It is ‘impossible’ to have memory that is both Datorarkitektur och operativsystem Unlimited ((large g in capacity) p y) Lecture 7 And fast We create an illusion for the programmer Before that, let us that the way programs access memoryy Chapter 5 — Large and Fast: Exploiting Memory Hierarchy —2 1 P Programs access a smallll proportion off their h address space at any time Temporal locality Items accessed recently are likely to be accessed again soon e.g., instructions in a loop §5.1 Intrroduction Principle of Locality To Take Advantage of Locality Employ memory hierarchy Use multiple p levels of memories ‘Larger’ distance from processor => • larger size • larger access time Spatial locality Items near those accessed recently are likely to be accessed soon E.g., sequential instruction access, array data Chapter 5 — Large and Fast: Exploiting Memory Hierarchy —3 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy —4 §5.1 Intrroduction Memory Memory y hierarchy y Memory Technology Static S RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic RAM (DRAM) 50ns – 70ns, 70ns $20 – $75 per GB Magnetic disk 5ms 5 – 20ms, 20 $0 $0.20 20 – $2 per GB Ideal memory Access time of SRAM Capacity p y and cost/GB of disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy —5 Taking Advantage of Locality FIGURE 5.1 The basic structure of a memory hierarchy. By implementing the memory system as a hierarchy, the user has the illusion of a memory that is as large as the largest level of the hierarchy, but can be accessed as if it were all built from the fastest memory. Flash memory has replaced disks in many embedded devices, and may lead to a new level in the storage hierarchy for desktop and server computers; see Section 6.4. Copyright © 2009 Elsevier, Inc. All rights reserved. 6 6 Memory Hierarchy Levels Block (aka line): unit of copying Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items f from disk d k to smaller ll DRAM memory Main memory y Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory Cache memory attached to CPU Chapter 5 — Large and Fast: Exploiting Memory Hierarchy —7 May M be b multiple lti l words d If accessed data is present in upper level Hit: access satisfied by upper level • Hit ratio: hits/accesses If accessed data is absent Miss: block copied from lower level • Time taken: miss penalty • Miss ratio: misses/accesses = 1 – hit ratio Then accessed data supplied from upper level Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8 Cache memory The level of the memory hierarchy closest to the CPU Given accesses X1, …, Xn–1, Xn This structure, with the appropriate operating mechanisms, allows the processor to have an access time that is determined primarily by level 1 of the hierarchy and yet have a memory as large as level n. Although the local disk is normally the bottom of the hierarchy, some systems use tape or a file server over a local area network as the next levels of the hierarchy. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 10 Direct Mapped Cache Location determined by address Direct mapped: only one choice (Block address) modulo (#Blocks in cache) (Block address) modulo (#Blocks in cache) How do we know if the data is present? Where do we look? §5.2 Thee Basics off Caches Cache Memory Direct Mapped Cache If #Blocks is a power of 2 If #Blocks is a power of 2 Use low‐order address bits to compute address dd Tag Bits Each cache location can store the contents of more than one memoryy location How do we know which particular block is stored in a cache location? Add a set of tag bits to the cache Tag needs only need the high-order bits of the address Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14 Valid Bits What if there is no data in a location? Valid bit: 1 = p present,, 0 = not ppresent Initially 0 because when the processor starts up, the cache does not have any valid data Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 15 Cache Example 8-blocks, 1 word/block, direct mapped Initial state Index V 000 N 001 N 010 N 011 N 100 N 101 N 110 N 111 N Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 16 Tag Data Cache Example Cache Example Word addr Binary addr Hit/miss Cache block Word addr Binary addr Hit/miss Cache block 22 10 110 Miss 110 26 11 010 Miss 010 Index V Index V 000 N Tag 000 N 001 N 001 N 010 N 010 Y 011 N 011 N 100 N 100 N 101 N 101 N 110 Y 110 Y 111 N 111 N 10 Data Mem[10110] Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 17 Tag Data 11 Mem[11010] 10 Mem[10110] Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 18 Cache Example Cache Example Word addr Binary addr Hit/miss Cache block Word addr Binary addr Hit/miss Cache block 22 10 110 Hit 110 16 10 000 Miss 000 26 11 010 Hit 010 3 00 011 Miss 011 16 10 000 Hit 000 Index V 000 N 001 N 010 Y 011 100 101 N 110 Y 111 N Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 19 Tag Index V Tag Data 000 Y 10 M [10000] Mem[10000] 001 N 010 Y 11 Mem[11010] N 011 Y 00 Mem[00011] N 100 N 10 Mem[10110] 11 10 Data Mem[11010] Mem[10110] 101 N 110 Y 111 N Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20 Cache Example Cache Misses Word addr Binary addr Hit/miss Cache block 18 10 010 Miss 010 On cache hit, CPU proceeds normally On cache miss V Tag Data Stall the CPU pipeline 000 Y 10 M [10000] Mem[10000] Fetch F h block bl k from f next level l l off hierarchy h h 001 N 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] Instruction cache miss • Restart instruction fetch 100 N 101 N 110 Y 111 N Index 10 Mem[10110] Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21 Data cache miss • Complete data access Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22 Write Through Write-Through O On eachh data-write d h could hit, ld just update d the h block bl k in cache But B then h cache h and d memory would ld be b inconsistent i i Write through: also update memory But B makes k writes take k longer l Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23 Write Buffer Write-Buffer Solution: S l write buffer b ff Holds data waiting to be written to memory CPU continues i i immediately di l after f writing i i to write-buffer i b ff • Write-buffer is freed later when memory write is completed But CPU stalls on write if write buffer is already full Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24 Write Buffer Write-Buffer Write Back Write-Back Write-buffer can become full and the processor will stall if p Alternative: On data-write hit, just update the block in cache Rate of memory completing the write operation is Keep track of whether each block is dirty less than the rate at which write instructions are generated Or if there is a burst of writes When a dirty block is replaced Write it back to memory Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25 P Program execution ti cycles l • Includes cache hit time Memory stall cycles • Mainly from cache misses Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27 Measuring g Cache Performance Components of CPU time P Program execution ti cycles l • Includes cache hit time Memory stall cycles • Mainly from cache misses With simplifying assumptions: Memory stall cycles Memory accesses Miss rate Miss penalty Program Instructions Misses Miss penalty Program Instruction Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28 §5.3 Measuring and Improvving Cache Perform mance Components of CPU time §5.3 Measuring and Improvving Cache Perform mance Measuring g Cache Performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26 Cache Performance Example Cache Performance Example Given Gi Mi Miss cycles l for f allll instructions i i if I is i the h instruction count I-cache miss rate = 2% D D-cache h miss rate = 4% Miss penalty = 100 cycles B Base CPI (ideal ( d l cache) h )=2 Load & stores are 36% of instructions II-cache: h I × 0.02 0 02 × 100 = 2 I D-cache: I × 0.36 × 0.04 × 100 = 1.44 I Mi Miss cycle l per instruction i i is i 2 + 1.44 1 44 Total cycles per instruction is 2 + 2 + 1.44 So, actual CPI = 2 + 2 + 1.44 = 5.44 H How muchh faster f would ld be b a processor withh a perfect cache that never misses ? Ideal CPU is 5.44/2 =2.72 times faster Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 30 Managed jointly by CPU hardware and the VM Terms Virtual address is the address produced by the program operating system (OS) Same underlying concept as in cache but different Physical y address is an address in the main memory in terminologies CPU and OS translate virtual addresses to physical addresses VM “block” is called a page VM translation “miss” is called a page fault Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 32 §5.4 Virttual Memory VM is the technique of using main memory as a “cache” for secondary (disk) storage §5.4 Virttual Memory Virtual Memory Motivation 1 Multiple programs share main memory and they can change dynamically On page fault, fault the page must be fetched from disk Takes T k millions illi off clock l k cycles. l M Main i memory To avoid writing into each other’s data, we would latency is around 100,000 times better than the disk latency like separate p address space p for each pprogram g Try to minimize page fault rate With VM, each g gets a private p virtual address space p Smart S replacement algorithms implemented in holding its frequently used code and data software in the OS •R Reading di from f disk di k is i slow l enoughh and d software f overhead is negligible VM translates the virtual address into physical address allowing protection from other programs Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34 A large program cannot fit into the main memory §5.4 Virttual Memory Motivation I1 Address Translation Address translation: the process by which the virtual address is mapped to a physical address VM automatically maps addresses into disk space if the main memoryy is not sufficient Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36 §5.4 Virttual Memory Page Fault Penalty Address Translation Two components of virtual p address Page offset does not change and the number of bits determines the size of the page the size of the page Translation Using a Page Table The number of pages addressable with virtual address might be larger than the number of pages addressable with the physical address which gives the ill i off unbounded illusion b d d amount off virtual i l memory Translation Using a Page Table A page table, that resides on the memory, is used for address translation Each program has its own page table Translation Using a Page Table To indicate the location of the page table in the memory, a hardware register points to the start of the page table Each program has its own page table Translation Using a Page Table Page Tables Stores p placement information Array of page table entries, indexed by virtual page number Page table register in CPU points to page table in physical memory If page is present in memory Note the use of a Valid bit it stores the physical page number Plus other status bits If p page g is not present p a page fault occurs and the OS is given control Next few slides, we recap p some OS concepts p Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 42 The 5 State Process Model Recap: Process A process (a ( program in execution), ) has h a context defined by the values in its program counter, registers, and page table If another pprocess preempts p p a runningg process, this context must be saved Terminated New admit d it create switch: schedule new job Ready exit Running switch: preempt Blocked Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 43 44 Role of the OS A process (a ( program in execution), ) has h a context defined by the values in its program counter, registers, and page table If another pprocess preempts p p this process, p , this context must be saved Rather than save the entire page table, only the page table register is saved R l off the Role h OS Th The OS is responsible bl for f allocating ll the h physical memory and updating the page tables It maintains that virtual address of different processes do not collide thus providing protection To restart the process in the ‘running’ running state, state the operating system reloads the context Page fault is handled by OS Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 45 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 46 R l off the Role h OS Mapping Pages to Storage Th The OS creates a space on the h disk d k (not ( main memory) for all pages of a process when the process is created called swap space OS also creates a record of where each virtual address of the process is located in the disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 47 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 48 Page Fault Handled by b OS If all pages in main memory are in use (it is full), full) the OS must choose a page to replace The replaced page must be written to the swap space on the disk Writes Disk Di k writes i take k millions illi off cycles l Write through is impractical because it takes millions of cycles to write to the disk • Building a write buffer is impractical VM uses write-back To reduce page fault rate, prefer f least-recently l l used d (LRU) replacement l Predict that the page that was NOT used recently will be NOT used in near future Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 49 Fast Translation Using a TLB Address translation would appear to require extra memory references One to access the page table itself Then the actual memory access But access to page tables has good locality Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 50 Memory Protection VM allows ll diff different processes to share h the h same main memory Need to protect against errant access Requires OS assistance So use a fast cache of recently y used translations Called a Translation Look-aside Buffer (TLB) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 51 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 52 §5.12 Co oncluding Remarks Memory Protection Concluding Remarks Hardware support for OS protection Fast F memories are small, ll llarge memories are slow l Support two modes • Privileged supervisor mode (aka kernel mode) meaning that the OS is running • User mode We really want fast, large memories Caching C hi gives i this hi ill illusion i Principle of locality Privileged instructions that only the OS can use • Allow it to write to the supervisor bit, and page table pointer Programs P use a smallll part off their h memory space frequently Allow mechanisms (e.g., special instructions) to switch Memory hierarchy between supervisor mode and the user mode (e.g., (e g syscall in MIPS) L1 cache L2 cache … DRAM memory disk These features allow the OS to change page tables p from changing g g them while ppreventingg a user process Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 54 Exam E Exam Questions Q ti Questions will be in English You may answer in Swedish or English Dictionary is allowed Most likely a mixture of problem oriented qquestions and subjective j qquestions Problem oriented questions Like Lik questions i iin homework h k Swedish to English Subjective j qquestions … English g to Swedish But, electronic dictionary is NOT allowed 55 56 Subjective questions like … • • • • Exam Questions What does the control unit do? What is RISC? CISC? g instructions ggood ? Whyy are fix-length What is principle of locality? Questions from past year are in the course website You may look at them for ‘some’ inspiration Note: N t P Pastt year had h d different diff t instructor, i t t different textbook, and different homeworks k even if there th are some common topics 57 57 From the Textbook Parts from : 5.1, 5.2, 5.3, 5.4 59 59 58