Datorarkitektur och operativsystem Lecture 7 1 It is ‘impossible’ to have memory that is both Unlimited ((large g in capacity) p y) And fast We create an illusion for the programmer Before that, let us that the way programs access memoryy Chapter 5 — Large and Fast: Exploiting Memory Hierarchy —2 §5.1 Intrroduction Memory P Programs access a smallll proportion off their h address space at any time Temporal locality Items accessed recently are likely to be accessed again soon e.g., instructions in a loop Spatial locality Items near those accessed recently are likely to be accessed soon E.g., sequential instruction access, array data Chapter 5 — Large and Fast: Exploiting Memory Hierarchy —3 §5.1 Intrroduction Principle of Locality To Take Advantage of Locality Employ memory hierarchy Use multiple p levels of memories ‘Larger’ distance from processor => • larger size • larger access time Chapter 5 — Large and Fast: Exploiting Memory Hierarchy —4 Memory Technology Static S RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic RAM (DRAM) 50ns – 70ns, 70ns $20 – $75 per GB Magnetic disk 5ms 5 – 20ms, 20 $0 $0.20 20 – $2 per GB Ideal memory Access time of SRAM Capacity p y and cost/GB of disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy —5 Memory y hierarchy y FIGURE 5.1 The basic structure of a memory hierarchy. By implementing the memory system as a hierarchy, the user has the illusion of a memory that is as large as the largest level of the hierarchy, but can be accessed as if it were all built from the fastest memory. Flash memory has replaced disks in many embedded devices, and may lead to a new level in the storage hierarchy for desktop and server computers; see Section 6.4. Copyright © 2009 Elsevier, Inc. All rights reserved. 6 6 Taking Advantage of Locality Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items f from d diskk to smaller ll DRAM memory Main memory y Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory Cache memory attached to CPU Chapter 5 — Large and Fast: Exploiting Memory Hierarchy —7 Memory Hierarchy Levels Block (aka line): unit of copying M May bbe multiple lti l words d If accessed data is present in upper level Hit: access satisfied by upper level • Hit ratio: hits/accesses If accessed data is absent Miss: block copied from lower level • Time taken: miss penalty • Miss ratio: misses/accesses = 1 – hit ratio Then accessed data supplied from upper level Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8 This structure, with the appropriate operating mechanisms, allows the processor to have an access time that is determined primarily by level 1 of the hierarchy and yet have a memory as large as level n. Although the local disk is normally the bottom of the hierarchy, some systems use tape or a file server over a local area network as the next levels of the hierarchy. Cache memory The level of the memory hierarchy closest to the CPU Given accesses X1, …, Xn–1, Xn Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 10 How do we know if the data is present? Where do we look? §5.2 Thee Basics off Caches Cache Memory Direct Mapped Cache Location determined by address Direct mapped: only one choice (Block address) modulo (#Blocks in cache) (Block address) modulo (#Blocks in cache) Direct Mapped Cache If #Blocks is a power of 2 If #Blocks is a power of 2 Use low‐order address bits to compute dd address Tag Bits Each cache location can store the contents of more than one memoryy location How do we know which particular block is stored in a cache location? Add a set of tag bits to the cache Tag needs only need the high-order bits of the address Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14 Valid Bits What if there is no data in a location? Valid bit: 1 = p present,, 0 = not ppresent Initially 0 because when the processor starts up, the cache does not have any valid data Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 15 Cache Example 8-blocks, 1 word/block, direct mapped Initial state Index V 000 N 001 N 010 N 011 N 100 N 101 N 110 N 111 N Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 16 Tag Data Cache Example Word addr Binary addr Hit/miss Cache block 22 10 110 Miss 110 Index V 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 111 N Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 17 Tag Data 10 Mem[10110] Cache Example Word addr Binary addr Hit/miss Cache block 26 11 010 Miss 010 Index V 000 N 001 N 010 Y 011 N 100 N 101 N 110 Y 111 N Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 18 Tag Data 11 Mem[11010] 10 Mem[10110] Cache Example Word addr Binary addr Hit/miss Cache block 22 10 110 Hit 110 26 11 010 Hit 010 Index V 000 N 001 N 010 Y 011 N 100 N 101 N 110 Y 111 N Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 19 Tag Data 11 Mem[11010] 10 Mem[10110] Cache Example Word addr Binary addr Hit/miss Cache block 16 10 000 Miss 000 3 00 011 Miss 011 16 10 000 Hit 000 Index V Tag Data 000 Y 10 M [10000] Mem[10000] 001 N 010 Y 11 Mem[11010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20 Cache Example Word addr Binary addr Hit/miss Cache block 18 10 010 Miss 010 Index V Tag Data 000 Y 10 M [10000] Mem[10000] 001 N 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21 Cache Misses On cache hit, CPU proceeds normally On cache miss Stall the CPU pipeline Fetch F h block bl k from f next level l l off hierarchy h h Instruction cache miss • Restart instruction fetch Data cache miss • Complete data access Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22 Write Through Write-Through O On eachh data-write d h could hit, ld just update d the h block bl k in cache But B then h cache h and d memory would ld be b inconsistent i i Write through: also update memory But B makes k writes take k longer l Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23 Write Buffer Write-Buffer Solution: S l write buffer b ff Holds data waiting to be written to memory CPU continues i i immediately di l after f writing i i to write-buffer i b ff • Write-buffer is freed later when memory write is completed But CPU stalls on write if write buffer is already full Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24 Write Buffer Write-Buffer Write-buffer can become full and the processor will stall if p Rate of memory completing the write operation is less than the rate at which write instructions are generated Or if there is a burst of writes Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25 Write Back Write-Back Alternative: On data-write hit, just update the block in cache Keep track of whether each block is dirty When a dirty block is replaced Write it back to memory Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26 Components of CPU time P Program execution ti cycles l • Includes cache hit time Memory stall cycles • Mainly from cache misses Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27 §5.3 Measuring and Improvving Cache Perform mance Measuring g Cache Performance Components of CPU time P Program execution ti cycles l • Includes cache hit time Memory stall cycles • Mainly from cache misses With simplifying assumptions: Memory stall cycles Memory accesses Miss rate Miss penalty Program Instructions Misses Miss penalty Program Instruction Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28 §5.3 Measuring and Improvving Cache Perform mance Measuring g Cache Performance Cache Performance Example Given Gi I-cache miss rate = 2% D D-cache h miss rate = 4% Miss penalty = 100 cycles B Base CPI (ideal ( d l cache) h )=2 Load & stores are 36% of instructions H How muchh faster f would ld be b a processor withh a perfect cache that never misses ? Cache Performance Example Mi Miss cycles l for f allll instructions i i if I is i the h instruction count II-cache: h I × 0.02 0 02 × 100 = 2 I D-cache: I × 0.36 × 0.04 × 100 = 1.44 I Mi Miss cycle l per instruction i i is i 2 + 1.44 1 44 Total cycles per instruction is 2 + 2 + 1.44 So, actual CPI = 2 + 2 + 1.44 = 5.44 Ideal CPU is 5.44/2 =2.72 times faster Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 30 VM is the technique of using main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Same underlying concept as in cache but different in terminologies Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31 §5.4 Virttual Memory Virtual Memory Virtual address is the address produced by the program Physical y address is an address in the main memory CPU and OS translate virtual addresses to physical addresses VM “block” is called a page VM translation “miss” is called a page fault Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 32 §5.4 Virttual Memory VM Terms Page Fault Penalty On page fault, fault the page must be fetched from disk Takes T k millions illi off clock l k cycles. l M Main i memory latency is around 100,000 times better than the disk latency Try to minimize page fault rate Smart S replacement algorithms implemented in software in the OS •R Reading di from f disk di k iis slow l enough h and d software f overhead is negligible Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33 Multiple programs share main memory and they can change dynamically To avoid writing into each other’s data, we would like separate p address space p for each pprogram g With VM, each g gets a private p virtual address space p holding its frequently used code and data VM translates the virtual address into physical address allowing protection from other programs Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34 §5.4 Virttual Memory Motivation 1 A large program cannot fit into the main memory VM automatically maps addresses into disk space if the main memoryy is not sufficient Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35 §5.4 Virttual Memory Motivation I1 Address Translation Address translation: the process by which the virtual address is mapped to a physical address Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36 Address Translation Two components of virtual p address Page offset does not change and the number of bits determines the the size of the page size of the page The number of pages addressable with virtual address might be larger than the number of pages addressable with the physical address which gives the ill i off unbounded illusion b d d amount off virtual i l memory Translation Using a Page Table Translation Using a Page Table A page table, that resides on the memory, is used for address translation Each program has its own page table Translation Using a Page Table To indicate the location of the page table in the memory, a hardware register points to the start of the page table Each program has its own page table Translation Using a Page Table Note the use of a Valid bit Page Tables Stores p placement information Array of page table entries, indexed by virtual page number Page table register in CPU points to page table in physical memory If page is present in memory it stores the physical page number Plus other status bits If p page g is not present p a page fault occurs and the OS is given control Next few slides, we recap p some OS concepts p Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 42 Recap: Process A process (a ( program in execution), ) has h a context defined by the values in its program counter, registers, and page table If another p process ppreempts p a runningg process, this context must be saved Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 43 The 5 State Process Model Terminated New admit d it create switch: schedule new job Ready exit Running switch: preempt Blocked 44 Role of the OS A process (a ( program in execution), ) has h a context defined by the values in its program counter, registers, and page table If another p process ppreempts p this pprocess,, this context must be saved Rather than save the entire page table, only the page table register is saved To restart the process in the ‘running’ running state, state the operating system reloads the context Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 45 R l off the Role h OS Th The OS is responsible bl for f allocating ll the h physical memory and updating the page tables It maintains that virtual address of different processes do not collide thus providing protection Page fault is handled by OS Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 46 R l off the Role h OS Th The OS creates a space on the h disk d k (not ( main memory) for all pages of a process when the process is created called swap space OS also creates a record of where each virtual address of the process is located in the disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 47 Mapping Pages to Storage Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 48 Page Fault Handled by b OS If all pages in main memory are in use (it is full), full) the OS must choose a page to replace The replaced page must be written to the swap space on the disk To reduce page fault rate, prefer f least-recently l l used d (LRU) replacement l Predict that the page that was NOT used recently will be NOT used in near future Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 49 Writes Disk Di k writes i take k millions illi off cycles l Write through is impractical because it takes millions of cycles to write to the disk • Building a write buffer is impractical VM uses write-back Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 50 Fast Translation Using a TLB Address translation would appear to require extra memory references One to access the page table itself Then the actual memory access But access to page tables has good locality So use a fast cache of recently y used translations Called a Translation Look-aside Buffer (TLB) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 51 Memory Protection VM allows ll diff different processes to share h the h same main memory Need to protect against errant access Requires OS assistance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 52 Memory Protection Hardware support for OS protection Support two modes • Privileged supervisor mode (aka kernel mode) meaning that the OS is running • User mode Privileged instructions that only the OS can use • Allow it to write to the supervisor bit, and page table pointer Allow mechanisms (e.g., special instructions) to switch between supervisor mode and the user mode (e.g., (e g syscall in MIPS) These features allow the OS to change page tables while ppreventingg a user process p from changing g g them Fast F memories are small, ll llarge memories are slow l We really want fast, large memories Caching C hi gives i this hi ill illusion i Principle of locality Programs P use a smallll part off their h memory space frequently Memory hierarchy L1 cache L2 cache … DRAM memory disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 54 §5.12 Co oncluding Remarks Concluding Remarks Exam Questions will be in English You may answer in Swedish or English Dictionary is allowed Swedish to English English g to Swedish But, electronic dictionary is NOT allowed 55 E Exam Questions Q ti Most likely a mixture of problem oriented questions and subjective q j qquestions Problem oriented questions Like Lik questions i iin homework h k Subjective j qquestions … 56 Subjective questions like … • • • • 57 What does the control unit do? What is RISC? CISC? Whyy are fix-length g instructions ggood ? What is principle of locality? 57 Exam Questions Questions from past year are in the course website You may look at them for ‘some’ inspiration Note: N t P Pastt year had h d different diff t instructor, i t t different textbook, and different homeworks k even if there th are some common topics 58 From the Textbook Parts from : 5.1, 5.2, 5.3, 5.4 59 59