Virtual Memory and I/O Mingsheng Hong

Virtual Memory and I/O Mingsheng Hong I/O Systems  Major I/O Hardware   Hard disks, network adaptors … Problems related with I/O Systems    Various types of Hardware – device drivers to provide OS with a unified I/O interface Typically much slower than CPU and memory speed – system bottleneck Too much CPU involvement in I/O operations Techniques to Improve I/O Performance  Buffering    e.g. download a file from network DMA Caching  CPU cache, TLB, file cache.. Other Techniques to Improve I/O Performance  Virtual Memory Page Remapping (IOLite)   Allows (cached) files and memory to be shared by different processes without extra data copy Prefetching Data (Software Pretching and Caching for TLBs)  Prefetches and caches page table entries Summary of First Paper  IO-Lite: A Unified I/O Buffering and Caching System (Pai et al. Best Paper of 3rd OSDI, 1999)    A unified I/O System Uses immutable data buffers to store all I/O data (only one physical copy) Uses VM page remapping  IPC  file system (disk files, file cache)  network subsystem Summary of Second Paper  Software Prefetching and Caching for Translation Lookaside buffers (Bala et al. 1994)    A software approach to help reduce TLB misses Works well for IPC-intensive systems Bigger performance gain for future systems Features of IO-Lite  Eliminates redundant data copying   Eliminates Multiple buffering   Saves CPU work & avoids cache pollution Saves main memory => improves hit rate of file cache Enables cross-subsystem optimizations   Cache Internet checksum Supports application-specific cache replacement policies Related work before IO-Lite     I/O APIs should preserve copy semantics Memory-mapped files Copy On Write Fbufs Key Data Structures  Immutable Buffers and Buffer Aggregates Discussion I  When we pass a buffer aggregate from process A to process B, how to efficiently do VM page remapping (modify B’s page table entries)?  Possible Approach 1: find any empty entry, and modify the VM address contained in buffer aggregate   Very inefficient Possible Approach 2: reserve the range of virtual addresses of buffers in the address space of each process  Basically limited the total size of buffers – How about dynamically allocated buffers? Impact of Immutable I/O Buffers  Copy-On-Write Optimization   Modified values are stored in a new buffer, as opposed to “in-place modification” Three situations when the data object is …  Completely modified   Partially modified (modification localized)   Allocates a new buffer Chains unmodified and modified portions of data Partially modified (modification not localized)  Compares the cost of writing an entire object with that of chaining; chooses the cheaper method Discussion II  How to measure the two costs?   Heuristics needed Fragmented data v.s. clustered data  Chained data increase reading cost   Similar to shadow page technique used in System R Should the cost of retrieving data from buffer also be considered? What does IO-Lite do?   Reduces extra data copy in  IPC  file system (disk files, file cache)  network subsystem Makes possible cross-subsystem optimization IO-Lite and IPC  Operations on Buffers & Aggregates  When I/O data is transferred    When buffer is deallocated    Pass related aggregates by value Associated buffers are passed by reference Buffer returned to a memory pool Buffer’s VM page mappings persist When buffer is reused (by the same process)   No further VM map changes required (Temporarily) grant write permission to associated producer process Io-Lite and Filesystem  IO-Lite I/O APIs Provided      Filesystem cache reorganized   IOL_read(int fd, IOL_Agg **aggr, size_t size) IOL_write(int fd, IOL_Agg **aggr) IOL_write operations are atomic – concurrency support I/O functions in stdio library reimplemented Buffer aggregates (pointers to data), instead of file data, are stored in cache Copy Semantics ensured  Suppose a portion of a cached file is read, and then is overwritten Copy Semantics Illustration 1 File Cache Buffer 1 Buffer Aggregate (in user process) Copy Semantics Illustration 2 File Cache Buffer 1 Buffer Aggregate (in user process) Copy Semantics Illustration 3 File Cache Buffer 1 Buffer Aggregate (in user process) Buffer 2 More on File Cache Management & VM Paging  Cache replacement policy (can be customized)     The eviction order is by current reference status & time of last file access Evict one entry when the file cache “appears” to be too large Added one entry on every file cache miss When a buffer page is paged out, data will be written back to swap space, and possibly to several other disk locations (for different files) IO-Lite and Network Subsystem  Access control and protection for processes   ACL related with buffer pools Must determine the ACL of a data object prior to allocating memory for it  Early demultiplexing technique to determine ACL for each incoming packet A Cross-Subsystem Optimization  Internet checksum caching    Cache the computed checksum for each slice of a buffer aggregate Increment the version number when buffer is reallocated – can be used to check whether data changed Works well for static files. Also has a big benefit on the CGI programs that chain dynamic data with static data Performance – Competitors    Flash Web server – a high performance HTTP server Flash-Lite – A modified version of Flash using IO-Lite API Apache 1.3.1 – representing the widely used Web server today Performance – Static Content requesting Performance – CGI Programs Performance – Real Workload  Average request size: 17KBytes Performance – WAN Effects  Memory for buffers = # clients * Tss Performance – Other Applications Conclusion on I/O-Lite   A unified framework of I/O subsystems Impressive performance in Web applications due to copy-avoidance & checksum caching Software Prefetching & Caching for TLBs    Prefetching & Caching Never applied to TLB misses in a software approach Improves overall performance by up to 3%   But has a great potential on newer architectures Clock Speed: 40MHz => 200 MHz Issues in Virtual Memory    User Address Space is typically huge … TLB to cache page tables Software support to help reduce TLB misses Motivations    TLB misses occur more frequently in Microkernel-based OS RISC computers handle TLB misses in software (trap) IPCs have a bigger impact on system performance Approach    Use a software approach to prefetch and cache TLB entries Experiments done on MIPS R3000based (RISC) architecture with Mach 3.0 Applications chosen from standard benchmarks, as well as a synthetic IPCintensive benchmark Discussion  The way the authors motivate their paper     A right approach for a particular type of system A valid Argument for future computer systems regarding performance gain Figures of experimental results mostly showing the reduced number of TLB misses, instead of overall performance improvement A synthetic IPC-intensive application to support their approach Prefetching: What entries to prefetch?    L1U: user address spaces L1K: kernel data structures L2: user (L1U) page tables     Stack segments Code segments Data segments L3: L1K and L2 page tables Prefetching: Details    On the first IPC call, probe hardware TLB on the PIC path and enter related TLB entries into PTLB On Subsequent IPC calls, entries are prefetched into PTLB by a hashed lookup Entries are stored in upmapped, cached physical memory Prefetching: Performance Prefetching: Performance Rate of TLB misses? Caching: Software Victim Cache   Use a region of unmapped, cached memory to cache entries evicted from hardware TLB PTE lookup sequence:    hardware TLB STLB generic trap handler Caching: Benefits  A faster trap path for TLB misses   Avoids overhead of context switch Eliminates (reduces?) cascaded TLB misses Caching: Performance Average STLB penalties Kernel TLB hit rates Caching: Performance Prefetching + Caching: Performance Worse than using PTLB alone! (Don’t understand the authors comment to justify it…) Discussion   SLTB (caching) is better than PLTB. So using it alone suffices. Is it possible to improve the IPC performance using both VM page remapping and software prefetching & caching?

Virtual Memory and I/O Mingsheng Hong

Related documents

Products

Support

Virtual Memory and I/O Mingsheng Hong

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib