Virtual Memory and I/O Mingsheng Hong I/O Systems Major I/O Hardware Hard disks, network adaptors … Problems related with I/O Systems Various types of Hardware – device drivers to provide OS with a unified I/O interface Typically much slower than CPU and memory speed – system bottleneck Too much CPU involvement in I/O operations Techniques to Improve I/O Performance Buffering e.g. download a file from network DMA Caching CPU cache, TLB, file cache.. Other Techniques to Improve I/O Performance Virtual Memory Page Remapping (IOLite) Allows (cached) files and memory to be shared by different processes without extra data copy Prefetching Data (Software Pretching and Caching for TLBs) Prefetches and caches page table entries Summary of First Paper IO-Lite: A Unified I/O Buffering and Caching System (Pai et al. Best Paper of 3rd OSDI, 1999) A unified I/O System Uses immutable data buffers to store all I/O data (only one physical copy) Uses VM page remapping IPC file system (disk files, file cache) network subsystem Summary of Second Paper Software Prefetching and Caching for Translation Lookaside buffers (Bala et al. 1994) A software approach to help reduce TLB misses Works well for IPC-intensive systems Bigger performance gain for future systems Features of IO-Lite Eliminates redundant data copying Eliminates Multiple buffering Saves CPU work & avoids cache pollution Saves main memory => improves hit rate of file cache Enables cross-subsystem optimizations Cache Internet checksum Supports application-specific cache replacement policies Related work before IO-Lite I/O APIs should preserve copy semantics Memory-mapped files Copy On Write Fbufs Key Data Structures Immutable Buffers and Buffer Aggregates Discussion I When we pass a buffer aggregate from process A to process B, how to efficiently do VM page remapping (modify B’s page table entries)? Possible Approach 1: find any empty entry, and modify the VM address contained in buffer aggregate Very inefficient Possible Approach 2: reserve the range of virtual addresses of buffers in the address space of each process Basically limited the total size of buffers – How about dynamically allocated buffers? Impact of Immutable I/O Buffers Copy-On-Write Optimization Modified values are stored in a new buffer, as opposed to “in-place modification” Three situations when the data object is … Completely modified Partially modified (modification localized) Allocates a new buffer Chains unmodified and modified portions of data Partially modified (modification not localized) Compares the cost of writing an entire object with that of chaining; chooses the cheaper method Discussion II How to measure the two costs? Heuristics needed Fragmented data v.s. clustered data Chained data increase reading cost Similar to shadow page technique used in System R Should the cost of retrieving data from buffer also be considered? What does IO-Lite do? Reduces extra data copy in IPC file system (disk files, file cache) network subsystem Makes possible cross-subsystem optimization IO-Lite and IPC Operations on Buffers & Aggregates When I/O data is transferred When buffer is deallocated Pass related aggregates by value Associated buffers are passed by reference Buffer returned to a memory pool Buffer’s VM page mappings persist When buffer is reused (by the same process) No further VM map changes required (Temporarily) grant write permission to associated producer process Io-Lite and Filesystem IO-Lite I/O APIs Provided Filesystem cache reorganized IOL_read(int fd, IOL_Agg **aggr, size_t size) IOL_write(int fd, IOL_Agg **aggr) IOL_write operations are atomic – concurrency support I/O functions in stdio library reimplemented Buffer aggregates (pointers to data), instead of file data, are stored in cache Copy Semantics ensured Suppose a portion of a cached file is read, and then is overwritten Copy Semantics Illustration 1 File Cache Buffer 1 Buffer Aggregate (in user process) Copy Semantics Illustration 2 File Cache Buffer 1 Buffer Aggregate (in user process) Copy Semantics Illustration 3 File Cache Buffer 1 Buffer Aggregate (in user process) Buffer 2 More on File Cache Management & VM Paging Cache replacement policy (can be customized) The eviction order is by current reference status & time of last file access Evict one entry when the file cache “appears” to be too large Added one entry on every file cache miss When a buffer page is paged out, data will be written back to swap space, and possibly to several other disk locations (for different files) IO-Lite and Network Subsystem Access control and protection for processes ACL related with buffer pools Must determine the ACL of a data object prior to allocating memory for it Early demultiplexing technique to determine ACL for each incoming packet A Cross-Subsystem Optimization Internet checksum caching Cache the computed checksum for each slice of a buffer aggregate Increment the version number when buffer is reallocated – can be used to check whether data changed Works well for static files. Also has a big benefit on the CGI programs that chain dynamic data with static data Performance – Competitors Flash Web server – a high performance HTTP server Flash-Lite – A modified version of Flash using IO-Lite API Apache 1.3.1 – representing the widely used Web server today Performance – Static Content requesting Performance – CGI Programs Performance – Real Workload Average request size: 17KBytes Performance – WAN Effects Memory for buffers = # clients * Tss Performance – Other Applications Conclusion on I/O-Lite A unified framework of I/O subsystems Impressive performance in Web applications due to copy-avoidance & checksum caching Software Prefetching & Caching for TLBs Prefetching & Caching Never applied to TLB misses in a software approach Improves overall performance by up to 3% But has a great potential on newer architectures Clock Speed: 40MHz => 200 MHz Issues in Virtual Memory User Address Space is typically huge … TLB to cache page tables Software support to help reduce TLB misses Motivations TLB misses occur more frequently in Microkernel-based OS RISC computers handle TLB misses in software (trap) IPCs have a bigger impact on system performance Approach Use a software approach to prefetch and cache TLB entries Experiments done on MIPS R3000based (RISC) architecture with Mach 3.0 Applications chosen from standard benchmarks, as well as a synthetic IPCintensive benchmark Discussion The way the authors motivate their paper A right approach for a particular type of system A valid Argument for future computer systems regarding performance gain Figures of experimental results mostly showing the reduced number of TLB misses, instead of overall performance improvement A synthetic IPC-intensive application to support their approach Prefetching: What entries to prefetch? L1U: user address spaces L1K: kernel data structures L2: user (L1U) page tables Stack segments Code segments Data segments L3: L1K and L2 page tables Prefetching: Details On the first IPC call, probe hardware TLB on the PIC path and enter related TLB entries into PTLB On Subsequent IPC calls, entries are prefetched into PTLB by a hashed lookup Entries are stored in upmapped, cached physical memory Prefetching: Performance Prefetching: Performance Rate of TLB misses? Caching: Software Victim Cache Use a region of unmapped, cached memory to cache entries evicted from hardware TLB PTE lookup sequence: hardware TLB STLB generic trap handler Caching: Benefits A faster trap path for TLB misses Avoids overhead of context switch Eliminates (reduces?) cascaded TLB misses Caching: Performance Average STLB penalties Kernel TLB hit rates Caching: Performance Prefetching + Caching: Performance Worse than using PTLB alone! (Don’t understand the authors comment to justify it…) Discussion SLTB (caching) is better than PLTB. So using it alone suffices. Is it possible to improve the IPC performance using both VM page remapping and software prefetching & caching?