Buffer Caches Chapter Four Digital UNIX Internals II 4-1 Buffer Caches File System I/O Using a Cache user process user process buffer buffer read/write mmap buffer On-disk Data in memory cache kernel Digital UNIX Internals II 4-2 Buffer Caches Process Reading One Byte read ( ... ,1) user process A Buffer kernel Digital UNIX Internals II 4-3 Buffer Caches File System Caches and I/O • Read-ahead – When a file system notices a file being read sequentially, it can order the physical read of the next block(s) before the application actually requests them. • Write-behind – Data blocks do not have to be immediately written to disk. File systems can cluster together writes to contiguous disk blocks to improve performance. Digital UNIX Internals II 4-4 Buffer Caches File System Caches in Digital UNIX • (Traditional BSD UNIX) Buffer Cache – From BSD – Fixed pool of physical memory • Unified Buffer Cache – Similar to SunOS and SVR4 – Flexible pool of physical memory – Supports memory mapping Digital UNIX Internals II 4-5 Buffer Caches Example: UFS uses both vnode v_type = VDIR v_object buf v_cleanblkhd v_dirtyblkhd vnode v_type = VREG vm_object v_object v_cleanblkhd vm_page ob_memq v_dirtyblkhd vo_vp vo_cleanpl vm_vp_object vo_cleanwpl vo_dirtywpl Digital UNIX Internals II 4-6 Buffer Caches Traditional Buffer Cache • Pool of Memory – Allocated at boot time – Shared with no other subsystem or allocator • Buffer Structures – – – – – Links into access hash chain, LRU and same vnode lists Device containing buffer Pointer to vnode Logical block in vnode Pointer to routine called when I/O is done • Linked lists of Buffers – Hash chain bucket, LOCKED, LRU, AGE and EMPTY lists Digital UNIX Internals II 4-7 Buffer Caches struct buf b_flags buf buf Hash list b_forw, b_back buf buf Queue av_forw, av_back b_blockf, b_blockb buf buf Vnode buffer list b_bcount b_bufsize b_dev b_error Buffer b_un b_lblkno, b_blkno b_resid b_proc b_hash_chain b_iodone() b_pagelist b_vp, b_rvp proc buf Head of hash lst vm_page vnode ucred b_rcred, b_wcred Credentials b_dirtyoff, b_dirtyend driver fields b_lock b_iocomplete Digital UNIX Internals II 4-8 Buffer Caches Buffer Cache Lists bfreelist[2] AGE bfreelist[0] LOCKED bfreelist[1] LRU buf buf buf buf buf bufhash bufhd Buffer Memory Pages buf buf bfreelist[3] EMPTY buf buf buf buf buf Digital UNIX Internals II 4-9 Buffer Caches To Find a Buffer 1. Calculate hash index using disk block number (b_blkno) and vnode (b_vp) (see BUFHASH macro in /sys/include/sys/buf.h). 2. Index into the hash list. 3. Follow hash pointer to buf structure in queue. 4. Identify the correct buf structure using vnode and block numbers. 5. If no match, follow hash pointer (b_forw) to next buf structure in queue. 6. If you get to the end of the list (wraps back to beginning) without finding the buf structure, it does not exist; allocate a new one from the free list. Digital UNIX Internals II 4 - 10 Buffer Caches Getting a Buffer bread() VOP_STRATEGY() getblk() allocbuf() getnewbuf() Digital UNIX Internals II 4 - 11 Buffer Caches UBC - Unified Buffer Cache(1) • Motivation – File Systems and Virtual Memory (Process Management) compete for physical memory. – UBC unifies previously separate pools of physical memory. – Available Memory can be used by File Systems (UBC) or VM on a first come first serve basis. – VM can memory map a file using same memory object as UBC. • Utilizes memory from the available pool – vm_page_queue_free – vm_page_array Digital UNIX Internals II 4 - 12 Buffer Caches Unified Buffer Cache (2) • Uses memory objects of type OT_UBC – includes a pointer to a vnode – associates cached pages with a specific file – accessed by • a file system looking for cached data • memory management on pagefault for an mmap’d file • Utilizes lists; – vm_page_buckets to find vm_pages belonging to an object – ubc_lru to time order when pages were cached Digital UNIX Internals II 4 - 13 Buffer Caches UBC Memory Object (OT_UBC) struct vm_ubc_object ob_memq <lock> ob_ops = u_anon_oop vu_object vm_page vm_object_ops ob_ref_count vfs_ubcops ob_res_count vu_ops ob_size vu_vfp ob_resident_pages vu_cleanpl ob_flags vu_cleanwpl ob_type vu_dirtywpl vu_wirecnt vu_nsequential vu_loffset vu_stamp vu_seglock vu_seglist vu_pshared vu_freelists Digital UNIX Internals II 4 - 14 Buffer Caches UBC LRU Page Queue • Least recently used list of UBC pages – One per memory affinity domain • vm_mads[N].md_ubc.ubc_lru • Each is a struct vm_page – vm_page -> vm_ubc_object -> vnode • For each vnode's VM object, – – – – clean page list clean wired page list dirty page list dirty wired page list Digital UNIX Internals II 4 - 15 Buffer Caches UBC Routines (1) Routine Function ubc_object_allocate() Allocates a vm_ubc_object if the vnode is a regular type and one has not already been allocated. Frees the vm_ubc_object when the vnode is about to be reused. Looks up the page at the specified offset and specified vm_vp_object. Looks for resident pages in the specified range. Allocates a page or returns a found page in the page hash list. Releases a page to the UBC LRU list or system memory if possible. ubc_object_free() ubc_page_lookup() ubc_incore() ubc_page_alloc() ubc_page_release() Digital UNIX Internals II 4 - 16 Buffer Caches UBC Routines (2) Routine Function ubc_lookup() Performs a hash search lookup on the page at the specified offset. If found, removes the page from the ubc_lru list and holds it. Transitions a page from the vnode's clean page list to its dirty page list. Calls for mmap to free all clean pages and writes all dirty pages. Invalidates some (or all) resident pages for a vnode. Starts I/O on all dirty pages for a vnode. Does not wait for I/O completion if flag B_ASYNC is used. ubc_page_dirty() ubc_msync() ubc_invalidate() ubc_flush_dirty() Digital UNIX Internals II 4 - 17 Buffer Caches UBC Routines (3) Routine ubc_dirty_kluster() ubc_bufalloc() ubc_sync_iodone() ubc_async_iodone_lwc() Digital UNIX Internals II Function Creates a list of sorted pages for a vnode. Assumes pages are scheduled for writing. Allocates a buf structure. Waits for synchronous I/O transfer to complete, then frees buf and pages. Called as LWC when asyncronous I/O transfer completes. 4 - 18 Buffer Caches File System and VM Routines System Call read() write() VFS VOP_READ VOP_WRITE File System ufs_read() ufs_write() uiomove() UBC Resident Page Management ufs_getpage() returns VM page I/O mmap Page Fault Handler Digital UNIX Internals II 4 - 19 Buffer Caches Finding a UBC page from a file system VOP_READ(vnode, ...) ufs_read(vnode, ...) ufs_getpage(vnode, ...) ufs_getapage(vnode,...) ubc_lookup(vnode, ...) vm_page_lookup(mem_obj, ..) Digital UNIX Internals II 4 - 20 Buffer Caches Limiting UBC • ubc_dirty_thread – Calls ubc_memory_flushdirty • Launders excessive dirty pages via calls to FSOP_PUTPAGE() • vm_pageout thread (pageout daemon) – Runs vm_pageout_loop() – When number of free pages is low and UBC has borrowed to many pages, • UBC pages are reclaimed off ubc_lru • If no free pages, vm_page_alloc() may also come to ubc_lru. Digital UNIX Internals II 4 - 21 Buffer Caches ubc_memory_purge() Flow Start Get ubc_lru page Referenced bit on? Yes Turn off and move to tail of bc_lru No No Dirty? Free the page Yes Move page from vm_vp_obect dirty list to clean list Write the page out (VOP_PUTPAGE()) asynchronously No Freed enough? Yes Stop Digital UNIX Internals II 4 - 22 Buffer Caches Limiting the Amount of Dirty Data in UBC • UBC limits the percent of its cached data that is modified – improves performance by spreading out IO load – minimizes loss of data if system crash • Managed by separate kernel daemon thread Digital UNIX Internals II 4 - 23 Buffer Caches ubc_dirty_thread_loop() Flow Start Sleep on timer No Too many dirty pages Yes Too many dirty pages No Yes Get ubc_lru_page No Dirty Yes Remove page from ubc_lru Move page from vm_vp_obect dirty list to clean list Write the page out (VOP_PUTPAGE()) asynchronously Digital UNIX Internals II 4 - 24 Buffer Caches UBC Parameters and Thresholds (1) Field Description ubc_pages ubc_minpages Count of UBC pages. Smallest number of pages UBC will shrink to. ubc_minpages = (vm_managed_pages * ubc_minpercent)/100 where ubc_minpercent is tunable (Default =10). Upper limit of size of UBC. ubc_maxpages = (vm_managed_pages * ubc_maxpercent)/100 where ubc_maxpercent is tunable (Default = 100). Number of pages on the UBC LRU queue. Determines if UBC should flush and free dirty pages. ubc_dirty_limit=MAX(ubc_min_dirtypages, ((vm_tune_value(ubcdirtypercent) * ubc_pages)/100)) where ubcdirtypercent is tunable (Default =10). ubc_maxpages ubc_lru_count ubc_dirty_limit Digital UNIX Internals II 4 - 25 Buffer Caches UBC Parameters and Thresholds (2) Field Description ubc_dirty_pages UBC page currently dirty; tracked by system. ubc_borrowlimit Number of pages UBC can have. If ubc_pages>ubc_borrowlimt then UBC is asked to free pages. ubc_borrowlimit=(ubc_borrowpercent * vm_managed_pages)/100 where ubc_borrowpercent is 10 by default. vm_perf.vpf_ubchit Rate of UBC pages transitioning to the tail of the UBC LRU list because a pmap_is_referenced returned TRUE. vm_perf.vpf_ubcalloc Rate of UBC page allocation vm_perf.vpf_ubcpagepushes Rate of pages being evicted from the UBC because of memory reclamation activity. vm_free_count Current count of free pages. Digital UNIX Internals II 4 - 26 Buffer Caches Source Reference (1 of 4) Buf Cache • kernel/sys/buf.h – definition of struct buf • kernel/vfs/vfs_bio.c – bfreelist[], bufhash and buf routines (bread() etc.) Digital UNIX Internals II 4 - 27 Buffer Caches Source Reference (2 of 4) UBC • kernel/vm/vm_page.h – definitions of vm_page, vm_page_array • kernel/vm/vm_resident.c – definition of vm_page_bucket hashing array • kernel/vfs/vfs_ubc.c – definition of ubc lru list • kernel/vm/vm_ubc.h – definition of vm_ubc_object • kernel/vfs/vfs_ubc.c – implementation of ubc routines interface routines. Digital UNIX Internals II 4 - 28 Buffer Caches Source Reference (3 of 4) Reading Data From a UBC Cached UFS File • kernel/ufs/ufs_vnops.c ufs_read() ufs_getpage() ufs_getapage() • kernel/vfs/vfs_ubc.c ubc_lookup() • kernel/vm/vm_resident.c vm_page_lookup() Digital UNIX Internals II 4 - 29 Buffer Caches Source Reference (4 of 4) Pagefaulting on a UBC MMAPed Page • kernel/arch/alpha/locore.s XentMM • kernel/arch/alpha/trap.c trap() • kernel/vm/vm_fault.c vm_fault() • kernel/vm/vm_umap.c u_map_fault() • kernel/vm/u_mape_vp.c u_vp_fault() Digital UNIX Internals II 4 - 30 Buffer Caches