Files and Buffer Manager 9:00 11:00 13:30 15:30 18:00 Aug. 2 Intro & terminology Reliability Fault tolerance Transaction models Reception Aug. 3 Aug. 4 Aug. 5 Aug. 6 TP mons Logging & Files & Structured & ORBs res. Mgr. Buffer Mgr. files Locking Res. Mgr. & COM+ Access paths theory Trans. Mgr. Locking CICS & TP CORBA/ Groupware techniques & Internet EJB + TP Queueing Advanced Replication Performance Trans. Mgr. & TPC Workflow Cyberbricks Party FREE Chapter 15 Abstractions Provided by the File Manager Device independence: The file manager turns the large variety of external storage devices, such as disks (with their different numbers of cylinders, tracks, arms, and read/write heads), ram-disks, tapes, and so on, into simple abstract data types. Allocation independence: The file manager does its own space management for storing the data objects presented by the client. It may store the same objects in more than one place (replication). Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 2 Abstractions Provided by the File Manager Address independence: Whereas objects in main memory are always accessed through their addresses, the file manager provides mechanisms for associative access. Thus, for example, the client can request access to all records with a specified value in some field of the record. Support for associative access comes in many flavors, from simple mechanisms yielding fast retrieval via the primary key up to the expressive power of the SQL select statement. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 3 External Storage vs. Main Memory Capacity: Main memory is usually limited to a size that is some orders of magnitude smaller than what large databases need. Economics: External storage holds large volumes of data at reasonable cost. Durability: Main memory is volatile. External storage devices such as magnetic or optical disks are inherently durable and therefore are appropriate for storing persistent objects. After a crash, recovery starts with what is found in durable storage. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 4 External Storage vs. Main Memory Speed: External storage devices are some orders of magnitude slower than main memory. As a result, it is more costly, both in terms of latency and in terms of pathlength, to get data from external storage to the CPU than to load data from main memory. Functionality: Data cannot be processed directly on external storage: they can neither be compared nor modified “out there.” Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 5 The Storage Pyramid current data main memory stale data online external storage Electronic RAM and bulk storage Magnetic / optical disks near line (archive) storage Automated archives (e.g. optical disk jukeboxes, tape robots, etc.) typical capacity Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 6 Interfacing to External Memory: Read-Write Mapping External Storage File A File C File D File B read object w read object x /write object x read object y Main Memory Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 7 Interfacing to External Memory: File Mapping External Storage File A File C File D File B map File C unmap File C Main Memory Jim Gray, Andreas Reuter File C Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 8 Interfacing to External Memory: Single-Level Storage External Storage File A File C File D File B Virtual memory File A File B File C File D Explicit mapping Main Memory Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 9 Locality and Cacheing The movement of data through the pyramid is guided by the principle of locality: Locality of active data: Data that have recently been referenced will very likely be referenced again. Locality of passive data: Data that have not been referenced recently will most likely not be referenced in the future. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 10 Levels of Abstraction in a File and Database Manager Application Transaction programs sort, join,... setoriented DBMS access Application database access read, write modules tuple oriented access main memory manages Tuple management, associative access logging recovery databaseb uffer mgr. Buffer management media and file manager File management archive manager Archive management block oriented access online external memory nearline external memory Jim Gray, Andreas Reuter manages device oriented access manages Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 11 Operations of the Basic File System STATUS STATUS STATUS STATUS STATUS STATUS STATUS create(filename, allocparmp) delete(filename) open(filename, ACCESSMODE, FILEID); close(FILEID) extend(FILEID, allocparmp) read(FILEID, BLOCKID, BLOCKP) readc(FILEID, BLOCKID, blockcount, BLOCKP) STATUS write(FILEID, BLOCKID, BLOCKP) STATUS writec(FILEID, BLOCKID, blockcount, BLOCKP) Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 12 Mapping Files To Disk File D File E File C Disk 1 File B The File A Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques denote the address mapping between disk and files. WICS August 2 - 6, 1999 13 Issues in Managing Disk Space Initial allocation: When a file is created, how many contiguous slots should be allocated to it? Incremental expansion: If an existing file grows beyond the number of slots currently allocated, how many additional contiguous blocks should be assigned to that file? Reorganization: When and how should the free space on the disk be reorganized? Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 14 Extent-Based Allocation file directory disk-id extent index accum-length primary Disk-A 14 100 A Jim Gray, Andreas Reuter 1.secondary 2.secondary 3.secondary Disk-A Disk-A Disk-B 187 3 214 350 600 850 extent directory Transaction Processing - Concepts and Techniques B WICS August 2 - 6, 1999 15 Buddy Systems Slots (shaded extents are free) 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 0 1 2 3 Buddy types Type Free extents Jim Gray, Andreas Reuter 0 1 2 3 0110 0010 1100 Transaction Processing - Concepts and Techniques 1011 WICS August 2 - 6, 1999 16 Simple Mapping of Relations To Disks File system segments real disks extents relations file system & operating system Jim Gray, Andreas Reuter database system Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 17 A Usual Way of Mapping of Relations To Disks relations logical disks OS-files real disks extents segments tabelspaces operating system Jim Gray, Andreas Reuter database system Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 18 Principles of the Database Buffer process of access module Give me page P bufferfix (P , ...) return frame address F directory 1 5 buffer is accessible from the caller's process (shared memory) 3 determine FILEID and block number Jim Gray, Andreas Reuter 2 find frame in buffer readdirect 4 buffer manager interface buffer storage area Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 19 Design Options for the Buffer Manager Buffer per file: Each file has its own private buffer pool.. Buffer per page size: In systems with different page (and block) sizes, there is usually at least one buffer for each page size. Buffer per file type: There are files like indices, which are accessed in a significantly different way from other files. Therefore, some systems dedicate buffers to files depending on the access pattern and try to manage each of them in a way that is optimal for the respective file organization. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 20 Logic of the Buffer Manager Search in buffer: Check if the requested page is in the buffer. If found, return the address F of this frame to the caller. Find free frame: If the page is not in the buffer, find a frame that holds no valid page. Determine replacement victim: If no such frame exists, determine a page that can be removed from the buffer (in order to reuse its frame). Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 21 Logic of the Buffer Manager Write modified page: If replacement page has been changed, write it. Establish frame address: Denote the start address of the frame as F. Determine block address: Translate the requested PAGEID P into a FILEID and a block number. Read the block into the frame selected. Return: Return the frame address F to the caller. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 22 Synchronization in the Buffer process A process B bufferfix (P ,...) bufferfix (P ,...) process A process B change P to P' change P to P" rewrite page give copy to caller rewrite page give copy to caller ? a) Access module in process A requests access to page P ; gets private copy. Jim Gray, Andreas Reuter b) Access module in process B requests access to page P ; gets private copy. c) Both processes try to rewrite an updated version of the page, but these versions are different. Only the version written last will be on disk; this is the "lost update" anomaly. Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 23 What the Buffer Manager Does for Synchronization Sharing: Pages are made addressable to all processes that run the database code. Semaphore protection: Each requestor gets the address of a semaphor protecting the page. Durable storage: The access modules inform the buffer manager if their page access has resulted in an update of the page; the actual write operation, however, is issued by the buffer manager, probably at a time when the update transaction is long gone. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 24 The Interface to the Buffer Manager typedef struct {PAGEID pageid; /* id of page in file */ PAGEPTR pageaddr; /* base addr. in buffer */ int index; /* record within page */ semaphore * pagesem; /* pointer to the sem. */ Boolean modified; /* caller modif. page */ Boolean invalid; /* destroyed page */ } BUFFER_ACC_CB, *BUFFER_ACC_CBP; /* control block for buffer access */ Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 25 The Need for Fix and Unfix transaction X transaction Y transaction Y bufferfix(Q,...) bufferfix(Q,...) bufferfix(P,...) return base address in buffer page P not in buffer read block containing page P a) Transaction X requests access to page P; gets base address in buffer. Jim Gray, Andreas Reuter return base address in buffer page Q not in buffer; replace P b) Transaction Y requests access to page Q; buffer mgr. decides to replace page P Transaction Processing - Concepts and Techniques read block containing page Q c) Transaction Y gets the base address of Q in the buffer - is same as P's. WICS August 2 - 6, 1999 26 The Fix-Use-Unfix Protocol I FIX: The client requests access to a page using the bufferfix interface. USE: The client uses the page and the pointer to the frame containing the page will remain valid. UNFIX: The client explicitly waives further usage of the frame pointer; that is, it tells the buffer manager that it no longer wants to use that page. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 27 The Fix-Use-Unfix Protocol II page P use page R fix page P use use unfix page P fix page R page Q use use unfix page R use unfix page Q fix page Q use use Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 28 Structure of the Buffer Manager bufferpool mru_page lru_page hash table p a g e s frame_index buffer control block buffer control block buffer control block buffer control block buffer control block first_bcb next_in_hclass buffer control block frames buffer access control block to and from client index of frame holding the page (address pointer in case of buffer access control block) chain of buffer control blocks in same hash class LRU - chain Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 29 Logging and Recovery from the Buffer Manager's Perspective I Transaction Buffer running running A B A B OK; old state in DB A B A B OK; old state in DB running A B A B database corrupted running A B A B conflicting view on TA committed committed A B A B OK; Read-only TA A B A B DB not in new state committed A B A B committed A B A B database corrupted OK; new state in DB Jim Gray, Andreas Reuter Database Remark Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 30 Logging and Recovery from the Buffer Manager's Perspective II state of transaction TA state of page A in database result of recovery using operation log aborted old wrong tuple might be deleted aborted new inverse operation succeeds committed old operation succeeds committed new duplicate of tuple is inserted Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 31 The Log and Page LSNs page A LSN1 page A LSN3 write to disk write to disk buffer manager log manager log record for page A LSN1 log record for page A LSN2 log record for page A LSN3 log record for page A LSN4 time Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 32 Different Buffer Management Policies Steal policy: When the buffer manager needs space, it can decide to replace dirty pages. No-Steal policy: Pages can be replaced only if they are clean. Force policy: At end of transaction, all modified pages are forced to disk in a series of synchronous write operations. No-Force policy: No modified page is forced during commit. REDO log records are written to the log. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 33 The Problem of Hotspot Pages durable storage force page A bufferpool update operations TA1 TA2 TA3 TA4 TA5 TA6 TA7 TA8 log The dotted arrows indicate an update of the page by the respective transaction.The arrows at 45 degrees indicate the forced writing of the page during commit processing.The downward arrows indicate the writing of log records for the respective transaction. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 34 The Basic Checkpoint Algorithm Quiesce: Delay all incoming update DML calls until all fixes with exclusive semaphores have been released. Flush the buffer: Write all modified pages. Log the checkpoint: Write a record to the log, saying that a checkpoint has been generated. Resume normal operation: The bufferfix requests for updates that have been delayed in order to take the checkpoint can now be processed again. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 35 The Case for Indirect Checkpointing checkpoints c1 c2 c3 1 2 3 4 5 6 7 8 9 10 framenumbers log When taking a checkpoint, the PAGEIDs of the pages currently in buffer are written to the log. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 36 The Indirect Checkpointing Algorithm Record TOC: Log the list of PAGEIDs. Compare with prev. ckpt: See if any modified pages have not been replaced since last ckpt. Force lazy pages: Schedule the writing of those pages during the next checkpoint interval. Low-water mark: Find the LSN of the oldest still-volatile update; write it to the log. Write “Checkpoint done” record Resume normal operation Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 37 Further Possibilities for Optimization Pre-flushing can be performed by an asynchronous process that scans the buffer for "old" modified pages. Writing is done under semaphore protection. Pre-fetching can, among other things, be used to make restart more efficient. If page reads are logged one can use the recent checkpoint plus the log to prime the bufferpool, i.e. it will look almost exactly like at the moment of the crash. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 38 Further Possibilities for Optimization Transaction scheduling and buffer management can take hints from the query optimizer: This relation will be scanned sequentially. This is a sequential scan of the leaves of a Btree. This is the traversal of a B-tree, starting at the root. This is a nested-loop join, where the inner relation is scanned in physically sequential order. Jim Gray, Andreas Reuter Transaction Processing - Concepts and Techniques WICS August 2 - 6, 1999 39