Data Storage and Disk Access Memory hierarchy Hard disks Architecture Processing requests Writing to disk Hard disk reliability and efficiency RAID Solid State Drives Buffer management Data storage Cache DBMS Main Memory Virtual Memory Disk File System Tertiary Storage Primary memory: volatile Main memory Cache Secondary memory: non-volatile Solid State Drive (SSD) Magnetic Disk (Hard Disk Drive, HDD) Tertiary memory: non-volatile CD/DVD Tape - sequential access cost speed Usually used as backup or for long-term storage Speed Main memory is much faster than secondary memory ▪ 10 – 100 nanoseconds to move data in main memory ▪ 0.00001 to 0.0001 milliseconds ▪ 10 milliseconds to read a block from an HDD ▪ 0.1 milliseconds to read a block from a SSD Cost Main memory is around 100 times more expensive than secondary memory SSDs are more expensive than HDDs System limitations On a 64 bit system only 264 bytes can be directly referenced ▪ Many databases are larger than that Volatility Data must be maintained between program executions which requires non-volatile memory ▪ Nonvolatile storage retains its contents when the device is turned off, or if there is a power failure Main memory is volatile, secondary storage is not Database data is usually stored on disks A database will often be too large to be retained in main memory When a query is processed data will need to be retrieved from storage Data is stored on disk blocks Also referred to as blocks, or in relation to the OS, pages A contiguous sequence of bytes and The unit in which data is written to, and read from Block size is typically between 4 and 16 kilobytes A hard disk consists of a number of platters Platters can store data on either one or both of its surfaces so is referred to as ▪ Single-sided or double sided Surfaces are composed of concentric rings called tracks The set of all tracks with the same diameter is called a cylinder Sectors are arcs of a track And are typically 4 kilobytes in size Block size is set when the disk is initialized, usually a small multiple of the sector size (hence 4 to 16 kilobytes) surfaces platter 2* 3* track cylinder *statistics for Western Digital Caviar Black 1 TB hard drive Data is transferred to or from a surface by a disk head There is one disk head for each surface These disk heads are moved as a unit (called a disk head array) ▪ Therefore all the heads are in identical positions with respect to their surfaces To read or write a block a disk head must be positioned over it Only one disk head can read or write at a time the disk spins – around 7,200rpm disk head array track moves in and out platters Disk drives are controlled by a processor called a disk controller which Controls the actuator that moves the head assembly Selects sectors and determines when the disk has rotated to a sector Transfers data between the disk and main memory Some controllers buffer data from tracks in the expectation that the data will be required The disk constantly spins 7,200 rpm* The head pivots over the desired track The desired block is read as it passes underneath the head * Western Digital Caviar Black 1 TB hard drive (again) The disk head is moved in or out to the track This seek time is typically 10 milliseconds ▪ WD Caviar Caviar Black 1TB: 8.9 ms Wait until the block rotates under the disk head This rotational delay is typically 4 milliseconds ▪ WD Caviar Caviar Black 1TB : 4.2 ms The data on the block is transferred to memory This transfer time is the time it takes for the block to completely rotate past the disk head ▪ Typically less than 1 millisecond The seek time and rotational delay depend on Where the disk head is before the request, Which track is being requested, and How far the disk has to rotate The transfer time depends on the request size The transfer time (in ms) for one block equals ▪ (60,000 / disk rpm) / blocks per track The transfer time (in ms) for an entire track equals ▪ (60,000 / disk rpm) Typical access time for a block on a hard disk 15 milliseconds Typical access time for a main memory frame 60 nanoseconds What’s the difference? 1 millisecond = 1,000,000 nanoseconds 60 ns = 0.000,060 ms Accessing a hard drive is around 250,000 times slower than accessing main memory Disk latency (access time) has three components seek time + rotational delay + transfer time The overall access time can be shortened by reducing, or even eliminating seek time and rotational delay Related data should be stored in close proximity Accessing two records in adjacent blocks on a track ▪ Seek the desired track, rotate to first block, and transfer two blocks = 10 + 4 + 2*1 = 16ms Accessing two records on different tracks ▪ Seek the desired track, rotate to the block, and transfer the block, then repeat = (10 + 4 + 1)*2 = 30ms What does it mean to say that related data should be stored close to each other? The term close refers not to physical proximity but to how the access time is affected In order of closeness: Same block Adjacent blocks on the same track Same track Same cylinder, but different surfaces Adjacent cylinders … Is 2, or 3 "closer" to 1? 2 is in the adjacent track And is clearly physically closer, but The disk head must be moved to access it 3 is in the same cylinder 1 The disk head does not have to be moved Which is why 3 is closer x x 2 x 3 A fair algorithm would take a first-come, first-serve approach 2,0001 Insert requests in a queue and process them 4,0004 in the order in which they are received 6,0002 Cylinder Received Complete Moved Total 2,000 0 5 2,000 2,000 6,000 0 14 4,000 6,000 14,000 0 27 8,000 14,000 4,000 10 43 10,000 24,000 16,000 20 60 12,000 36,000 10,000 30 72 6,000 42,000 10,0006 14,0003 16,0005 The elevator algorithm usually performs better than FIFO 2,0001 Requests are buffered and the disk head 4,0004 moves in one direction, processing requests The arm then reverses direction 6,0002 Cylinder Received Complete Moved Total 2,000 0 5 2,000 2,000 6,000 0 14 4,000 6,000 14,000 0 27 8,000 14,000 16,000 20 35 2,000 16,000 10,000 30 46 6,000 22,000 4,000 30 58 6,000 28,000 10,0006 14,0003 16,0005 The elevator algorithm gives much better performance than FIFO on average And is a relatively fair algorithm The elevator algorithm is not optimal The shortest-seek first algorithm is closer to optimal but can result in a high variance in response time ▪ And may even result in starvation for distant requests In some cases the elevator algorithm can perform worse than FIFO To modify an existing record (on a disk) the following steps must be taken Read the record Modify the record in main memory Write the modified record back to disk It is important to remember that the smallest unit of transfer to / from a disk is a block A single disk block usually contains many records Read one block into main memory … other records … Landis#winner#Phonak#... other records … other records … Landis#winner#Phonak#... other records … Read one block into main memory … other records … Landis#winner#Phonak#... Landis#disq.#none#... other records … … modify the desired record … other records … Landis#winner#Phonak#... other records … Read one block into main memory … other records … other records … Landis#disq.#none#... … modify the desired record … … and write it back. other records … Landis#disq.#none#... Landis#winner#Phonak#... other records … Consider creating a new record The user enters the data for the record Through some application interface The record is created in main memory And then written to disk Does this process require a read-modify-write process? YES! Because, otherwise, the existing contents of the disk block will be overwritten Intermittent failure Multiple attempts are required to read or write a sector Media decay A bit or a number of bits are permanently corrupted and it is impossible to read a sector Write failure A sector cannot be written to or retrieved ▪ Often caused by a power failure during a write Disk crash The entire disk becomes unreadable An intermittent failure may result in incorrect data being read by the disk controller Such incorrect data can be detected by a checksum Each sector contains additional bits whose values are based on the data bits in the sector A simple single-bit checksum is to maintain an even parity on the sector ▪ If there is an odd number of 1s the parity is odd ▪ If there is an even number of 1s the parity is even Assume that there are seven data bits and a single checksum bit Data bits 0111011 – parity is odd ▪ Checksum bit is set to 1 so that the overall parity is even Using a single checksum bit allows errors of only one bit to be detected reliably Several checksum bits can be maintained to reduce the chance of failing to notice an error e.g. maintain 8 checksum bits, one for each bit position in the data bytes Checksums can detect errors but can't correct them Stable storage can be implemented on a disk to allow errors to be corrected Sectors are paired, with each pair representing a single sector Pairs are usually referred to as Left and Right ▪ Errors in a sector (L or R) are detected using checksums Stable storage can cope with media failures and write failures For writing, write the value of some sector X into XL Check that the value is correct (using checksums) If the value is not correct after a given number of attempts then assume that the sector has failed ▪ A spare sector should be substituted for XL Repeat the process for XR For reading, XL and XR are read from in turn until a correct value is returned Hard disks act as bottlenecks for processing DB data is stored on disks, and must be fetched into main memory to be processed, and Disk access is considerably slower than main memory processing There are also reliability issues with disks Disks contain mechanical components that are more prone to failure than electronic components One solution is to use multiple disks Multiple disks Each disk contains multiple platters Disks can be read in parallel, and Different disks can read from different cylinders e.g. the first disk can access data from cylinder 6,000, while the second disk is accessing data from cylinder 11,000 Single disk Multiple platters Disk heads are always over the same cylinder Using multiple disks to store data improves efficiency as the disks can be read in parallel To satisfy a request the physical disks and disk blocks that the data resides on must be identified The data may be on a single disk, or it may be split over multiple disks The way in which data is distributed over the disks affects the cost of accessing it In the same way that related data should be stored close to each other on a single disk A disk array gives the user the abstraction of a single, large, disk When an I/O request is issued the physical disk blocks to be retrieved have to be identified How the data is distributed over the disks in the array affects how many disks are involved in an I/O request Data is divided into partitions called striping units The striping unit is usually either a block or a bit Striping units are distributed over the disks using a round robin algorithm Notional File – the data is divided into striping units of a given size 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 … The striping units are distributed across a RAID system in a round robin fashion 1 5 9 17 21 25 29 33 37 41 45 49 53 57 61 65 … disk 1 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 … disk 2 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 … disk 3 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 … disk 4 13 The size of the striping unit has an impact on the behaviour of the system Assume that a file is to be distributed across a four disk RAID system, using block striping, and that, Purely for the sake of illustration, the block size is just one byte! Notional File – the numbers represent a sequence of individual bits in the file 1 2 3 4 5 6 7 8 9 10 11 12 13 12 15 16 17 18 19 20 21 22 23 24 … Distribute these bits across a 4 disk RAID system using BLOCK striping: 3 4 5 6 7 8 33 34 35 36Disk 37 138 39 40 65 66 67 68 69 70 71 72 1 2 … 9 10 11 12 13 14 15 16 41 42 43 44Disk 45 246 47 48 73 74 75 76 77 78 79 80 … 17 18 19 20 21 22 23 24 49 50 51 52Disk 53 354 55 56 81 82 83 84 85 86 87 88 … 25 26 27 28 29 30 31 32 57 58 59 60Disk 61 462 63 64 89 90 91 92 93 94 95 96 … Block 1 Block 2 Block 3 Here is the same file to be distributed across a four disk RAID system, this time using bit striping, and again remember that Purely for the sake of illustration , the block size is just one byte! Notional File – the numbers represent a sequence of individual bits in the file 1 2 3 4 5 6 7 8 9 10 11 12 13 12 15 16 17 18 19 20 21 22 23 24 … Distribute these bits across a 4 disk RAID system using BIT striping: 13 17 21 25 29 33 37 41 45 Disk 49 153 57 61 65 69 73 1 5 9 2 6 10 14 18 22 26 30 34 38 42 46Disk 50 254 58 62 66 70 74 78 82 86 90 94 … 3 7 11 15 19 23 27 31 35 39 43 47 Disk 51 355 59 63 67 71 4 8 12 16 20 24 28 32 36 40 44 48Disk 52 456 60 64 68 72 76 80 84 88 92 96 … Block 1 Block 2 77 81 85 89 93 … 75 79 83 87 91 95 … Block 3 Assume that a disk array consists of D disks Data is distributed across the disks using data striping How does it perform compared to a single disk? To answer this question we must specify the kinds of requests that will be made ▪ Random read – reading multiple, unrelated records ▪ Random write ▪ Sequential read – reading a number of records (such as one file or table), stored on more than D blocks ▪ Sequential write Use all D disks to improve efficiency, and distribute data using block striping Random read performance Very good – up to D different records can be read at once ▪ Depending on which disks the records reside on Random write performance – same as read performance Sequential read performance Very good – as related data are distributed over all D disks performance is D times faster than a single disk Sequential write performance – same as read performance But what about reliability … Hard disks contain mechanical components and are less reliable than other, purely electronic, components Increasing the number of hard disks decreases reliability, reducing the mean-time-to-failure (MTTF) ▪ The MTTF of a hard disk is 50,000 hours, or 5.7 years In a disk array the overall MTTF decreases Because the number of disks is greater MTTF of a 100 disk array is 21 days – (50,000/100) / 24 ▪ This assumes that failures occur independently and ▪ The failure probability does not change over time Reliability is improved by storing redundant data Reliability of a disk array can be improved by storing redundant data If a disk fails the redundant data can be used to reconstruct the data lost on the failed disk The data can either be stored on a separate check disk or Distributed uniformly over all the disks Redundant data is typically stored using one of two methods Mirroring, where each disk is duplicated A parity scheme, where sufficient redundant data is maintained to recreate the data in any one disk Other redundancy schemes provide greater reliability For each bit on the data disks there is a parity bit on a check disk If the sum of the data disks bits is even the parity bit is set to zero If the sum of the bits is odd the parity bit is set to one The data on any one failed disk can be recreated bit by bit 0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 1 1 0 0 1 … 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 … 0 0 0 1 1 1 0 1 0 0 1 1 0 0 0 1 1 0 1 1 1 0 0 1 … 0 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 1 … 0 0 1 0 1 0 0 1 1 1 … 4 data disk system showing individual bit values 1 0 1 1 1 0 1 0 1 1 1 5th check disk containing parity data 1 1 0 Reading The parity scheme does not affect reading Writing A naïve approach would be to calculate the new value of the parity bit from all the data disks A better approach is to compare the old and new values of the disk that is written to ▪ And change the value of a parity bit if the corresponding bits have changed A RAID system consists of several disks organized to increase performance and improve reliability Performance is improved through data striping Reliability is improved through redundancy RAID stands for Redundant Arrays of Independent Disks There are several RAID schemes or levels The levels differ in terms of their ▪ Read and write performance, ▪ Reliability, and ▪ Cost All D disks are used to improve efficiency, and data is distributed using block striping No redundant information is kept Read and write performance is very good But, reliability is poor Unless data is regularly backed up a RAID 0 system should only be used when the data is not important A RAID 0 system is the cheapest of all RAID levels As there are no disks used for storing redundant data An identical copy is kept of each disk in the system, hence the term mirroring Read performance is similar to a single disk No data striping, but parallel reads of the duplicate disks can be made which improves random read performance Write performance is worse than a single disk as the duplicate disk has to be written to Writes to the original and mirror should not be performed simultaneously in case there is a global system failure But write performance is superior to most other RAID levels Very reliable but costly With D data disks, a level 1 RAID system has 2D disks Sometimes referred to as RAID level 10, combines both striping and mirroring Very good read performance Similar to RAID level 0 2D times the speed of a single disk for sequential reads Up to 2D times the speed of a single disk for random reads Allows parallel reads of blocks that, conceptually, reside on the same disk Poor write performance Similar to RAID level 1 Very reliable but the most expensive RAID level Writing data is the Achilles heel of RAID systems Data and check disks should not be written to simultaneously Parity information may have to be read before check disks can be written to In many RAID systems writing is less efficient than with a single disk! Sequential writes, or random writes in a RAID system using bit striping: Write to all D data disks, using a read-modify-write cycle Calculate the parity information from the written data Write to the check disk(s) ▪ A read-modify-write cycle is not required Random writes in a system using block striping: Write to the data disk using a read-modify-write cycle Read the check disk(s), and calculate the new parity data Write to the check disk(s) A RAID system with D disks can read data up to D times faster than a single disk system For sequential reads there is no performance difference between bit striping and block striping Block striping is more efficient for random reads With bit striping all D disks have to be read to recreate a single record (and block) of the data file With block striping, a complete record is stored on one disk, so only one disk is required to satisfy a single random read Write performance is similar except that it is also affected by the parity scheme Level 2 does not use the standard parity scheme Uses a scheme that allows the failed disk to be identified increasing the number of disks required However the failed disk can be detected by the disk controller so this is unnecessary Can tolerate the loss of a single disk Level 3 is Bit Interleaved Parity The striping unit is a single bit Random read and write performance is poor as all disks have to be accessed for each request Can tolerate the loss of a single disk Uses block striping to distribute data over disks Uses one redundant disk containing parity data The ith block on the redundant disk contains parity checks for the ith blocks of all data disks Good sequential read performance D times single disk speed Very good random read performance Disks can be read independently, up to D times single disk speed When data is written the affected block and the redundant disk must both be written to To calculate the new value of the redundant disk Read the old value of the changed block Read the corresponding redundant disk block Write the new data block Recalculate the block of the redundant disk To recalculate the redundant data consider the changes in the bit pattern of the written data block Cost is moderate Only one check disk is required The system can tolerate the loss of one drive Write performance is poor for random writes Where different data disks are written independently For each such write a write to the redundant disk is also required Performance can be improved by distributing the redundant data across all disks – RAID level 5 The dedicated check disk in RAID level 4 tends to act as a bottleneck for random writes RAID level 5 does not have a dedicated check disk but distributes the parity data across all disks This removes the bottleneck thus increasing the performance of random writes Sequential write performance is similar to level 4 Cost is moderate, with the same effective space utilization as level 4 The system can tolerate the loss of one drive RAID levels 4 and 5 can only cope with single disk crashes Therefore if multiple disks crash at the same time (or before a failed disk can be replaced) data will be lost RAID level 6 allows systems to deal with multiple disk crashes These systems use more sophisticated error correcting codes One of the simpler error correcting codes is the Hamming Code Consider a system with seven disks which can be identified with numbers from 1 to 7 Four of the disks are data disks, disks 1 to 4 Three of the disks are redundant disks, disks 5 to 7 Each of the three check disks contain parity data for three of the four data disks Disk 5 contains parity data for disks 1, 2 and 3 Disk 6 contains parity data for disks 1, 2 and 4 Disk 7 contains parity data for disks 1, 3 and 4 Disk 1 2 3 4 5 6 7 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1,2,3 1,2,4 1,3,4 Data Redundant Data Reads are performed as normal Only the data disks are used Writes are performed in a similar way to RAID level 4 Except that multiple redundant disks may be involved Cost is high, as more check disks are required If one disk fails use the parity data to restore the failed disk like level 4 If two disks fail then both disks can be rebuilt using three of the other disks, e.g. If disks 1 and 2 fail ▪ Rebuild disk 1 using disks 3, 4 and 7 ▪ Rebuild disk 2 using disks 1, 3 and 5 If disks 3 and 5 fail ▪ Rebuild disk 3 using disks 1, 4 and 7 ▪ Rebuild disk 5 using disks 1, 2 and 3 In real-life RAID systems the disk array is partitioned into reliability groups A reliability group consists of a set of data disks and a set of check disks The number of check disks depends on the reliability level that is selected Consider a RAID system with 100 data disks and 10 check disks i.e. 10 reliability groups The MTTF is increased from 21 days to 250 years! Level 0 improves performance at the lowest cost but does not improve reliability Level 1+0 is better than level 1 and has the best write performance Levels 2 and 4 are always inferior to 3 and 5 Level 3 is good for large transfer requests of several contiguous blocks but bad for many small request of a single disk block Level 5 is a good general-purpose solution Level 6 is appropriate if higher reliability is required In practice the choice is usually between 0, 1, and 5 The table that follows compares RAID levels using RAID level 0 as a baseline Comparisons of RAID systems vary dependent on the metric used and the measurements of those metrics The three primary metrics are reliability, performance and cost These can be measured in I/Os per second, bytes per second, response time and so on The comparison uses throughput per dollar for systems of equivalent file capacity File capacity is the amount of information that can be stored on the system, which excludes redundant data Random Read Random Write Sequential Read Sequential Write Storage Efficiency RAID 0 1 1 1 1 1 RAID 1 1 ½ 1 ½ ½ RAID 3 1/G 1/G (G-1)/G (G-1)/G (G-1)/G RAID 4 (G-1)/G max(1/G, ¼) (G-1)/G (G-1)/G (G-1)/G RAID 5 1 max(1/G, ¼) 1 (G-1)/G (G-1)/G RAID 6 1 max(1/G, 1/6) 1 (G-2)/G (G-2)/G G refers to the number of disks in a reliability group (both data disks and check disks) RAID levels 10 and 2 not shown Solid State Drives (SSDs) use NAND flash memory and do not contain moving parts like an HDD Accessing an SSD does not require seek time or rotational latency and they are therefore considerably faster Flash memory is non-volatile memory that is used by smart-phones, mp3 players and thumb (or USBO) drives NAND flash architecture is similar to a NAND (negated and) logic gate hence the name NAND flash architecture is only able to read and write data one page at a time There are two types of SSD Multi-level cell (MLC) Single-level cell (SLC) MLC cells can store multiple different charge levels And therefore more than one bit ▪ With four charge levels a cell can store 2 bits Multiple threshold voltages makes reading more complex but allows more data to be stored per cell MLC SSDs are cheaper than SLC SSDs However write performance is worse And their lifetimes are shorter SLC cells can only store a single charge level They are therefore on or off, and can contain only one bit SLC drives are less complex They are more reliable and have a lower error rate They are faster since it is easier to read or write a single charge value SLC drives are more expensive And typically used for enterprise rather than home use Reads are much faster than HDDs since there are no moving parts Writes are also faster than HDDs However flash memory must be erased before it is written, and entire blocks must be erased ▪ Referred to as write amplification The performance increase is greatest for random reads DBMS Transaction and Lock Manager Query Evaluation File and Access Code Buffer Manager Disk Space Manager Database Recovery Manager When an SQL command is evaluated a request may be made for a DB record Such a request is passed to the buffer manager If the record is not stored in the (main memory) buffer the page must be fetched from disk The disk space manager provides routines for allocating, de-allocating, reading and writing pages The disk space manager (DSM) keeps track of available disk space The lowest level of the DBMS architecture Supports the allocation and de-allocation of disk pages Pages are abstract units of storage, mapped to disk blocks Reading and writing to a page is performed in one disk I/O Sequences of pages are allocated to a contiguous sequence of blocks to increase access speed The DSM hides the underlying details of storage Allowing higher level processes to consider the data to be a collection of pages A DB increases and decreases in size over time In addition to mapping pages to blocks the DSM has to record which disk blocks are in use As time goes on, gaps in sequences of allocated blocks appear Free blocks need to be recorded so that they can be allocated in the future, using either A linked list, the head points to the first free block A bitmap, each bit corresponds to a single block ▪ Allows for fast identification, and therefore allocation, of contiguous areas of free space An OS is required to manage space on a disk Typically an OS abstracts a file as a sequence of bytes While possible to build a DSM with the OS many DBMS perform their own disk management This makes the DBMS more portable across platforms Using the OS may impose technical limitations such as maximum file size In addition, OS files cannot typically be stored on separate disks, which may be necessary in a DBMS Attributes, or fields, must be organized within records Information that is common to all records of a particular type is stored in the system catalog ▪ Including the number and type of fields Records of a single table can vary from each other In addition to differences in data (obviously) Different records may contain different number of fields, or Fields of varying length INTEGER, represented by two or four bytes FLOAT, represented by four or eight bytes CHAR(n), fixed length character strings of n bytes Unused characters are occupied with a pad character ▪ e.g. if a CHAR(5) stored "elm" it would be stored as elm VARCHAR(n), character strings of varying lengths Stored as arrays of n+1 bytes ▪ i.e. even though a VARCHAR's contents can vary, n+1 bytes are dedicated to them The length of a VARCHAR is stored in the first byte, or Its end is specified by a null character Fields are a fixed length, and the number of fields is fixed Fields may then be stored consecutively And given the address of a record, the address of a particular field can be found ▪ By referring to the field size in the system catalog It is common to begin all fields at a multiple of 4 or 8 bytes Consider an employee record: {name CHAR(30), address VARCHAR(255), salary FLOAT} Fields can be found by looking up the field size in the schema and performing an offset calculation pointer to schema length timestamp name 0 12 header address 44 salary 300 308 In the relational model each record contains the same number of fields However fields may be of variable length If a record contains both fixed and variable length fields, store the fixed length fields first The fixed length fields are easy to locate To store variable length fields include additional information in the record header The length of the record Pointers to the beginning of each variable length field A pointer to the end of the record Consider an employee record with name, and salary being fixed length and address being variable length The pointer to the first variable length field may be omitted other header information record length address pointer end of record name header salary address Records in a relational DB have the same number of fields But it is possible to have repeating fields For example a many to many relationship in a record that represents an object References to other objects will have to be stored ▪ The references (or pointers) to other objects suggest that different records will have different lengths There are three alternatives for recording such data Store the entire record in one block Maintain a pointer to the first reference Store the fixed length portion in one location, and the variable length portion in another The header contains a pointer to the variable length portion (the references to other objects), and The number of such objects Store a fixed length record with an fixed number of occurrences of the repeating fields, and A pointer to (and count of) any additional occurrences There are many advantages of keeping records (and therefore fields) fixed length More efficient for search Lower overhead (the header contains less data) Easier to move records around The main advantage of using variable length fields is that it can save space This can result in fewer disk I/O operations Modifying a variable field in a record may make it larger Later fields in the same record have to be moved, and Other records may also have to be moved When a variable field is modified the record’s size may increase to the extent that it no longer fits on the page The record must then be moved to another page, but A "forwarding address" has to be maintained on the old page, so that external references to the rid are still valid A record may grow larger than the page size The record must then be broken into sections and connected by pointers Forwarding addresses may need to be maintained When a record grows too large, or When records are maintained in order (clustered) When maintaining ordered data Provide a forwarding address if a record has to be moved to a new page to maintain the ordering Delete records by inserting a NULL value, or tombstone pointer in the header The record slot can be re-used when another record is inserted 1 3 The numbers represent the primary keys of the records Insert 6,17, and 21 8 1 8 3 17 6 8 21 1 3 Delete 3,and insert 5 8 1 1 5 8 8 There are other data types that requires special treatment in terms of record storage Pointers, and reference variables Large objects such as text, images, video, sound etc. If a record represents an object, the object may contain pointers to or addresses of some other object Such pointers need to be managed by a DBMS A data item may have two addresses Database address on disk, usually 8 bytes Memory address in main memory, usually 4 bytes When an item is on the disk (i.e. secondary storage) its database address must be used And when an item is in the buffer pool it can be referred to by either its database or memory address It is more efficient to use the memory address Database addresses of items in main memory should be translated to their current memory addresses To avoid unnecessary disk I/O It is possible to create a translation table that maps database addresses to memory addresses However when using such a table addresses may have to be repeatedly translated Whenever a pointer of a record in main memory is accessed the translation table must be used Pointer swizzling is used to avoid repeated translation table look-up Whenever a block is moved from secondary to main memory pointers in that block may be swizzled i.e. translated from the database to the memory address A pointer in main memory consists of A bit that indicates whether the pointer is a database or a memory address, and The memory address (four byte) or database address (8 byte) as appropriate ▪ Space is always reserved for the database address There are several strategies to decide when a pointer should be swizzled Disk … Memory Read into memory swizzled Block 1 unswizzled Block 2 When a new block is brought into main memory, pointers related to that block may be swizzled The block may contain pointers to records in the same block, or other blocks, and Pointers in records in other blocks, already in main memory, may point to records in the newly copied block There are four main swizzling strategies Automatic swizzling Swizzling on demand No swizzling – i.e. just use the translation table Programmer controlled swizzling – when access patterns are known Enter the address of the block and its records into the translation table Enter the address of any pointers in the records in the block into the translation table If such an address is already in the table, swizzle the pointer giving it the appropriate memory address If the address is not already in the table, copy its block into memory and swizzle the pointer This ensures that all pointers in the new block are swizzled when the block is loaded, which may save time However, it is possible that some of the pointers may not be followed, hence time spent swizzling them is wasted Enter the address of the block and its records into the translation table Leave all pointers in the block unswizzled When an unswizzled pointer is followed, look up the address in the translation table If the address is in the table, swizzle the pointer If the address is not in the table, copy the appropriate block into main memory, and swizzle the pointer Unlike automatic swizzling, this strategy does not result in unnecessary swizzling When a block is written to disk its pointer must first be unswizzled That is, the pointers to memory addresses must be replaced by the appropriate database addresses The translation table can be searched (by memory address) to find the database address This is potentially time consuming The translation table should therefore be indexed to allow efficient lookup of both memory and database addresses Pointer swizzling may result in blocks being pinned A block is pinned if it cannot safely be written back to disk A block that is pointed to by a swizzled pointer should be pinned Otherwise, the pointer can no longer be followed to the block at the specified memory address If a block is unpinned pointers to it must be unswizzled The translation table must also include the memory addresses of pointers that refer to an entry ▪ As a linked list attached to an entry in the translation table, or ▪ As a (pointer to a) linked list in the record's pointer field How are large data objects stored in records? Video clips, or sound files or the text from a book LOB data types store and manipulate large blocks of unstructured data Tables can contain multiple LOB columns The maximum size of a LOB is large ▪ At least 8 terabytes in Oracle 10g LOB data must be processed by application programs LOB data is stored as either binary or character data BLOB – unstructured binary data CLOB, NCLOB – character data BFILE – unstructured binary data in OS files LOB's have to be stored on a sequence of blocks Ideally the blocks should be contiguous for efficient retrieval, but It is possible to store the LOB on a linked list of blocks ▪ Where each block contains a pointer to the next block If fast retrieval of LOBs is required they can be striped across multiple disks for parallel access It may be necessary to provide an index to a LOB For example indexing by seconds for a movie to allow a client to request small portions of the movie Records are organized on pages Pages can be thought of as a collection of slots, each of which contains a single record A record can be identified by its record id (rid) ▪ The rid is the {page ID, slot number} pair Before considering different organizations for managing slots it is important to know if Records are fixed length or Variable length There are two organizations based on how records are deleted Records are stored consecutively in slots When a record is deleted the last record on the page is moved to its location Records are found by an offset calculation All empty space is at the bottom of the page But the rid includes the slot number slot 1 slot 2 … slot N free space N As records are moved external references become invalid number of records The page header contains a bitmap slot 1 Each bit represents a single slot 2 slot A slot's bit is turned off when the slot is empty slot 3 … New records are inserted in empty slots slot M A record's slot number 1 … 0 1 0 M doesn't change M 3 2 1 number of slots bitmap showing slot occupancy With variable length records a page cannot be divided into fixed length slots If a new record is larger than the slot it cannot be inserted If a new record is smaller it wastes space To avoid wasting space, records must be moved so that all the free space is contiguous without changing the rids One solution is to maintain a directory of page slots at the end of each page which contains A pointer (an offset value) to each record and The length of each record Pointers are offsets to records Moving a record on the page has length = 24 no impact on its rid Its pointer changes but its slot number does not A pointer to the start of the free space is required length = 16 length = 20 pointer to start of free free space space Records are deleted by setting the offset to -1 New records can be inserted in vacant slots Pages should be periodically reorganized to remove gaps The directory "grows" into the free space 16 … 24 20 N N 2 1 slot directory number of slots A page can be considered as a collection of records Pages containing related records are organized into collections, or files One file usually represents a single table One file may span several pages It is therefore necessary to be able to access all of the pages that make up a file The basic file structure is a heap file Heap files are not ordered in any way But they do guarantee that all of the records in a file can be retrieved by repeatedly requesting the next record Each record in a file has a unique record ID (rid) And each page in the file is the same size Heap files support the following operations: Creating and destroying files Inserting and deleting records Scanning all the records in the file To support these operations it is necessary to: Keep track of the pages in the file Keep track of which of those pages contain free space Maintain the heap file as a pair of doubly linked lists of pages One list for pages with free space and One list for pages that are full The DBMS can record the first page in the list in a table with one entry for each file If records are of variable length most pages will end up in the list of pages with free space It may be necessary to search several pages on the free space list to find one with enough free space Maintain the heap file as a directory of pages Each directory entry identifies a page (or a sequence of pages) in the heap file The entries are kept in data page order and records for each page: Whether or not the page is full, or The amount of free space ▪ If the amount of free space is recorded there is no need to visit a page to determine if it contains enough space The buffer manager is responsible for bringing pages from disk to main memory as required Main memory is partitioned into a collection of pages called the buffer pool Main memory pages are referred to as frames Other processes must tell the buffer manager if a page is no longer required and whether or not it has been modified A DB may be many times larger than the buffer pool Accessing the entire DB (or performing queries that require joins) can easily fill up the buffer pool ▪ When the buffer pool is full, the buffer manager must decide which pages to replace by following a replacement policy Program 1 Disk Page Program 2 Buffer Manager Free Frame MAIN MEMORY Buffer Pool Database DISK Buffer pool frames are the same size as disk pages The buffer manager records two pieces of information for each frame Dirty Bit Pin Count Data Page dirty bit – on if the page has been modified pin-count – the number of times the page has been requested but not released Main memory frame If the page is already in the buffer pool Increment the frame's pin-count (called pinning) Otherwise Choose a frame to replace (using the policy) ▪ A frame is only chosen for replacement if its pin-count is zero ▪ If there is no frame with a pin-count of zero the transaction must either wait or be aborted ▪ If the chosen frame is dirty write it to the disk Read requested page into replacement frame and set its pin- count to 1 Return the address of the frame When a process releases (de-allocates) a page its pin-count is reduced, known as unpinning The program indicates if the page has been modified, if so the buffer manager sets the dirty bit to on Processes for requesting and releasing pages are affected by concurrency and crash recovery policies These will be discussed at a later date The policy used to replace frames can affect the efficiency of database operations Ideally a frame should not be replaced if it will be needed again in the near future Least Recently Used (LRU) replacement policy Assumes that frames that haven't been used recently are no longer required Uses a queue to keep track of frames with pin-count of zero Replaces the frame at the front of the queue Requires main memory space for the queue A variant of the LRU policy with les overhead Instead of a queue the policy requires one bit per frame, and a single variable, called current Assume that the frames are numbered from 0 to B-1 ▪ Where B is the number of frames Each frame has an associated referenced bit The referenced bit is initially set to off, and is Set to on when the frame's pin-count reaches zero current is initially set to 0, and is used to indicate the next frame to be considered for replacement Consider the current frame for replacement If pin-count 0, increment current If pin-count 0 and referenced bit is on ▪ Switch referenced to off and increment current If pin-count 0 and referenced is off ▪ Replace the frame If current equals B-1 set it to 0 Only replace frames with pin-counts of zero Frames with a pin-count of zero are only replaced after all older candidates are replaced LRU and clock replacement are fair schemes They are not always the best strategies for a DB system It is common for some DB operations to require repeated sequential scans of data (e.g. Cartesian products, joins) With LRU such operations may result in sequential flooding An alternative is the Most Recently Used policy This prevents sequential flooding but is generally poor Most systems use some variant of LRU Some systems will identify certain operations, and apply MRU for those operations Assume that a process requests sequential scans of a file The file, shown below, has nine pages p1 p2 p3 p4 p5 p6 p7 p8 p9 p8 p9 Assume that the buffer pool has ten frames Buffer Pool p1 p2 p3 p4 p5 p6 p7 Read page 1 first, then page 2, … then page 9 All the pages are in the buffer, when the next scan of the file is requested, no further disk access is required! Assume that a process requests sequential scans of a file This file, shown below, has eleven pages p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 Assume that the buffer pool still has ten frames Buffer Pool p1 p2 p3 p4 p5 p6 Read pages 1 to 10 first, page 11 is still to be read p7 p8 p9 p10 Assume that a process requests sequential scans of a file This file, shown below, has eleven pages p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 Assume that the buffer pool still has ten frames Buffer Pool pp111 p2 p3 p4 p5 p6 p7 p8 p9 Read pages 1 to 10 first, page 11 is still to be read Using LRU, replace the appropriate frame, which contains p1, with p11 p10 Assume that a process requests sequential scans of a file This file, shown below, has eleven pages p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 Assume that the buffer pool still has ten frames Buffer Pool p11 p21 p3 p4 p5 p6 p7 p8 p9 Read pages 1 to 10 first, page 11 is still to be read Using LRU, replace the appropriate frame, which contains p1, with p11 The first scan is complete, start the second scan by reading p1 from the file Replace the LRU frame (containing p2) with p1 p10 Assume that a process requests sequential scans of a file This file, shown below, has eleven pages p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 Assume that the buffer pool still has ten frames Buffer Pool p11 p1 p23 p4 p5 p6 p7 p8 p9 Read pages 1 to 10 first, page 11 is still to be read Using LRU, replace the appropriate frame, which contains p1, with p11 The first scan is complete, start the second scan by reading p1 from the file Replace the LRU frame (containing p2) with p1 Continue the scan by reading p2, … p10 Assume that a process requests sequential scans of a file This file, shown below, has eleven pages p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 Assume that the buffer pool still has ten frames Buffer Pool p10 11 pp111 p2 p43 p45 p65 p67 p87 Each scan of the file requires that every page is read from the disk! In this case LRU is the WORST possible replacement policy! p89 pp109 There are similarities between OS virtual memory and DBMS buffer management Both have the goal of accessing more data than will fit in main memory Both bring pages from disk to main memory as needed and replace unneeded pages A DBMS requires its own buffer management To increase the efficiency of database operations To control when a page is written to disk A DBMS can often predict patterns in the way in which pages are referenced Most page references are generated by processes such as query processing with known patterns of page accesses Knowledge of these patterns allows for a better choice of pages to replace and Allows prefetching of pages, where the page requests can be anticipated and performed before they are requested A DBMS requires the ability to force a page to disk To ensure that the page is updated on a disk This is necessary to implement crash recovery protocols where the order in which pages are written is critical Some DBMS buffer managers are able to predict page requests And fetch pages into the buffer before they are requested The pages are then available in the buffer pool as soon as they are requested, and If the pages to be prefetched are contiguous, the retrieval will be faster than if they had been retrieved individually If the pages are not contiguous, retrieval may still be faster as access to them can be efficiently scheduled The disadvantage of prefetching (aka double-buffering) is that it requires extra main memory buffers Organizing data by cylinders Related data should be stored "close to" each other Using a RAID system to improve efficiency or reliability Multiple disks and striping improves efficiency Mirroring or redundancy improves reliability Scheduling requests using the elevator algorithm Reduces disk access time for random reads and writes Most effective when there are many requests waiting Prefetching (or double-buffering) data in large chunks Speeds up access when needed blocks can be predicted but requires more main memory buffers