CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Storage, Files and Indexing Outline Part 1: Disk properties and file storage File organizations: ordered, unordered, and hashed Storage topics: RAID and Storage area networks Chapter: “Disk Storage, Basic File Stuctures and Hashing” in Elmasri and Navathe Part 2: Indexes 2 CS346 Advanced Databases Why? Important to understand how high-level abstractions (databases) map down to low-level concepts (disks, files) Get a sense of the scale of the quantities involved (seek times, overhead of inefficient solutions) – Appreciate the difference that smart solutions can bring – Understand where the bottlenecks lie – Give a “bottom-up” perspective on data management See the whole picture starting from the low-level – Demystify some aspects that can seem opaque (B-trees, hashing, file organization) – Apply to many areas of computer science (OS, algorithms…) – 3 CS346 Advanced Databases The Memory Hierarchy Flash Storage 4 CS346 Advanced Databases Data on Disks Databases ultimately rely on non-volatile disk storage – Data typically does not fit in (volatile) memory Physical properties of disks affect performance of the DBMS – Need to understand some basics of disks A few exceptions to disk-based databases: Some real-time applications use “in-memory databases” – Some legacy/massive applications use tape storage as well – Different tradeoffs with flash-based storage Much faster to read, but limits on number of deletions – No major difference between random access and linear scan – “Flash databases” are a niche, but growing area – 5 CS346 Advanced Databases Rotating Disk: 5000 – 10000RPM Sector size: 0.5KB – 4KB, basic unit of data transfer from disk Seek time: move read head into position, currently ~4ms Includes rotational delay: wait for sector to come under read head – Random access: 1/0.004 * 4KB = 1MB/second: quite slow – Track-to-track move, currently ~0.4ms: 10 times faster 6 – Sustained read/write time: 100MB/second (caching can improve) CS346 Advanced Databases Disk properties: the fundamental contrast Random access is slow, sequential access is fast By factors of up to 100s – Want to design storage of data to avoid or minimize random access and make data access as fast as possible – Buffering can help in multithreaded systems: Work on other processes while waiting for data to arrive – Double buffering: maintain two buffers of data work on current buffer of data, while other buffer fills from disk – Maximizes parallel utilization, but doesn’t make my thread faster – 7 CS346 Advanced Databases Records: the basic unit of the database Databases fundamentally composed of records – Each record describes an object with a number of fields Fields have a type (integer, float, string, time, compound…) – Fixed or variable length Need to know when one field ends and the next begins Field length codes – Field separators (special characters) – Leads to variable length records – 8 How to effectively search through data with variable length records? CS346 Advanced Databases 9 CS346 Advanced Databases Records and Blocks Records get stored on disks organized into blocks Small records: pack an integer number into each block Leaves some space left over in blocks – Blocking Factor: (average) number of records per block – Large records: may not be effective to leave slack Records may span across multiple blocks (spanned organization) – May use a pointer at end of block to point to next block – 10 CS346 Advanced Databases Files A sequence of records is stored as a file – Either using OS file system support, or handled by DBMS Database requires support for various file operations: – – – – – – – Open file, return new file handler Scan for the next record that satisfies a search condition Read the next record from disk into memory Delete the current record and (eventually) update file on disk Modify the current record and (eventually) update file on disk Insert a new record at the current location Close the file, flush any buffers and postponed operations Need suitable file layout and indices to allow fast scan operation 11 CS346 Advanced Databases File organization: unordered Just dump the records on disk in no particular order Insert is very efficient: just add to last block Scan is very inefficient: need to do a linear search – Read half the file on average Delete could be inefficient: Read whole file, write it back with deleted record omitted – Instead, just “mark” record as deleted – Periodically remove marked records – 12 CS346 Advanced Databases File organization: ordered Keep records ordered on some (key) attribute Can scan through records in that order very easily Can search for a value (or range of values) by binary search Binary search: log2 b seeks to find desired record out of b blocks – Linear search: b/2 seeks on average to find record – Insertion is rather more expensive and complex to do well – Keep recent records in “overflow buffer” for periodic merge If modifying the key field, treat as a deletion and an insertion 13 CS346 Advanced Databases 14 CS346 Advanced Databases File organization: hashed Use hashing to ensure records with same key are grouped together Arrange file blocks into M equal sized buckets – Often, 1 block = 1 bucket Apply hash function to key field to determine its bucket Usual hash table concerns emerge Need to deal with collisions, e.g. by open addressing, or chaining – Deletions also get messy, depending on collision method used – 15 CS346 Advanced Databases External hashing Don’t store records directly in buckets, store pointers to records Pointers are small, fit more in a block – “All problems in computer science can be solved by another level of indirection” – David Wheeler – 16 CS346 Advanced Databases External hashing: issues Aim for 70-90% occupancy of the hash table – Not too much wastage, not too many collisions Hash function should spread records evenly across buckets – If very skewed distribution, we lose benefits of hashing Still costly if access to records ordered by key is required – And doesn’t help with accessing records not by key Main disadvantage: hard to adjust if number of records grows – Need to resize the hash table What if too many records hash to the same bucket? – 17 Can handle extra records by “chaining” to overflow buckets CS346 Advanced Databases Hashing: Overflow buckets 18 CS346 Advanced Databases Extendible hashing Hashing scheme that allows the hash table to grow and shrink – Avoid wasted space and avoid excessive collisions Makes use of a directory of bucket addresses Directory size is a power of two, 2d – So can double or halve the directory size as needed – The first d bits of the hash value are used to index into the directory – Directory entries point to disk blocks storing records Contiguous directory entries can point to same disk block – Disk blocks can have a local value of d, d’ – Insertions into a block may cause it to overflow and split in two – 19 The directory is then updated accordingly CS346 Advanced Databases Extendible hashing example – 20 Some values of d’ less than global d CS346 Advanced Databases Extendible Hashing: Updating d If a bucket becomes full, may need to increase d d + 1 – Double the size of the directory Similarly, if all buckets have local d’ < d, can decrease d d – 1 – Halve the size of the directory Other adaptive hashing variants exist – 21 Dynamic hashing: binary tree directory CS346 Advanced Databases RAID disk technology RAID originally a way to combine multiple cheap disks for reliability – “Redundant Array of Inexpensive Disks” (1980s) Now general purpose approach to providing reliability “Redundant Array of Independent Disks – Sets of different levels of replication – RAID 0: spread data over multiple disks (striping) – 22 Increases throughput, but increases risk of data loss CS346 Advanced Databases Important RAID levels RAID 1: duplication of data across multiple disks (mirroring) – – – – – Data copied to 2 (or more) disks Disk reliability measured in “mean time between failures” (MTBF) Typical MTBF is 100K hours – 1M hours (~ 1 century) Chance of both disks failing at same time is small So enough time to recover a copy RAID 5: block level striping and parity coding spread over disks – Parity coding: allows recovery of 1 missing disk 1 0 1 1 Data bits 23 0 1 Parity bit CS346 Advanced Databases RAID levels RAID 6: Reed-Solomon coding allows multiple disk losses Other RAID levels (2, 3, 4) not in common usage 24 CS346 Advanced Databases Storage Area Networks Storage Area Networks: virtual disks Disks attached to “headless” server – Easy to configure, low maintenance overhead – Many advantages to SANs: Flexible configuration: hot-swap new disks in/out – Can be physically remote from other network elements Provided on fast (fibre-based) network – Separate storage for server configuration, OS updates etc. – 25 CS346 Advanced Databases Outline Part 2 Indexes: primary and secondary Multilevel indexes and B-trees Chapter: “Indexing Structure for Files” in Elmasri and Navathe 26 CS346 Advanced Databases Indexing for Files Chapter: “Indexing Structure for Files” in Elmasri and Navathe – Move focus from how file is stored on disk to how file is accessed / indexed by the DBMS Index: an auxiliary file that makes it faster to find certain records An index is usually for one field of the record (e.g. index by name) – Can have multiple indexes, each for different fields – A basic form of an index is a sorted list of pointers <field value, pointer to record>, ordered by field value – “An access path” for the indexed field – 27 CS346 Advanced Databases Indexes as access paths Indexes usually take up much less space than the original file Each index entry is much smaller than the full record – Just need a field value, and a pointer (few bytes) – Efficient to look up matching records – Binary search on the index, then follow pointer The index may be dense or sparse Dense index: contains an entry for every possible search value – Sparse index: contains entries only for some search values – Can have an index on the field that the file is sorted on! Why? – 28 Can be faster to search via index than do binary search on file CS346 Advanced Databases Primary Index A primary index applies when the file is ordered by a key field A sparse index: one entry for each block of the data file An index for the first record in the block (the block anchor) – Can be much fewer entries in index than in the data file – Straightforward to search for a record Use the index to find the block that the record should be in – Retrieve the block and see if the record is there – Insertion and deletion of records in the main file is a pain! – Almost all the pointers change! Some standard tricks to mitigate the pain Buffer updates in an “overflow” file and check against this – Linked list of overflow records for each block as needed – Mark records as deleted, and only purge periodically – 29 CS346 Advanced Databases 30 CS346 Advanced Databases Indexing Example Example: Given a data file EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... ) Suppose that: – – – record size R=100 bytes (fixed size) block size B=1024 bytes file size r=30000 records Blocking factor Bfr= B / R = 1024 / 100 = 10 records/block Number of file blocks b= (r/Bfr)= (30000/10)= 3000 blocks 31 CS346 Advanced Databases Indexing Example For an index on the SSN field, assume the field size VSSN=9 bytes and the record pointer size PR=6 bytes. Then: index entry size RI=(VSSN+ PR)=(9+6)=15 bytes – index blocking factor BfrI= B / RI = 1024/15 = 68 entries/block – number of index blocks bI = (r/ BfrI)= (3000/68)= 45 blocks – binary search needs log2(bI)= log2(45)= 6 block accesses [In practice, likely that these 45 blocks would end up in cache] – This is compared to an average linear search cost of: – (b/2) = 30000/2 = 15000 block accesses If the file records are ordered, the binary search cost would be: – 32 log2b = log23000 = 12 block accesses CS346 Advanced Databases Clustering Index Clustering index applies when data is ordered on a non-key field The field on which data is ordered is called the clustering field – The data file is described as a clustered file – Clustering index is sorted list of <field value, pointer> pairs – Why make a distinction between clustering and primary index? Field values can appear in many consecutive records – Only one entry in index for each distinct field value No point having multiple entries – Index points to first data block containing the matching value – Same issues with insertion and deletion as for primary index 33 CS346 Advanced Databases 34 CS346 Advanced Databases Cluster index where each distinct value is allocated a whole disk block Linked list if more than one block is needed 35 CS346 Advanced Databases Secondary indexes Secondary indexes provide a secondary means of access to data – For when some primary access already exists (e.g. index on key) A secondary index is on some other field(s) Either other candidate key fields which are unique for every record – Or non-key field with duplicate values – Secondary index is an ordered file of <field value, pointer> pairs Pointer can be to a file block, or record within a file – A dense index: must be one pointer per record – Many secondary indexes can be created for a file Allowing access based on different fields – By contrast, there can be only one primary index – 36 CS346 Advanced Databases Secondary index with block pointers Unique data values so structure is simple 37 CS346 Advanced Databases Secondary index example Same set up as previous example: r=30000 records of size R=100 bytes, block size B = 1024 bytes File is stored in 3000 blocks as worked out before Search for a record based on a field of V = 9 bytes – Linear search would read 1500 blocks on average Secondary index on target attribute (9 + 6) = 15 bytes/record – Blocking factor for index is 1024/15 = 68 entries per block Need 30000/68 = 442 blocks to store the (dense) index – Binary search on index takes log2 442 = 9 block accesses – Slightly more than the primary index (why?) – 38 CS346 Advanced Databases Secondary index for non-key, non-ordering Secondary index for a non-key non-ordering field I.e. a field that has duplicate values in many records – Several possible approaches – 1. Include duplicate index entries for the same field value (dense) 2. Have variable length entries in the index: a list of pointers to all blocks containing the target value 3. Use an extra level of indirection: fixed length index entries point to list of pointers, arranged as list of disk blocks Option 3 is most commonly used – 39 All options are painful when data file is subject to insert/deletes CS346 Advanced Databases “Option 3” secondary index 40 CS346 Advanced Databases Single Level Indexing Summary Primary index: on the field that the data is sorted by – Allows faster access than searching the file directly Secondary index: on any field(s) in the data Can have multiple secondary indexes – Typically dense – All indexes require extra effort to maintain if the data is subject to frequent updates (insert/delete operations) 41 CS346 Advanced Databases Multilevel Indexing The indexes described so far miss a trick: they do binary search But we can read a block of k index records at a time – Can do a k-way split instead of a 2-way split – Improves cost from log2 N to logk N – Another way to look at it: if index is large, build index on index… Original index is first level index, then there is second level index – Can repeat, creating third level index, fourth level index… – Until top level of index fits into one disk block – For all realistic file sizes, a constant number of levels is needed – Apply this idea to any index type (primary, secondary, cluster) – 42 Assume first level index has fixed length, distinct valued entries CS346 Advanced Databases Two-level index 43 CS346 Advanced Databases Example Convert previous example into a multilevel index Blocking factor for indexes remains 68 – 442 blocks of first level index – Second level index: 442/68 = 7 blocks – Third level index fits in 1 block: stop here! – Hence, need three levels of index: three accesses to find (pointer to) target record 44 CS346 Advanced Databases Dynamic multilevel indexes Can we modify our storage of indices to make handling inserts/deletes less painful? Use tree-structure to directly access data – Keep some space in file blocks to reduce cost of updates Use the language of trees to describe the structure: 45 CS346 Advanced Databases Search trees A search tree: a tree where each node contains at most p-1 search values and p pointers as P1, K1, P2, K2, … Kq-1, Pq, q ≤ p The values are in order: K1 < K2 < … Kq-1 – Each pointer Pi points to a subtree so that Ki-1 < X ≤ Ki for all keys in subtree – Rules allow efficient search for any key value – 46 Search within the only subtree it can be in at each level CS346 Advanced Databases Search tree example Leaf-level entries have the full record Insertion is easier: we can add a new block without having to rewrite the rest of the tree If tree is unbalanced (some very deep paths), searches are long Try to avoid by using rules to avoid tree getting unbalanced – Perform occasional rebalancing or “self-balancing” trees – 47 CS346 Advanced Databases B-trees and B+-trees B-trees add the constraint that the tree should be balanced The root to leaf path should be about the same length for all leaves – Avoid wasted space: each node should be between half full and full – B+-tree is a slight modification of B-tree that is now the standard B-trees: allow pointers to data at all levels of the tree – B+-tree: pointers to data only at the leaf level – B+-tree slightly simpler (fewer cases to deal handle with updates) – The trees can be used for (primary, secondary) multi-level indexes – Updates to data can be reflected in tree easily These trees are widely used in file systems and database systems File systems: NTFS [Windows], NSS, XFS, JFS – for directory entries – DBMSs: IBM DB2, Informix, MS SQL Server, Oracle, SQLite – 48 CS346 Advanced Databases B+-tree Internal nodes: P1, K1, P2, K2, … Kq-1, Pq, where p/2 < q ≤ p Leaf nodes: K1, Pr1, K2, Pr2, … Kq-1, Prq-1, Pnext, p/2 < q ≤ p Ki, Pri : Pri points to record with value Ki – Pnext points to the next leaf node in the tree (for linear access) – 49 CS346 Advanced Databases B+-tree: Search Search on a B+-tree is fairly straightforward Start at root block – While not at a leaf block Determine between which values in the block the key falls Follow the relevant pointer to the new block – Search current leaf block for desired value – If found, follow pointer to retrieve record – 50 CS346 Advanced Databases B-tree: insertion As with many tree algorithms, insertion is based on search Start by searching for where the record should be – If room in the leaf block, insert a pointer to the new record – Else, split the leaf block into two, and insert the pointer – Now there are two leaf blocks: need to update parent Similar process to update parent: may need to split parent – May propagate back to root – Note that we do not explicitly attempt to keep tree balanced – The condition p/2 < q ≤ p ensures that it can’t be too unbalanced Algorithms fans: condition ensures height is O(log n) for n keys – 51 Worst case time for {insert, delete, search} is O(log n) CS346 Advanced Databases 52 CS346 Advanced Databases B+-tree: deletion Essentially the inverse of insertion – – – – – 53 Find the record to delete from the B+-tree Remove the pointer and if block is still large enough, halt Else, try to redistribute: move entries from sibling block If can’t redistribute, merge the two siblings Then delete one pointer from parent and recurse up tree CS346 Advanced Databases 54 CS346 Advanced Databases Summary Disk properties and file storage File organizations: ordered, unordered, and hashed Storage topics: RAID and Storage area networks Indexes: primary and secondary Multilevel indexes and B-trees Chapter: “Disk Storage, Basic File Stuctures and Hashing” in Elmasri and Navathe Chapter: “Indexing Structure for Files” in Elmasri and Navathe 55 CS346 Advanced Databases