More on Indexes Secondary Indexes B-Trees Source: our textbook, slides by Hector Garcia-Molina 1 Secondary Indexes Sometimes we want multiple indexes on a relation. Ex: search Candies(name,manf) both by name and by manufacturer Typically the file would be sorted using the key (ex: name) and the primary index would be on that field. The secondary index is on any other attribute (ex: manf). Secondary index also facilitates finding records, but cannot rely on them being sorted 2 Sparse Secondary Index? No! Since records are not sorted on that key, cannot predict the location of a record from the location of any other record. Thus secondary indexes are always dense. 3 Sequence field • Sparse index 30 20 80 100 90 ... 30 50 20 70 80 40 100 10 does not make sense! 90 60 4 Design of Secondary Indexes Always dense, usually with duplicates Consists of key-pointer pairs ("key" means search key, not relation key) Entries in index file are sorted by key Therefore second-level index is sparse 5 Secondary indexes 10 50 90 ... sparse secondlevel Sequence field 10 20 30 40 30 50 50 60 70 ... 80 40 dense firstlevel 20 70 100 10 90 60 6 Secondary Index and Duplicate Keys Scheme in previous diagram wastes space in the present of duplicate keys If a search key value appears n times in the data file, then there are n entries for it in the index. 7 Duplicate values & secondary indexes one option... Problem: excess overhead! • disk space • search time 10 10 10 20 20 30 40 40 40 40 ... 20 10 20 40 10 40 10 40 30 40 8 Buckets To avoid repeating values, use a level of indirection Put buckets between the secondary index file and the data file One entry in index for each search key K; its pointer goes to a location in a "bucket file", called the bucket for K Bucket holds pointers to all records with search key K 9 Duplicate values & secondary indexes 20 10 10 20 30 40 20 40 10 40 50 60 ... saves space as long as search-keys are larger than pointers and average key appears at least twice 10 40 30 40 buckets 10 Why “bucket” idea is useful Indexes name: primary dept: secondary floor: secondary Records Emp (name,dept,floor,...) 11 Query: SELECT name FROM Emp WHERE dept = 'Toy' AND floor = 2 dept index Emp floor index Toy Intersect Toy dept bucket and floor 2 bucket to get set of matching Emp’s Saves disk I/O's 2 12 Summary of Indexes So Far Advantages: simple index is sequential file, good for scans Disadvantages either inserts are expensive or lose sequentiality (cf. next slide) Instead use B-tree data structure to implement index 13 Example continuous Index 10 20 30 33 40 50 60 free space 70 80 90 (sequential) 39 31 35 36 32 38 34 overflow area (not sequential) 14 B-Trees Several related data structures Key features are: automatically adjust number of levels of indexes as size of data file changes storage on blocks is managed to keep every block between half full and full => no overflow blocks needed We'll actually study B+ trees 15 B-Tree Structure an example of a balanced search tree: every root-to-leaf path has same length each node (vertex) in the tree is a block, which contains search keys and pointers parameter n, which is largest value so that n+1 pointers and n keys fit in one block Ex: If block size is 4096 bytes, keys are 4 bytes, and pointers are 8 bytes, then n = 340. 16 Constraints on B-Tree Nodes Keys in leaf nodes are copies of keys from data file, in sorted order Root contains between 2 and n+1 index node pointers Each internal node contains between (n+1)/2 and n+1 index node pointers Each non-leaf node consists of ptr1,key1,ptr2,key2,…,keym-1,ptrm where ptri points to index node with keys between keyi-1 and keyi 17 Constraints (cont'd) Each leaf contains between (n+1)/2 and n data record pointers, plus a "next leaf" pointer Associated with each data record pointer is a key, and the pointer points to the data record with that key 18 Example B-tree nodes with n = 3 more concise notation 30 35 Leaf: to record with key 30 30 35 to record with key 35 30 30 Non-leaf: textbook notation to part of tree with keys < 30 to part of tree with keys ≥ 30 19 95 81 57 Sample non-leaf to keys to keys to keys < 57 57 k<81 81k<95 to keys 95 20 57 81 95 To record with key 57 To record with key 81 To record with key 85 Sample leaf node: From non-leaf node to next leaf in sequence 21 120 150 180 30 Leaf 30 35 Full node counts even if null Non-leaf 3 5 11 n=3 min. node 22 180 200 150 156 179 120 130 100 101 110 30 35 3 5 11 120 150 180 30 100 B-Tree Example n=3 Root … to records … 23 Insert into B+tree (a) simple case space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root 24 n=3 30 31 32 3 5 11 30 100 (a) Insert key = 32 25 (a) Insert key = 7 30 30 31 3 57 11 3 5 7 100 n=3 26 180 200 160 179 150 156 179 180 120 150 180 160 100 (c) Insert key = 160 n=3 27 (d) New root, insert 45 40 45 40 30 32 40 20 25 10 12 1 2 3 10 20 30 30 new root n=3 28 Deletion from B-tree (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf 29 (b) Coalesce with sibling n=4 40 50 10 20 30 40 10 40 100 Delete 50 30 (c) Redistribute keys n=4 35 40 50 10 20 30 35 10 40 35 100 Delete 50 31 (d) Non-leaf coalese n=4 25 Delete 37 40 45 30 37 30 40 25 26 30 20 22 10 14 1 3 10 20 25 40 new root 32 B-tree deletions in practice – Often, coalescing is not implemented Too hard and not worth it! 33 Applications of B-Trees B-tree is used to implement indexes The data record pointers in the leaves correspond to the data record pointers in sequential indexes Some example uses: B-tree search key is primary key for data file, leaf pointers form a dense index on the file B-tree search key is primary key for data file, leaf pointers form a sparse index on the file B-tree search key is not primary key, leaf pointers form a dense index on the file 34 B-Trees with Duplicate Keys Change definition of B-tree: If key K appears in an internal node, then K is the smallest "new" key in the subtree S rooted at the pointer that follows K in the node "New" means K does not appear in the part of the B-tree to the left of S but it does appear in S Allow null key in certain situations 35 43 47 23 37 41 23 23 13 17 23 7 13 2 3 5 -37 43 7 17 Example B-Tree with Duplicates 36 Lookup in B-Trees Assume no duplicate keys. Assume B-tree is a dense index. To find the record with key K, search starting at the root and ending at a leaf: if current node is not a leaf and has keys K1, K2, …, Kn, find the smallest key, Ki, in the sequence that is ≤ K. follow the (i+1)-st pointer to a node at the next level and repeat when a leaf node is reached, find the key with value K and follow the associated pointer to the data record 37 Range Queries with B-Trees Range query: a query in which a range of values is sought. Examples: SELECT * FROM R WHERE R.k > 40; SELECT * FROM R WHERE R.k >= 10 AND R.k <= 25; To find all keys in the range [a,b]: Do a lookup on a: leads to leaf where a could be Search the leaf for all keys ≥ a If we find a key > b, we are done Else follow next-leaf pointer and continue searching in the next leaf Continue until finding a key > b or no more leaves 38 Efficiency of B-Trees B-trees allow lookup, insertion and deletion of records with very few disk I/Os Number of disk I/Os is number of levels in the Btree plus cost of any reorganization If n is at least 10, then splitting/merging blocks will be rare and usually limited to the leaves For typical sizes of keys, pointers, blocks and files, 3 levels suffice (see next slide) Also can keep root block of B-tree in memory 39 Size of B-Tree Assume 4096 bytes per block 4 bytes per key (e.g., integer) 8 bytes per pointer no header info in the block Then n = 340 (can keep n keys and n+1 pointers in a block) Assume on average a block has 255 pointers Count: one node at level 1 (the root) 255 nodes at level 2 255*255 = 65,025 nodes at level 3 (leaves) each leaf has 255 pointers, so total number of records is more than 16 million 40