Ch5: Index Structures for Files Learning Goals: * File organizations with additional access methods * Single level indexes - Primary, Secondary and Clustering * Multi-level dynamic indexes - B-trees, B+trees: properties, algorithms & data-structures * Strategies for point queries: scan, index-search * Strategies for range queries: scan, index-scan * Using I/O cost models −2− Summary of Ch#4. Record Storage and File Organization Storage Hierarchy, Magnetic Disks Records and Files * Mapping them to disk blocks * File operations Organizing records in a file: unordered, ordered, hashed * Different cost of Find(), FindNext() * Additional data-structures used in dynamic/extendible Hashing Index = additional data-structure to speedup search * Idea similar to card-catalogs (library), indexes in books * Fast search of index via ordering index entries - or multi-level index (search trees) * Focus of Ch#5 −3− 5.1 Index: Basic Concepts and Classification What is an index? f : Indexing Field Value --> disk page(s) * Collection of <K(i), P(i)> (1NF) * Distinct from datafile, though shares the indexing field * Search procedure: (a) search index : K(i) --> set of P(i)s - (b) search contents of the Pages P(i)s in main memory * Search time w/ index: log(bi) + |p(i)s| I/Os , bi = blocks in index * Ex. Index (pp. 849) 3 Types of Single-Level Indexes 1. Primary Index (Fig. 5.1, pp 105) * Datafile ordered by indexing field = primary key * nondense, i.e. #index-entries < #dat records * One index entry per disk page of datafile * Takes less space, faster search but slow insert - Insert needs shifting, update of anchor records −4− Single Level Indexes (Contd.) 2. Clustered Index (Fig. 5.2, pp 108) * Datafile ordered by indexing field * Indexing field value NOT UNIQUE for each record * One entry per distinct value of indexing field * Q? Is it a dense index? * Q? Are insert() easier than primary index? - need shifting or free space per block (Fig. 5.3, pp 109) 3. Secondary Index (Fig. 5.4, pp 110) * Datafile may not be ordered by indexing field * Dense index, i.e. an index entry per data record * Indexing field may be a key or a non-key field * Q?How to handle duplicate values of indexing field? - dense index, variable length index record, or - extra level of indirection (see Fig. 5.5) −5− Single Level Indexes: Exercises Q? How many indices of each type can be defined on a datafile? * Heap, Hashed: Many secondary indexes * Sortedfile: One primary/clustered, Many secondary indexes Fully inverted file = one secondary index per field Q? Which index provides most gain in search time ? * Secondary: O(bd/2 - log(bi)); Primary: O(log(bd) - log(bi)) Cost Model Ex. 1 & 2 (pp 104, 106): r = 30,000; - B = 1 Kbyte; R = 100 byte; V = 9 byte; Pointer = 6 bytes - Compute I/O Cost Saving from Primary Index. - Compute I/O cost savings from a Secondary index (unordered datafile). - Hints: calculate bfr(d), bd, #index entries, bfr(i), bi Q? What is the cost of FindNext() for different indexes? * O(1) for all, since index is sorted. Table 5.1-2 (pp 112-3), give row/col. headers, ask for entries. −6− Multilevel Indexes (Fig. 5.6, pp 115) Concept: Speedup Step(a), i.e. index search * By adding a 2nd level index to 1st level index file * How many levels can a multilevel index have? - log(bi, base = fo), where fo = fanout = bfr(i) * Search procedure: level traversal (Alg. 5.1, pp 114) * Search time: log(bi, base = fo)) vs. log(bi) vs. log(bd) vs. bd/2 Commercial ISAM = 2 level primary index, ordered file Ex. computer #levels and search time for multilevel index to - unordered datafile, dense secondary index with r = 30,000; - B= 1 Kbyte; R= 100 byte; V= 9 byte; Pointer= 6 bytes Q? What kind of index is 2nd level index? - Primary, Clustered or Secondary? Summary: Faster Search, But Slower Insert(), Delete() * To reduce shifting => leave freespace in blocks OR - use overflow blocks −7− 5.3 Dynamic Multilevel Index: B-tree, B+tree Definitions: Tree, nodes, root, leafs, internal nodes, level (node) - parent-child relationship, descendants, subtree(node) - Fig. 5.7 (pp 116) Search Trees of order p (see Fig. 5.9/ pp 118) * variation of multilevel indexes * - Node has (p-1) search values and p pointers in order <P1, K1, P2, K2, ..., K(q-1), Pq), q <= p where Pi = pointer to child / null; Ki = key value For simplicity, assume search values to be unique * - Properties (see Fig. 5. 8, pp 117) Given key value can occur in a unique child(node) 1. Within each node, K1 < K2 < ... < K(q-1) 2. all X in subtree(Pi); K(i-1)<X<Ki for 1 < i < q X < Ki for i = 1; and K(i-1) < X for i = q −8− B-tree, B+tree (Fig. 5.10, 5.11 on pp 119, 122) B+tree used in most commercial databases B-tree = Search tree + balanced + free space mangement * Node = {P1, <K1, Pr1>, ..., P(q-1), <K(q-1), Pr(q-1)>, Pq} - Pr(i) = data pointer, Pi = tree pointer or data pointer * Non-root nodes have ceil(p/2) <= q <= p (avg. 69 * Root has > 2 pointers, unless it has no children. * Leaf nodes have Pi = Null. B+tree is like B-tree, except for * Leaf and non-leafs have different formats (Higher fan-out): - Non-leaf node do not have Pr(i), Pi is always tree pointer - Leaf nodes do not have Pi (which were Null anyway!) * Leaf nodes are threaded (for sequential access) Q? Is B+tree an index file distinct from datafile? −9− Comparison + Search() on B+trees Q? Compare B-tree and B+-tree on fanout, - Avg. #nodes, #entries, #pointers @ levels root, 1, 2, 3 for - V= 9 bytes, P= 6 bytes, B= 512 bytes, 69 * fo (17,23), Level 2 (4912,11638), Level 3 (83520,267674) - Search(key = k, Page), called with root-page (Alg. 5.2/pp 124) binary search on page within main memory Case K = Ki in the page => Follow Pr(i) Case K in subtree(notNull Pi) => Search(K, page(Pi)) Case K in subtree(Null Pi) => not in tree * Time = depth of B+tree = log(r, base = fo), i.e. 2 to 4 Ex. List strategies for point and range search using B+tree - as primary (secondary) index. * Primary: Find first record, scan datafile from there * Secondary: Find first index-leaf, scan index-leafs − 10 − Insert(), Delete() with B+trees Complex recursive algorithms, Better use intuition! Insert(key = k, ...), see Alg. 5.3 (pp 125) * Produce a legal B+tree compatible with following steps: * 1. Search(key = k, ...) to locate right Leaf * 2. Case Leaf has space => insert in leaf * 2. Case Leaf is full => insert; split leaf into 2 pages * 3. Parent(Leaf) is updated to add pointer to new leaf - overflow => recursion of {insert, split}up the level Ex. Fig. 5.12 (pp 127): Insert 8, 5, 1, 7, 3, 12, 9, 6 - insert(1) => splits root - insert(12) , insert(6) => split propagates Ex. Insert into empty B+tree(order 3): 1, 2, 3, 4, 5, 6, 7, 8 " ! Ex. 5.11 (pp 133)