Ch5: Index Structures for Files

advertisement
Ch5: Index Structures for Files
Learning Goals:
* File organizations with additional access methods
* Single level indexes
- Primary, Secondary and Clustering
* Multi-level dynamic indexes
- B-trees, B+trees: properties, algorithms & data-structures
* Strategies for point queries: scan, index-search
* Strategies for range queries: scan, index-scan
* Using I/O cost models
−2−
Summary of Ch#4. Record Storage and File Organization
Storage Hierarchy, Magnetic Disks
Records and Files
* Mapping them to disk blocks
* File operations
Organizing records in a file: unordered, ordered, hashed
* Different cost of Find(), FindNext()
* Additional data-structures used in dynamic/extendible Hashing
Index = additional data-structure to speedup search
* Idea similar to card-catalogs (library), indexes in books
* Fast search of index via ordering index entries
- or multi-level index (search trees)
* Focus of Ch#5
−3−
5.1 Index: Basic Concepts and Classification
What is an index? f : Indexing Field Value --> disk page(s)
* Collection of <K(i), P(i)> (1NF)
* Distinct from datafile, though shares the indexing field
* Search procedure: (a) search index : K(i) --> set of P(i)s
- (b) search contents of the Pages P(i)s in main memory
* Search time w/ index: log(bi) + |p(i)s| I/Os , bi = blocks in index
* Ex. Index (pp. 849)
3 Types of Single-Level Indexes
1. Primary Index (Fig. 5.1, pp 105)
* Datafile ordered by indexing field = primary key
* nondense, i.e. #index-entries < #dat records
* One index entry per disk page of datafile
* Takes less space, faster search but slow insert
- Insert needs shifting, update of anchor records
−4−
Single Level Indexes (Contd.)
2. Clustered Index (Fig. 5.2, pp 108)
* Datafile ordered by indexing field
* Indexing field value NOT UNIQUE for each record
* One entry per distinct value of indexing field
* Q? Is it a dense index?
* Q? Are insert() easier than primary index?
- need shifting or free space per block (Fig. 5.3, pp 109)
3. Secondary Index (Fig. 5.4, pp 110)
* Datafile may not be ordered by indexing field
* Dense index, i.e. an index entry per data record
* Indexing field may be a key or a non-key field
* Q?How to handle duplicate values of indexing field?
- dense index, variable length index record, or
- extra level of indirection (see Fig. 5.5)
−5−
Single Level Indexes: Exercises
Q? How many indices of each type can be defined on a datafile?
* Heap, Hashed: Many secondary indexes
* Sortedfile: One primary/clustered, Many secondary indexes
Fully inverted file = one secondary index per field
Q? Which index provides most gain in search time ?
* Secondary: O(bd/2 - log(bi)); Primary: O(log(bd) - log(bi))
Cost Model Ex. 1 & 2 (pp 104, 106): r = 30,000;
- B = 1 Kbyte; R = 100 byte; V = 9 byte; Pointer = 6 bytes
- Compute I/O Cost Saving from Primary Index.
- Compute I/O cost savings from a Secondary index (unordered
datafile).
- Hints: calculate bfr(d), bd, #index entries, bfr(i), bi
Q? What is the cost of FindNext() for different indexes?
* O(1) for all, since index is sorted.
Table 5.1-2 (pp 112-3), give row/col. headers, ask for entries.
−6−
Multilevel Indexes (Fig. 5.6, pp 115)
Concept: Speedup Step(a), i.e. index search
* By adding a 2nd level index to 1st level index file
* How many levels can a multilevel index have?
- log(bi, base = fo), where fo = fanout = bfr(i)
* Search procedure: level traversal (Alg. 5.1, pp 114)
* Search time: log(bi, base = fo)) vs. log(bi) vs. log(bd) vs. bd/2
Commercial ISAM = 2 level primary index, ordered file
Ex. computer #levels and search time for multilevel index to
- unordered datafile, dense secondary index with r = 30,000;
- B= 1 Kbyte; R= 100 byte; V= 9 byte; Pointer= 6 bytes
Q? What kind of index is 2nd level index?
- Primary, Clustered or Secondary?
Summary: Faster Search, But Slower Insert(), Delete()
* To reduce shifting => leave freespace in blocks OR
- use overflow blocks
−7−
5.3 Dynamic Multilevel Index: B-tree, B+tree
Definitions: Tree, nodes, root, leafs, internal nodes, level (node)
- parent-child relationship, descendants, subtree(node)
- Fig. 5.7 (pp 116)
Search Trees of order p (see Fig. 5.9/ pp 118)
* variation of multilevel indexes
*
-
Node has (p-1) search values and p pointers
in order <P1, K1, P2, K2, ..., K(q-1), Pq), q <= p
where Pi = pointer to child / null; Ki = key value
For simplicity, assume search values to be unique
*
-
Properties (see Fig. 5. 8, pp 117)
Given key value can occur in a unique child(node)
1. Within each node, K1 < K2 < ... < K(q-1)
2. all X in subtree(Pi); K(i-1)<X<Ki for 1 < i < q
X < Ki for i = 1; and K(i-1) < X for i = q
−8−
B-tree, B+tree (Fig. 5.10, 5.11 on pp 119, 122)
B+tree used in most commercial databases
B-tree = Search tree + balanced + free space mangement
* Node = {P1, <K1, Pr1>, ..., P(q-1), <K(q-1), Pr(q-1)>, Pq}
- Pr(i) = data pointer, Pi = tree pointer or data pointer
* Non-root nodes have ceil(p/2) <= q <= p (avg. 69
* Root has > 2 pointers, unless it has no children.
* Leaf nodes have Pi = Null.
B+tree is like B-tree, except for
* Leaf and non-leafs have different formats (Higher fan-out):
- Non-leaf node do not have Pr(i), Pi is always tree pointer
- Leaf nodes do not have Pi (which were Null anyway!)
* Leaf nodes are threaded (for sequential access)
Q? Is B+tree an index file distinct from datafile?
−9−
Comparison + Search() on B+trees
Q? Compare B-tree and B+-tree on fanout,
- Avg. #nodes, #entries, #pointers @ levels root, 1, 2, 3 for
- V= 9 bytes, P= 6 bytes, B= 512 bytes, 69
* fo (17,23), Level 2 (4912,11638), Level 3 (83520,267674)
-
Search(key = k, Page), called with root-page (Alg. 5.2/pp 124)
binary search on page within main memory
Case K = Ki in the page => Follow Pr(i)
Case K in subtree(notNull Pi) => Search(K, page(Pi))
Case K in subtree(Null Pi) => not in tree
* Time = depth of B+tree = log(r, base = fo), i.e. 2 to 4
Ex. List strategies for point and range search using B+tree
- as primary (secondary) index.
* Primary: Find first record, scan datafile from there
* Secondary: Find first index-leaf, scan index-leafs
− 10 −
Insert(), Delete() with B+trees
Complex recursive algorithms, Better use intuition!
Insert(key = k, ...), see Alg. 5.3 (pp 125)
* Produce a legal B+tree compatible with following steps:
* 1. Search(key = k, ...) to locate right Leaf
* 2. Case Leaf has space => insert in leaf
* 2. Case Leaf is full => insert; split leaf into 2 pages
* 3. Parent(Leaf) is updated to add pointer to new leaf
- overflow => recursion of {insert, split}up the level
Ex. Fig. 5.12 (pp 127): Insert 8, 5, 1, 7, 3, 12, 9, 6
- insert(1) => splits root
- insert(12) , insert(6) => split propagates
Ex. Insert into empty B+tree(order 3): 1, 2, 3, 4, 5, 6, 7, 8
"
!
Ex. 5.11 (pp 133)
Download