Index tuning-B+tree overview • Overview of tree-structured index • Indexed sequential access method (ISAM) • B+tree Data Structures • Most index data structures can be viewed as trees. • In general, the root of this tree will always be in main memory, while the leaves will be located on disk. – The performance of a data structure depends on the number of nodes in the average path from the root to the leaf. – Data structure with high fan-out (maximum number of children of an internal node) are thus preferred. Tree-Structured Indexing • Tree-structured indexing techniques support both range searches and equality searches – index file may still be quite large. But we can apply the idea repeatedly! – ISAM: static structure; B+ tree: dynamic, adjusts gracefully under inserts and deletes. Data pages Indexed sequential access method (ISAM) • Data entries vs index entries – Both belong to index file – Data entries: <key, records or pointer to records/record list> – Index entries:<key, pointer to index entries or data entries> • Both in ISAM and B-tree, leaf pages are data entries Example ISAM Tree • Each node can hold 2 entries; no need for `nextleaf-page’ pointers. (Why?) Comments on ISAM • File creation: Leaf (data) pages allocated sequentially, sorted by search key; then index pages allocated, then space for overflow pages. • Index entries: <search key value, page id>; they `direct’ search for data entries, which are in leaf pages. • Search: Start at root; use key comparisons to go to leaf. • Cost=logFN ; F = # entries/index pg, N = # leaf pgs • Insert: Find leaf data entry belongs to, and put it there. • Delete: Find and remove from leaf; if empty overflow page, de-allocate. • * Static tree structure: inserts/deletes affect only primary leaf pages. After Inserting 23*, 48*, 41*, 42* ... Then Deleting 42*, 51*, 97* * Note If primary leaf page is empty, just leave it empty so that the number of allocation primary pages does not change! Features of ISAM • The primary leaf pages are assumed to be allocated sequentially – Because the number of such pages is known when the tree is created and does not change subsequently under inserts and deletes – So next-page-pointers are not needed Prons and cons of ISAM • Cons:losing sequentiality and balance – Due to long overflow chains – Leading to poor retrieving performance – One solution • Keep 20% of each page free when tree was created • Prons – No need to lock non-leaf index pages since we never modify them, which is one of important advantages of ISAM over B+tree – Another is: scans over a large range is more efficient thant B+tree B+-Tree • A B+-Tree is a balanced tree whose leaves contain a sequence of key-pointer pairs. • Dynamic structure that adjust gracefully for delete and insert + B Tree: The Most Widely Used Index – Operation (insertion, deletion) keep it Heightbalanced. – Searching for a record requires a traversal from root to the leaf • Equality query, Insert/delete at log F N cost (F = fanout, N = # leaf pages); – Grow and shrink dynamically. • Need `next-leaf-pointer’ to chain up the leaf nodes – To handle cases such as leaf node merging – Minimum 50% occupancy (except for root). • Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree. – Data entries at leaf are sorted. Example B+ Tree • Each node can hold 4 entries (order = 2) Root 17 5 2 3 24 13 5 7 8 14 16 19 20 22 30 24 27 29 33 34 38 39 Node structure • Non-leaf nodes index entry P 0 K 1 P 1 K 2 P K 2 P 2 K m Pm 2 K m Pm Leaf nodes P 0 K 1 P 1 Next leaf node Searching in B+ Tree • Search begins at root, and key comparisons direct it to a leaf (as in ISAM). • Search for 5, 15, all data entries >= 24 ... Root 13 2 3 5 14 16 17 19 20 24 22 30 24 27 29 33 34 38 39 Based on the search for 15*, we know it is not in the tree! summarize