B + Tree

advertisement
Index tuning-B+tree
overview
• Overview of tree-structured index
• Indexed sequential access method (ISAM)
• B+tree
Data Structures
• Most index data structures can be viewed as
trees.
• In general, the root of this tree will always be in
main memory, while the leaves will be located
on disk.
– The performance of a data structure depends on the
number of nodes in the average path from the root to
the leaf.
– Data structure with high fan-out (maximum number of
children of an internal node) are thus preferred.
Tree-Structured Indexing
• Tree-structured indexing techniques support both
range searches and equality searches
– index file may still be quite large. But we can
apply the idea repeatedly!
– ISAM: static structure; B+ tree: dynamic, adjusts
gracefully under inserts and deletes.
Data
pages
Indexed sequential access method
(ISAM)
• Data entries vs index entries
– Both belong to index file
– Data entries: <key, records or pointer to records/record list>
– Index entries:<key, pointer to index entries or data entries>
• Both in ISAM and B-tree, leaf pages are data entries
Example ISAM Tree
• Each node can hold 2 entries; no need for `nextleaf-page’ pointers. (Why?)
Comments on ISAM
• File creation: Leaf (data) pages allocated
sequentially, sorted by search key; then index
pages allocated, then space for overflow pages.
• Index entries: <search key value, page id>;
they `direct’ search for data entries, which are
in leaf pages.
• Search: Start at root; use key comparisons to
go to leaf.
• Cost=logFN ; F = # entries/index pg, N = # leaf
pgs
• Insert: Find leaf data entry belongs to, and put it
there.
• Delete: Find and remove from leaf; if empty
overflow page, de-allocate.
• * Static tree structure: inserts/deletes affect
only primary leaf pages.
After Inserting 23*, 48*, 41*, 42*
... Then Deleting 42*, 51*, 97*
* Note
If primary leaf page is empty, just leave it empty so that
the number of allocation primary pages does not change!
Features of ISAM
• The primary leaf pages are assumed to be
allocated sequentially
– Because the number of such pages is known
when the tree is created and does not change
subsequently under inserts and deletes
– So next-page-pointers are not needed
Prons and cons of ISAM
• Cons:losing sequentiality and balance
– Due to long overflow chains
– Leading to poor retrieving performance
– One solution
• Keep 20% of each page free when tree was created
• Prons
– No need to lock non-leaf index pages since we never
modify them, which is one of important advantages of
ISAM over B+tree
– Another is: scans over a large range is more efficient
thant B+tree
B+-Tree
• A B+-Tree is a balanced tree whose leaves contain a
sequence of key-pointer pairs.
• Dynamic structure that adjust gracefully for delete and
insert
+
B
Tree: The Most Widely Used Index
– Operation (insertion, deletion) keep it Heightbalanced.
– Searching for a record requires a traversal from
root to the leaf
• Equality query, Insert/delete at log F N cost (F = fanout,
N = # leaf pages);
– Grow and shrink dynamically.
• Need `next-leaf-pointer’ to chain up the leaf nodes
– To handle cases such as leaf node merging
– Minimum 50% occupancy (except for root).
• Each node contains d <= m <= 2d entries. The
parameter d is called the order of the tree.
– Data entries at leaf are sorted.
Example B+ Tree
• Each node can hold 4 entries (order = 2)
Root
17
5
2
3
24
13
5
7
8
14 16
19 20
22
30
24 27 29
33 34 38 39
Node structure
• Non-leaf nodes
index entry
P
0
K
1
P
1
K 2
P
K 2
P
2
K m
Pm
2
K m
Pm
Leaf nodes
P
0
K
1
P
1
Next leaf
node
Searching in B+ Tree
• Search begins at root, and key comparisons
direct it to a leaf (as in ISAM).
• Search for 5, 15, all data entries >= 24 ...
Root
13
2
3
5
14
16
17
19
20
24
22
30
24
27
29
33
34
38
39
 Based on the search for 15*, we know it is not in the tree!
summarize
Download