Search trees, 2-3-4 tree, B+tree Marko Berezovský Radek Mařík PAL 2012 p 2<1 Hi! ?/ x+y x--y To read - Robert Sedgewick: Algorithms in C++, Parts 1–4: Fundamentals, Data Structure, Sorting, Searching, Third Edition, Addison Wesley Professional, 1998 - http://www.cs.helsinki.fi/u/mluukkai/tirak2010/B-tree.pdf - (CLRS) Cormen, Leiserson, Rivest, Stein: Introduction to Algorithms, 3rd ed., MIT Press, 2009 See PAL webpage for references Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 2-3-4 tree Properties A 2-3-4 search tree is either empty or it contains three types of nodes: 2-node, with one key, left link to a tree with smaller keys, and right link to a tree with larger keys; 3-node, with two keys, a left link to a tree with smaller keys, a middle link to a tree with key values between the node's keys and a right link to a tree with larger keys; 4-node, with three keys and four links to trees with key values defined by the ranges subtended by the node's keys. AND: All links to empty trees, ie. all leaves, are at the same distance from the root, thus the tree is perfectly balanced. A 2-3-4 search tree is structurally a B-tree of order 4. Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 1 2-3-4 tree 2 Example 41 21 33 11 17 5 14 20 30 67 37 24 25 32 36 39 40 58 62 71 51 60 64 68 79 89 78 85 86 Note 2-nodes, 3-nodes, 4-node, same depth of all leaves. Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 92 2-3-4 tree Operations Find: As in B-tree Insert: As in B-tree: Find the place for the inserted key x in a leaf and store it there. If necessary split the leaf. Additional insert rule: In our way down the tree, whenever we reach a 4-node, we split it into two 2- nodes, and move the middle element up to the parent node. This strategy prevents the following from happening: After inserting a key it might happen in B tree that it is necessary to split all the nodes going from inserted key back to the root. Such outcome is considered to be time consuming. Splitting 4-nodes on the way down results in sparse occurence of 4-nodes in the tree, thus it never happens that we have to split nodes recursively bottom-up. Delete: As in B-tree Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 3 2-3-4 tree Insert: Splitting strategy 4 Splitting strategy I B Root A B C a b A c d a b c X Not root b c d B X A B C a C A d a C b c d Changed Not root a b c Any nodes, incl. empty X X B d A B C a b c A d a C b c d Note that splitting changes the height of the 2-3-4 tree only when the root is splitted. Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 2-3-4 tree Not root Insert: Splitting strategy X Y B X Y A B C a b A c Not root d a C b X Y a Changed Not root b c Any nodes, incl. empty b c d X B Y A B C a 5 Splitting strategy II c A d a X Y C b c d X Y B d A B C a b c A d a b C c d Note that splitting changes the height of the 2-3-4 tree only when the root is splitted. Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 2-3-4 Tree Insert example I Insert keys into initially empty 2-3-4 tree: Insert A ASERCHINGX Insert S A Insert E A S A E S E Insert R A Insert C Insert H E A C R S R S E A C Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 H R S 6 2-3-4 Tree ... Insert H 7 Insert example II E A C H R S Insert I Insert N E R E R A C A C H I H I N S Insert G Insert X E I R A C G H S I E N S A C G H Note seemingly unnecessary split of EIR 4-node during insert of G. Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 R N S X 2-3-4 Tree 8 Size example Results of an experiment with N uniformly distributed random keys from range {1, ..., 109 } inserted into initially empty 2-3-4 tree: N Tree depth 2-nodes 3-nodes 4-nodes 10 2 6 2 0 100 4 39 29 1 1000 7 414 257 24 10 000 10 4 451 2 425 233 100 000 13 43 583 24 871 2 225 1 000 000 15 434 671 248 757 22 605 10 000 000 18 4 356 849 2 485 094 224 321 Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 2-3-4 tree Relation to R-B tree Relation of a 2-3-4 tree to a red-black tree X a X a b X X Y a b a b Y c b c Y X Y Z X a b c d a Z b Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 c d 9 2-3-4 Tree Relation to R-B tree Relation of a 2-3-4 tree to a red-black tree 50 30 30 50 70 10 15 20 40 60 15 80 90 10 Original 2-3-4 tree 70 40 60 20 Equivalent r-b tree Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 80 90 10 B+ tree Description B+ tree B+ tree is analogous to B-tree, namely in: -- Being perfectly balanced all the time, -- that nodes cannot be less than half full, -- operational complexity. The differences are: -- Records (or pointers to actual records) are stored only in the leaf nodes, -- internal nodes store only search key values which are used only as routers to guide the search. The leaf nodes of a B+-tree are linked together to form a linked list. This is done so that the records can be retrieved sequentially without accessing the B+-tree index. This also supports fast processing of range-search queries. Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 12 2-3-4 Tree Operations complexity Complexities Searches in N-node 2-3-4 tree visit at most lg(N) + 1 nodes Insertion into N-node 2-3-4 tree requires fewer than lg(N) + 1 node splits in the worst case, and seem to require less than one node split on the average Precise analytic results on the average-case performance of 2-3-4 trees have so far eluded the experts**, but it is clear from empirical studies that very few splits are used to balance the trees. The worst case is lg(N), and that is not even approached in practical situations. ** Now, finally, there is some challenge for you! Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 10 B+ tree Description/example 13 60 28 50 5 10 1520 28 30 Routers and keys 75 50 55 75 85 60 65 75 80 Data records or pointers to them 85 90 95 Leaves links Values in internal nodes are routers, originally each of them was a key when a record was inserted. Insert and Delete operations split and merge the nodes and thus move the keys and routers around. A router may remain in the tree even after the corresponding record and its key was deleted. Values in the leaves are actual keys associated with the records and must be deleted when a record is deleted (their router copies may live on). Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 B+ tree Insert I Inserting key K (and its associated data record) into B+ tree Find, as in B tree, correct leaf to insert K. Then there are 3 cases: Case 1 Free slot in a leaf? YES Place the key and its associated record in the leaf. Case 2 Free slot in a leaf? NO. Free slot in the parent node? YES. 1. Consider all keys in the leaf, including K, to be sorted. 2. Insert middle (median) key M in the parent node in the appropriate slot Y. (If parent does not exist, first create an empty one = new root.) 3. Split the leaf to two new leaves L1 and L2. 4. Left leaf (L1) from Y contains records with keys smaller than M. 5. Right leaf (L2) from Y contains records with keys equal to or greater than M. Note: Splitting leaves and inner nodes works in the same way as in B-trees. Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 14 B+ tree Insert II Inserting key K (and its associated data record) into B+ tree Find, as in B tree, correct leaf to insert K. Then there are 3 cases: Case 3 Free slot in a leaf? NO. Free slot in the parent node? NO. 1. Split the leaf to two leaves L1 and L2, consider all its keys including K sorted, denote M median of these keys. 2. Records with keys < M go to the left leaf L1. 3. Records with keys >= M go to the right leaf L2. 4. Split the parent node P to nodes P1 and P2, consider all its keys including M sorted, denote M1 median of these keys. 5. Keys < M1 key go to P1. 6. Keys >= M1 key go to P2. 7. If parent PP of P is not full, insert M1 to PP and stop. (If PP does not exist, first create an empty one = new root.) Else set M := M1, P := PP and continue splitting parent nodes recursively up the tree, repeating from step 4. Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 15 B+ tree Insert example I Initial tree 25 50 75 5 10 1520 25 30 Insert 28 75 80 85 90 25 50 75 5 10 1520 Changes 50 55 60 65 25 28 30 50 55 60 65 75 80 85 90 Leaves links Data records and pointers to them are not drawn here for simplicity's sake. Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 16 B+ tree Insert example II Initial tree 25 50 75 5 10 1520 25 28 30 50 55 60 65 75 80 85 90 Insert 70 median = 60 25 50 60 75 5 10 1520 Changes 25 28 30 50 55 60 65 70 Leaves links Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 75 80 85 90 17 B+ tree Insert example III Initial tree 25 50 60 75 5 10 1520 25 28 30 50 55 60 65 70 75 80 85 90 Insert 95 second median = 60 60 25 50 5 10 15 20 25 28 30 Changes Leaves links 50 55 first median = 85 75 85 60 65 70 75 80 85 90 95 Note the router 60 in the root, detached from its original position in the leaf. Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 18 B+ tree Delete I Deleting key K (and its associated data record) in B+ tree Find, as in B tree, key K in a leaf. Then there are 3 cases: Case 1 Leaf more than half full or leaf == root? YES. Delete the key and its record from the leaf L. Arrange the keys in the leaf in ascending order to fill the void. If the deleted key K appears also in the parent node P replace it by the next bigger key K1 from L (explain why it exists) and leave K1 in L as well. Case 2 Leaf more than half full? NO. Left or right sibling more than half full? YES. Move one (or more if you wish and rules permit) key from sibling S to the leaf L, reflect the changes in the parent P of L and parent P2 of sibling S. (If S does not exist then L is the root, which may contain any number of keys). Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 19 B+ tree Delete II Deleting key K (and its associated data record) in B+ tree Find, as in B tree, key K in a leaf. Then there are 3 cases: Case 3 Leaf more than half full? NO. Left or right sibling more than half full? NO. 1. Consider sibling S of L which has same parent P as L. 2. Consider set M of ordered keys of L and S without K but together with key K1 in P which separates L and S. 3. Merge: Store M in L, connect L to the other sibling of S (if exists), destroy S. 4. Set the reference left to K1 to point to L. Delete K1 from P. If P contains K delete it also from P. If P is still at least half full stop, else continue with 5. 5. If any sibling SP of P is more then half full, move necessary number of keys from SP to P and adjust links in P, SP and their parents accordingly and stop. Else set L := P and continue recursively up the tree (like in B-tree), repeating from step 1. Note: Merging leaves and inner nodes works same way as in B-trees. Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 20 B+ tree Delete example I 60 Initial tree 25 50 5 10 1520 25 28 30 75 85 50 55 Delete 70 60 65 70 Changes 85 90 95 75 80 85 90 95 60 25 50 5 10 1520 75 80 25 28 30 50 55 75 85 60 65 Leaves links Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 21 B+ tree Delete example II Initial tree 60 25 50 5 10 1520 25 28 30 50 55 Delete 25 Changes 60 65 75 80 85 90 95 75 80 85 90 95 60 28 50 5 10 1520 75 85 28 30 75 85 50 55 60 65 Leaves links Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 22 B+ tree Delete example III Initial tree 60 Delete 60 28 50 5 10 1520 28 30 75 85 50 55 60 65 75 80 85 90 95 Join here 28 50 60 85 5 10 1520 Changes 28 30 50 55 65 75 80 Leaves links Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 85 90 95 23 B+ tree Delete example IV Initial tree 60 Delete 75 5 10 28 50 28 30 75 85 50 55 Too few keys, join these two nodes and bring key from parent (recursively). 60 65 28 50 60 65 80 ... done. 28 50 60 85 28 30 50 55 85 90 85 Progress... 5 10 75 80 60 65 80 85 90 Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 85 90 24 B+ tree Operations complexity Complexities Find, Insert, Delete, all need (logb n) operations, where n is number of records in the tree, and b is the branching factor or, as it is often understood, the order of the tree. Note: Be careful, some authors (e.g CLRS) define degree/order of B-tree as [b/2], there is no unified precise common terminology. Range search thanks to the linked leaves is performed in time (logb(n) + k) where k is the range (number of elements) of the query. Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 25 Trees comparison Example 26 Ben Pfaff: Performance Analysis of BSTs in System Software Stanford University, Department of Computer Science Conclusions: • ...Unbalanced BSTs are best when randomly ordered input can be relied upon; • if random ordering is the norm but occasional runs of sorted order are expected, then red-black trees should be chosen. • On the other hand, if insertions often occur in a sorted order, AVL trees excel when later accesses tend to be random, • and splay trees perform best when later accesses are sequential or clustered. Some consequences: Managing virtual memory areas in OS kernel: ... Many kernels use BSTs for keeping track of VMAs: Linux before 2.4.10 used AVL trees, OpenBSD and later versions of Linux use red-black trees, FreeBSD uses splay trees, and so does Windows NT for its VMA equivalents... Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 Trees comparison Example (excerpt) tree / time in msec / order Memory management supporting web browser BST AVL RB splay 15.67 3.65 3.78 2.63 4 2 3 1 Artificial uniformly random data BST AVL RB splay 1.63 1.67 1.64 1.94 1 3 2 4 Secondary peer cache tree BST AVL RB splay 3.94 4.07 3.78 7.19 2 3 1 4 Processing identifiers cross-references BST AVL RB splay 4.97 4.47 4.33 4.00 4 3 2 1 Pokročilá Algoritmizace, A4M33PAL, ZS 2012/2013, FEL ČVUT, 12/14 27