CS305/503, Spring 2009 Self-Balancing Trees Michael Barnathan Here’s what we’ll be learning: • Data Structures: – AVL Trees. – B+ Trees. AVL Trees – The Idea • We looked at an algorithm for balancing trees using rotations last time. • This turns out to be a pretty good strategy in general. – Rotations are O(1): they only affect up to 2 levels of the tree no matter how deep it is. – As in the DSW algorithm, rotations can be used to maintain tree balance. – The trick is knowing when to apply them. • A left rotation will decrease the right subtree’s height and increase the left subtree’s height. • A right rotation will do the opposite. – Recall: A balanced tree is one in which the depth of the leaves differ by no more than one level. • We can enforce this condition using rotations! Balance Factor • The difference between the height of a node’s right and left subtrees is called the node’s balance factor. – Balance Factor = Height(Right) - Height(Left). – Some sources define it as Height(Left) - Height(Right), but this does not change anything. • Leaves, having no children, have a balance factor of 0 (Height(right) = Height(left) = 0). • By the definition of tree balance, a subtree is considered balanced if its balance factor is -1, 0, or 1. • A left rotation will lower the balance factor. • A right rotation will raise it. Balance Factor A right rotation will balance this tree. -1 -2 1 1 0 0 2 0 3 0 5 0 +1 4 No balance factor < -1 or > 1: This tree is balanced. -1 2 0 0 3 4 A node has a balance factor of -2: This tree is not balanced. AVL Trees - Structure • Small modification to a node’s structure: class BinaryTree { int value; BinaryTree left; BinaryTree right; } class AVLTree extends BinaryTree { //value, left, and right are inherited. int balanceFactor; } AVL Trees: Algorithms • Insertion and deletion must keep the balance. Access doesn’t change. • Insertion: – Insert as in a normal binary search tree, but go back up the tree and update the balance factor of each node back towards the root. – “Go back up the tree” -> do something after the recursive call / on the “pop”. – If the balance factor becomes +2 or -2, rotate to correct it. – Four different cases involving up to 2 rotations. • Deletion: – Delete as in a normal binary search tree (replacing the node with its inorder successor), but go back up the tree and adjust the balance factors. – If the balance factor becomes +2 or -2, rotate to correct it. – If the balance factor becomes +1 or -1, we can stop. • This indicates that the height of the subtree hasn’t changed. – If the balance factor becomes 0, we must keep going. – The deletion algorithm is very similar to the BST algorithm, so I won’t present it formally. Insertion Cases (Wikipedia) Note the similarity to tree_to_list: Visual AVL Demonstration • http://webpages.ull.es/users/jriera/Docencia/ AVL/AVL%20tree%20applet.htm Insertion Algorithm //Refer to Lecture 9 for the rotate functions. void insert(AVLTree root, AVLTree newtree) { //This can only happen now if the user passes in an empty tree. if (root == null) root = newtree; else if (newtree.value < root.value) { if (root.left == null) root.left = newtree; else insert(root.left, newtree); } else { if (root.right == null) root.right = newtree; else insert(root.right, newtree); } //Empty. Insert the root. //Go left if <. //Found a place to insert. //Keep traversing. //Go right if >=. //Found a place to insert. //Keep traversing. updateBalance(root); } void updateBalance(AVLTree root) { //Note that a balance factor of -1 guarantees a left child exists. if (root.balance < -1 && root.left.balance < 0) { rotateRight(root); //Left-left case: rotate right once. root.right.balance = root.balance++; } else if (root.balance < -1) { rotateLeft(root.left); //Left-right case: rotate the left child left, rotate the root right. rotateRight(root); root.left.balance = -1 * Math.max(root.balance, 0); root.right.balance = -1 * Math.min(root.balance, 0); root.balance = 0; } else if (root.balance > 1 && root.right.balance > 0) { rotateLeft(root); //Right-right case. root.left.balance = root.balance--; } else if (root.balance > 1) { rotateRight(root.right); //Right-left case. rotateLeft(root); root.left.balance = -1 * Math.max(root.balance, 0); root.right.balance = -1 * Math.min(root.balance, 0); root.balance = 0; } } Insertion Analysis • We go down the tree to insert. • We go back up the tree and rotate. • AVL trees are always balanced, so what is the complexity of this operation? CRUD: AVL Trees. • • • • Insertion: Access: Updating an element: Deleting an element: • Search: • Traversal: O(log n). O(log n). O(log n). O(log n). O(log n). O(n). • This is a winner. • We have all of the nice BST properties, without having to worry about balance. • This does, however, require O(n) extra space to store the balance factor for each node. B+ Trees: Motivation • Binary search trees are very useful data structures when data lives in memory. • However, they are not good for disk access. – Disk access is very slow compared to memory. – Traversing a BST is a mess on disk. • If each node is stored somewhere on the disk, even a simple traversal requires a great deal of random access. • Random access is difficult to cache. • Range queries in particular perform poorly. – Nodes do not align to “blocks” on disk. • Disks can only read data one “block” at a time. If we need less than one block, we waste time reading data that isn’t used. • A self-balancing tree called a B+ tree can solve these issues. • These are used in several popular filesystems, including NTFS, ReiserFS, XFS, and JFS. • They are also used to index tables in database systems, such as MySQL. B+ Trees: Idea • Very different from what we’ve seen. • First, they grow UP, not DOWN. • They are not binary; each node contains an array of n values and points to n+1 children. • Only leaves hold actual values; interior nodes hold the maximum value in the corresponding leaf. This is used as a means of indexing. – Some variations use the minimum or middle. • They are “threaded”: – Each leaf points to the next in sorted order. – This makes sequential access and range queries fast. • If each variable occupies a bytes and your device’s block size is b, the optimal size of the array is b / a - 1. – One level of the tree would then fill one block. B+ Trees 14 7 14 20 20 26 22 26 28 Note that each value in an interior node is the maximum of its leaves’ values. The last pointer points to a child containing greater elements. Advantage: you can tell which leaf to read using only the interior node (one disk read). The other leaves do not have to be read. 34 97 B+ Tree Structure • We define k as the order of a B+ tree. • Each node is allowed to store up to 2k values and 2k+1 pointers. • The structure looks like this: class BPTree { static final int ORDER=2; //You choose this. int[] values = new int[ORDER*2]; BPTree[] children = new BPTree[ORDER*2+1]; } Search • Each node contains the maximum value of its keys. We can use this to locate the node to descend to when searching. BPTree search(BPTree root, int val) { if (root == null) return null; for (int childidx = root.values.length - 1; childidx >= 0; childidx--) { if (root.children[0] == null && root.values[childidx] == val) return root; if (val > root.values[childidx]) return search(root.children[childidx+1], val); return search(root.children[0], val); } //Found in a leaf. Search Example 14 7 14 20 20 26 22 26 28 Search for 17. Childidx = 2. Val > 26? No. Childidx = 1. Val > 20? No. Childidx = 0. Val > 14? Yes. Traverse down 20 (child[childidx+1]). 20 is a leaf. Val = 20? No. Return null. 34 97 Insertion: Split of a Leaf Node • If the node is not full, we can just locate the proper position in the node’s array of keys to insert. • However, if the node is full, we need to split the node. This is how the tree grows. • Insertion always begins at the leaves. • When a leaf splits, a new leaf is created, which becomes that leaf’s successor. – The lower half of the old leaf’s values stay, while the upper half move to the new leaf. The parent value for the old leaf becomes the new maximum of the values remaining in the leaf. – The old leaf is then linked to the new leaf. • That means the leaf’s parent must point to this new node. – So we insert the new leaf into the parent. – Ack, there’s a problem here! Split of an Internal Node • What if the parent is also full when we try to insert the new leaf into it? • We then have to split the parent. • This is similar to a leaf-node split (cut the node in half, move the maximum up), with one crucial difference: – When you move the old maximum to the parent, you remove it from the current node. – Internal nodes don’t contain values, so this is OK. • Now we’re inserting into this node’s parent. – And that means that node can split as well! • When will the insanity end!? Root Split • When you reach the root, of course. • When the root splits in two, a new root is created pointing to the old root and its new sibling, which are now its children. • This increases the height of the tree by 1. • So it is possible for one insertion to cascade splits all the way up the tree. – What do you think the complexity is, then? B+ Tree Deletion • As always, deletion is insertion in reverse. • As a rule, B+ tree nodes should always be at least halfway full (that is why the order is half of the maximum number of nodes). • If deletion causes a node to fall below this size, we will have to undo splits. • But first, the easy case: – If the leaf we remove from is more than half full, we simply remove it and we’re finished. “Borrowing” Values. • If we fall below the halfway threshold but the next sibling of this leaf is above the threshold, just move the first value of the sibling into the last value of the current node. • Since this is larger than anything currently in the node by definition, update the parent’s maximum pointer as well. • Nodes can be borrowed from the previous sibling as well. – If neither can spare an element, there’s no choice but to merge. Merging • The inverse of a split is called a merge or coalesce operation. • The inability to borrow ensured that this node and its sibling are both half full. • Therefore, we can merge the two siblings into one node. • We then delete the pointer to the sibling from the parent node (and from the linked list of siblings, of course). • This in turn can cause the parent to underflow… – Fortunately, internal nodes and leaves are merged in the same way. • If the merge propagates to the root, the old root disappears and the height of the tree drops by 1. Example • The algorithms for these operations are complex. I haven’t decided whether we should discuss them in detail yet. • First, make sure you understand the idea of what is going on. • This will help: – http://people.ksp.sk/~kuko/bak/index.html – That applet demonstrates B-trees, not B+ trees, but I’ll point out the differences. – http://www.seanster.com/BplusTree/BplusTree.html – This is a B+-tree implementation, but uses the middle element rather than the maximum. CRUD: B+ Trees. • • • • Insertion: Access: Update: Delete: O(log n). O(log n). O(log n). O(log n). • Search: • Traversal: O(log n). O(n). • These are the same asymptotic performances as in AVL trees. • The primary advantage of the B+ tree over the AVL tree is in disk performance and indexing. – There’s also no balance factor, but you waste more space with halfempty arrays in the worst case. • You also have that nice linked list structure to traverse. Our Last Balancing Act • We’ve devoted a lot of time to tree balance. • Next time, we’ll move on to heaps and heapsort, and we’ll revisit priority queues. • The lesson: – Automatic solutions save time with repeated use, but often carry a higher initial cost.