Self-balancing binary search trees

advertisement
CS305/503, Spring 2009
Self-Balancing Trees
Michael Barnathan
Here’s what we’ll be learning:
• Data Structures:
– AVL Trees.
– B+ Trees.
AVL Trees – The Idea
• We looked at an algorithm for balancing trees using
rotations last time.
• This turns out to be a pretty good strategy in general.
– Rotations are O(1): they only affect up to 2 levels of the tree no
matter how deep it is.
– As in the DSW algorithm, rotations can be used to maintain tree
balance.
– The trick is knowing when to apply them.
• A left rotation will decrease the right subtree’s height and increase the
left subtree’s height.
• A right rotation will do the opposite.
– Recall: A balanced tree is one in which the depth of the leaves
differ by no more than one level.
• We can enforce this condition using rotations!
Balance Factor
• The difference between the height of a node’s right
and left subtrees is called the node’s balance factor.
– Balance Factor = Height(Right) - Height(Left).
– Some sources define it as Height(Left) - Height(Right), but
this does not change anything.
• Leaves, having no children, have a balance factor of 0
(Height(right) = Height(left) = 0).
• By the definition of tree balance, a subtree is
considered balanced if its balance factor is -1, 0, or 1.
• A left rotation will lower the balance factor.
• A right rotation will raise it.
Balance Factor
A right rotation will balance this tree.
-1
-2
1
1
0
0
2
0
3
0
5
0
+1
4
No balance factor < -1 or > 1:
This tree is balanced.
-1
2
0
0
3
4
A node has a balance factor of -2:
This tree is not balanced.
AVL Trees - Structure
• Small modification to a node’s structure:
class BinaryTree {
int value;
BinaryTree left;
BinaryTree right;
}
class AVLTree extends BinaryTree {
//value, left, and right are inherited.
int balanceFactor;
}
AVL Trees: Algorithms
• Insertion and deletion must keep the balance. Access doesn’t change.
• Insertion:
– Insert as in a normal binary search tree, but go back up the tree and update
the balance factor of each node back towards the root.
– “Go back up the tree” -> do something after the recursive call / on the “pop”.
– If the balance factor becomes +2 or -2, rotate to correct it.
– Four different cases involving up to 2 rotations.
• Deletion:
– Delete as in a normal binary search tree (replacing the node with its inorder
successor), but go back up the tree and adjust the balance factors.
– If the balance factor becomes +2 or -2, rotate to correct it.
– If the balance factor becomes +1 or -1, we can stop.
• This indicates that the height of the subtree hasn’t changed.
– If the balance factor becomes 0, we must keep going.
– The deletion algorithm is very similar to the BST algorithm, so I won’t present
it formally.
Insertion Cases (Wikipedia)
Note the similarity to tree_to_list:
Visual AVL Demonstration
• http://webpages.ull.es/users/jriera/Docencia/
AVL/AVL%20tree%20applet.htm
Insertion Algorithm
//Refer to Lecture 9 for the rotate functions.
void insert(AVLTree root, AVLTree newtree) {
//This can only happen now if the user passes in an empty tree.
if (root == null)
root = newtree;
else if (newtree.value < root.value) {
if (root.left == null)
root.left = newtree;
else
insert(root.left, newtree);
}
else {
if (root.right == null)
root.right = newtree;
else
insert(root.right, newtree);
}
//Empty. Insert the root.
//Go left if <.
//Found a place to insert.
//Keep traversing.
//Go right if >=.
//Found a place to insert.
//Keep traversing.
updateBalance(root);
}
void updateBalance(AVLTree root) {
//Note that a balance factor of -1 guarantees a left child exists.
if (root.balance < -1 && root.left.balance < 0) {
rotateRight(root);
//Left-left case: rotate right once.
root.right.balance = root.balance++;
}
else if (root.balance < -1) {
rotateLeft(root.left); //Left-right case: rotate the left child left, rotate the root right.
rotateRight(root);
root.left.balance = -1 * Math.max(root.balance, 0);
root.right.balance = -1 * Math.min(root.balance, 0);
root.balance = 0;
}
else if (root.balance > 1 && root.right.balance > 0) {
rotateLeft(root);
//Right-right case.
root.left.balance = root.balance--;
}
else if (root.balance > 1) {
rotateRight(root.right);
//Right-left case.
rotateLeft(root);
root.left.balance = -1 * Math.max(root.balance, 0);
root.right.balance = -1 * Math.min(root.balance, 0);
root.balance = 0;
}
}
Insertion Analysis
• We go down the tree to insert.
• We go back up the tree and rotate.
• AVL trees are always balanced, so what is the
complexity of this operation?
CRUD: AVL Trees.
•
•
•
•
Insertion:
Access:
Updating an element:
Deleting an element:
• Search:
• Traversal:
O(log n).
O(log n).
O(log n).
O(log n).
O(log n).
O(n).
• This is a winner.
• We have all of the nice BST properties, without having to worry
about balance.
• This does, however, require O(n) extra space to store the balance
factor for each node.
B+ Trees: Motivation
• Binary search trees are very useful data structures when data lives in
memory.
• However, they are not good for disk access.
– Disk access is very slow compared to memory.
– Traversing a BST is a mess on disk.
• If each node is stored somewhere on the disk, even a simple traversal requires a great
deal of random access.
• Random access is difficult to cache.
• Range queries in particular perform poorly.
– Nodes do not align to “blocks” on disk.
• Disks can only read data one “block” at a time. If we need less than one block, we waste
time reading data that isn’t used.
• A self-balancing tree called a B+ tree can solve these issues.
• These are used in several popular filesystems, including NTFS, ReiserFS,
XFS, and JFS.
• They are also used to index tables in database systems, such as MySQL.
B+ Trees: Idea
• Very different from what we’ve seen.
• First, they grow UP, not DOWN.
• They are not binary; each node contains an array of n
values and points to n+1 children.
• Only leaves hold actual values; interior nodes hold the
maximum value in the corresponding leaf. This is used as a
means of indexing.
– Some variations use the minimum or middle.
• They are “threaded”:
– Each leaf points to the next in sorted order.
– This makes sequential access and range queries fast.
• If each variable occupies a bytes and your device’s block
size is b, the optimal size of the array is b / a - 1.
– One level of the tree would then fill one block.
B+ Trees
14
7
14
20
20
26
22
26
28
Note that each value in an interior node is the maximum of its leaves’
values. The last pointer points to a child containing greater elements.
Advantage: you can tell which leaf to read using only the interior node
(one disk read). The other leaves do not have to be read.
34
97
B+ Tree Structure
• We define k as the order of a B+ tree.
• Each node is allowed to store up to 2k values and
2k+1 pointers.
• The structure looks like this:
class BPTree {
static final int ORDER=2; //You choose this.
int[] values = new int[ORDER*2];
BPTree[] children = new BPTree[ORDER*2+1];
}
Search
• Each node contains the maximum value of its
keys. We can use this to locate the node to
descend to when searching.
BPTree search(BPTree root, int val) {
if (root == null)
return null;
for (int childidx = root.values.length - 1; childidx >= 0; childidx--) {
if (root.children[0] == null && root.values[childidx] == val)
return root;
if (val > root.values[childidx])
return search(root.children[childidx+1], val);
return search(root.children[0], val);
}
//Found in a leaf.
Search Example
14
7
14
20
20
26
22
26
28
Search for 17.
Childidx = 2. Val > 26? No.
Childidx = 1. Val > 20? No.
Childidx = 0. Val > 14? Yes. Traverse down 20 (child[childidx+1]).
20 is a leaf. Val = 20? No. Return null.
34
97
Insertion: Split of a Leaf Node
• If the node is not full, we can just locate the proper position in
the node’s array of keys to insert.
• However, if the node is full, we need to split the node. This is
how the tree grows.
• Insertion always begins at the leaves.
• When a leaf splits, a new leaf is created, which becomes that
leaf’s successor.
– The lower half of the old leaf’s values stay, while the upper half move
to the new leaf. The parent value for the old leaf becomes the new
maximum of the values remaining in the leaf.
– The old leaf is then linked to the new leaf.
• That means the leaf’s parent must point to this new node.
– So we insert the new leaf into the parent.
– Ack, there’s a problem here!
Split of an Internal Node
• What if the parent is also full when we try to insert
the new leaf into it?
• We then have to split the parent.
• This is similar to a leaf-node split (cut the node in
half, move the maximum up), with one crucial
difference:
– When you move the old maximum to the parent, you
remove it from the current node.
– Internal nodes don’t contain values, so this is OK.
• Now we’re inserting into this node’s parent.
– And that means that node can split as well!
• When will the insanity end!?
Root Split
• When you reach the root, of course.
• When the root splits in two, a new root is
created pointing to the old root and its new
sibling, which are now its children.
• This increases the height of the tree by 1.
• So it is possible for one insertion to cascade
splits all the way up the tree.
– What do you think the complexity is, then?
B+ Tree Deletion
• As always, deletion is insertion in reverse.
• As a rule, B+ tree nodes should always be at
least halfway full (that is why the order is half
of the maximum number of nodes).
• If deletion causes a node to fall below this
size, we will have to undo splits.
• But first, the easy case:
– If the leaf we remove from is more than half full,
we simply remove it and we’re finished.
“Borrowing” Values.
• If we fall below the halfway threshold but the next
sibling of this leaf is above the threshold, just move
the first value of the sibling into the last value of the
current node.
• Since this is larger than anything currently in the
node by definition, update the parent’s maximum
pointer as well.
• Nodes can be borrowed from the previous sibling as
well.
– If neither can spare an element, there’s no choice but to
merge.
Merging
• The inverse of a split is called a merge or coalesce operation.
• The inability to borrow ensured that this node and its sibling
are both half full.
• Therefore, we can merge the two siblings into one node.
• We then delete the pointer to the sibling from the parent
node (and from the linked list of siblings, of course).
• This in turn can cause the parent to underflow…
– Fortunately, internal nodes and leaves are merged in the same way.
• If the merge propagates to the root, the old root disappears
and the height of the tree drops by 1.
Example
• The algorithms for these operations are complex. I
haven’t decided whether we should discuss them in
detail yet.
• First, make sure you understand the idea of what is
going on.
• This will help:
– http://people.ksp.sk/~kuko/bak/index.html
– That applet demonstrates B-trees, not B+ trees, but I’ll
point out the differences.
– http://www.seanster.com/BplusTree/BplusTree.html
– This is a B+-tree implementation, but uses the middle
element rather than the maximum.
CRUD: B+ Trees.
•
•
•
•
Insertion:
Access:
Update:
Delete:
O(log n).
O(log n).
O(log n).
O(log n).
• Search:
• Traversal:
O(log n).
O(n).
• These are the same asymptotic performances as in AVL trees.
• The primary advantage of the B+ tree over the AVL tree is in disk
performance and indexing.
– There’s also no balance factor, but you waste more space with halfempty arrays in the worst case.
• You also have that nice linked list structure to traverse.
Our Last Balancing Act
• We’ve devoted a lot of time to tree balance.
• Next time, we’ll move on to heaps and
heapsort, and we’ll revisit priority queues.
• The lesson:
– Automatic solutions save time with repeated use,
but often carry a higher initial cost.
Download