TreeSet<T> Implementation: AVL Trees Total and balanced collections of binary trees A collection C of binary search trees (whose node values are of type T, say) is said to be total iff for every (finite) set s of elements drawn from T, there is a tree in C representing s. For example, the collection of binary search trees (with values of type Integer, say) all of whose nodes have empty right subtrees is total (each tree in this case is like a sorted linked list). For example, the set {1, 2, 3, 4} is represented by the tree on the right, which is obviously in C. On the other hand, the collection D of binary search trees (containing values of type Integer, say) where every node has 0 or 2 child nodes is not total – it has no tree representing the set {1, 2, 3, 4}, for example (try it!). 4 3 2 1 A collection of binary search trees is balanced iff for some constant k every tree in the collection has height at most k*log2n, where n denotes the number of nodes in the tree (obviously k must be at least 1). Each tree in a balanced collection is also said to be balanced. Informally, a tree is balanced if the tree and all its subtrees have very roughly the same number of nodes on the left and right hand sides. As an example, the collection C mentioned above is clearly not balanced (in fact, each tree has height n). If a total and balanced collection of binary search trees has insertion and deletion algorithms whose worst case time complexity is proportional to the height of the tree being operated on, then clearly insertion and deletion (and, of course, searching) have O(log n) worst case time complexity. Such a collection offers an efficient implementation of TreeSet. Two such collections are in common use: AVL trees and red-black trees. AVL trees A node in a binary tree is said to be AVL if the height of its left and right subtrees differ by at most 1. A binary tree is said to be AVL if all its nodes are AVL. It can be shown that the height of an AVL tree does not exceed about 1.44 log2n. Hence the AVL collection of binary search trees (whose nodes contains values of some type T, say) is balanced. Moreover, it will become evident shortly (when we exhibit the insertion algorithm) that it is total. Hence the collection of AVL trees is a good candidate for implementing TreeSet. Searching an AVL tree Searching an AVL tree is exactly the same as searching a general binary search tree, and so the time complexity is O(log n). Insertion in an AVL tree Insertion in an AVL tree proceeds as for insertion in a regular binary search tree (resulting in a new leaf node, recall), but additionally the tree may need re-balancing to keep it AVL. Balanced Trees 1 Suppose that the values 8 9 8 9, 8, 4, 3, 6, 7 are inserted into an empty 9 4 8 9 4 AVL tree. 9 and 8 are inserted as usual, but 3 6 4 after insertion of 4 node 9 is not AVL (see (i)). 7 It is re-balanced by (i) (ii) (iii) “rotating” (the tree rooted at) node 9 right (see (ii)). Next 3 is 8 6 inserted, then 6, but after inserting 7, node 8 9 6 8 is not AVL (see (iii)). 4 Note that the insertion 4 7 path (shown in bold) 9 7 3 has a “kink” at node 4; this signals that re- 3 balancing needs to (iv) (v) proceed in two steps. First, we “rotate left” (the tree rooted at) node 4 (this is the inverse of “rotate right” above), yielding the tree shown in (iv) (if node 6 happened to have had a left subtree in (iii) the subtree would appear in (iv) as the right subtree of node 4). Second, we rotate (the tree rooted at) node 8 right resulting in the tree shown in (v). Observe that the tree is still a binary search tree, and is AVL. For more examples, see any of the many websites offering animations of AVL trees, e.g. groups.engin.umd.umich.edu/CIS/course.des/cis350/treetool/ or webdiis.unizar.es/asignaturas/EDA/AVLTree/avltree.html. In general, re-balancing a tree after insertion requires rotating at most two (sub-)trees. A rotation is either a right or a left rotation as shown below (the numbers indicate sample heights of the (sub)trees). Check that the property 12 11 of being a binary search tree is b d preserved by either rotation. 11 Rotate right 10 Call the path in the tree from the 9 d 10 b Rotate left B point of insertion (of a leaf node) to F c the root the insertion path (shown in c bold where relevant). The following 9 9 9 property of AVL trees should be 10 B F E E immediately clear: Inserting a (leaf) c c c c node in an AVL tree affects the heights of at most the nodes on the insertion path. We should expect, therefore, that restoring the AVL property requires at most re-balancing (sub)trees at nodes on this path; we typically say re-balance a node as shorthand for re-balance the (sub)tree rooted at a node. To describe the re-balancing phase of the insertion algorithm in detail, we define the balance factor of a node to be the height of its left subtree minus the height of its right subtree (so a Balanced Trees 2 binary tree is AVL iff the balance factor of every node is 1, 0, or -1). After insertion, some nodes on the insertion path may not be AVL because their balance factor is 2 or -2. After inserting the new node, we proceed back up the insertion path checking the balance factor at each node. Let node d, say, be the first node we encounter whose balance factor is 2 or -2. We identify the following cases. (i) Left-left. The balance factor of d is 2 and that of d’s left child is 1 – see the left tree in the picture above. Note that node d’s left child b necessarily lies on the insertion path – why? Observe that a right rotation of node d makes the resulting tree AVL – check! (ii) Left-right. The balance factor of d is 2 and that of d’s left child (node b, again on the insertion path) is -1. First picture subtree E expanded, labelling its root node c (you should be able to convince yourself easily that c lies on the insertion path). Now rotate node b left, yielding the heights indicated. Although the tree is still not AVL, it is not AVL precisely in the manner of case (i) – check this! So just rotate node d right, and we’re done. In expanding E we attributed a balance factor of -1 to node c (its subtree heights are 8 and 9, resp.); check that the transformation still works if we attribute it a balance factor of 1 (it would also work for a balance factor of 0, but that case can’t arise). 8 11 9 F c Rotate b left 9 B c F c E c 10 b D c 9 9 B c F c D c C c c 9 c 10 c 10 C c 9 b 11 11 B c 11 d d 9 d 12 12 b 12 10 Rotate d right b d 10 9 8 9 B c C c 9 8 D c 9 F c Convince yourself that the case of d having a balance factor of 2 and d’s left child (node b) having a balance factor of 0 cannot arise. (iii) Right-right. The balance factor of d is -2 and that of d’s right child is -1. Symmetrical to (i). (iv) Right-left. The balance factor of d is -2 and that of d’s right child is 1. Symmetrical to (ii). Now observe the following obvious property regarding insertion in an AVL tree: If for any node d on the insertion path (starting at the point of insertion) the height of d is the same as it was before the insertion, then so are the heights of the remaining nodes on the path (i.e. up to and including the root). Hence the balance factors of the remaining nodes are as they were (which is 1, 0, or -1 in each case). Balanced Trees 3 Finally: At most one node on the insertion path needs re-balancing. A node is rebalanced because its balance factor has changed from 1 to 2 (or from -1 to -2), and that is because its height has increased by 1. However, the act of re-balancing leaves the height of the re-balanced subtree just as it was before the insertion – check this for the two primary cases above (note the original height of node d is necessarily 11, and the heights of the resulting subtree in either case is again 11). The result then follows from the preceding property. Nodes in an AVL tree Each node t in an AVL tree has a value and two reference fields, just as in a general binary search tree; in addition it has a height field to record the height of the (sub)tree rooted at t. After inserting a new node, the height fields clearly only need to be updated at most for the nodes on the insertion path. Re-balancing and re-calculation of heights, therefore, only requires a fixed amount of work at each node on the insertion path. As the maximum length of the insertion path is 1.44 log2n, it follows that the time complexity of insertion is O(log n). As a minor optimisation in storage costs, it is possible to store the balance factor in each node rather than the height. The saving is small – balance factors only require two bits whereas heights stored as integers use 16 or 32 bits typically. Deletion in an AVL tree Deletion is as for general binary search trees, except that we must additionally re-calculate heights and re-balance subtrees along the path from the deletion point to the root. The deletion point in this case refers to the deletion of the minimum node (refer to deletion in the general binary search tree). The properties of deletion from an AVL tree are not as simple as those for insertion, and so deletion is considerably more complex. Balanced Trees 4