Binary Search and Binary Tree • Binary Search • Heap • Binary Tree Search on Data • Search is one of fundamentals in computer science • It consists of methods to quickly answer the question, “is there this in the data?” (called query) One way is to use buckets and hashes • We here approach this problem not from the way of memorizing the data but from the search method Consult a Dictionary • We will find out the position of a word in the dictionary • How do we do this? + check all words one by one from the beginning called linear scan; O(n) time + open an arbitral page if the word is not there, check the former/latter half faster than linear scan; the candidate pages are refined Binary Search • For conciseness, we assume that data is a collection of numbers • As preparation, sort the data Let s be the position (index) of 1st number, and t be that of the last • For query of finding q, we first look at the center if the center is q, answer the position if not, compare q and center to refine the area to be searched t s 1 3 7 8 9 11 13 17 18 19 Refine the Search Area • The center > q q must be in the left side set t to the position just before the center • The center < q q must be in the right side set s to the position just after the center • When t < s, end • Search space is refined to half and half, iteratively t s 1 3 7 8 9 s 11 t 13 17 18 19 Computation time for Binary Search • In each iteration, the search area becomes half or less after at most log2 n iterations, the search area will be of length one, and the search will terminate computation time is O(log2 n), that is optimal in the sense of complexity theory • No need of large extra memory, just two variables and the input data of O(n) (called “in place”) • So, very good Exercise • On the following number sequence, perform a binary search for queries of finding 8,17 and 19 (trace the movements of s and t ) 1 3 7 8 9 11 13 17 18 19 Weak Points of Array Data • Array needs long time (O(n) time) to keep the increasing order for insertion and deletion at a random position • If we use a list instead of arrays, we can insert/delete in O(1) time, but needs long time (O(n) time) to find the center of the order • In general, it is not trivial to attain efficiency for both search and insertion/deletion • … however, there are some ways Finding the minimum • We begin with, fast insertion/deletion, and fast search for minimum value, as a first step Problem: + store several (many) numeric values + insertion of new value, and deletion of a value in the data structure has to be done quickly + the minimum value among the values in the data can be found quickly • Generally, a data structure having these functions are called heap Determine the Winner • Determine the fastest runner in a school • They can not run at once, thus each class determine its fastest; then, we can find the fastest among the class-fastest • The class-fastest is also determined by classifying the students in smaller groups • For determining the strongest football team, two teams can play at once, thus we have a knockout system Finding the Minimum • Let’s have the same for numeric values (knockout system) • … after the determination, the minimum would be changed when we modify a value; how can we update? • ”A non-minimum value gets smaller” is easy; just compare the value and the minimum. It means that we have to keep only the minimum, for this • When the minimum value increases (or we delete it), we may have to re-compute everything? Re-computation is NOT Whole • Where do we have to re-compute, when a minimum increases (and becomes non-minimum?) actually, it is not all • The results above the modified value can change, and others never • In the opposite view, the result which has the modified value below has to be checked Time for Re-computation • How long is the time for re-computation? it’s linear in the height of the knockout system tree (this tree is often called heap tree) • #teams that are not knocked out increases exponentially, by going down the tree from the top • So, we take at most log2n +1 steps to get the bottom level • The time for re-computation is O(log n ) Insertion and Deletion • We keep that the left branch is always no less than the right, everywhere in the tree • To insert a new value to the heap, we put it at the right most position of the bottom level (or, the leftmost of new level if there is no space) • To delete a value, assign the value of the rightmost of the bottom level to the position to be deleted, and reduce the size by one • Both needs O(log n) time Realize Heap • To realize the heap, we may need something to structure shall we use cell & pointers as list? • Actually, this is a good way Representing the adjacency relation by the pointers, to up, right child, and left child • However, actually, we can do this without pointers Structure by Array • Trace the heap from top to down, and trace each level from left to right, and put indices to the nodes from 0 When #leaves is n, the size of array is 2n-2 • Then, actually, the index of the parent/children can be computed in an arithmetic way 0 2 1 3 4 5 7 8 9 10 11 12 6 The Index of Adjacent Cell • The index of the cell adjacent to cell i up (parent) (i-1)/2 (flooring) left-down (left child) i*2+1 right-down (righ child) i*2+2 • if i > n-1, then no child 0 2 1 3 4 5 7 8 9 10 11 12 6 Structure of Heap • Heap structure is composed of array, array size, and heap size • A subroutine changes the value of cell i to a typedef struct { void AHEAP_chg ( AHEAP *H, int i, int a ){ int *h; // array for values int j; int end; // size of array H->h[i] = a; int num; // current size of heap while ( i>0 ){ } AHEAP; j = i - 1 + (i%2)*2; // j := sibling of i if ( H->h[j] < a ) a = H->h[j]; i = (i-1) / 2; // i := parent of i if ( H->h[i] == a ) break; // no need to update H->h[i] = a; } } Insert & Delete • To insert, increase num and change the value of the last cell to a void AHEAP_ins ( AHEAP *H, int a ){ H->num++; H->h[H->num*2-3] = H->h[(H->num*2-2)/2] AHEAP_chg ( H, H->num*2-2, a); } void AHEAP_del ( AHEAP *H, int i ){ AHEAP_chg ( H, i, H->h[H->num*2-2]); AHEAP_chg ( H, (H->num*2-2)/2, H->h[H->num*2-3]); H->num--; } 1 3 1 7 1 4 7 9 2 1 8 4 3 Find the Cell of the Minimum Value • 一start from the top cell, and (セル i )からスタートして、最小値 を持つ子どもの方に降りていく int AHEAP_findmin ( AHEAP *H, int i ){ if ( H->num <= 0 ) return (-1); while ( i < H->num-1 ){ if ( H->h[i*2+1] == H->h[i] ) i = i*2+1; else i = i*2+2; } return ( i ); } 1 3 1 7 1 4 7 9 2 1 8 4 3 Find all ≤ Threshold • Find the left most one ≤ threshold int AHEAP_findlow_leftmost (AHEAP *H, int a , int i){ if ( H->num <= 0 ) return (-1); if ( H->h[0] > a ) return (-1); while ( i < H->num-1 ){ if ( H->h[i*2+1] <= a ) i = i*2+1; else i = i*2+2; } return ( i ); } • Find the one right to cell i ≤ threshold int AHEAP_findlow_nxt (AHEAP *H, int i){ for ( ; i>0 ; i=(i-1)/2 ){ if ( i%2 == 1 && H->h[i+1] <= a ) 1 return (AHEAP_findlow_leftmost (H, a, i+1)); } 7 return (-1); } 1 3 1 4 7 9 2 1 8 4 3 Example of Usage • Sort the numbers (in increasing order) + insert all numbers to a heap + extract the minimum number repeatedly • Clustering on similarity graph (gather nearest pairs, iteratively) 0 2 1 3 4 5 7 8 9 10 11 12 6 Ex. Huffman Tree • We have n words, or something, and each has frequency + insert all frequencies to a heap + extract two minimums, and merge them with the frequency of their sum (they are two children and merged one is their parent) + insert the new one to the heap • Finally, we obtain a tree structure • Assigning 0 to left, 1 to right child, each word gets a 01 code, obtained by tracing the path from the root to it • This code gives an optimal code assignment 35 15 20 11 7 A9 B6 C5 D4 E3 F8 Exercise: Heap • Construct a heap with the following values, and insert the values of 7, 2, and 13, iteratively 4, 6, 8, 9, 11, 15, 17 Memory Efficiency • 2n-1cells are used to store n values using almost twice • Are there any way to more efficient storage? store values on inner cells 0 2 1 3 4 5 7 8 9 10 11 12 6 Heap on Textbooks • Heap in usual texts is this type • In the “usual heap”, we keep the condition “parent has value smaller than its children” top cell always has the minimum value • We update the heap with keeping this condition, so minimum is easy to find 0 2 1 3 4 5 7 8 9 10 11 12 6 Update Heap + Modification of the value is done by swapping the parent and child in the opposite relation, and go up (down) until the condition will be satisfied + Insertion is done by appending a cell at the right end + Deletion is done by moving the right end cell to there, and decrement the size • Almost the same as the previous one 0 2 7 9 8 3 10 11 9 10 4 4 7 A Code for Value Change • Heap structure is the same • Modify the value of cell i to a void HEAP_chg ( AHEAP *H, int i, int a ){ int aa = H->h[i]; H->h[i] = a; if ( aa > a ) HEAP_chg_up ( H, i ); if ( aa < a ) HEAP_chg_down ( H, i ); } typedef struct { int *h; // array for values int end; // size of array int num; // current size of heap } HEAP; Update Heap (upward) • Go upward with updating for decreasing the value, and go downward otherwise void HEAP_up ( AHEAP *H, int i ){ int a; while ( i>0 ){ if ( H->h[(i-1)/2]<= H->h[i] ) break; a = H->h[(i-1)/2]; H->h[(i-1)/2] = H->h[i]; H->h[i] = a; i = (i-1)/2 } } typedef struct { int *h; // array for values int end; // size of array int num; // current size of heap } HEAP; • The position of a value changes, thus is disadvantage if we want to store the position of a value Update Heap (downward) • Increasing a value may result reversal on the parent child constraint • Then, we have to swap parent and child, but we choose the smaller one, and we go down further void HEAP_down ( AHEAP *H, int i ){ int ii, a; while ( i<H->num/2 ){ ii = i*2+1; if (i*2+1 < H->num && H->h[ii]>H->h[ii+1]) ii = ii+1; if ( H->h[ii] >= H->h[i] ) break; typedef struct { a = H->h[ii]; int *h; // array for values H->h[ii] = H->h[i]; H->h[i] = a; int end; // size of array i = ii; int num; // current size of heap } } HEAP; } Find Values ≤ Threshold • Relatively simple by using recursion int HEAP_findlow ( AHEAP *H, int a , int i ){ if ( i>=H->num ) return; if ( H->h[i] > a ) return; printf (“%d\n”, H->h[i] HEAP_findlow ( H, a, i*2+1) HEAP_findlow ( H, a, i*2+2) } 0 2 7 9 8 3 10 11 9 10 4 4 7 Exercise: Heap (2) • Construct a usual heap with the following numbers, and insert numbers 7, 2, and 13, iteratively 4, 6, 8, 9, 11, 15, 17 Column: Speed of Heap in Practice • A heap needs O(log n) time for one operation • However, in practice, it takes 4 or 5 times more compared to usual arrays, even it has 1,000,000 cells (log2 1,000,000 ≒ 20 ) • Why does it happen? Column: Speed of Heap in Practice (2) • A heap update involves operation from the root to a leaf • Once it is done, the cells accessed are stored in cache memory, and can be quickly accessed in the next time • Do this several times, then the upper part of the heap is inside the cache; only lower part needs long memory access time • The phenomenon implies that the lower part is composed of 4 or 5 levels Here, Terminology on Trees • (In graph theory) the structure composed of vertices (or node) and edges connecting two vertices is called a graph • A graph without a ring (circle, cycle) is called a tree • A tree specified a top vertex called root is called a 根rooted tree • For a vertex x of a rooted tree + vertices on the path between x and the root are ancestors of x + vertices one of whose ancestors is x are descendants of x + the vertex adjacent to x and is an ancestor is the parent of x + the other vertices adjacent to x are children of x + the tree composed of all descendants of x is the subtree rooted at x • A vertex having no child is a leaf • A vertex having some children is an inner vertex • Distance to the root is the depth of a vertex • The max. depth among all vertices is height (depth) • A tree is a binary tree if for any vertex, #children ≤ 2 • A tree is a full binary tree if #children = 0 or 2 Find any Value • Heap is simple, so is good, but we want to find any value from the data in short time • To perform binary search, tree structure like heaps is good, but insertion/deletion take long time under keeping the increasing order • To keep the ordering, we have to be able to delete/insert any position quickly 0 2 1 3 4 5 7 8 9 10 11 12 6 When Order is kept • If the value at the leaves are sorted, we can perform a binary search by going down the tree from the top • To realize this, we write to each node the maximum value among its descendants able to determine left or right, by looking at this value • This is realized with quick insertion/deletion, by allowing ill-formed tree + we attach two children to the vertex having the value just larger than the inserting value + copy the sibling vertex to the parent, and delete the both children Skew would Grow • Search/update time is linear in the depth of the target leaf • They are fast when the tree is balanced so that the height of the tree is low, but take long time when the tree is skewed happens by many insertion at the same place • To fasten the operation, we need to derive something Eliminate the Skew • Optimal search time is O(log n ) • So, try to bound the time by c log n for some constant c • When deep leaves exist, shallow places must be somewhere else deepen the shallow area and make deep places shallow, with keeping the ordering • This could be done by re-formulation of trees locally, by rotating the children and their parents Balancing by Rotation • Suppose that there are consecutive two vertices such that the left is two more higher than the right • We swapping the positions of the parent and the child (rotation) • By a rotation, gap of the heights decreases by two ≥2 Bounding the Height • For any vertex, the heights of children do not differ two, by repeatedly applying the rotation • Can we say something about the height k? + there is at least one vertex of depth k-1 (in another branch, branched at the root, or the child of the root) + At least two vertices of k-2 (branched at the depth of 2 or 3) + At least 2h-1 vertices of depth k-h (branched at 2h or 2h+1) …. The number of vertices in the tree is at least 2k/3 • If there are n leaves, ら、高さは 3log2 n = O(log n) Such a tree of height O(log n) is called a balanced tree Time for Search • ”Finding a value” needs to trace the path from the root to a leaf • The time for the search is, at most, the depth of the tree • When #leaves is n, the height ≤ 3log2 n = O(log n) therefore, time for search is O(log n) Effects by Rotation • When we rotate the tree at vertex x, are there any new vertex such that we now have to rotate the vertex? + descendants of x: OK, the heights of their children do not change + non-ancestor & non-descendants of x is also OK + For ancestors x, the height of one child can change • … so, if we rotate at a vertex, its ancestors may have to be rotated We thus rotate from the vertex to the root, iteratively Insertion and Deletion • We insert or delete a vertex, then its ancestors may violate the condition to be balanced • The height increase/decreases by one, thus one rotation is sufficient to each ancestor • Trace the ancestors from the vertex operated, and perform rotation if necessary (can stop if rotation is not needed at an ancestor) • The height of tree is O(log n), and a rotation can be done in a constant time, thus insertion and deletion with re-balancing can be done in O(log n) time • This rotation does not affect to any its ancestor (the number of descendants is not changed by rotation) Rotation by Other Criteria New criteria: the size of a subtree rooted at its grandchild is more than half, then rotate • By rotating, the maximum size of grandchildren will decrease at least one the size of subtree gets half by going down two levels 20 30 50% 30 20 The height of Tree • Get half by two levels, thus we can go down at most 2log2 n levels the height is at most 4log2 n = O(log n) if #leaves is n 20 30 50% Insertion and Deletion • This rotation does not affect to any its ancestor (the number of descendants is not changed by rotation) • Trace the ancestors from the vertex operated, and perform rotation if necessary (not stop even if rotation is not needed at an ancestor) • The height of tree is O(log n), and a rotation can be done in a constant time, thus insertion and deletion with re-balancing can be done in O(log n) time Structure for Binary Tree • We need pointers in this case, since the shape of binary tree is not uniform and periodical • For the rotation threshold, we keep the height and size of the subtree, rooted at each vertex • We can represent the structure by array, as list typedef struct { BTREE *p; // -> parent BTREE *l; // -> left child BTREE *r; // -> rigth child int height; // height of subtree int height; // height of subtree int value; // (max) value } BTREE; Example of Usages • Dictionary data, storage for IDs • Keyword search in a document … Exersice: Binary Tree • Rotate the vertices of the following tree, that are necessary (examine two criteria) Many Children • Each vertex of a binary tree always has two children • Why two? + update cost is optimal + search and update will be the same costs + operation for children becomes simple • Can we get advantage by allowing more than two children? • 2-3 tree is an example; #children is 2 or 3 + the depths of all leaves are the same + however, operations for children are not simple (choosing minimum among three, split three into two,…) • Can we increase the number more? B-tree • A tree is a B-tree if #children of any vertex is bounded by B • There are some motivations for this • Consider HDD or tape, that take much cost to access a block, but reading a block takes not so long time compared to reading a bit the computation time depends on #blocks we accessed • Then, simple solution is to increase the maximum number of children, that fits a block Update of B-tree • If the definition is “all vertices have exactly B children”, the memory usage is efficient however, we have to frequently update everywhere • However, the efficiency is less if many vertices have few children bound the number of children from B/2 to B if a parent and its child, or two siblings have at most B children in total, we merge them into one node • By applying rotation, the height of the tree is bounded by O(logB/2 n) Summary Binary search: search area is refined half, at most log n times Heap: simulate update of knockout system Binary tree: rotate at vertices to re-balance the tree B-tree: minimize the blocks to be accessed