ON TREE-GROWING SEARCH STRATEGIES George Washington

The Annals of Applied Probability 1996, Vol. 6, No. 4, 1284 ] 1302 ON TREE-GROWING SEARCH STRATEGIES BY JANICE LENT AND HOSAM M. MAHMOUD1 George Washington University Using the concept of a ‘‘tree-growing’’ search strategy, we prove that for most practical insertion sorting algorithms, the number of comparisons needed to sort n keys has asymptotically normal behavior. We prove and apply a sufficient condition for asymptotically normal behavior. The condition specifies a relationship between the variance of the number of comparisons and the rate of growth in height of the sequence of trees that the search strategy ‘‘grows.’’ 1. Introduction. Insertion sort is a popular on-line sorting algorithm. At each stage of the sorting, the data obtained so far make up a sorted array. The algorithm reads in a new datum, searches for its proper position in the array and inserts it. When a data file of size n G 2 is to be sorted, the ith search is for the position of the Ž i q 1.st key, i s 1, 2, . . . . We shall denote the algorithm that performs the ith search by Si and call the collection of search algorithms S s Si 4ìs1 a search strategy. Doberkat Ž1982. and Panny Ž1986. have studied the moments of the number of comparisons required for insertion sort with one particular Žlinear. search strategy. In this paper, we show that if the values in a data array have been randomly selected from a continuous distribution, the number of comparisons needed to sort the values by most practical search strategies has asymptotically normal behavior. Thus we investigate properties of search strategies in general, with insertion sort serving as a conspicuous application. In order to insert a new key in a sorted data array, an implementation of insertion sort selects a sequence of probes: keys in the sorted array to which the algorithm compares the new key. If the new key is larger than a given probe, the algorithm selects a larger probe; if the probe’s value exceeds that of the new key, the search continues in the segment of the data array containing keys smaller than the probe. In general, the search algorithms applied for different keys in an insertion sort may be independent of each other. One may, for example, imagine a search strategy S s Si 4ìs1 in which Sk is binary search and Skq 1 is linear search. Practitioners, however, usually prefer strategies in which all the algorithms applied are coded as a single procedure which is then invoked with different parameters at different stages of the sorting. Most efficient, easy-to-use strategies thus possess certain Received January 1995; revised August 1996. 1 Research partially supported by NSA Grant MDA904-92-H3086. AMS 1991 subject classifications. Primary 60F05, 68P10; secondary 05C05. Key words and phrases. Searching, sorting, limit distributions. 1284 TREE-GROWING STRATEGIES 1285 consistency properties, which we will specify and use to prove our main result. Rudimentarily, when all the search algorithms Si in a strategy S specify probe positions through a reasonable function of the length of the fragment of the data array being searched, we will call S a consistent strategy. To identify a class of reasonable probe-selection functions, we introduce decision trees: deterministic binary trees whose nodes correspond to the positions of the probes selected by a search algorithm. Since the binary tree is therefore a fundamental instrument in our analysis, we briefly review its definition and basic properties. A binary tree is a hierarchical structure comprising nodes organized in levels. A node can have either Ža. no children }in which case the node is called a leaf, Žb. a left child, Žc. a right child or Žd. two distinct children, one left and one right. It is customary to draw a binary tree with the root}the only node with no ancestors}at the top and the leaves at the bottom. The level or depth of a specific node is the length of the path Žthe number of edges. from the node to the root node, which is on level zero. The size of a binary tree is the number of its nodes. If level k of a tree has 2 k nodes}the maximum number possible}it is said to be saturated. A tree in which all existing levels, except possibly the lowest, are saturated is complete. ŽSome authors require that the nodes on the lowest level of a complete tree be positioned as far to the left as possible, but the positioning of the nodes within a level is irrelevant to our analysis.. The height of the tree is the length of the longest root-to-leaf path in the tree. We extend a binary tree by adding enough external nodes Žsquares in our diagrams. to ensure that each original internal node Žbullets in our diagrams. has two children. Each leaf thus has two external nodes as its left and right children. An extended binary tree of size n has n q 1 external nodes. A decision tree representing a search algorithm is constructed as follows: The position of the first probe chosen becomes the root of the decision tree; the two possible positions of the second probe}at most one on each side of the first probe}become its children and so forth. A probe falling at one of the ends of the data array will not have two internal nodes as children; one or both of its children will then be external nodes in the extended tree. The algorithm Si is represented by Ti , a decision tree of size i whose root-to-leaf paths represent the possible probe sequences of Si . We shall call the search strategy S a tree-growing strategy if, for every positive integer i, the shape of Tiq1 may be obtained by adding an internal node in one of the insertion positions}one of the external nodes}of Ti . Since the tree-growing property provides an element of consistency across algorithms in the search strategy, we will restrict our class of consistent strategies to those possessing this property. Assume the ranks of the data elements being sorted form a random permutation of the integers 1, . . . , n4 , as is the case when the data values are selected at random from any continuous distribution. Let the random variable Cn be the total number of times S compares a new key to a probe during the sorting of the first n keys. The class of normal search strategies 1286 J. LENT AND H. M. MAHMOUD comprises strategies for which Cn is asymptotically normal; that is, Cn y E w Cn x ªD N Ž 0, 1 . . Var w Cn x ' We shall see that the normality of a subclass of tree-growing strategies}the consistent strategies}may be established by relating the variance of Cn to the rate of growth in height of the decision trees in the family T s Ti 4ìs1. Our sufficient condition for normality establishes asymptotic normality but does not give the asymptotic mean or variance of Cn , which obviously depend on the particular strategy S . Identifying these requires additional investigation of the strategy. For linear search and binary search, the two most commonly used strategies, we will establish asymptotic normality and calculate the asymptotic means and variances which serve as centering and scaling factors. The next section provides examples of tree-growing search strategies. In Section 3 we identify a subclass of tree-growing strategies as consistent. The remaining sections focus on the normality of consistent strategies. Section 4 gives a sufficient condition for the normality of tree-growing strategies. In Section 5, we use this condition to establish the asymptotic normality for binary search and linear search. We compute the mean and variance for the binary search strategy}the variance includes periodic fluctuations}and use the well-known mean and variance for the linear search strategy w Gonnet and Baeza-Yates Ž1991.x to derive full asymptotic distributions. In Section 6, we use the sufficient condition to prove our main result: all consistent strategies are normal. ŽThis result is further extended in the Appendix to include some tree-growing implementations of Fibonaccian search.. 2. Examples of tree-growing strategies. We illustrate the tree-growing property through simple examples. One commonly used strategy known as linear search from the bottom always selects as a first probe the largest value in the data array being searched. When searching for the position of the Ž i q 1.st key, linear search from the bottom probes positions i, i y 1, i y 2 and so forth, until it discovers a key less than the Ž i q 1.st key. If it finds such a key in position j, it inserts the new key in position j q 1. Though linear search from the bottom is inefficient, it is useful for searching sequential access data files w e.g., data stored on a tape, as discussed by Hu and Wachs Ž1987.x . Figure 1 shows the decision trees of linear search from the bottom. The shape of Tiq1 is obtained from Ti by adjoining a leaf to replace the leftmost external node of Ti . The i nodes of Ti are relabeled in Tiq1 , and it is the shape of Tiq1 that evolves from that of Ti by ‘‘tree-growing.’’ Variants of linear search from the bottom include the obvious linear search from the top as well as a slightly more complex method called c-jump search. Still based on sequential searching, c-jump search starts at the ‘‘top’’ of the data array and advances c positions at a time. When it finds a probe larger than the new key, it reverses direction and performs a linear search ‘‘from the bottom’’ of the relevant fragment Ža fragment of size c y 1. of the array. Three-jump search is illustrated by the decision trees of Figure 2. TREE-GROWING STRATEGIES 1287 FIG. 1. The extended decision trees for linear search from the bottom. FIG. 2. The Ž unextended . decision trees of three-jump search. A more efficient search method also commonly used is binary search, which always probes the middle position of the remaining portion of the list. Thus when the search has been confined to a fragment of the data array between an upper bound u and a lower bound l, binary search chooses the probe in position Ž l q u . r2 . For i s 6, the decision tree representing binary search is that of Figure 3. Though all of the search strategies discussed above lie in the class of consistent strategies defined in the next section, strategies reasonable for some applications lie outside this class. Examples include alternating linear 1288 J. LENT AND H. M. MAHMOUD FIG. 3. The Ž unextended . decision tree of binary search. FIG. 4. The Ž unextended . decision trees of alternating linear search. FIG. 5. The Ž unextended . decision tree of Fibonaccian search in an array of 12 keys. search}a hybrid of linear search from the bottom and linear search from the top}which is represented by the decision trees of Figure 4. Figure 5 shows a decision tree for Fibonaccian search, a method so called because its decision tree is endowed with a recursive partition that follows the Fibonacci number sequence. ŽThe usual definition of the Fibonacci number Fj is Fj s Fjy1 q Fjy2 , with F0 s 0, F1 s 1. Thus F2 s 1, F3 s 2, F4 s 3, F5 s 5 and so forth.. When the tree size is Fkq 1 y 1, where Fj is the jth Fibonacci number, the two subtrees of the root node have sizes Fk y 1 and Fky1 y 1, and the property propagates recursively in the subtrees. Fibonaccian search is sometimes preferred to binary search because Fibonaccian search requires only the operations of addition and subtraction to compute the position of the next probe w see Knuth Ž1973., Section 6.2.1x . Like alternating linear search, Fibonaccian search fails to meet our consistency criteria. We will see, however, that all of the search strategies discussed TREE-GROWING STRATEGIES 1289 here are normal, except possibly some implementations of Fibonaccian search. ŽThe normality of Fibonaccian search may depend on the search strategy used when the number of keys in the array being searched is not of the form Fk y 1. In the Appendix, we discuss two implementations of Fibonaccian search for which we can prove normality.. 3. Consistent strategies. In this section we present a condition on the search strategy S that is equivalent to the tree-growing property and give a rigorous definition of consistency. Let Ti be the decision tree representing the search algorithm Si }the algorithm used to insert the Ž i q 1.st key. Let A i be the sorted array formed by the first i keys. We search for the correct position of the Ž i q 1.st key by successively selecting probes in A i , at each step comparing the Ž i q 1.st key to the selected probe. At the first step in the insertion, we select one probe; at the second step, we again select one probe, but the probe we select depends on the outcome of our comparison of the Ž i q 1.st key to the first probe. Thus there are two possible second probes: one in case the Ž i q 1.st key is larger than the first probe and another in case it is smaller Žunless, of course, the first probe selected falls at one of the ends of A i }in this case, the algorithm would specify only one probe at the second step.. In general then, at the jth step in the insertion, the algorithm specifies up to 2 jy1 possible probes. Let Pi j be the set of all possible positions for the first j probes in the insertion of the Ž i q 1.st key. Then Pi j has at most 2 j y 1 elements, which correspond to the nodes on levels zero through j y 1 of Ti . Let AŽ i, j, 1., AŽ i, j, 2., . . . , AŽ i, j, 2 j . be the subarrays of A i Žin increasing order, though some may be empty. created by the partitioning which results from deleting all elements of Pi j . If we have some adjacent probes or probes at either of the ends of the array, some of the subarrays AŽ i, j, k . will be empty. Let l i jk be the length of AŽ i, j, k ., k s 1, 2, . . . , 2 j w zero if AŽ i, j, k . is emptyx . For the Ž j q 1.st step in the insertion, the algorithm specifies one probe from each of the nonempty subarrays. If AŽ i, j, k . is not empty, let pi jk be the position of the next probe within the subarray AŽ i, j, k ., relative to the other elements of the subarray. It is easy to show that the search strategy S has the tree-growing property if and only if, for all i, j and k such that AŽ i, j, k . is not empty, pi jk s f Ž l i jk , j, k . , for some function f such that f Ž l q 1, j, k . s f Ž l, j, k . q a Ž l, j, k . and a Ž l, j, k . takes values in the set 0, 14 , where l is any natural number. So the tree-growing property implies that Ž1. the probe selected in AŽ i, j, k . at the Ž j q 1.st step cannot depend directly on i, the index of the insertion, and Ž2. if the array AŽ i, j, k . were larger by one element, the position of the probe selected would be either pi jk or pi jk q 1. The tree-growing property thus provides some similarity of probe selection functions within and across the 1290 J. LENT AND H. M. MAHMOUD algorithms Si , i s 1, . . . , n. We restrict our class of consistent strategies to those which possess additional consistency properties: DEFINITION. Let S s Si 4ìs1 be a tree-growing strategy for insertion sort. We call S a consistent strategy if, for all i, j and k such that AŽ i, j, k . is not empty, Si specifies the relative rank w within AŽ i, j, k .x of the probe selected in AŽ i, j, k . by pi jk s w S Ž l i jk . , 1 F pi jk F l i jk , where the function w S has the following property: if g S Ž n. s minŽ w S Ž n. y 1, n y w S Ž n.., then lim n ª` g S Ž n.rn exists. In words, g S Ž n. is the size of the smaller subtree rooted at level one of Tn , and consistency requires that the proportion of nodes belonging to the smaller subtree approach a limit as n approaches infinity. Note also that for consistent strategies, the position of the probe pi jk within AŽ i, j, k . depends only on l i jk , the length of the subarray. Since consistent strategies are tree-growing, we have w S Ž l q 1. s w S Ž l . q a S Ž l . and a S Ž l . takes values in the set 0, 14 . Examples of consistent strategies include c-jump search, which is specified by the function wS Ž l . s ½ c, l, if l ) c, otherwise. In this case, lim n ª` g S Ž n.rn s 0. For alternating linear search, however, the position of the next probe is a function of both l i jk and j: pi jk s ½ 1, l i jk , j odd, j even. Thus pi jk is not a function of the length only}it depends explicitly on j Žthe step number.. So, though alternating linear search is tree-growing Žsee Figure 4., it is not consistent. 4. A sufficient condition for normality of tree-growing strategies. Let Ti be the decision tree for Si , the search algorithm used to insert the Ž i q 1.st key by the tree-growing search strategy S s Si 4ìs1. Let h i denote the height of Ti , let X i denote the number of comparisons made by Si and let 1 Cn s Ý ny is1 X i , the total number of comparisons required by S for sorting n keys. Each insertion is performed independently of all others, so we take the X i ’s to be independent random variables. Let sn2 s Varw Cn x . We also define the symbol V: if x and c are functions of n, we say that x Ž n. s V Ž c Ž n.. if c Ž n. s O Ž x Ž n... LEMMA 1. If h n s oŽ sn ., then S is a normal strategy. TREE-GROWING STRATEGIES 1291 PROOF. For i s 1, . . . , n y 1, let Yi s X i y Ew X i x and let Fi denote the distribution function of Yi . Then P < Yi < ) h n q 1 4 F P max Ž X i , E w X i x . ) h i q 1 4 s 0, since X i cannot exceed h i q 1. Now if h n s oŽ sn . and « ) 0, there is a constant N« such that, for all n ) N« , h n q 1 - « sn . Then for i s 1, . . . , n y 1, H< y <G« s 4 y 2 Fi Ž y . F dF n H< y <)h q14 y 2 Fi Ž y . s 0, dF n because the integral is taken over a set of zero probability. Thus for n ) N« , ny1 Ý is1 H< y <G« s 4 y 2 Fi Ž y . s 0. dF n So lim 1 ny1 Ý 2 nª` s n is1 H< y <G« s 4 y 2 Fi Ž y . s 0 dF n and the normality of S follows from the Lindeberg theorem w see Billingsley Ž1986., page 368x . I Although most practical strategies satisfy the condition of the lemma, the condition does not hold for all tree-growing strategies. Consider, for example, a strategy described by a sequence of decision trees T1 , T2 , . . . that satisfies the following property: let k 1 s 2 and for each integer i G 2, let k i s k iy1 q 2 k iy 1 r2 . The k i ’s form a sequence of even numbers. For each i G 1, each of the trees Tn for n s 2 k iq1 , . . . , 2 k iq1 q 2 k i r2 y 1 has levels 0 through k i saturated, while its remaining n y 2 k iq1 q 1 nodes form a thin branch at the bottom of the tree; that is, levels k i q 1, . . . , n y 2 k iq1 q k i q 1 of Tn each contain only one internal node. In a recursive fashion, the strategy first grows a complete tree of height k i , then grows a thin branch of length 2 k i r2 at the bottom of the complete tree. Finally, it fills out levels k i q 1 through k i q 2 k i r2 to obtain the next complete tree, a complete tree of height k iq1. Let n i s 2 k iq1 q 2 k i r2 y 1, the size of the smallest decision tree whose thin branch is grown to length 2 k i r2 . The height of Tn i is h n i s k i q 2 k i r2 s V Ž 2 k i r 2 . , whereas it can be shown that s n i s O Ž2 k i r 2 . . Thus lim sup i ª` h n irsn i ) 0. While this strategy is not of practical interest, it provides an example of a tree-growing strategy that violates the condition of Lemma 1. It can also be shown, by the Lindeberg]Feller theorem, that this strategy is not normal. 5. Normality of two commonly used search strategies. We illustrate the use of the sufficient condition for normality by showing the normality of two search strategies commonly used for insertion sort: binary search and linear search. We begin by introducing some notation to be used here and in later sections. 1292 J. LENT AND H. M. MAHMOUD NOTATION. Let S s Si 4ìs1 be a consistent strategy and let T s Ti 4ìs1 be the sequence of decision trees representing S . For i s 1, . . . , n, let h i be the height of Ti , and for k s 1, 2, . . . , let L k s min i: h i s k 4 and Uk s max i: h i s k 4 . That is, TL k, TL kq1 , . . . , TUk are the only trees in T that have height k; thus Uk q 1 s L kq1. Define U0 s 1, and for k s 1, 2, . . . , let m k s Uk y Uky1. As before, let Cn denote the number of comparisons needed to sort the first n keys by insertion sort using S and let X i denote the number of comparisons required for inserting the Ž i q 1.st key. We may write ny1 Cn s Ý Xi . is1 So E w Cn x s ny1 Ý E w Xi x is1 and, since we assume the X i ’s are independent, Var w Cn x s ny1 Ý is1 Var w X i x . Binary search is a consistent strategy which, when searching a subarray of length l, selects as a probe the element whose rank Žrelative to the other elements in the subarray. is u lr2 v . This strategy results in a sequence of complete decision trees. The external nodes of the complete decision tree representing the binary search algorithm Si lie only on one or two levels: the lowest level ? log 2 i @ Žif i q 1 is a power of 2. or the two lowest levels ? log 2 i @ and ? log 2 i @ q 1. Thus X i s ? log 2 i @ q Bi , where Bi is a Bernoulli random variable that assumes the value 1 with probability 2 Ž i q 1 y 2 ? log 2 i @ . iq1 , the proportion of external nodes at level ? log 2 i @ q 1, as is easily shown. So E w Xi x s 2 Ž i q 1 y 2 ? log 2 i @ . iq1 q ? log 2 i @ and E w Cn x s ny1 Ý is1 2 Ž i q 1 y 2 ? log 2 i @ . iq1 ny1 q Ý ?log 2 i @ is1 s n log 2 n q O Ž n . . For binary search, Uk s 2 kq 1 y 1, the number of nodes in a complete tree of height k, so m k s 2 k . To compute sU2 j , we may first sum the variances of the 1293 TREE-GROWING STRATEGIES X i ’s corresponding to trees of height k, for k s 1, . . . , j, and then sum over k: 2 ky1 j sU2 j s Ý Ý ks1 is0 Var X L kqi . Again using the result about the number of nodes on levels ?log 2 i @ and ?log 2 i @ q 1, 2 ky1 Ý is0 Var X L kqi s ž 2 ky1 Ý is0 /ž 2 Ž i q 1. 2k q i q 1 1y 2 Ž i q 1. 2k q i q 1 / . Some straightforward algebraic manipulation shows that 2 ky1 Ý is0 Var X L kqi s 6 Ž 2 k . w H2 kq 1 y H2 k x y 4 kq1 H2Ž2.kq 1 y H2Ž2.k y 2 kq1 , where HnŽ m. s Ý nis11ri m , the nth harmonic number of order m. ŽWe omit the superscript when it is 1.. To simplify this expression, we use the asymptotic approximations 1 Hn s ln n q g q O , n ž / where g is Euler’s constant, and HnŽ2. s p2 y 6 1 n qO ž / 1 n2 . Using these we obtain 2 ky1 Ý is0 Var X L kqi s Ž 6 ln 2 y 4 . 2 k q O Ž 1 . . Then sU2 j s Ž 6 ln 2 y 4 . j Ý 2k q OŽ j. ks1 s V Ž Uj . . Also, if 0 F l F 2 y 1, j l Ý Var is0 X L jqi s 4 jq1 ž 1 2 qlq1 j y 1 2j / q 6 Ž 2 j . Ž H2 jqlq1 y H2 j . y 2 l q O Ž 1 . . Thus for general n s L j q i, where 0 F i F 2 j y 1, sL2 jqi s sU2 jy 1 q 4 jq1 ž 1 2 qiq1 j y 1 2j / q 6 Ž 2 j . Ž H2 jqiq1 y H2 j . y 2 i q O Ž 1 . . 1294 J. LENT AND H. M. MAHMOUD Since, in the case of binary search, hU j s O Žln Uj ., there is a positive constant c such that sU j s V ŽUj1r2 . s V ŽexpŽ chU j ... Then for i s 0, . . . , m j y 1, sL jqi ) sU jy 1 ž ž s V exp c h L jqi y 1 ž / // s V exp Ž ch L jqi . . That is, sn s V ŽexpŽ ch n .., so h n s oŽ sn . and the sufficient condition for normality is satisfied. Writing n s L j q i and simplifying the above expression for sL2 jqi gives 6 Ž 1 q f n . ln 2 y 6 sn2 s n 2 fn y2q 4 4 fn q O Ž ln n . , where f n ' log 2 n y ? log 2 n @ . w The sequence f n 4`ns1 is dense on the interval w 0, 1..x Then Cn y n log 2 n 'nA ªD N Ž 0, 1 . , n where 6 Ž 1 q f n . ln 2 y 6 4 . 4 fn The analytic continuation of the function A n s AŽ f n . to the real line, as f n ranges over the interval w 0, 1., attains its minimum value Žapproximately 0.151911. at the point An s 2 fn W Ž y8r3e 2 . q 2 ln 2 y2q y 1 s 0.141469 . . . , where W Ž x . is the principal branch of the omega function w see Fritsch, Schafer and Crowley Ž1973.x . It attains its maximum value Žapproximately 0.174449. at W Ž 0, y8r3e 2 . q 2 ln 2 y 1 s 0.707059 . . . , where W Ž0, x . is the zeroth branch of the omega function. Intuitively, we can think of A n as an asymptotic average of the variances of X 1 , . . . , X n 4 . Thus A n increases in n as long as Varw X n x ) A ny1. This occurs while the numbers of external nodes on the two lowest levels of Tn are close, and thus f n is close to lnŽ4r3.rln 2 Žapproximately 0.415037.. When one level of Tn has many more external nodes than the other, Varw X n x is small and thus A n tends to decrease. As each level of the tree fills up and f n takes on increasing values in the interval w 0, 1., A n completes one cycle: it moves from 6 ln 2 y 4 Žapproximately 0.158883. down to its minimum Žfor that cycle., then up to its TREE-GROWING STRATEGIES 1295 maximum Žfor that cycle. and finally back down to 6 ln 2 y 4. With each new cycle, A n more closely approximates its analytic continuation. A second commonly used consistent strategy is linear search, which, when searching an array of size l, always selects as a probe the data element of rank l Žlinear search from the bottom. or the data element of rank 1 Žlinear search from the top.. For linear search it is known w see Gonnet and BaezaYates Ž1991.x that n2 E w Cn x s q O Ž n. 4 and n3 Var w Cn x ; . 36 Since sn2 s V Ž n 3 . and h n s n y 1, the sufficient condition holds for the linear search strategy; thus Cn y n 2r4 n 3r2 ž ªD N 0, 1 36 / . 6. Normality of consistent strategies. In this section we use the sufficient condition for normality to prove that all consistent strategies are normal. Let S s Si 4ìs1 be a consistent strategy represented by the decision trees T s Ti 4ìs1 and let X i , Cn , sn2 , L k , Uk and m k be as defined. We begin by establishing some properties of consistent strategies. PROPERTY 1. For each positive integer i, the decision tree Ti representing an algorithm from a consistent strategy has at least one external node on each unsaturated level. The right subtree of the tree in Figure 6, for example, has the following shape characteristic: its left sub-subtree is complete down to level two, while its right sub-subtree has neither internal nor external nodes on level two. The third level of the tree is therefore both unsaturated and bereft of external nodes. We will show that this tree cannot be part of a sequence of decision trees representing a consistent strategy. FIG. 6. An extended decision tree that cannot represent an algorithm from a consistent strategy. 1296 J. LENT AND H. M. MAHMOUD PROOF OF PROPERTY 1. The first probe specified by Si , i s 1, 2, . . . , divides an array of size i into two subarrays. Let g S Ž i . denote the size of the smaller of the two subarrays and g XS Ž i . the size of the larger. w Naturally we define g S Ž0. s g XS Ž0. s 0.x For a cleaner notation, we suppress the subscript S . If Ti had an incomplete level with no external nodes, Ti would contain a subtree whose two sub-subtrees Žleft and right. had the following property: one sub-subtree would be complete down to some level l, while the other would have neither internal nor external nodes at level l. Since Si is a consistent algorithm, this would imply that for some number nX F i, the size of the subtree, g Ž l . Ž nX . s 0 and g Ž l. Ž g X Ž nX . . ) 0 w where g Ž p. Ž i . s g Ž g Ž g ??? Ž i ..., the function g composed p times, for a positive integer p x . This is a contradiction, since g must be nondecreasing, but g X Ž nX . - nX . I PROPERTY 2. For consistent strategies, the sequence m k 4`ks1 is nondecreasing in k, so Ukrm k F k q 1. PROOF. Let g and g X be as defined. ŽAgain we suppress the subscript S .. Then for each positive integer i, the larger subtree Žrooted at level one. of Ti has the same shape as Tg X Ž i. , the decision tree for S g X Ž i. . Since height is monotonic in size, the larger subtree of TL kq 1 }recall that TL kq 1 is the smallest decision tree in T having height k q 1}has the shape of TL k, the smallest decision tree in T whose height is k. So g X Ž L kq 1 . s L k . Similarly, g X ŽUkq 1 . s Uk . Also, by the tree-growing property, if i 2 ) i 1 , then g X Ž i 2 . G g X Ž i 1 . and i 2 y g X Ž i 2 . G i 1 y g X Ž i 1 .. Since for all i, g X Ž i . - i, this implies w taking i 1 s g X Ž i . and i 2 s i x that i y gX Ž i . G gX Ž i . y gX Ž gX Ž i . . . Then m kq 1 s Ukq1 y Uk s Ukq 1 y g X Ž Ukq1 . G g X Ž Ukq 1 . y g X Ž g X Ž Ukq1 . . s Uk y Uky1 s mk . Since the m k ’s are nondecreasing in k, Uk mk PROPERTY 3. s 1 q m1 q m 2 q ??? qm k mk F k q 1. For consistent strategies, sn2 s V Ž n.. I TREE-GROWING STRATEGIES 1297 PROOF. Suppose n G 100. First note that, unless Ti Žthe decision tree illustrating Si . has at least 95% of its external nodes at a single level, we have Varw X i x G 0.0475. Hence, unless at least 95% of the trees Ti in the n 2 subsequence Ti 4is1 have this property, we have snq1 G 0.002375n. n Now if 95% or more of the trees Ti in Ti 4is1 have this property Žthat is, at least 95% of their external nodes lie on a single level., then this level, level L say, must be the same for all the integers i, 0.10n F i F n, for which Ti has this property. This is so because if the level L were to shift, say to level L q D n for some integer D / 0, then Ti 4is1 would contain a subsequence of decision trees having fewer than 95% of their external nodes on any single level, and n this subsequence would include more than 5% of the decision trees in Ti 4is1 . Toward a contradiction, suppose Tn has fewer than 0.80n external nodes on level L. Let i 0 s max i F n: Ti has at least 0.95i nodes on level L 4 . We know that i 0 G 0.95n, so we may add at most 0.05n nodes to Ti 0 to obtain Tn . If we add all the n y i 0 internal nodes to level L of Ti 0 Žreducing by as many as possible the number of external nodes on level L ., the number of external nodes on level L of Tn will still be at least 0.8525n, a contradiction. So Tn must have at least 0.80n external nodes on level L. Now since Tg X Ž n. is a subtree of Tn and contains at least 0.50n external nodes, Tg X Ž n. has at least 0.40 g X Ž n. external nodes on level L y 1 Žmeasured from its own root.. Let i 1 s min i ) g X Ž n . : Ti has at least 0.80i nodes on level L 4 . ŽNote that i 1 F n must exist.. Since Tg X Ž n. has at least 0.40 g X Ž n. external i1 X nodes on level L y 1 and g X Ž n. G nr2, the subsequence Ti 4isg Ž n. contains i1 X more than 0.05n trees. This is a contradiction, since all trees in Ti 4isg Ž n. have fewer than 95% of their external nodes on level L. I We now present our main result. THEOREM 1. All consistent strategies are normal. PROOF. If lim n ª` g Ž n.rn s 1r2, the strategy is similar to binary search: the rate of growth in h n is O Žln n., while, by Property 3, the rate of growth in sn2 is V Ž n., so the sufficient condition clearly holds. Now assume lim n ª` g Ž n.rn s c for some constant c g w 0, 1r2.. Let h n be the height and bn the number of complete levels of a decision tree of size n. We first show that in this case, there are positive constants N1 and c1 such that c1 - 1 and for n ) N1 we have b n F c1 h n . If bn is bounded above by some constant, there is nothing to prove, so assume that lim n ª` bn s `. ŽThe limit must exist because bn is nondecreasing in n.. Since lim n ª` g Ž n.rn s c - 1r2, there is a positive constant c 2 - 1r2 and an integer N2 such that for n ) N2 , g Ž n. - c 2 n. That is, the smaller subtree of a 1298 J. LENT AND H. M. MAHMOUD decision tree of size n ) N2 has fewer then c 2 n nodes. Further, each subtree of size n ) N2 also has this property; that is, its smaller sub-subtree has at most c 2 n nodes. Thus there is a constant integer c3 such that for n large enough, g Ž b ny1. Ž n . - c 2b nyc 3 Ž 12 . n, c3 where, by definition, g Ž k . Ž n. s g Ž g Ž ky1. Ž n... Since g Ž b ny1 . Ž n. s 1, we have c 2b nyc 3 Ž 12 . n ) 1. c3 Taking logarithms on both sides of the inequality gives Ž bn y c3 . ln c2 y c3 ln 2 q ln n ) 0, and, rearranging, we obtain bn - ž ln n q c3 1 y ln Ž 1rc2 . ln 2 ln Ž 1rc2 . / . Since h n G ? log 2 n @ and c 2 - 1r2, this implies that there are positive constants N1 and c1 such that c1 - 1 and for n ) N1 , b n F c1 h n . In the notation introduced at the beginning of the previous section, we have shown that there is a constant j0 such that for j ) j0 , bi j F c1 j, where bi j is the number of complete levels of TL jqi , i s 0, 1, . . . , m j y 1. By Property 1, TL jqi has at least one external node on every level below bi j , so for i s 0, . . . , m j y 1, Var X L jqi G G j 1 Ý Lj q i q 1 1 Uj ks b i j k y E X L jqi 2 X j Ý w kX y mX x 2 , X k s1 where jX s j y bi j q 1 and mX s Ew X L jqi x y bi j q 1. Since the sum of squares above may be minimized by setting mX s Ž jX q 1.r2, we have Var X L jqi G s s X j 1 Uj 1 Uj Ý kX y X k s1 ½ jX 3 12 V Ž jX 3 . Uj jX q 1 2 q O Ž jX 2 . . 5 2 TREE-GROWING STRATEGIES 1299 Recalling that jX G Ž1 y c1 . j for sufficiently large j, we have Var X L jqi s V Ž j3 . Uj . Then m jy1 Ý is0 Var X L jqi s s G m jy1 V Ž j3 . is0 Uj Ý ž /Ž. ž /Ž. mj Uj V j3 1 jq1 V j3 s V Ž j2 . , where we have used Property 2 to obtain the inequality. So sU2 j j m ky1 ks1 is0 Ý Ý s Var X L kqi j s Ý VŽ k2 . ks1 s V Ž j3 . . Thus sU j s V Ž j 3r2 ., which implies that j s oŽ sL j . or h n s oŽ sn .. I APPENDIX Normality of other tree-growing search strategies. Properties 1]3, as is clear from their proofs, apply to all tree-growing strategies for which the position of each probe depends only on the length of the data subarray being searched. Here we consider tree-growing strategies which, though not consistent, retain this characteristic and thus possess Properties 1]3. The proof in the previous section clearly works for all such strategies in which the limit lim n ª` g Ž n.rn does not exist, as long as lim sup n ª` g Ž n.rn - 1r2. One example of such a strategy is the following tree-growing implementation of Fibonaccian search. ŽNot all implementations of Fibonaccian search have the tree-growing property w see Knuth Ž1973., Section 6.2.1x . We restrict our discussion to tree-growing implementations.. Let the position of the probe to be selected in an array of size n be specified by ¡n, j Ž n . s~n y F ¢F , 1 k Ž n.y1 q 1, k Ž n.q1 if n F 3, if Fk Ž n.q1 F n F Fk Ž n.q1 q Fk Ž n.y1 y 1, if Fk Ž n.q1 q Fk Ž n.y1 F n F Fk Ž n.q2 y 1, 1300 J. LENT AND H. M. MAHMOUD where, for n G 3, k Ž n. is such that Fk Ž n.q1 F n - Fk Ž n.q2 and Fi is the ith Fibonacci number. Using this strategy, we ‘‘grow’’ the Fibonacci tree of size Fkq 1 y 1 into a Fibonacci tree of size Fkq 2 y 1 in the following way. We first add Fky1 nodes to the larger subtree of TFkq 1y1 , bringing the size of this subtree to Fkq1 y 1. Then we add Fky 2 nodes to the smaller subtree, bringing its size to Fk y 1. The new Fibonacci tree then has size Fkq 1 y 1 q Fky1 q Fky2 s Fkq2 y 1. With this strategy we have, using our previous notation, lim sup g Ž n. n nª` s lim kª` s lim kª` s Fky 1 y 1 Fkq 1 y 1 Ž 1r'5 . f ky 1 Ž 1r'5 . f kq 1 1 f2 s 0.381966 . . . , where fs 1 q '5 2 , the golden ratio w as in Knuth Ž1973., page 416x . This strategy is not consistent, because lim inf nª` g Ž n. n s lim kª` s lim kª` s Fky 1 y 1 Fkq 1 q Fky1 y 1 f ky 1 f kq 1 q f ky1 1 f q1 2 s 0.276393 . . . . Since we do, however, have lim sup n ª` g Ž n.rn - 1r2, this strategy is normal. Note that we may draw this conclusion from our general result, avoiding a somewhat complicated variance calculation. The following proposition further expands the class of strategies for which we can prove normality. PROPOSITION 1. Any search strategy S for which lim inf n ª` g Ž n.rn s k 1 g Ž0, 1r2. is normal. TREE-GROWING STRATEGIES 1301 PROOF. Let g X Ž n. be as defined. Since lim inf n ª` g Ž n.rn s k 1 ) 0, there are constants j0 and k 2 such that k 2 - 1 and for j ) j0 we have g X ŽUj . F k 2 Uj . Then for j ) j0 , g XŽ jyj 0 . Ž Uj . F k 2jyj 0 Uj . For example, for j s j0 q 2 we have g XŽ2. Ž Uj 0q2 . s g X Ž g X Ž Uj 0q2 . . s g X Ž Uj 0q1 . F k 2 Uj 0q1 s k 2 g X Ž Uj 0q2 . F k 22 Uj 0q2 . Since g XŽ jyj 0 . ŽUj . G 1, this implies that 1 F k 2jyj 0 Uj . Thus there is some constant k 3 ) 0 such that for j ) j0 , Uj ) expŽ k 3 j .. Writing j s h n and noting that Ujy1 - n, we have, by Property 3, sn2 s V Ž n . s V Ž exp Ž k 3 h n . . , so h n s oŽ sn .. I Another implementation of Fibonaccian search provides a practical example of a strategy for which the previous proposition applies. Let the position of the probe to be selected in an array of size n be determined by the function ¡n, j Ž n . s~ F , ¢n y F 2 if n F 2, if Fk Ž n.q1 F n F Fk Ž n.q1 q Fk Ž n.y2 y 1, k Ž n. k Ž n. q 1, if Fk Ž n.q1 q Fk Ž n.y2 F n F Fk Ž n.q2 y 1, where Fk Ž n. is as previously defined. This strategy is similar to the one determined by j 1 above, except we add nodes to the smaller subtree first, then to the larger. With this strategy we have, using our previous notation, lim sup g Ž n. nª` n s lim kª` Fk y 1 Fkq 1 q Fky2 y 1 1 . 2 Here, again, we do not have consistency, since s lim inf nª` g Ž n. n s lim kª` s lim kª` s 1 f2 Fky 1 y 1 Fkq 1 y 1 f ky 1 f kq 1 , but the normality of this Fibonaccian strategy follows from Proposition 1. Many other tree-growing strategies are also normal. Alternating linear search, for example, is clearly normal and shares the asymptotic mean and variance of linear search from the bottom Žor top.. Intuitively, if a strategy 1302 J. LENT AND H. M. MAHMOUD corresponds to a set of decision trees T s Ti 4ìs1 for which the height of Ti grows at a steady rate as i grows, the strategy is likely to be normal. In fact, we can prove the normality of all search strategies S whose corresponding decision trees T satisfy one of the following sets of conditions: 1. Ža. m k s O Ž k 1q « . for some « in the interval Ž0, 1., Žb. Ukrm k s O Ž k 1q d . for some d in the interval Ž0, 1. and Žc. the tree Ti is complete down to a certain level; below this level, Ti has at least one external node on every cth level for some constant c. 2. Ža. m k s V Ž k 1q « . for some « ) 0 and Žb. VarŽ X n . s V Ž1.. Acknowledgments. The authors gratefully acknowledge the contribution of Sudip Bose of the George Washington University, whose excellent suggestions substantially improved this paper. In addition, the anonymous referees who reviewed the paper offered many helpful comments. The authors are particularly thankful for the proof of Property 3 and the example of a tree-growing strategy that violates the sufficient condition for normality, which one of the referees provided. REFERENCES BILLINGSLEY, P. Ž1986.. Probability and Measure. Wiley, New York. DOBERKAT, E. Ž1982.. Asymptotic estimates for the higher moments of the expected behavior of straight insertion sort. Inform. Process. Lett. 14 179]182. FRITSCH, F., SCHAFER, R. and CROWLEY, W. Ž1973.. Solution of the transcendental equation v e v s x. Comm. ACM 16 123]124. GONNET, G. and BAEZA-YATES, R. Ž1991.. Handbook of Algorithms and Data Structures, 2nd ed. Addison-Wesley, Reading, MA. HU, T. and WACHS, M. Ž1987.. Binary search on a tape. SIAM J. Comput. 16 573]590. KNUTH, D. Ž1973.. The Art of Computer Programming 3. Addison-Wesley, Reading, MA. PANNY, W. Ž1986.. A note on the higher moments of the expected behavior of straight insertion sort. Inform. Process. Lett. 22 175]177. DEPARTMENT OF STATISTICS GEORGE WASHINGTON UNIVERSITY WASHINGTON , DC 20052 E-MAIL : lent j@bls.gov hosam@gwis2.circ.gwu.edu

ON TREE-GROWING SEARCH STRATEGIES George Washington

Related documents

Products

Support

ON TREE-GROWING SEARCH STRATEGIES George Washington

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib