ON TREE-GROWING SEARCH STRATEGIES George Washington

advertisement
The Annals of Applied Probability
1996, Vol. 6, No. 4, 1284 ] 1302
ON TREE-GROWING SEARCH STRATEGIES
BY JANICE LENT
AND
HOSAM M. MAHMOUD1
George Washington University
Using the concept of a ‘‘tree-growing’’ search strategy, we prove that
for most practical insertion sorting algorithms, the number of comparisons
needed to sort n keys has asymptotically normal behavior. We prove and
apply a sufficient condition for asymptotically normal behavior. The condition specifies a relationship between the variance of the number of comparisons and the rate of growth in height of the sequence of trees that the
search strategy ‘‘grows.’’
1. Introduction. Insertion sort is a popular on-line sorting algorithm.
At each stage of the sorting, the data obtained so far make up a sorted array.
The algorithm reads in a new datum, searches for its proper position in the
array and inserts it. When a data file of size n G 2 is to be sorted, the ith
search is for the position of the Ž i q 1.st key, i s 1, 2, . . . . We shall denote the
algorithm that performs the ith search by Si and call the collection of search
algorithms S s Si 4`is1 a search strategy. Doberkat Ž1982. and Panny Ž1986.
have studied the moments of the number of comparisons required for insertion sort with one particular Žlinear. search strategy. In this paper, we show
that if the values in a data array have been randomly selected from a
continuous distribution, the number of comparisons needed to sort the values
by most practical search strategies has asymptotically normal behavior. Thus
we investigate properties of search strategies in general, with insertion sort
serving as a conspicuous application.
In order to insert a new key in a sorted data array, an implementation of
insertion sort selects a sequence of probes: keys in the sorted array to which
the algorithm compares the new key. If the new key is larger than a given
probe, the algorithm selects a larger probe; if the probe’s value exceeds that of
the new key, the search continues in the segment of the data array containing keys smaller than the probe. In general, the search algorithms applied for
different keys in an insertion sort may be independent of each other. One
may, for example, imagine a search strategy S s Si 4`is1 in which Sk is
binary search and Skq 1 is linear search. Practitioners, however, usually
prefer strategies in which all the algorithms applied are coded as a single
procedure which is then invoked with different parameters at different stages
of the sorting. Most efficient, easy-to-use strategies thus possess certain
Received January 1995; revised August 1996.
1
Research partially supported by NSA Grant MDA904-92-H3086.
AMS 1991 subject classifications. Primary 60F05, 68P10; secondary 05C05.
Key words and phrases. Searching, sorting, limit distributions.
1284
TREE-GROWING STRATEGIES
1285
consistency properties, which we will specify and use to prove our main
result. Rudimentarily, when all the search algorithms Si in a strategy S
specify probe positions through a reasonable function of the length of the
fragment of the data array being searched, we will call S a consistent
strategy.
To identify a class of reasonable probe-selection functions, we introduce
decision trees: deterministic binary trees whose nodes correspond to the
positions of the probes selected by a search algorithm. Since the binary tree is
therefore a fundamental instrument in our analysis, we briefly review its
definition and basic properties. A binary tree is a hierarchical structure
comprising nodes organized in levels. A node can have either Ža. no children
}in which case the node is called a leaf, Žb. a left child, Žc. a right child or
Žd. two distinct children, one left and one right. It is customary to draw a
binary tree with the root}the only node with no ancestors}at the top and
the leaves at the bottom. The level or depth of a specific node is the length of
the path Žthe number of edges. from the node to the root node, which is on
level zero. The size of a binary tree is the number of its nodes. If level k of a
tree has 2 k nodes}the maximum number possible}it is said to be saturated. A tree in which all existing levels, except possibly the lowest, are
saturated is complete. ŽSome authors require that the nodes on the lowest
level of a complete tree be positioned as far to the left as possible, but the
positioning of the nodes within a level is irrelevant to our analysis.. The
height of the tree is the length of the longest root-to-leaf path in the tree. We
extend a binary tree by adding enough external nodes Žsquares in our
diagrams. to ensure that each original internal node Žbullets in our diagrams. has two children. Each leaf thus has two external nodes as its left and
right children. An extended binary tree of size n has n q 1 external nodes.
A decision tree representing a search algorithm is constructed as follows:
The position of the first probe chosen becomes the root of the decision tree;
the two possible positions of the second probe}at most one on each side of
the first probe}become its children and so forth. A probe falling at one of the
ends of the data array will not have two internal nodes as children; one or
both of its children will then be external nodes in the extended tree. The
algorithm Si is represented by Ti , a decision tree of size i whose root-to-leaf
paths represent the possible probe sequences of Si . We shall call the search
strategy S a tree-growing strategy if, for every positive integer i, the shape of
Tiq1 may be obtained by adding an internal node in one of the insertion
positions}one of the external nodes}of Ti . Since the tree-growing property
provides an element of consistency across algorithms in the search strategy,
we will restrict our class of consistent strategies to those possessing this
property.
Assume the ranks of the data elements being sorted form a random
permutation of the integers 1, . . . , n4 , as is the case when the data values are
selected at random from any continuous distribution. Let the random variable Cn be the total number of times S compares a new key to a probe
during the sorting of the first n keys. The class of normal search strategies
1286
J. LENT AND H. M. MAHMOUD
comprises strategies for which Cn is asymptotically normal; that is,
Cn y E w Cn x
ªD N Ž 0, 1 . .
Var w Cn x
'
We shall see that the normality of a subclass of tree-growing strategies}the
consistent strategies}may be established by relating the variance of Cn to
the rate of growth in height of the decision trees in the family T s Ti 4`is1. Our
sufficient condition for normality establishes asymptotic normality but does
not give the asymptotic mean or variance of Cn , which obviously depend on
the particular strategy S . Identifying these requires additional investigation
of the strategy. For linear search and binary search, the two most commonly
used strategies, we will establish asymptotic normality and calculate the
asymptotic means and variances which serve as centering and scaling factors.
The next section provides examples of tree-growing search strategies. In
Section 3 we identify a subclass of tree-growing strategies as consistent. The
remaining sections focus on the normality of consistent strategies. Section 4
gives a sufficient condition for the normality of tree-growing strategies. In
Section 5, we use this condition to establish the asymptotic normality for
binary search and linear search. We compute the mean and variance for the
binary search strategy}the variance includes periodic fluctuations}and use
the well-known mean and variance for the linear search strategy w Gonnet and
Baeza-Yates Ž1991.x to derive full asymptotic distributions. In Section 6, we
use the sufficient condition to prove our main result: all consistent strategies
are normal. ŽThis result is further extended in the Appendix to include some
tree-growing implementations of Fibonaccian search..
2. Examples of tree-growing strategies. We illustrate the tree-growing property through simple examples. One commonly used strategy known
as linear search from the bottom always selects as a first probe the largest
value in the data array being searched. When searching for the position of the
Ž i q 1.st key, linear search from the bottom probes positions i, i y 1, i y 2
and so forth, until it discovers a key less than the Ž i q 1.st key. If it finds
such a key in position j, it inserts the new key in position j q 1. Though
linear search from the bottom is inefficient, it is useful for searching sequential access data files w e.g., data stored on a tape, as discussed by Hu and
Wachs Ž1987.x . Figure 1 shows the decision trees of linear search from the
bottom. The shape of Tiq1 is obtained from Ti by adjoining a leaf to replace
the leftmost external node of Ti . The i nodes of Ti are relabeled in Tiq1 , and
it is the shape of Tiq1 that evolves from that of Ti by ‘‘tree-growing.’’
Variants of linear search from the bottom include the obvious linear search
from the top as well as a slightly more complex method called c-jump search.
Still based on sequential searching, c-jump search starts at the ‘‘top’’ of the
data array and advances c positions at a time. When it finds a probe larger
than the new key, it reverses direction and performs a linear search ‘‘from
the bottom’’ of the relevant fragment Ža fragment of size c y 1. of the array.
Three-jump search is illustrated by the decision trees of Figure 2.
TREE-GROWING STRATEGIES
1287
FIG. 1. The extended decision trees for linear search from the bottom.
FIG. 2. The Ž unextended . decision trees of three-jump search.
A more efficient search method also commonly used is binary search, which
always probes the middle position of the remaining portion of the list. Thus
when the search has been confined to a fragment of the data array between
an upper bound u and a lower bound l, binary search chooses the probe in
position Ž l q u . r2 . For i s 6, the decision tree representing binary search
is that of Figure 3.
Though all of the search strategies discussed above lie in the class of
consistent strategies defined in the next section, strategies reasonable for
some applications lie outside this class. Examples include alternating linear
1288
J. LENT AND H. M. MAHMOUD
FIG. 3. The Ž unextended . decision tree of binary search.
FIG. 4. The Ž unextended . decision trees of alternating linear search.
FIG. 5. The Ž unextended . decision tree of Fibonaccian search in an array of 12 keys.
search}a hybrid of linear search from the bottom and linear search from the
top}which is represented by the decision trees of Figure 4. Figure 5 shows a
decision tree for Fibonaccian search, a method so called because its decision
tree is endowed with a recursive partition that follows the Fibonacci number
sequence. ŽThe usual definition of the Fibonacci number Fj is Fj s Fjy1 q
Fjy2 , with F0 s 0, F1 s 1. Thus F2 s 1, F3 s 2, F4 s 3, F5 s 5 and so forth..
When the tree size is Fkq 1 y 1, where Fj is the jth Fibonacci number, the
two subtrees of the root node have sizes Fk y 1 and Fky1 y 1, and the
property propagates recursively in the subtrees. Fibonaccian search is sometimes preferred to binary search because Fibonaccian search requires only
the operations of addition and subtraction to compute the position of the next
probe w see Knuth Ž1973., Section 6.2.1x .
Like alternating linear search, Fibonaccian search fails to meet our consistency criteria. We will see, however, that all of the search strategies discussed
TREE-GROWING STRATEGIES
1289
here are normal, except possibly some implementations of Fibonaccian search.
ŽThe normality of Fibonaccian search may depend on the search strategy
used when the number of keys in the array being searched is not of the form
Fk y 1. In the Appendix, we discuss two implementations of Fibonaccian
search for which we can prove normality..
3. Consistent strategies. In this section we present a condition on the
search strategy S that is equivalent to the tree-growing property and give a
rigorous definition of consistency. Let Ti be the decision tree representing the
search algorithm Si }the algorithm used to insert the Ž i q 1.st key. Let A i
be the sorted array formed by the first i keys. We search for the correct
position of the Ž i q 1.st key by successively selecting probes in A i , at each
step comparing the Ž i q 1.st key to the selected probe. At the first step in the
insertion, we select one probe; at the second step, we again select one probe,
but the probe we select depends on the outcome of our comparison of the
Ž i q 1.st key to the first probe. Thus there are two possible second probes: one
in case the Ž i q 1.st key is larger than the first probe and another in case it is
smaller Žunless, of course, the first probe selected falls at one of the ends of
A i }in this case, the algorithm would specify only one probe at the second
step.. In general then, at the jth step in the insertion, the algorithm specifies
up to 2 jy1 possible probes.
Let Pi j be the set of all possible positions for the first j probes in the
insertion of the Ž i q 1.st key. Then Pi j has at most 2 j y 1 elements, which
correspond to the nodes on levels zero through j y 1 of Ti . Let
AŽ i, j, 1., AŽ i, j, 2., . . . , AŽ i, j, 2 j . be the subarrays of A i Žin increasing order,
though some may be empty. created by the partitioning which results from
deleting all elements of Pi j . If we have some adjacent probes or probes at
either of the ends of the array, some of the subarrays AŽ i, j, k . will be empty.
Let l i jk be the length of AŽ i, j, k ., k s 1, 2, . . . , 2 j w zero if AŽ i, j, k . is emptyx .
For the Ž j q 1.st step in the insertion, the algorithm specifies one probe from
each of the nonempty subarrays. If AŽ i, j, k . is not empty, let pi jk be the
position of the next probe within the subarray AŽ i, j, k ., relative to the other
elements of the subarray.
It is easy to show that the search strategy S has the tree-growing property
if and only if, for all i, j and k such that AŽ i, j, k . is not empty,
pi jk s f Ž l i jk , j, k . ,
for some function f such that
f Ž l q 1, j, k . s f Ž l, j, k . q a Ž l, j, k .
and a Ž l, j, k . takes values in the set 0, 14 , where l is any natural number. So
the tree-growing property implies that Ž1. the probe selected in AŽ i, j, k . at
the Ž j q 1.st step cannot depend directly on i, the index of the insertion, and
Ž2. if the array AŽ i, j, k . were larger by one element, the position of the probe
selected would be either pi jk or pi jk q 1. The tree-growing property thus
provides some similarity of probe selection functions within and across the
1290
J. LENT AND H. M. MAHMOUD
algorithms Si , i s 1, . . . , n. We restrict our class of consistent strategies to
those which possess additional consistency properties:
DEFINITION. Let S s Si 4`is1 be a tree-growing strategy for insertion sort.
We call S a consistent strategy if, for all i, j and k such that AŽ i, j, k . is not
empty, Si specifies the relative rank w within AŽ i, j, k .x of the probe selected
in AŽ i, j, k . by
pi jk s w S Ž l i jk . ,
1 F pi jk F l i jk ,
where the function w S has the following property: if g S Ž n. s minŽ w S Ž n. y
1, n y w S Ž n.., then lim n ª` g S Ž n.rn exists.
In words, g S Ž n. is the size of the smaller subtree rooted at level one of Tn ,
and consistency requires that the proportion of nodes belonging to the
smaller subtree approach a limit as n approaches infinity. Note also that for
consistent strategies, the position of the probe pi jk within AŽ i, j, k . depends
only on l i jk , the length of the subarray. Since consistent strategies are
tree-growing, we have
w S Ž l q 1. s w S Ž l . q a S Ž l .
and a S Ž l . takes values in the set 0, 14 . Examples of consistent strategies
include c-jump search, which is specified by the function
wS Ž l . s
½
c,
l,
if l ) c,
otherwise.
In this case, lim n ª` g S Ž n.rn s 0. For alternating linear search, however,
the position of the next probe is a function of both l i jk and j:
pi jk s
½
1,
l i jk ,
j odd,
j even.
Thus pi jk is not a function of the length only}it depends explicitly on j Žthe
step number.. So, though alternating linear search is tree-growing Žsee
Figure 4., it is not consistent.
4. A sufficient condition for normality of tree-growing strategies.
Let Ti be the decision tree for Si , the search algorithm used to insert the
Ž i q 1.st key by the tree-growing search strategy S s Si 4`is1. Let h i denote
the height of Ti , let X i denote the number of comparisons made by Si and let
1
Cn s Ý ny
is1 X i , the total number of comparisons required by S for sorting n
keys. Each insertion is performed independently of all others, so we take the
X i ’s to be independent random variables. Let sn2 s Varw Cn x .
We also define the symbol V: if x and c are functions of n, we say that
x Ž n. s V Ž c Ž n.. if c Ž n. s O Ž x Ž n...
LEMMA 1.
If h n s oŽ sn ., then S is a normal strategy.
TREE-GROWING STRATEGIES
1291
PROOF. For i s 1, . . . , n y 1, let Yi s X i y Ew X i x and let Fi denote the
distribution function of Yi . Then
P < Yi < ) h n q 1 4 F P max Ž X i , E w X i x . ) h i q 1 4 s 0,
since X i cannot exceed h i q 1. Now if h n s oŽ sn . and « ) 0, there is a
constant N« such that, for all n ) N« , h n q 1 - « sn . Then for i s 1, . . . , n y 1,
H< y <G« s 4 y
2
Fi Ž y . F
dF
n
H< y <)h q14 y
2
Fi Ž y . s 0,
dF
n
because the integral is taken over a set of zero probability. Thus for n ) N« ,
ny1
Ý
is1
H< y <G« s 4 y
2
Fi Ž y . s 0.
dF
n
So
lim
1
ny1
Ý
2
nª` s n is1
H< y <G« s 4 y
2
Fi Ž y . s 0
dF
n
and the normality of S follows from the Lindeberg theorem w see Billingsley
Ž1986., page 368x . I
Although most practical strategies satisfy the condition of the lemma, the
condition does not hold for all tree-growing strategies. Consider, for example,
a strategy described by a sequence of decision trees T1 , T2 , . . . that satisfies
the following property: let k 1 s 2 and for each integer i G 2, let k i s k iy1 q
2 k iy 1 r2 . The k i ’s form a sequence of even numbers. For each i G 1, each of the
trees Tn for n s 2 k iq1 , . . . , 2 k iq1 q 2 k i r2 y 1 has levels 0 through k i saturated, while its remaining n y 2 k iq1 q 1 nodes form a thin branch at the
bottom of the tree; that is, levels k i q 1, . . . , n y 2 k iq1 q k i q 1 of Tn each
contain only one internal node. In a recursive fashion, the strategy first grows
a complete tree of height k i , then grows a thin branch of length 2 k i r2 at the
bottom of the complete tree. Finally, it fills out levels k i q 1 through k i q
2 k i r2 to obtain the next complete tree, a complete tree of height k iq1. Let
n i s 2 k iq1 q 2 k i r2 y 1, the size of the smallest decision tree whose thin
branch is grown to length 2 k i r2 . The height of Tn i is h n i s k i q 2 k i r2 s
V Ž 2 k i r 2 . , whereas it can be shown that s n i s O Ž2 k i r 2 . . Thus
lim sup i ª` h n irsn i ) 0. While this strategy is not of practical interest, it
provides an example of a tree-growing strategy that violates the condition of
Lemma 1. It can also be shown, by the Lindeberg]Feller theorem, that this
strategy is not normal.
5. Normality of two commonly used search strategies. We illustrate the use of the sufficient condition for normality by showing the normality of two search strategies commonly used for insertion sort: binary search
and linear search. We begin by introducing some notation to be used here and
in later sections.
1292
J. LENT AND H. M. MAHMOUD
NOTATION. Let S s Si 4`is1 be a consistent strategy and let T s Ti 4`is1 be
the sequence of decision trees representing S . For i s 1, . . . , n, let h i be the
height of Ti , and for k s 1, 2, . . . , let L k s min i: h i s k 4 and Uk s max i:
h i s k 4 . That is, TL k, TL kq1 , . . . , TUk are the only trees in T that have height k;
thus Uk q 1 s L kq1. Define U0 s 1, and for k s 1, 2, . . . , let m k s Uk y Uky1.
As before, let Cn denote the number of comparisons needed to sort the first n
keys by insertion sort using S and let X i denote the number of comparisons
required for inserting the Ž i q 1.st key.
We may write
ny1
Cn s
Ý
Xi .
is1
So
E w Cn x s
ny1
Ý E w Xi x
is1
and, since we assume the X i ’s are independent,
Var w Cn x s
ny1
Ý
is1
Var w X i x .
Binary search is a consistent strategy which, when searching a subarray of
length l, selects as a probe the element whose rank Žrelative to the other
elements in the subarray. is u lr2 v . This strategy results in a sequence of
complete decision trees. The external nodes of the complete decision tree
representing the binary search algorithm Si lie only on one or two levels: the
lowest level ? log 2 i @ Žif i q 1 is a power of 2. or the two lowest levels ? log 2 i @
and ? log 2 i @ q 1. Thus X i s ? log 2 i @ q Bi , where Bi is a Bernoulli random
variable that assumes the value 1 with probability
2 Ž i q 1 y 2 ? log 2 i @ .
iq1
,
the proportion of external nodes at level ? log 2 i @ q 1, as is easily shown. So
E w Xi x s
2 Ž i q 1 y 2 ? log 2 i @ .
iq1
q ? log 2 i @
and
E w Cn x s
ny1
Ý
is1
2 Ž i q 1 y 2 ? log 2 i @ .
iq1
ny1
q
Ý ?log 2 i @
is1
s n log 2 n q O Ž n . .
For binary search, Uk s 2 kq 1 y 1, the number of nodes in a complete tree of
height k, so m k s 2 k . To compute sU2 j , we may first sum the variances of the
1293
TREE-GROWING STRATEGIES
X i ’s corresponding to trees of height k, for k s 1, . . . , j, and then sum over k:
2 ky1
j
sU2 j s
Ý Ý
ks1 is0
Var X L kqi .
Again using the result about the number of nodes on levels ?log 2 i @ and
?log 2 i @ q 1,
2 ky1
Ý
is0
Var X L kqi s
ž
2 ky1
Ý
is0
/ž
2 Ž i q 1.
2k q i q 1
1y
2 Ž i q 1.
2k q i q 1
/
.
Some straightforward algebraic manipulation shows that
2 ky1
Ý
is0
Var X L kqi s 6 Ž 2 k . w H2 kq 1 y H2 k x y 4 kq1 H2Ž2.kq 1 y H2Ž2.k y 2 kq1 ,
where HnŽ m. s Ý nis11ri m , the nth harmonic number of order m. ŽWe omit the
superscript when it is 1.. To simplify this expression, we use the asymptotic
approximations
1
Hn s ln n q g q O
,
n
ž /
where g is Euler’s constant, and
HnŽ2. s
p2
y
6
1
n
qO
ž /
1
n2
.
Using these we obtain
2 ky1
Ý
is0
Var X L kqi s Ž 6 ln 2 y 4 . 2 k q O Ž 1 . .
Then
sU2 j s Ž 6 ln 2 y 4 .
j
Ý
2k q OŽ j.
ks1
s V Ž Uj . .
Also, if 0 F l F 2 y 1,
j
l
Ý Var
is0
X L jqi s 4 jq1
ž
1
2 qlq1
j
y
1
2j
/
q 6 Ž 2 j . Ž H2 jqlq1 y H2 j . y 2 l q O Ž 1 . .
Thus for general n s L j q i, where 0 F i F 2 j y 1,
sL2 jqi s sU2 jy 1 q 4 jq1
ž
1
2 qiq1
j
y
1
2j
/
q 6 Ž 2 j . Ž H2 jqiq1 y H2 j . y 2 i q O Ž 1 . .
1294
J. LENT AND H. M. MAHMOUD
Since, in the case of binary search, hU j s O Žln Uj ., there is a positive constant
c such that sU j s V ŽUj1r2 . s V ŽexpŽ chU j ... Then for i s 0, . . . , m j y 1,
sL jqi ) sU jy 1
ž ž
s V exp c h L jqi y 1
ž
/
//
s V exp Ž ch L jqi . .
That is, sn s V ŽexpŽ ch n .., so h n s oŽ sn . and the sufficient condition for
normality is satisfied. Writing n s L j q i and simplifying the above expression for sL2 jqi gives
6 Ž 1 q f n . ln 2 y 6
sn2 s n
2
fn
y2q
4
4 fn
q O Ž ln n . ,
where
f n ' log 2 n y ? log 2 n @ .
w The sequence f n 4`ns1
is dense on the interval w 0, 1..x Then
Cn y n log 2 n
'nA
ªD N Ž 0, 1 . ,
n
where
6 Ž 1 q f n . ln 2 y 6
4
.
4 fn
The analytic continuation of the function A n s AŽ f n . to the real line, as f n
ranges over the interval w 0, 1., attains its minimum value Žapproximately
0.151911. at the point
An s
2
fn
W Ž y8r3e 2 . q 2
ln 2
y2q
y 1 s 0.141469 . . . ,
where W Ž x . is the principal branch of the omega function w see Fritsch,
Schafer and Crowley Ž1973.x . It attains its maximum value Žapproximately
0.174449. at
W Ž 0, y8r3e 2 . q 2
ln 2
y 1 s 0.707059 . . . ,
where W Ž0, x . is the zeroth branch of the omega function. Intuitively, we can
think of A n as an asymptotic average of the variances of X 1 , . . . , X n 4 . Thus
A n increases in n as long as Varw X n x ) A ny1. This occurs while the numbers
of external nodes on the two lowest levels of Tn are close, and thus f n is close
to lnŽ4r3.rln 2 Žapproximately 0.415037.. When one level of Tn has many
more external nodes than the other, Varw X n x is small and thus A n tends to
decrease. As each level of the tree fills up and f n takes on increasing values
in the interval w 0, 1., A n completes one cycle: it moves from 6 ln 2 y 4
Žapproximately 0.158883. down to its minimum Žfor that cycle., then up to its
TREE-GROWING STRATEGIES
1295
maximum Žfor that cycle. and finally back down to 6 ln 2 y 4. With each new
cycle, A n more closely approximates its analytic continuation.
A second commonly used consistent strategy is linear search, which, when
searching an array of size l, always selects as a probe the data element of
rank l Žlinear search from the bottom. or the data element of rank 1 Žlinear
search from the top.. For linear search it is known w see Gonnet and BaezaYates Ž1991.x that
n2
E w Cn x s
q O Ž n.
4
and
n3
Var w Cn x ;
.
36
Since sn2 s V Ž n 3 . and h n s n y 1, the sufficient condition holds for the linear
search strategy; thus
Cn y n 2r4
n
3r2
ž
ªD N 0,
1
36
/
.
6. Normality of consistent strategies. In this section we use the
sufficient condition for normality to prove that all consistent strategies are
normal. Let S s Si 4`is1 be a consistent strategy represented by the decision
trees T s Ti 4`is1 and let X i , Cn , sn2 , L k , Uk and m k be as defined. We begin by
establishing some properties of consistent strategies.
PROPERTY 1. For each positive integer i, the decision tree Ti representing
an algorithm from a consistent strategy has at least one external node on each
unsaturated level.
The right subtree of the tree in Figure 6, for example, has the following
shape characteristic: its left sub-subtree is complete down to level two, while
its right sub-subtree has neither internal nor external nodes on level two. The
third level of the tree is therefore both unsaturated and bereft of external
nodes. We will show that this tree cannot be part of a sequence of decision
trees representing a consistent strategy.
FIG. 6.
An extended decision tree that cannot represent an algorithm from a consistent strategy.
1296
J. LENT AND H. M. MAHMOUD
PROOF OF PROPERTY 1. The first probe specified by Si , i s 1, 2, . . . , divides
an array of size i into two subarrays. Let g S Ž i . denote the size of the smaller
of the two subarrays and g XS Ž i . the size of the larger. w Naturally we define
g S Ž0. s g XS Ž0. s 0.x For a cleaner notation, we suppress the subscript S . If Ti
had an incomplete level with no external nodes, Ti would contain a subtree
whose two sub-subtrees Žleft and right. had the following property: one
sub-subtree would be complete down to some level l, while the other would
have neither internal nor external nodes at level l. Since Si is a consistent
algorithm, this would imply that for some number nX F i, the size of the
subtree,
g Ž l . Ž nX . s 0 and
g Ž l. Ž g X Ž nX . . ) 0
w where g Ž p. Ž i . s g Ž g Ž g ??? Ž i ..., the function g composed p times, for a
positive integer p x . This is a contradiction, since g must be nondecreasing,
but g X Ž nX . - nX . I
PROPERTY 2. For consistent strategies, the sequence m k 4`ks1 is nondecreasing in k, so Ukrm k F k q 1.
PROOF. Let g and g X be as defined. ŽAgain we suppress the subscript S ..
Then for each positive integer i, the larger subtree Žrooted at level one. of Ti
has the same shape as Tg X Ž i. , the decision tree for S g X Ž i. . Since height is
monotonic in size, the larger subtree of TL kq 1 }recall that TL kq 1 is the
smallest decision tree in T having height k q 1}has the shape of TL k, the
smallest decision tree in T whose height is k. So g X Ž L kq 1 . s L k . Similarly,
g X ŽUkq 1 . s Uk . Also, by the tree-growing property, if i 2 ) i 1 , then g X Ž i 2 . G
g X Ž i 1 . and i 2 y g X Ž i 2 . G i 1 y g X Ž i 1 .. Since for all i, g X Ž i . - i, this implies
w taking i 1 s g X Ž i . and i 2 s i x that
i y gX Ž i . G gX Ž i . y gX Ž gX Ž i . . .
Then
m kq 1 s Ukq1 y Uk
s Ukq 1 y g X Ž Ukq1 .
G g X Ž Ukq 1 . y g X Ž g X Ž Ukq1 . .
s Uk y Uky1
s mk .
Since the m k ’s are nondecreasing in k,
Uk
mk
PROPERTY 3.
s
1 q m1 q m 2 q ??? qm k
mk
F k q 1.
For consistent strategies, sn2 s V Ž n..
I
TREE-GROWING STRATEGIES
1297
PROOF. Suppose n G 100. First note that, unless Ti Žthe decision tree
illustrating Si . has at least 95% of its external nodes at a single level, we
have Varw X i x G 0.0475. Hence, unless at least 95% of the trees Ti in the
n
2
subsequence Ti 4is1
have this property, we have snq1
G 0.002375n.
n
Now if 95% or more of the trees Ti in Ti 4is1 have this property Žthat is, at
least 95% of their external nodes lie on a single level., then this level, level L
say, must be the same for all the integers i, 0.10n F i F n, for which Ti has
this property. This is so because if the level L were to shift, say to level L q D
n
for some integer D / 0, then Ti 4is1
would contain a subsequence of decision
trees having fewer than 95% of their external nodes on any single level, and
n
this subsequence would include more than 5% of the decision trees in Ti 4is1
.
Toward a contradiction, suppose Tn has fewer than 0.80n external nodes
on level L. Let
i 0 s max i F n: Ti has at least 0.95i nodes on level L 4 .
We know that i 0 G 0.95n, so we may add at most 0.05n nodes to Ti 0 to obtain
Tn . If we add all the n y i 0 internal nodes to level L of Ti 0 Žreducing by as
many as possible the number of external nodes on level L ., the number of
external nodes on level L of Tn will still be at least 0.8525n, a contradiction.
So Tn must have at least 0.80n external nodes on level L. Now since Tg X Ž n. is
a subtree of Tn and contains at least 0.50n external nodes, Tg X Ž n. has at least
0.40 g X Ž n. external nodes on level L y 1 Žmeasured from its own root.. Let
i 1 s min i ) g X Ž n . : Ti has at least 0.80i nodes on level L 4 .
ŽNote that i 1 F n must exist.. Since Tg X Ž n. has at least 0.40 g X Ž n. external
i1 X
nodes on level L y 1 and g X Ž n. G nr2, the subsequence Ti 4isg
Ž n. contains
i1 X
more than 0.05n trees. This is a contradiction, since all trees in Ti 4isg
Ž n.
have fewer than 95% of their external nodes on level L. I
We now present our main result.
THEOREM 1.
All consistent strategies are normal.
PROOF. If lim n ª` g Ž n.rn s 1r2, the strategy is similar to binary search:
the rate of growth in h n is O Žln n., while, by Property 3, the rate of growth
in sn2 is V Ž n., so the sufficient condition clearly holds. Now assume
lim n ª` g Ž n.rn s c for some constant c g w 0, 1r2.. Let h n be the height and
bn the number of complete levels of a decision tree of size n. We first show
that in this case, there are positive constants N1 and c1 such that c1 - 1 and
for n ) N1 we have
b n F c1 h n .
If bn is bounded above by some constant, there is nothing to prove, so assume
that lim n ª` bn s `. ŽThe limit must exist because bn is nondecreasing in n..
Since lim n ª` g Ž n.rn s c - 1r2, there is a positive constant c 2 - 1r2 and an
integer N2 such that for n ) N2 , g Ž n. - c 2 n. That is, the smaller subtree of a
1298
J. LENT AND H. M. MAHMOUD
decision tree of size n ) N2 has fewer then c 2 n nodes. Further, each subtree
of size n ) N2 also has this property; that is, its smaller sub-subtree has at
most c 2 n nodes. Thus there is a constant integer c3 such that for n large
enough,
g Ž b ny1. Ž n . - c 2b nyc 3 Ž 12 . n,
c3
where, by definition, g Ž k . Ž n. s g Ž g Ž ky1. Ž n... Since g Ž b ny1 . Ž n. s 1, we have
c 2b nyc 3 Ž 12 . n ) 1.
c3
Taking logarithms on both sides of the inequality gives
Ž bn y c3 . ln c2 y c3 ln 2 q ln n ) 0,
and, rearranging, we obtain
bn -
ž
ln n
q c3 1 y
ln Ž 1rc2 .
ln 2
ln Ž 1rc2 .
/
.
Since h n G ? log 2 n @ and c 2 - 1r2, this implies that there are positive constants N1 and c1 such that c1 - 1 and for n ) N1 ,
b n F c1 h n .
In the notation introduced at the beginning of the previous section, we have
shown that there is a constant j0 such that for j ) j0 ,
bi j F c1 j,
where bi j is the number of complete levels of TL jqi , i s 0, 1, . . . , m j y 1. By
Property 1, TL jqi has at least one external node on every level below bi j , so
for i s 0, . . . , m j y 1,
Var X L jqi G
G
j
1
Ý
Lj q i q 1
1
Uj
ks b i j
k y E X L jqi
2
X
j
Ý w kX y mX x 2 ,
X
k s1
where jX s j y bi j q 1 and mX s Ew X L jqi x y bi j q 1. Since the sum of squares
above may be minimized by setting mX s Ž jX q 1.r2, we have
Var X L jqi G
s
s
X
j
1
Uj
1
Uj
Ý
kX y
X
k s1
½
jX 3
12
V Ž jX 3 .
Uj
jX q 1
2
q O Ž jX 2 .
.
5
2
TREE-GROWING STRATEGIES
1299
Recalling that jX G Ž1 y c1 . j for sufficiently large j, we have
Var X L jqi s
V Ž j3 .
Uj
.
Then
m jy1
Ý
is0
Var X L jqi s
s
G
m jy1
V Ž j3 .
is0
Uj
Ý
ž /Ž.
ž /Ž.
mj
Uj
V j3
1
jq1
V j3
s V Ž j2 . ,
where we have used Property 2 to obtain the inequality. So
sU2 j
j
m ky1
ks1
is0
Ý Ý
s
Var X L kqi
j
s
Ý VŽ k2 .
ks1
s V Ž j3 . .
Thus sU j s V Ž j 3r2 ., which implies that j s oŽ sL j . or h n s oŽ sn .. I
APPENDIX
Normality of other tree-growing search strategies. Properties 1]3,
as is clear from their proofs, apply to all tree-growing strategies for which the
position of each probe depends only on the length of the data subarray being
searched. Here we consider tree-growing strategies which, though not consistent, retain this characteristic and thus possess Properties 1]3. The proof in
the previous section clearly works for all such strategies in which the limit
lim n ª` g Ž n.rn does not exist, as long as lim sup n ª` g Ž n.rn - 1r2. One
example of such a strategy is the following tree-growing implementation of
Fibonaccian search. ŽNot all implementations of Fibonaccian search have the
tree-growing property w see Knuth Ž1973., Section 6.2.1x . We restrict our
discussion to tree-growing implementations.. Let the position of the probe to
be selected in an array of size n be specified by
¡n,
j Ž n . s~n y F
¢F ,
1
k Ž n.y1 q 1,
k Ž n.q1
if n F 3,
if Fk Ž n.q1 F n F Fk Ž n.q1 q Fk Ž n.y1 y 1,
if Fk Ž n.q1 q Fk Ž n.y1 F n F Fk Ž n.q2 y 1,
1300
J. LENT AND H. M. MAHMOUD
where, for n G 3, k Ž n. is such that
Fk Ž n.q1 F n - Fk Ž n.q2
and Fi is the ith Fibonacci number.
Using this strategy, we ‘‘grow’’ the Fibonacci tree of size Fkq 1 y 1 into a
Fibonacci tree of size Fkq 2 y 1 in the following way. We first add Fky1 nodes
to the larger subtree of TFkq 1y1 , bringing the size of this subtree to Fkq1 y 1.
Then we add Fky 2 nodes to the smaller subtree, bringing its size to Fk y 1.
The new Fibonacci tree then has size Fkq 1 y 1 q Fky1 q Fky2 s Fkq2 y 1.
With this strategy we have, using our previous notation,
lim sup
g Ž n.
n
nª`
s lim
kª`
s lim
kª`
s
Fky 1 y 1
Fkq 1 y 1
Ž 1r'5 . f ky 1
Ž 1r'5 . f kq 1
1
f2
s 0.381966 . . . ,
where
fs
1 q '5
2
,
the golden ratio w as in Knuth Ž1973., page 416x . This strategy is not consistent, because
lim inf
nª`
g Ž n.
n
s lim
kª`
s lim
kª`
s
Fky 1 y 1
Fkq 1 q Fky1 y 1
f ky 1
f kq 1 q f ky1
1
f q1
2
s 0.276393 . . . .
Since we do, however, have lim sup n ª` g Ž n.rn - 1r2, this strategy is normal. Note that we may draw this conclusion from our general result, avoiding
a somewhat complicated variance calculation.
The following proposition further expands the class of strategies for which
we can prove normality.
PROPOSITION 1. Any search strategy S for which lim inf n ª` g Ž n.rn s
k 1 g Ž0, 1r2. is normal.
TREE-GROWING STRATEGIES
1301
PROOF. Let g X Ž n. be as defined. Since lim inf n ª` g Ž n.rn s k 1 ) 0, there
are constants j0 and k 2 such that k 2 - 1 and for j ) j0 we have g X ŽUj . F k 2 Uj .
Then for j ) j0 ,
g XŽ jyj 0 . Ž Uj . F k 2jyj 0 Uj .
For example, for j s j0 q 2 we have
g XŽ2. Ž Uj 0q2 . s g X Ž g X Ž Uj 0q2 . . s g X Ž Uj 0q1 . F k 2 Uj 0q1 s k 2 g X Ž Uj 0q2 . F k 22 Uj 0q2 .
Since g XŽ jyj 0 . ŽUj . G 1, this implies that
1 F k 2jyj 0 Uj .
Thus there is some constant k 3 ) 0 such that for j ) j0 , Uj ) expŽ k 3 j ..
Writing j s h n and noting that Ujy1 - n, we have, by Property 3,
sn2 s V Ž n . s V Ž exp Ž k 3 h n . . ,
so h n s oŽ sn .. I
Another implementation of Fibonaccian search provides a practical example of a strategy for which the previous proposition applies. Let the position of
the probe to be selected in an array of size n be determined by the function
¡n,
j Ž n . s~ F ,
¢n y F
2
if n F 2,
if Fk Ž n.q1 F n F Fk Ž n.q1 q Fk Ž n.y2 y 1,
k Ž n.
k Ž n.
q 1,
if Fk Ž n.q1 q Fk Ž n.y2 F n F Fk Ž n.q2 y 1,
where Fk Ž n. is as previously defined. This strategy is similar to the one
determined by j 1 above, except we add nodes to the smaller subtree first,
then to the larger. With this strategy we have, using our previous notation,
lim sup
g Ž n.
nª`
n
s lim
kª`
Fk y 1
Fkq 1 q Fky2 y 1
1
.
2
Here, again, we do not have consistency, since
s
lim inf
nª`
g Ž n.
n
s lim
kª`
s lim
kª`
s
1
f2
Fky 1 y 1
Fkq 1 y 1
f ky 1
f kq 1
,
but the normality of this Fibonaccian strategy follows from Proposition 1.
Many other tree-growing strategies are also normal. Alternating linear
search, for example, is clearly normal and shares the asymptotic mean and
variance of linear search from the bottom Žor top.. Intuitively, if a strategy
1302
J. LENT AND H. M. MAHMOUD
corresponds to a set of decision trees T s Ti 4`is1 for which the height of Ti
grows at a steady rate as i grows, the strategy is likely to be normal. In fact,
we can prove the normality of all search strategies S whose corresponding
decision trees T satisfy one of the following sets of conditions:
1. Ža. m k s O Ž k 1q « . for some « in the interval Ž0, 1., Žb. Ukrm k s O Ž k 1q d .
for some d in the interval Ž0, 1. and Žc. the tree Ti is complete down to a
certain level; below this level, Ti has at least one external node on every
cth level for some constant c.
2. Ža. m k s V Ž k 1q « . for some « ) 0 and Žb. VarŽ X n . s V Ž1..
Acknowledgments. The authors gratefully acknowledge the contribution of Sudip Bose of the George Washington University, whose excellent
suggestions substantially improved this paper. In addition, the anonymous
referees who reviewed the paper offered many helpful comments. The authors
are particularly thankful for the proof of Property 3 and the example of a
tree-growing strategy that violates the sufficient condition for normality,
which one of the referees provided.
REFERENCES
BILLINGSLEY, P. Ž1986.. Probability and Measure. Wiley, New York.
DOBERKAT, E. Ž1982.. Asymptotic estimates for the higher moments of the expected behavior of
straight insertion sort. Inform. Process. Lett. 14 179]182.
FRITSCH, F., SCHAFER, R. and CROWLEY, W. Ž1973.. Solution of the transcendental equation
v e v s x. Comm. ACM 16 123]124.
GONNET, G. and BAEZA-YATES, R. Ž1991.. Handbook of Algorithms and Data Structures, 2nd ed.
Addison-Wesley, Reading, MA.
HU, T. and WACHS, M. Ž1987.. Binary search on a tape. SIAM J. Comput. 16 573]590.
KNUTH, D. Ž1973.. The Art of Computer Programming 3. Addison-Wesley, Reading, MA.
PANNY, W. Ž1986.. A note on the higher moments of the expected behavior of straight insertion
sort. Inform. Process. Lett. 22 175]177.
DEPARTMENT OF STATISTICS
GEORGE WASHINGTON UNIVERSITY
WASHINGTON , DC 20052
E-MAIL : lent j@bls.gov
hosam@gwis2.circ.gwu.edu
Download