Paper Review for CSCI356

advertisement
Paper Review for CSCI356
Spring 1999
David Paulson
Reviewed:
“Fast Algorithms for Sorting and Searching Strings”
by Jon Bentley and Robert Sedgewick
I actually added two papers together for review, both on structures and algorithms for storing and
sorting sets of strings. I originally reviewed “Ternary Search Strings” by Jon Bentley and Robert
Sedgewick from Dr. Dobb’s Journal, April 1999 (the “Algorithms” edition). This article had a Web link to
the paper mentioned in the title, “Fast Algorithms for Sorting and Searching Strings” by the same authors.
The latter paper seemed more appropriately technical (including analysis), whereas the Dr. Dobb’s paper
had good applied programming tips I used in my project. The latter paper is 10 pages and the Dr. Dobbs
paper is 5.
This paper presents algorithms for sorting and searching multikey (strings) data with details of ‘C’
implementations. It is a very well written and very practical, applied oriented paper with direct applications
to improving sort and search times in real cases. The paper is divided into 7 sections: an intro, background
(justifying need for ternary tree), analysis, coding, ‘advanced’ applications (e.g. wildcard searching), and
conclusions. The analysis section is a little sparser than I had hoped, offering derived results with only
‘proof sketches’, mostly referring to previous authors results. The program samples offered are in ‘C’,
available on the authors web site and are quite practical. There are a great many (27) references to previous
works. The additional Dr. Dobb’s journal reference breaks the program analysis down even further,
including software optimization techniques.
The sort algorithm presented is a blend of Quicksort and Radix sort, and the search algorithm is a
blend of digital ‘tries’ and binary search trees. Quicksort is introduced as a classical binary partitioning
algorithm about some key into less-than and greater-than sets, with a ‘well known’ need for also adding an
equals set but no simple way of doing so. The multikey quicksort acts like normal quicksort, paritioning
characters into a greater than and less than set, but also acts like a radix sort in that on equal characters the
sort moves on to the next field. Several analysis results of Quicksort are presented as theorems:
1) Partitioning about a single random element sorts N items on average in 1.39N(lgN) comparisions.
2) Partitioning about the median can sort N items in N(lgN)+O(N) comparisons.
or general worst case:
3) Quicksort with median partitioning sorts N items in cN(lgN)+O(N) comparisons in worst case.
(This is less than our texts’ O(n2) worst case with in-order partitioning)
4) Average cost of a successful search in a randomly built binary tree is approximately 1.39(lgN)
comparisons.
The paper then describes an ‘isomorphism’ (mapping analogy) between Quicksort and a Binary
search tree by showing an example with a correspondence between creating a tree directly off a data set
(arranging it in ‘input’ or left to right order). They then show how the branches corresponds to the various
Quicksort sub-partitions that will result by partitioning from first element and on, also left to right. They
then mention that a (balanced) binary tree can also be searched in O(lgN) comparisons analogous to
Quicksort. By extending the ‘isomorphism’ idea, the paper describes the following other correspondences:
Algorithm
Data_Structure
Quicksort
Multikey_Quicksort
MSD_Radix_Sort
Binary_Search_Tree
Ternary_Search_Tree
Digital_Tries
The middle two algorithms are what the rest of the paper focuses on. A ‘ternary partitioning’
algorithm (described from Hoare’s multikey Quicksort paper) sorts an ‘n’ long set ‘s’, or array, of
substrings (actually any vector type) of maximum length ‘k’. The recursive call at depth (or character
offset into substrings) ‘d’, looks like:
sort(s,n,d)
//first call is sort(s,n,1) at tree’s root
if n <= 1 or d > k return;
//k is max possible substring length
choose a partition value ‘v’; //could be first char, random, or other method
partition ‘s’ around value=‘v’ on component ‘d’ to get subsequences:
s<, s=, s> of sizes n<, n=, n>;
//next three recursive calls are a ternary tree walk
sort(s<,n<,d);
sort(s=,n=,d+1);
//note: here are moving to next char position in substrings
sort(s>,n>,d);
As I note in the comment before the last three lines, the recursive sort calls would correspond to an in-order
tree walk of a ‘ternary’ tree with less-than, equals, and greater-than branches. An example of a (balanced)
ternary tree for 12 two letter words (as, at, be, by, he, in, is, it, of, on, or, to) looks like:
In the above tree, a search for ‘by’ starts at ‘i’, compares and proceeds down < branch to ‘b’,
compares and skips over = ‘e’ node, proceeds down > branch to ‘y’, compares and finds ‘by’ is in the tree.
A search for ‘ax’ does three compares to get to the ‘a’ node and two compares before failing to find the ‘x’.
The paper somewhat glosses over the problem of balancing a ternary tree (references for balancing general
k-trees are given but I didn’t get article reprints). Instead they describe ways of building a mostly balanced
searchable tree from pre-sorted data: insert median element, then recursively insert all lesser and then all
greater elements.
The analysis section mostly refers to other papers’ results, and then somewhat builds on them
presenting ‘proof sketches’:
5) A search in a balanced ternary tree of N vectors (of length ‘k’) takes at most lgN + k compares.
6) A multikey Quicksort, partitioned around a median in cN comparisons, sorts N k-vectors in cN(lgN+k)
comparisons (extrapolates from 5 above).
Two other (complex) results for median partitioning methods are also given.
The next, fairly lengthy, section gives coding examples for Multikey-quicksort and ternary search
trees as applied to string searching and sorting. They mention that ternary trees have been neglected as a
mostly theoretical construct (for sorting ‘k-vectors’) but that they have very practical direct application to
strings, and prove that their ‘tuned’ ternary search code can usually outperform the ‘C’ library version of
Quicksort, ‘qsort’.
The paper describes their sample code for multikey quicksort (the pseudocode algorithm described
earlier). They include two versions: an inefficient ‘directly coded’ version, and a second version which uses
various tricks like “sorting small subarrays with insertion sort and partitioning around the median of three
elements”, as well as converting indexed arrays to pointer arithmetic based structures. The results, in
seconds, of running these samples on a 72275 word dictionary are:
CPU-speed
Mips-150
Mips-100
Pentium-90
486DX-33
qsort
.85
1.32
1.74
8.20
direct
.79
1.30
.98
4.15
tuned
.44
.68
.69
2.41
Radix
.40
.62
.50
1.74
This shows even the simple (direct) coded sort generally is faster than the system qsort, and the tuned
version is significantly faster. The last column represents what they describe as “the fastest algorithm they
know, the highly tuned radix sort of McIlroy, Bostic, and McIlroy” which apparently is quite complex.
Next the ternary search tree implementation is described and analyzed. The nodes are basically a
structure like:
typedef struct tnode {
char splitchar;
//char to sort this node by
Tnode* lo_child, eq_child, hi_child;
//pointers to branches
} Tnode;
The code for insert and search is much like the binary tree code with the exception of the ‘equals’ case. On
‘equals’ (search or insert), the next char of the string being searched, or inserted, is indexed. They then
tested various ways of inserting a set of substrings (dictionary ordered, random,…) and found that
11) the number of nodes in a ternary tree is the same independent of insertion order:
-always same qty equal branches
-qty of lo and hi branches varies, as does total qty branches
-total qty of nodes is same
The time to do the insertions varied by a factor of 4 (reverse and random order are worst), whereas in a
binary search tree the slow down is a factor of 2000, so ternary trees are more robust independent of data.
The paper also describes simulation comparisons of searching, in which on average a ‘highly tuned binary
search’ took about twice the time to search for a string as the ternary tree. They also compared searching
with a hash table (from K&R, but with inlined code). A summary of search performance follows:
CPU-speed
Mips-150
Mips-100
Pentium-90
Successful (hit)
TST
Hash
.44
.43
.66
.61
.58
.65
Unsuccessful (miss)
TST
Hash
.27
.39
.42
.54
.38
.50
For successful searches (e.g. the string is in the tree somewhere), the two methods are very similar,
however the ternary search tree is much faster in the unsuccessful case (string is absent). This is because in
hashing the whole key must be generated and searched for whereas in the ternary tree the miss is usually
detected after a few characters (unless they only differ at the end of the string). The main drawback of
ternary search trees over hash tables is that ternary trees can take more storage space, an average of 3X in
the text’s examples. In Bell Labs the ternary tree method is used to represent English dictionaries, much
faster than hashing, and able to easily handle Unicode symbols.
The paper also demonstrates using ternary search trees to do advanced searches like ‘wildcard
character matching’ as well as ‘nearest neighbor’ searching. For wild-card search (where specific
characters are represented by a ‘.’ and can match anything) the algorithm is: if the current char is not a ‘.’,
look for the ‘equals’ subtree, else recursively search the hi and low sub-branches. This works fine, but
tends to be slowed when the wild-card chars are near the beginning of the string. Here is some example
data from their 72000 word dictionary ternary tree:
Pattern
television
tele……
t.l.v.s..n
….vision
banana
ban…
.a.a.a
…ana
qty-matching
1
17
1
1
1
15
19
8
nodes-searched
24
265
164
37178
17
166
2746
13756
In conclusion, this paper demonstrates that multikey quicksort, as implemented by ternary trees, is
competitive with the best known string sorting algorithms. It is also demonstrated that ternary search trees
are a very efficient string symbol table storage mechanism, especially when search keys are long strings.
These algorithms have many practical applications and have already been incorporated in several
commercial systems.
To summarize:
 Ternary trees do not incur extra overhead for insertion or successful searches.
 Ternary trees are usually substantially faster than hashing for unsuccessful searches.
 Ternary trees gracefully grow and shrink; hash tables need to be rebuilt after large size changes.
 Ternary trees support advanced searches, such as partial-match and near-neighbor search.
 Ternary trees support many other operations, such as traversal to report items in sorted order.
The paper is clearly written with good examples and sample code snippets, with full code sample available
on the authors web site. While it does not include detailed analysis it does give sketches of proofs of
performance bounds and gives many references which do include the details of proofs. The only area I felt
was somewhat weak was the lack of description of re-balancing trees on adds or deletes, although
references on balancing techniques were given.
The papers are available on-line as:
“Fast Algorithms for Sorting and Searching Strings”, Jon Bentley and Robert Sedgewick,
presented at Eighth Annual ACM-SIAM Symposium on Discrete Algorithms,
New Orleans, January, 1997
http://www.cs.princeton.edu/~rs/strings/paper.pdf
and:
“Ternary Search Trees”, Jon Bentley and Bob Sedgewick,
Dr. Dobb's Journal April 1998.
http://www.ddj.com/articles/1998/9804/9804a/9804a.htm
and the source code samples are at:
http://www.cs.princeton.edu/~rs/strings/demo.c
My own project related to ternary tree performance (Java implementation) is available at:
http://www.ecst.csuchico.edu/~davep/cs356/project.htm
Download