advertisement

Paper Review for CSCI356 Spring 1999 David Paulson Reviewed: “Fast Algorithms for Sorting and Searching Strings” by Jon Bentley and Robert Sedgewick I actually added two papers together for review, both on structures and algorithms for storing and sorting sets of strings. I originally reviewed “Ternary Search Strings” by Jon Bentley and Robert Sedgewick from Dr. Dobb’s Journal, April 1999 (the “Algorithms” edition). This article had a Web link to the paper mentioned in the title, “Fast Algorithms for Sorting and Searching Strings” by the same authors. The latter paper seemed more appropriately technical (including analysis), whereas the Dr. Dobb’s paper had good applied programming tips I used in my project. The latter paper is 10 pages and the Dr. Dobbs paper is 5. This paper presents algorithms for sorting and searching multikey (strings) data with details of ‘C’ implementations. It is a very well written and very practical, applied oriented paper with direct applications to improving sort and search times in real cases. The paper is divided into 7 sections: an intro, background (justifying need for ternary tree), analysis, coding, ‘advanced’ applications (e.g. wildcard searching), and conclusions. The analysis section is a little sparser than I had hoped, offering derived results with only ‘proof sketches’, mostly referring to previous authors results. The program samples offered are in ‘C’, available on the authors web site and are quite practical. There are a great many (27) references to previous works. The additional Dr. Dobb’s journal reference breaks the program analysis down even further, including software optimization techniques. The sort algorithm presented is a blend of Quicksort and Radix sort, and the search algorithm is a blend of digital ‘tries’ and binary search trees. Quicksort is introduced as a classical binary partitioning algorithm about some key into less-than and greater-than sets, with a ‘well known’ need for also adding an equals set but no simple way of doing so. The multikey quicksort acts like normal quicksort, paritioning characters into a greater than and less than set, but also acts like a radix sort in that on equal characters the sort moves on to the next field. Several analysis results of Quicksort are presented as theorems: 1) Partitioning about a single random element sorts N items on average in 1.39N(lgN) comparisions. 2) Partitioning about the median can sort N items in N(lgN)+O(N) comparisons. or general worst case: 3) Quicksort with median partitioning sorts N items in cN(lgN)+O(N) comparisons in worst case. (This is less than our texts’ O(n2) worst case with in-order partitioning) 4) Average cost of a successful search in a randomly built binary tree is approximately 1.39(lgN) comparisons. The paper then describes an ‘isomorphism’ (mapping analogy) between Quicksort and a Binary search tree by showing an example with a correspondence between creating a tree directly off a data set (arranging it in ‘input’ or left to right order). They then show how the branches corresponds to the various Quicksort sub-partitions that will result by partitioning from first element and on, also left to right. They then mention that a (balanced) binary tree can also be searched in O(lgN) comparisons analogous to Quicksort. By extending the ‘isomorphism’ idea, the paper describes the following other correspondences: Algorithm Data_Structure Quicksort Multikey_Quicksort MSD_Radix_Sort Binary_Search_Tree Ternary_Search_Tree Digital_Tries The middle two algorithms are what the rest of the paper focuses on. A ‘ternary partitioning’ algorithm (described from Hoare’s multikey Quicksort paper) sorts an ‘n’ long set ‘s’, or array, of substrings (actually any vector type) of maximum length ‘k’. The recursive call at depth (or character offset into substrings) ‘d’, looks like: sort(s,n,d) //first call is sort(s,n,1) at tree’s root if n <= 1 or d > k return; //k is max possible substring length choose a partition value ‘v’; //could be first char, random, or other method partition ‘s’ around value=‘v’ on component ‘d’ to get subsequences: s<, s=, s> of sizes n<, n=, n>; //next three recursive calls are a ternary tree walk sort(s<,n<,d); sort(s=,n=,d+1); //note: here are moving to next char position in substrings sort(s>,n>,d); As I note in the comment before the last three lines, the recursive sort calls would correspond to an in-order tree walk of a ‘ternary’ tree with less-than, equals, and greater-than branches. An example of a (balanced) ternary tree for 12 two letter words (as, at, be, by, he, in, is, it, of, on, or, to) looks like: In the above tree, a search for ‘by’ starts at ‘i’, compares and proceeds down < branch to ‘b’, compares and skips over = ‘e’ node, proceeds down > branch to ‘y’, compares and finds ‘by’ is in the tree. A search for ‘ax’ does three compares to get to the ‘a’ node and two compares before failing to find the ‘x’. The paper somewhat glosses over the problem of balancing a ternary tree (references for balancing general k-trees are given but I didn’t get article reprints). Instead they describe ways of building a mostly balanced searchable tree from pre-sorted data: insert median element, then recursively insert all lesser and then all greater elements. The analysis section mostly refers to other papers’ results, and then somewhat builds on them presenting ‘proof sketches’: 5) A search in a balanced ternary tree of N vectors (of length ‘k’) takes at most lgN + k compares. 6) A multikey Quicksort, partitioned around a median in cN comparisons, sorts N k-vectors in cN(lgN+k) comparisons (extrapolates from 5 above). Two other (complex) results for median partitioning methods are also given. The next, fairly lengthy, section gives coding examples for Multikey-quicksort and ternary search trees as applied to string searching and sorting. They mention that ternary trees have been neglected as a mostly theoretical construct (for sorting ‘k-vectors’) but that they have very practical direct application to strings, and prove that their ‘tuned’ ternary search code can usually outperform the ‘C’ library version of Quicksort, ‘qsort’. The paper describes their sample code for multikey quicksort (the pseudocode algorithm described earlier). They include two versions: an inefficient ‘directly coded’ version, and a second version which uses various tricks like “sorting small subarrays with insertion sort and partitioning around the median of three elements”, as well as converting indexed arrays to pointer arithmetic based structures. The results, in seconds, of running these samples on a 72275 word dictionary are: CPU-speed Mips-150 Mips-100 Pentium-90 486DX-33 qsort .85 1.32 1.74 8.20 direct .79 1.30 .98 4.15 tuned .44 .68 .69 2.41 Radix .40 .62 .50 1.74 This shows even the simple (direct) coded sort generally is faster than the system qsort, and the tuned version is significantly faster. The last column represents what they describe as “the fastest algorithm they know, the highly tuned radix sort of McIlroy, Bostic, and McIlroy” which apparently is quite complex. Next the ternary search tree implementation is described and analyzed. The nodes are basically a structure like: typedef struct tnode { char splitchar; //char to sort this node by Tnode* lo_child, eq_child, hi_child; //pointers to branches } Tnode; The code for insert and search is much like the binary tree code with the exception of the ‘equals’ case. On ‘equals’ (search or insert), the next char of the string being searched, or inserted, is indexed. They then tested various ways of inserting a set of substrings (dictionary ordered, random,…) and found that 11) the number of nodes in a ternary tree is the same independent of insertion order: -always same qty equal branches -qty of lo and hi branches varies, as does total qty branches -total qty of nodes is same The time to do the insertions varied by a factor of 4 (reverse and random order are worst), whereas in a binary search tree the slow down is a factor of 2000, so ternary trees are more robust independent of data. The paper also describes simulation comparisons of searching, in which on average a ‘highly tuned binary search’ took about twice the time to search for a string as the ternary tree. They also compared searching with a hash table (from K&R, but with inlined code). A summary of search performance follows: CPU-speed Mips-150 Mips-100 Pentium-90 Successful (hit) TST Hash .44 .43 .66 .61 .58 .65 Unsuccessful (miss) TST Hash .27 .39 .42 .54 .38 .50 For successful searches (e.g. the string is in the tree somewhere), the two methods are very similar, however the ternary search tree is much faster in the unsuccessful case (string is absent). This is because in hashing the whole key must be generated and searched for whereas in the ternary tree the miss is usually detected after a few characters (unless they only differ at the end of the string). The main drawback of ternary search trees over hash tables is that ternary trees can take more storage space, an average of 3X in the text’s examples. In Bell Labs the ternary tree method is used to represent English dictionaries, much faster than hashing, and able to easily handle Unicode symbols. The paper also demonstrates using ternary search trees to do advanced searches like ‘wildcard character matching’ as well as ‘nearest neighbor’ searching. For wild-card search (where specific characters are represented by a ‘.’ and can match anything) the algorithm is: if the current char is not a ‘.’, look for the ‘equals’ subtree, else recursively search the hi and low sub-branches. This works fine, but tends to be slowed when the wild-card chars are near the beginning of the string. Here is some example data from their 72000 word dictionary ternary tree: Pattern television tele…… t.l.v.s..n ….vision banana ban… .a.a.a …ana qty-matching 1 17 1 1 1 15 19 8 nodes-searched 24 265 164 37178 17 166 2746 13756 In conclusion, this paper demonstrates that multikey quicksort, as implemented by ternary trees, is competitive with the best known string sorting algorithms. It is also demonstrated that ternary search trees are a very efficient string symbol table storage mechanism, especially when search keys are long strings. These algorithms have many practical applications and have already been incorporated in several commercial systems. To summarize: Ternary trees do not incur extra overhead for insertion or successful searches. Ternary trees are usually substantially faster than hashing for unsuccessful searches. Ternary trees gracefully grow and shrink; hash tables need to be rebuilt after large size changes. Ternary trees support advanced searches, such as partial-match and near-neighbor search. Ternary trees support many other operations, such as traversal to report items in sorted order. The paper is clearly written with good examples and sample code snippets, with full code sample available on the authors web site. While it does not include detailed analysis it does give sketches of proofs of performance bounds and gives many references which do include the details of proofs. The only area I felt was somewhat weak was the lack of description of re-balancing trees on adds or deletes, although references on balancing techniques were given. The papers are available on-line as: “Fast Algorithms for Sorting and Searching Strings”, Jon Bentley and Robert Sedgewick, presented at Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, January, 1997 http://www.cs.princeton.edu/~rs/strings/paper.pdf and: “Ternary Search Trees”, Jon Bentley and Bob Sedgewick, Dr. Dobb's Journal April 1998. http://www.ddj.com/articles/1998/9804/9804a/9804a.htm and the source code samples are at: http://www.cs.princeton.edu/~rs/strings/demo.c My own project related to ternary tree performance (Java implementation) is available at: http://www.ecst.csuchico.edu/~davep/cs356/project.htm