1 Copyright Oliver Serang, 2014 Linear time suffix tree construction Idea of the data structure Given a string S, the suffix tree of that string is a compressed trie containing all suffixes S1 = S[1 . . . n], S2 = S[2 . . . n], S3 = S[3 . . . n], . . . Sn = S[n]. The compressed property ensures that every internal node in the suffix tree has at least two children (that is, an edge in the tree (u, v) will be contracted into a single node uv unless there is some other node w s.t. there is an edge (u, w)); thus, unlike a standard radix trie, which progresses with each edge representing an additional character, each edge may now represent a string of characters. Because this data structure is a trie, then each path down from the root can be thought of as successively appending labels to a result string. Tries have a uniqueness property that ensures that no two paths (or partial paths, which progress partially down an edge, appending a prefix of the edge’s label) will result in the same string. That is, if a node u in the tree has two outward edges (u, v1 ) and (u, v2 ), then the labels label((u, v1 )), label((u, v2 )) must begin with different prefixes. For example, no valid trie would have two edges originating from u sharing the same prefix, such as label((u, v1 )) = “abc12300 , label((u, v2 )) = “abc45600 ; this would correctly be represented by “factoring out” the shared prefix and adding it on some path above, so that the edges to v1 and v2 are “123” and “456”, respectively. As a result, searching for a path that starts with some prefix should return a unique location (a node or a position along an edge) in the suffix tree. Construction A suffix tree can be constructed by inserting all suffixes in descending order of size; that is, by beginning with a single edge from the root to a leaf, with label S1 , and then inserting S2 , and then S3 , and so on until every suffix has been inserted. This is, of course, trivial when every suffix begins with a thus far unseen character, as we can observe with the string S = abcdef g . . .. In this case, the root of the tree has an edge to each suffix, because each new suffix 2 inserted begins with a character that has never been the prefix of a label of an edge starting from the root, the encountered characters represent a split. The resulting suffix tree will have edges label(root, S1 ) = “abcdef g . . .00 , label(root, S2 ) = “bcdef g . . .00 , label(root, S3 ) = “cdef g . . .00 , and so on; however, when a character is encountered twice, there will be a split. For example, the string S = abac will insert the first two suffixes “abac” and “bac” trivially, because they both begin with new prefixes from the root. Inserting S3 = “ac00 will find that a partial path from the root, along the edge (root, S1 ), will begin with the same character “a”. Thus, to ensure the trie property, we must search downard in the tree for Si , the suffix we are inserting, and find the split point, where insertion will take place. This is equivalent to finding headi , which is defined as the longest prefix shared with any already inserted suffix: headi = longestj<i sharedP ref ix(Si , Sj ), where sharedP ref ix(Si , Sj ) is the longest prefix shared between the two strings Si and Sj . That is, Si [1 : k] where k is the maximum integer where Si [1 : k] = Sj [1 : k]. Once we find the location (either a node or an edge) loc(headi ) where the path from the root gives the string headi , we can then split off a new edge containing the remainder of the string (denoted taili , where Si headi taili ); note that if the location loc(headi ) is on an edge, we need to insert an internal node splitting that edge so that we can fork off of it. Note that we must create a new edge labeled taili to a new leaf node, because we defined headi to be the maximal matching prefix, and thus any remaining characters (i.e. taili ) must start with a character that doesn’t match any path in the current suffix tree. If the string S is terminated with a “sentinel” character $ that is not used in any other part of the string (e.g. S = s1 s2 s3 . . . $), then there will always be at least one remaining character after headi (i.e. taili 6= ””), and thus there will always be a branch to a new leaf node for every suffix inserted. Also, note that because there is a bijection between any node in the tree and the prefix that corresponds to reaching that node, the length of the prefix reaching any particular location in the tree will always be constant; the prefix length can be stored in every node when it is created, and used to easily compute the prefix length corresponding to any location along an edge originating at that node (as well as directly cache the prefix length for any location at a node). Thus, given Si and location of headi in the tree (i.e. 3 given loc(headi )), we can easily compute its length |loc(headi )|, and thus retrieve headi = Si [1 : |loc(headi )|]. In this way, we see that each insertion consists of finding |loc(headi )| and then subsequently inserting the new edge and leaf node. Achieving linear space construction First we show that we can achieve a linear number of nodes: As shown above, each insertion of a suffix Si corresponds to always adding a new leaf node, and potentially inserts a new internal node. Thus, it is trivial to see that we insert at most 2n nodes (because each of the n suffixes corresponds to at most 2 nodes inserted). Second, we can use the linear bound on the number of nodes to achieve a linear space overall. At first, the possibility to achieve linear space may intuitively seem impossible: after all, the length of a particularly suffix |Si | = n − i + 1, and thus the cumulative lengths of storing all suffix strings will P , which is in O(n2 ); however, because be ni |Si |n − i + 1 = n2 + n − n(n+1) 2 each of these strings is a contiguous block in the original string S, and thus, given a single copy of the full string S, we can simply store an integer for the ending point. Thus, for each suffix, we would simply store the ending point (which is a constant size). Likewise, we use the same strategy to store the labels for the edges in the suffix tree; here we simply need to store the starting and ending points of the label in the string S. For example, in S = “abcghcxyz 00 the contiguous block “ghi” would be stored as the integer pair (4, 6) to indicate that “ghi00 = S[4]S[5]S[6] = S[4 : 7]. Thus, each edge can be encoded using a constant size (two integers). We can also show that there are a linear number of edges (there are linearly many nodes, and each node of which has only one incoming edge, guaranteed by the fact that the graph is a tree). Thus, we can store the suffix tree in O(n) space. Achieving linear time construction “Slowscan”: We call the standard method of searching for headi and inserting a suffix, wherein we scan character-by-character and move down the tree until we find a mismatch, “slowscan”. We can see that the runtime of slowscan is linear in the number of characters that must be scanned (although, note that it is sometimes possible to use shortcuts to start further 4 along in the suffix tree, and thus scan a limited number of characters per suffix). Finding hidden forms of reusable computation: Linear time can be tricky to understand at first, but the reward is worth the difficulty; like a brief glimpse into the heart of the machinery in a fast Fourier transform (FFT), this is a genuine moment of algorithmic magic. First, we notice a pattern: We can see this by examining the suffix tree for the string S = BAN AN A (see figure on http://en.wikipedia.org/wiki/Suffix tree); the subtree found at the node loc(“A00 ) (i.e. the subtree found after the edge labeled “A” originating from the root) is identical to the subtree including the edge labeled “NA” from the root. This shows us an opportunity to potentially save time by somehow caching a redundant computation. Unfortunately, this example is a special case (we are actually not really guaranteed that such subtrees will be identical, but there will always be a similarity). We formalize our intuition about this pattern by introducing the suffix lemma: Suffix lemma: Note: We will denote characters with lowercase characters and substrings as uppercase characters; note that these substrings will potentially be zero-length (the empty string ””). If |headi | = k, then there exists suffix j with j < i such that S[j . . . j + k − 1] = S[i . . . i + k − 1]. By stripping off the first character, we see that S[j + 1 . . . j + k − 1] = S[i + 1 . . . i + k − 1], and thus Si+1 matches at least the first k − 1 characters of Sj+1 ; as a result, |headi+1 | ≥ k − 1, because there is some suffix j + 1 matching at least the first k − 1 characters. To put it another way, if headi = xA, then suffix Si = xA . . . matches the prefix of some previous suffix Sj = xA . . . xA . . . (where j < i). Then, when searching for headi+1 , we notice that Si+1 = A . . . (i.e., the subsequent suffix, which can be found by removing the first character x from the previous suffix Si ), must also match the previous suffix Sj+1 = A . . . xA . . .. If |headi | = k, we are guaranteed that the first k − 1 characters of Si+1 match Sj+1 , and thus |headi+1 | ≥ k − 1 (we are not guaranteed strict equality, because it is possible that another suffix Sj 0 matches even better for some j 0 < i). “Fastscan”: We can exploit the suffix lemma to show that if |headi | = k, when we insert the subsequent suffix Si+1 , we are already guaranteed that 5 the first k − 1 characters of Si+1 are already found in the current state of the suffix tree (i.e that prefix is already found in the suffix tree after inserting the first i suffixes S1 , S2 , . . . , Si ). Since we know that these first k − 1 characters are already in the tree, we simply look at the first character each time we take an edge, in order to find which edge to take. Then, if the first character of the edge matches the next character in the k −1 length substring, and if the edge contains m characters, we simply follow the edge and then arrive at the next node needing to search for the remaining k − 1 − m characters, and having jumped m characters forward in our target substring. Note that if m is ever longer than the number of characters remaining, the target location is in the middle of the edge. Thus, the runtime of this insertion, which we call “fastscan” is linear, but rather than being linear in the number of characters (like the slowscan search algorithm), it is linear in the number of edges processed. Again, as in the slowscan algorithm, we note that this search could be performed more efficiently if we are able to start the search further down in the tree. Using suffix links: After inserting suffix Si , we have computed already computed (and added a node for) loc(headi ), and in the previous iteration, we have done the same for loc(headi−1 ); thus, if we store those previous nodes loc(headj ) in an array, for all j < i (we will add the link for i during iteration i, after finding loc(headi )), we can easily add a link from link(loc(headi−1 )) = loc(headi ). Unfortunately, we would like to use that link in iteration i in order to find loc(headi ), but we create the link in iteration i after finding loc(headi ); so thus far, the cache does not add any help; however, the parent of loc(headi−1 ) has a link that was created when it was inserted (it was inserted in a previous iteration), and thus we can follow the link link(parent(loc(headi−1 ))), which will not go directly to loc(headi ), but will reach an ancestor of loc(headi ), partially completing the search for loc(headi ) (because we no longer need to start from root when searching for loc(headi ); we now start the search further down the tree). Then we will use fastscan to efficiently find the k − 1 length prefix that we know must already be in the tree (where k = |headi−1 |). And then, after completing fastscan, we perform slowscan to find any additional matching characters (because once again, the k − 1 matching characters guaranteed by the suffix lemma is a lower bound; there may actually be a longer matching 6 prefix). Suffix links help us by limiting the cost of fastscan operations. Because the runtime of fastscan is linear in the number of edges taken, the runtime is given as the depth increase in the tree; particularly, the depth increase is found relative to the starting point (after following the suffix link). Once again, the starting point for fastscan is the target of the suffix link link(parent(loc(headi−1 ))). We will first determine that the depth decrease from following the suffix link of a parent node will be at most 2 nodes (i.e. we will show that following the suffix link of a parent node moves at most 2 nodes up the tree). Then, because the depth is decreasing only a little each time, and the overall maximum depth of the tree is n, we can make an amortized proof that the total depth increase of fastscan calls (which is equivalent to the total runtime of fastscan calls) must be in O(n). Taking the parent node trivially decreases the depth by 1. Furthermore, taking a suffix link also decreases the depth by at most 1 (in fact, it may potentially increase the depth, but will never decrease the depth by more than 1). This is because an internal node will exist only because there are two suffixes that have the same prefix, but have different characters after that internal node: Given a string S = . . . xyAv . . ., where there is a suffix link from Sj = xyAv . . . to Si = yAv . . . (where j < i), an internal node will only exist if there exists another suffix Sj 0 = xyAw, so that the after the shared prefix xyA, one has an edge starting with “v” and the other has an edge starting with “w”. It is trivial to observe that the same will be true for the shorter strings: Si0 = yAw . . . will also create a split with Si = yAv . . ., because both share a common prefix yA, but where (like their longer counterparts Sj and Sj 0 ), there is a split from this internal node where one takes an edge starting “v” and the other takes an edge starting “w”. Thus, when suffixes are linked, every split (and hence every internal node added) to the longer suffix will result in an internal node added to a shorter suffix. There is one exception: the first character “x” found in Sj and Sj 0 may also add an additional internal node because of a potential split with some other suffix Sj 00 = xz . . .. Thus, with the exception of splits utilizing the first character of Sj (or, equivalently, Sj 0 ), every internal node inserted above loc(headj ) will be inserted above loc(headi ), and thus the maximum decrease in depth from following a suffix link is 1 node (because it is possible that there is at most 1 ancestor in loc(headj ) that does not correspond to an ancestor in loc(headi )). 7 Furthermore, the node link(parent(loc(headi−1 ))) is guaranteed to be an ancestor of loc(headi ) (or, equivalently, link(loc(headi−1 ))); this is true because following a link is equivalent to removing the first character on the left, while following a parent edge is equivalent to removing some nonzero number of characters from the right. Thus, link(loc(xA . . . B)) = loc(A . . . B) and link(parent(loc(xA . . . B))) = loc(A), and because A is a prefix of A . . . B, loc(A) is a prefix of loc(A . . . B). Runtime of fastscan: During iteration i, the depth after following a suffix link and performing fastscan is equal to the starting node depth (after following the suffix link) plus the runtime from using fastscan (i.e. the depth increase compared to the starting node depth): this new depth will be depthAf terF astscani ≥ depth(loc(headi−1 )) − 2 + f astscanCosti . We also know that depth(loc(headi )) ≥ depthAf terF astscani , because the slowscan operations made after finding the result of fastscan will only increase the depth. Thus, depth(loc(headi )) ≥ depth(loc(headi−1 )) − 2 + f astscanCosti , which can be rewritten to say that f astscanCosti ≤ depth(loc(headi )) − depth(loc(headi−1 )) + 2. The sum of the fastscan costs can thus be Pn bounded by a telescoping sum: i f astscanCosti ≤ depth(loc(headn )) − depth(loc(headn−1 )) + 2 + depth(loc(headn−1 )) − depth(loc(headn−2 )) + 2 + . . . + depth(loc(head2 )) − depth(loc(head1 )) + 2, which collapses to depth(loc(headn )) − depth(loc(head1 )) + 2n. Furthermore, because the depth of a suffix tree is at most n, the depth increase from depth(loc(headn )) − P depth(loc(head1 )) ≤ n, and thus ni f astscanCosti ≤ 3n. Intuitively, this states that, given that the depth of the tree can never exceed n, and following suffix links can never decrease the depth by more than 2 per iteration, then all fastscan operations can never increase the depth by more than n plus the maximum total depth decrease 2n. Runtime of slowscan: We make a simpler (but similar) argument for the runtime of slowscan: If |headi−1 | = k, then following suffix links and performing fastscan will descend in the tree k − 1 characters (there is a k − 1 length prefix guaranteed to be in the tree already, thanks to the suffix lemma). Thus, to find loc(headi ), we must process the additional characters in headi that follow the first k − 1 characters of Si (which are, once again, guaranteed to be in the tree, and are thus processed with fastscan rather than slowscan). Thus, the slowscan cost 8 is given by the number of remaining characters not known from the suffix lemma: slowscanCost = |headi | − (|headi−1 | − 1). And hence, the P total cost of slowscan is found by a telescoping sum: ni slowscanCosti = |headn |−|headn−1 |+1+|headn−1 |−|headn−2 |+1+. . . |head2 |−|head1 |+1 = n. Thus the total runtime necessary to find loc(headi ) in every iteration is O(n) (i.e. the amortized runtime to find it in each iteration is constant), and thus an edge to a new leaf node can be added trivially to result in an O(n) overall construction time. Pseudocode: 1: procedure ConstructSuffixTree(S) 2: n ← |S| 3: locOf Head ← empty cache 4: for i = 1 to n do . Find loc(headi ) if i-1 in locOfHeadCache then Start search for loc(headi ) at parent(locOf HeadCachei−1 ).link: Search downward using fastscan (i.e. look only at the first character of each edge label), until the first |headi−1 | − 1 characters of Si have been matched. 8: Continue searching for loc(headi ) character-by-character (i.e. with slowscan) 9: else 10: Start search for loc(headi ) at root: 11: Search for loc(headi ) character-by-character (i.e. with slowscan) 12: end if . loc(headi ) has been found. Now insert taili , update the cache, and add a suffix link. 13: Create a new node for loc(headi ) (if none exists yet) by splitting an edge 14: if locOf Headi 6= root then . Don’t bother caching root as the starting point 15: Cache: locOf Headi ← loc(headi ) 16: Add a suffix link from loc(headi−1 ) to loc(headi ): locOf Headi−1 .link ← locOf Headi 17: end if 18: end for 19: end procedure 5: 6: 7: As a linear-time and linear-space solution to the longest common substring problem: Idea of approach The idea of the approach is simple: concatenate the two strings (padded by a special character), build the suffix tree of the con- 9 catenated string, and then scan its internal nodes for the best solution. Here we will prove the correctness of the approach, but like the tricky linear-time proofs for the suffix tree, the recipe is sufficient to achieving the result (although, it is still nice to appreciate the underlying idea). Proof that the solution occurs on a node For two strings S (A) and S (B) , we concatenate them to form a contiguous string: S = S (A) $1 S (B) $2 , where $1 is some unused character and $2 is another different unused character. Then we build the suffix tree for S = S (A) $1 S (B) $2 . The longest identical substring in both S (A) and S (B) can be defined as the longest prefix shared by two suffixes, one found in S (A) and one found in S (B) ; we first prove that the solution occurs on a node in the suffix tree of S, and then use simple dynamic programming to find the highest-quality node. First we demonstrate that the solution is a node in the suffix tree: Any substring A found in both S (A) and S (B) must begin at least two suffixes of the string S = S (A) $1 S (B) $2 : one found after the $1 (Si = A . . . $2 ) and one found before the $1 (Sj = A . . . $1 . . . A . . . $2 , where j < i). Thus, the suffixes Si and Sj share a common prefix A. Furthermore, if A is the maximal matching string with |A| = k, then the character Si [k + 1] 6= Sj [k + 1] (or else adding one more character would produce a better substring, contradictig the notion that A is the maximal matching substring). When this differentiating character is encountered during insertion, it will result in a split in the tree, and the internal node inserted for that split will represent the substring A. We can easily determine whether a node can be found in both strings because it will have at least one descendent leaf containing $1 and at least one descendent leaf not containing $1 (i.e. it is found in both strings). Proof that no internal node represents an invalid string Because we have concatenated the two strings into S = S (A) $1 S (B) $2 , we must be careful that no string crossing the boundary (i.e. A = A1 . . . $1 . . . A2 ) will be recognized as a potential solution. This is trivial, because the node corresponding to this string cannot have a descendent leaf node that does not contain $1 because any string containing $1 must itself be a leaf node (because $1 never recurs, so no split can follow it), and it does contain $1 . Finding the longest common substring Second, we use a simple dynamic programming method to easily compute which nodes in the tree have 10 descendents containing both both $1 and not containing $1 (i.e. which nodes correspond to substrings found in both S (A) and S (B) ): It is trivial to use the inward edge of any leaf to decide if the leaf contains $1 or not (this can be done in O(1) per leaf node, because the edge stores the lower and upper index in the original tree, and it is easy to check if the index of $1 is within those limits). With the leaves as the base case (i.e. we trivially know the outcome of hasDescendentW ith$1 (u) for any leaf node u), it is easy to mark if a node has a descendent containing the character $1 through simple recursion: hasDescendentW ith$1 (u) = [ hasDescendentW ith$1 (v), v∈children(u) for any node u in the tree. By memoizing hasDescendentW ith$1 (u) for every processed node, and calling hasDescendentW ith$1 (root), we compute hasDescendentW ith$1 (u) for every node. We can do the same thing to determine whether a node has children without $1 : hasDescendentW ithout$1 (u) = [ hasDescendentW ithout$1 (v), v∈children(u) where the base case for leaves defines hasDescendentW ithout$1 (u) as the already known logical opposite of hasDescendentW ith$1 (u). Thus the recursion can be memoized in an identical manner. A node u corresponds to a substring found in both S (A) and S (B) if and only if hasDescendentW ithout$1 (u)ĥasDescendentW ith$1 (u). Regarding finding the longest string found in both, we will actually keep track of the number of characters corresponding to a particular path from the root to node u, while we construct the tree (this is noted above, because the length of the prefix reaching node u will always be constant, and thus have constant length): given the character count of the path corresponding to node u is k characters, then the character count at v, a child of u, is k + |label((u, v))|, the length of the prefix at node u plus the number of characters along the edge from u to v. Thus, since the solution is guaranteed to be found in a node, we can scan the nodes (the number of nodes is O(n)) to find the node with highest character count where hasDescendentW ithout$1 (u)ĥasDescendentW ith$1 (u) is true.