+suffix trees as a linear-time solution to

advertisement
1
Copyright Oliver Serang, 2014
Linear time suffix tree construction
Idea of the data structure
Given a string S, the suffix tree of that string is a compressed trie containing
all suffixes S1 = S[1 . . . n], S2 = S[2 . . . n], S3 = S[3 . . . n], . . . Sn = S[n]. The
compressed property ensures that every internal node in the suffix tree has
at least two children (that is, an edge in the tree (u, v) will be contracted
into a single node uv unless there is some other node w s.t. there is an edge
(u, w)); thus, unlike a standard radix trie, which progresses with each edge
representing an additional character, each edge may now represent a string
of characters.
Because this data structure is a trie, then each path down from the root
can be thought of as successively appending labels to a result string. Tries
have a uniqueness property that ensures that no two paths (or partial paths,
which progress partially down an edge, appending a prefix of the edge’s label)
will result in the same string. That is, if a node u in the tree has two outward
edges (u, v1 ) and (u, v2 ), then the labels label((u, v1 )), label((u, v2 )) must
begin with different prefixes. For example, no valid trie would have two edges
originating from u sharing the same prefix, such as label((u, v1 )) = “abc12300 ,
label((u, v2 )) = “abc45600 ; this would correctly be represented by “factoring
out” the shared prefix and adding it on some path above, so that the edges
to v1 and v2 are “123” and “456”, respectively. As a result, searching for a
path that starts with some prefix should return a unique location (a node or
a position along an edge) in the suffix tree.
Construction
A suffix tree can be constructed by inserting all suffixes in descending order
of size; that is, by beginning with a single edge from the root to a leaf, with
label S1 , and then inserting S2 , and then S3 , and so on until every suffix has
been inserted.
This is, of course, trivial when every suffix begins with a thus far unseen character, as we can observe with the string S = abcdef g . . .. In this
case, the root of the tree has an edge to each suffix, because each new suffix
2
inserted begins with a character that has never been the prefix of a label
of an edge starting from the root, the encountered characters represent a
split. The resulting suffix tree will have edges label(root, S1 ) = “abcdef g . . .00 ,
label(root, S2 ) = “bcdef g . . .00 , label(root, S3 ) = “cdef g . . .00 , and so on; however, when a character is encountered twice, there will be a split. For example, the string S = abac will insert the first two suffixes “abac” and “bac”
trivially, because they both begin with new prefixes from the root. Inserting
S3 = “ac00 will find that a partial path from the root, along the edge (root, S1 ),
will begin with the same character “a”. Thus, to ensure the trie property, we
must search downard in the tree for Si , the suffix we are inserting, and find
the split point, where insertion will take place. This is equivalent to finding
headi , which is defined as the longest prefix shared with any already inserted
suffix:
headi = longestj<i sharedP ref ix(Si , Sj ),
where sharedP ref ix(Si , Sj ) is the longest prefix shared between the two
strings Si and Sj . That is, Si [1 : k] where k is the maximum integer where
Si [1 : k] = Sj [1 : k].
Once we find the location (either a node or an edge) loc(headi ) where the
path from the root gives the string headi , we can then split off a new edge
containing the remainder of the string (denoted taili , where Si headi taili );
note that if the location loc(headi ) is on an edge, we need to insert an internal
node splitting that edge so that we can fork off of it. Note that we must create
a new edge labeled taili to a new leaf node, because we defined headi to be
the maximal matching prefix, and thus any remaining characters (i.e. taili )
must start with a character that doesn’t match any path in the current suffix
tree. If the string S is terminated with a “sentinel” character $ that is not
used in any other part of the string (e.g. S = s1 s2 s3 . . . $), then there will
always be at least one remaining character after headi (i.e. taili 6= ””), and
thus there will always be a branch to a new leaf node for every suffix inserted.
Also, note that because there is a bijection between any node in the tree
and the prefix that corresponds to reaching that node, the length of the
prefix reaching any particular location in the tree will always be constant;
the prefix length can be stored in every node when it is created, and used
to easily compute the prefix length corresponding to any location along an
edge originating at that node (as well as directly cache the prefix length for
any location at a node). Thus, given Si and location of headi in the tree (i.e.
3
given loc(headi )), we can easily compute its length |loc(headi )|, and thus
retrieve headi = Si [1 : |loc(headi )|]. In this way, we see that each insertion
consists of finding |loc(headi )| and then subsequently inserting the new edge
and leaf node.
Achieving linear space construction
First we show that we can achieve a linear number of nodes: As shown above,
each insertion of a suffix Si corresponds to always adding a new leaf node,
and potentially inserts a new internal node. Thus, it is trivial to see that
we insert at most 2n nodes (because each of the n suffixes corresponds to at
most 2 nodes inserted).
Second, we can use the linear bound on the number of nodes to achieve
a linear space overall. At first, the possibility to achieve linear space may
intuitively seem impossible: after all, the length of a particularly suffix |Si | =
n − i + 1, and thus the cumulative lengths of storing all suffix strings will
P
, which is in O(n2 ); however, because
be ni |Si |n − i + 1 = n2 + n − n(n+1)
2
each of these strings is a contiguous block in the original string S, and thus,
given a single copy of the full string S, we can simply store an integer for the
ending point. Thus, for each suffix, we would simply store the ending point
(which is a constant size).
Likewise, we use the same strategy to store the labels for the edges in the
suffix tree; here we simply need to store the starting and ending points of
the label in the string S. For example, in S = “abcghcxyz 00 the contiguous
block “ghi” would be stored as the integer pair (4, 6) to indicate that “ghi00 =
S[4]S[5]S[6] = S[4 : 7]. Thus, each edge can be encoded using a constant
size (two integers). We can also show that there are a linear number of
edges (there are linearly many nodes, and each node of which has only one
incoming edge, guaranteed by the fact that the graph is a tree). Thus, we
can store the suffix tree in O(n) space.
Achieving linear time construction
“Slowscan”: We call the standard method of searching for headi and inserting a suffix, wherein we scan character-by-character and move down the
tree until we find a mismatch, “slowscan”. We can see that the runtime
of slowscan is linear in the number of characters that must be scanned (although, note that it is sometimes possible to use shortcuts to start further
4
along in the suffix tree, and thus scan a limited number of characters per
suffix).
Finding hidden forms of reusable computation: Linear time can be
tricky to understand at first, but the reward is worth the difficulty; like a
brief glimpse into the heart of the machinery in a fast Fourier transform
(FFT), this is a genuine moment of algorithmic magic.
First, we notice a pattern:
We can see this by examining
the suffix tree for the string S = BAN AN A (see figure on
http://en.wikipedia.org/wiki/Suffix tree); the subtree found at the node
loc(“A00 ) (i.e. the subtree found after the edge labeled “A” originating from
the root) is identical to the subtree including the edge labeled “NA” from
the root. This shows us an opportunity to potentially save time by somehow
caching a redundant computation. Unfortunately, this example is a special
case (we are actually not really guaranteed that such subtrees will be identical, but there will always be a similarity). We formalize our intuition about
this pattern by introducing the suffix lemma:
Suffix lemma: Note: We will denote characters with lowercase characters and substrings as uppercase characters; note that these substrings will
potentially be zero-length (the empty string ””).
If |headi | = k, then there exists suffix j with j < i such that S[j . . . j +
k − 1] = S[i . . . i + k − 1]. By stripping off the first character, we see that
S[j + 1 . . . j + k − 1] = S[i + 1 . . . i + k − 1], and thus Si+1 matches at least
the first k − 1 characters of Sj+1 ; as a result, |headi+1 | ≥ k − 1, because there
is some suffix j + 1 matching at least the first k − 1 characters.
To put it another way, if headi = xA, then suffix Si = xA . . . matches the
prefix of some previous suffix Sj = xA . . . xA . . . (where j < i). Then, when
searching for headi+1 , we notice that Si+1 = A . . . (i.e., the subsequent suffix,
which can be found by removing the first character x from the previous suffix
Si ), must also match the previous suffix Sj+1 = A . . . xA . . .. If |headi | = k,
we are guaranteed that the first k − 1 characters of Si+1 match Sj+1 , and
thus |headi+1 | ≥ k − 1 (we are not guaranteed strict equality, because it is
possible that another suffix Sj 0 matches even better for some j 0 < i).
“Fastscan”: We can exploit the suffix lemma to show that if |headi | = k,
when we insert the subsequent suffix Si+1 , we are already guaranteed that
5
the first k − 1 characters of Si+1 are already found in the current state of the
suffix tree (i.e that prefix is already found in the suffix tree after inserting
the first i suffixes S1 , S2 , . . . , Si ).
Since we know that these first k − 1 characters are already in the tree, we
simply look at the first character each time we take an edge, in order to find
which edge to take. Then, if the first character of the edge matches the next
character in the k −1 length substring, and if the edge contains m characters,
we simply follow the edge and then arrive at the next node needing to search
for the remaining k − 1 − m characters, and having jumped m characters
forward in our target substring. Note that if m is ever longer than the
number of characters remaining, the target location is in the middle of the
edge. Thus, the runtime of this insertion, which we call “fastscan” is linear,
but rather than being linear in the number of characters (like the slowscan
search algorithm), it is linear in the number of edges processed. Again, as in
the slowscan algorithm, we note that this search could be performed more
efficiently if we are able to start the search further down in the tree.
Using suffix links: After inserting suffix Si , we have computed already
computed (and added a node for) loc(headi ), and in the previous iteration,
we have done the same for loc(headi−1 ); thus, if we store those previous nodes
loc(headj ) in an array, for all j < i (we will add the link for i during iteration
i, after finding loc(headi )), we can easily add a link from link(loc(headi−1 )) =
loc(headi ).
Unfortunately, we would like to use that link in iteration i in order to find
loc(headi ), but we create the link in iteration i after finding loc(headi ); so
thus far, the cache does not add any help; however, the parent of loc(headi−1 )
has a link that was created when it was inserted (it was inserted in a previous
iteration), and thus we can follow the link link(parent(loc(headi−1 ))), which
will not go directly to loc(headi ), but will reach an ancestor of loc(headi ),
partially completing the search for loc(headi ) (because we no longer need to
start from root when searching for loc(headi ); we now start the search further
down the tree).
Then we will use fastscan to efficiently find the k − 1 length prefix that
we know must already be in the tree (where k = |headi−1 |). And then, after
completing fastscan, we perform slowscan to find any additional matching
characters (because once again, the k − 1 matching characters guaranteed by
the suffix lemma is a lower bound; there may actually be a longer matching
6
prefix).
Suffix links help us by limiting the cost of fastscan operations. Because
the runtime of fastscan is linear in the number of edges taken, the runtime is given as the depth increase in the tree; particularly, the depth increase is found relative to the starting point (after following the suffix link).
Once again, the starting point for fastscan is the target of the suffix link
link(parent(loc(headi−1 ))).
We will first determine that the depth decrease from following the suffix
link of a parent node will be at most 2 nodes (i.e. we will show that following
the suffix link of a parent node moves at most 2 nodes up the tree). Then,
because the depth is decreasing only a little each time, and the overall maximum depth of the tree is n, we can make an amortized proof that the total
depth increase of fastscan calls (which is equivalent to the total runtime of
fastscan calls) must be in O(n).
Taking the parent node trivially decreases the depth by 1. Furthermore,
taking a suffix link also decreases the depth by at most 1 (in fact, it may
potentially increase the depth, but will never decrease the depth by more
than 1). This is because an internal node will exist only because there are
two suffixes that have the same prefix, but have different characters after
that internal node: Given a string S = . . . xyAv . . ., where there is a suffix
link from Sj = xyAv . . . to Si = yAv . . . (where j < i), an internal node
will only exist if there exists another suffix Sj 0 = xyAw, so that the after
the shared prefix xyA, one has an edge starting with “v” and the other
has an edge starting with “w”. It is trivial to observe that the same will
be true for the shorter strings: Si0 = yAw . . . will also create a split with
Si = yAv . . ., because both share a common prefix yA, but where (like their
longer counterparts Sj and Sj 0 ), there is a split from this internal node where
one takes an edge starting “v” and the other takes an edge starting “w”.
Thus, when suffixes are linked, every split (and hence every internal node
added) to the longer suffix will result in an internal node added to a shorter
suffix. There is one exception: the first character “x” found in Sj and Sj 0
may also add an additional internal node because of a potential split with
some other suffix Sj 00 = xz . . .. Thus, with the exception of splits utilizing
the first character of Sj (or, equivalently, Sj 0 ), every internal node inserted
above loc(headj ) will be inserted above loc(headi ), and thus the maximum
decrease in depth from following a suffix link is 1 node (because it is possible
that there is at most 1 ancestor in loc(headj ) that does not correspond to an
ancestor in loc(headi )).
7
Furthermore, the node link(parent(loc(headi−1 ))) is guaranteed to be an
ancestor of loc(headi ) (or, equivalently, link(loc(headi−1 ))); this is true because following a link is equivalent to removing the first character on the left,
while following a parent edge is equivalent to removing some nonzero number
of characters from the right. Thus, link(loc(xA . . . B)) = loc(A . . . B) and
link(parent(loc(xA . . . B))) = loc(A), and because A is a prefix of A . . . B,
loc(A) is a prefix of loc(A . . . B).
Runtime of fastscan: During iteration i, the depth after following a suffix link and performing fastscan is equal to the starting node depth (after following the suffix link) plus the runtime from using fastscan (i.e. the
depth increase compared to the starting node depth): this new depth will be
depthAf terF astscani ≥ depth(loc(headi−1 )) − 2 + f astscanCosti . We also
know that depth(loc(headi )) ≥ depthAf terF astscani , because the slowscan
operations made after finding the result of fastscan will only increase the
depth. Thus, depth(loc(headi )) ≥ depth(loc(headi−1 )) − 2 + f astscanCosti ,
which can be rewritten to say that f astscanCosti ≤ depth(loc(headi )) −
depth(loc(headi−1 )) + 2. The sum of the fastscan costs can thus be
Pn
bounded by a telescoping sum:
i f astscanCosti ≤ depth(loc(headn )) −
depth(loc(headn−1 )) + 2 + depth(loc(headn−1 )) − depth(loc(headn−2 )) +
2 + . . . + depth(loc(head2 )) − depth(loc(head1 )) + 2, which collapses to
depth(loc(headn )) − depth(loc(head1 )) + 2n. Furthermore, because the depth
of a suffix tree is at most n, the depth increase from depth(loc(headn )) −
P
depth(loc(head1 )) ≤ n, and thus ni f astscanCosti ≤ 3n.
Intuitively, this states that, given that the depth of the tree can never
exceed n, and following suffix links can never decrease the depth by more than
2 per iteration, then all fastscan operations can never increase the depth by
more than n plus the maximum total depth decrease 2n.
Runtime of slowscan: We make a simpler (but similar) argument
for the runtime of slowscan: If |headi−1 | = k, then following suffix links
and performing fastscan will descend in the tree k − 1 characters (there
is a k − 1 length prefix guaranteed to be in the tree already, thanks
to the suffix lemma). Thus, to find loc(headi ), we must process the
additional characters in headi that follow the first k − 1 characters of
Si (which are, once again, guaranteed to be in the tree, and are thus
processed with fastscan rather than slowscan). Thus, the slowscan cost
8
is given by the number of remaining characters not known from the
suffix lemma: slowscanCost = |headi | − (|headi−1 | − 1). And hence, the
P
total cost of slowscan is found by a telescoping sum: ni slowscanCosti =
|headn |−|headn−1 |+1+|headn−1 |−|headn−2 |+1+. . . |head2 |−|head1 |+1 = n.
Thus the total runtime necessary to find loc(headi ) in every iteration is
O(n) (i.e. the amortized runtime to find it in each iteration is constant), and
thus an edge to a new leaf node can be added trivially to result in an O(n)
overall construction time.
Pseudocode:
1: procedure ConstructSuffixTree(S)
2:
n ← |S|
3:
locOf Head ← empty cache
4:
for i = 1 to n do
. Find loc(headi )
if i-1 in locOfHeadCache then
Start search for loc(headi ) at parent(locOf HeadCachei−1 ).link:
Search downward using fastscan (i.e. look only at the first character of each
edge label), until the first |headi−1 | − 1 characters of Si have been matched.
8:
Continue searching for loc(headi ) character-by-character (i.e. with slowscan)
9:
else
10:
Start search for loc(headi ) at root:
11:
Search for loc(headi ) character-by-character (i.e. with slowscan)
12:
end if
. loc(headi ) has been found. Now insert taili , update the cache, and add a
suffix link.
13:
Create a new node for loc(headi ) (if none exists yet) by splitting an edge
14:
if locOf Headi 6= root then . Don’t bother caching root as the starting point
15:
Cache: locOf Headi ← loc(headi )
16:
Add a suffix link from loc(headi−1 ) to loc(headi ): locOf Headi−1 .link ←
locOf Headi
17:
end if
18:
end for
19: end procedure
5:
6:
7:
As a linear-time and linear-space solution to the longest
common substring problem:
Idea of approach The idea of the approach is simple: concatenate the
two strings (padded by a special character), build the suffix tree of the con-
9
catenated string, and then scan its internal nodes for the best solution. Here
we will prove the correctness of the approach, but like the tricky linear-time
proofs for the suffix tree, the recipe is sufficient to achieving the result (although, it is still nice to appreciate the underlying idea).
Proof that the solution occurs on a node For two strings S (A) and
S (B) , we concatenate them to form a contiguous string: S = S (A) $1 S (B) $2 ,
where $1 is some unused character and $2 is another different unused character. Then we build the suffix tree for S = S (A) $1 S (B) $2 . The longest identical
substring in both S (A) and S (B) can be defined as the longest prefix shared
by two suffixes, one found in S (A) and one found in S (B) ; we first prove that
the solution occurs on a node in the suffix tree of S, and then use simple
dynamic programming to find the highest-quality node.
First we demonstrate that the solution is a node in the suffix tree: Any
substring A found in both S (A) and S (B) must begin at least two suffixes
of the string S = S (A) $1 S (B) $2 : one found after the $1 (Si = A . . . $2 ) and
one found before the $1 (Sj = A . . . $1 . . . A . . . $2 , where j < i). Thus, the
suffixes Si and Sj share a common prefix A. Furthermore, if A is the maximal
matching string with |A| = k, then the character Si [k + 1] 6= Sj [k + 1] (or else
adding one more character would produce a better substring, contradictig the
notion that A is the maximal matching substring). When this differentiating
character is encountered during insertion, it will result in a split in the tree,
and the internal node inserted for that split will represent the substring
A. We can easily determine whether a node can be found in both strings
because it will have at least one descendent leaf containing $1 and at least
one descendent leaf not containing $1 (i.e. it is found in both strings).
Proof that no internal node represents an invalid string Because we
have concatenated the two strings into S = S (A) $1 S (B) $2 , we must be careful
that no string crossing the boundary (i.e. A = A1 . . . $1 . . . A2 ) will be recognized as a potential solution. This is trivial, because the node corresponding
to this string cannot have a descendent leaf node that does not contain $1
because any string containing $1 must itself be a leaf node (because $1 never
recurs, so no split can follow it), and it does contain $1 .
Finding the longest common substring Second, we use a simple dynamic programming method to easily compute which nodes in the tree have
10
descendents containing both both $1 and not containing $1 (i.e. which nodes
correspond to substrings found in both S (A) and S (B) ): It is trivial to use the
inward edge of any leaf to decide if the leaf contains $1 or not (this can be
done in O(1) per leaf node, because the edge stores the lower and upper index
in the original tree, and it is easy to check if the index of $1 is within those
limits). With the leaves as the base case (i.e. we trivially know the outcome
of hasDescendentW ith$1 (u) for any leaf node u), it is easy to mark if a node
has a descendent containing the character $1 through simple recursion:
hasDescendentW ith$1 (u) =
[
hasDescendentW ith$1 (v),
v∈children(u)
for any node u in the tree. By memoizing hasDescendentW ith$1 (u) for
every processed node, and calling hasDescendentW ith$1 (root), we compute
hasDescendentW ith$1 (u) for every node.
We can do the same thing to determine whether a node has children
without $1 :
hasDescendentW ithout$1 (u) =
[
hasDescendentW ithout$1 (v),
v∈children(u)
where the base case for leaves defines hasDescendentW ithout$1 (u) as
the already known logical opposite of hasDescendentW ith$1 (u). Thus the
recursion can be memoized in an identical manner.
A node u corresponds to a substring found in both S (A) and S (B) if and
only if hasDescendentW ithout$1 (u)ĥasDescendentW ith$1 (u).
Regarding finding the longest string found in both, we will actually keep
track of the number of characters corresponding to a particular path from
the root to node u, while we construct the tree (this is noted above, because
the length of the prefix reaching node u will always be constant, and thus
have constant length): given the character count of the path corresponding
to node u is k characters, then the character count at v, a child of u, is
k + |label((u, v))|, the length of the prefix at node u plus the number of
characters along the edge from u to v.
Thus, since the solution is guaranteed to be found in a node, we can scan
the nodes (the number of nodes is O(n)) to find the node with highest character count where hasDescendentW ithout$1 (u)ĥasDescendentW ith$1 (u) is
true.
Download