Weiner`s Algorithm

advertisement
Linear Time Suffix Tree Construction
Introduction
There are several well-known algorithms for building suffix trees. We will discuss
Weiner’s algorithm. Weiner’s algorithm (1973) was the first linear time algorithm.
The terminology used is that of Seiferas (1985) who explained Weiner’s
algorithm.
Weiner’s Algorithm
Step1: Start from the shortest suffix (i.e. the end of the string), and work our way
backward.
Let us define a suffix Si = si si+1 si+2 ……sn
Let us define the suffix tree representing this suffix as: T(  i)
So the first suffix tree that we will create will be T(  n+1). (illustrated below)

$
Our goal is to create the full suffix tree T(  1).
Step2: Given T(  i+1), construct T(  i) by adding si to the tree.
Example:
S=ABCABBA$
We start with T(  n+1) (ie T(  8))

T(  8)
S8=$
T(  7)
S7=A$
T(  6)
S6=BA$
$

A$
$

BA$ A$
$

T(  5)
S5=BBA$
B
BA$
A$
B
BA$
A
$
A$ BBA$ A$

CABBA$
BA$
T(  2)
S4=BCABBA$
$

T(  4)
S4=ABBA$
T(  3)
S4=CABBA$
A$
B
A
$
A$ BBA$ A$

CABBA$
CABBA$
B
A
$
BA$ A$ BBA$ A$
And finally the full tree:
T(  1)
S1=ABCABBA$

CABBA$
CABBA$
B
A
BA$ A$
CABBA$
B
$
A$
BA$
Note: In the actual implementation of the algorithm we don’t store the actual characters of the
strings in the suffix tree only the indices.
A very important observation: The height of the tree grows by at most one with
each insertion. Since the tree grows from the bottom we don’t want to build the
tree using the naïve approach (i.e. traversing the tree from root to the bottom).
We want a way we can insert a new leaf by going from the leaves up.
Weiner’s Algorithm:
Construct T(  n+1)
For I= n downto 1 do
(i)Insert Si to T(  i+1) and create T(  i)
end
end Algorithm
How do we build the suffix tree? How do we insert each suffix? We start from the
smallest suffix and work our way backward through the string. The key to
Weiner’s algorithm is how to find head efficiently, and hence quickly insert a
suffix into the tree.
(i)Let us define head as the longest prefix of Si that is in the tree T(  i+1) (at
least implicitly*)
*when we say implicitly we mean it is located in the tree, but it may be part of a
leaf and not a leaf in its own right.
1. Find head
2. If head not a node then break an edge and make it one.
3. Add Si+|head|-1 (ie the rest of the suffix minus head) as the additional son of
head.
EX:
Found in tree: ABD
Want to Add: ABC
Head: AB
AB
ABC$
C$
D$
Breaking a node and inserting the remainder of the suffix (steps 2 and 3) can be
done in constant time. BUT: How can we find “head” quickly?
Case1: trivial case (head is empty). This means a new letter has been
introduced. Add a node and an edge.

Si
Example:
Si=DCBA$

DCBA$
Case 2: Head is not empty.
Go from Si+1 to head. HOW?
Si = si Si+1
head = si x where x= longest prefix of Si+1
such that si x is in the tree T(  i+1) but does not necessarily end in a
node.
Referring back to our first example:
S4=ABBA$
s4=A
S5=BBA$
And head in this case would be “A”.
Let x’ = longest prefix of Si+1 such that si x’ ends in a node in the tree T(  i+1)
Since we are traversing the tree from the bottom up, si x’ must be closer to the
root (i.e. has a smaller depth) than x’. We must prove this in order to prove this
algorithm is linear.
Illustration:




X’
 

X head


 

Si+1 





Six’

Si = si Si+1






The following lemma is the basis of the whole algorithm. The idea of this lemma
is that even if one insertion causes us to traverse far up the tree the next run up
will be much shorter.
Lemma: If au ends in a node in Tree(  i) then u ends in a node Tree(  i+1)
(i.e. depth(x’)  depth (Six’))
Proof:
If au ends in a leaf then clearly u ends in a leaf, because u is the suffix.
Otherwise u is an internal node:
 b,c s.t. aub, auc  T(  i)
which means that in T(  i+1) :
 b,c s.t. ub, uc  T(  i+1)
which means that:
 node in T(  i+1) in which u ends.
The lemma guarantees us that x’ will appear explicitly in the tree.
So all we need now in order to be able to find head very quickly:
For every node u and every    (every member of the alphabet seen up to this
point):
A flag indicating if u is found in Tree(  i+1) implicitly. (‘implicitly’ as we have
stated earlier means even if it is found in the middle of an edge)
A pointer to the node where u ends if it appears explicitly (i.e. ends in node)

Example:
BDAC$
DA
C$
A
W
U
BDAC$
BDAC$
C$
UC$
At node u for letter d:
a. Flag for d at $node u= true
b. Pointer for d at node u is pointing to node w.
S1=DABDAC$
S2=ABDAC$
S3=BDAC$
S4=DAC$
S5=AC$
S6=C$
The Algorithm:
1. Run up the path of Si+1 until si flag appears
2. Keep running up the path until si pointer appears, keeping track of the
difference between the flag and the pointer. We need this information in order to
break the edge.
3. Jump to six’
4. Break the edge and add six appropriately.
It is true that running up from Si+1 is not constant, but the lemma tells us that the
next run will be shorter. Breaking the edge and inserting the suffix can be done
in constant time.
How do we update the pointers and flags?
Case1: Trivial Case. Head is empty.
1. Si+1 points to Si
2. All nodes on the path Si+1 set si flag.
3. si is a new symbol so no pointers nor flags at Si
si

Si
si
si
Si+1
Case 2: Head is not empty.
1. Update flags that change in T(  i+1) because head and Si nodes are
introduced.
a. Set all si flags from Si+1 to x
2. Update pointers that change in T(  i+1) because of new nodes head and
Si
a. Si+1 points to Si (siSi +1)
b. x points to head (six).
3. Set flags and pointers at new nodes head and Si.
a. Si is the longest string in Tree, T(  i) so it doesn’t get any flags or
pointers
b. If head was already a node in the tree, T(  i+1) then we are done.
c. If head is new, every flag and pointer in the node below it sets
appropriate flag in head. No pointers. Why? By lemma. If a head is
a node in T(  i), then head is a node in T(  i+1) , which was not
the case)

x’

si
si
Si+1
x
si

six'
Head (six)
si
siSi+1 (Si)
si
Time Complexity
Informally: At every insertion step the depth is incremented at most 1. We may
run up a lot, but down only one. Each run up the tree is not necessarily linear, but
the amortized time of the algorithm will be linear. If you charge running up the
path against “building” it down the total will be linear. Remember: breaking the
edge and inserting the suffix can be done in constant time.
Formally:
Total time  |(depth(Sn+1)- depth(Sn))+
(depth(Sn)- depth(Sn-1))+
.
.
(depth(Si+1)- depth(Si))+
.
.
(depth(S2)- depth(S1))| + O(n)
= O(n+ maxdepth(T))= O(n)
Si+1
Download