Weiner`s Algorithm

Linear Time Suffix Tree Construction Introduction There are several well-known algorithms for building suffix trees. We will discuss Weiner’s algorithm. Weiner’s algorithm (1973) was the first linear time algorithm. The terminology used is that of Seiferas (1985) who explained Weiner’s algorithm. Weiner’s Algorithm Step1: Start from the shortest suffix (i.e. the end of the string), and work our way backward. Let us define a suffix Si = si si+1 si+2 ……sn Let us define the suffix tree representing this suffix as: T(  i) So the first suffix tree that we will create will be T(  n+1). (illustrated below)  $ Our goal is to create the full suffix tree T(  1). Step2: Given T(  i+1), construct T(  i) by adding si to the tree. Example: S=ABCABBA$ We start with T(  n+1) (ie T(  8))  T(  8) S8=$ T(  7) S7=A$ T(  6) S6=BA$ $  A$ $  BA$ A$ $  T(  5) S5=BBA$ B BA$ A$ B BA$ A $ A$ BBA$ A$  CABBA$ BA$ T(  2) S4=BCABBA$ $  T(  4) S4=ABBA$ T(  3) S4=CABBA$ A$ B A $ A$ BBA$ A$  CABBA$ CABBA$ B A $ BA$ A$ BBA$ A$ And finally the full tree: T(  1) S1=ABCABBA$  CABBA$ CABBA$ B A BA$ A$ CABBA$ B $ A$ BA$ Note: In the actual implementation of the algorithm we don’t store the actual characters of the strings in the suffix tree only the indices. A very important observation: The height of the tree grows by at most one with each insertion. Since the tree grows from the bottom we don’t want to build the tree using the naïve approach (i.e. traversing the tree from root to the bottom). We want a way we can insert a new leaf by going from the leaves up. Weiner’s Algorithm: Construct T(  n+1) For I= n downto 1 do (i)Insert Si to T(  i+1) and create T(  i) end end Algorithm How do we build the suffix tree? How do we insert each suffix? We start from the smallest suffix and work our way backward through the string. The key to Weiner’s algorithm is how to find head efficiently, and hence quickly insert a suffix into the tree. (i)Let us define head as the longest prefix of Si that is in the tree T(  i+1) (at least implicitly*) *when we say implicitly we mean it is located in the tree, but it may be part of a leaf and not a leaf in its own right. 1. Find head 2. If head not a node then break an edge and make it one. 3. Add Si+|head|-1 (ie the rest of the suffix minus head) as the additional son of head. EX: Found in tree: ABD Want to Add: ABC Head: AB AB ABC$ C$ D$ Breaking a node and inserting the remainder of the suffix (steps 2 and 3) can be done in constant time. BUT: How can we find “head” quickly? Case1: trivial case (head is empty). This means a new letter has been introduced. Add a node and an edge.  Si Example: Si=DCBA$  DCBA$ Case 2: Head is not empty. Go from Si+1 to head. HOW? Si = si Si+1 head = si x where x= longest prefix of Si+1 such that si x is in the tree T(  i+1) but does not necessarily end in a node. Referring back to our first example: S4=ABBA$ s4=A S5=BBA$ And head in this case would be “A”. Let x’ = longest prefix of Si+1 such that si x’ ends in a node in the tree T(  i+1) Since we are traversing the tree from the bottom up, si x’ must be closer to the root (i.e. has a smaller depth) than x’. We must prove this in order to prove this algorithm is linear. Illustration:     X’    X head      Si+1       Six’  Si = si Si+1       The following lemma is the basis of the whole algorithm. The idea of this lemma is that even if one insertion causes us to traverse far up the tree the next run up will be much shorter. Lemma: If au ends in a node in Tree(  i) then u ends in a node Tree(  i+1) (i.e. depth(x’)  depth (Six’)) Proof: If au ends in a leaf then clearly u ends in a leaf, because u is the suffix. Otherwise u is an internal node:  b,c s.t. aub, auc  T(  i) which means that in T(  i+1) :  b,c s.t. ub, uc  T(  i+1) which means that:  node in T(  i+1) in which u ends. The lemma guarantees us that x’ will appear explicitly in the tree. So all we need now in order to be able to find head very quickly: For every node u and every    (every member of the alphabet seen up to this point): A flag indicating if u is found in Tree(  i+1) implicitly. (‘implicitly’ as we have stated earlier means even if it is found in the middle of an edge) A pointer to the node where u ends if it appears explicitly (i.e. ends in node)  Example: BDAC$ DA C$ A W U BDAC$ BDAC$ C$ UC$ At node u for letter d: a. Flag for d at $node u= true b. Pointer for d at node u is pointing to node w. S1=DABDAC$ S2=ABDAC$ S3=BDAC$ S4=DAC$ S5=AC$ S6=C$ The Algorithm: 1. Run up the path of Si+1 until si flag appears 2. Keep running up the path until si pointer appears, keeping track of the difference between the flag and the pointer. We need this information in order to break the edge. 3. Jump to six’ 4. Break the edge and add six appropriately. It is true that running up from Si+1 is not constant, but the lemma tells us that the next run will be shorter. Breaking the edge and inserting the suffix can be done in constant time. How do we update the pointers and flags? Case1: Trivial Case. Head is empty. 1. Si+1 points to Si 2. All nodes on the path Si+1 set si flag. 3. si is a new symbol so no pointers nor flags at Si si  Si si si Si+1 Case 2: Head is not empty. 1. Update flags that change in T(  i+1) because head and Si nodes are introduced. a. Set all si flags from Si+1 to x 2. Update pointers that change in T(  i+1) because of new nodes head and Si a. Si+1 points to Si (siSi +1) b. x points to head (six). 3. Set flags and pointers at new nodes head and Si. a. Si is the longest string in Tree, T(  i) so it doesn’t get any flags or pointers b. If head was already a node in the tree, T(  i+1) then we are done. c. If head is new, every flag and pointer in the node below it sets appropriate flag in head. No pointers. Why? By lemma. If a head is a node in T(  i), then head is a node in T(  i+1) , which was not the case)  x’  si si Si+1 x si  six' Head (six) si siSi+1 (Si) si Time Complexity Informally: At every insertion step the depth is incremented at most 1. We may run up a lot, but down only one. Each run up the tree is not necessarily linear, but the amortized time of the algorithm will be linear. If you charge running up the path against “building” it down the total will be linear. Remember: breaking the edge and inserting the suffix can be done in constant time. Formally: Total time  |(depth(Sn+1)- depth(Sn))+ (depth(Sn)- depth(Sn-1))+ . . (depth(Si+1)- depth(Si))+ . . (depth(S2)- depth(S1))| + O(n) = O(n+ maxdepth(T))= O(n) Si+1

Weiner`s Algorithm

Related documents

Products

Support

Weiner`s Algorithm

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib