Definition

advertisement
A Space-Economical Suffix Tree Construction Algorithm
Edward M. McCreight (1976)
Definitions
Suffix trees
Definition (Σ+-tree) .
Given some alphabet Σ, a Σ+-tree T is a rooted tree with
1. (T1) The labels of the edges are from Σ+ ;
2. (T3) Every sibling edges begin in a different character.
Definition (locus) .
Denote by path(k) the concatenation of the edge labels on the path from the root of T to the node k.
The locus of a string u in a Σ+-tree is the node k such that path(k)=u, we denote k by u.
proposition (uniqueness of path labels) .
Path labels are unique, that is, there is no two nodes km such that path(k)=path(m).
Definition (words represented in a tree) .
A string w occurs in T iff T contains the node wu (i.e., the locus of wu) for some u.
By words(T) we denote the set of strings occurring in T.
Definition (Suffix Tree).
A suffix tree for a string S over Σ is a Σ+-tree such that
1. words(T)={w | w is a substring of S}
2. (T2) Every internal node has at least 2 sons.
Proposition. Let T be a suffix tree for S, then
1. Every (full) path in T represents a suffix of S.
2. Every suffix of S represented by a unique path in T.
Proposition (uniqueness of suffix trees). Given a string S, its suffix tree is unique up to order among
siblings.
Definition (Extended Locus).
The Extended Locus of a string u is the locus of the shortest extension of u, uw (w is possibly
empty), such that uw is a node in T.
Definition (Contracted Locus).
The Contracted Locus of a string u is the locus of the longest prefix of u, x (x is possibly empty),
such that x is a node in T.
1
Strings and Suffixes
Let S be our main string
Definition (Sufi) .
Sufi is the suffix of S beginning at the ith position (position are counted from 1, i.e., Suf1 = S).
Definition (headi) .
headi is the longest prefix of sufi that is also a prefix of sufj for some j<i. Hence, headi is a prefix of
sufi that occurs also before the ith position.
Definition (taili) . taili is defined such that sufi = headi taili .
McCreight’s Algorithm
Lemma 1: If headi-1 = xu for some character x and some string u (possibly empty), then u is a prefix of
headi .
Proof. headi-1=xu, hence, there is a j<i s.t. xu is a prefix of both sufj-1 and sufi-1.
1.xu is a prefix of sufj-1 therefore u is a prefix of sufj .
2.xu is a prefix of sufi-1 therefore u is a prefix of sufi .
By (1), (2): there is some j<i such that u is a prefix of both sufj and sufi .
Hence, by definition of head: u is a prefix of headi.
Definition (resi) . resi = wv{taili} in step i .
Hence, resi is the suffix of S rescanned and scanned during step i.
Definition (inti) . inti = number of intermediate nodes (f ) encountered while rescanning (in substep B)
during step i.
2
Running Examples of McCreight’s algorithm
contracted locus of
headi-1 in Ti-2 unless
the locus of headi-1 in
Ti-2 is the root
S=ababc
headi
Ti-1
T0
headi
taili
headi-1
u
w
Step i
sufi
Step 1
ababc

ababc
Step 2
babc

babc




Step 3
abc
ab
c



ab
Step 4
bc
b
c
a

b

Step 5
c

c
b


c
x
v
root
T1
ababc
T2
babc
ababc
T3
ab
babc
abc
c
T4
b
ab
abc
T5
c
c
abc
b
c
ab
abc
c
c
abc
3
contracted locus of
headi-1 in Ti-2 unless
the locus of headi-1 in
Ti-2 is the root
S=aabaaac
headi
Ti-1
T0
headi
taili
headi-1
u
w
Step i
sufi
Step 1
aabaaac

aabaaac
Step 2
abaaac
a
baaac



a
Step 3
baaac

baaac
a



Step 4
aaac
aa
ac



aa
Step 5
aac
aa
c
a

a
a
Step 6
ac
a
c
a


Step 7
c

c
a


x
v
root
T1
aabaaac
T2
a
abaaac
baaac
a
T3
baaac
baaac
abaaac
a
T4
a
baaac
baaac
ac
baaac
a
T5
a
baaac
baaac ac
c
baaac
c
baaac
a
T6
a
c
baaac
baaac ac
T7
a
c
4
a
c
baaac
c
baaac

5
Download