A Space-Economical Suffix Tree Construction Algorithm Edward M. McCreight (1976) Definitions Suffix trees Definition (Σ+-tree) . Given some alphabet Σ, a Σ+-tree T is a rooted tree with 1. (T1) The labels of the edges are from Σ+ ; 2. (T3) Every sibling edges begin in a different character. Definition (locus) . Denote by path(k) the concatenation of the edge labels on the path from the root of T to the node k. The locus of a string u in a Σ+-tree is the node k such that path(k)=u, we denote k by u. proposition (uniqueness of path labels) . Path labels are unique, that is, there is no two nodes km such that path(k)=path(m). Definition (words represented in a tree) . A string w occurs in T iff T contains the node wu (i.e., the locus of wu) for some u. By words(T) we denote the set of strings occurring in T. Definition (Suffix Tree). A suffix tree for a string S over Σ is a Σ+-tree such that 1. words(T)={w | w is a substring of S} 2. (T2) Every internal node has at least 2 sons. Proposition. Let T be a suffix tree for S, then 1. Every (full) path in T represents a suffix of S. 2. Every suffix of S represented by a unique path in T. Proposition (uniqueness of suffix trees). Given a string S, its suffix tree is unique up to order among siblings. Definition (Extended Locus). The Extended Locus of a string u is the locus of the shortest extension of u, uw (w is possibly empty), such that uw is a node in T. Definition (Contracted Locus). The Contracted Locus of a string u is the locus of the longest prefix of u, x (x is possibly empty), such that x is a node in T. 1 Strings and Suffixes Let S be our main string Definition (Sufi) . Sufi is the suffix of S beginning at the ith position (position are counted from 1, i.e., Suf1 = S). Definition (headi) . headi is the longest prefix of sufi that is also a prefix of sufj for some j<i. Hence, headi is a prefix of sufi that occurs also before the ith position. Definition (taili) . taili is defined such that sufi = headi taili . McCreight’s Algorithm Lemma 1: If headi-1 = xu for some character x and some string u (possibly empty), then u is a prefix of headi . Proof. headi-1=xu, hence, there is a j<i s.t. xu is a prefix of both sufj-1 and sufi-1. 1.xu is a prefix of sufj-1 therefore u is a prefix of sufj . 2.xu is a prefix of sufi-1 therefore u is a prefix of sufi . By (1), (2): there is some j<i such that u is a prefix of both sufj and sufi . Hence, by definition of head: u is a prefix of headi. Definition (resi) . resi = wv{taili} in step i . Hence, resi is the suffix of S rescanned and scanned during step i. Definition (inti) . inti = number of intermediate nodes (f ) encountered while rescanning (in substep B) during step i. 2 Running Examples of McCreight’s algorithm contracted locus of headi-1 in Ti-2 unless the locus of headi-1 in Ti-2 is the root S=ababc headi Ti-1 T0 headi taili headi-1 u w Step i sufi Step 1 ababc ababc Step 2 babc babc Step 3 abc ab c ab Step 4 bc b c a b Step 5 c c b c x v root T1 ababc T2 babc ababc T3 ab babc abc c T4 b ab abc T5 c c abc b c ab abc c c abc 3 contracted locus of headi-1 in Ti-2 unless the locus of headi-1 in Ti-2 is the root S=aabaaac headi Ti-1 T0 headi taili headi-1 u w Step i sufi Step 1 aabaaac aabaaac Step 2 abaaac a baaac a Step 3 baaac baaac a Step 4 aaac aa ac aa Step 5 aac aa c a a a Step 6 ac a c a Step 7 c c a x v root T1 aabaaac T2 a abaaac baaac a T3 baaac baaac abaaac a T4 a baaac baaac ac baaac a T5 a baaac baaac ac c baaac c baaac a T6 a c baaac baaac ac T7 a c 4 a c baaac c baaac 5