≤ i ≤ n, Si denotes the i-th suffix S[i..n] of S . The suffix array SA of the string S is an array of integers in the range 1 to n specifying the lexicographic ordering of the n suffixes of the string S . that is, it satisfies SSA[1] < SSA[2] < · · · < SSA[n] . Let S be a string of length n. For every i, 1 ISA is an array of size n such that for any q with 1 ≤ q ≤ n the equality ISA[SA[q]] = q holds. The inverse suffix array 1 i SA ISA SSA[i] 1 9 5 $ 2 7 9 at$ 3 5 4 atat$ 4 3 8 atatat$ 5 1 3 ctatatat$ 6 8 7 t$ 7 6 2 tat$ 8 4 6 tatat$ 9 2 1 tatatat$ 2 The Burrows-Wheeler transform transforms a string S in three steps: 1. Form a conceptual matrix M 0 whose rows are the cyclic shifts of the string S . 2. Compute the matrix M by sorting the rows of M 0 lexicographically. 3. Output the last column L of M . 3 F ctatatat$ $ ctatata t tatatat$c a t$ctata t atatat$ct a tat$cta t tatat$cta ctatatat$ cyclic −−−−−→ shif ts L atat$ctat a tatat$c t sort −−−→ c tatatat $ tat$ctata t $ctatat a at$ctatat t at$ctat a t$ctatata t atat$ct a $ctatatat t atatat$ c M0 last −−−−−−→ column tttt$aaac M 4 Definition: For a string S of length n having the sentinel character at the end (and nowhere else), the string BWT[1..n] is defined by BWT[i] = $ if SA[i] = 1 and BWT[i] = S[SA[i] − 1] otherwise. 5 i SA ISA BWT SSA[i] 1 9 5 t $ 2 7 9 t at$ 3 5 4 t atat$ 4 3 8 t atatat$ 5 1 3 $ ctatatat$ 6 8 7 a t$ 7 6 2 a tat$ 8 4 6 a tatat$ 9 2 1 c tatatat$ 6 Computing BWT from SA and the string S ← 1 to n do if SA[i] = 1 then BWT[i] ← $ else BWT[i] ← S[SA[i] − 1] for i 7 F L BWT F L $ ctatata t $ t t $ t a t$ctata t a t$ t t at$ t a tat$cta t a tat$ t t atat$ t a tatat$ t observe t atatat$ t c tatatat$ $ L=BWT $ ctatatat$ $ t $ctatat a t$ a a t$ a t at$ctat a t at$ a a tat$ a t atat$ct a t atat$ a a tatat$ a t atatat$ c t atatat$ c c tatatat$ c F L a tatat$c t c tatatat $ truncate −−−−−−−→ af ter $ −−−−−−→ 8 : {1, . . . , n} → {1, . . . , n} is defined as follows. If L[i] = c is the k -th occurrence of character c in L = BWT, then LF (i) = j is the index so that F [j] is the k -th occurrence of c in F . Definition: The function LF The function LF is called last-to-first mapping because it maps the last column L to the first column F . 9 i 1 2 3 4 5 L[i] t t $ a a a c LF (i) 6 F [i] t t 6 7 8 9 7 8 9 1 2 3 4 5 $ a a a c t t t t 10 Computing the C -array of string S cnt[1..σ] ← [0..0] /* initialize cnt */ C[1..σ] ← [0..0] /* initialize C */ for i ← 1 to n do cnt[S[i]] ← cnt[S[i]] + 1 for j ← 2 to σ do C[j] ← cnt[j − 1] + C[j − 1] 11 Computing LF from BWT ∈ Σ do count[c] ← C[c] for i ← 1 to n do c ← BWT[i] count[c] ← count[c] + 1 LF [i] ← count[c] for all c 12 Lemma: The first row of the matrix M contains the suffix Sn = $. If row i, 2 ≤ i ≤ n, of the matrix M contains the suffix Sj , then row LF (i) of M contains the suffix Sj−1 . 13 Computing the string S from BWT and LF . S[n] ← $ j←1 for i ← n − 1 downto 1 do S[i] ← BWT[j] j ← LF (j) 14 Theorem: If L = BWT is the output of the Burrows-Wheeler transform applied to the string S and LF is corresponding last-to-front mapping, then the previous algorithm computes S . 15 String S Burrows-Wheeler Transform (BWT) Move-to-Front Coding (MTF) Huffman Compression Code Sc 16 Move-to-front coding of a string L ∈ Σ∗ of length n. Initialize a list containing the characters from Σ in increasing order. ← 1 to n do R[i] ← number of characters preceding character L[i] in list move character L[i] to the front of list for i 17 Move-to-front decoding of R. Initialize a list containing the characters from Σ in increasing order. ← 1 to n do L[i] ← character at position R[i] + 1 in list (numbering elements from 1) move character L[i] to the front of list for i 18 Construction of the suffix array of a string S The induced sorting algorithm by Nong et al.: Phase 0: Compute the type array T by a right-to-left scan of S . Phase I: Compute all LMS-positions in S and sort the corresponding suffixes in ascending lexicographic order (we will elaborate on this phase later). Phase II: Using the output of phase I, induce the order of all suffixes. 19 The suffixes of S are classified into two types: Suffix Si is S-type if Si < Si+1 and it is L-type if Si > Si+1 . The last suffix, the sentinel $, is S-type. We use a bit array T [1..n + 1] to store the types of the suffixes: T [i] = 0 means that suffix Si is L-type and T [i] = 1 means that it is S-type. For better readability, we write T [i] = L instead of T [i] = 0 T [i] = S instead of T [i] = 1. Lemma: All suffixes can be classified by a right-to-left scan of S . Si is S-type if (a) S[i] < S[i + 1] or (b) S[i] = S[i + 1] and Si+1 is S-type. Analogously, Si is L-type if (a) S[i] > S[i + 1] or (b) S[i] = S[i + 1] and Si+1 is L-type. Proof: 20 S type LMS 1 2 3 4 5 6 7 i m i m m m i S L S L L L S * * 8 s L 9 10 11 12 13 14 15 16 i s m i s i s s S L L S L S L L * * * 17 18 19 20 21 i i p i $ S S L L S * * 21 Lemma: An S-type suffix is lexicographically greater than any L-type suffix starting with the same first character. Corollary: In the suffix array of S , among all suffixes that begin with the same character, the L-type suffixes appear before the S-type suffixes. 22 The c-bucket is the interval [i..j] with i = C[c] + 1 and j = C[c + 1]. Every c-bucket [i..j] can further be divided into the interval [i..k] containing all L-type suffixes starting with character c and the interval [k + 1..j] containing all S-type suffixes starting with character c. The interval [i..k] is called the L-type region and [k S-type region of the c-bucket (k + 1..j] is called the = i − 1 or k = j is possible, i.e., the S-type region or the L-type region of the bucket may be empty). 23 < i ≤ n + 1, in S with T [i − 1] = L and T [i] = S (i.e., suffix Si−1 is L-type and suffix Si is S-type) is called LMS-position (leftmost A position i, 1 S-type position). S type LMS 1 2 3 4 5 6 7 i m i m m m i S L S L L L S * * 8 s L 9 10 11 12 13 14 15 16 i s m i s i s s S L L S L S L L * * * 17 18 19 20 21 i i p i $ S S L L S * * 24 Phase II: 1. Scan the sorted sequence of LMS-positions from right to left. For each position encountered in the scan, move it to the current end of its bucket in A, and shift the current end of the bucket by one position to the left. 2. Scan the array A from left to right. For each entry A[i] encountered in the scan, if SA[i]−1 is an L-type suffix, move its start position A[i] − 1 in S to the current front of its bucket in A, and shift the current front of the bucket by one position to the right. 3. Scan the array A from right to left. For each entry A[i] encountered in the scan, if SA[i]−1 is an S-type suffix, move its start position A[i] − 1 in S to the current end of its bucket in A, and shift the current end of the bucket by one position to the left. 25 S type 1 2 3 4 5 6 i S m L i S m m m L L L * LMS L-type suffixes S-type suffixes $ 21 21 20 i 17 3 17 3 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 i s i s m i s i s s i i p i $ S L S L L S L S L L S S L L S * * * * * * p m s 7 12 9 14 7 12 9 14 2 6 11 5 4 19 16 8 13 10 15 21 20 17 1 17 3 7 12 9 14 2 3 18 7 12 9 14 6 11 5 4 19 16 8 13 10 15 26 Phase I: 1. Scan the array T from left to right. For each j : if j is an LMS-position, move it to the current end of its bucket in A, and shift the current end of the bucket by one position to the left; move j to the current front of array P , and shift the current front of P by one position to the right. 2. Scan the array A from left to right. For each entry A[i] encountered in the scan, if SA[i]−1 is an L-type suffix, move its start position A[i] − 1 in S to the current front of its bucket in A, and shift the current front of the bucket by one position to the right. 3. Scan the array A from right to left. For each entry A[i] encountered in the scan, if SA[i]−1 is an S-type suffix, move its start position A[i] − 1 in S to the current end of its bucket in A, and shift the current end of the bucket by one position to the left. 27 4. Initialize an array LN[1..n + 1] that will contain the new lexicographic names of the LMS-substrings, and initialize the counter i for new lexicographic names to 1. Because the sentinel is the lexicographically smallest LMS-substring, it gets the smallest new lexicographic name, i.e., LN[n + 1] = 1. Furthermore, initialize prev = n + 1. Now, starting at index i = 2, scan the array A from left to right. Whenever an entry A[i] = j is encountered so that j is the start position of an LMS-substring do the following: • Compare the LMS-substring starting at position prev in S with the LMS-substring starting at position j in S (character by character). • If they are different, increment i by one. • Set LN[j] = i and prev = j . 28 5. Scan the array P [1..m] from left to right and for each k with 1 ≤ k ≤ m set S[k] = LN[P [k]]. 6. If i = m, then the LMS-substrings are pairwise different. Compute the suffix array SA of the string S directly by SA[S[k]] = k . 7. Otherwise, recursively compute the suffix array SA of the string S . 8. The ascending lexicographic order of all the suffixes of S that start at an LMS-position is P [SA[1]], P [SA[2]], . . . , P [SA[m]]. 29 30 P S type LMS L-type suffixes S-type suffixes LN 1 3 2 7 3 4 5 6 7 9 12 14 17 21 1 2 3 4 6 7 8 i S m L i S m m m L L L i S s L 5 * * $ i 21 17 14 12 21 20 17 14 12 21 20 17 1 17 14 12 3 18 12 3 9 9 9 7 4 9 10 11 12 13 14 15 16 17 18 19 20 21 i s m i s i s s i i p i $ S L L S L S L L S S L L S * * * * * m p s 7 3 7 3 11 6 2 5 4 19 16 13 8 10 15 7 3 11 6 2 5 4 19 16 13 8 10 15 9 14 5 4 6 2 1 recursion 1 3 S 2 4 S 3 5 L 4 4 S * 2 2 3 3 4 LMS L-type suffixes S-type suffixes 1 1 7 7 7 6 6 1 P 1 3 2 7 3 4 5 6 7 9 12 14 17 21 _ S type LMS posi"on 5 6 L 6 2 L 7 1 S * 5 6 5 7 6 3 3 5 5 4 2 4 4 4 4 3. 4. 6. 5. 7. 2. 1. 31 Worst-case time-complexity: It is not difficult to see that each step in phases I and II takes at most O(n) time. By definition, position 1 is not an LMS-position and there must be at least one position in between two consecutive LMS-positions. Hence |S| = m ≤ b n+1 2 c. It follows from the master theorem that the whole induced sorting algorithm has a worst-case time complexity of O(n). 32 The master theorem says that a recurrence of the form T (n) = m X T (αi · n) + Θ(nk ) i=1 where 0 ≤ αi ≤ 1, m ≥ 1, k ≥ 0, has the solution Pm k k Θ(n ), if i=1 αi < 1 Pm k k T (n) = Θ(n log n), if i=1 αi = 1 Pm k Θ(nc ), if i=1 αi > 1 where c is the solution of the equation Pm c = 1. α i=1 i 33 1 2 3 4 21 20 17 1 5 6 7 8 3 18 12 7 9 10 11 12 13 14 15 16 17 18 19 20 21 9 14 11 6 2 5 4 19 16 13 8 10 15 21 17 3 12 7 9 14 sorted LMS-substrings 21 17 3 12 7 9 14 3 lexicographic names 4 5 4 6 2 1 _ S 21 17 3 12 7 9 14 3 4 5 4 6 2 1 34 Definition: Let S be a string of length n on an alphabet Σ. The Lempel-Ziv factorization (or LZ-factorization for short) of S is a factorization S (a) a letter c = s1 s2 · · · sm so that each factor sj , 1 ≤ j ≤ m, is either ∈ Σ that does not occur in s1 s2 · · · sj−1 , or (b) the longest substring of S that occurs at least twice in s1 s2 · · · sj . The Lempel-Ziv factorization can be represented by a sequence of pairs (PrevOcc1 , LPS1 ), . . . , (PrevOccm , LPSm ), where in case (a) PrevOccj = c and LPSj = 0, and in case (b) PrevOccj is a position in s1 s2 · · · sj−1 , at which an occurrence of sj starts and LPSj = |sj |. 35 Reconstruction of the string S , given its LZ-factorization (PrevOcc1 , LPS1 ), . . . , (PrevOccm , LPSm ). i←1 for j ← 1 to m do if LPSj = 0 then S[i] ← PrevOccj i←i+1 else ← 0 to LPSj − 1 do S[i] ← S[PrevOccj + k] i←i+1 for k 36 Definition: For a string S of length n, the longest previous substring (LPS) array of size n is defined by LPS[1] = 0 and LPS[k] = max{` : 0 ≤ ` ≤ n−k+1; S[k..k+`−1] is a substring of S[1..k+`−2]} ≤ k ≤ n. That is, LPS[k] is the length of the longest prefix of Sk that has another occurrence in S starting strictly before position k. for every k with 2 37 The LPS and PrevOcc arrays of the string S S[i] i LPS[i] PrevOcc[i] a c a a a c a = acaaacatat$. t a t 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 2 1 0 2 1 a c 1 3 1 2 5 8 t 7 38 Computation of the LZ-factorization based on LPS and PrevOcc. i←1 while i < n do if LPS[i] = 0 then PrevOcc[i] ← S[i] output (PrevOcc[i], LPS[i]) i ← i + max{1, LPS[i]} 39 Lemma: For every k with 2 ≤ k ≤ n, the following equality holds true: LPS[k] = max{|lcp(Sp , Sk )| : 1 ≤ p < k} where lcp(Sp , Sk ) denotes the longest common prefix between Sp and Sk . With k = SA[i] and p = SA[j], the lemma can be rephrased as: LPS[SA[i]] = max{|lcp(SSA[j] , SSA[i] )| : SA[j] < SA[i]} (1) 40 Definition: For any index i with 1 ≤ i ≤ n, we define PSVSA [i] = max{j : 0 ≤ j < i and SA[j] < SA[i]} and NSVSA [i] = min{j : i < j ≤ n + 1 and SA[j] < SA[i]} 41 Definition: The LCP-array is an array of size n + 1 with boundary elements LCP[1] = 0 and LCP[n + 1] = 0, and for all i with 2 ≤ i ≤ n we have LCP[i] = |lcp(SSA[i−1] , SSA[i] )| 42 i SA LCP SSA[i] PSV[i] NSV[i] LPS[SA[i]] PrevOcc[SA[i]] 0 0 1 3 0 aaacatat 0 3 1 1 2 4 2 aacatat 1 3 2 3 3 1 1 acaaacatat 0 11 0 ⊥ 4 5 3 acatat 3 7 3 1 5 9 1 at 4 6 2 7 6 7 2 atat 4 7 1 5 7 2 0 caaacatat 3 11 0 ⊥ 8 6 2 catat 7 11 2 2 0 t 8 10 1 8 1 tat 8 11 0 9 10 10 8 ε ⊥ 43 Lemma: For every i with 1 ≤ i ≤ n, the following equality holds LPS[SA[i]] = max{|lcp(SSA[PSV[i]] , SSA[i] )|, |lcp(SSA[i] , SSA[NSV[i]] )|} where S0 is the empty string ε. 44 Consider the (conceptual) graph defined as follows: The graph has n + 2 nodes each of which is labeled with (i, SA[i]); to deal with boundary cases, the graph also contains the nodes labeled with (0, SA[0]) = (0, 0) and (n + 1, SA[n + 1]) = (n + 1, 0). Moreover, consecutive nodes (i, SA[i]) and (i + 1, SA[i + 1]) are connected by an edge labeled with the value LCP[i + 1]. It is instructive to view each node (i, SA[i]) as a point in the plane R2 . 45 10 1 9 2 1 0 8 7 6 5 0 2 2 4 0 3 3 1 2 0 1 0 0 46 In the graph, |lcp(SSA[PSV[i]] , SSA[i] )| is the minimum of the edge labels on the path from node (PSV[i], SA[PSV[i]]) to (i, SA[i]). Analogously, |lcp(SSA[i] , SSA[NSV[i]] )| is the minimum of the edge labels on the path from (i, SA[i]) to node (NSV[i], SA[NSV[i]]). Now consider a peak in the graph, i.e., a node (i, SA[i]) so that SA[i − 1] < SA[i] and SA[i + 1] < SA[i]. In this case, (PSV[i], SA[PSV[i]]) = (i − 1, SA[i − 1]) and (NSV[i], SA[NSV[i]]) = (i + 1, SA[i + 1]). Thus, we have LPS[SA[i]] = max{LCP[i], LCP[i + 1]}. We delete node (i, SA[i]) and its edges from the graph, add an edge labeled with min(LCP[i], LCP[i + 1]) between nodes (i − 1, SA[i − 1]) and (i + 1, SA[i + 1]), and continue in the same fashion with the transformed graph. 47 Removal of the first peak in the graph: 10 1 9 2 1 0 8 7 6 5 0 2 2 4 0 3 3 1 2 0 1 0 0 48 Transformed graph: 10 1 9 2 1 0 8 7 6 5 0 2 0 3 3 1 2 0 1 0 0 49 Removal of the second peak in the graph: 10 1 9 2 1 0 8 7 6 5 0 2 0 3 3 1 2 0 1 0 0 50 Transformed graph: 10 1 9 2 1 0 8 7 6 5 0 2 0 3 2 0 0 1 0 51 Removal of the third peak in the graph: 10 1 9 2 1 0 8 7 6 5 0 2 0 3 2 0 0 1 0 52 Transformed graph: 10 1 0 8 7 1 6 5 0 2 0 3 2 0 0 1 0 53 SA[n + 1] ← 0 LCP[n + 1] ← 0 push(h0, ⊥i) for i ← 1 to n + 1 do lcp ← LCP[i] while SA[i] < top.sa last ← pop LPS[last.sa] ← max(last.lcp, lcp) lcp ← min(last.lcp, lcp) if LPS[last.sa] = 0 then PrevOcc[last.sa] ← ⊥ else if last.lcp > lcp then PrevOcc[last.sa] ← top.sa else PrevOcc[last.sa] ← SA[i] push(hSA[i], lcpi) 54 As the suffix array, the LCP-array can be computed in linear time. However, it is not necessary to precompute the LCP-array. The preceding algorithm can be modified in such a way that computes LPS and PrevOcc using the suffix array only. 55