Strings II Tommy Färnqvist Seminar 10 Strings II Outline TDDD95: APS Suffix Tree and Array Final Note Seminar in Algorithmic Problem Solving April 15, 2016 Tommy Färnqvist Department of Computer and Information Science Linköping University 10.1 Outline Strings II Tommy Färnqvist Outline Suffix Tree and Array Final Note 1 Suffix Tree and Array 10.2 What are suffix trees? Strings II Tommy Färnqvist Outline Suffix Tree and Array Final Note • A suffix tree stores all of the substrings of a given string in linear space in a way that the search for a substring is efficient • time proportional to the length of the queary string. • Suffix tree can be built in linear time. 10.3 As a picture Strings II Tommy Färnqvist • Here is the suffix tree for GAAGAT Outline Suffix Tree and Array Final Note • An edge is labelled with a substring of the original string S. • A node’s label is the concatenation of all edge labels for the path leading to that node. • The path from the root to any leaf is a suffix of the string S. • Each internal node has at least 2 children. • The substrings labelling each edge out of a node must all begin with different letters. • If all suffixes are to be included, then the last letter must be unique. Else, suffix may end at an internal node! 10.4 Application I. Search for a substring Strings II Tommy Färnqvist Outline Suffix Tree and Array Final Note • Any substring of S is a prefix of a suffix. • Example of using this: Is the string x a substring of S? • Start at the root, and follow paths labelled by the characters of x. If you can get to the end of x, then yes, it is. 10.5 Application II. Longest Common Substring Strings II Tommy Färnqvist Outline • What is the longest substring common to both S1 and S2 ? • Build a suffix tree for S = S1 #S2 $, where # and $ are unique characters. • All suffixes of S1 ends with an edge including #S2 . • Color all nodes v of the tree: Suffix Tree and Array Final Note • red if v ’s label is a substring of S1 • blue if it is a substring of S2 • purple if it is a substring of both • We want the lowest purple node. • Linear time, linear space. 10.6 Quick Note on Suffix Array Strings II Tommy Färnqvist Outline • Suffix tree is not a compact data structure. • A lot of pointers Suffix Tree and Array Final Note • A suffix array stores the positions in a string. Each position is an integer, so this is a length n integer array. • Each position corresponds to a suffix starting at this position. • The suffix array is sorted according to the string order of the corresponding suffixes. 10.7 As a picture Strings II Tommy Färnqvist • Here is the suffix array for AGAAGAT Outline Suffix Tree and Array Final Note 10.8 Suffix Array Strings II Tommy Färnqvist Outline • Binary search to find substring of length m. • O(m log n) if implemented straighforwardly • O(m + log n) if used with an auxiliary data structure called the longest common prefix (LCP) array Suffix Tree and Array Final Note • Constructing suffix array (together with LCP array) can be done in linear time. • SA + LCP arrays make it possible to simulate traversing the corresponding suffix tree in several different ways. 10.9 Next time Strings II Tommy Färnqvist Outline Suffix Tree and Array Final Note Combinatorial Search. . . 10.10