Assignments for lecture „Bioinformatics III“ WS 03/04 Assignment 2, return in room 1.05 Building 17.1 until Nov 18, 2003, 11 am Your name: Matrikelnummer: Fachrichtung: General remark: These exercises often require to “show” something. This does not require a formal proof. Just describe the solution idea. Please direct questions to: Jörg Niggemann, tel. 302-64176, email: joergn@thoughtblade.com (1) Draw a suffix tree for the string “mississippi”. Optional: use a shorter string of your choice, and use one of the algorithms of McGraigh, Ukkonen or Weiner. Use a different color for each phase. (20 points) (2) How can you improve the suffix tree, such that it takes less space in memory? Hint: You don't have to maintain it as a tree. Look at repetitive branches and think of a DAG (directed acyclic graph) as a result. (10 points) (3) Construct an infinite family of strings over a fixed alphabet, where the total length of the edge-labels on their suffix trees grows faster than (m) (m is the length of the string). That is, show that linear-time suffix tree algorithms would be impossible if edge-labels were written explicitly on the edges. (15 points) (4) Consider a generalized suffix tree built for a set of k strings. Additional strings may be added to the set, or entire strings may be deleted from the set. This is the common case for maintaining a generalized suffix tree for biological sequence data. Discuss the problem of maintaining the generalized suffix tree in this dynamic setting. Explain why this problem has a much easier solution than when arbitrary substrings represented in the suffix tree are deleted. (15 points) (5) a) Show how to count the number of distinct substrings of a string T in O(m) time, where the length of T is m. b) Show how to enumerate one copy of each distinct substring, all together in time proportional to the total length of all strings. c) Given a substring P of length n, compute the frequency of P in T in O(n) time. That is the time needed to walk along P to its end, so the information should be accessible from the end of P in constant time. Hint for this and the next exercise: O(m) is again the time to construct the suffix tree. Think how you can accumulate information on the nodes either while constructing the tree or during one tree traversal after the tree is finished. Recall that tree traversals are linear, so they don't spoil the O(m) time bound. (20 points) (6) One way to hunt for “interesting“ sequences in a DNA sequence database is to look for substrings in the database that appear much more often than they would be predicted to appear by chance alone. This is done today and will become even more attractive when huge amounts of anonymous DNA sequences are available. Assuming one has a statistical model to determine how likely any particular substring would occur by chance, and a threshold above which a substring is “interesting”, show how to efficiently find all interesting substrings in the database. If the database has a total length m, then the method should take time O(m) (= construction of the suffix tree plus traversal) plus time proportional to the number of interesting substrings (= output each one's origin and length). (10 points) (7) Given a set S of k strings, we want to find every string in S that is a substring of some other string in S. Assuming that the total length of all the strings is n, give an O(n) time algorithm to solve this problem. (10 points)