Algorithms in Computational Biology 236522, Winter 2005-6 Home Assignment No. 2 - Alignments Publication date: Due date: 24.11.05 12.12.05 (To Ilan Gronau’s mailbox (#117) in the 5th floor). 1. Recall the model presented in class describing the calculation of PAM-r substitution matrices. We introduced the probability measure p(a,b) – the probability that a randomly chosen correlated pair of amino-acids is {a,b}. Our goal is to estimate these measures using a collection of accepted mutations (pairs of distinct amino-acids), and the rate of relative mutability (r). We denote by f the total number of accepted mutations in our collection, and by fab the number of accepted mutations of type {a,b}. In class we introduced the following r f ab equation for all pairs of amino acids (a,b): p(a, b) . f a. Why is fab more complicated to estimate for identical amino-acid pair (i.e. a=b)? qa f 1 2 f ab , r ba where qa is the frequency of amino acid ‘a’ in the collection of accepted mutations.. b. Prove that for a pair of identical amino acids we have: f aa 2. Many biological sequences (DNA/protein) contain within themselves homologous subsequences. Consider the following problem of finding inexact repeats within a given sequence. Given a sequence S, we wish to find two distinct subsequences within S which have optimal alignment between them. a. Formally define the search space of the problem; in particular describe in detail the properties of a valid output. Argue why we can’t simply use SmithWaterman’s algorithm for local alignment to solve it. b. Suggest an O(n2) algorithm for finding such a pair of subsequences, when they are allowed to overlap. c. Suggest an O(n3) algorithm for finding such a pair of subsequences, when they are not allowed to overlap. In both cases shortly describe your algorithms. Explain why they are correct, and their (time /space) complexity. 3. S = s1 s2… sn is a non-contiguous subsequence of T iff T = w0s1w1s2w2…wn-1snwn for some sequences w0, w1,…, wn (some of which may be empty). In such a case T is considered a noncontiguous super-sequence of S. Suggest an efficient algorithm for finding the shortest common non-contiguous supersequence T, of a pair of sequences S1, S2. Explain its time/space complexity and prove its correctness. Algorithms in Computational Biology 236522, Winter 2005-6 4. Recall the generalized DP algorithm for optimal multiple alignment. Let i=(i1,…,ik) be a given cell in the k- dimensional matrix used to align k sequences S1…Sk. Assume that we know that some alignment of the k sequences has score L. For each pair 1 u<v k, let a(u,v) be the score of an optimal pairwise alignment of Su and Sv which passes through cell (iu,iv) (in the 2-dimensional matrix of their pairwise alignment). Prove that no optimal multiple alignment (of the k sequences) passes through cell i when a(u, v) L . 1u v k 5. Recall the 2-Approximation algorithm for best SP-score multiple alignment shown in the tutorial (AKA star algorithm). Show that the choice of center-sequence the algorithm makes is not always optimal. Give an example of a set of sequences, where the choice of a different sequence as a center (i.e. S1) yields a better multiple alignment under edit distance (i.e. match - 0; indel/mismatch - 1). Good luck!