hw2

advertisement
Algorithms in Computational Biology 236522, Winter 2005-6
Home Assignment No. 2 - Alignments
Publication date:
Due date:
24.11.05
12.12.05 (To Ilan Gronau’s mailbox (#117) in the 5th floor).
1. Recall the model presented in class describing the calculation of PAM-r substitution
matrices. We introduced the probability measure p(a,b) – the probability that a randomly
chosen correlated pair of amino-acids is {a,b}. Our goal is to estimate these measures using a
collection of accepted mutations (pairs of distinct amino-acids), and the rate of relative
mutability (r). We denote by f the total number of accepted mutations in our collection, and
by fab the number of accepted mutations of type {a,b}. In class we introduced the following
r  f ab
equation for all pairs of amino acids (a,b): p(a, b) 
.
f
a. Why is fab more complicated to estimate for identical amino-acid pair (i.e. a=b)?
qa  f 1
 2  f ab ,
r
ba
where qa is the frequency of amino acid ‘a’ in the collection of accepted
mutations..
b. Prove that for a pair of identical amino acids we have: f aa 
2. Many biological sequences (DNA/protein) contain within themselves homologous
subsequences. Consider the following problem of finding inexact repeats within a given
sequence. Given a sequence S, we wish to find two distinct subsequences within S which
have optimal alignment between them.
a. Formally define the search space of the problem; in particular describe in detail
the properties of a valid output. Argue why we can’t simply use SmithWaterman’s algorithm for local alignment to solve it.
b. Suggest an O(n2) algorithm for finding such a pair of subsequences, when they
are allowed to overlap.
c. Suggest an O(n3) algorithm for finding such a pair of subsequences, when they
are not allowed to overlap.

In both cases shortly describe your algorithms. Explain why they are correct, and
their (time /space) complexity.
3. S = s1 s2… sn is a non-contiguous subsequence of T iff T = w0s1w1s2w2…wn-1snwn for some
sequences w0, w1,…, wn (some of which may be empty). In such a case T is considered a noncontiguous super-sequence of S.

Suggest an efficient algorithm for finding the shortest common non-contiguous supersequence T, of a pair of sequences S1, S2. Explain its time/space complexity and prove
its correctness.
Algorithms in Computational Biology 236522, Winter 2005-6
4. Recall the generalized DP algorithm for optimal multiple alignment. Let i=(i1,…,ik) be a
given cell in the k- dimensional matrix used to align k sequences S1…Sk. Assume that we
know that some alignment of the k sequences has score L. For each pair 1 u<v  k, let
a(u,v) be the score of an optimal pairwise alignment of Su and Sv which passes through cell
(iu,iv) (in the 2-dimensional matrix of their pairwise alignment). Prove that no optimal
multiple alignment (of the k sequences) passes through cell i when  a(u, v)  L .
1u v  k
5. Recall the 2-Approximation algorithm for best SP-score multiple alignment shown in the
tutorial (AKA star algorithm). Show that the choice of center-sequence the algorithm makes
is not always optimal. Give an example of a set of sequences, where the choice of a different
sequence as a center (i.e. S1) yields a better multiple alignment under edit distance (i.e.
match - 0; indel/mismatch - 1).
Good luck!
Download