ex2 5 questions

advertisement
Tel Aviv University
School of Computer Science
Computational Genomics
2013
Exercise 2, published 7/11/13
Notes:
1. General Guidelines: This assignment is part of the exam in the course. It should be
done independently, individually, without any help from others. Duplicated and
copied works will be given grade zero. Using articles, books, or web sites is perfectly
acceptable as long as you include the references that you used. If a question
requires a description of an algorithm, you must prove its correctness and analyze
its time and space complexity.
2. Each question gets equal credit. Each subsection gets equal credit in the question.
Bonus sections are optional and give extra credit.
3. Submit to Yaron's box (number 370) or in class. Deadline: 28/11/13, 17:00.
4. Submitted work can be in English or Hebrew, in hand-writing or printed.
Q. 1: Prove that the consensus of the optimal SOP solution of multiple alignment is equal
to Steiner string (up to spaces).
Q. 2: A student suggested the following implementation to Carillo-Lipman's algorithm:
a. Align each pair of sequences using optimal global alignment.
b. For each pair, run the traceback procedure and store the DP matrix cells that the
algorithm visits.
c. Add the hyper-cube cell (0,0,…,0) to an empty queue.
d. Repeat until the MSA is completed:
i. Pop the top cell in the queue.
ii. Update the scores of nearby cells in the hypercube.
iii. Add to the queue every updated cell for which:
1. All adjacent cells have updated its score.
2. Its projection on some 2D plan is one of the cells found in (b).
Determine whether the implementation is correct or not. If it is correct, explain why.
If not, show an example in which it will return incorrect results.
Q. 3: The human genome of any two individuals is ~99.9% identical. The difference between
the genomes of different persons lies mostly in single nucleotides – called SNPs (single
nucleotide polymorphisms). To reconstruct a person's genome sequence it is enough to read
those SNPs. In addition, nearby SNPs are highly correlated, and so we can sequence only
some of the SNPs and use them to predict the rest. Our goal is to select a limited number of
SNPs, so that sequencing them will give maximum information on a person's genome.
We assume we are dealing with only one chromosome. We denote the known SNPs in the
chromosome by the numbers 1,2,…,m, according to their order along it. Assume we have a
budget to sequence t SNPs, termed selected SNPs. The other SNPS are predicted, resulting in
some prediction mistakes or errors. We wish to select the t SNPs, so that the expected
number of errors made on the other m-t SNPs is minimum. Assume that the expected error
in predicting a SNP only depends on the two closest selected SNPs.
We are given the following score functions, which are based on a collection of fully
sequenced chromosomes.
scores(i, j) = expected number of errors in bases i+1,…,j-1 if i and j are among the selected
SNPs, i<j, and none of the SNPs k between them (i<k<j) is selected.
score1(i) = expected number of errors in bases 1,…,i-1 if i is the first selected SNP.
score2(j) = expected number of errors in bases j+1,…,m if j is the last selected SNP.
Describe an efficient algorithm to find a set of t selected SNPs such that their sequencing will
minimize the overall expected number of errors in predicting the rest of the SNPs.
Q. 4: The following question motivates the use of better seeds in BLAST, by introducing
spaces in them. For simplicity assume that we are aligning a binary query string to a binary
reference string. Given a seed present in the reference, we want to find all its occurrences in
the query. The goal is to use seeds that will maximize the number of such occurrences.
We define a spaced seed: 1 stands for a match to a bit 1 and * for a don't-care. Blast seeds
taught in class were continuous seeds, e.g. 1111, whereas spaced seeds include 1's and *'s,
e.g. 1*1**11 (the * position can match to any bit). Note that in both seeds we are checking
four positions.
We denote by W the weight of a seed, i.e., the number of 1's in it. L is the length of R, the
target sequence in which we want to find the seed. M is the seed length (counting 1's and
*'s, e.g., 1*1**11 is of length 7). p is the probability of 1 in the target sequence, and we
assume i.i.d. positions. A hit in position i on the target is is a match of M to R[i,i+M-1] (i.e., all
the seed's 1-positions match 1's in the sub-sequence of the target). The sensitivity of seed s
is the probability that s has a hit in at least one position of the target sequence.
a. What is the expected number of hits in the target sequence? Write it as a function of
L, M, p and W.
b. You are given two seeds: s1=11111*** and s2=11**11*1. What is the expected
number of hits of each of them, given that si has a hit in the first position in the
sequence and the sequence length is 11? Write the expectations as a function of p.
c. Use the equation of E(# hits) = P(s hits first position) * E(# hits | s hits first position)
to show that the sensitivity of the spaced seed in b is greater than that of the
continuous seed (i.e., show that P(s hits the first position) is greater for the spaced
seed).
d. Let b be a string of length M, and let f(i,b) be the probability of s hitting the prefix
L[1,i] of the target sequence, where b=L[i-M+1,i]. If s=b, then f(i,b) = 1. Give a
recursive formula for the case where sb.
e. (bonus) Write the probability of s hitting a target sequence of length L as a function
of f(i, b). Hint: write it as a sum over all possible suffixes b.
Q. 5:
We want to solve the tree alignment problem assuming triangle inequality using a restricted
variant of the lifted alignment algorithm. In this variant, which we call “one sided lifted
alignment”, we are given a binary tree with leaf labels and a representation of the tree so
that the two children of each non-leaf node are labeled “left” and “right”. As in the regular
lifting algorithm the label of a node is one of the labels of its children, but here it must be
selected so that for all nodes in the same level the lifting is from the same side, left or right.
We define the following transformation from the optimal tree T (optimal in the sense that
the sum of distances over all tree edges is minimal, without the constraint of one-sided lifted
alignment): start from bottom and go up – in each level choose to lift all labels from the right
or the left child depending on the choice closer to the parent label in the optimal tree.
Denote this tree T'.
a. Prove that the cost of T' is at most twice the cost of T. (Hint: follow the same proof
as with regular lifted alignment but sum separately each level, then prove that each
non-zero edge is bounded by the cost of the path from leaf to node, and that these
paths are disjoint and cover the whole tree).
b. Describe a polynomial algorithm for finding the best one sided lifted alignment.
Write its running time as a function of k (number of leafs), n (the length of the
sequence in each leaf) and h (the height of the tree).
c. (bonus) Prove that the best one sided lifted alignment guarantees a 2approximation.
Download