ex2 5 questions

Tel Aviv University School of Computer Science Computational Genomics 2013 Exercise 2, published 7/11/13 Notes: 1. General Guidelines: This assignment is part of the exam in the course. It should be done independently, individually, without any help from others. Duplicated and copied works will be given grade zero. Using articles, books, or web sites is perfectly acceptable as long as you include the references that you used. If a question requires a description of an algorithm, you must prove its correctness and analyze its time and space complexity. 2. Each question gets equal credit. Each subsection gets equal credit in the question. Bonus sections are optional and give extra credit. 3. Submit to Yaron's box (number 370) or in class. Deadline: 28/11/13, 17:00. 4. Submitted work can be in English or Hebrew, in hand-writing or printed. Q. 1: Prove that the consensus of the optimal SOP solution of multiple alignment is equal to Steiner string (up to spaces). Q. 2: A student suggested the following implementation to Carillo-Lipman's algorithm: a. Align each pair of sequences using optimal global alignment. b. For each pair, run the traceback procedure and store the DP matrix cells that the algorithm visits. c. Add the hyper-cube cell (0,0,…,0) to an empty queue. d. Repeat until the MSA is completed: i. Pop the top cell in the queue. ii. Update the scores of nearby cells in the hypercube. iii. Add to the queue every updated cell for which: 1. All adjacent cells have updated its score. 2. Its projection on some 2D plan is one of the cells found in (b). Determine whether the implementation is correct or not. If it is correct, explain why. If not, show an example in which it will return incorrect results. Q. 3: The human genome of any two individuals is ~99.9% identical. The difference between the genomes of different persons lies mostly in single nucleotides – called SNPs (single nucleotide polymorphisms). To reconstruct a person's genome sequence it is enough to read those SNPs. In addition, nearby SNPs are highly correlated, and so we can sequence only some of the SNPs and use them to predict the rest. Our goal is to select a limited number of SNPs, so that sequencing them will give maximum information on a person's genome. We assume we are dealing with only one chromosome. We denote the known SNPs in the chromosome by the numbers 1,2,…,m, according to their order along it. Assume we have a budget to sequence t SNPs, termed selected SNPs. The other SNPS are predicted, resulting in some prediction mistakes or errors. We wish to select the t SNPs, so that the expected number of errors made on the other m-t SNPs is minimum. Assume that the expected error in predicting a SNP only depends on the two closest selected SNPs. We are given the following score functions, which are based on a collection of fully sequenced chromosomes. scores(i, j) = expected number of errors in bases i+1,…,j-1 if i and j are among the selected SNPs, i<j, and none of the SNPs k between them (i<k<j) is selected. score1(i) = expected number of errors in bases 1,…,i-1 if i is the first selected SNP. score2(j) = expected number of errors in bases j+1,…,m if j is the last selected SNP. Describe an efficient algorithm to find a set of t selected SNPs such that their sequencing will minimize the overall expected number of errors in predicting the rest of the SNPs. Q. 4: The following question motivates the use of better seeds in BLAST, by introducing spaces in them. For simplicity assume that we are aligning a binary query string to a binary reference string. Given a seed present in the reference, we want to find all its occurrences in the query. The goal is to use seeds that will maximize the number of such occurrences. We define a spaced seed: 1 stands for a match to a bit 1 and * for a don't-care. Blast seeds taught in class were continuous seeds, e.g. 1111, whereas spaced seeds include 1's and *'s, e.g. 1*1**11 (the * position can match to any bit). Note that in both seeds we are checking four positions. We denote by W the weight of a seed, i.e., the number of 1's in it. L is the length of R, the target sequence in which we want to find the seed. M is the seed length (counting 1's and *'s, e.g., 1*1**11 is of length 7). p is the probability of 1 in the target sequence, and we assume i.i.d. positions. A hit in position i on the target is is a match of M to R[i,i+M-1] (i.e., all the seed's 1-positions match 1's in the sub-sequence of the target). The sensitivity of seed s is the probability that s has a hit in at least one position of the target sequence. a. What is the expected number of hits in the target sequence? Write it as a function of L, M, p and W. b. You are given two seeds: s1=11111*** and s2=11**11*1. What is the expected number of hits of each of them, given that si has a hit in the first position in the sequence and the sequence length is 11? Write the expectations as a function of p. c. Use the equation of E(# hits) = P(s hits first position) * E(# hits | s hits first position) to show that the sensitivity of the spaced seed in b is greater than that of the continuous seed (i.e., show that P(s hits the first position) is greater for the spaced seed). d. Let b be a string of length M, and let f(i,b) be the probability of s hitting the prefix L[1,i] of the target sequence, where b=L[i-M+1,i]. If s=b, then f(i,b) = 1. Give a recursive formula for the case where sb. e. (bonus) Write the probability of s hitting a target sequence of length L as a function of f(i, b). Hint: write it as a sum over all possible suffixes b. Q. 5: We want to solve the tree alignment problem assuming triangle inequality using a restricted variant of the lifted alignment algorithm. In this variant, which we call “one sided lifted alignment”, we are given a binary tree with leaf labels and a representation of the tree so that the two children of each non-leaf node are labeled “left” and “right”. As in the regular lifting algorithm the label of a node is one of the labels of its children, but here it must be selected so that for all nodes in the same level the lifting is from the same side, left or right. We define the following transformation from the optimal tree T (optimal in the sense that the sum of distances over all tree edges is minimal, without the constraint of one-sided lifted alignment): start from bottom and go up – in each level choose to lift all labels from the right or the left child depending on the choice closer to the parent label in the optimal tree. Denote this tree T'. a. Prove that the cost of T' is at most twice the cost of T. (Hint: follow the same proof as with regular lifted alignment but sum separately each level, then prove that each non-zero edge is bounded by the cost of the path from leaf to node, and that these paths are disjoint and cover the whole tree). b. Describe a polynomial algorithm for finding the best one sided lifted alignment. Write its running time as a function of k (number of leafs), n (the length of the sequence in each leaf) and h (the height of the tree). c. (bonus) Prove that the best one sided lifted alignment guarantees a 2approximation.

ex2 5 questions

Related documents

Products

Support

ex2 5 questions

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib