Tel Aviv University School of Computer Science Computational Genomics 2013 Exercise 3, published 27/11/13 Notes: 1. General Guidelines: This assignment is part of the final exam in the course. It should be done independently, individually, without any help from others. Duplicated and copied works will be given grade zero. Using articles, books, or web sites is perfectly acceptable as long as you include the references that you used. If a question requires a description of an algorithm, you must prove its correctness and analyze its time and space complexity. 2. Each question gets equal credit. Each subsection gets equal credit in the question. 3. Submit in class or to Yaron's box (number 370). Deadline: 19/12/13, 17:00. 4. Submitted work can be in English or Hebrew, in hand-writing or printed. Q. 1: A transcription factor (TF) is a protein that binds to certain DNA sequences in promoters and affects transcription. The binding sites of a TF that binds to sequences of length n can be described using a 4×n matrix in which the ith column gives the probabilities for the different nucleotides at position i of the binding site. a. Describe an HMM for a promoter that can have multiple disjoint binding sites for a given TF. You can assume that other nucleotides in the promoter are generated independently at random according to a given background distribution. b. Given a promoter sequence, explain how to calculate its probability according to the model. Make sure to account for the constraint that a binding site cannot start in the last n-1 positions. Background for Q2 and Q3: The genomes of each two individuals are identical in 99.9% of the positions. The positions in which they vary are called Single Nucleotide Polymorphisms (SNPs for short). In each SNP, only two nucleotides are possible, e.g. A or G. We denote these two options by 0 and 1. For simplicity, we will deal with one diploid chromosome, which we term the "genome" (i.e., two sequences over {0,1}. Each sequence represents the bases in the SNPs of one copy of the chromosome). In reading a human genome we can see the sum of its two copies: for each SNP we get 0, 1 or 2 according to the bases in the two copies of that position. For 0,0 we see 0. For 1,1, we see 2, for 0,1 or 1,0 we see 1 (the sequencing does not distinguish between the copies 0 came from). Q. 2: When sequencing a human genome, in some cases, there is some uncertainty in the reads, and we see several options for a SNP, e.g. {0,1}, {1,2} or {0,1,2}. The input R to our problem is reads from n different individuals in a single position. The possible reads are the subsets {0},{1},{2},{0,1},{1,2},{0,1,2}. We want to learn the probability of having 1 in that position. We denote this probability by p. a. Write a likelihood function for the observed reads. That is, write a formula for L(p; R), where R are the n reads. b. Write the Q function for this problem. Write it explicitly. i.e., don't leave the expected sign or an exponential sum of terms. c. Give the update rule for p. Q. 3: For each population (i.e., Africans or Europeans) there is a different probability for 0 or 1 in each SNP. We can describe an African-American genome as two sequences of SNPs; each sequence is composed of segments of a European and African genome, interweaved (due to recombination and inter-population mating). In the European / African segment, SNP probabilities are according to the European / African population. There is no dependence between segments. We wish to model such a genome using a double HMM. Instead of one path, there are two independent paths (with possible transitions between them). The output is the sum of these paths. a. Describe how a dHMM models the sequencing of the African-American genome. Plot the states and possible transitions and describe the output. b. Suppose we are given the sequence of SNPs along a chromosome (i.e., a sequence over {0,1,2}) along with the dHMM model for that chromosome, including the transition and emission probabilities. Describe an algorithm that calculates the most probable partition of the two chromosomal copies into European and African segments. Q. 4: The pre-mRNA of a eukaryote gene contains exons interspersed with introns. In a normal splicing process the exons are concatenated in their order, and together they form the mature mRNA. In alternative splicing, the process skips one or more exons, and consequently they are not contained in the mature mRNA. Describe an efficient algorithm that gets as input (1) the normal gene represented as an ordered list of exon sequences and (2) a mature mRNA sequence, and identifies which exons are contained in it. Due to sequencing errors, there may be mismatches and indels between the exon sequence in the mature mRNA and the original exon sequence. Use the following notations: the mature mRNA sequence M(1,…,m). The normal gene sequence E(1,…,n). Starting indices of exons in the sequence: s1=1<s2< … <sk<n, that is exon i is E(si),…,E(si+1 – 1). Similarity score of nucleotides x, y: (x,y). Indel score: . Assume k << n. Q. 5: You are given a random sample of similarity values in the expression pattern of pairs of genes. Assume that the genes classify to two clusters, so that the similarity values in a cluster distribute normally with mean w and standard deviation s, while the similarity values between clusters distribute normally with mean b and standard deviation s. a. Write a detailed expression of the likelihood of a given sample (x1, …, xn). b. In order to use an EM algorithm to calculate the parameters that maximize (locally) the likelihood function: calculate the Q function. c. Derive the update formula of s, b and w