ex3

advertisement
Tel Aviv University
School of Computer Science
Computational Genomics
2013
Exercise 3, published 27/11/13
Notes:
1. General Guidelines: This assignment is part of the final exam in the course. It should
be done independently, individually, without any help from others. Duplicated and
copied works will be given grade zero. Using articles, books, or web sites is perfectly
acceptable as long as you include the references that you used. If a question
requires a description of an algorithm, you must prove its correctness and analyze
its time and space complexity.
2. Each question gets equal credit. Each subsection gets equal credit in the question.
3. Submit in class or to Yaron's box (number 370). Deadline: 19/12/13, 17:00.
4. Submitted work can be in English or Hebrew, in hand-writing or printed.
Q. 1:
A transcription factor (TF) is a protein that binds to certain DNA sequences in promoters and
affects transcription. The binding sites of a TF that binds to sequences of length n can be
described using a 4×n matrix in which the ith column gives the probabilities for the different
nucleotides at position i of the binding site.
a. Describe an HMM for a promoter that can have multiple disjoint binding sites for a given
TF. You can assume that other nucleotides in the promoter are generated independently at
random according to a given background distribution.
b. Given a promoter sequence, explain how to calculate its probability according to the
model. Make sure to account for the constraint that a binding site cannot start in the last n-1
positions.
Background for Q2 and Q3:
The genomes of each two individuals are identical in 99.9% of the positions. The positions in
which they vary are called Single Nucleotide Polymorphisms (SNPs for short). In each SNP,
only two nucleotides are possible, e.g. A or G. We denote these two options by 0 and 1. For
simplicity, we will deal with one diploid chromosome, which we term the "genome" (i.e.,
two sequences over {0,1}. Each sequence represents the bases in the SNPs of one copy of
the chromosome).
In reading a human genome we can see the sum of its two copies: for each SNP we get 0, 1
or 2 according to the bases in the two copies of that position. For 0,0 we see 0. For 1,1, we
see 2, for 0,1 or 1,0 we see 1 (the sequencing does not distinguish between the copies 0
came from).
Q. 2:
When sequencing a human genome, in some cases, there is some uncertainty in the reads,
and we see several options for a SNP, e.g. {0,1}, {1,2} or {0,1,2}.
The input R to our problem is reads from n different individuals in a single position. The
possible reads are the subsets {0},{1},{2},{0,1},{1,2},{0,1,2}. We want to learn the probability
of having 1 in that position. We denote this probability by p.
a. Write a likelihood function for the observed reads. That is, write a formula for L(p;
R), where R are the n reads.
b. Write the Q function for this problem. Write it explicitly. i.e., don't leave the
expected sign or an exponential sum of terms.
c. Give the update rule for p.
Q. 3:
For each population (i.e., Africans or Europeans) there is a different probability for 0 or 1 in
each SNP. We can describe an African-American genome as two sequences of SNPs; each
sequence is composed of segments of a European and African genome, interweaved (due to
recombination and inter-population mating). In the European / African segment, SNP
probabilities are according to the European / African population. There is no dependence
between segments.
We wish to model such a genome using a double HMM. Instead of one path, there are two
independent paths (with possible transitions between them). The output is the sum of these
paths.
a. Describe how a dHMM models the sequencing of the African-American genome.
Plot the states and possible transitions and describe the output.
b. Suppose we are given the sequence of SNPs along a chromosome (i.e., a sequence
over {0,1,2}) along with the dHMM model for that chromosome, including the
transition and emission probabilities. Describe an algorithm that calculates the most
probable partition of the two chromosomal copies into European and African
segments.
Q. 4:
The pre-mRNA of a eukaryote gene contains exons interspersed with introns. In a normal
splicing process the exons are concatenated in their order, and together they form the
mature mRNA. In alternative splicing, the process skips one or more exons, and
consequently they are not contained in the mature mRNA.
Describe an efficient algorithm that gets as input (1) the normal gene represented as an
ordered list of exon sequences and (2) a mature mRNA sequence, and identifies which exons
are contained in it. Due to sequencing errors, there may be mismatches and indels between
the exon sequence in the mature mRNA and the original exon sequence.
Use the following notations: the mature mRNA sequence M(1,…,m). The normal gene
sequence E(1,…,n). Starting indices of exons in the sequence: s1=1<s2< … <sk<n, that is exon i
is E(si),…,E(si+1 – 1). Similarity score of nucleotides x, y: (x,y). Indel score: . Assume k << n.
Q. 5:
You are given a random sample of similarity values in the expression pattern of pairs of
genes. Assume that the genes classify to two clusters, so that the similarity values in a
cluster distribute normally with mean w and standard deviation s, while the similarity values
between clusters distribute normally with mean b and standard deviation s.
a. Write a detailed expression of the likelihood of a given sample (x1, …, xn).
b. In order to use an EM algorithm to calculate the parameters that maximize (locally)
the likelihood function: calculate the Q function.
c. Derive the update formula of s, b and w
Download