EXERCISES

advertisement
TIK-61.181 BIOINFORMATICS
AUTUMN 2000
EXERCISES
Julián Cobos Aparicio
St. Number: 55810J
julianc@cc.hut.fi
EXERCISES 1
a) Explain in detail how you would align the following sequences with pair HMM:s
and with non-probabilistic methods:
AAAG
ACG
To align the sequences with non-probabilistic methods we can use dynamic
programming, with the basic algorithm. This is the formula for getting the score of
each box:
a[i, j − 1] − 2
ì
ï
a[i, j ] = max ía[i − 1, j − 1] + p(i, j )
ï
a[i, j − 1] − 2
î
ì 1 s[i ] = t [ j ]
where p(i, j ) = í
î− 1 s[i ] ≠ t [ j ]
The resulting matrix from the application of the algorithm is the following:
A
A
A
G
1
2
3
4
0
-2
-4
-6
-8
-2
1
-1
-3
-5
-4
-1
0
-2
-4
-6
-3
-2
-1
-1
0
0
A
C
G
1
2
3
Therefore we obtain three possible alignments:
AAAG
– ACG
Optimal
AAAG
AC – G
Sub optimal
AAAG
A– CG
Sub optimal
To search the best alignment for the given sequences with probabilistic methods we
can use pair HMMs with the Viterbi algorithm.
The HMM model would be:
X
δ
M
ε
1-ε
δ
1-2δ
1-ε
Y
ε
The Viterbi algorithm calculates the alignment finding the most probable path in the
model following these steps:
Initialization: vM(0,0) = 1; all other v*(i,0), v*(0,j) are set to 0.
Recurrence: i = 1..n, j = 1..m;
ì(1 − 2δ )v M (i − 1, j − 1),
ï
v M (i, j ) = p xi y j max í (1 − ε )v X (i − 1, j − 1),
ï (1 − ε )vY (i − 1, j − 1);
î
ìδ v M (i − 1, j ),
v X (i, j ) = q xi max í X
î ε v (i − 1, j );
ìδ v M (i, j − 1),
vY (i, j ) = q y j max í Y
î ε v (i, j − 1);
Termination:
v E = max(v M (n, m), v X (n, m), vY (n, m))
To get the best alignment we only have to keep pointers to trace back, saving at the
end the residue emitted at each step of the path in the trace back.
At the end we also would get the probability of obtaining this optimal path.
b) Analyze briefly the relation of the probabilistic approach and the nonprobabilistic approach for sequence alignment.
The Finite State Automaton (FSA) is the description of the most complex dynamic
programming pair wise alignment algorithms. We also are able to take it as the basis
for a more advanced probabilistic model (HMMs).
FSA
s(xi,yj)
s(xi,yj)
M
(+1,+1)
HMM
X
(+1,+0)
-e
-d
-d
s(xi,yj)
Y
(+0,+1)
-e
δ
1-ε
M
1-2δ
δ
1-ε
X
ε
Y
ε
Using this probabilistic model and applying to it all the HMM theory and algorithms
we will obtain some complementary information about the alignments that we cannot
get with non-probabilistic methods, for example the reliability of the optimal
alignment obtained, or other possible but not optimal alignments (suboptimal). With
all this information we will be able to score the similarity of two sequences by using
statistical probabilistic methods. It is also possible to modify the model for
recognising local alignments.
By measuring the similarity of two sequences (we can do it using the Forward
algorithm), with the HMM we can avoid the problem of identifying the correct
alignment with dynamic programming when similarity is weak. Other advantage of
HMMs is that we are able somehow to train the model with pairs of related sequences
to find the optimal parameters of the model matrices to maximize the final
probabilities. And then we can use these parameters for searching with new
sequences.
c) Give an example of the possible uses of the MC-model and HMMs in
Bioinformatics. Outline briefly the main differences between the two models.
Some possible uses of the Markov chains models and Hidden Markov models (HMM)
are:
•
Pair alignment of two sequences. Finding the best alignment of two
sequences and measuring the similarity between them.
•
Multiple alignments. Finding the best possible alignment of several
sequences and giving a score to this alignment.
•
Check if a particular DNA sequence belongs to a certain family based on
common properties of the family, and what is the probability for it to
belong to the family.
•
Pattern searching in DNA sequences. (e.g. CpG islands)
•
Identifying α-helix or –sheet regions in a protein sequence.
•
Classification of sequences and fragments inside a database.
A HMM can be seen like a finite state machine in which the following state only
depends on the present state (present but not passed memory), and we have a
observations or parameters vector associated to each transition between states. It is
possible thus to be said that a model of Markov have two processes associated: one
hidden, no directly observable, corresponding to the transitions between states, and
another observable one (and directly related to first), whose accomplishments are the
vectors of parameters that take place from each state and which they form the pattern
to recognize.
In a normal Markov model there is a one-to-one correspondence between the states
and the symbols, this means that in a normal Markov model each state corresponds to
a particular emission of a symbol, whereas in a HMM usually every state is able to
emit all the symbols.
d) Relax. You are buying a drink from the notorious crazy coke machine while
absent-mindedly solving some exercises. The machine can be in either of two states:
cola preferring state (CP) and mineral water (WP) preferring state. When you put
in a coin, the output of the machine can be described with the following probability
matrix:
cola Mineral water lemonade
CP 0.6
0.1
0.3
WP 0.1
0.7
0.2
Draw the state model graph with transition probabilities P(CPà
àWP) = 0.3 and
P(WPà
àCP) = 0.5 and calculate the probability of seeing the output sequence
{cola,lemonade} if the machine always starts off in the WP state. What kind of
behaviour would make this HMM to a (visible) Markov model?
The state model graph for the HMM is the following:
WP
cola: 0.1
wat: 0.7
lem:0.2
CP
0.5
0.3
0.5
cola: 0.6
wat: 0.1
lem:0.3
0.7
We can calculate the probability of seeing the sequence {cola, lemonade} starting in
WP, by using the forward algorithm as follows:
fwp(cola) = ewp(cola) πwp = 0.1 * 1 = 0.1
fcp(cola) = ecp(cola) πcp = 0.6 * 0 = 0
fwp(lemonade) = (fwp(cola)*awp-wp + fcp(cola)*acp-wp) ewp(lemonade) = (0.1*0.5+0*0.3)
= 0.01
fcp(lemonade) = (fwp(cola)*awp-cp + fcp(cola)*acp-cp) ecp(lemonade) = (0.1*0.5+0*0.7) =
0.015
Therefore:
P(cola, lemonade / λ) = fwp(lemonade) + fcp(lemonade) = 0.01 + 0.015 = 0.025
f) Compare briefly the FSAs and pairwise HMMs in searching.
In probabilistic modelling we are able to sample the model with a very large amount
of data, if this amount tends to infinite then the likelihood takes its maximum value
for the model, and the parameters of this optimal model (emissions and transitions
matrices) could be used to maximize the likelihood of the data.
Due to this, if the parameters of a pair HMM describe the statistics of pairs of related
sequences well, then we should use that model with those parameter values for
searching. We also could use the Bayesian model comparison with other model that
gives a good description of the generation of random sequence.
However most currently used algorithms for these probabilistic models have two
weaknesses. First they do not compute the full probability of the pair of sequences,
summing over all alignments, but instead find the best match, or Viterbi path, in this
case, a model whose parameters match the data is not necessarily the best search
model. Second, regarded as FSAs, their parameters may not be readily translated into
probabilities.
Therefore we can conclude that probabilistic models may underperform standard
alignment methods if Viterbi is used for database searching, but if the forward
algorithm is use to provide a complete score independent of specific alignment, then
pair HMMs may improve upon the standard methods.
g) Analyze the relation of the profile HMMs, HMMs, and PSSMs.
Profile HMMs are a particular type of HMMs especially good to modelling multiple
alignments. A particular feature of protein family multiple alignment is that gaps tend
to line up with each other leaving solid groups without any gap.
A first probabilistic model to make comparison with these solid blocks is to specify
probabilities to the observation of amino acid a in position i ( ei(a) ) comparing it to
the probability under a random model:
L
S = å log
i =1
ei ( xi )
qxi
à Probability of a new sequence x
These values behave like elements in a score matrix; this is why this method is called
Position Specific Score Matrix (PSSM). And this method is used for searching for a
match of the sequence x in a longer sequence. PSSMs are useful to find conservation
information in protein families, but they do not take account of gaps. The best
approach to do this is to build a HMM with repetitive structure of states, but different
probabilities in each position. A PSSM can be viewed as a trivial HMM, to which we
can add features to take into account insertions and deletions.
Dj
Ij
Begin
Mj
End
The profile HMM comes up from conditioning the pair HMM used for pair wise
alignment (for example comparing sequences x and y) on emitting the sequence y as
one of the sequences in its alignment. Because of this, it is possible to use the same
Viterbi equations to find the most probable alignment of x to our profile HMM.
h) Assume we have a number of sequences that are 50 residues long, and that a
pairwise comparison of two such sequences takes one second of CPU time on our
computer. An alignment of four sequences takes (2L)N-2 =102N-4=104 seconds (a few
hours). If we had unlimited memory and we were willing to wait for the answer
until just before the sun burns out in five billion years, how many sequences could
our computer align?
We have to align N sequences of length 50, and knowing that a single pariwise
comparison takes 1 second and the alignment of 4 of those sequences takes 104
seconds, and we have five billion (5*1012) years:
5*1012 years = 15768 *1016 seconds
Therefore:
(2L)N-2 = 102N-4 = 15768 *1016
Taking logarithms:
(2N-4)*log(10) = log( 15768 *1016 )
And:
ê log(15768 *1016 ) + 4 ú
N =ê
ú = ë12.09888û = 12 sequences
2
ë
û
It means that in five billion years we can align 12 sequences with this computer.
i) Explain briefly the pros, cons, and the most useful applications of the different
progressive alignment methods and profile HMMs in multiple alignment.
There are several methods that can be used for multiple alignment:
•
•
•
•
Feng-Doolittle progressive alignment
Profile alignment (CLUSTALW)
Iterative refinement (Barton-Sternberg)
Profile HMMs
The Feng-Doolittle method is quite fast due to its clustering algorithm (Fitch &
Margoliash) but uses only pair wise alignment to determine the multiple alignments.
This means that when a group of sequences has been aligned the method uses
position-specific information from it to align the following sequence. Therefore when
a subalignment has been built up, it does not change during the rest of the process.
The CLUSTALW method has a higher accuracy and flexibility than the previous one,
and it is very simple for linear gap scoring, but it has similar weak points, the
subalignments that already have been done remains without changes to the end of the
alignment, and it aligns always whole sequences, never taking into account what
fragments of the sequence are alignable or not.
The Barton-Sternberg method has the advantage that is more flexible than the
progressive alignment methods, but its complexity is also bigger.
The use of profile HMMs for multiple alignment give us a more biologically realistic
view of the problem, it is semantically better because it takes into account if the
current fragment of the sequence is meaningfully alignable. On the other hand, it is
difficult to apply it because the construction of the model is not obvious and the
estimation of a multiple alignment from unaligned sequences is a very hard task.
j) Explain the reasons why there are methods for fragment assembly and physical
mapping.
The problem of DNA sequencing becomes unapproachable when the continuous
stretches are over a few hundred bases. However there are methods to split a long
DNA molecule into random pieces and to produce enough copies of the pieces to
sequence. Therefore, we can sequence fragments of them, but this method leaves us
the problem of assembling the fragments, and that’s why these methods exist.
However, even using these methods to sequence DNA, we are restricted to a few tens
of thousands of DNA base pairs whereas a chromosome is a DNA molecule with
around 108 base pairs. This means that these methods only enable us to analyze a very
small fraction of the chromosome, not allowing us to analyze bigger structures in the
DNA molecule. In order to do this there are methods to create maps of entire
chromosomes or of significant parts of them, and they are called techniques for
Physical Mapping of DNA.
l) Explain why two restriction sites for two different restriction enzymes cannot
coincide.
The restriction site mapping technique uses these kind of “fingerprints” that describes
uniquely the information in a given fragment of the DNA molecule (restriction
enzyme). These fingerprints are used to compare the fragments mainly by overlapping
to accomplish the DNA mapping.
One restriction enzyme could recognize one or more restriction sites, however two
different restriction enzymes cannot recognize the same restriction site. A restriction
site is obtained from information on the length of the fragments, a fragment length is
thus its fingerprint. Therefore the restriction site represents uniquely a particular
restriction enzyme.
EXERCISES 2
a) Prove that any unoriented permutation over n elements can be sorted with at
most n-1 reversals.
In order to sort an unoriented permutation of n elements we can do reversals over 2 to
n elements:
Reversal over 2 elements:
Reversal over 5 elements:
62374 à 26374
62374 à 47326
It allows us to make, in the worst case, a reversal for every single element but the last
two elements that can be ordered with a single reversal over them. Thus we would
make n-1 reversals to sort the whole permutation simply choosing always the
following number to order as the end of the reversal.
For example in a 6 elements permutation:
246531 à 135642 à 124653 à 123564 à 123465 à 123456
Therefore, we have sorted the permutation in 5 reversals.
b) Give an example of an RNA sequence where a knot may appear.
With a RNA sequence like:
1 2 3 4 5 6 7 8 9 10 1112 13
AAUUGAAAACAAG
We may find a knot if the base pairs: (r5,r10) and (r4,r9) belong to the set of pairs (S) of
a possible RNA secondary structure for the given sequence:
A A U U G A
A
G A A C A A
c) Consider a simple HMM that models two kinds of base composition in DNA. The
model has two states fully interconnected by four state transitions. State 1 emits
CG-rich sequence with probabilities (pa, pc, pg, pt) = {0.1, 0.4, 0.4, 0.1} and state 2
emits AT-rich sequence with probabilities (pa, pc, pg, pt) = {0.3, 0.2, 0.2, 0.3}.
a) Draw this HMM.
b) Set the transition probabilities so that the expected length of a run of
state 1s is 1000 bases and the expected length of a run of state 2s is 100
bases.
c) Give the same model in stochastic regular grammar form with terminals,
non-terminals, and production rules with their associated probabilities.
The HMM could be represented as follows with the transition probabilities set to
a12=1/1000 and a21=1/100:
S1
pa: 0.1
pc: 0.4
pg: 0.4
pt: 0.1
1-(1/1000)
S2
1/1000
1/100
pa: 0.3
pc: 0.2
pg: 0.2
pt: 0.3
1-(1/100)
This model can be represented as the following stochastic regular grammar:
Terminals = {A, C, G, T}
Non-terminals = {S1, S2}
Production rules:
S1 à AS2
S1 à CS2
S1 à GS2
S1 à TS2
(0.3*1/1000)
(0.2*1/1000)
(0.2*1/1000)
(0.3*1/1000)
S1 à AS1
S1 à CS1
S1 à GS1
S1 à TS1
(0.1*(1-1/1000))
(0.4*(1-1/1000))
(0.4*(1-1/1000))
(0.1*(1-1/1000))
S2 à AS1
S2 à CS1
S2 à GS1
S2 à TS1
(0.1*1/100)
(0.4*1/100)
(0.4*1/100)
(0.1*1/100)
S2 à AS2
S2 à CS2
S2 à GS2
S2 à TS2
(0.3*(1-1/100))
(0.2*(1-1/100))
(0.2*(1-1/100))
(0.3*(1-1/100))
Note that this satisfys the requirement that the probabilities for all the transitions
leaving each state sum to one.
d) Show that the Nussinov folding algorithm can be trivially extended to find a
maximally scoring structure where a base pair between residues a and b gets a
score s(a,b). (For instance, we might set s(G, C) = 3 and s(A, U) = 2 to better reflect
the increased thermodynamic stability of GC pairs.)
The basic Nussinov algorithm, given a sequence x of length L with symbols x1,..,xL
scores as follows:
ì 1 if xi and x j are base pair
δ (i, j ) = í
if not
î0
So we can modify it in such way that:
ì3
ï
δ (i, j ) = í 2
ï0
î
if xi = G and x j = C or viceversa
if xi = A and x j = U or viceversa
else
And apply the algorithm in the same way, and then at the end of the algorithm the
score at the upper right corner of the matrix is the value for the maximum scored valid
secondary RNA structure for the given sequence.
After that we can make the backtrace in the same way that with the basic Nussinov
algorithm and we will obtain the structure with the highest score.
Applying the new algorithm to an example sequence (GGGAAAUCC) we obtain the
following matrix:
G G G
G 0 0 0
G 0 0 0
G
0 0
A
0
A
A
U
C
C
A
0
0
0
0
0
A
0
0
0
0
0
0
A
0
0
0
0
0
0
0
U
2
2
2
2
2
2
0
0
C
5
5
5
2
2
2
0
0
0
C
8
8
5
2
2
2
0
0
0
By back tracing we obtain the following structure of score 8:
A
A
A
G
G
G
U
U
C
e) The authors [SSK, Ch. 11] propose a method for computing a substituting matrix
for amino acids. Characterize the differences from the PAM matrices. Comment the
proposal. Where would you use each?
The method suggested by the author is to build a substitution matrix (WAC matrix),
based on statistical descriptions of the amino acid environments, taking a measure of
the tolerance of a microenvironment for one new amino acid based on the properties
of the environment surrounding the lost amino acid.
Other substitution matrices (score matrices) like PAM matrices or BLOSUM have
been obtained by empirical methods from measures of the frequency of substitution of
alternative residues in each position in some experiments. These substitution matrices
are mostly used for general protein searching and homology measuring by pairwise
alignment, but they have demonstrated also a good performance in detecting related
protein sequences to a certain family.
Download