ppt - Ron Levy Group

advertisement
Lecture 3
Identifying Templates:
Profile Hidden Markov
Models
• Durbin R, Eddy. S, Krough A, and Mitchenson G, Biological Sequence Analysis,
Cambridge University Press, 2002
• Eddy S, What is a Hidden Markov Model? Nature Biotechnology 22:1315-1316, 2004
• Eddy S, Profile Hidden Markov Models, Bioinformatics Review, vol. 19, no. 8, pp.
755-763, 1998
• Numerical example in this presentation are from: http://web.cs.iastate.edu/~cs544/
What is this lecture about?
• In this lecture we will introduce Hidden Markov Models
(HMMs) as an advanced tool to detect distant
evolutionary relationships among proteins
•
From a practical standpoint, HMMs are similar to
Position Specific Substitution matrices in that the try to
“learn” the rules of substitution from a precompiled
(small) alignment
• Before delving into the technicalities of profile HMMs (I
will show you how to use HMMER), we will take a long
detour…
Let’s start from far away:
stochastic processes
A stochastic process:
• Collection of random variables
representing the evolution of a dynamical
system
• In discrete time there is a random variable
associated to each time t, xt
• It is random as opposed to a deterministic
time series (e.g. trajectory of a particle in
Newtonian mechanics)
Example: stock market fluctuations
It would be “nice” to know what is going to happen tomorrow based on the
knowledge of the market history
Central question
• Given that a certain time series has been
observed, what is the probability of
observing a certain outcome at time t?
P(x*t|xt-1, xt-2, xt-3, …,x1) ?
Note: this is a conditional probability (the
outcome is conditioned on the history)
Let’s brush up on probability…
A Probability is a number assigned to each subset (they are called events) of a
sample space  satisfying the following rules:
• For any event A, 0  P(A)  1.
• P() =1.
• If A1, A2, … An is a partition of A, then
P(A) = P(A1) + P(A2) + ...+ P(An)
(A1, A2, … An is called a partition of A if A1  A2  … An = A and A1, A2, … An are
mutually exclusive.)
Events A and B occur with
joint probability P(A∩B)

A
A∩B
B
Conditional Probability
P(A Ç B)
P(A | B) =
P(B)
We are restricting the
sample space to B (think
of this number as a
normalization factor), we
say this in words: What is
the probability that A occur
given that B occurred?

A
A∩B
B
Markov chains
x1
x2
x3
…
• Markov chains are stochastic processes that undergo
transitions between a finite series of states in a chainlike
manner.
• The system transverses states with probability:
p(x1, x2, x3, ...) = p(x1) p(x2| x1) p(x3| x2) p(x4| x3)…
• Markov chains are memoryless: the probability that the
chain is in state xi at step t, depends only on the state at
the previous time step and not on the past history of the
states visited before time t−1.
Markov chains
x1
x2
x3
…
• This specific kind of "memorylessness" is called
Markov property
• The Markov property states that the conditional
probability distribution for the system at the next step
(and in fact at all future steps) depends only on the
current state of the system, and not additionally on
the state of the system at previous steps.
Example of a Markov Model
Numbers close to the arrows represent transition probabilities, e.g.
P(Bear Market at time t=i | Bull Market at time t=i-1). Probabilities are normalized,
this allows to easily obtain the “self-transition” probabilities. With this model we
can at least formalize our question on the financial markets and explore the
consequences of a mathematical model
Image from wikipedia
Hidden Markov Models
What is special (and most importantly what is
hidden) in HMM?
• We know the “topology” of the model and all the
transition probabilities, but…
• Each state emits a set of symbols and we can
only observe these symbols
• Since a given symbol can in principle be emitted
by different states, we do not actually know the
state of the system, we can only make a
probabilistic inference on it
Example: the occasionally
dishonest casino
Problem: a casino uses sometimes a fair die and sometimes a loaded one, we
observe a sequence of outcomes and we would like to know the sequence of die
types
• Symbols
– {1,2,3,4,5,6}
• Transition probabilities
– Prob(Fair  Loaded) = 0.01
– Prob(Loaded  Fair) = 0.2
• Emission probabilities:
– Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6
– Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10,
Prob(6) = ½
Graphical representation of the
model for the occasionally
dishonest casino
0.99
0.8
0.01
F
L
0.2
The occasionally dishonest casino
• Assumed:
 Transition probabilities
 Emission probabilities
• Hidden states: types of die:
 FLFFFLLFLFLFLFFFFFLF...
• Observed: faces of the die
 63556235243434664514...
• Questions we can ask:
 When was a fair die used?
 Is the casino cheating? [Better: what is the
probability that outcomes come from our HMM? Is
it bigger than that corresponding to a single fair
die?]
Making the inference
• Using the HMM we can assign a probability to each explanation of
the observed sequence:
P(436|FFL)
= P(4|F)·P(FF)·P(3|F)·P(FL)·P(6|L)
= 1/6 · 0.99 · 1/6 · 0.01 · ½
• Most likely sequence of hidden states: Find the path that
maximizes the probability of getting as outcome the observed
sequence
• Total probability of the observed sequence: Consider all the
possible (weighted) paths that might have resulted in the observed
sequence
Some notation
• In our multivariate probability distribution functions, we
need to distinguish between two types of variables, x
and , which denote the symbols emitted and the state
occupied, respectively. The subscript indicates the
time, e.g. xi is the symbol emitted at time i and i is the
state occupied at time i. x and  (vectors) indicate an
entire (time ordered) sequence.
• alm is the probability of making a transition from state k
to state r:
alm = P(p i = m | p i-1 = l)
• el(σ) is the probability that symbol σ is emitted when in
state k
el (s ) = P(xi = s | p i = l)
Each explanation of the
observations is associated to a
path of known probability
F
F
F
…
F
L
L
L
…
L
x1
x2
x3
L
P(x, p ) = P(p1 )Õ ep i (xi )× ap ip i+1
i=1
xL
How do we find the most
probable path?
p = argmax P(x, p )
*
*,
p
It turns out that
can be found in polynomial time (in L) by
using a divide and conquer strategy: consider, among all
possible paths producing x and ending in state l, the one of
maximum probability:
Markov property implies the following recursion relation
vl (i) = el (xi )max ( vm (i -1)aml )
m
The Viterbi Algorithm
• Start from the initial state (i = 0), which is (with
certainty) the state {BEGIN}
v{BEGIN } (0) =1, vl (0) = 0 "l
• Then use the recursion relation to increment the time
(from i = 1 to i=L). Compute vl(i) for all the sates l
vl (i) = el (xi )max ( vm (i -1)aml )
m
• Stop when after reaching i=L to transition to the
{END} state:
P(x, p ) = max ( vl (L)al{END} )
*
l
Using a trace-back matrix, as in in dynamic programming, it is
easy to find *
Viterbi: Example
x
6
2
6
F
(1/6)(1/2)
= 1/12
(1/6)max{(1/12)0.99,
(1/4)0.2}
= 0.0138
(1/6)max{0.013750.99,
0.020.2}
= 0.0023
L
(1/2)(1/2)
= 1/4
(1/10)max{(1/12)0.01,
(1/4)0.8}
= 0.02
(1/2)max{0.013750.01,
0.020.8}
= 0.08

vl (i) = el (xi )max ( vm (i -1)aml )
m
The path maximizing the probability
“closely approximate” what happened in
reality
Image from wikipedia
What if we do not know the
transition probabilities?
Supervised learning:
We use a set of know instances (training set)
to find the optimal parameters for the HMM
Total probability
Several distinct paths result in observation x.
All the paths contribute to the likelihood of an
observation, each with its own weight:
P(x) = å P(x, p )
p
We can use this Total Probability to distinguish
observations that are likely to be “emitted” by our
model from all the others. Tuning the parameters
(transition and emission probabilities), we can customtailor the model on a specific set of observations
How we compute total
probability?
P(x) is calculated in a similar way as the
probability of the most probable path.
fl (i) = Probability of observing x1,… , xi
assuming that π i = l
Also in this case a recursion relation holds:
fl (i) = el (xi )å fm (i -1)aml
m
The Forward Algorithm
• Start from the initial state (i = 0), which is (with certainty) the state {BEGIN}
f{BEGIN } (0) =1, fl (0) = 0 "l
• Then use the recursion relation to increment the time (from i = 1 to i=L).
Compute fl(i) for all the sates l
fl (i) = el (xi )å fm (i -1)aml
m
• Stop when after reaching i=L to transition to the {END} state:
P(x) = å fl (L)al{END}
l
The total probability gives a recipe for
finding the optimal transition and emission
probabilities
• Baum-Welch algorithm
• Transition and emission probabilities are “guessed”
• Guesses are iteratively refined to increase the total probability of
the observations in the training dataset
• Viterbi training
• Transition and emission probabilities are “guessed”
• The most probable paths for the training dataset observations
are found using the Viterbi algorithm
• The Viterbi paths are used to collect statistics about transitions
and emissions (the parameters are updated)
• The last two steps are repeated until convergence (the path
does not change
Profile HMMs (at last!)
• The goal is to create a different model for each
phylogenetic family
• An initial “seed” multiple sequence alignment is needed
to “train” the model
• The parameters of model are chosen so to that
sequences of the family have maximum total probability
• Transition and emission probabilities are positionspecific
• For all the sequences of the database one can calculate
the total probability (to decide if the sequence belong to
the family) and the optimal path (to find the best
alignment with the initial multiple sequence alignment)
Profile HMMs
Taken from Bioinformatics Sequence and Genome Analysis, David W. Mount
(http://compbio.pbworks.com/w/page/16252909/Multiple%20Sequence%20Alignm
ent)
Profile HMMs: Example
Source: http://www.csit.fsu.edu/~swofford/bioinformatics_spring05/
Pfam (http://pfam.xfam.org/)
• The Pfam database is a large collection of protein
families, each represented by multiple sequence
alignments and hidden Markov models (HMMs).
• The Pfam protein families database: towards a more
sustainable future: R.D. Finn, P. Coggill, R.Y. Eberhardt,
S.R. Eddy, J. Mistry, A.L. Mitchell, S.C. Potter, M. Punta,
M. Qureshi, A. Sangrador-Vegas, G.A. Salazar, J. Tate,
A. Bateman Nucleic Acids Research (2016) Database
Issue 44:D279-D285
Download