Lecture 3 Identifying Templates: Profile Hidden Markov Models • Durbin R, Eddy. S, Krough A, and Mitchenson G, Biological Sequence Analysis, Cambridge University Press, 2002 • Eddy S, What is a Hidden Markov Model? Nature Biotechnology 22:1315-1316, 2004 • Eddy S, Profile Hidden Markov Models, Bioinformatics Review, vol. 19, no. 8, pp. 755-763, 1998 • Numerical example in this presentation are from: http://web.cs.iastate.edu/~cs544/ What is this lecture about? • In this lecture we will introduce Hidden Markov Models (HMMs) as an advanced tool to detect distant evolutionary relationships among proteins • From a practical standpoint, HMMs are similar to Position Specific Substitution matrices in that the try to “learn” the rules of substitution from a precompiled (small) alignment • Before delving into the technicalities of profile HMMs (I will show you how to use HMMER), we will take a long detour… Let’s start from far away: stochastic processes A stochastic process: • Collection of random variables representing the evolution of a dynamical system • In discrete time there is a random variable associated to each time t, xt • It is random as opposed to a deterministic time series (e.g. trajectory of a particle in Newtonian mechanics) Example: stock market fluctuations It would be “nice” to know what is going to happen tomorrow based on the knowledge of the market history Central question • Given that a certain time series has been observed, what is the probability of observing a certain outcome at time t? P(x*t|xt-1, xt-2, xt-3, …,x1) ? Note: this is a conditional probability (the outcome is conditioned on the history) Let’s brush up on probability… A Probability is a number assigned to each subset (they are called events) of a sample space satisfying the following rules: • For any event A, 0 P(A) 1. • P() =1. • If A1, A2, … An is a partition of A, then P(A) = P(A1) + P(A2) + ...+ P(An) (A1, A2, … An is called a partition of A if A1 A2 … An = A and A1, A2, … An are mutually exclusive.) Events A and B occur with joint probability P(A∩B) A A∩B B Conditional Probability P(A Ç B) P(A | B) = P(B) We are restricting the sample space to B (think of this number as a normalization factor), we say this in words: What is the probability that A occur given that B occurred? A A∩B B Markov chains x1 x2 x3 … • Markov chains are stochastic processes that undergo transitions between a finite series of states in a chainlike manner. • The system transverses states with probability: p(x1, x2, x3, ...) = p(x1) p(x2| x1) p(x3| x2) p(x4| x3)… • Markov chains are memoryless: the probability that the chain is in state xi at step t, depends only on the state at the previous time step and not on the past history of the states visited before time t−1. Markov chains x1 x2 x3 … • This specific kind of "memorylessness" is called Markov property • The Markov property states that the conditional probability distribution for the system at the next step (and in fact at all future steps) depends only on the current state of the system, and not additionally on the state of the system at previous steps. Example of a Markov Model Numbers close to the arrows represent transition probabilities, e.g. P(Bear Market at time t=i | Bull Market at time t=i-1). Probabilities are normalized, this allows to easily obtain the “self-transition” probabilities. With this model we can at least formalize our question on the financial markets and explore the consequences of a mathematical model Image from wikipedia Hidden Markov Models What is special (and most importantly what is hidden) in HMM? • We know the “topology” of the model and all the transition probabilities, but… • Each state emits a set of symbols and we can only observe these symbols • Since a given symbol can in principle be emitted by different states, we do not actually know the state of the system, we can only make a probabilistic inference on it Example: the occasionally dishonest casino Problem: a casino uses sometimes a fair die and sometimes a loaded one, we observe a sequence of outcomes and we would like to know the sequence of die types • Symbols – {1,2,3,4,5,6} • Transition probabilities – Prob(Fair Loaded) = 0.01 – Prob(Loaded Fair) = 0.2 • Emission probabilities: – Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6 – Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10, Prob(6) = ½ Graphical representation of the model for the occasionally dishonest casino 0.99 0.8 0.01 F L 0.2 The occasionally dishonest casino • Assumed: Transition probabilities Emission probabilities • Hidden states: types of die: FLFFFLLFLFLFLFFFFFLF... • Observed: faces of the die 63556235243434664514... • Questions we can ask: When was a fair die used? Is the casino cheating? [Better: what is the probability that outcomes come from our HMM? Is it bigger than that corresponding to a single fair die?] Making the inference • Using the HMM we can assign a probability to each explanation of the observed sequence: P(436|FFL) = P(4|F)·P(FF)·P(3|F)·P(FL)·P(6|L) = 1/6 · 0.99 · 1/6 · 0.01 · ½ • Most likely sequence of hidden states: Find the path that maximizes the probability of getting as outcome the observed sequence • Total probability of the observed sequence: Consider all the possible (weighted) paths that might have resulted in the observed sequence Some notation • In our multivariate probability distribution functions, we need to distinguish between two types of variables, x and , which denote the symbols emitted and the state occupied, respectively. The subscript indicates the time, e.g. xi is the symbol emitted at time i and i is the state occupied at time i. x and (vectors) indicate an entire (time ordered) sequence. • alm is the probability of making a transition from state k to state r: alm = P(p i = m | p i-1 = l) • el(σ) is the probability that symbol σ is emitted when in state k el (s ) = P(xi = s | p i = l) Each explanation of the observations is associated to a path of known probability F F F … F L L L … L x1 x2 x3 L P(x, p ) = P(p1 )Õ ep i (xi )× ap ip i+1 i=1 xL How do we find the most probable path? p = argmax P(x, p ) * *, p It turns out that can be found in polynomial time (in L) by using a divide and conquer strategy: consider, among all possible paths producing x and ending in state l, the one of maximum probability: Markov property implies the following recursion relation vl (i) = el (xi )max ( vm (i -1)aml ) m The Viterbi Algorithm • Start from the initial state (i = 0), which is (with certainty) the state {BEGIN} v{BEGIN } (0) =1, vl (0) = 0 "l • Then use the recursion relation to increment the time (from i = 1 to i=L). Compute vl(i) for all the sates l vl (i) = el (xi )max ( vm (i -1)aml ) m • Stop when after reaching i=L to transition to the {END} state: P(x, p ) = max ( vl (L)al{END} ) * l Using a trace-back matrix, as in in dynamic programming, it is easy to find * Viterbi: Example x 6 2 6 F (1/6)(1/2) = 1/12 (1/6)max{(1/12)0.99, (1/4)0.2} = 0.0138 (1/6)max{0.013750.99, 0.020.2} = 0.0023 L (1/2)(1/2) = 1/4 (1/10)max{(1/12)0.01, (1/4)0.8} = 0.02 (1/2)max{0.013750.01, 0.020.8} = 0.08 vl (i) = el (xi )max ( vm (i -1)aml ) m The path maximizing the probability “closely approximate” what happened in reality Image from wikipedia What if we do not know the transition probabilities? Supervised learning: We use a set of know instances (training set) to find the optimal parameters for the HMM Total probability Several distinct paths result in observation x. All the paths contribute to the likelihood of an observation, each with its own weight: P(x) = å P(x, p ) p We can use this Total Probability to distinguish observations that are likely to be “emitted” by our model from all the others. Tuning the parameters (transition and emission probabilities), we can customtailor the model on a specific set of observations How we compute total probability? P(x) is calculated in a similar way as the probability of the most probable path. fl (i) = Probability of observing x1,… , xi assuming that π i = l Also in this case a recursion relation holds: fl (i) = el (xi )å fm (i -1)aml m The Forward Algorithm • Start from the initial state (i = 0), which is (with certainty) the state {BEGIN} f{BEGIN } (0) =1, fl (0) = 0 "l • Then use the recursion relation to increment the time (from i = 1 to i=L). Compute fl(i) for all the sates l fl (i) = el (xi )å fm (i -1)aml m • Stop when after reaching i=L to transition to the {END} state: P(x) = å fl (L)al{END} l The total probability gives a recipe for finding the optimal transition and emission probabilities • Baum-Welch algorithm • Transition and emission probabilities are “guessed” • Guesses are iteratively refined to increase the total probability of the observations in the training dataset • Viterbi training • Transition and emission probabilities are “guessed” • The most probable paths for the training dataset observations are found using the Viterbi algorithm • The Viterbi paths are used to collect statistics about transitions and emissions (the parameters are updated) • The last two steps are repeated until convergence (the path does not change Profile HMMs (at last!) • The goal is to create a different model for each phylogenetic family • An initial “seed” multiple sequence alignment is needed to “train” the model • The parameters of model are chosen so to that sequences of the family have maximum total probability • Transition and emission probabilities are positionspecific • For all the sequences of the database one can calculate the total probability (to decide if the sequence belong to the family) and the optimal path (to find the best alignment with the initial multiple sequence alignment) Profile HMMs Taken from Bioinformatics Sequence and Genome Analysis, David W. Mount (http://compbio.pbworks.com/w/page/16252909/Multiple%20Sequence%20Alignm ent) Profile HMMs: Example Source: http://www.csit.fsu.edu/~swofford/bioinformatics_spring05/ Pfam (http://pfam.xfam.org/) • The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). • The Pfam protein families database: towards a more sustainable future: R.D. Finn, P. Coggill, R.Y. Eberhardt, S.R. Eddy, J. Mistry, A.L. Mitchell, S.C. Potter, M. Punta, M. Qureshi, A. Sangrador-Vegas, G.A. Salazar, J. Tate, A. Bateman Nucleic Acids Research (2016) Database Issue 44:D279-D285