Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2016 Luay Nakhleh, Rice University Bayes Rule P(X = x|Y = y) = P(X = x, Y = y) P(X = x)P(Y = y|X = x) = P 0 0 P(Y = y) x0 P(X = x )P(Y = y|X = x ) Bayes Rule Example (from “Machine Learning: A Probabilistic Perspective”) Consider a woman in her 40s who decides to have a mammogram. Question: If the test is positive, what is the probability that she has cancer? The answer depends on how reliable the test is! Bayes Rule Suppose the test has a sensitivity of 80%; that is, if a person has cancer, the test will be positive with probability 0.8. If we denote by x=1 the event that the mammogram is positive, and by y=1 the event that the person has breast cancer, then P(x=1|y=1)=0.8. Bayes Rule Does the probability that the woman in our example (who tested positive) has cancer equal 0.8? Bayes Rule No! That ignores the prior probability of having breast cancer, which, fortunately, is quite low: p(y=1)=0.004 Bayes Rule Further, we need to take into account the fact that the test may be a false positive. Mammograms have a false positive probability of p(x=1|y=0)=0.1. Bayes Rule Combining all these facts using Bayes rule, we get (using p(y=0)=1-p(y=1)): p(x=1|y=1)p(y=1) p(y = 1|x = 1) = p(x=1|y=1)p(y=1)+p(x=1|y=0)p(y=0) 0.8⇥0.004 = 0.8⇥0.004+0.1⇥0.996 = 0.031 How does Bayesian reasoning apply to phylogenetic inference? Assume we are interested in the relationships bet ween human, gorilla, and chimpanzee (with orangutan as an outgroup). There are clearly three possible relationships. 7.2 Bayesian phylogenetic inference How does Bayesian reasoning apply to phylogenetic inference? Assume we are interested in the relationships between man, gorilla, and chimpanzee. In the standard case, we need an additional species to root the tree, and the orangutan would be appropriate here. There are three possible ways of arranging these species in a phylogenetic tree: the chimpanzee is our closest relative, the gorilla is our closest relative, or the chimpanzee and the gorilla are each other’s closest relatives (Fig. 7.1). A B C Probability 1.0 Prior distribution 0.5 0.0 Data (observations) Probability 1.0 0.5 Posterior distribution 0.0 Fig. 7.1 A Bayesian phylogenetic analysis. We start the analysis by specifying our prior beliefs about the tree. In the absence of background knowledge, we might associate the same probability to each tree topology. We then collect data and use a stochastic evolutionary model and Bayes’ theorem to update the prior to a posterior probability distribution. If the data are informative, most of the posterior probability will be focused on one tree (or a small subset of trees in a large tree space). Before the analysis, we need to specify our prior beliefs about the relationships. For example, in the absence of background data, a simple solution would be to assign equal probability to the possible trees. How does Bayesian reasoning apply to phylogenetic inference? Assume we are interested in the relationships between man, gorilla, and chimpanzee. In the standard case, we need an additional species to root the tree, and the orangutan would be appropriate here. There are three possible ways of arranging these species in a phylogenetic tree: the chimpanzee is our closest relative, the gorilla is our closest relative, or the chimpanzee and the gorilla are each other’s closest relatives (Fig. 7.1). A B C Probability 1.0 Prior distribution 0.5 0.0 [This is an uninformative prior] Data (observations) Probability 1.0 0.5 Posterior distribution 0.0 Fig. 7.1 A Bayesian phylogenetic analysis. We start the analysis by specifying our prior beliefs about the tree. In the absence of background knowledge, we might associate the same probability to each tree topology. We then collect data and use a stochastic evolutionary model and Bayes’ theorem to update the prior to a posterior probability distribution. If the data are informative, most of the posterior probability will be focused on one tree (or a small subset of trees in a large tree space). To update the prior, we need some data, typically in the form of a molecular sequence alignment, and a stochastic model of the process generating the data on the tree. In principle, Bayes rule is then used to obtain the posterior probability distribution, which is the result of the analysis. The posterior specifies the probability of each tree given the model, the prior, and the data. When the data are informative, most of the posterior probability is typically concentrated on one tree (or, a small subset of trees in a large tree space). 7.2 Bayesian phylogenetic inference How does Bayesian reasoning apply to phylogenetic inference? Assume we are interested in the relationships between man, gorilla, and chimpanzee. In the standard case, we need an additional species to root the tree, and the orangutan would be appropriate here. There are three possible ways of arranging these species in a phylogenetic tree: the chimpanzee is our closest relative, the gorilla is our closest relative, or the chimpanzee and the gorilla are each other’s closest relatives (Fig. 7.1). A B C Probability 1.0 Prior distribution 0.5 0.0 Data (observations) Probability 1.0 0.5 Posterior distribution 0.0 Fig. 7.1 A Bayesian phylogenetic analysis. We start the analysis by specifying our prior beliefs about the tree. In the absence of background knowledge, we might associate the same probability to each tree topology. We then collect data and use a stochastic evolutionary model and Bayes’ theorem to update the prior to a posterior probability distribution. If the data are informative, most of the posterior probability will be focused on one tree (or a small subset of trees in a large tree space). To describe the analysis mathematically, consider: the matrix of aligned sequences X the tree topology parameter τ the branch lengths of the tree ν (typically, substitution model parameters are also included) Let θ=(τ,ν) tree; collect these in the vector v. Typically, there are also some substitution model parameters to be considered but, for now, let us use the Jukes Cantor substitution model (see below), which does not have any free parameters. Thus, in our case, θ = (τ, v). Bayes’ theorem allows us to derive the posterior distribution as f (θ) f (X|θ) f (X) us theorem allows f (θ|X) = Bayes to derive the posterior distribution as (7.5) The denominator is an integral over the parameter values, which evaluates to f (✓)f (X|✓) f (✓|X) = a summation over discrete topologies and a multidimensional integration over f (X) possible branch length values: where f (X) = ! f (θ) f (X|θ) dθ "! f (v) f (X|τ, v) dv = τ Fredrik Ronquist, Paul van der Mark, and John P. Huelsenbeck Posterior Probability 218 48% 32% 20% topology A Fig. 7.2 v topology B topology C Posterior probability distribution for our phylogenetic analysis. The x-axis is an imaginary one-dimensional representation of the parameter space. It falls into three different regions corresponding to the three different topologies. Within each region, a point along the axis corresponds to a particular set of branch lengths on that topology. It is difficult to arrange the space such that optimal branch length combinations for different topologies are close to each other. Therefore, the posterior distribution is multimodal. The area under the curve falling in each tree topology region is the posterior probability of that tree topology. The marginal probability distribution on topologies Even though our model is as simple as phylogenetic models come, it is impossible to portray its parameter space accurately in one dimension. However, imagine for a while that we could do just that. Then the parameter axis might have three distinct regions corresponding to the three different tree topologies (Fig. 7.2). Within each region, the different points on the axis would represent different branch length values. The one-dimensional parameter axis allows us to obtain a picture of the posterior probability function or surface. It would presumably have three distinct peaks, each corresponding to an optimal combination of topology and branch (7.6) (7.7) 219 Why they called marginal Bayesianare phylogenetic analysis using M R B AYES : theory probabilities? Topologies τ τ τ Branch length vectors A B Joint probabilities C ν A 0.10 0.07 0.12 0.29 ν B 0.05 0.22 0.06 0.33 ν C 0.05 0.19 0.14 0.38 0.20 0.48 0.32 Marginal probabilities Fig. 7.3 A two-dimensional table representation of parameter space. The columns represent different tree topologies, the rows represent different branch length bins. Each cell in the table represents the joint probability of a particular combination of branch lengths and topology. If we summarize the probabilities along the margins of the table, we get the marginal probabilities for the topologies (bottom row) and for the branch length bins (last column). be an infinite number of rows, but imagine that we sorted the possible branch length values into discrete bins, so that we get a finite number of rows. For instance, if we considered only short and long branches, one bin would have all branches long, another would have the terminal branches long and the interior branch short, etc. Now, assume that we can derive the posterior probability that falls in each of the cells in the table. These are joint probabilities because they represent the joint probability of a particular topology and a particular set of branch lengths. If we summarized all joint probabilities along one axis of the table, we would obtain the marginal probabilities for the corresponding parameter. To obtain the marginal probabilities for the topologies, for instance, we would summarize the entries in each column. It is traditional to write the sums in the margin of the table, hence the term marginal probability (Fig. 7.3). It would also be possible to summarize the probabilities in each row of the table. This would give us the marginal probabilities for the branch length combinations (Fig. 7.3). Typically, this distribution is of no particular interest but the possibility of calculating it illustrates an important property of Bayesian inference: there is no sharp distinction between different types of model parameters. Once the posterior probability distribution is obtained, we can derive any marginal distribution of Markov chain Monte Carlo Sampling In most cases, it is impossible to derive the posterior probability distribution analytically. Even worse, we can’t even estimate it by drawing random samples from it. The reason is that most of the posterior probability is likely to be concentrated in a small part of a vast parameter space. The solution is to estimate the posterior probability distribution using Markov chain Monte Carlo sampling, or MCMC for short. Monte Carlo = random simulation Markov chain = the state of the simulator depends only on the current state Irreducible Markov chains (their topology is strongly connected) have the property that they converge towards an equilibrium state regardless of starting point. We just need to set up a Markov chain that converges onto our posterior probability distribution! scribing the full set of probability statements that define the Transition probabilities next step (step i + 1) given its current state (in step i). These s to transition rates if we are working with continuous-time A Markov chain can be defined by describing the full set of probability statements that define the rules for the state of the chain in the next step (step i + 1) given its current state (in step i). These transition probabilities are analogous to transition rates if we are working with continuous-time ov chain: one with two states Markov ime. The figure to theprocesses. right Stationary Distribution of a Markov Chain sition probabilities are shown Consider simplest possible Markov chain: one with two states probabilities next to thethe line. 0.6 time. The figure to the right 1) are: that operates in discrete correspond to (0, the and graph shows the states in circles. The transition are shown 0.4 0 1 probabilities 0.1 ) = 0.4 as arcs connecting the states with the probabilities next to the line. 0.6 ) = 0.6 The full probability statements0.9 that correspond to the graph are: ) = 0.9 0.4 0 1 0.1 P(xi+1 = 0|xi = 0) = 0.4 ) = 0.1 0.9 P(xi+1 = 1|xi = 0) = 0.6 ities, some of them must sum Figure 1: A graphical depicP(xi+1 = 0|xi = 1) = 0.9 particular state at step i we tion of a two-state Markov P(xi+1 = 1|xi = 1) = 0.1 , we must have some state so process. ble xi . Note that, because these are probabilities, some of them must sum Figure 1: A graphical depicto one. In particular, if we are in a particular state at step i we tion of a two-state Markov depends on the state at step can call the state xi . In the next step, we must have some state so process. formally we could P state it as: 1 = j P(xi+1 = j|xi ) for every possible xi . 1 |xi , xi k ) Note that the state at step i + 1 only depends on the state at step is probability i.statement is saying that, conditional on xi , we could state it as: This is the Markovisproperty. More formally he state at any point before i. So when working with Markov selves with the full history of theP(x chain, knowing i+1 |xmerely i ) = P(x i+1 |xi , xthe i k) where k is positive integer. What this probability statement is saying is that, conditional on xi , the then statewe at can i + 1make is independent on the state at any point before i. So when working with Markov ion probabilities, a probabilistic statement we don’t need toare concern with the full history of the chain, merely knowing the n (in fact the chains transition probabilities these ourselves probabilistic state at the step is be enough. about the probability thatprevious the chain will in a 1particular scribing the full set of probability statements that define the Transition probabilities next step (step i + 1) given its current state (in step i). These s to transition rates if we are working with continuous-time A Markov chain can be defined by describing the full set of probability statements that define the rules for the state of the chain in the next step (step i + 1) given its current state (in step i). These transition probabilities are analogous to transition rates if we are working with continuous-time ov chain: one with two states Markov ime. The figure to theprocesses. right Stationary Distribution of a Markov Chain sition probabilities are shown Consider simplest possible Markov chain: one with two states probabilities next to thethe line. 0.6 time. The figure to the right 1) are: that operates in discrete correspond to (0, the and graph shows the states in circles. The transition are shown 0.4 0 1 probabilities 0.1 ) = 0.4 as arcs connecting the states with the probabilities next to the line. 0.6 ) = 0.6 The full probability statements0.9 that correspond to the graph are: ) = 0.9 0.4 0 1 0.1 P(xi+1 = 0|xi = 0) = 0.4 ) = 0.1 0.9 P(xi+1 = 1|xi = 0) = 0.6 ities, some of them must sum Figure 1: A graphical depicP(xi+1 = 0|xi = 1) = 0.9 particular state at step i we tion of a two-state Markov P(xi+1 = 1|xi = 1) = 0.1 , we must have some state so process. ble xi . Note that, are because these areP(xprobabilities, of P(x them sum Figure 1: A graphical depicP(x = 0|x = 0) = 1|x = 0) P(x =some 0|x = 1) = 1|xmust = 1) ? What to one. In particular, if we are in a particular state at step i we tion of a two-state Markov depends on the state at step can call the state xi . In the next step, we must have some state so process. formally we could P state it as: 1 = j P(xi+1 = j|xi ) for every possible xi . 1 |xi , xi k ) Note that the state at step i + 1 only depends on the state at step is probability i.statement is saying that, conditional on xi , we could state it as: This is the Markovisproperty. More formally he state at any point before i. So when working with Markov selves with the full history of theP(x chain, knowing i+1 |xmerely i ) = P(x i+1 |xi , xthe i k) i 0 i 0 i 0 i 0 where k is positive integer. What this probability statement is saying is that, conditional on xi , the then statewe at can i + 1make is independent on the state at any point before i. So when working with Markov ion probabilities, a probabilistic statement chains we don’t need toare concern ourselves with the full history of the chain, merely knowing the nscribing (in factthe thefull transition probabilities these probabilistic set of probability statements that 1define the state the step is be enough. about the(step probability thatprevious thecurrent chain will a particular next step i + 1)atgiven its state (ininstep i). These s to transition rates if we are working with continuous-time Clearly if we know xi and the transition probabilities, then we can make a probabilistic statement i+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0) about the state in the next iteration (in fact the transition probabilities are these probabilistic 4 ⇤ 0.4 statements). ov chain: one with two statesBut we can also think about the probability that the chain will be in a particular state ime. The figure to two the steps right form now: sition probabilities are shown probabilities) apply when we the same “rules” (transition P(xi+2 = 0|xi = 0) = P(xi+2 = 0|xi+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0) probabilities next to the line. and i + 2. If the transition probabilities are fixed the 0.6+through = 0.9 ⇤ 0.6 0.4 ⇤ 0.4 correspond to the graph are: e are dealing with a time-homogeneous Markov chain. 0.4= 00.7 1 0.1 = depend 0.4 on the two previous states, and third-order Markov process s) that the simplest which only on thethe current )ng with = 0.6 HereMarkov we arechains exploiting thedepend fact0.9that same “rules” (transition probabilities) apply when we consider state changes between i + 1 and i + 2. If the transition probabilities are fixed through the ) = 0.9 ) = 0.1 2 running of the Markov chain, then we are dealing with a time-homogeneous Markov chain. Stationary Distribution of a Markov Chain 1 P(x P(xi processes = k|xi 1that = 0)P(x 0|x `) i = 0 = `) =Markov i on 1 = 0 = There arek|x second-order depend the two previous states, and third-order Markov process ities, some of them must k|xi 1with = 1)P(x i =dealing i 1 = 1|x 0 = `)chains which only depend on the current etc.. But in sum this course, we’ll just be the simplest Markov Figure 1:+P(x A graphical depicparticular state at step i we tion of a two-state Markov state. , we must have some state so process. ble xi . 2 depends on the state at step formally we could state it as: 1 |xi , xi k ) is probability statement is saying is that, conditional on xi , he state at any point before i. So when working with Markov selves with the full history of the chain, merely knowing the ion probabilities, then we can make a probabilistic statement n (in fact the transition probabilities are these probabilistic about the probability that the chain will be in a particular scribing the full set of probability statements that define the next step (step i + 1) given its current state (in step i). These s to transition rates if we are working with continuous-time Stationary Distribution of a Markov Chain ov chain: one with two states ime. The figure to the right sition probabilities are shown probabilities next to the line. correspond to the graph are: 0.6 0.4 ) = 0.4 0 1 0.1 0.9 ) = 0.6 ) = 0.9 ) = 0.1 P(xi = k|x0 = `) = P(xi = k|xi 1 = 0)P(xi 1 ities, some of them must sum Figure 1:+P(x = 1)P(xi i = k|xi 1depicA graphical particular state at step i we tion of a two-state Markov , we must have some state so process. transition probabilities ble xi . = 0|x0 = `) 1 = 1|x0 = `) depends on the state at step formally we could state it as: 1 |xi , xi k ) is probability statement is saying is that, conditional on xi , he state at any point before i. So when working with Markov selves with the full history of the chain, merely knowing the ion probabilities, then we can make a probabilistic statement nscribing (in factthe thefull transition probabilities are these probabilistic set of probability statements that define the about the(step probability thatits thecurrent chain state will be a particular next step i + 1) given (ininstep i). These s to transition rates if we are working with continuous-time i+1 Stationary Distribution of a Markov Chain = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0) 4 ⇤ 0.4 ov chain: one with two states Note that: ime. The figure to the right P(x = 1|x = 0) = P(x = 1|x = 1)P(x = 1|x = 0) + P(x = 1|x = 0)P(x = 0|x = 0.1 ⇤ 0.6 + 0.6 ⇤ 0.4 sition probabilities shown probabilities) the same “rules” are (transition apply when we = 0.3 probabilities next to the line. and i + 2. If the transition are thestatements sum to one. So,probabilities if we sum over all possible statesfixed at i + 2, the relevant probability 0.6 through the graph are: ecorrespond are dealingtowith a time-homogeneous Markov chain. i+2 i i+2 i+1 i+1 i i+2 i+1 i+1 It gets tedious to continue this for a large number of steps into the future. But we can ask a computer to calculate the probability of being in state 0 for a large number of steps into the future 0.4 0 1 0.1 0.6 0.6 0.8 ) = 0.1 2 0.8 1.0 ) = 0.9 1.0 by repeating the calculations (see the Appendix A for code). Figure (2) shows the probabilities of = depend 0.4 on the two previous s) that states, and third-order Markov process the Markov process being in state 0 as a function of the step number for the two possible starting the simplest Markov states. chains which only depend 0.9on the current )ng with = 0.6 Pr(x_i=0) 0.4 0.2 0.0 0.0 0.2 0.4 Pr(x_i=0) ities, some of them must sum Figure 1: A graphical depicparticular state at step i we tion of a two-state Markov , we must have some state so process. ble xi . 0 5 10 15 0 5 10 15 P(x = 0|x = 1) depends on the state at step P(x = 0|x = 0) Figure 2: The probability of being in state 0 as a function of the step number, i, for two di↵erent formally we could state itstarting as: states (0 and 1) for the Markov process depicted in Figure (1). 1 |xi , xi k ) i i i 0 i 0 Note that the probability stabilizes to a steady state distribution. Knowing whether the chain started in state 0 or 1 tells you very little about the state of the chain in step 15. Technically, x15 and x0 are not independent of each other. If you work through the math the probability of the state at step 15 does depend on the starting state: is probability statement is saying is that, on xi , P(x conditional = 0|x = 0) = 0.599987792969 P(x = 0|x = 1) = 0.600018310547 he state at any point before i. So when working with Markov But clearly these two probabilities are very close to being equal. selves with the full history of the chain, merely knowing the 15 0 15 0 If we consider an even larger number of iterations (e.g. the state at step 100), then the probabilities are so close that they are indistinguishable. ion probabilities, then we can make a probabilistic3 statement n (in fact the transition probabilities are these probabilistic about the probability that the chain will be in a particular i = 0) scribing the full set of probability statements that define the next step (step i + 1) given its current state (in step i). These s to transition rates if we are working with continuous-time Stationary Distribution of a Markov Chain ov chain: one with two states Note that: ime. The figure to the right P(x = 1|x = 0) = P(x = 1|x = 1)P(x = 1|x = 0) + P(x = 1|x = 0)P(x = 0|x = 0) = 0.1 ⇤ 0.6 + 0.6 ⇤ 0.4 sition probabilities are shown = 0.3 probabilities next to the line. So, if we sum over all possible states at i + 2, the relevant probability statements sum to one. 0.6 same probability correspond to the graph are: i+2 i i+2 i+1 i+1 i i+2 i+1 i+1 i It gets tedious to continue this for a large number of steps into the future. But we can ask a computer to calculate the probability of being in state 0 for a large number of steps into the future by repeating the calculations (see the Appendix A for code). Figure (2) shows the probabilities of the Markov process being in state 0 as a function of the step number for the two possible starting states. 0.4 ) = 0.4 0 1 regardless of starting state! 0.1 0.6 0.6 0.8 ) = 0.1 0.8 1.0 ) = 0.9 1.0 0.9 ) = 0.6 Pr(x_i=0) 0.4 0.2 0.0 0.0 0.2 0.4 Pr(x_i=0) ities, some of them must sum Figure 1: A graphical depicparticular state at step i we tion of a two-state Markov , we must have some state so process. ble xi . 0 5 10 15 0 5 10 15 P(x = 0|x = 1) depends on the state at step P(x = 0|x = 0) Figure 2: The probability of being in state 0 as a function of the step number, i, for two di↵erent formally we could state itstarting as: states (0 and 1) for the Markov process depicted in Figure (1). i i i 0 i 0 Note that the probability stabilizes to a steady state distribution. Knowing whether the chain started in state 0 or 1 tells you very little about the state of the chain in step 15. Technically, x15 and x0 are not independent of each other. If you work through the math the probability of the state at step 15 does depend on the starting state: 1 |xi , xi k ) is probability statement is saying is that, on xi , P(x conditional = 0|x = 0) = 0.599987792969 P(x = 0|x = 1) = 0.600018310547 he state at any point before i. So when working with Markov clearly these two probabilities are very close to being equal. selves with the full historyBut of the chain, merely knowing the 15 0 15 0 If we consider an even larger number of iterations (e.g. the state at step 100), then the probabilities are so close that they are indistinguishable. ion probabilities, then we can make a probabilistic3 statement statements that define are the these probabilistic nt of (inprobability fact the transition probabilities 1) giventhe its probability current statethat (in the stepchain i). These about will be in a particular s if we are working with continuous-time Stationary Distribution of a Markov Chain P(xi+2 = 1|xi = 0) = P(xi+2 = 1|xi+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 1|xi+1 = 0)P(xi+1 = 0|xi = 0) = 0.1 ⇤ 0.6 + 0.6 ⇤ 0.4 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0) = 0.3 So, if we sum over all possible states at i + 2, the relevant probability statements sum to one. 4two ⇤ 0.4 states o the right are shown tothe thesame line. “rules” (transition probabilities) apply when we and 2. If the transition probabilities are fixed through the 0.6 raphi + are: e are dealing with a time-homogeneous Markov chain. 0.4 0 1 0.1 0.0 0.2 s that depend on the two previous states, and third-order Markov process 0.9 which only depend on the current ng with the simplest Markov chains 0 5 10 1.0 0.8 0.6 Pr(x_i=0) 0.4 0.2 0.4 Pr(x_i=0) 0.6 0.8 1.0 It gets tedious to continue this for a large number of steps into the future. But we can ask a computer to calculate the probability of being in state 0 for a large number of steps into the future by repeating the calculations (see the Appendix A for code). Figure (2) shows the probabilities of the Markov process being in state 0 as a function of the step number for the two possible starting states. 0.0 i+1 Note that: 15 0 5 i 2 must sum Figure 1: A graphical depicstep i we tion of a two-state Markov me state so process. ate at step state it as: ement is saying is that, conditional on xi , nt before i. So when working with Markov history of the chain, merely knowing the hen we can make a probabilistic statement sition probabilities are these probabilistic lity that the chain will be in a particular 10 15 i P(xi = 0|x0 = 0) P(xi = 0|x0 = 1) Figure 2: The probability of being in state 0 as a function of the step number, i, for two di↵erent starting states (0 and 1) for the Markov process depicted in Figure (1). Note that the probability stabilizes to a steady state distribution. Knowing whether the chain started in state 0 or 1 tells you very little about the state of the chain in step 15. Technically, x15 and x0 are not independent of each other. If you work through the math the probability of the state at step 15 does depend on the starting state: P(x15 = 0|x0 = 0) = 0.599987792969 P(x15 = 0|x0 = 1) = 0.600018310547 But clearly these two probabilities are very close to being equal. If we consider an even larger number of iterations (e.g. the state at step 100), then the probabilities are so close that they are indistinguishable. 3 t of probability statements that define the 1) given its current state (in step i). These s if we are working with continuous-time Note that: Stationary Distribution of a Markov Chain P(xi+2 = 1|xi = 0) = P(xi+2 = 1|xi+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 1|xi+1 = 0)P(xi+1 = 0|xi = 0) = 0.1 ⇤ 0.6 + 0.6 ⇤ 0.4 = 0.3 So, if we sum over all possible states at i + 2, the relevant probability statements sum to one. two states o the right are shown to the line. raph are: 1.0 0.6 Pr(x_i=0) 0.4 0.2 0.2 0.1 0.9 0.0 1 0.4 0 0.0 0.4 Pr(x_i=0) 0.6 0.8 0.6 0.8 1.0 It gets tedious to continue this for a large number of steps into the future. But we can ask a computer to calculate the probability of being in state 0 for a large number of steps into the future by repeating the calculations (see the Appendix A for code). Figure (2) shows the probabilities of the Markov process being in state 0 as a function of the step number for the two possible starting states. 0 5 10 15 0 5 i 10 15 i P(xi = 0|x0 = 0) P(xi = 0|x0 = 1) Figure 2: The probability of being in state 0 as a function of the step number, i, for two di↵erent starting states (0 and 1) for the Markov process depicted in Figure (1). where does the 0.6 come from? Note that the probability stabilizes to a steady state distribution. Knowing whether the chain started in state 0 or 1 tells you very little about the state of the chain in step 15. Technically, x15 and x0 are not independent of each other. If you work through the math the probability of the state at step 15 does depend on the starting state: must sum Figure 1: A graphical depicstep i we tion of a two-state Markov me state so process. P(x15 = 0|x0 = 0) = 0.599987792969 P(x15 = 0|x0 = 1) = 0.600018310547 But clearly these two probabilities are very close to being equal. If we consider an even larger number of iterations (e.g. the state at step 100), then the probabilities are so close that they are indistinguishable. 3 ate at step state it as: ement is saying is that, conditional on xi , nt before i. So when working with Markov history of the chain, merely knowing the hen we can make a probabilistic statement t of probability statements that define the sition probabilities are these probabilistic 1) (ininstep i). These litygiven thatits thecurrent chain state will be a particular s if we are working with continuous-time Note that: Stationary Distribution of a Markov Chain P(xi+2 = 1|xi = 0) = P(xi+2 = 1|xi+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 1|xi+1 = 0)P(xi+1 = 0|xi = 0) = 0.1 ⇤ 0.6 + 0.6 ⇤ 0.4 |xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0) two states o the right are shown to(transition the line. probabilities) apply when we 0.6 through the raph are: ansition probabilities are fixed a time-homogeneous 0.4 0Markov chain. 1 0.1 = 0.3 So, if we sum over all possible states at i + 2, the relevant probability statements sum to one. 1.0 0.8 0.6 Pr(x_i=0) 0.4 0.2 0.0 0.0 wo previous states, and third-order Markov process 0.9 Markov chains which only depend on the current 0.2 0.4 Pr(x_i=0) 0.6 0.8 1.0 It gets tedious to continue this for a large number of steps into the future. But we can ask a computer to calculate the probability of being in state 0 for a large number of steps into the future by repeating the calculations (see the Appendix A for code). Figure (2) shows the probabilities of the Markov process being in state 0 as a function of the step number for the two possible starting states. 0 5 10 15 0 5 i 10 15 i P(xi = 0|x0 = 0) P(xi = 0|x0 = 1) Figure 2: The probability of being in state 0 as a function of the step number, i, for two di↵erent starting states (0 and 1) for the Markov process depicted in Figure (1). must sum Figure 1: A graphical depicstep i we tion of a two-state Markov me state so process. where does the 0.6 come from? Note that the probability stabilizes to a steady state distribution. Knowing whether the chain started in state 0 or 1 tells you very little about the state of the chain in step 15. Technically, x15 and x0 are not independent of each other. If you work through the math the probability of the state at step 15 does depend on the starting state: P(x15 = 0|x0 = 0) = 0.599987792969 P(x15 = 0|x0 = 1) = 0.600018310547 But clearly these two probabilities are very close to being equal. stationary distribution: π0=0.6 π1=0.4 ate at step state it as: ement is saying is that, conditional on xi , nt before i. So when working with Markov history of the chain, merely knowing the hen we can make a probabilistic statement sition probabilities are these probabilistic lity that the chain will be in a particular If we consider an even larger number of iterations (e.g. the state at step 100), then the probabilities are so close that they are indistinguishable. 3 scribing the full set of probability statements that define the next step (step i + 1) given its current state (in step i). These s to transition rates if we are working with continuous-time Stationary Distribution of a Markov Chain ov chain: one with two states ime. The figure to the right sition probabilities are shown probabilities next to the line. correspond to the graph are: 0.6 0.4 ) = 0.4 0 1 0.1 0.9 ) = 0.6 ) = 0.9 ) = 0.1 ities, some of them must sum Figure 1: A graphical depicparticular state at step i we tion of a two-state Markov , we must have some state so process. ble xi . depends on the state at step formally we could state it as: 1 |xi , xi k ) is probability statement is saying is that, conditional on xi , he state at any point before i. So when working with Markov selves with the full history of the chain, merely knowing the ion probabilities, then we can make a probabilistic statement nscribing (in factthe thefull transition probabilities are these probabilistic set of probability statements that define the about the(step probability thatits thecurrent chain state will be a particular next step i + 1) given (ininstep i). These s to transition rates if we are working with continuous-time i+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0) Stationary Distribution of a Markov Chain 4 ⇤ 0.4 ov chain: one with two states ime. The figure to the right sition probabilities shown probabilities) apply when we the same “rules” are (transition probabilities next to the line. and i + 2. If the transition probabilities are fixed 0.6 through the the graph are: ecorrespond are dealingtowith a time-homogeneous Markov chain. 0.4 0 1 0.1 = depend 0.4 on the two previous states, and third-order Markov process s) that the simplest Markov chains which only depend 0.9on the current )ng with = 0.6 Imagine infinitely many chains. At equilibrium (steady-state), the “flux out” of each state must be equal to the “flux into” that state. ities, some of them must sum ) = 0.9 ) = 0.1 2 Figure 1: A graphical depicparticular state at step i we tion of a two-state Markov , we must have some state so process. ble xi . depends on the state at step formally we could state it as: 1 |xi , xi k ) is probability statement is saying is that, conditional on xi , he state at any point before i. So when working with Markov selves with the full history of the chain, merely knowing the ion probabilities, then we can make a probabilistic statement n (in fact the transition probabilities are these probabilistic about the probability that the chain will be in a particular i+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0) scribing the full set of probability statements that define the w canstep we (step showithat this isits the steady-state next + 1) given current state (in(equilibrium) step i). Thesedistribution? The flux “out” of each be equal to the “flux” into thatwith statecontinuous-time for a system to be at equilibrium. Here it is helpful stetomust transition rates if we are working think about running infinitely many chains. If we have an equilibrium probability distribution, n we can characterize the probability of a chain leaving state 0 in at one particular step in the cess. This is with the joint probability of being in state 0 at step i and the probability of an 0 ! 1 ov chain: one two states nsition at point ime. The figure i.toThus, the right sition probabilities are shown er a long enough walk, a Markov chain in which all of the states are connected2 will converge P(0 ! 1 at i) = P(xi = 0)P(xi+1 = 1|xi = 0) probabilities to the line. ts stationarynext distribution. 0.6 correspond to the graph are: the equilibrium P(xi = 0) = ⇡0 , so the flux out of 0 at step i is ⇡0 P(xi+1 = 1|xi = 0). In our 0.4 0 system the flux into state 0 can only 1come0.1 from state 1, so: )ple=(two-state) 0.4 any chains or one really long chain 0.9 ) = 0.6 P(1 ! 0 at i) = P(xi = 1)P(xi+1 = 0|xi = 1) ) = 0.9 Imagine infinitely many chains. At equilibrium (steady-state), = ⇡1 P(xi+1 = 0|xi = 1) hen about Markov chains we be often flip to back forth between thinking ) =constructing 0.1 thearguments “flux out” of each state must equal theand “flux into” out the behavior of an arbitrarily large ofstate. instances of the random process all obeying the we force the “flux out” and “flux in” tonumber bethat identical: ities, some of performing them must sum me rules (but their walks Figureindependently) 1: A graphicalversus depic-the behavior we expect if we sparsely particular at stepThe weidea mple a very state long chain. that sample ofi = aif0) two-state Markov ⇡0iP(x =is1|x = ⇡sparsely = 0|xi =the 1) sampled points from one i+1tion 1 P(x i+1 enough, ,inweare must have some state so process. close to being independent of each⇡0other (as 2 was meant to demonstrate). Thus P(xFigure i+1 = 0|xi = 1) ble = i. canxthink of a very long chain as if we had number of independent chains. ⇡ a large P(x = 1|x = 0) Stationary Distribution of a Markov Chain 1 i+1 i depends the isstate at step A reducible chain has some states that are disconnected from others, so the The fancy on jargon “irreducible.” n can be broken up (reduced) into two or more chains that operate on a subset of the states. formally we could state it as: ationary distribution 1 |xi , xi k ) tice in the discussion above theisprobability of ending is probability statement is that saying that, conditional on up xi ,in state 0 as the number of iterations reased What is special about4this he stateapproached at any point0.6. before i. So when working withvalue? Markov selves with the full history of the chain, merely knowing the urns out that in the stationary distribution (frequently denoted ⇡), the probability of sampling hain in state 0 is 0.6. We would write this as ⇡ = {0.6, 0.4} or the pair of statements: ⇡0 = , ⇡1 probabilities, = 0.4. ion then we can make a probabilistic statement nscribing (in factthe thefull transition probabilities are these probabilistic set of probability statements that define the about the(step probability that the chain will be instep a particular w canstep we show this isits the steady-state next ithat + 1) given current state (in(equilibrium) i). Thesedistribution? The flux “out” of each be equal to the “flux” into thatwith statecontinuous-time for a system to be at equilibrium. Here it is helpful stetomust transition rates if we are working think about running infinitely many chains. If we have an equilibrium probability distribution, i+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0) n we can characterize the probability of a chain leaving state 0 in at one particular step in the 4cess. ⇤ 0.4This is the joint probability of being in state 0 at step i and the probability of an 0 ! 1 ov chain: one with two states nsition at point ime. The figure i.toThus, the right sition probabilities shown probabilities) apply when we the same “rules” are (transition P(0 ! 1 at i) = P(xi = 0)P(xi+1 = 1|xi = 0) probabilities next to the line. and i + 2. If the transition probabilities are fixed 0.6 through the towith the graph are: ecorrespond are dealing a time-homogeneous Markov chain. the equilibrium P(xi = 0) = ⇡0 , so the flux out of 0 at step i is ⇡0 P(xi+1 = 1|xi = 0). In our 0.4 0 0 can only 1come0.1 the flux into =(two-state) 0.4 on thesystem s)ple that depend two previous states, andstate third-order Markov processfrom state 1, so: Stationary Distribution of a Markov Chain the simplest Markov chains which only depend 0.9on the current )ng with = 0.6 P(1 ! 0 at i) = P(xi = 1)P(xi+1 = 0|xi = 1) ) = 0.9 Imagine infinitely many chains. At equilibrium (steady-state), = ⇡1 P(xi+1 = 0|xi = 1) ) = 0.1 2 the “flux out” of each state must be equal to the “flux into” state. we force the “flux out” and “flux in” to bethat identical: ities, some of them must sum Figure 1: A graphical depicparticular state at step i we tion of a two-state Markov ⇡0 P(xi+1 = 1|xi = 0) = ⇡1 P(xi+1 = 0|xi = 1) , we must have some state so process. ⇡0 P(xi+1 = 0|xi = 1) ble xi . = ⇡1 P(xi+1 = 1|xi = 0) depends the isstate at step A reducible chain has some states that are disconnected from others, so the The fancy on jargon “irreducible.” ⇡0 = P(xi = 0) ⇡1 = P(xi = 1) n can be broken up (reduced) into two or more chains that operate on a subset of the states. formally we could state it as: 1 |xi , xi k ) is probability statement is saying is that, conditional on xi , 4 he state at any point before i. So when working with Markov selves with the full history of the chain, merely knowing the ion probabilities, then we can make a probabilistic statement n (in fact the transition probabilities are these probabilistic about the probability that the chain will be in a particular i+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0) scribing the full set of probability statements that define the next step (step i + 1) given its current state (in step i). These s to transition rates if we are working with continuous-time Stationary Distribution of a Markov Chain ov chain: one with two states ime. The figure to the right sition probabilities are shown After a long enough Markov chain in which all of the states are connected2 will converge probabilities next walk, to thea line. 0.6 to its stationary distribution. correspond to the graph are: 0.4 ) = 0.4 0 1 chains or one really long chain ) Many = 0.6 0.1 0.9 ) = 0.9 When constructing arguments about Markov chains we often flip back and forth between thinking about the behavior of an arbitrarily large number of instances of the random process all obeying the same rules (but performing their walks independently) versus the behavior we expect if we sparsely ities, some of them must sum 1: A sparsely graphical depicsample a very long chain. The idea isFigure that if sample enough, the sampled points from one particular state at step i we of eachofother a two-state chain are close to being independent tion (as Figure 2 Markov was meant to demonstrate). Thus , we wecan must have some state so think of a very long chain as process. if we had a large number of independent chains. ) = 0.1 ble xi . depends ondistribution the state at step Stationary formally we could state it as: Notice in the discussion above that the probability of ending up in state 0 as the number of iterations increased approached 0.6. What is special about this value? 1 |xi , xi k ) is probability statement is saying is that, conditional on xi , turns out that in the stationary distribution (frequently denoted ⇡), the probability of sampling heIt state at any point before i. So when working with Markov a chain in state 0 is 0.6. We would write this as ⇡ = {0.6, 0.4} or the pair of statements: ⇡0 = selves with the full history of the chain, merely knowing the 0.6, ⇡ = 0.4. 1 How can we show that this is the steady-state (equilibrium) distribution? The flux “out” of each ion probabilities, can make a probabilistic statement state must be equal then to the we “flux” into that state for a system to be at equilibrium. Here it is helpful to running infinitely many chains.are If we haveprobabilistic an equilibrium probability distribution, nscribing (inthink factabout thefull transition probabilities these the set of probability statements that define the then wethe can probability characterize the probability of a will chain leaving 0 in at one particular step in the about that thecurrent chain astate particular next step (step i + 1) given its statebe (ininstep i). These process. This is the joint probability of being in state 0 at step i and the probability of an 0 ! 1 s transition to transition rates if we are working with continuous-time at point i. Thus, i+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0) Stationary Distribution of a Markov Chain P(0 ! 1 at i) = P(xi = 0)P(xi+1 = 1|xi = 0) 4 ⇤ 0.4 ovAtchain: one withP(x two states the equilibrium i = 0) = ⇡0 , so the flux out of 0 at step i is ⇡0 P(xi+1 = 1|xi = 0). In our ime. figure system to thetheright simpleThe (two-state) flux into state 0 can only come from state 1, so: sition probabilities shown probabilities) apply when we the same “rules” are (transition P(1 line. ! 0 at i) = P(xi = 1)P(xi+1 = 0|xi = 1) probabilities next to the and i + 2. If the transition probabilities are fixed through the = ⇡1 P(xi+1 0.6 = 0|xi = 1) the graph are: ecorrespond are dealingtowith a time-homogeneous Markov chain. 0.4to be identical: 0 1 0.1 If = we force “flux out” and “flux in” 0.4 the s) that depend on the two previous states, and third-order Markov process ng with the simplest Markov chains which only depend on the current ) = 0.6 ⇡0 P(xi+1 = 1|xi = 0) = ⇡10.9 P(xi+1 = 0|xi = 1) ) = 0.9 ) = 0.1 2 2 ⇡0 ⇡1 = P(xi+1 = 0|xi = 1) P(xi+1 = 1|xi = 0) The fancy jargon is “irreducible.” A reducible chain has some states that are disconnected from others, so the chain can be broken up (reduced) more chains that operate on a subset of the states. ities, some of them must into sumtwo or Figure 1: A graphical depicparticular state at step i we tion of a two-state Markov , we must have some state so process. ble xi . 4 depends on the state at step formally we could state it as: 1 |xi , xi k ) is probability statement is saying is that, conditional on xi , he state at any point before i. So when working with Markov selves with the full history of the chain, merely knowing the ion probabilities, then we can make a probabilistic statement n (in fact the transition probabilities are these probabilistic about the probability that the chain will be in a particular i+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0) to think about running infinitely many chains. If we have an equilibrium probability distribution, scribing the full set of probability statements that define the then we can characterize the probability of a chain leaving state 0 in at one particular step in the next stepThis (step i +joint 1) given its current (in0step i).i These process. is the probability of beingstate in state at step and the probability of an 0 ! 1 s transition to transition rates if we are working with continuous-time at point i. Thus, Stationary Distribution of a Markov Chain P(0 ! 1 at i) = P(xi = 0)P(xi+1 = 1|xi = 0) ovAtchain: one withP(x two states the equilibrium i = 0) = ⇡0 , so the flux out of 0 at step i is ⇡0 P(xi+1 = 1|xi = 0). In our ime. figure system to thetheright simpleThe (two-state) flux into state 0 can only come from state 1, so: sition probabilities are shown P(1 ! 0 at i) = inP(x 0|xi =are 1) connected2 will converge i = 1)P(x After a long enough Markov chain which all ofi+1 the=states probabilities next walk, to thea line. to its stationary distribution. = ⇡1 P(xi+1 0.6 = 0|xi = 1) correspond to the graph are: 0.4 0 1 0.1 we force ) If = 0.4 the “flux out” and “flux in” to be identical: Many chains or one really long chain ) = 0.6 ⇡0 P(xi+1 = 1|xi = 0) = ⇡10.9 P(xi+1 = 0|xi = 1) ⇡0 P(xi+1 = 0|xi = 1) = When constructing arguments about Markov we i+1 often ⇡1chains P(x = flip 1|xiback = 0) and forth between thinking about the behavior of an arbitrarily large number of instances of the random process all obeying the 2 The fancy jargon is “irreducible.” A reducible chain has some states that are disconnected from others, so the same rules (but performing independently)operate thea subset behavior we expect if we sparsely chain can be broken up (reduced) into two or more chains that on 1 of the states. ities, some of them musttheir sumwalks ⇡0 graphical +versus ⇡1 = 1: A depicsample a very long chain. The idea isFigure that if sample sparsely enough, the sampled points from one particular state at step i we a two-state chain are close to being independent tion of eachofother (as Figure 2 Markov was meant to demonstrate). Thus , we wecan must have state soas process. think of a some very long chain if we had a large number of independent chains. ble xi . 4 ) = 0.9 ) = 0.1 depends ondistribution the state at step Stationary formally we could state it as: Notice in the discussion above that the probability of ending up in state 0 as the number of iterations increased approached 0.6. What is special about this value? 1 |xi , xi k ) is probability statement is saying is that, conditional on xi , turns out that in the stationary distribution (frequently denoted ⇡), the probability of sampling heIt state at any point before i. So when working with Markov a chain in state 0 is 0.6. We would write this as ⇡ = {0.6, 0.4} or the pair of statements: ⇡0 = selves with the full history of the chain, merely knowing the 0.6, ⇡ = 0.4. 1 How can we show that this is the steady-state (equilibrium) distribution? The flux “out” of each ion probabilities, can make a probabilistic statement state must be equal then to the we “flux” into that state for a system to be at equilibrium. Here it is helpful to running infinitely many chains.are If we haveprobabilistic an equilibrium probability distribution, nscribing (inthink factabout thefull transition probabilities these the set of probability statements that define the then wethe can probability characterize the probability of a will chain leaving 0 in at one particular step in the about that thecurrent chain astate particular next step (step i + 1) given its statebe (ininstep i). These process. This is the joint probability of being in state 0 at step i and the probability of an 0 ! 1 s transition to transition rates if we are working with continuous-time at point i. Thus, i+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0) Stationary Distribution of a Markov Chain P(0 ! 1 at i) = P(xi = 0)P(xi+1 = 1|xi = 0) 4 ⇤ 0.4 ovAtchain: one withP(x two states the equilibrium i = 0) = ⇡0 , so the flux out of 0 at step i is ⇡0 P(xi+1 = 1|xi = 0). In our ime. figure system to thetheright simpleThe (two-state) flux into state 0 can only come from state 1, so: sition probabilities shown probabilities) apply when we the same “rules” are (transition P(1 line. ! 0 at i) = P(xi = 1)P(xi+1 = 0|xi = 1) probabilities next to the and i + 2. If the transition probabilities are fixed through the = ⇡1 P(xi+1 0.6 = 0|xi = 1) the graph are: ecorrespond are dealingtowith a time-homogeneous Markov chain. 0.4to be identical: 0 1 0.1 If = we force “flux out” and “flux in” 0.4 the s) that depend on the two previous states, and third-order Markov process ng with the simplest Markov chains which only depend on the current ) = 0.6 ⇡0 P(xi+1 = 1|xi = 0) = ⇡10.9 P(xi+1 = 0|xi = 1) ) = 0.9 ⇡ P(x = 0|x = 1) 0 i+1 i We can then solve for the actual frequencies = by recognizing that ⇡ is a probability distribution: ⇡1 P(xi+1 = 1|xi = 0) X ⇡k = 1 2 The fancy jargon is “irreducible.” A reducible chain has ksome states that are disconnected from others, so the chain can be broken up (reduced) operate on 1 a subset of the states. ities, some of them must into sumtwo or more chains ⇡0 graphical + ⇡1 = 1: that A depicFor the system that we areFigure considering: ) = 0.1 2 particular state at step i we tion of a two-state Markov ⇡0 0.9 , we must have some state so process. = = 1.5 ⇡1 0.6 ble xi . 4 ⇡0 = 1.5⇡1 depends on the state at step formally we could state it as: 1 |xi , xi k) 1.5⇡1 + ⇡1 = 1.0 ⇡1 = 0.4 ⇡0 = 0.6 Thus, the transition probabilities determine the stationary distribution. is probability statement saying is that, conditional xiconstruct , If we can chooseisthe transition probabilities then weon can a sampler that will he state at any pointtobefore i. So when converge any distribution thatworking we wouldwith like. Markov selves with the full history of the chain, merely knowing the When we have a state set S with more than 2 states, we can express a general statement about the stationary distribution: X ⇡j = ⇡k P(xi+1 = j|xi = k) (1) k2S ion probabilities, then we can make a probabilistic statement connection toprobabilities the “flux out” and “flux in” argument may not seem obvious, but note that the n (in fact theThetransition are these probabilistic “flux out” can be quantified in terms of the 1 - the “self-transition.” Let S6=j be the set of states about the probability thatj, the willis: be in a particular that excludes state so thechain “flux out” i+1 flux out of j = ⇡j P(xi+1 2 S6=j |xi = j) = ⇡= P(xi+1 2 j|x = j)] j [10)P(x = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = i0|x i+1 i = 0) Stationary Distribution of a Markov Chain If we can choose the transition probabilities of the Markov chain, then we can construct a sampler that will converge to any distribution that we desire! We can then solve for the actual frequencies by recognizing that ⇡ is a probability distribution: X We can then solve for the actual frequencies by⇡recognizing that ⇡ is a probability distribution: k = 1 k X ⇡k = 1 For the system that we are considering: k ⇡0 0.9 = = 1.5 ⇡1 0.6 1.5⇡1that ⇡ is a probability distribution: ⇡⇡00 by = 0.9 We can then solve for the actual frequencies recognizing = = 1.5 0.6 1.5⇡1 +⇡⇡1X 1.0 1 ⇡= k = 1 ⇡⇡0k = 1.5⇡ = 0.4 1 For the system that we are considering: 1 For the system that we are considering: 1.5⇡1 + ⇡⇡10 = = 1.0 0.6 0.9 ⇡1 ⇡0 == 0.4 = 1.5 0.6 ⇡1 ⇡0 ⇡0== 0.6 1.5⇡1 Thus, the transition probabilities determine the stationary distribution. 1.5⇡ + ⇡ = 1.0 1 1 ⇡1 = 0.4 If wethe can choose the transition probabilities can construct a sampler that will =then 0.6 wedistribution. Thus, transition probabilities determine the⇡0stationary converge to any distribution that we would like. Thus,the the transition probabilities determine the stationary distribution. If we can choose transition probabilities then we can construct a sampler that will When weto have a state set S withthat morewe than 2 states, we can express a general statement about the converge any would like. If wedistribution can choose the transition probabilities then we can construct a sampler that will stationary distribution: Xwe would like. converge to any distribution that ⇡j = ⇡k P(xi+1 = j|xi = k) When we have aWhen state set S with more thanthan 2 states, we can express a general statement about (1) the we have a state set S with more 2 states, we can express a general statement about the k2S stationary distribution: stationary distribution: X X The connection to the “flux out”⇡and “flux in” argument may not seem obvious, but note that the ⇡ = ⇡k P(xi+1 = = j|x k) k) (1) i == j|x (1) j = j ⇡k P(x i+1 i k2S1 - the “self-transition.” Let S “flux out” can be quantified in terms of the 6=j be the set of states k2S connection to the “flux out” out” and that excludes The state j, so the “flux is:“flux in” argument may not seem obvious, but note that the out”“flux can be out” quantified terms ofin” the argument 1 - the “self-transition.” S6=j be the set ofbut statesnote that the The connection “flux to the andin “flux may notLet seem obvious, that excludes state j, so the “flux out” is: “flux out” can be quantifiedflux in terms thei+1 “self-transition.” out of of j the = 1⇡-j P(x 2 S6=j |xi = j) Let S6=j be the set of states flux outis: of j = ⇡j P(xi+1 2 S6=j |xi = j) that excludes state j, so the “flux out” = =⇡j⇡[1[1 P(x 2 j|xi = j)] P(x i+1 2 j|x = j)] Stationary Distribution of a Markov Chain For the general case of more than 2 states: j The flux in is: The flux in is: i = ⇡j [1 P(xi+1 2 j|xi = X flux into j = X ⇡k P(xi+1 = j|xi = k) flux into j = The flux in is: i+1 flux out of j = ⇡j P(xi+1 2 S6=j |xi = j) Forcing the fluxes to be equal gives us: j)] ⇡k P(xi+1 = j|xi = k) k2S6=j k2S6=j X X flux ⇡ [1 P(x j|x =j j)] == ⇡k⇡ P(x j|xi= = j|x k) i = k) i+1 =into i+1 = i+1 k P(x Forcing the fluxes toj be equal givesi us: k2S6=j k2S 6=j X ⇡j =X ⇡j P(xi+1 = j|xi = j) + ⇡k P(xi+1 = j|xi = k) (2) ⇡j [1 P(xi+1 = j|xi = j)] = ⇡k P(xi+1 = j|xk2S k) i = 6=j X Forcing the fluxes to be equal gives us: k2S = 6=j ⇡k P(xi+1 = j|xi = k) X k2S X ⇡j = ⇡ ⇡k P(xi+1 = j|xi = k) j P(xi+1 = j|xi = j) + ⇡j [1 P(xi+1 = j|xi = j)] = ⇡k P(xi+1 = j|xi =k2S k) 6=j X6=j k2S = ⇡k P(x5i+1 = j|xi = k)X ⇡j = ⇡k2S ⇡k P(xi+1 = j|xi = k) j P(xi+1 = j|xi = j) + = X k2S6=j ⇡k P(xi+1 = j|xi = k) k2S 5 5 (2) (2) Mixing While setting the transition probabilities to specific values affects the stationary distribution, the transition probabilities cannot be determined uniquely from the stationary distribution. Note that: P(xi+2 = 1|xi = 0) = P(xi+2 = 1|xi+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 1|xi+1 = 0)P(xi+1 = 0|xi = 0) = 0.1 ⇤ 0.6 + 0.6 ⇤ 0.4 = 0.3 So, if we sum over all possible states at i + 2, the relevant probability statements sum to one. 1.0 1.0 It gets tedious to continue this for a large number of steps into the future. But we can ask a computer to calculate the probability of being in state 0 for a large number of steps into the future by repeating the calculations (see the Appendix A for code). Figure (2) shows the probabilities of the Markov process being in state 0 as a function of the step number for the two possible starting states. 0.8 0.4 Pr(x_i=0) 0.6 We just saw how changing transition probabilities will a↵ect the stationary distribution. Perhaps, there is a one-to-one mapping between transition probabilities and a stationary distribution. In the form of a question: if we were given the stationary distribution could we decide what the transition rates must be to achieve that distribution? 0.2 It turns out that the answer is “No.” Which can be seen if we examine equation (2). Imagine dropping the “flux out” of each state by a factor of 10. Because the “self-transition” rate would increase by the appropriate amount, we can still end up in the same stationary distribution. 0.0 0.0 0.2 0.4 Pr(x_i=0) 0.6 0.8 Mixing How would such a chain behave? The primary di↵erence is that it would “mix” more slowly. 0 5 10 15 0 5 10 15 Adjacent steps would be more likely to be in the same state, and it would take a larger number i i of iterationsP(x before the “forgets” its starting state. FigureP(x (3)i =depicts 0) 0|x0 = the 1) same probability i = 0|x 0 = chain statements as Figure (2) but for a process in which P(xi+1 = 1|xi = 0) = 0.06 and P(xi+1 = 0|xi = Figure 2: The probability of being in state 0 as a function of the step number, i, for two di↵erent 1)starting = 0.09.states (0 and 1) for the Markov process depicted in Figure (1). 0.8 1.0 Note that the probability stabilizes to a steady state distribution. Knowing whether the chain started in state 0 or 1 tells you very little about the state of the chain in step 15. Technically, x15 and x0 are not independent of each other. If you work through the math the probability of the state at step 15 does depend on the starting state: Pr(x_i=0) P(x15 = 0|x0 = 1) = 0.600018310547 0.4 But clearly these two probabilities are very close to being equal. 0.0 0.2 If we consider an even larger number of iterations (e.g. the state at step 100), then the probabilities are so close that they are indistinguishable. 0.0 0.2 0.4 Pr(x_i=0) 0.6 P(x15 = 0|x0 = 0) = 0.599987792969 0.6 0.8 1.0 P(xi+1 = 1|xi = 0) = 0.6 P(xi+1 = 0|xi = 1) = 0.9 0 10 20 30 40 3 i 0 10 20 30 40 i P(xi = 0|x0 = 0) P(xi = 0|x0 = 1) Figure 3: The probability of being in state 0 as a function of the step number, i, for two di↵erent P(xi+1 = 1|xi (0=and 0) = 0.06 P(x in Figure=(1) 0|x 1) = 0.09 i the=state-changing starting states 1) for the Markov process depictedi+1 but with transition probabilities scaled down by a factor of 10. Compare this to Figure (2). Thus, that the rate of convergence of a chain to its stationary distribution is an aspect of a Markov chain that is separate from what the stationary distribution is. In MCMC we will design a Markov chain such that its stationary distribution will be identical to the posterior probability distribution over the space of parameters. We will try to design chains that have high transition probabilities, so that our MCMC approximation will quickly converge to the posterior. But even a slowly mixing chain will (in theory) eventually be capable of providing 6 So, if we sum over all possible states at i + 2, the relevant probability statements sum to one. 1.0 1.0 It gets tedious to continue this for a large number of steps into the future. But we can ask a computer to calculate the probability of being in state 0 for a large number of steps into the future by repeating the calculations (see the Appendix A for code). Figure (2) shows the probabilities of the Markov process being in state 0 as a function of the step number for the two possible starting states. 0.8 0.4 Pr(x_i=0) 0.6 We just saw how changing transition probabilities will a↵ect the stationary distribution. Perhaps, there is a one-to-one mapping between transition probabilities and a stationary distribution. In the form of a question: if we were given the stationary distribution could we decide what the transition rates must be to achieve that distribution? 0.2 It turns out that the answer is “No.” Which can be seen if we examine equation (2). Imagine dropping the “flux out” of each state by a factor of 10. Because the “self-transition” rate would increase by the appropriate amount, we can still end up in the same stationary distribution. 0.0 0.0 0.2 0.4 Pr(x_i=0) 0.6 0.8 Mixing How would such a chain behave? The primary di↵erence is that it would “mix” more slowly. 0 5 10 15 0 5 10 15 Adjacent steps would be more likely to be in the same state, and it would take a larger number i i of iterationsP(x before the “forgets” its starting state. FigureP(x (3)i =depicts 0) 0|x0 = the 1) same probability i = 0|x 0 = chain statements as Figure (2) but for a process in which P(xi+1 = 1|xi = 0) = 0.06 and P(xi+1 = 0|xi = Figure 2: The probability of being in state 0 as a function of the step number, i, for two di↵erent 1)starting = 0.09.states (0 and 1) for the Markov process depicted in Figure (1). 0.8 1.0 Note that the probability stabilizes to a steady state distribution. Knowing whether the chain started in state 0 or 1 tells you very little about the state of the chain in step 15. Technically, x15 and x0 are not independent of each other. If you work through the math the probability of the state at step 15 does depend on the starting state: Pr(x_i=0) P(x15 = 0|x0 = 1) = 0.600018310547 0.4 But clearly these two probabilities are very close to being equal. 0.0 0.2 If we consider an even larger number of iterations (e.g. the state at step 100), then the probabilities are so close that they are indistinguishable. 0.0 0.2 0.4 Pr(x_i=0) 0.6 P(x15 = 0|x0 = 0) = 0.599987792969 0.6 0.8 1.0 P(xi+1 = 1|xi = 0) = 0.6 P(xi+1 = 0|xi = 1) = 0.9 0 10 20 30 40 3 0 10 i 20 30 40 i P(xi = 0|x0 = 0) P(xi = 0|x0 = 1) P(xi+1 = 1|xi = 0) = 0.06 P(xi+1 = 0|xi = 1) = 0.09 Figure 3: The probability of being in state 0 as a function of the step number, i, for two di↵erent starting states (0 and 1) for the Markov process depicted in Figure (1) but with the state-changing transition probabilities scaled down by a factor of 10. Compare this to Figure (2). Thus, that the rate of convergence of a chain to its stationary distribution is an aspect of a Markov chain that is separate from what the stationary distribution is. In MCMC we will design a Markov chain such that its stationary distribution will be identical to the posterior probability distribution over the space of parameters. We will try to design chains that have high transition probabilities, so that our MCMC approximation will quickly converge to the posterior. But even a slowly mixing chain will (in theory) eventually be capable of providing 6 Mixing Setting the transition probabilities to lower values resulted in a chain that “mixed” more slowly: Adjacent steps would be more likely to be in the same state and, thus, would require a larger number of iterations before the chain “forgets” its starting state. Mixing The rate of convergence of a chain to its stationary distribution is an aspect of a Markov chain that is separate from what the stationary distribution is. Mixing In MCMC, we will design a Markov chain whose stationary distribution is identical to the posterior probability distribution over the space of parameters. We try to design chains that have high transition probabilities to achieve faster convergence. Detailed Balance In practice, the number of states is very large. Setting the transition probabilities so that we have equal flux into and out of any state is tricky. What we use instead is detailed balance. Detailed Balance We restrict ourselves to Markov chains that satisfy detailed balance for all pairs of states j and k: ⇡j P(xi+1 = k|xi = j) = ⇡k P(xi+1 = j|xi = k) (equivalently: ⇡j P(xi+1 = j|xi = k) = ⇡k P(xi+1 = k|xi = j) ) This can be achieved using several different methods, the most flexible of which is known as the Metropolis algorithm and its extension, the Metropolis-Hastings method. In the Metropolis-Hastings algorithm, we choose rules for constructing a random walk through the parameter space. We adopt transition probabilities such that the stationary distribution of our Markov chain is equal to the posterior probability distribution: ⇡✓j = P(✓j |Data) t`,k ⇡k = = tk,` ⇡` ✓ P(D|✓j = k)P(✓j = k) P(D) ◆,✓ P(D|✓j = `)P(✓j = `) P(D) ◆ ◆,✓ P(D|✓j = `)P(✓j = `) P(D) ◆ desired property t`,k ⇡k = = tk,` ⇡` ✓ P(D|✓j = k)P(✓j = k) P(D) desired property t`,k ⇡k = = tk,` ⇡` ✓ P(D|✓j = k)P(✓j = k) P(D) ◆,✓ P(D|✓j = `)P(✓j = `) P(D) ◆ ◆,✓ P(D|✓j = `)P(✓j = `) P(D) ◆ detailed balance desired property t`,k ⇡k = = tk,` ⇡` ✓ P(D|✓j = k)P(✓j = k) P(D) detailed balance P(D) cancels out, so doesn’t need to be computed! Therefore, we need to set the transition probabilities so that t`,k P(D|✓j = k)P(✓j = k) = tk,` P(D|✓j = `)P(✓j = `) However, an important problem arises when doing this, which can be illustrated as follows: when dealing with states k,l, it could be that we need to set tk,l=1 and tl,k=0.5 when dealing with states k,m, it could be that we need to set tk,m=0.3 and tm,k=0.1 Then, we have tk,m+tk,l=1.3, which violates the fundamental rules of probability! In our coin example, we can see that P(D|✓ = 1)P(✓ = 1) is three times higher than P(D|✓ = 0)P(✓ = 0). Thus we need to ensure that the probability of moving from 0! 1 is 1/3 the probability of moving from 1 ! 0. For instance, we could set m0,1 = 1 and m1,0 = 1/3. Or we could choose m0,1 = 1/2 and m1,0 = 1/6. Which should we use? We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best. Solution: Similar arguments lead us to the conclusion that m1,2 = 1 and m2,1 = 3/8. However this reveals a the transition probability as a that P(xi+1 = 0|xi = 1) = 1/3 problem: setting theseview four transition probabilities in this way means the move is proposed and P(xi+1 = 2|xi = 1)joint = 1. event: But this (1) violates fundamental rules of probability - the probability of all mutually-exclusive with eventsprobability must sum to 1.0. With(2) ourthe transition it exceeds 1. q, and moveprobabilities is accepted with probability α. On solution is to view a transition probability as a joint event: the move is proposed and the move is accepted. We can denote the probability of proposing a move from j to k as q(j, k). We can we denote x’i+1from thejstate denote the acceptanceIfprobability of abymove to k (given that the move was proposed) as at stepthei+1, then ↵(j, k). If we were to proposed use x0i+1 to denote state proposed at step i + 1 then q(j, k) = P(x0i+1 = k|xi = j) ↵(j, k) = P(xi+1 = k|x0i = j, x0i+1 = k) This means that we can choose proposal probabilities that sum to one for all of the state-changing transitions. Then, we can multiply them by the appropriate acceptance probability (keeping that as high as we can, but 1). So, we get m1,0 q(1, 0)↵(1, 0) P(D|✓ = 0)P(✓ = 0) = = m0,1 q(0, 1)↵(0, 1) P(D|✓ = 1)P(✓ = 1) (5) We have a great deal of flexibility in selecting how we perform proposals on new states in MCMC. We have to assure that q(j, k) > 0 whenever q(k, j) > 0; failing to do this means that the ratio of proposal probabilities will be 0 in one direction and infinite in the other. It is fine for q(j, k) = q(k, j) = 0 for some pairs of states (we already said that it was OK if not all states were adjacent in the Markov chain – note that two states that are not adjacent still fulfill detailed balance because both fluxes are 0). We can choose proposal probabilities that sum to onescheme, for allthough, the state-changing Once we have chosen a proposal we do not have as much flexibility when choosing whether or not to accept a proposal. Rearranging terms in equation (5) to put the acceptance transitions. Then, we can multiply them by the 10 appropriate acceptance probabilities (keeping them as high as possible, but ≤1). We get t`,k q(`, k)↵(`, k) P(D|✓j = k)P(✓j = k) = = tk,` q(k, `)↵(k, `) P(D|✓j = `)P(✓j = `) t`,k q(`, k)↵(`, k) P(D|✓j = k)P(✓j = k) = = tk,` q(k, `)↵(k, `) P(D|✓j = `)P(✓j = `) We have flexibility in selecting how we perform proposals on new states in MCMC. We have to ensure that q(l,k)>0 whenever q(k,l)>0 (it is fine if both are 0, but we can’t have one being 0 and the other greater than 0). t`,k q(`, k)↵(`, k) P(D|✓j = k)P(✓j = k) = = tk,` q(k, `)↵(k, `) P(D|✓j = `)P(✓j = `) However, once we have chosen a proposal scheme, we do not have much flexibility in choosing whether or not to accept a proposal. t`,k q(`, k)↵(`, k) P(D|✓j = k)P(✓j = k) = = tk,` q(k, `)↵(k, `) P(D|✓j = `)P(✓j = `) t`,k q(`, k)↵(`, k) P(D|✓j = k)P(✓j = k) = = tk,` q(k, `)↵(k, `) P(D|✓j = `)P(✓j = `) ↵(`, k) P(D|✓j = k)P(✓j = k)q(k, `) = ↵(k, `) P(D|✓j = `)P(✓j = `)q(`, k) t`,k q(`, k)↵(`, k) P(D|✓j = k)P(✓j = k) = = tk,` q(k, `)↵(k, `) P(D|✓j = `)P(✓j = `) ↵(`, k) P(D|✓j = k)P(✓j = k)q(k, `) = ↵(k, `) P(D|✓j = `)P(✓j = `)q(`, k) acceptance = ratio (likelihood ratio ) ( prior ratio ) ( Hastings ratio ) The central idea is to make small random changes to some current parameter values, and then accept or reject the changes according to the appropriate probabilities (a) r > 1: new state accepted (b) r < 1: new state accepted with probability r if new state rejected, stay in old state 4. Go to step 2 Posterior probability q *a q 48% 221 Accept sometimes q*b 32% 20% Topology A Fig. 7.4 Always accept Topology B Topology C The Markov chain Monte Carlo (MCMC) procedure is used to generate a valid sample from the posterior. One first sets up a Markov chain that has the posterior as its stationary distribution. The chain is then started at a random point and run until it converges onto this distribution. In each step (generation) of the chain, a small change is made to the current values of the model parameters (step 2). The ratio r of the posterior probability of the new and current states is then calculated. If r > 1, we are moving uphill and the move is always accepted (3a). If r < 1, we are moving downhill and accept the new state with probability r (3b). or reject the proposed value with the probability ! " f (θ ∗ |X) f (θ|θ ∗ ) r = min 1, × f (θ|X) f (θ ∗ |θ) ! ∗ ∗ ∗ " f (θ|θ ) f (X|θ )/ f M (X) ) f (θ Bayesian = phylogenetic analysis using R B AYES : theory × min 1, f (θ) f (X|θ)/ f (X) f (θ ∗ |θ) ! " f (θ ∗ f (X|θ ∗ ) f (θ|θ ∗ ) Markov Carlo = chain min Monte 1, × steps × f (θ) f (X|θ) f (θ ∗ |θ) (7.8) (7.9) (7.10) 1. Start at an arbitrary point (q) The the last equation 2. three Make aratios small in random move (to q*) are referred to as the prior ratio, the likelihood ratio, and theheight proposal (or state Hastings respectively. The first two ratios 3. Calculate ratio ratio (r ) of new (to q*)ratio), to old state (q) (a) r >to 1:the newratio stateof accepted correspond the numerators in Bayes’ theorem; note that the complex (b) r < 1: new state accepted with probability r if new state rejected, stay in old state 4. Go to step 2 Posterior probability q *a 48% Accept sometimes q*b 32% 20% Topology A Fig. 7.4 Always accept q Topology B Topology C The Markov chain Monte Carlo (MCMC) procedure is used to generate a valid sample from the posterior. One first sets up a Markov chain that has the posterior as its stationary distribution. The chain is then started at a random point and run until it converges onto this distribution. In each step (generation) of the chain, a small change is made to the current values of the model parameters (step 2). The ratio r of the posterior probability of the new or reject the proposed value with the probability ! " f (θ ∗ |X) f (θ|θ ∗ ) r = min 1, × f (θ|X) f (θ ∗ |θ) " ! ∗ ∗ f (θ|θ ∗ ) f (θ ) f (X|θ )/ f (X) × = min 1, f (θ) f (X|θ)/ f (X) f (θ ∗ |θ) ! " f (θ ∗) f (X|θ ∗ ) f (θ|θ ∗ ) = min 1, × × f (θ) f (X|θ) f (θ ∗ |θ) (7.8) (7.9) (7.10) The three ratios in the last equation are referred to as the prior ratio, the likelihood prior likelihood proposal ratio, and the proposal ratio (or Hastings ratio), respectively. The first two ratios ratio ratio ratio correspond to the ratio of the numerators in Bayes’ theorem; note that the complex 223 Bayesian phylogenetic analysis using M R B AYES : theory Box 7.2 (cont.) 2σ x An example of a proposal mechanism is the beta proposal: Both the sliding window and normal proposals can be problematic when the effect on the likelihood varies over the parameter range. For instance, changing a branch length from 0.01 to 0.03 is likely to have a dramatic effect on the posterior but changing it from 0.51 to 0.53 will hardly be noticeable. In such situations, the multiplier proposal is appropriate. It is equivalent to a sliding window with width λ on the log scale of the parameter. A random number u is drawn from a uniform distribution on the interval (−0.5, 0.5) and the proposed value is x ∗ = mx, where m = e λu . If the value of λ takes the form 2 ln a, one will pick multipliers m in the interval (1/a, a). The beta and Dirichlet proposals are used for simplex parameters. They pick new values from a beta or Dirichlet distribution centered on the current values of the simplex. Assume that the current values are (x1 , x2 ). We then multiply them with a value α, which is a tuning parameter, and pick new values from the distribution Beta(αx1 , αx2 ). The higher the value of α, the closer the proposed values will be to the current values. Assume the current values are (x1,x2); Multiply them with a value α; Pick new values from Beta(αx1,αx2) 10 Beta(70,30) (α = 100) Beta(7,3) (α = 10) 0 0 x = (0.7,0.3) 1 More complex moves are needed to change topology. A common type uses stochastic branch rearrangements (see Chapter 8). For instance, the extending subtree pruning and regrafting (extending SPR) move chooses a subtree at random and then moves its attachment point, one branch at a time, until a random number u drawn from a uniform on (0, 1) becomes higher than a specified extension probability p. The extension probability p is a tuning parameter; the higher the value, the more drastic rearrangements will be proposed. the latter). Box 7.2 Proposal mechanisms A simpler proposal mechanism is to types of proposal mechanisms are commonly used to change continuous variables. defineFour a continuous uniform distribution The simplest is the sliding window proposal. A continuous uniform distribution of width of width w, centered on the current w is centered on the current value x, and the new value x ∗ is drawn from this distribution. valueThe x, and the width newwvalue x* is drawnA larger value of w results in more radical “window” is a tuning parameter. fromproposals this distribution. and lower acceptance rates, while a smaller value leads to more modest changes and higher acceptance rates. w x The normal proposal is similar to the sliding window except that it uses a normal distribution centered on the current value x. The variance σ 2 of the normal distribution determines how drastic the new proposals are and how often they will be accepted. More complex moves are needed to change tree topology. A common type uses operations such as SPR, TBR, and NNI. Burn-in, mixing, and convergence If the chain is started from a random tree and arbitrarily chosen branch lengths, chances are that the initial likelihood is low. The early phase of the run in which the likelihood increases very rapidly towards regions in the posterior with high probability mass is known as the burn-in. 224 Fredrik Ronquist, Paul van der Mark, and John P. Huelsenbeck Trace plot −26880 Putative stationary phase ln L −26920 −26960 −27000 Burn-in 0 100000 200000 300000 400000 500000 Generation Fig. 7.5 The likelihood values typically increase very rapidly during the initial phase of the run because the starting point is far away from the regions in parameter space with high posterior probability. This initial phase of the Markov chain is known as the burn in. The burn-in samples are typically discarded because they are so heavily influenced by the starting point. As the chain converges onto the target distribution, the likelihood values tend to reach a plateau. This phase of the chain is sampled with some thinning, primarily to save disk space. 7.4 Burn-in, mixing and convergence ln L 224 If the chain is started from a random tree and arbitrarily chosen branch lengths, chances are that the initial likelihood is low. As the chain moves towards the regions in the posterior with high probability mass, the likelihood typically increases very Fredrik in Ronquist, Paul van der Mark, so and Johnthat P. Huelsenbeck rapidly; fact, it almost always changes rapidly it is necessary to measure Trace it on a log scale (Fig. 7.5). Thisplot early phase of the run is known as the burn-in, and the −26880 burn-in samples are often discarded because they are so heavily influenced by the starting point. Putative stationary phase As the chain approaches its stationary distribution, the likelihood values tend to reach a plateau. This is the first sign that the chain may have converged onto the −26920 target distribution. Therefore, the plot of the likelihood values against the generation of the chain, known as the trace plot (Fig. 7.5), is important in monitoring the performance of an MCMC run. However, it is extremely important to confirm convergence using other diagnostic tools because it is not sufficient for the chain to −26960 Burn-in reach the region of high probability in the posterior, it must also cover this region adequately. The speed with which the chain covers the interesting regions of the posterior is known as its mixing behavior. The better the mixing, the faster the −27000 0 100000 200000 300000 400000 500000 chain will generate an adequate sample of the posterior. Generation Fig. 7.5 The likelihood values typically increase very rapidly during the initial phase of the run because the starting point is far away from the regions in parameter space with high samples in this region are discarded! posterior probability. This initial phase of the Markov chain is known as the burn in. The burn-in samples are typically discarded because they are so heavily influenced by the starting point. As the chain converges onto the target distribution, the likelihood values tend to reach a plateau. This phase of the chain is sampled with some thinning, primarily to save disk space. 7.4 Burn-in, mixing and convergence If the chain is started from a random tree and arbitrarily chosen branch lengths, chances are that the initial likelihood is low. As the chain moves towards the regions in the posterior with high probability mass, the likelihood typically increases very As the chain approaches its stationary distribution, the likelihood values tend to reach a plateau. This is the first sign that the chain may have converged onto the target distribution. However, it is not sufficient for the chain to reach the region of high probability in the posterior; it must also cover this region adequately. The speed with which the chain covers the interesting regions of the posterior is known as its mixing behavior. The better the mixing, the faster the chain will generate an adequate sample of the posterior. 225 Bayesian phylogenetic analysis using M R B AYES : theory (a) Target distribution Sampled value 25 20 Too modest proposals Acceptance rate too high Poor mixing 15 10 5 0 100 200 300 400 500 Generation (b) Sampled value 25 20 Too bold proposals Acceptance rate too low Poor mixing 15 10 5 0 100 200 300 400 500 Generation (c) Sampled value 25 20 Moderately bold proposals Acceptance rate intermediate Good mixing 15 10 5 0 100 200 300 400 500 Generation Fig. 7.6 The time it takes for a Markov chain to obtain an adequate sample of the posterior depends critically on its mixing behavior, which can be controlled to some extent by the proposal tuning parameters. If the proposed values are very close to the current ones, all proposed changes are accepted but it takes a long time for the chain to cover the posterior; mixing is poor. If the proposed values tend to be dramatically different from the current ones, most proposals are rejected and the chain will remain on the same value for a long time, again leading to poor mixing. The best mixing is obtained at intermediate values of the tuning parameters, associated with moderate acceptance rates. The mixing behavior of a Metropolis sampler can be adjusted using its tuning parameter(s). Assume, for instance, that we are sampling from a normal distribution using a sliding window proposal (Fig. 7.6). The sliding window proposal has one tuning parameter, the width of the window. If the width is too small, then the proposed value will be very similar to the current one (Fig. 7.6a). The posterior probabilities will also be very similar, so the proposal will tend to be accepted. But each proposal will only move the chain a tiny distance in parameter space, so it will take the chain a long time to cover the entire region of interest; mixing is poor. In Bayesian MCMC sampling of phylogenetic problems, the tree topology is typically the most difficult parameter to sample from. Therefore, it makes sense to focus on this parameter when monitoring convergence. Summarizing the results The stationary phase of the chain is typically sampled with some thinning, for instance every 50th or 100th generation. Once an adequate sample is obtained, it is usually trivial to compute an estimate of the marginal posterior distribution for the parameter(s) of interest. For example, this can take the form of a frequency histogram of the sampled values. When it is difficult to visualize this distribution or when space does not permit it, various summary statistics are used instead. The most common approach to summarizing topology posteriors is to give the frequencies of the most common splits, since there are much fewer splits than topologies. the phylogenies included. relatively unimportant9. Unfortunately, the molecular clock assumption is usually inappropriate for distantly related sequences. In this situation, a correction of the pairwise distances that accounts for multiple ‘hits’ to the same site should be used. There are many models for how sequence evolution occurs, each of which implies a different way to correct pairwise distances (see REF. 10 for some suggestions), so there is considerable debate on which correction to use (see BOX 2 for a discussion of model selection). Furthermore, these corrections have substantial variance when the distances are large. Tree searches under optimality criteria. The NJ tree is often treated as the starting point for a computationally intensive search for the best phylogeny (TABLE 2). To perform a tree search, a standard must be used for comparing trees — an optimality criterion, in the jargon of phylogenetics. The most popular criteria are parsimony, minimum evolution and ML. Minimum evolution (ME), like NJ, requires that the data be compressed into a distance matrix; therefore, like NJ, its success is dependent on the sequence divergences being adequately corrected for multiple hits (or the tree being so easy to infer that poor distance estimates are sufficient). Summary Box 2 | The phylogenetic inference process Source: Nat Rev Genet, 4:275, 2003 The flowchart puts phylogenetic estimation (shown in the green box) into the context of an Collect data entire study. After new sequence data are collected, the first step is usually downloading other CTTACATCGTAGCCTAGATC relevant sequences. Typically, a few outgroup sequences are included in a study to root the tree (that is, to indicate which nodes in the tree are the oldest), provide clues about the early Retrieve homologous ancestral sequences and improve the estimates of parameters in the model of evolution. sequences Insertions and deletions obscure which of the sites are homologous. Multiple-sequence CTACTGTAGCAGTCCGTAGA GCTTAATCGTAGCTAGATCA alignment is the process of adding gaps to a matrix of data so that the nucleotides (or amino CTTACATCGTAGCCTAGATC acids) in one column of the matrix are related to each other by descent from a common ancestral residue (a gap in a sequence indicates that the site has been lost in that species, or that a base was Multiple sequence inserted at that position in some of the other species). Although models of sequence evolution alignment that incorporate insertions and deletions have been proposed55–58, most phylogenetic methods C-TAC-T-GTAG-C-AG-TC CTTA-ATCGTAG-CTAGATC proceed using an aligned matrix as the input (see REF. 59 for a review of the interplay between CTTACATCGTAGCCTAGATC alignment and tree inference). In addition to the data, the scientist must choose a model of Model selection sequence evolution (even if this means just choosing a family of models and letting software infer the parameters of these models). reconstruction: some are functionally constrained so Traditional Bayesian Increasing model complexity improves the fit to the data but also Box 1 | Applications of phylogenetic methods approaches approaches Input for phylogenetic that they are invariant among all known sequences; increases variance in estimated parameters. Model selection60–63 estimation strategies attempt to find the appropriate level of complexity on the Detection of orthology and paralogy characters; others evolvebegin rapidly (and, therefore, are not reliable dimensions nchar=898; basis of the available data. Model complexity can often lead to format missing=? gap=Estimate Phylogenetics is commonly used to sort out the history of gene duplications gene indicators ofmatchchar=. deep relationships). As a result of such computational intractability, so pragmatic concernsfor sometimes interleave datatype=dna; 'best' tree options gapmode=missing; matrix outweigh ones (for example, NJ and parsimony are mainly families. This application is now included instatistical even preliminary examinations of sequence ‘noise’, oftenLemur_catta severalAAGCTTCATAGGAGCAACCAT phylogenetic hypotheses can Homo_sapiens AAGCTTCACCGGCGCAGTCAT by theirgenome speed). 46 included neighbour-joining MCMC Pan AAGCTTCACCGGCGCAATTAT data; for example, the initial analysisjustifiable of the mouse explain the data well. So, the researcher Gorilla reasonably AAGCTTCACCGGCGCAGTTGT As discussed in BOX 3, data and a model can be used to create a sample Pongo AAGCTTCACCGGCGCAACCAC trees to identify duplications in cytochrome P450either andMarkov other chain geneMonte families. Assess of trees through Carlo (MCMC) or multiple must take this uncertainty into account. Commonlyconfidence tree searches on bootstrapped data (the ‘traditional’ approach). This used methods often ignore this uncertainty by proEstimating divergence times collection of trees is often summarized using consensus-tree techniques, 37,38 40 ducing a single estimate of the tree, which is then which show the parts ofAris-Brosou the tree that are found in most, to or all, of the Bayesian implementations of new models allowed and Yang tree with measures of support in a set. Although useful, CONSENSUS METHODS are just one way of treated as the'Best' true tree. Additional (potentially timeestimate when animal phyla divergedtrees without assuming a molecular clock. summarizing the information in a group of trees. AGREEMENT SUBTREES are Homo sapiens 89 consuming) steps are necessary to measure the more resistant to ‘rogue sequences’ (one or a few sequences that are Pan Reconstructing ancient proteins difficult 100 to place on the tree); the presence of such sequences can make a Gorilla strength of support for the groupings of sequences in Chang et al.47 used maximum likelihood (ML) reconstruct theeven sequence ofisvisual consensus treeto relatively unresolved, when there considerable Pongo the tree. Recently, BAYESIAN approaches to phylogenetic agreement on the relationships between the other sequences. REVIEWS pigments in the last common ancestor of birds and alligators; the protein was then Sometimes, the bootstrap or MCMC sample might show substantial synthesized in the laboratory (see REF. 48 forforamultiple recent trees discussion oftopologically the methodology of support that are not similar. In such cases, presenting more than one tree (or more than one consensus of ancestral-character-state reconstruction). trees) might be the only way to appropriately summarize the data. Finding the residues that are important to natural selection Amino-acid sites on the surface of influenza that are targeted by the immune system can NATURE REVIEWS | GENETICS be detected by an excess of non-synonymous substitutions49–51. This information might assist vaccine preparation. Traditional approaches Detecting recombination points New Bayesian methods52 can help determine which strains of human immunodeficiency virus-1 (HIV-1) arose from recombination. © 2003 Nature Publishing Group Identifying mutations likely to be associated with disease The lack of structural, biochemical and functional data from many genes implicated in disease means it is unclear which missense mutations are important. Fleming et al.53 used Bayesian phylogenetics to identify missense mutations in conserved regions and regions under positive selection in the breast cancer gene BRCA1. These data allowed them to prioritize these mutations for future functional and population studies. Determining the identity of new pathogens Phylogenetic analysis is now routinely performed after polymerase chain reaction (PCR) amplification of genomic fragments of previously unknown pathogens. Such analyses made possible the rapid identification of both Hantavirus54 and West Nile virus4,5. distantly related sequences to diverge from each other more slowly than is expected, or even become more similar to each other at some residues. Many of the sites in a DNA sequence are not helpful for phylogenetic The most common traditional approaches to reconstructing phylogenies are the neighbour-joining (NJ) algorithm and tree searches that use an optimality criterion such as PARSIMONY or maximum likelihood (ML). TABLE 1 shows a summary of the advantages and disadvantages of these methods, as well as a list of the software packages that implement them. BOX 2 sets these methods in the context of the entire study, from data collection to hypothesis testing. Neighbour-joining. The extremely popular NJ algorithm7,8 is relatively fast (a matter of seconds for most data sets) and, like all methods, performs well when the divergence between sequences is low. The first step in the algorithm is converting the DNA or protein sequences into a distance matrix that represents an estimate of the evolutionary distance between sequences (the evolutionary distance is the number of changes that have occurred along the branches between two A branch of statistics that ocuses on the posterior probability of hypotheses. The posterior probability is proportional to the product of he prior probability and the ikelihood. PARSIMONY Source: Nat Rev Genet, 4:275, 2003 Summary BAYESIAN n systematics, parsimony refers o choosing between trees on the basis of which one requires the ewest possible mutations to explain the data. 276 Hylobates inference have provided a method for simultaneously estimating trees and obtaining measurements of Hypothesis testing uncertainty for every branch. Here, we briefly review the most popular methods of tree estimation (for a REF. 64)| APRIL and 2003 the| 2newer more thorough review, seeVOLUME 77 Bayesian approaches. Table 1 | Comparison of methods Method Advantages Disadvantages Software Neighbour joining Fast Information is lost in compressing sequences into distances; reliable estimates of pairwise distances can be hard to obtain for divergent sequences PAUP* MEGA PHYLIP Parsimony Fast enough for the analysis of hundreds of sequences; robust if branches are short (closely related sequences or dense sampling) Can perform poorly if there is substantial variation in branch lengths PAUP* NONA MEGA PHYLIP Minimum evolution Uses models to correct for unseen changes Distance corrections can break down when distances are large PAUP* MEGA PHYLIP Maximum likelihood The likelihood fully captures what the data tell us about the phylogeny under a given model Can be prohibitively slow (depending on the thoroughness of the search and access to computational resources) PAUP* PAML PHYLIP Bayesian Has a strong connection to the maximum likelihood method; might be a faster way to assess support for treesthan maximum likelihood bootstrapping The prior distributions for parameters must be specified; it can be difficult to determine whether the Markov chain Monte Carlo (MCMC) approximation has run for long enough MrBayes BAMBE For a more complete list of software implementations, see online link to Phylogeny Programs. For software URLs, see online links box. | APRIL 2003 | VOLUME 4 www.nature.com/reviews/genetics © 2003 Nature Publishing Group Acknowledgment Material in these slides are based on Chapter 7 in “The Phylogenetic Handbook”, Lemey, Salemi, Vandamme (Eds.) Some of the material is based on MCMC notes by Prof. Mark Holder Questions?