Phylogenetics: Bayes Rule Bayesian Phylogenetic Analysis COMP 571 - Spring 2016

advertisement
Phylogenetics:
Bayesian Phylogenetic Analysis
COMP 571 - Spring 2016
Luay Nakhleh, Rice University
Bayes Rule
P(X = x|Y = y) =
P(X = x, Y = y)
P(X = x)P(Y = y|X = x)
= P
0
0
P(Y = y)
x0 P(X = x )P(Y = y|X = x )
Bayes Rule
Example (from “Machine Learning: A
Probabilistic Perspective”)
Consider a woman in her 40s who decides to
have a mammogram.
Question: If the test is positive, what is the
probability that she has cancer?
The answer depends on how reliable the test
is!
Bayes Rule
Suppose the test has a sensitivity of
80%; that is, if a person has cancer, the
test will be positive with probability
0.8.
If we denote by x=1 the event that the
mammogram is positive, and by y=1 the
event that the person has breast
cancer, then P(x=1|y=1)=0.8.
Bayes Rule
Does the probability that the woman in
our example (who tested positive) has
cancer equal 0.8?
Bayes Rule
No!
That ignores the prior probability of
having breast cancer, which,
fortunately, is quite low: p(y=1)=0.004
Bayes Rule
Further, we need to take into account
the fact that the test may be a false
positive.
Mammograms have a false positive
probability of p(x=1|y=0)=0.1.
Bayes Rule
Combining all these facts using Bayes
rule, we get (using p(y=0)=1-p(y=1)):
p(x=1|y=1)p(y=1)
p(y = 1|x = 1) = p(x=1|y=1)p(y=1)+p(x=1|y=0)p(y=0)
0.8⇥0.004
= 0.8⇥0.004+0.1⇥0.996
= 0.031
How does Bayesian reasoning apply to
phylogenetic inference?
Assume we are interested in the
relationships bet ween human, gorilla,
and chimpanzee (with orangutan as an
outgroup).
There are clearly three possible
relationships.
7.2 Bayesian phylogenetic inference
How does Bayesian reasoning apply to phylogenetic inference? Assume we are
interested in the relationships between man, gorilla, and chimpanzee. In the standard case, we need an additional species to root the tree, and the orangutan would
be appropriate here. There are three possible ways of arranging these species in a
phylogenetic tree: the chimpanzee is our closest relative, the gorilla is our closest
relative, or the chimpanzee and the gorilla are each other’s closest relatives (Fig. 7.1).
A
B
C
Probability
1.0
Prior distribution
0.5
0.0
Data (observations)
Probability
1.0
0.5
Posterior distribution
0.0
Fig. 7.1
A Bayesian phylogenetic analysis. We start the analysis by specifying our prior beliefs about
the tree. In the absence of background knowledge, we might associate the same probability
to each tree topology. We then collect data and use a stochastic evolutionary model and
Bayes’ theorem to update the prior to a posterior probability distribution. If the data are
informative, most of the posterior probability will be focused on one tree (or a small subset
of trees in a large tree space).
Before the analysis, we need to specify
our prior beliefs about the relationships.
For example, in the absence of
background data, a simple solution
would be to assign equal probability to
the possible trees.
How does Bayesian reasoning apply to phylogenetic inference? Assume we are
interested in the relationships between man, gorilla, and chimpanzee. In the standard case, we need an additional species to root the tree, and the orangutan would
be appropriate here. There are three possible ways of arranging these species in a
phylogenetic tree: the chimpanzee is our closest relative, the gorilla is our closest
relative, or the chimpanzee and the gorilla are each other’s closest relatives (Fig. 7.1).
A
B
C
Probability
1.0
Prior distribution
0.5
0.0
[This is an uninformative prior]
Data (observations)
Probability
1.0
0.5
Posterior distribution
0.0
Fig. 7.1
A Bayesian phylogenetic analysis. We start the analysis by specifying our prior beliefs about
the tree. In the absence of background knowledge, we might associate the same probability
to each tree topology. We then collect data and use a stochastic evolutionary model and
Bayes’ theorem to update the prior to a posterior probability distribution. If the data are
informative, most of the posterior probability will be focused on one tree (or a small subset
of trees in a large tree space).
To update the prior, we need some data,
typically in the form of a molecular
sequence alignment, and a stochastic
model of the process generating the
data on the tree.
In principle, Bayes rule is then used to
obtain the posterior probability
distribution, which is the result of the
analysis.
The posterior specifies the probability of
each tree given the model, the prior, and
the data.
When the data are informative, most
of the posterior probability is typically
concentrated on one tree (or, a small
subset of trees in a large tree space).
7.2 Bayesian phylogenetic inference
How does Bayesian reasoning apply to phylogenetic inference? Assume we are
interested in the relationships between man, gorilla, and chimpanzee. In the standard case, we need an additional species to root the tree, and the orangutan would
be appropriate here. There are three possible ways of arranging these species in a
phylogenetic tree: the chimpanzee is our closest relative, the gorilla is our closest
relative, or the chimpanzee and the gorilla are each other’s closest relatives (Fig. 7.1).
A
B
C
Probability
1.0
Prior distribution
0.5
0.0
Data (observations)
Probability
1.0
0.5
Posterior distribution
0.0
Fig. 7.1
A Bayesian phylogenetic analysis. We start the analysis by specifying our prior beliefs about
the tree. In the absence of background knowledge, we might associate the same probability
to each tree topology. We then collect data and use a stochastic evolutionary model and
Bayes’ theorem to update the prior to a posterior probability distribution. If the data are
informative, most of the posterior probability will be focused on one tree (or a small subset
of trees in a large tree space).
To describe the analysis mathematically,
consider:
the matrix of aligned sequences X
the tree topology parameter τ
the branch lengths of the tree ν
(typically, substitution model
parameters are also included)
Let θ=(τ,ν)
tree; collect these in the vector v. Typically, there are also some substitution model
parameters to be considered but, for now, let us use the Jukes Cantor substitution
model (see below), which does not have any free parameters. Thus, in our case,
θ = (τ, v).
Bayes’ theorem allows us to derive the posterior distribution as
f (θ) f (X|θ)
f (X) us
theorem allows
f (θ|X) =
Bayes
to derive the
posterior distribution as
(7.5)
The denominator is an integral over the parameter values, which evaluates to
f (✓)f (X|✓)
f (✓|X)
=
a summation
over discrete
topologies and a multidimensional integration over
f (X)
possible branch length values:
where
f (X) =
!
f (θ) f (X|θ) dθ
"!
f (v) f (X|τ, v) dv
=
τ
Fredrik Ronquist, Paul van der Mark, and John P. Huelsenbeck
Posterior Probability
218
48%
32%
20%
topology A
Fig. 7.2
v
topology B
topology C
Posterior probability distribution for our phylogenetic analysis. The x-axis is an imaginary
one-dimensional representation of the parameter space. It falls into three different regions
corresponding to the three different topologies. Within each region, a point along the axis
corresponds to a particular set of branch lengths on that topology. It is difficult to arrange
the space such that optimal branch length combinations for different topologies are close
to each other. Therefore, the posterior distribution is multimodal. The area under the curve
falling in each tree topology region is the posterior probability of that tree topology.
The marginal probability distribution on topologies
Even though our model is as simple as phylogenetic models come, it is impossible
to portray its parameter space accurately in one dimension. However, imagine for a
while that we could do just that. Then the parameter axis might have three distinct
regions corresponding to the three different tree topologies (Fig. 7.2). Within each
region, the different points on the axis would represent different branch length
values. The one-dimensional parameter axis allows us to obtain a picture of the
posterior probability function or surface. It would presumably have three distinct
peaks, each corresponding to an optimal combination of topology and branch
(7.6)
(7.7)
219
Why
they called
marginal
Bayesianare
phylogenetic
analysis using
M R B AYES : theory
probabilities?
Topologies
τ
τ
τ
Branch length vectors
A
B
Joint probabilities
C
ν
A
0.10
0.07
0.12
0.29
ν
B
0.05
0.22
0.06
0.33
ν
C
0.05
0.19
0.14
0.38
0.20
0.48
0.32
Marginal probabilities
Fig. 7.3
A two-dimensional table representation of parameter space. The columns represent different tree topologies, the rows represent different branch length bins. Each cell in the
table represents the joint probability of a particular combination of branch lengths and
topology. If we summarize the probabilities along the margins of the table, we get the
marginal probabilities for the topologies (bottom row) and for the branch length bins
(last column).
be an infinite number of rows, but imagine that we sorted the possible branch
length values into discrete bins, so that we get a finite number of rows. For instance,
if we considered only short and long branches, one bin would have all branches
long, another would have the terminal branches long and the interior branch
short, etc.
Now, assume that we can derive the posterior probability that falls in each of
the cells in the table. These are joint probabilities because they represent the joint
probability of a particular topology and a particular set of branch lengths. If we
summarized all joint probabilities along one axis of the table, we would obtain the
marginal probabilities for the corresponding parameter. To obtain the marginal
probabilities for the topologies, for instance, we would summarize the entries in
each column. It is traditional to write the sums in the margin of the table, hence
the term marginal probability (Fig. 7.3).
It would also be possible to summarize the probabilities in each row of the table.
This would give us the marginal probabilities for the branch length combinations
(Fig. 7.3). Typically, this distribution is of no particular interest but the possibility
of calculating it illustrates an important property of Bayesian inference: there is no
sharp distinction between different types of model parameters. Once the posterior
probability distribution is obtained, we can derive any marginal distribution of
Markov chain Monte Carlo
Sampling
In most cases, it is impossible to derive
the posterior probability distribution
analytically.
Even worse, we can’t even estimate it by
drawing random samples from it.
The reason is that most of the posterior
probability is likely to be concentrated
in a small part of a vast parameter
space.
The solution is to estimate the posterior
probability distribution using Markov
chain Monte Carlo sampling, or MCMC
for short.
Monte Carlo = random simulation
Markov chain = the state of the
simulator depends only on the current
state
Irreducible Markov chains (their
topology is strongly connected) have
the property that they converge
towards an equilibrium state regardless
of starting point.
We just need to set up a Markov chain
that converges onto our posterior
probability distribution!
scribing the full
set of probability
statements that define the
Transition
probabilities
next step (step i + 1) given its current state (in step i). These
s to transition rates if we are working with continuous-time
A Markov chain can be defined by describing the full set of probability statements that define the
rules for the state of the chain in the next step (step i + 1) given its current state (in step i). These
transition
probabilities are analogous to transition rates if we are working with continuous-time
ov chain: one with
two states
Markov
ime. The figure
to theprocesses.
right
Stationary Distribution
of a Markov Chain
sition probabilities are shown
Consider
simplest possible Markov chain: one with two states
probabilities next
to thethe
line.
0.6 time. The figure to the right
1) are:
that operates in discrete
correspond to (0,
the and
graph
shows the states in
circles.
The
transition
are shown
0.4
0
1 probabilities
0.1
) = 0.4
as arcs connecting the states with the probabilities next to the line.
0.6
) = 0.6
The full probability statements0.9
that correspond to the graph are:
) = 0.9
0.4
0
1
0.1
P(xi+1 = 0|xi = 0) = 0.4
) = 0.1
0.9
P(xi+1 = 1|xi = 0) = 0.6
ities, some of them must sum Figure 1: A graphical depicP(xi+1 = 0|xi = 1) = 0.9
particular state at step i we tion of a two-state Markov
P(xi+1 = 1|xi = 1) = 0.1
, we must have some state so process.
ble xi .
Note that, because these are probabilities, some of them must sum Figure 1: A graphical depicto one. In particular, if we are in a particular state at step i we tion of a two-state Markov
depends on the state at step
can call the state xi . In the next step, we must have some state so process.
formally we could P
state it as:
1 = j P(xi+1 = j|xi ) for every possible xi .
1 |xi , xi k )
Note that the state at step i + 1 only depends on the state at step
is probability i.statement
is saying
that, conditional
on xi , we could state it as:
This is the
Markovisproperty.
More formally
he state at any point before i. So when working with Markov
selves with the full history of theP(x
chain,
knowing
i+1 |xmerely
i ) = P(x
i+1 |xi , xthe
i k)
where k is positive integer. What this probability statement is saying is that, conditional on xi ,
the then
statewe
at can
i + 1make
is independent
on the
state at any point before i. So when working with Markov
ion probabilities,
a probabilistic
statement
we don’t
need toare
concern
with the full history of the chain, merely knowing the
n (in fact the chains
transition
probabilities
these ourselves
probabilistic
state at the
step
is be
enough.
about the probability
thatprevious
the chain
will
in a 1particular
scribing the full
set of probability
statements that define the
Transition
probabilities
next step (step i + 1) given its current state (in step i). These
s to transition rates if we are working with continuous-time
A Markov chain can be defined by describing the full set of probability statements that define the
rules for the state of the chain in the next step (step i + 1) given its current state (in step i). These
transition
probabilities are analogous to transition rates if we are working with continuous-time
ov chain: one with
two states
Markov
ime. The figure
to theprocesses.
right
Stationary Distribution
of a Markov Chain
sition probabilities are shown
Consider
simplest possible Markov chain: one with two states
probabilities next
to thethe
line.
0.6 time. The figure to the right
1) are:
that operates in discrete
correspond to (0,
the and
graph
shows the states in
circles.
The
transition
are shown
0.4
0
1 probabilities
0.1
) = 0.4
as arcs connecting the states with the probabilities next to the line.
0.6
) = 0.6
The full probability statements0.9
that correspond to the graph are:
) = 0.9
0.4
0
1
0.1
P(xi+1 = 0|xi = 0) = 0.4
) = 0.1
0.9
P(xi+1 = 1|xi = 0) = 0.6
ities, some of them must sum Figure 1: A graphical depicP(xi+1 = 0|xi = 1) = 0.9
particular state at step i we tion of a two-state Markov
P(xi+1 = 1|xi = 1) = 0.1
, we must have some state so process.
ble xi .
Note
that, are
because
these
areP(xprobabilities,
of P(x
them
sum Figure 1: A graphical depicP(x =
0|x = 0)
= 1|x = 0) P(x =some
0|x = 1)
= 1|xmust
= 1) ?
What
to one. In particular, if we are in a particular state at step i we tion of a two-state Markov
depends on the state at step
can call the state xi . In the next step, we must have some state so process.
formally we could P
state it as:
1 = j P(xi+1 = j|xi ) for every possible xi .
1 |xi , xi k )
Note that the state at step i + 1 only depends on the state at step
is probability i.statement
is saying
that, conditional
on xi , we could state it as:
This is the
Markovisproperty.
More formally
he state at any point before i. So when working with Markov
selves with the full history of theP(x
chain,
knowing
i+1 |xmerely
i ) = P(x
i+1 |xi , xthe
i k)
i
0
i
0
i
0
i
0
where k is positive integer. What this probability statement is saying is that, conditional on xi ,
the then
statewe
at can
i + 1make
is independent
on the
state at any point before i. So when working with Markov
ion probabilities,
a probabilistic
statement
chains
we don’t
need toare
concern
ourselves
with the full history of the chain, merely knowing the
nscribing
(in factthe
thefull
transition
probabilities
these
probabilistic
set of probability statements that 1define the
state
the
step
is be
enough.
about
the(step
probability
thatprevious
thecurrent
chain
will
a particular
next
step
i + 1)atgiven
its
state
(ininstep
i). These
s to transition rates if we are working with continuous-time
Clearly if we know xi and the transition probabilities, then we can make a probabilistic statement
i+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0)
about the state in the next iteration (in fact the transition probabilities are these probabilistic
4 ⇤ 0.4
statements).
ov chain: one with
two statesBut we can also think about the probability that the chain will be in a particular
state
ime. The figure
to two
the steps
right form now:
sition
probabilities
are
shown probabilities) apply when we
the same
“rules”
(transition
P(xi+2
= 0|xi = 0) = P(xi+2 = 0|xi+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0)
probabilities
next
to the line.
and
i + 2. If the
transition
probabilities are fixed
the
0.6+through
= 0.9 ⇤ 0.6
0.4 ⇤ 0.4
correspond
to
the
graph
are:
e are dealing with a time-homogeneous
Markov chain.
0.4= 00.7
1
0.1
= depend
0.4 on the two previous states, and third-order Markov process
s) that
the simplest
which only
on thethe
current
)ng with
= 0.6
HereMarkov
we arechains
exploiting
thedepend
fact0.9that
same “rules” (transition probabilities) apply when we
consider state changes between i + 1 and i + 2. If the transition probabilities are fixed through the
) = 0.9
) = 0.1 2 running of the Markov chain, then we are dealing with a time-homogeneous Markov chain.
Stationary Distribution
of a Markov Chain
1
P(x
P(xi processes
= k|xi 1that
= 0)P(x
0|x
`)
i =
0 = `) =Markov
i on
1 =
0 =
There
arek|x
second-order
depend
the
two
previous
states, and third-order Markov process
ities, some of them
must
k|xi 1with
= 1)P(x
i =dealing
i 1 = 1|x
0 = `)chains which only depend on the current
etc.. But
in sum
this course,
we’ll
just
be
the simplest
Markov
Figure
1:+P(x
A graphical
depicparticular state
at step i we tion of a two-state Markov
state.
, we must have some state so process.
ble xi .
2
depends on the state at step
formally we could state it as:
1 |xi , xi k )
is probability statement is saying is that, conditional on xi ,
he state at any point before i. So when working with Markov
selves with the full history of the chain, merely knowing the
ion probabilities, then we can make a probabilistic statement
n (in fact the transition probabilities are these probabilistic
about the probability that the chain will be in a particular
scribing the full set of probability statements that define the
next step (step i + 1) given its current state (in step i). These
s to transition rates if we are working with continuous-time
Stationary Distribution
of a Markov Chain
ov chain: one with two states
ime. The figure to the right
sition probabilities are shown
probabilities next to the line.
correspond to the graph are:
0.6
0.4
) = 0.4
0
1
0.1
0.9
) = 0.6
) = 0.9
) = 0.1
P(xi = k|x0 = `) = P(xi = k|xi
1
= 0)P(xi
1
ities, some of them must sum Figure 1:+P(x
= 1)P(xi
i = k|xi 1depicA graphical
particular state at step i we tion of a two-state Markov
, we must have some state so process.
transition probabilities
ble xi .
= 0|x0 = `)
1 = 1|x0 = `)
depends on the state at step
formally we could state it as:
1 |xi , xi k )
is probability statement is saying is that, conditional on xi ,
he state at any point before i. So when working with Markov
selves with the full history of the chain, merely knowing the
ion probabilities, then we can make a probabilistic statement
nscribing
(in factthe
thefull
transition
probabilities
are these
probabilistic
set of probability
statements
that
define the
about
the(step
probability
thatits
thecurrent
chain state
will be
a particular
next
step
i + 1) given
(ininstep
i). These
s to transition rates if we are working with continuous-time
i+1
Stationary Distribution
of a Markov Chain
= 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0)
4 ⇤ 0.4
ov chain: one with two states
Note that:
ime. The figure to the right
P(x
= 1|x = 0) = P(x
= 1|x
= 1)P(x
= 1|x = 0) + P(x
= 1|x
= 0)P(x
= 0|x
= 0.1 ⇤ 0.6 + 0.6 ⇤ 0.4
sition
probabilities
shown probabilities)
the same
“rules” are
(transition
apply when we
= 0.3
probabilities
next
to the line.
and
i + 2. If the
transition
are
thestatements sum to one.
So,probabilities
if we sum over all possible
statesfixed
at i + 2, the
relevant probability
0.6 through
the graph
are:
ecorrespond
are dealingtowith
a time-homogeneous
Markov chain.
i+2
i
i+2
i+1
i+1
i
i+2
i+1
i+1
It gets tedious to continue this for a large number of steps into the future. But we can ask a
computer to calculate the probability of being in state 0 for a large number of steps into the future
0.4
0
1
0.1
0.6
0.6
0.8
) = 0.1 2
0.8
1.0
) = 0.9
1.0
by repeating
the calculations
(see the Appendix
A for code).
Figure (2) shows the probabilities of
= depend
0.4 on the two previous
s) that
states,
and third-order
Markov
process
the Markov process being in state 0 as a function of the step number for the two possible starting
the simplest Markov states.
chains which only depend
0.9on the current
)ng with
= 0.6
Pr(x_i=0)
0.4
0.2
0.0
0.0
0.2
0.4
Pr(x_i=0)
ities, some of them must sum Figure 1: A graphical depicparticular state at step i we tion of a two-state Markov
, we must have some state so process.
ble xi .
0
5
10
15
0
5
10
15
P(x = 0|x = 1)
depends on the state at step P(x = 0|x = 0)
Figure 2: The probability of being in state 0 as a function of the step number, i, for two di↵erent
formally we could state itstarting
as: states (0 and 1) for the Markov process depicted in Figure (1).
1 |xi , xi k )
i
i
i
0
i
0
Note that the probability stabilizes to a steady state distribution. Knowing whether the chain
started in state 0 or 1 tells you very little about the state of the chain in step 15. Technically, x15
and x0 are not independent of each other. If you work through the math the probability of the
state at step 15 does depend on the starting state:
is probability statement is saying is that,
on xi ,
P(x conditional
= 0|x = 0) = 0.599987792969
P(x = 0|x = 1) = 0.600018310547
he state at any point before i. So when working
with Markov
But clearly these two probabilities are very close to being equal.
selves with the full history of the chain, merely knowing the
15
0
15
0
If we consider an even larger number of iterations (e.g. the state at step 100), then the probabilities
are so close that they are indistinguishable.
ion probabilities, then we can make a probabilistic3 statement
n (in fact the transition probabilities are these probabilistic
about the probability that the chain will be in a particular
i
= 0)
scribing the full set of probability statements that define the
next step (step i + 1) given its current state (in step i). These
s to transition rates if we are working with continuous-time
Stationary Distribution
of a Markov Chain
ov chain: one with two states
Note that:
ime. The figure to the right
P(x
= 1|x = 0) = P(x
= 1|x
= 1)P(x
= 1|x = 0) + P(x
= 1|x
= 0)P(x
= 0|x = 0)
= 0.1 ⇤ 0.6 + 0.6 ⇤ 0.4
sition probabilities are shown
= 0.3
probabilities next to the line.
So, if we sum over all possible states at i + 2, the relevant probability statements sum to one.
0.6
same probability
correspond to the graph are:
i+2
i
i+2
i+1
i+1
i
i+2
i+1
i+1
i
It gets tedious to continue this for a large number of steps into the future. But we can ask a
computer to calculate the probability of being in state 0 for a large number of steps into the future
by repeating the calculations (see the Appendix A for code). Figure (2) shows the probabilities of
the Markov process being in state 0 as a function of the step number for the two possible starting
states.
0.4
) = 0.4
0
1
regardless of
starting state!
0.1
0.6
0.6
0.8
) = 0.1
0.8
1.0
) = 0.9
1.0
0.9
) = 0.6
Pr(x_i=0)
0.4
0.2
0.0
0.0
0.2
0.4
Pr(x_i=0)
ities, some of them must sum Figure 1: A graphical depicparticular state at step i we tion of a two-state Markov
, we must have some state so process.
ble xi .
0
5
10
15
0
5
10
15
P(x = 0|x = 1)
depends on the state at step P(x = 0|x = 0)
Figure 2: The probability of being in state 0 as a function of the step number, i, for two di↵erent
formally we could state itstarting
as: states (0 and 1) for the Markov process depicted in Figure (1).
i
i
i
0
i
0
Note that the probability stabilizes to a steady state distribution. Knowing whether the chain
started in state 0 or 1 tells you very little about the state of the chain in step 15. Technically, x15
and x0 are not independent of each other. If you work through the math the probability of the
state at step 15 does depend on the starting state:
1 |xi , xi k )
is probability statement is saying is that,
on xi ,
P(x conditional
= 0|x = 0) = 0.599987792969
P(x = 0|x = 1) = 0.600018310547
he state at any point before i. So when working
with Markov
clearly these two probabilities are very close to being equal.
selves with the full historyBut of
the chain, merely knowing the
15
0
15
0
If we consider an even larger number of iterations (e.g. the state at step 100), then the probabilities
are so close that they are indistinguishable.
ion probabilities, then we can make a probabilistic3 statement
statements
that define are
the these probabilistic
nt of
(inprobability
fact the transition
probabilities
1)
giventhe
its probability
current statethat
(in the
stepchain
i). These
about
will be in a particular
s if we are working with continuous-time
Stationary Distribution
of a Markov Chain
P(xi+2 = 1|xi = 0) = P(xi+2 = 1|xi+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 1|xi+1 = 0)P(xi+1 = 0|xi = 0)
= 0.1 ⇤ 0.6 + 0.6 ⇤ 0.4
= 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0)
= 0.3
So, if we sum over all possible states at i + 2, the relevant probability statements sum to one.
4two
⇤ 0.4
states
o the right
are shown
tothe
thesame
line. “rules” (transition probabilities) apply when we
and
2. If the transition probabilities
are fixed through the
0.6
raphi +
are:
e are dealing with a time-homogeneous Markov chain.
0.4
0
1
0.1
0.0
0.2
s that depend on the two previous states, and third-order Markov process
0.9 which only depend on the current
ng with the simplest Markov chains
0
5
10
1.0
0.8
0.6
Pr(x_i=0)
0.4
0.2
0.4
Pr(x_i=0)
0.6
0.8
1.0
It gets tedious to continue this for a large number of steps into the future. But we can ask a
computer to calculate the probability of being in state 0 for a large number of steps into the future
by repeating the calculations (see the Appendix A for code). Figure (2) shows the probabilities of
the Markov process being in state 0 as a function of the step number for the two possible starting
states.
0.0
i+1
Note that:
15
0
5
i
2
must sum Figure 1: A graphical depicstep i we tion of a two-state Markov
me state so process.
ate at step
state it as:
ement is saying is that, conditional on xi ,
nt before i. So when working with Markov
history of the chain, merely knowing the
hen we can make a probabilistic statement
sition probabilities are these probabilistic
lity that the chain will be in a particular
10
15
i
P(xi = 0|x0 = 0)
P(xi = 0|x0 = 1)
Figure 2: The probability of being in state 0 as a function of the step number, i, for two di↵erent
starting states (0 and 1) for the Markov process depicted in Figure (1).
Note that the probability stabilizes to a steady state distribution. Knowing whether the chain
started in state 0 or 1 tells you very little about the state of the chain in step 15. Technically, x15
and x0 are not independent of each other. If you work through the math the probability of the
state at step 15 does depend on the starting state:
P(x15 = 0|x0 = 0) = 0.599987792969
P(x15 = 0|x0 = 1) = 0.600018310547
But clearly these two probabilities are very close to being equal.
If we consider an even larger number of iterations (e.g. the state at step 100), then the probabilities
are so close that they are indistinguishable.
3
t of probability statements that define the
1) given its current state (in step i). These
s if we are working with continuous-time
Note that:
Stationary Distribution
of a Markov Chain
P(xi+2 = 1|xi = 0) = P(xi+2 = 1|xi+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 1|xi+1 = 0)P(xi+1 = 0|xi = 0)
= 0.1 ⇤ 0.6 + 0.6 ⇤ 0.4
= 0.3
So, if we sum over all possible states at i + 2, the relevant probability statements sum to one.
two states
o the right
are shown
to the line.
raph are:
1.0
0.6
Pr(x_i=0)
0.4
0.2
0.2
0.1
0.9
0.0
1
0.4
0
0.0
0.4
Pr(x_i=0)
0.6
0.8
0.6
0.8
1.0
It gets tedious to continue this for a large number of steps into the future. But we can ask a
computer to calculate the probability of being in state 0 for a large number of steps into the future
by repeating the calculations (see the Appendix A for code). Figure (2) shows the probabilities of
the Markov process being in state 0 as a function of the step number for the two possible starting
states.
0
5
10
15
0
5
i
10
15
i
P(xi = 0|x0 = 0)
P(xi = 0|x0 = 1)
Figure 2: The probability of being in state 0 as a function of the step number, i, for two di↵erent
starting states (0 and 1) for the Markov process depicted in Figure (1).
where does the 0.6
come from?
Note that the probability stabilizes to a steady state distribution. Knowing whether the chain
started in state 0 or 1 tells you very little about the state of the chain in step 15. Technically, x15
and x0 are not independent of each other. If you work through the math the probability of the
state at step 15 does depend on the starting state:
must sum Figure 1: A graphical depicstep i we tion of a two-state Markov
me state so process.
P(x15 = 0|x0 = 0) = 0.599987792969
P(x15 = 0|x0 = 1) = 0.600018310547
But clearly these two probabilities are very close to being equal.
If we consider an even larger number of iterations (e.g. the state at step 100), then the probabilities
are so close that they are indistinguishable.
3
ate at step
state it as:
ement is saying is that, conditional on xi ,
nt before i. So when working with Markov
history of the chain, merely knowing the
hen we can make a probabilistic statement
t of probability
statements
that
define the
sition
probabilities
are these
probabilistic
1)
(ininstep
i). These
litygiven
thatits
thecurrent
chain state
will be
a particular
s if we are working with continuous-time
Note that:
Stationary Distribution
of a Markov Chain
P(xi+2 = 1|xi = 0) = P(xi+2 = 1|xi+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 1|xi+1 = 0)P(xi+1 = 0|xi = 0)
= 0.1 ⇤ 0.6 + 0.6 ⇤ 0.4
|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0)
two states
o the right
are shown
to(transition
the line. probabilities) apply when we
0.6 through the
raph are:
ansition
probabilities are fixed
a time-homogeneous
0.4
0Markov chain. 1
0.1
= 0.3
So, if we sum over all possible states at i + 2, the relevant probability statements sum to one.
1.0
0.8
0.6
Pr(x_i=0)
0.4
0.2
0.0
0.0
wo previous states, and third-order Markov process
0.9
Markov chains which only depend on the current
0.2
0.4
Pr(x_i=0)
0.6
0.8
1.0
It gets tedious to continue this for a large number of steps into the future. But we can ask a
computer to calculate the probability of being in state 0 for a large number of steps into the future
by repeating the calculations (see the Appendix A for code). Figure (2) shows the probabilities of
the Markov process being in state 0 as a function of the step number for the two possible starting
states.
0
5
10
15
0
5
i
10
15
i
P(xi = 0|x0 = 0)
P(xi = 0|x0 = 1)
Figure 2: The probability of being in state 0 as a function of the step number, i, for two di↵erent
starting states (0 and 1) for the Markov process depicted in Figure (1).
must sum Figure 1: A graphical depicstep i we tion of a two-state Markov
me state so process.
where does the 0.6
come from?
Note that the probability stabilizes to a steady state distribution. Knowing whether the chain
started in state 0 or 1 tells you very little about the state of the chain in step 15. Technically, x15
and x0 are not independent of each other. If you work through the math the probability of the
state at step 15 does depend on the starting state:
P(x15 = 0|x0 = 0) = 0.599987792969
P(x15 = 0|x0 = 1) = 0.600018310547
But clearly these two probabilities are very close to being equal.
stationary distribution: π0=0.6 π1=0.4
ate at step
state it as:
ement is saying is that, conditional on xi ,
nt before i. So when working with Markov
history of the chain, merely knowing the
hen we can make a probabilistic statement
sition probabilities are these probabilistic
lity that the chain will be in a particular
If we consider an even larger number of iterations (e.g. the state at step 100), then the probabilities
are so close that they are indistinguishable.
3
scribing the full set of probability statements that define the
next step (step i + 1) given its current state (in step i). These
s to transition rates if we are working with continuous-time
Stationary Distribution
of a Markov Chain
ov chain: one with two states
ime. The figure to the right
sition probabilities are shown
probabilities next to the line.
correspond to the graph are:
0.6
0.4
) = 0.4
0
1
0.1
0.9
) = 0.6
) = 0.9
) = 0.1
ities, some of them must sum Figure 1: A graphical depicparticular state at step i we tion of a two-state Markov
, we must have some state so process.
ble xi .
depends on the state at step
formally we could state it as:
1 |xi , xi k )
is probability statement is saying is that, conditional on xi ,
he state at any point before i. So when working with Markov
selves with the full history of the chain, merely knowing the
ion probabilities, then we can make a probabilistic statement
nscribing
(in factthe
thefull
transition
probabilities
are these
probabilistic
set of probability
statements
that
define the
about
the(step
probability
thatits
thecurrent
chain state
will be
a particular
next
step
i + 1) given
(ininstep
i). These
s to transition rates if we are working with continuous-time
i+1
= 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0)
Stationary Distribution
of a Markov Chain
4 ⇤ 0.4
ov chain: one with two states
ime. The figure to the right
sition
probabilities
shown probabilities) apply when we
the same
“rules” are
(transition
probabilities
next
to the line.
and
i + 2. If the
transition
probabilities are fixed
0.6 through the
the graph
are:
ecorrespond
are dealingtowith
a time-homogeneous
Markov chain.
0.4
0
1
0.1
= depend
0.4 on the two previous states, and third-order Markov process
s) that
the simplest Markov chains which only depend
0.9on the current
)ng with
= 0.6
Imagine infinitely many chains. At equilibrium (steady-state),
the “flux out” of each state must be equal to the “flux into”
that state.
ities, some of them must sum
) = 0.9
) = 0.1 2
Figure 1: A graphical depicparticular state at step i we tion of a two-state Markov
, we must have some state so process.
ble xi .
depends on the state at step
formally we could state it as:
1 |xi , xi k )
is probability statement is saying is that, conditional on xi ,
he state at any point before i. So when working with Markov
selves with the full history of the chain, merely knowing the
ion probabilities, then we can make a probabilistic statement
n (in fact the transition probabilities are these probabilistic
about the probability that the chain will be in a particular
i+1
= 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0)
scribing the full set of probability statements that define the
w canstep
we (step
showithat
this isits
the
steady-state
next
+ 1) given
current
state (in(equilibrium)
step i). Thesedistribution? The flux “out” of each
be equal
to the
“flux”
into thatwith
statecontinuous-time
for a system to be at equilibrium. Here it is helpful
stetomust
transition
rates
if we
are working
think about running infinitely many chains. If we have an equilibrium probability distribution,
n we can characterize the probability of a chain leaving state 0 in at one particular step in the
cess.
This
is with
the joint
probability of being in state 0 at step i and the probability of an 0 ! 1
ov
chain:
one
two states
nsition
at point
ime. The
figure i.toThus,
the right
sition
probabilities
are shown
er a long
enough walk,
a Markov chain in which all of the states are connected2 will converge
P(0
! 1 at i) = P(xi = 0)P(xi+1 = 1|xi = 0)
probabilities
to the
line.
ts
stationarynext
distribution.
0.6
correspond to the graph are:
the equilibrium P(xi = 0) = ⇡0 , so the flux out of 0 at step i is ⇡0 P(xi+1 = 1|xi = 0). In our
0.4
0
system the flux into state 0 can only 1come0.1
from state 1, so:
)ple=(two-state)
0.4
any
chains
or
one
really long chain
0.9
) = 0.6
P(1 ! 0 at i) = P(xi = 1)P(xi+1 = 0|xi = 1)
) = 0.9 Imagine infinitely many chains. At equilibrium (steady-state),
= ⇡1 P(xi+1 = 0|xi = 1)
hen
about
Markov
chains
we be
often
flip to
back
forth
between thinking
) =constructing
0.1
thearguments
“flux out”
of each
state
must
equal
theand
“flux
into”
out
the
behavior
of
an
arbitrarily
large
ofstate.
instances of the random process all obeying the
we force the “flux out” and “flux in” tonumber
bethat
identical:
ities,
some
of performing
them must sum
me
rules
(but
their walks
Figureindependently)
1: A graphicalversus
depic-the behavior we expect if we sparsely
particular
at stepThe
weidea
mple
a very state
long chain.
that
sample
ofi =
aif0)
two-state
Markov
⇡0iP(x
=is1|x
= ⇡sparsely
= 0|xi =the
1) sampled points from one
i+1tion
1 P(x
i+1 enough,
,inweare
must
have
some state
so process.
close
to being
independent
of each⇡0other (as
2 was meant to demonstrate). Thus
P(xFigure
i+1 = 0|xi = 1)
ble
=
i.
canxthink
of a very long chain as if we had
number
of independent
chains.
⇡ a large
P(x
= 1|x
= 0)
Stationary Distribution
of a Markov Chain
1
i+1
i
depends
the isstate
at step A reducible chain has some states that are disconnected from others, so the
The
fancy on
jargon
“irreducible.”
n
can be broken
up (reduced)
into two or more chains that operate on a subset of the states.
formally
we could
state it as:
ationary distribution
1 |xi , xi k )
tice
in the discussion
above
theisprobability
of ending
is probability
statement
is that
saying
that, conditional
on up
xi ,in state 0 as the number of iterations
reased
What
is special
about4this
he
stateapproached
at any point0.6.
before
i. So
when working
withvalue?
Markov
selves with the full history of the chain, merely knowing the
urns out that in the stationary distribution (frequently denoted ⇡), the probability of sampling
hain in state 0 is 0.6. We would write this as ⇡ = {0.6, 0.4} or the pair of statements: ⇡0 =
, ⇡1 probabilities,
= 0.4.
ion
then we can make a probabilistic statement
nscribing
(in factthe
thefull
transition
probabilities
are these
probabilistic
set of probability
statements
that
define the
about
the(step
probability
that
the
chain
will be
instep
a particular
w
canstep
we
show
this
isits
the
steady-state
next
ithat
+ 1) given
current
state
(in(equilibrium)
i). Thesedistribution? The flux “out” of each
be equal
to the
“flux”
into thatwith
statecontinuous-time
for a system to be at equilibrium. Here it is helpful
stetomust
transition
rates
if we
are working
think about running infinitely many chains. If we have an equilibrium probability distribution,
i+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0)
n we can characterize the probability of a chain leaving state 0 in at one particular step in the
4cess.
⇤ 0.4This is the joint probability of being in state 0 at step i and the probability of an 0 ! 1
ov chain:
one with two states
nsition
at point
ime. The
figure i.toThus,
the right
sition
probabilities
shown probabilities) apply when we
the same “rules” are
(transition
P(0
! 1 at i) = P(xi = 0)P(xi+1 = 1|xi = 0)
probabilities
next
to the
line.
and
i + 2. If the
transition
probabilities are fixed
0.6 through the
towith
the graph
are:
ecorrespond
are
dealing
a
time-homogeneous
Markov
chain.
the equilibrium P(xi = 0) = ⇡0 , so the
flux out
of 0 at step i is ⇡0 P(xi+1 = 1|xi = 0). In our
0.4
0 0 can only 1come0.1
the flux
into
=(two-state)
0.4 on thesystem
s)ple
that
depend
two previous
states,
andstate
third-order
Markov processfrom state 1, so:
Stationary Distribution
of a Markov Chain
the simplest Markov chains which only depend
0.9on the current
)ng with
= 0.6
P(1 ! 0 at i) = P(xi = 1)P(xi+1 = 0|xi = 1)
) = 0.9 Imagine infinitely many chains. At equilibrium (steady-state),
= ⇡1 P(xi+1 = 0|xi = 1)
) = 0.1 2 the “flux out” of each state must be equal to the “flux into”
state.
we force the “flux out” and “flux in” to bethat
identical:
ities, some of them must sum Figure 1: A graphical depicparticular state at step i we tion of a two-state Markov
⇡0 P(xi+1 = 1|xi = 0) = ⇡1 P(xi+1 = 0|xi = 1)
, we must have some state so process.
⇡0
P(xi+1 = 0|xi = 1)
ble xi .
=
⇡1
P(xi+1 = 1|xi = 0)
depends
the isstate
at step A reducible chain has some states that are disconnected from others, so the
The
fancy on
jargon
“irreducible.”
⇡0 = P(xi = 0) ⇡1 = P(xi = 1)
n
can be broken
up (reduced)
into two or more chains that operate on a subset of the states.
formally
we could
state it as:
1 |xi , xi k )
is probability statement is saying is that, conditional on xi ,
4
he state at any point before i. So when working with Markov
selves with the full history of the chain, merely knowing the
ion probabilities, then we can make a probabilistic statement
n (in fact the transition probabilities are these probabilistic
about the probability that the chain will be in a particular
i+1
= 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0)
scribing the full set of probability statements that define the
next step (step i + 1) given its current state (in step i). These
s to transition rates if we are working with continuous-time
Stationary Distribution
of a Markov Chain
ov chain: one with two states
ime. The figure to the right
sition probabilities are shown
After a long enough
Markov chain in which all of the states are connected2 will converge
probabilities
next walk,
to thea line.
0.6
to its stationary
distribution.
correspond
to the
graph are:
0.4
) = 0.4
0
1
chains or one really long chain
) Many
= 0.6
0.1
0.9
) = 0.9
When constructing arguments about Markov chains we often flip back and forth between thinking
about the behavior of an arbitrarily large number of instances of the random process all obeying the
same rules (but performing their walks independently) versus the behavior we expect if we sparsely
ities,
some of them must sum
1: A sparsely
graphical
depicsample a very long chain. The idea isFigure
that if sample
enough,
the sampled points from one
particular
state at step i we of eachofother
a two-state
chain are close to being independent tion
(as Figure 2 Markov
was meant to demonstrate). Thus
, we
wecan
must
have
some
state
so
think of a very long chain as process.
if we had a large number of independent chains.
) = 0.1
ble xi .
depends
ondistribution
the state at step
Stationary
formally we could state it as:
Notice in the discussion above that the probability of ending up in state 0 as the number of iterations
increased approached 0.6. What is special about this value?
1 |xi , xi k )
is probability statement is saying is that, conditional on xi ,
turns out that in the stationary distribution (frequently denoted ⇡), the probability of sampling
heIt state
at any point before i. So when working with Markov
a chain in state 0 is 0.6. We would write this as ⇡ = {0.6, 0.4} or the pair of statements: ⇡0 =
selves
with the full history of the chain, merely knowing the
0.6, ⇡ = 0.4.
1
How can we show that this is the steady-state (equilibrium) distribution? The flux “out” of each
ion
probabilities,
can
make
a probabilistic
statement
state
must be equal then
to the we
“flux”
into
that state
for a system to
be at equilibrium. Here it is helpful
to
running
infinitely
many chains.are
If we
haveprobabilistic
an equilibrium probability distribution,
nscribing
(inthink
factabout
thefull
transition
probabilities
these
the
set of probability
statements
that define the
then wethe
can probability
characterize the
probability
of a will
chain leaving
0 in at one particular step in the
about
that
thecurrent
chain
astate
particular
next
step (step
i + 1) given
its
statebe
(ininstep
i). These
process. This is the joint probability of being in state 0 at step i and the probability of an 0 ! 1
s transition
to transition
rates
if we are working with continuous-time
at point
i. Thus,
i+1
= 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0)
Stationary Distribution
of a Markov Chain
P(0 ! 1 at i) = P(xi = 0)P(xi+1 = 1|xi = 0)
4 ⇤ 0.4
ovAtchain:
one withP(x
two
states
the equilibrium
i = 0) = ⇡0 , so the flux out of 0 at step i is ⇡0 P(xi+1 = 1|xi = 0). In our
ime.
figure system
to thetheright
simpleThe
(two-state)
flux into state 0 can only come from state 1, so:
sition
probabilities
shown probabilities) apply when we
the same
“rules” are
(transition
P(1 line.
! 0 at i) = P(xi = 1)P(xi+1 = 0|xi = 1)
probabilities
next
to the
and
i + 2. If the
transition
probabilities are fixed
through the
= ⇡1 P(xi+1 0.6
= 0|xi = 1)
the graph
are:
ecorrespond
are dealingtowith
a time-homogeneous
Markov chain.
0.4to be identical:
0
1
0.1
If =
we force
“flux out” and “flux in”
0.4 the
s) that
depend
on the two previous states, and third-order Markov process
ng
with
the
simplest
Markov
chains
which
only
depend
on
the
current
) = 0.6
⇡0 P(xi+1 = 1|xi = 0) = ⇡10.9
P(xi+1 = 0|xi = 1)
) = 0.9
) = 0.1 2
2
⇡0
⇡1
=
P(xi+1 = 0|xi = 1)
P(xi+1 = 1|xi = 0)
The fancy jargon is “irreducible.” A reducible chain has some states that are disconnected from others, so the
chain can
be broken
up (reduced)
more chains that operate on a subset of the states.
ities,
some
of them
must into
sumtwo or
Figure
1: A graphical depicparticular state at step i we tion of a two-state Markov
, we must have some state so process.
ble xi .
4
depends on the state at step
formally we could state it as:
1 |xi , xi k )
is probability statement is saying is that, conditional on xi ,
he state at any point before i. So when working with Markov
selves with the full history of the chain, merely knowing the
ion probabilities, then we can make a probabilistic statement
n (in fact the transition probabilities are these probabilistic
about the probability that the chain will be in a particular
i+1
= 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0)
to think about running infinitely many chains. If we have an equilibrium probability distribution,
scribing the full set of probability statements that define the
then we can characterize the probability of a chain leaving state 0 in at one particular step in the
next
stepThis
(step
i +joint
1) given
its current
(in0step
i).i These
process.
is the
probability
of beingstate
in state
at step
and the probability of an 0 ! 1
s transition
to transition
rates
if we are working with continuous-time
at point
i. Thus,
Stationary Distribution
of a Markov Chain
P(0 ! 1 at i) = P(xi = 0)P(xi+1 = 1|xi = 0)
ovAtchain:
one withP(x
two
states
the equilibrium
i = 0) = ⇡0 , so the flux out of 0 at step i is ⇡0 P(xi+1 = 1|xi = 0). In our
ime.
figure system
to thetheright
simpleThe
(two-state)
flux into state 0 can only come from state 1, so:
sition probabilities are shown
P(1
!
0 at i)
= inP(x
0|xi =are
1) connected2 will converge
i = 1)P(x
After a long enough
Markov
chain
which
all ofi+1
the=states
probabilities
next walk,
to thea line.
to its stationary
distribution.
= ⇡1 P(xi+1 0.6
= 0|xi = 1)
correspond
to the
graph are:
0.4
0
1
0.1
we force
) If =
0.4 the “flux out” and “flux in” to be identical:
Many
chains
or
one
really
long
chain
) = 0.6
⇡0 P(xi+1 = 1|xi = 0) = ⇡10.9
P(xi+1 = 0|xi = 1)
⇡0
P(xi+1 = 0|xi = 1)
=
When constructing arguments about Markov
we i+1
often
⇡1chains P(x
= flip
1|xiback
= 0) and forth between thinking
about
the behavior of an arbitrarily large number of instances of the random process all obeying the
2
The fancy jargon is “irreducible.” A reducible chain has some states that are disconnected from others, so the
same
rules
(but performing
independently)operate
thea subset
behavior
we expect if we sparsely
chain can
be broken
up (reduced)
into
two or more chains that
on 1
of the states.
ities,
some
of them
musttheir
sumwalks
⇡0 graphical
+versus
⇡1 =
1: A
depicsample a very long chain. The idea isFigure
that if sample
sparsely enough,
the sampled points from one
particular
state
at
step
i
we
a two-state
chain are close to being independent tion
of eachofother
(as Figure 2 Markov
was meant to demonstrate). Thus
, we
wecan
must
have
state
soas process.
think
of a some
very long
chain
if we had a large number of independent chains.
ble xi .
4
) = 0.9
) = 0.1
depends
ondistribution
the state at step
Stationary
formally we could state it as:
Notice in the discussion above that the probability of ending up in state 0 as the number of iterations
increased approached 0.6. What is special about this value?
1 |xi , xi k )
is probability statement is saying is that, conditional on xi ,
turns out that in the stationary distribution (frequently denoted ⇡), the probability of sampling
heIt state
at any point before i. So when working with Markov
a chain in state 0 is 0.6. We would write this as ⇡ = {0.6, 0.4} or the pair of statements: ⇡0 =
selves
with the full history of the chain, merely knowing the
0.6, ⇡ = 0.4.
1
How can we show that this is the steady-state (equilibrium) distribution? The flux “out” of each
ion
probabilities,
can
make
a probabilistic
statement
state
must be equal then
to the we
“flux”
into
that state
for a system to
be at equilibrium. Here it is helpful
to
running
infinitely
many chains.are
If we
haveprobabilistic
an equilibrium probability distribution,
nscribing
(inthink
factabout
thefull
transition
probabilities
these
the
set of probability
statements
that define the
then wethe
can probability
characterize the
probability
of a will
chain leaving
0 in at one particular step in the
about
that
thecurrent
chain
astate
particular
next
step (step
i + 1) given
its
statebe
(ininstep
i). These
process. This is the joint probability of being in state 0 at step i and the probability of an 0 ! 1
s transition
to transition
rates
if we are working with continuous-time
at point
i. Thus,
i+1
= 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1 = 0)P(xi+1 = 0|xi = 0)
Stationary Distribution
of a Markov Chain
P(0 ! 1 at i) = P(xi = 0)P(xi+1 = 1|xi = 0)
4 ⇤ 0.4
ovAtchain:
one withP(x
two
states
the equilibrium
i = 0) = ⇡0 , so the flux out of 0 at step i is ⇡0 P(xi+1 = 1|xi = 0). In our
ime.
figure system
to thetheright
simpleThe
(two-state)
flux into state 0 can only come from state 1, so:
sition
probabilities
shown probabilities) apply when we
the same
“rules” are
(transition
P(1 line.
! 0 at i) = P(xi = 1)P(xi+1 = 0|xi = 1)
probabilities
next
to the
and
i + 2. If the
transition
probabilities are fixed
through the
= ⇡1 P(xi+1 0.6
= 0|xi = 1)
the graph
are:
ecorrespond
are dealingtowith
a time-homogeneous
Markov chain.
0.4to be identical:
0
1
0.1
If =
we force
“flux out” and “flux in”
0.4 the
s) that
depend
on the two previous states, and third-order Markov process
ng
with
the
simplest
Markov
chains
which
only
depend
on
the
current
) = 0.6
⇡0 P(xi+1 = 1|xi = 0) = ⇡10.9
P(xi+1 = 0|xi = 1)
) = 0.9
⇡
P(x
= 0|x = 1)
0
i+1
i
We can then solve for the actual frequencies
= by recognizing that ⇡ is a probability distribution:
⇡1
P(xi+1 = 1|xi = 0)
X
⇡k = 1
2
The fancy jargon is “irreducible.” A reducible chain has ksome states that are disconnected from others, so the
chain can
be broken
up (reduced)
operate
on 1
a subset of the states.
ities,
some
of them
must into
sumtwo or more chains
⇡0 graphical
+
⇡1 =
1: that
A
depicFor the system that we areFigure
considering:
) = 0.1 2
particular state at step i we tion of a two-state Markov
⇡0
0.9
, we must have some state so process.
=
= 1.5
⇡1
0.6
ble xi .
4 ⇡0 = 1.5⇡1
depends on the state at step
formally we could state it as:
1 |xi , xi
k)
1.5⇡1 + ⇡1 = 1.0
⇡1 = 0.4
⇡0 = 0.6
Thus, the transition probabilities determine the stationary distribution.
is probability
statement
saying
is that,
conditional
xiconstruct
,
If we
can chooseisthe
transition
probabilities
then weon
can
a sampler that will
he state at any
pointtobefore
i. So when
converge
any distribution
thatworking
we wouldwith
like. Markov
selves with the full history of the chain, merely knowing the
When we have a state set S with more than 2 states, we can express a general statement about the
stationary distribution:
X
⇡j =
⇡k P(xi+1 = j|xi = k)
(1)
k2S
ion probabilities, then we can make a probabilistic
statement
connection toprobabilities
the “flux out” and
“flux
in” argument
may not seem obvious, but note that the
n (in fact theThetransition
are
these
probabilistic
“flux out” can be quantified in terms of the 1 - the “self-transition.” Let S6=j be the set of states
about the probability
thatj, the
willis: be in a particular
that excludes state
so thechain
“flux out”
i+1
flux out of j = ⇡j P(xi+1 2 S6=j |xi = j)
= ⇡=
P(xi+1
2 j|x
= j)]
j [10)P(x
= 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 0|xi+1
= i0|x
i+1
i = 0)
Stationary Distribution
of a Markov Chain
If we can choose the transition probabilities of the
Markov chain, then we can construct a sampler
that will converge to any distribution that we desire!
We can then solve for the actual frequencies by recognizing that ⇡ is a probability distribution:
X
We can then solve for the actual frequencies by⇡recognizing
that ⇡ is a probability distribution:
k = 1
k
X
⇡k = 1
For the system that we are considering:
k
⇡0
0.9
=
= 1.5
⇡1
0.6
1.5⇡1that ⇡ is a probability distribution:
⇡⇡00 by =
0.9
We can then solve for the actual frequencies
recognizing
=
= 1.5
0.6
1.5⇡1 +⇡⇡1X
1.0
1 ⇡=
k = 1
⇡⇡0k =
1.5⇡
= 0.4 1
For the system that we are considering:
1
For the system that we are considering:
1.5⇡1 + ⇡⇡10
=
= 1.0
0.6
0.9
⇡1 ⇡0 == 0.4
= 1.5
0.6
⇡1
⇡0 ⇡0== 0.6
1.5⇡1
Thus, the transition probabilities determine
the stationary distribution.
1.5⇡ + ⇡ = 1.0
1
1
⇡1 = 0.4
If wethe
can
choose the
transition
probabilities
can construct a sampler that will
=then
0.6 wedistribution.
Thus,
transition
probabilities
determine
the⇡0stationary
converge to any distribution that we would like.
Thus,the
the transition
probabilities
determine the stationary
distribution.
If we can choose
transition
probabilities
then we
can construct a sampler that will
When weto
have
a state set S withthat
morewe
than
2 states,
we can express a general statement about the
converge
any
would
like.
If wedistribution
can choose the transition
probabilities
then we can construct a sampler that will
stationary distribution:
Xwe would like.
converge to any distribution that
⇡j =
⇡k P(xi+1 = j|xi = k)
When we have aWhen
state
set S with more
thanthan
2 states,
we can express a general statement about (1)
the
we have a state set S with more
2 states, we can express a general statement about the
k2S
stationary distribution:
stationary distribution:
X X
The connection to the “flux out”⇡and
“flux
in”
argument
may
not
seem
obvious,
but
note
that
the
⇡ =
⇡k P(xi+1 =
= j|x
k) k)
(1)
i ==
j|x
(1)
j = j ⇡k P(x
i+1
i
k2S1 - the “self-transition.” Let S
“flux out” can be quantified in terms of
the
6=j be the set of states
k2S
connection
to the
“flux out”
out” and
that excludes The
state
j, so the
“flux
is:“flux in” argument may not seem obvious, but note that the
out”“flux
can be out”
quantified
terms ofin”
the argument
1 - the “self-transition.”
S6=j be
the set ofbut
statesnote that the
The connection “flux
to the
andin “flux
may notLet
seem
obvious,
that excludes state j, so the “flux out” is:
“flux out” can be quantifiedflux
in terms
thei+1
“self-transition.”
out of of
j the
= 1⇡-j P(x
2 S6=j |xi = j) Let S6=j be the set of states
flux outis:
of j = ⇡j P(xi+1 2 S6=j |xi = j)
that excludes state j, so the “flux out”
= =⇡j⇡[1[1 P(x
2 j|xi = j)]
P(x i+1
2 j|x = j)]
Stationary Distribution
of a Markov Chain
For the general case of more than 2
states:
j
The flux in is:
The flux in is:
i
= ⇡j [1 P(xi+1 2 j|xi =
X
flux into j = X ⇡k P(xi+1 = j|xi = k)
flux into j =
The flux in is:
i+1
flux out of j = ⇡j P(xi+1 2 S6=j |xi = j)
Forcing the fluxes to be equal gives us:
j)]
⇡k P(xi+1 = j|xi = k)
k2S6=j
k2S6=j
X
X
flux
⇡ [1 P(x
j|x =j j)] ==
⇡k⇡
P(x
j|xi=
= j|x
k) i = k)
i+1 =into
i+1 =
i+1
k P(x
Forcing the fluxes toj be equal
givesi us:
k2S6=j
k2S
6=j
X
⇡j =X
⇡j P(xi+1 = j|xi = j) +
⇡k P(xi+1 = j|xi = k)
(2)
⇡j [1 P(xi+1 = j|xi = j)] =
⇡k P(xi+1 = j|xk2S
k)
i =
6=j
X
Forcing the fluxes to be equal gives us: k2S
= 6=j ⇡k P(xi+1 = j|xi = k)
X
k2S
X
⇡j = ⇡
⇡k P(xi+1 = j|xi = k)
j P(xi+1 = j|xi = j) +
⇡j [1 P(xi+1 = j|xi = j)] =
⇡k P(xi+1 = j|xi =k2S
k)
6=j
X6=j
k2S
=
⇡k P(x5i+1 = j|xi = k)X
⇡j = ⇡k2S
⇡k P(xi+1 = j|xi = k)
j P(xi+1 = j|xi = j) +
=
X
k2S6=j
⇡k P(xi+1 = j|xi = k)
k2S
5
5
(2)
(2)
Mixing
While setting the transition
probabilities to specific values affects
the stationary distribution, the
transition probabilities cannot be
determined uniquely from the
stationary distribution.
Note that:
P(xi+2 = 1|xi = 0) = P(xi+2 = 1|xi+1 = 1)P(xi+1 = 1|xi = 0) + P(xi+2 = 1|xi+1 = 0)P(xi+1 = 0|xi = 0)
= 0.1 ⇤ 0.6 + 0.6 ⇤ 0.4
= 0.3
So, if we sum over all possible states at i + 2, the relevant probability statements sum to one.
1.0
1.0
It gets tedious to continue this for a large number of steps into the future. But we can ask a
computer to calculate the probability of being in state 0 for a large number of steps into the future
by repeating the calculations (see the Appendix A for code). Figure (2) shows the probabilities of
the Markov process being in state 0 as a function of the step number for the two possible starting
states.
0.8
0.4
Pr(x_i=0)
0.6
We just saw how changing transition probabilities will a↵ect the stationary distribution. Perhaps,
there is a one-to-one mapping between transition probabilities and a stationary distribution. In the
form of a question: if we were given the stationary distribution could we decide what the transition
rates must be to achieve that distribution?
0.2
It turns out that the answer is “No.” Which can be seen if we examine equation (2). Imagine
dropping the “flux out” of each state by a factor of 10. Because the “self-transition” rate would
increase by the appropriate amount, we can still end up in the same stationary distribution.
0.0
0.0
0.2
0.4
Pr(x_i=0)
0.6
0.8
Mixing
How would such a chain behave? The primary di↵erence is that it would “mix” more slowly.
0
5
10
15
0
5
10
15
Adjacent steps would be more likely to be in the same state, and it would take a larger number
i
i
of iterationsP(x
before
the
“forgets” its starting state. FigureP(x
(3)i =depicts
0)
0|x0 = the
1) same probability
i = 0|x
0 = chain
statements as Figure (2) but for a process in which P(xi+1 = 1|xi = 0) = 0.06 and P(xi+1 = 0|xi =
Figure 2: The probability of being in state 0 as a function of the step number, i, for two di↵erent
1)starting
= 0.09.states (0 and 1) for the Markov process depicted in Figure (1).
0.8
1.0
Note that the probability stabilizes to a steady state distribution. Knowing whether the chain
started in state 0 or 1 tells you very little about the state of the chain in step 15. Technically, x15
and x0 are not independent of each other. If you work through the math the probability of the
state at step 15 does depend on the starting state:
Pr(x_i=0)
P(x15 = 0|x0 = 1) = 0.600018310547
0.4
But clearly these two probabilities are very close to being equal.
0.0
0.2
If we consider an even larger number of iterations (e.g. the state at step 100), then the probabilities
are so close that they are indistinguishable.
0.0
0.2
0.4
Pr(x_i=0)
0.6
P(x15 = 0|x0 = 0) = 0.599987792969
0.6
0.8
1.0
P(xi+1 = 1|xi = 0) = 0.6 P(xi+1 = 0|xi = 1) = 0.9
0
10
20
30
40
3
i
0
10
20
30
40
i
P(xi = 0|x0 = 0)
P(xi = 0|x0 = 1)
Figure 3: The probability of being in state 0 as a function of the step number, i, for two di↵erent
P(xi+1 =
1|xi (0=and 0)
= 0.06 P(x in Figure=(1) 0|x
1) = 0.09
i the=state-changing
starting states
1) for the Markov process depictedi+1
but with
transition probabilities scaled down by a factor of 10. Compare this to Figure (2).
Thus, that the rate of convergence of a chain to its stationary distribution is an aspect of a Markov
chain that is separate from what the stationary distribution is.
In MCMC we will design a Markov chain such that its stationary distribution will be identical to
the posterior probability distribution over the space of parameters. We will try to design chains
that have high transition probabilities, so that our MCMC approximation will quickly converge to
the posterior. But even a slowly mixing chain will (in theory) eventually be capable of providing
6
So, if we sum over all possible states at i + 2, the relevant probability statements sum to one.
1.0
1.0
It gets tedious to continue this for a large number of steps into the future. But we can ask a
computer to calculate the probability of being in state 0 for a large number of steps into the future
by repeating the calculations (see the Appendix A for code). Figure (2) shows the probabilities of
the Markov process being in state 0 as a function of the step number for the two possible starting
states.
0.8
0.4
Pr(x_i=0)
0.6
We just saw how changing transition probabilities will a↵ect the stationary distribution. Perhaps,
there is a one-to-one mapping between transition probabilities and a stationary distribution. In the
form of a question: if we were given the stationary distribution could we decide what the transition
rates must be to achieve that distribution?
0.2
It turns out that the answer is “No.” Which can be seen if we examine equation (2). Imagine
dropping the “flux out” of each state by a factor of 10. Because the “self-transition” rate would
increase by the appropriate amount, we can still end up in the same stationary distribution.
0.0
0.0
0.2
0.4
Pr(x_i=0)
0.6
0.8
Mixing
How would such a chain behave? The primary di↵erence is that it would “mix” more slowly.
0
5
10
15
0
5
10
15
Adjacent steps would be more likely to be in the same state, and it would take a larger number
i
i
of iterationsP(x
before
the
“forgets” its starting state. FigureP(x
(3)i =depicts
0)
0|x0 = the
1) same probability
i = 0|x
0 = chain
statements as Figure (2) but for a process in which P(xi+1 = 1|xi = 0) = 0.06 and P(xi+1 = 0|xi =
Figure 2: The probability of being in state 0 as a function of the step number, i, for two di↵erent
1)starting
= 0.09.states (0 and 1) for the Markov process depicted in Figure (1).
0.8
1.0
Note that the probability stabilizes to a steady state distribution. Knowing whether the chain
started in state 0 or 1 tells you very little about the state of the chain in step 15. Technically, x15
and x0 are not independent of each other. If you work through the math the probability of the
state at step 15 does depend on the starting state:
Pr(x_i=0)
P(x15 = 0|x0 = 1) = 0.600018310547
0.4
But clearly these two probabilities are very close to being equal.
0.0
0.2
If we consider an even larger number of iterations (e.g. the state at step 100), then the probabilities
are so close that they are indistinguishable.
0.0
0.2
0.4
Pr(x_i=0)
0.6
P(x15 = 0|x0 = 0) = 0.599987792969
0.6
0.8
1.0
P(xi+1 = 1|xi = 0) = 0.6 P(xi+1 = 0|xi = 1) = 0.9
0
10
20
30
40
3
0
10
i
20
30
40
i
P(xi = 0|x0 = 0)
P(xi = 0|x0 = 1)
P(xi+1 = 1|xi = 0) = 0.06 P(xi+1 = 0|xi = 1) = 0.09
Figure 3: The probability of being in state 0 as a function of the step number, i, for two di↵erent
starting states (0 and 1) for the Markov process depicted in Figure (1) but with the state-changing
transition probabilities scaled down by a factor of 10. Compare this to Figure (2).
Thus, that the rate of convergence of a chain to its stationary distribution is an aspect of a Markov
chain that is separate from what the stationary distribution is.
In MCMC we will design a Markov chain such that its stationary distribution will be identical to
the posterior probability distribution over the space of parameters. We will try to design chains
that have high transition probabilities, so that our MCMC approximation will quickly converge to
the posterior. But even a slowly mixing chain will (in theory) eventually be capable of providing
6
Mixing
Setting the transition probabilities to
lower values resulted in a chain that
“mixed” more slowly: Adjacent steps
would be more likely to be in the same
state and, thus, would require a larger
number of iterations before the chain
“forgets” its starting state.
Mixing
The rate of convergence of a chain to its
stationary distribution is an aspect of
a Markov chain that is separate from
what the stationary distribution is.
Mixing
In MCMC, we will design a Markov
chain whose stationary distribution is
identical to the posterior probability
distribution over the space of
parameters.
We try to design chains that have high
transition probabilities to achieve
faster convergence.
Detailed Balance
In practice, the number of states is very
large.
Setting the transition probabilities so
that we have equal flux into and out of
any state is tricky.
What we use instead is detailed
balance.
Detailed Balance
We restrict ourselves to Markov chains
that satisfy detailed balance for all
pairs of states j and k:
⇡j P(xi+1 = k|xi = j) = ⇡k P(xi+1 = j|xi = k)
(equivalently:
⇡j
P(xi+1 = j|xi = k)
=
⇡k
P(xi+1 = k|xi = j)
)
This can be achieved using several
different methods, the most flexible of
which is known as the Metropolis
algorithm and its extension, the
Metropolis-Hastings method.
In the Metropolis-Hastings algorithm,
we choose rules for constructing a
random walk through the parameter
space.
We adopt transition probabilities such
that the stationary distribution of our
Markov chain is equal to the posterior
probability distribution:
⇡✓j = P(✓j |Data)
t`,k
⇡k
=
=
tk,`
⇡`
✓
P(D|✓j = k)P(✓j = k)
P(D)
◆,✓
P(D|✓j = `)P(✓j = `)
P(D)
◆
◆,✓
P(D|✓j = `)P(✓j = `)
P(D)
◆
desired property
t`,k
⇡k
=
=
tk,`
⇡`
✓
P(D|✓j = k)P(✓j = k)
P(D)
desired property
t`,k
⇡k
=
=
tk,`
⇡`
✓
P(D|✓j = k)P(✓j = k)
P(D)
◆,✓
P(D|✓j = `)P(✓j = `)
P(D)
◆
◆,✓
P(D|✓j = `)P(✓j = `)
P(D)
◆
detailed balance
desired property
t`,k
⇡k
=
=
tk,`
⇡`
✓
P(D|✓j = k)P(✓j = k)
P(D)
detailed balance
P(D) cancels out, so
doesn’t need to be
computed!
Therefore, we need to set the transition
probabilities so that
t`,k
P(D|✓j = k)P(✓j = k)
=
tk,`
P(D|✓j = `)P(✓j = `)
However, an important problem arises
when doing this, which can be illustrated
as follows:
when dealing with states k,l, it could be
that we need to set tk,l=1 and tl,k=0.5
when dealing with states k,m, it could be
that we need to set tk,m=0.3 and tm,k=0.1
Then, we have tk,m+tk,l=1.3, which violates
the fundamental rules of probability!
In our coin example, we can see that P(D|✓ = 1)P(✓ = 1) is three times higher than P(D|✓ = 0)P(✓ =
0). Thus we need to ensure that the probability of moving from 0! 1 is 1/3 the probability of
moving from 1 ! 0. For instance, we could set m0,1 = 1 and m1,0 = 1/3. Or we could choose
m0,1 = 1/2 and m1,0 = 1/6. Which should we use? We’d like an MCMC simulation that converges
quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and
m1,0 = 1/3 sounds best.
Solution:
Similar arguments lead us to the conclusion that m1,2 = 1 and m2,1 = 3/8. However this reveals a
the transition
probability
as a that P(xi+1 = 0|xi = 1) = 1/3
problem: setting theseview
four transition
probabilities
in this way means
the move
is proposed
and P(xi+1 = 2|xi = 1)joint
= 1. event:
But this (1)
violates
fundamental
rules of probability - the probability of
all mutually-exclusive with
eventsprobability
must sum to 1.0.
With(2)
ourthe
transition
it exceeds 1.
q, and
moveprobabilities
is
accepted with probability α.
On solution is to view a transition probability as a joint event: the move is proposed and the move
is accepted. We can denote the probability of proposing a move from j to k as q(j, k). We can
we denote
x’i+1from
thejstate
denote the acceptanceIfprobability
of abymove
to k (given that the move was proposed) as
at stepthei+1,
then
↵(j, k). If we were to proposed
use x0i+1 to denote
state
proposed at step i + 1 then
q(j, k) = P(x0i+1 = k|xi = j)
↵(j, k) = P(xi+1 = k|x0i = j, x0i+1 = k)
This means that we can choose proposal probabilities that sum to one for all of the state-changing
transitions. Then, we can multiply them by the appropriate acceptance probability (keeping that
as high as we can, but  1). So, we get
m1,0
q(1, 0)↵(1, 0)
P(D|✓ = 0)P(✓ = 0)
=
=
m0,1
q(0, 1)↵(0, 1)
P(D|✓ = 1)P(✓ = 1)
(5)
We have a great deal of flexibility in selecting how we perform proposals on new states in MCMC.
We have to assure that q(j, k) > 0 whenever q(k, j) > 0; failing to do this means that the ratio
of proposal probabilities will be 0 in one direction and infinite in the other. It is fine for q(j, k) =
q(k, j) = 0 for some pairs of states (we already said that it was OK if not all states were adjacent in
the Markov chain – note that two states that are not adjacent still fulfill detailed balance because
both fluxes are 0).
We can choose proposal probabilities that
sum
to onescheme,
for allthough,
the state-changing
Once we have chosen
a proposal
we do not have as much flexibility when choosing
whether or not to accept
a proposal. Rearranging terms in equation (5) to put the acceptance
transitions.
Then, we can multiply them
by the
10
appropriate acceptance probabilities
(keeping them as high as possible, but ≤1).
We get
t`,k
q(`, k)↵(`, k)
P(D|✓j = k)P(✓j = k)
=
=
tk,`
q(k, `)↵(k, `)
P(D|✓j = `)P(✓j = `)
t`,k
q(`, k)↵(`, k)
P(D|✓j = k)P(✓j = k)
=
=
tk,`
q(k, `)↵(k, `)
P(D|✓j = `)P(✓j = `)
We have flexibility in selecting how we
perform proposals on new states in MCMC.
We have to ensure that q(l,k)>0 whenever
q(k,l)>0 (it is fine if both are 0, but we can’t
have one being 0 and the other greater than
0).
t`,k
q(`, k)↵(`, k)
P(D|✓j = k)P(✓j = k)
=
=
tk,`
q(k, `)↵(k, `)
P(D|✓j = `)P(✓j = `)
However, once we have chosen a proposal
scheme, we do not have much flexibility in
choosing whether or not to accept a
proposal.
t`,k
q(`, k)↵(`, k)
P(D|✓j = k)P(✓j = k)
=
=
tk,`
q(k, `)↵(k, `)
P(D|✓j = `)P(✓j = `)
t`,k
q(`, k)↵(`, k)
P(D|✓j = k)P(✓j = k)
=
=
tk,`
q(k, `)↵(k, `)
P(D|✓j = `)P(✓j = `)
↵(`, k)
P(D|✓j = k)P(✓j = k)q(k, `)
=
↵(k, `)
P(D|✓j = `)P(✓j = `)q(`, k)
t`,k
q(`, k)↵(`, k)
P(D|✓j = k)P(✓j = k)
=
=
tk,`
q(k, `)↵(k, `)
P(D|✓j = `)P(✓j = `)
↵(`, k)
P(D|✓j = k)P(✓j = k)q(k, `)
=
↵(k, `)
P(D|✓j = `)P(✓j = `)q(`, k)
acceptance
=
ratio
(likelihood
ratio ) (
prior
ratio
) ( Hastings
ratio )
The central idea is to make small
random changes to some current
parameter values, and then accept or
reject the changes according to the
appropriate probabilities
(a) r > 1: new state accepted
(b) r < 1: new state accepted with probability r
if new state rejected, stay in old state
4. Go to step 2
Posterior probability
q *a
q
48%
221
Accept sometimes
q*b
32%
20%
Topology A
Fig. 7.4
Always accept
Topology B
Topology C
The Markov chain Monte Carlo (MCMC) procedure is used to generate a valid sample
from the posterior. One first sets up a Markov chain that has the posterior as its stationary
distribution. The chain is then started at a random point and run until it converges onto this
distribution. In each step (generation) of the chain, a small change is made to the current
values of the model parameters (step 2). The ratio r of the posterior probability of the new
and current states is then calculated. If r > 1, we are moving uphill and the move is always
accepted (3a). If r < 1, we are moving downhill and accept the new state with probability
r (3b).
or reject the proposed value with the probability
!
"
f (θ ∗ |X)
f (θ|θ ∗ )
r = min 1,
×
f (θ|X)
f (θ ∗ |θ)
!
∗
∗
∗ "
f (θ|θ
) f (X|θ
)/ f M
(X)
)
f (θ
Bayesian =
phylogenetic
analysis
using
R
B
AYES
:
theory
×
min 1,
f (θ) f (X|θ)/ f (X)
f (θ ∗ |θ)
!
"
f (θ ∗
f (X|θ ∗ )
f (θ|θ ∗ )
Markov
Carlo
= chain
min Monte
1,
× steps ×
f (θ)
f (X|θ)
f (θ ∗ |θ)
(7.8)
(7.9)
(7.10)
1. Start at an arbitrary point (q)
The
the last
equation
2. three
Make aratios
small in
random
move
(to q*) are referred to as the prior ratio, the likelihood
ratio,
and theheight
proposal
(or state
Hastings
respectively.
The first two ratios
3. Calculate
ratio ratio
(r ) of new
(to q*)ratio),
to old state
(q)
(a) r >to
1:the
newratio
stateof
accepted
correspond
the numerators in Bayes’ theorem; note that the complex
(b) r < 1: new state accepted with probability r
if new state rejected, stay in old state
4. Go to step 2
Posterior probability
q *a
48%
Accept sometimes
q*b
32%
20%
Topology A
Fig. 7.4
Always accept
q
Topology B
Topology C
The Markov chain Monte Carlo (MCMC) procedure is used to generate a valid sample
from the posterior. One first sets up a Markov chain that has the posterior as its stationary
distribution. The chain is then started at a random point and run until it converges onto this
distribution. In each step (generation) of the chain, a small change is made to the current
values of the model parameters (step 2). The ratio r of the posterior probability of the new
or reject the proposed value with the probability
!
"
f (θ ∗ |X)
f (θ|θ ∗ )
r = min 1,
×
f (θ|X)
f (θ ∗ |θ)
"
!
∗
∗
f (θ|θ ∗ )
f (θ ) f (X|θ )/ f (X)
×
= min 1,
f (θ) f (X|θ)/ f (X)
f (θ ∗ |θ)
!
"
f (θ ∗) f (X|θ ∗ )
f (θ|θ ∗ )
= min 1,
×
×
f (θ)
f (X|θ)
f (θ ∗ |θ)
(7.8)
(7.9)
(7.10)
The three ratios in the last equation are referred to as the prior ratio, the likelihood
prior
likelihood
proposal
ratio, and the proposal ratio (or Hastings ratio), respectively. The first two ratios
ratio
ratio
ratio
correspond to the ratio of the numerators in Bayes’ theorem; note that the complex
223
Bayesian phylogenetic analysis using M R B AYES : theory
Box 7.2 (cont.)
2σ
x
An example of a proposal mechanism is
the beta proposal:
Both the sliding window and normal proposals can be problematic when the effect on
the likelihood varies over the parameter range. For instance, changing a branch length
from 0.01 to 0.03 is likely to have a dramatic effect on the posterior but changing it
from 0.51 to 0.53 will hardly be noticeable. In such situations, the multiplier proposal
is appropriate. It is equivalent to a sliding window with width λ on the log scale of the
parameter. A random number u is drawn from a uniform distribution on the interval
(−0.5, 0.5) and the proposed value is x ∗ = mx, where m = e λu . If the value of λ takes
the form 2 ln a, one will pick multipliers m in the interval (1/a, a).
The beta and Dirichlet proposals are used for simplex parameters. They pick new
values from a beta or Dirichlet distribution centered on the current values of the simplex.
Assume that the current values are (x1 , x2 ). We then multiply them with a value α, which
is a tuning parameter, and pick new values from the distribution Beta(αx1 , αx2 ). The
higher the value of α, the closer the proposed values will be to the current values.
Assume the current values are (x1,x2);
Multiply them with a value α;
Pick new values from Beta(αx1,αx2)
10
Beta(70,30)
(α = 100)
Beta(7,3)
(α = 10)
0
0
x = (0.7,0.3)
1
More complex moves are needed to change topology. A common type uses stochastic
branch rearrangements (see Chapter 8). For instance, the extending subtree pruning
and regrafting (extending SPR) move chooses a subtree at random and then moves
its attachment point, one branch at a time, until a random number u drawn from a
uniform on (0, 1) becomes higher than a specified extension probability p. The extension
probability p is a tuning parameter; the higher the value, the more drastic rearrangements
will be proposed.
the latter).
Box 7.2 Proposal mechanisms
A simpler proposal mechanism is to
types of proposal
mechanisms
are commonly used to change continuous variables.
defineFour
a continuous
uniform
distribution
The simplest is the sliding window proposal. A continuous uniform distribution of width
of width
w, centered on the current
w is centered on the current value x, and the new value x ∗ is drawn from this distribution.
valueThe
x, and
the width
newwvalue
x* is
drawnA larger value of w results in more radical
“window”
is a tuning
parameter.
fromproposals
this distribution.
and lower acceptance rates, while a smaller value leads to more modest changes
and higher acceptance rates.
w
x
The normal proposal is similar to the sliding window except that it uses a normal
distribution centered on the current value x. The variance σ 2 of the normal distribution
determines how drastic the new proposals are and how often they will be accepted.
More complex moves are needed to
change tree topology.
A common type uses operations such as
SPR, TBR, and NNI.
Burn-in, mixing, and
convergence
If the chain is started from a random
tree and arbitrarily chosen branch
lengths, chances are that the initial
likelihood is low.
The early phase of the run in which the
likelihood increases very rapidly
towards regions in the posterior with
high probability mass is known as the
burn-in.
224
Fredrik Ronquist, Paul van der Mark, and John P. Huelsenbeck
Trace plot
−26880
Putative stationary phase
ln L
−26920
−26960
−27000
Burn-in
0
100000
200000
300000
400000
500000
Generation
Fig. 7.5
The likelihood values typically increase very rapidly during the initial phase of the run
because the starting point is far away from the regions in parameter space with high
posterior probability. This initial phase of the Markov chain is known as the burn in. The
burn-in samples are typically discarded because they are so heavily influenced by the starting
point. As the chain converges onto the target distribution, the likelihood values tend to reach
a plateau. This phase of the chain is sampled with some thinning, primarily to save disk
space.
7.4 Burn-in, mixing and convergence
ln L
224
If the chain is started from a random tree and arbitrarily chosen branch lengths,
chances are that the initial likelihood is low. As the chain moves towards the regions
in the posterior with high probability mass, the likelihood typically increases very
Fredrik in
Ronquist,
Paul van
der
Mark, so
and
Johnthat
P. Huelsenbeck
rapidly;
fact, it almost
always
changes
rapidly
it is necessary to measure
Trace
it on a log scale (Fig. 7.5).
Thisplot
early phase of the run is known as the burn-in, and
the −26880
burn-in samples are often discarded because they are so heavily influenced by
the starting point.
Putative stationary phase
As the chain approaches its stationary distribution, the likelihood values tend to
reach a plateau. This is the first sign that the chain may have converged onto the
−26920
target distribution. Therefore, the plot of the likelihood values against the generation of the chain, known as the trace plot (Fig. 7.5), is important in monitoring
the performance of an MCMC run. However, it is extremely important to confirm
convergence
using other diagnostic tools because it is not sufficient for the chain to
−26960
Burn-in
reach the region of high probability in the posterior, it must also cover this region
adequately. The speed with which the chain covers the interesting regions of the
posterior is known as its mixing behavior. The better the mixing, the faster the
−27000
0
100000
200000
300000
400000
500000
chain will
generate
an adequate
sample
of the
posterior.
Generation
Fig. 7.5
The likelihood values typically increase very rapidly during the initial phase of the run
because the starting point is far away from the regions in parameter space with high
samples
in this region are discarded!
posterior probability. This initial phase of the Markov chain is known as the burn in. The
burn-in samples are typically discarded because they are so heavily influenced by the starting
point. As the chain converges onto the target distribution, the likelihood values tend to reach
a plateau. This phase of the chain is sampled with some thinning, primarily to save disk
space.
7.4 Burn-in, mixing and convergence
If the chain is started from a random tree and arbitrarily chosen branch lengths,
chances are that the initial likelihood is low. As the chain moves towards the regions
in the posterior with high probability mass, the likelihood typically increases very
As the chain approaches its stationary
distribution, the likelihood values tend
to reach a plateau.
This is the first sign that the chain may
have converged onto the target
distribution.
However, it is not sufficient for the chain to
reach the region of high probability in the
posterior; it must also cover this region
adequately.
The speed with which the chain covers the
interesting regions of the posterior is
known as its mixing behavior.
The better the mixing, the faster the chain
will generate an adequate sample of the
posterior.
225
Bayesian phylogenetic analysis using M R B AYES : theory
(a)
Target distribution
Sampled value
25
20
Too modest proposals
Acceptance rate too high
Poor mixing
15
10
5
0
100
200
300
400
500
Generation
(b)
Sampled value
25
20
Too bold proposals
Acceptance rate too low
Poor mixing
15
10
5
0
100
200
300
400
500
Generation
(c)
Sampled value
25
20
Moderately bold proposals
Acceptance rate intermediate
Good mixing
15
10
5
0
100
200
300
400
500
Generation
Fig. 7.6
The time it takes for a Markov chain to obtain an adequate sample of the posterior depends
critically on its mixing behavior, which can be controlled to some extent by the proposal
tuning parameters. If the proposed values are very close to the current ones, all proposed
changes are accepted but it takes a long time for the chain to cover the posterior; mixing is
poor. If the proposed values tend to be dramatically different from the current ones, most
proposals are rejected and the chain will remain on the same value for a long time, again
leading to poor mixing. The best mixing is obtained at intermediate values of the tuning
parameters, associated with moderate acceptance rates.
The mixing behavior of a Metropolis sampler can be adjusted using its tuning
parameter(s). Assume, for instance, that we are sampling from a normal distribution using a sliding window proposal (Fig. 7.6). The sliding window proposal has
one tuning parameter, the width of the window. If the width is too small, then the
proposed value will be very similar to the current one (Fig. 7.6a). The posterior
probabilities will also be very similar, so the proposal will tend to be accepted. But
each proposal will only move the chain a tiny distance in parameter space, so it will
take the chain a long time to cover the entire region of interest; mixing is poor.
In Bayesian MCMC sampling of
phylogenetic problems, the tree topology
is typically the most difficult parameter
to sample from.
Therefore, it makes sense to focus on
this parameter when monitoring
convergence.
Summarizing the results
The stationary phase of the chain is
typically sampled with some thinning,
for instance every 50th or 100th
generation.
Once an adequate sample is obtained, it
is usually trivial to compute an
estimate of the marginal posterior
distribution for the parameter(s) of
interest.
For example, this can take the form of a
frequency histogram of the sampled
values.
When it is difficult to visualize this
distribution or when space does not
permit it, various summary statistics
are used instead.
The most common approach to
summarizing topology posteriors is to
give the frequencies of the most
common splits, since there are much
fewer splits than topologies.
the phylogenies included.
relatively unimportant9.
Unfortunately, the molecular clock assumption is
usually inappropriate for distantly related sequences. In
this situation, a correction of the pairwise distances that
accounts for multiple ‘hits’ to the same site should be
used. There are many models for how sequence evolution
occurs, each of which implies a different way to correct
pairwise distances (see REF. 10 for some suggestions), so
there is considerable debate on which correction to use
(see BOX 2 for a discussion of model selection).
Furthermore, these corrections have substantial variance
when the distances are large.
Tree searches under optimality criteria. The NJ tree is
often treated as the starting point for a computationally
intensive search for the best phylogeny (TABLE 2). To perform a tree search, a standard must be used for comparing trees — an optimality criterion, in the jargon of
phylogenetics. The most popular criteria are parsimony, minimum evolution and ML. Minimum evolution (ME), like NJ, requires that the data be compressed
into a distance matrix; therefore, like NJ, its success is
dependent on the sequence divergences being adequately corrected for multiple hits (or the tree being so
easy to infer that poor distance estimates are sufficient).
Summary
Box 2 | The phylogenetic inference process
Source: Nat Rev Genet, 4:275, 2003
The flowchart puts phylogenetic estimation (shown in the green box) into the context of an
Collect data
entire study. After new sequence data are collected, the first step is usually downloading other
CTTACATCGTAGCCTAGATC
relevant sequences. Typically, a few outgroup sequences are included in a study to root the
tree (that is, to indicate which nodes in the tree are the oldest), provide clues about the early
Retrieve
homologous
ancestral sequences and improve the estimates of parameters in the model of evolution.
sequences
Insertions and deletions obscure which of the sites are homologous. Multiple-sequence
CTACTGTAGCAGTCCGTAGA
GCTTAATCGTAGCTAGATCA
alignment is the process of adding gaps to a matrix of data so that the nucleotides (or amino
CTTACATCGTAGCCTAGATC
acids) in one column of the matrix are related to each other by descent from a common ancestral
residue (a gap in a sequence indicates that the site has been lost in that species, or that a base was
Multiple sequence
inserted at that position in some of the other species). Although models of sequence evolution
alignment
that incorporate insertions and deletions have been proposed55–58, most phylogenetic methods
C-TAC-T-GTAG-C-AG-TC
CTTA-ATCGTAG-CTAGATC
proceed using an aligned matrix as the input (see REF. 59 for a review of the interplay between
CTTACATCGTAGCCTAGATC
alignment and tree inference).
In addition to the data, the scientist must choose a model of
Model selection
sequence evolution (even if this means just choosing a family of
models and letting software infer the parameters of these models).
reconstruction:
some are functionally constrained
so
Traditional
Bayesian
Increasing model
complexity improves the fit to the data but also
Box 1 | Applications of phylogenetic
methods
approaches
approaches
Input for phylogenetic
that
they are invariant
among all known
sequences;
increases variance in estimated parameters. Model selection60–63
estimation
strategies attempt to find the appropriate level of complexity on the
Detection of orthology and paralogy
characters;
others evolvebegin
rapidly
(and, therefore, are not reliable
dimensions nchar=898;
basis of the available data. Model complexity can often lead to
format missing=? gap=Estimate
Phylogenetics is commonly used to sort
out the history
of gene
duplications
gene
indicators
ofmatchchar=.
deep relationships). As a result of such
computational
intractability,
so pragmatic
concernsfor
sometimes
interleave datatype=dna;
'best' tree
options gapmode=missing;
matrix
outweigh
ones (for example,
NJ and parsimony
are mainly
families. This application is now included
instatistical
even preliminary
examinations
of sequence
‘noise’, oftenLemur_catta
severalAAGCTTCATAGGAGCAACCAT
phylogenetic hypotheses can
Homo_sapiens AAGCTTCACCGGCGCAGTCAT
by theirgenome
speed). 46 included neighbour-joining
MCMC
Pan
AAGCTTCACCGGCGCAATTAT
data; for example, the initial analysisjustifiable
of the mouse
explain the data
well. So, the
researcher
Gorilla reasonably
AAGCTTCACCGGCGCAGTTGT
As discussed in BOX 3, data and a model can be used to create a sample
Pongo
AAGCTTCACCGGCGCAACCAC
trees to identify duplications in cytochrome
P450either
andMarkov
other chain
geneMonte
families.
Assess
of trees through
Carlo (MCMC) or multiple
must
take this uncertainty into account. Commonlyconfidence
tree searches on bootstrapped data (the ‘traditional’ approach). This
used
methods
often
ignore
this
uncertainty
by proEstimating divergence times
collection of trees is often summarized using consensus-tree techniques,
37,38
40
ducing a single estimate of the tree, which is then
which show the
parts ofAris-Brosou
the tree that are found
in most, to
or all, of the
Bayesian implementations of new models
allowed
and Yang
tree with measures of support
in a set. Although useful, CONSENSUS METHODS are just one way of
treated as the'Best'
true
tree. Additional (potentially timeestimate when animal phyla divergedtrees
without
assuming
a
molecular
clock.
summarizing the information in a group of trees. AGREEMENT SUBTREES are
Homo sapiens
89
consuming) steps are
necessary
to measure the
more resistant to ‘rogue sequences’ (one or a few sequences that are
Pan
Reconstructing ancient proteins difficult
100
to place on the tree); the presence of such sequences can make a
Gorilla
strength of support for the groupings
of sequences in
Chang et al.47 used maximum likelihood
(ML)
reconstruct
theeven
sequence
ofisvisual
consensus
treeto
relatively
unresolved,
when there
considerable
Pongo
the tree. Recently, BAYESIAN approaches
to phylogenetic
agreement on the relationships between the other sequences.
REVIEWS
pigments in the last common ancestor of birds and alligators; the protein was then
Sometimes, the bootstrap or MCMC sample might show substantial
synthesized in the laboratory (see REF.
48 forforamultiple
recent trees
discussion
oftopologically
the methodology
of
support
that are not
similar. In such
cases, presenting more than one tree (or more than one consensus of
ancestral-character-state reconstruction).
trees) might be the only way to appropriately summarize the data.
Finding the residues that are important to natural selection
Amino-acid sites on the surface of influenza that are targeted by the immune system can
NATURE REVIEWS | GENETICS
be detected by an excess of non-synonymous substitutions49–51. This information might
assist vaccine preparation.
Traditional approaches
Detecting recombination points
New Bayesian methods52 can help determine which strains of human
immunodeficiency virus-1 (HIV-1) arose from recombination.
© 2003 Nature Publishing Group
Identifying mutations likely to be associated with disease
The lack of structural, biochemical and functional data from many genes implicated in
disease means it is unclear which missense mutations are important. Fleming et al.53
used Bayesian phylogenetics to identify missense mutations in conserved regions and
regions under positive selection in the breast cancer gene BRCA1. These data allowed
them to prioritize these mutations for future functional and population studies.
Determining the identity of new pathogens
Phylogenetic analysis is now routinely performed after polymerase chain reaction (PCR)
amplification of genomic fragments of previously unknown pathogens. Such analyses
made possible the rapid identification of both Hantavirus54 and West Nile virus4,5.
distantly related sequences to diverge from each other
more slowly than is expected, or even become more
similar to each other at some residues. Many of the sites
in a DNA sequence are not helpful for phylogenetic
The most common traditional approaches to reconstructing phylogenies are the neighbour-joining (NJ)
algorithm and tree searches that use an optimality criterion such as PARSIMONY or maximum likelihood (ML).
TABLE 1 shows a summary of the advantages and disadvantages of these methods, as well as a list of the software packages that implement them. BOX 2 sets these
methods in the context of the entire study, from data
collection to hypothesis testing.
Neighbour-joining. The extremely popular NJ algorithm7,8 is relatively fast (a matter of seconds for most
data sets) and, like all methods, performs well when the
divergence between sequences is low. The first step in
the algorithm is converting the DNA or protein
sequences into a distance matrix that represents an estimate of the evolutionary distance between sequences
(the evolutionary distance is the number of changes that
have occurred along the branches between two
A branch of statistics that
ocuses on the posterior
probability of hypotheses. The
posterior probability is
proportional to the product of
he prior probability and the
ikelihood.
PARSIMONY
Source: Nat Rev Genet, 4:275, 2003
Summary
BAYESIAN
n systematics, parsimony refers
o choosing between trees on the
basis of which one requires the
ewest possible mutations to
explain the data.
276
Hylobates
inference have provided a method for simultaneously
estimating trees and obtaining measurements of
Hypothesis testing
uncertainty for every
branch. Here, we briefly review
the most popular methods of tree estimation (for a
REF. 64)| APRIL
and 2003
the| 2newer
more thorough review, seeVOLUME
77
Bayesian approaches.
Table 1 | Comparison of methods
Method
Advantages
Disadvantages
Software
Neighbour
joining
Fast
Information is lost in compressing
sequences into distances; reliable estimates
of pairwise distances can be hard to obtain
for divergent sequences
PAUP*
MEGA
PHYLIP
Parsimony
Fast enough for the analysis of hundreds
of sequences; robust if branches are
short (closely related sequences or
dense sampling)
Can perform poorly if there is
substantial variation in branch lengths
PAUP*
NONA
MEGA
PHYLIP
Minimum
evolution
Uses models to correct for unseen
changes
Distance corrections can break down when
distances are large
PAUP*
MEGA
PHYLIP
Maximum
likelihood
The likelihood fully captures what the
data tell us about the phylogeny under
a given model
Can be prohibitively slow (depending on
the thoroughness of the search and access to
computational resources)
PAUP*
PAML
PHYLIP
Bayesian
Has a strong connection to the maximum
likelihood method; might be a faster way
to assess support for treesthan
maximum likelihood bootstrapping
The prior distributions for parameters must be
specified; it can be difficult to determine
whether the Markov chain Monte Carlo (MCMC)
approximation has run for long enough
MrBayes
BAMBE
For a more complete list of software implementations, see online link to Phylogeny Programs. For software URLs, see online links box.
| APRIL 2003 | VOLUME 4
www.nature.com/reviews/genetics
© 2003 Nature Publishing Group
Acknowledgment
Material in these slides are based on
Chapter 7 in “The Phylogenetic
Handbook”, Lemey, Salemi, Vandamme
(Eds.)
Some of the material is based on MCMC
notes by Prof. Mark Holder
Questions?
Download