Phylogenetic Tree Construction using Sequential Monte Carlo Algorithms on Posets

advertisement
Phylogenetic Tree Construction using Sequential
Monte Carlo Algorithms on Posets
Liangliang Wang
Western University, Canada
Joint work with
Alexandre Bouchard-Côté
Arnaud Doucet
Recent Advances in SMC
Sep 19-21, 2012
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
1 / 60
Outline
1
Background
2
Combinatorial Sequential Monte Carlo (CSMC)
3
Particle MCMC
4
Ongoing and Future Work
Background
Phylogenetics
Example: what are the evolutionary relationships among these Cichlid fishes?
(a) Chalinochromis popelini
(d) Lepidiolamprologus
(b) Julidochromis marlieri
(e) Neolamprologus brichardi (f) Neolamprologus
elongatus
Liangliang Wang (Western University)
(c) Lamprologus callipterus
(g) Telmatochromis temporalis
tetracanthus
Bayesian Phylogenetics via SMC
3 / 60
Background
Data: Y
Biological sequences of a set of species
e.g. a DNA sequence is a string of characters from the set of four
nucleotides {A, C, G, T }.
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
4 / 60
Background
Data: Y
Biological sequences of a set of species
e.g. a DNA sequence is a string of characters from the set of four
nucleotides {A, C, G, T }.
An example of aligned DNA sequences.
Nucleotides in the same column were obtained from a shared ancestral
nucleotide
A
CTCTAGCCTTTTTCCACT
B
TTCTAGCCTTTCTCTACT
C
CTCTAGCCTTTCTCTACT
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
4 / 60
Background
A rooted phylogenetic tree, t, and evolution
●
Root: a common ancestor;
●
●
●
●
AGTC..
●
Bayesian Phylogenetics via SMC
AGAC..
AGAC..
Liangliang Wang (Western University)
●
●
ATTT..
AGAC..
●
5 / 60
Background
A rooted phylogenetic tree, t, and evolution
●
Root: a common ancestor;
●
Internal nodes
●
●
●
AGTC..
●
Bayesian Phylogenetics via SMC
AGAC..
AGAC..
Liangliang Wang (Western University)
●
●
ATTT..
AGAC..
●
5 / 60
Background
A rooted phylogenetic tree, t, and evolution
●
Root: a common ancestor;
●
Internal nodes
Branch lengths
●
β5
β4
β3
β6
●
β7
β8
●
●
ATTT..
●
AGAC..
AGAC..
Bayesian Phylogenetics via SMC
●
AGTC..
●
AGAC..
positive real numbers associated
with each edge,
specifying the amount of evolution
between nodes.
Liangliang Wang (Western University)
β2
β1
5 / 60
Background
A likelihood Model: CTMC
A likelihood model: P(Y|t, θ)
Assumption: site independence.
A
CTCTAGCCTTTTTCCACT
B
TTCTAGCCTTTCTCTACT
C
CTCTAGCCTTTCTCTACT
One site
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
6 / 60
Background
A likelihood Model: CTMC
A likelihood model: P(Y|t, θ)
Assumption: site independence.
Likelihood model on each site over one
branch is a Continuous Time Markov
Chain (CTMC): {Y s : s ∈ [0, b]}
The state space of the chain:
Y s ∈ {A, C, G, T }.
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
6 / 60
Background
A likelihood Model: CTMC
A likelihood model: P(Y|t, θ)
Assumption: site independence.
Likelihood model on each site over one
branch is a Continuous Time Markov
Chain (CTMC): {Y s : s ∈ [0, b]}
The state space of the chain:
Y s ∈ {A, C, G, T }.
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
6 / 60
Background
A likelihood Model: CTMC
A likelihood model: P(Y|t, θ)
Assumption: site independence.
Likelihood model on each site over one
branch is a Continuous Time Markov
Chain (CTMC): {Y s : s ∈ [0, b]}
The state space of the chain:
Y s ∈ {A, C, G, T }.
The transition matrix: P(b) = eQ4×4 b for
the branch of length b.
e.g. P1,3 (b) = P(Yb = G|Y0 = A).
Liangliang Wang (Western University)


πC κπG πT 
 
 π
πG κπT 
A


Q = 

πT 
 κπA πC

πA κπC πG
-
Bayesian Phylogenetics via SMC
6 / 60
Background
A likelihood Model: CTMC
A likelihood model: P(Y|t, θ)
Assumption: site independence.
Likelihood model on each site over one
branch is a Continuous Time Markov
Chain (CTMC): {Y s : s ∈ [0, b]}
The state space of the chain:
Y s ∈ {A, C, G, T }.
The transition matrix: P(b) = eQ4×4 b for
the branch of length b.
e.g. P1,3 (b) = P(Yb = G|Y0 = A).
Evolutionary parameters in CTMCs are
denoted by θ.
Liangliang Wang (Western University)


πC κπG πT 
 
 π
πG κπT 
A


Q = 

πT 
 κπA πC

πA κπC πG
-
Bayesian Phylogenetics via SMC
6 / 60
Background
Bayesian phylogenetics
Bayesian phylogenetics
Data: aligned sequences, denoted by Y
θ: evolutionary parameters
t: a phylogenetic tree
Posterior
π(θ, t|Y) =
Liangliang Wang (Western University)
P(Y|t, θ)p(t|θ)p(θ)
P(Y)
Bayesian Phylogenetics via SMC
7 / 60
Background
Posterior expectation of ϕ(t):
Bayesian phylogenetics
Z
π(dt)ϕ(t)
An example of the function ϕ:
ϕ(t) = 1(c ∈ clades(t))
V1
a clade: a group consisting of a
species and all its descendants
clades(t): all the clades of the tree t
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
8 / 60
Background
Bayesian phylogenetics
Difficult inference problem over a huge tree space
A
B
C
Liangliang Wang (Western University)
A
C
B
Bayesian Phylogenetics via SMC
C
B
A
9 / 60
Background
Bayesian phylogenetics
Difficult inference problem over a huge tree space
A
B
C
A
C
B
C
B
A
Tree space for a phylogenetic tree
#Species
3
4
6
10
Liangliang Wang (Western University)
#Topologies
3
15
945
34459425
Bayesian Phylogenetics via SMC
9 / 60
Background
Bayesian phylogenetics
Standard Bayesian phylogenetics using MCMC
MCMC: obtain samples tk ∼ π(·|Y), k = 1, · · · , K
......
Z
Liangliang Wang (Western University)
K
1X
π(dt)ϕ(t) ≈
ϕ(tk )
K k=1
Bayesian Phylogenetics via SMC
10 / 60
Background
Bayesian phylogenetics
Problems with MCMC
The Markov chain doesn’t explore the tree space well
Only small moves are allowed in each iteration
Each step is expensive to compute
MCMC does not scale to large datasets
a large number of taxa
large amount of data for each taxon
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
11 / 60
Background
Bayesian phylogenetics
The Ultimate Goal in Phylogenetics
Infer phylogenetic trees accurately and efficiently
Develop new statistical evolutionary models
Computational algorithms for efficient analysis of large-scale datasets
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
12 / 60
Background
Bayesian phylogenetics
The Ultimate Goal in Phylogenetics
Infer phylogenetic trees accurately and efficiently
Develop new statistical evolutionary models
Current model: CTMC over characters for each site.
Proposed model: a general string-valued CTMC for biological
sequences.
Computational algorithms for efficient analysis of large-scale datasets
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
12 / 60
Background
Bayesian phylogenetics
The Ultimate Goal in Phylogenetics
Infer phylogenetic trees accurately and efficiently
Develop new statistical evolutionary models
Current model: CTMC over characters for each site.
Proposed model: a general string-valued CTMC for biological
sequences.
Computational algorithms for efficient analysis of large-scale datasets
Current methods:
Standard MCMC
SMC for unrealistic phylogenetic trees (Teh et al. 2008; Bouchard-Côté
et al. 2011) for fixed parameters θ
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
12 / 60
Background
Bayesian phylogenetics
The Ultimate Goal in Phylogenetics
Infer phylogenetic trees accurately and efficiently
Develop new statistical evolutionary models
Current model: CTMC over characters for each site.
Proposed model: a general string-valued CTMC for biological
sequences.
Computational algorithms for efficient analysis of large-scale datasets
Current methods:
Standard MCMC
SMC for unrealistic phylogenetic trees (Teh et al. 2008; Bouchard-Côté
et al. 2011) for fixed parameters θ
Proposed methods:
An efficient SMC algorithm for general phylogenetic trees
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
12 / 60
Background
Bayesian phylogenetics
The Ultimate Goal in Phylogenetics
Infer phylogenetic trees accurately and efficiently
Develop new statistical evolutionary models
Current model: CTMC over characters for each site.
Proposed model: a general string-valued CTMC for biological
sequences.
Computational algorithms for efficient analysis of large-scale datasets
Current methods:
Standard MCMC
SMC for unrealistic phylogenetic trees (Teh et al. 2008; Bouchard-Côté
et al. 2011) for fixed parameters θ
Proposed methods:
An efficient SMC algorithm for general phylogenetic trees
PMCMC for joint estimation of t and θ
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
12 / 60
Combinatorial Sequential Monte Carlo (CSMC)
1
Background
2
Combinatorial Sequential Monte Carlo (CSMC)
3
Particle MCMC
4
Ongoing and Future Work
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
13 / 60
Combinatorial Sequential Monte Carlo (CSMC)
SMC algorithm for phylogenetic trees
Target distribution: the posterior π(t|Y) ∝ γ(t|Y)
Z = P(Y|t)p(t)
Interested in the posterior expectation of ϕ(t):
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
π(dt)ϕ(t).
14 / 60
Combinatorial Sequential Monte Carlo (CSMC)
SMC algorithm for phylogenetic trees
Target distribution: the posterior π(t|Y) ∝ γ(t|Y)
Z = P(Y|t)p(t)
Interested in the posterior expectation of ϕ(t):
Input Y
Aligned biological sequences
Liangliang Wang (Western University)
π(dt)ϕ(t).
A
CTCTAGCCTTTTTCCACT
B
TTCTAGCCTTTCTCTACT
C
CTCTAGCCTTTCTCTACT
D
TTCTAGCCTTTTTCTACT
Bayesian Phylogenetics via SMC
14 / 60
Combinatorial Sequential Monte Carlo (CSMC)
SMC algorithm for phylogenetic trees
Target distribution: the posterior π(t|Y) ∝ γ(t|Y)
Z = P(Y|t)p(t)
Interested in the posterior expectation of ϕ(t):
Input Y
Aligned biological sequences
π(dt)ϕ(t).
A
CTCTAGCCTTTTTCCACT
B
TTCTAGCCTTTCTCTACT
C
CTCTAGCCTTTCTCTACT
D
TTCTAGCCTTTTTCTACT
Output:
weighted particles {(tk , Wk )} to approximate the posterior distribution
over trees, π̂(t|Y)
t1
B ●
C●
D●
A
●
t2
BADC
●●●●
t3
t4
A ●
C ●
B ●
D
●
t5
t6
B ●
A ●
D ●
C
●
B ●
A ●
D ●
C
●
B ●
A ●
D ●
C
●
t7
t8
t9
B ●
A ●
D ●
C
●
B ●
A ●
C ●
D
●
B ●
D ●
A ●
C
●
t10
B●
D●
A●
C
●
estimate of the marginal likelihood, P̂(Y).
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
14 / 60
Combinatorial Sequential Monte Carlo (CSMC)
A sequence of partial states
Using ν+ (s0 → t) is not efficient
sr : a partial state (forest) of n − r subtrees (n is
the number of species)
A forward proposal ν+ (sr−1 → sr ): randomly
choose a pair of subtrees of sr−1 to merge.
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
15 / 60
Combinatorial Sequential Monte Carlo (CSMC)
How to define distributions πr over partial states sr ?
πr (sr |Y) ∝ P(Y|sr )p(sr )
P(Y|sr ): the likelihood of the partial state sr
we have likelihood model for trees
consider the trees in a forest to be independent
the product of the likelihood of the subtrees of sr .
πr is represented by K weighted particles, {(srk , Wrk )}
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
16 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Illustration of SMC
The initial partial state
s0 : a forest in which each sequence is a trivial tree with a single leaf.
s0
Liangliang Wang (Western University)
A●
B●
C●
D
●
Bayesian Phylogenetics via SMC
17 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Illustration of SMC
The first partial state s1
Generate particle s11 using ν+ (s0 → ·)
Randomly choose a pair of subtrees, species B and C, to merge.
s1,1
B●
C●
D●
A
●
s0
Liangliang Wang (Western University)
A●
B●
C●
D
●
Bayesian Phylogenetics via SMC
18 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Illustration of SMC
The first partial state s1
Generate particle s12 using ν+ (s0 → ·)
Randomly choose a pair of subtrees, species D and C, to merge.
s1,1
s1,2
B●
C●
D●
A
●
B●
A●
D●
C
●
s0
Liangliang Wang (Western University)
A●
B●
C●
D
●
Bayesian Phylogenetics via SMC
19 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Illustration of SMC
The first partial state s1
Generate K particles s1k using ν+ (s0 → ·)
These particles cannot represent π1 directly
We need to compensate for the discrepancy between the distribution
of interest and the proposed distribution.
s1
s1,1
s1,2
s1,3
s1,4
s1,5
s1,6
s1,7
s1,8
s1,9
s1,10
B●
C●
D●
A
●
B●
A●
D●
C
●
A●
C●
B●
D
●
A●
D●
B●
C
●
B●
A●
D●
C
●
A●
B●
D●
C
●
C●
D●
B●
A
●
B●
A●
C●
D
●
B●
D●
A●
C
●
A●
B●
D●
C
●
s0
Liangliang Wang (Western University)
A●
B●
C●
D
●
Bayesian Phylogenetics via SMC
20 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Illustration of SMC
Update the weight of particles s1k
The rectangle size corresponds to the normalized particle weight
W1 (s1k ).
s1,1
s1
B ●
C ●
D ●
A
●
s1,2
B ADC
●●●●
s1,3
ACBD
●●●●
s1,4
ADBC
●
●●●
s1,5
B ●
A ●
D ●
C
●
s0
Liangliang Wang (Western University)
s1,6
ABDC
●
●●●
s1,7
C●
D●
B ●
A
●
s1,8
s1,9
B ●
A ●
C ●
D
●
B ●
D ●
A ●
C
●
s1,10
ABDC
●
●●●
A●
B●
C●
D
●
Bayesian Phylogenetics via SMC
21 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Illustration of SMC
Resample s1k
Using a multinomial distribution
Purpose: prune unpromising particles
s1,1
s1
B ●
C ●
D ●
A
●
s1,2
B ADC
●●●●
s1,3
ACBD
●●●●
s1,4
ADBC
●
●●●
s1,5
B ●
A ●
D ●
C
●
s0
Liangliang Wang (Western University)
s1,6
ABDC
●
●●●
s1,7
C●
D●
B ●
A
●
s1,8
s1,9
B ●
A ●
C ●
D
●
B ●
D ●
A ●
C
●
s1,10
ABDC
●
●●●
A●
B●
C●
D
●
Bayesian Phylogenetics via SMC
22 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Illustration of SMC
The second partial state s2
Generate the particles s2k using ν+ (s1k → ·)
Update the weights of the particles W2 (s2k )
s2,1
s2
B ●
C ●
D ●
A
●
s1,1
s1
B ●
C ●
D ●
A
●
s2,2
s2,3
B ADC
●●●●
A ●
C ●
B ●
D
●
s1,2
s1,3
B ADC
●●●●
ACBD
●●●●
s2,4
s2,5
B ●
A ●
D ●
C
●
B ●
A ●
D ●
C
●
s1,4
ADBC
●
●●●
s1,5
Liangliang Wang (Western University)
s2,7
B●
A●
D●
C
●
C●
D●
B●
A
●
s1,6
B ●
A ●
D ●
C
●
s0
s2,6
ABDC
●
●●●
s1,7
C●
D●
B ●
A
●
s2,8
s2,9
B ●
A ●
C ●
D
●
B ●
D ●
A ●
C
●
s1,8
s1,9
B ●
A ●
C ●
D
●
B ●
D ●
A ●
C
●
s2,10
B ●
D ●
A ●
C
●
s1,10
ABDC
●
●●●
A●
B●
C●
D
●
Bayesian Phylogenetics via SMC
23 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Illustration of SMC
Resample s2k
Using a multinomial distribution.
Purpose: prune unpromising particles
s2,1
s2
B ●
C ●
D ●
A
●
s1,1
s1
B ●
C ●
D ●
A
●
s2,2
s2,3
B ADC
●●●●
A ●
C ●
B ●
D
●
s1,2
s1,3
B ADC
●●●●
ACBD
●●●●
s2,4
s2,5
B ●
A ●
D ●
C
●
B ●
A ●
D ●
C
●
s1,4
ADBC
●
●●●
s1,5
Liangliang Wang (Western University)
s2,7
B●
A●
D●
C
●
C●
D●
B●
A
●
s1,6
B ●
A ●
D ●
C
●
s0
s2,6
ABDC
●
●●●
s1,7
C●
D●
B ●
A
●
s2,8
s2,9
B ●
A ●
C ●
D
●
B ●
D ●
A ●
C
●
s1,8
s1,9
B ●
A ●
C ●
D
●
B ●
D ●
A ●
C
●
s2,10
B ●
D ●
A ●
C
●
s1,10
ABDC
●
●●●
A●
B●
C●
D
●
Bayesian Phylogenetics via SMC
24 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Illustration of SMC
The final state (full tree)
s3,1
s3
BCDA
●●●●
s2,1
s2
B ●
C ●
D ●
A
●
s1,1
s1
B ●
C ●
D ●
A
●
s3,2
BADC
●●●●
s2,2
s3,3
A●
C●
B●
D
●
s2,3
B ADC
●●●●
A ●
C ●
B ●
D
●
s1,2
s1,3
B ADC
●●●●
ACBD
●●●●
s3,4
B ●
A ●
D ●
C
●
s3,5
s3,6
B●
A●
D●
C
●
B●
A●
D●
C
●
s2,4
s2,5
B ●
A ●
D ●
C
●
B ●
A ●
D ●
C
●
s1,4
ADBC
●
●●●
s1,5
s0
Liangliang Wang (Western University)
s3,8
s3,9
B ●
A ●
D ●
C
●
B ●
A ●
C ●
D
●
B ●
D ●
A ●
C
●
s2,6
s2,7
B●
A●
D●
C
●
C●
D●
B●
A
●
s1,6
B ●
A ●
D ●
C
●
s3,7
ABDC
●
●●●
s1,7
C●
D●
B ●
A
●
s2,8
s2,9
B ●
A ●
C ●
D
●
B ●
D ●
A ●
C
●
s1,8
s1,9
B ●
A ●
C ●
D
●
B ●
D ●
A ●
C
●
s3,10
BDAC
●●●●
s2,10
B ●
D ●
A ●
C
●
s1,10
ABDC
●
●●●
A●
B●
C●
D
●
Bayesian Phylogenetics via SMC
25 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Weight update
The weight update in a standard SMC
wr (sr ) = wr−1 (sr−1 ) ·
γr (sr )
1
γr−1 (sr−1 ) ν+ (sr−1 → sr )
γr (sr ) : unnormalized density of sr
γr−1 (sr−1 ) : unnormalized density of sr−1
ν+ : forward proposal
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
26 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Weight update
The weight update in a standard SMC
wr (sr ) = wr−1 (sr−1 ) ·
γr (sr )
1
γr−1 (sr−1 ) ν+ (sr−1 → sr )
γr (sr ) : unnormalized density of sr
γr−1 (sr−1 ) : unnormalized density of sr−1
ν+ : forward proposal
This weight update of the standard SMC will lead to a biased estimate for
general phylogenetic trees due to an over-counting problem.
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
26 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Weight update
The weight update in a standard SMC
wr (sr ) = wr−1 (sr−1 ) ·
γr (sr )
1
γr−1 (sr−1 ) ν+ (sr−1 → sr )
WRONG for general trees!
γr (sr ) : unnormalized density of sr
γr−1 (sr−1 ) : unnormalized density of sr−1
ν+ : forward proposal
This weight update of the standard SMC will lead to a biased estimate for
general phylogenetic trees due to an over-counting problem.
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
26 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Over-counting problem
Over-counting some samples
Two trees (each 1/2)
A ●
B ●
C ●
D
●
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
A ●
C ●
B ●
D
●
27 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Over-counting problem
Over-counting some samples
Two trees (each 1/2)
A ●
B ●
C ●
D
●
Two copies of the same partial
state: s21 , s22
Two copies of the same full tree:
s31 , s32
s31
A ●
B ●
C ●
D
●
s21
A ●
B ●
C ●
D
●
s11
A ●
B ●
C ●
D
●
A ●
C ●
B ●
D
●
s32
s33
A ●
B ●
C ●
D
●
A ●
C ●
B ●
D
●
s23
s22
A ●
B ●
C ●
D
●
A ●
C ●
B ●
D
●
s13
s12
A ●
B ●
C ●
D
●
A ●
C ●
B ●
D
●
A ●
B ●
C ●
D
●
s0
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
27 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Over-counting problem
Over-counting some samples
Two trees (each 1/2)
A ●
B ●
C ●
D
●
A ●
C ●
B ●
D
●
s31 + s32
Two copies of the same partial
state: s21 , s22
A
●
C
●
A ●
C ●
B ●
D
●
D
●
s21 + s22
Two copies of the same full tree:
s31 , s32
A
●
Their weights are doubled.
This will cause biased
estimates.
B
●
s33
s11
A ●
B ●
C ●
D
●
B
●
C
●
s23
A ●
C ●
B ●
D
●
D
●
s13
s12
A ●
B ●
C ●
D
●
A ●
C ●
B ●
D
●
A ●
B ●
C ●
D
●
s0
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
27 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Over-counting problem
Over-counting some samples
Two trees (each 1/2)
A ●
B ●
C ●
D
●
A ●
C ●
B ●
D
●
s31 + s32
Two copies of the same partial
state: s21 , s22
s33
A ●
B ●
C ●
D
●
Two copies of the same full tree:
A ●
C ●
B ●
D
●
s21 + s22
s31 , s32
s23
A ●
B ●
C ●
D
●
A ●
C ●
B ●
D
●
Their weights are doubled.
This will cause biased
estimates.
s11
A ●
B ●
C ●
D
●
s13
s12
A ●
B ●
C ●
D
●
A ●
C ●
B ●
D
●
Need to downweight the
over-counted partial states.
A ●
B ●
C ●
D
●
s0
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
27 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Over-counting problem
Over-counting some samples
Two trees (each 1/2)
A ●
B ●
C ●
D
●
Two copies of the same partial
state: s21 , s22
s31
A ●
B ●
C ●
D
●
A ●
C ●
B ●
D
●
s32
A ●
B ●
C ●
D
●
Two copies of the same full tree:
s21
s31 , s32
A ●
B ●
C ●
D
●
s22
A ●
B ●
C ●
D
●
s33
A ●
C ●
B ●
D
●
s23
A ●
C ●
B ●
D
●
Their weights are doubled.
This will cause biased
estimates.
s11
A ●
B ●
C ●
D
●
s13
s12
A ●
B ●
C ●
D
●
A ●
C ●
B ●
D
●
Need to downweight the
over-counted partial states.
A ●
B ●
C ●
D
●
s0
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
27 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Over-counting problem
Over-counting some samples
Two trees (each 1/2)
A ●
B ●
C ●
D
●
A ●
C ●
B ●
D
●
s31 + s32
Two copies of the same partial
state: s21 , s22
s33
A ●
B ●
C ●
D
●
Two copies of the same full tree:
A ●
C ●
B ●
D
●
s21 + s22
s31 , s32
s23
A ●
B ●
C ●
D
●
A ●
C ●
B ●
D
●
Their weights are doubled.
This will cause biased
estimates.
s11
A ●
B ●
C ●
D
●
s13
s12
A ●
B ●
C ●
D
●
A ●
C ●
B ●
D
●
Need to downweight the
over-counted partial states.
A ●
B ●
C ●
D
●
s0
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
27 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Over-counting problem
Our correct weight update (in the CSMC algorithm)
wr (sr ) = wr−1 (sr−1 ) ·
γr (sr ) ν (sr → sr−1 )
γr−1 (sr−1 ) ν+ (sr−1 → sr )
The backward proposal ν− is
based on a graded partially ordered set (poset) on an extended
combinatorial space
1: if there is only one way from sr−1 to sr
between 0 and 1 if there are multiple ways from sr−1 to sr
to downweight the over-counted partial states.
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
28 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Convergence results
Convergence results
Under weak conditions, for any bounded real-valued function ϕ : Sr → R,
Strong Law of Large Numbers (SLLN)
 K

Z
X
 a.s.
lim  Wrk ϕ(srk ) − πr (dsr )ϕ(sr ) −→ 0,
K→∞
k=1
K : the number of weighted particles.
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
29 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Convergence results
Illustration of convergence: simulation
0.0 0.1 0.2 0.3 0.4 0.5
y-axis: Total variation distance of π̂ to π
x-axis: K (# particles)
●
●
1e+01
Liangliang Wang (Western University)
Poset−correction
Standard−SMC
●
●
●
1e+03
Bayesian Phylogenetics via SMC
●
1e+05
30 / 60
Combinatorial Sequential Monte Carlo (CSMC)
Experiment on tree inference
# leaves: 10
# sites: 1000
12
y-axis: Partition metric
x-axis: time (in log scale)
●
MB
SMC
0
2
100×: 2 orders of magnitude
●
4
Computationally faster
6
8
# datasets: 1000
1e+02
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
●
●
1e+04
●
●
●
1e+06
31 / 60
Particle MCMC
1
Background
2
Combinatorial Sequential Monte Carlo (CSMC)
3
Particle MCMC
4
Ongoing and Future Work
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
32 / 60
Particle MCMC
Particle MCMC
Inferring both the tree and the evolutionary parameter jointly
π(θ, t|Y)
Particle MCMC (Andrieu et al. 2010)
Each MCMC iteration uses our proposed CSMC algorithm to
approximate the posterior distribution of the phylogenetic tree
Particle marginal Metropolis-Hastings (PMMH)
Particle Independent Metropolis-Hastings (PIMH)
The Particle Gibbs sampler (PGS)
requires a special SMC algorithm, conditional SMC.
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
33 / 60
Particle MCMC
Particle MCMC
Particle MCMC
Advantage
Bolder and more efficient move to update t.
Convergence result
These algorithms converges to the true posterior. (Andrieu et al. 2010)
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
34 / 60
Particle MCMC
Comparing the standard MCMC and the particle MCMC
Cartoon: problem with standard MCMC
Assume θ can only take two
values: θ1 , θ2
Two possible trees: t1 , t2
At each iteration of MCMC, the
chain is at one of 4 states:
(θ1 , t1 ), (θ1 , t2 ), (θ2 , t1 ), (θ2 , t2 )
The square is a joint distribution
A good Markov chain should
move quickly among the states
with high probability mass
Liangliang Wang (Western University)
t1
θ1
θ2
0.1
0.5
0.3
0.1
A ●
B ●
C ●
D
●
t2
A ●
B ●
C ●
D
●
Bayesian Phylogenetics via SMC
35 / 60
Particle MCMC
Comparing the standard MCMC and the particle MCMC
Cartoon: problem with standard MCMC
Assume θ can only take two
values: θ1 , θ2
Two possible trees: t1 , t2
At each iteration of MCMC, the
chain is at one of 4 states:
(θ1 , t1 ), (θ1 , t2 ), (θ2 , t1 ), (θ2 , t2 )
The square is a joint distribution
A good Markov chain should
move quickly among the states
with high probability mass
Liangliang Wang (Western University)
t1
θ1
θ2
0.1
0.5
0.3
0.3
0.1
A ●
B ●
C ●
D
●
t2
A ●
B ●
C ●
D
●
Bayesian Phylogenetics via SMC
35 / 60
Particle MCMC
Comparing the standard MCMC and the particle MCMC
Cartoon: problem with standard MCMC
Assume θ can only take two
values: θ1 , θ2
Two possible trees: t1 , t2
At each iteration of MCMC, the
chain is at one of 4 states:
(θ1 , t1 ), (θ1 , t2 ), (θ2 , t1 ), (θ2 , t2 )
The square is a joint distribution
A good Markov chain should
move quickly among the states
with high probability mass
Liangliang Wang (Western University)
t1
θ1
θ2
0.1
0.5
0.3
0.3
0.1
A ●
B ●
C ●
D
●
t2
A ●
B ●
C ●
D
●
Bayesian Phylogenetics via SMC
35 / 60
Particle MCMC
Comparing the standard MCMC and the particle MCMC
Cartoon: problem with standard MCMC
Assume θ can only take two
values: θ1 , θ2
Two possible trees: t1 , t2
At each iteration of MCMC, the
chain is at one of 4 states:
(θ1 , t1 ), (θ1 , t2 ), (θ2 , t1 ), (θ2 , t2 )
The square is a joint distribution
A good Markov chain should
move quickly among the states
with high probability mass
Liangliang Wang (Western University)
t1
θ1
θ2
0.1
0.5
0.3
0.1
A ●
B ●
C ●
D
●
t2
A ●
B ●
C ●
D
●
Bayesian Phylogenetics via SMC
35 / 60
Particle MCMC
Comparing the standard MCMC and the particle MCMC
Advantage of using particle MCMC
t1
θ1
θ2
0.1
0.5
0.3
0.1
A ●
B ●
C ●
D
●
t2
A ●
B ●
C ●
D
●
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
36 / 60
Particle MCMC
Comparing the standard MCMC and the particle MCMC
Particle marginal Metropolis-Hastings (PMMH)
Each iteration of PMMH
1
sample θ∗ ∼ q(θ → ·),
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
37 / 60
Particle MCMC
Comparing the standard MCMC and the particle MCMC
Particle marginal Metropolis-Hastings (PMMH)
Each iteration of PMMH
1
sample θ∗ ∼ q(θ → ·),
2
run our SMC algorithm targeting πθ∗ (t|Y), sample t∗ ∼ π̂θ∗ (·|Y), and
P̂θ∗ (Y) is the marginal likelihood obtained from SMC.
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
37 / 60
Particle MCMC
Comparing the standard MCMC and the particle MCMC
Particle marginal Metropolis-Hastings (PMMH)
Each iteration of PMMH
1
2
3
sample θ∗ ∼ q(θ → ·),
run our SMC algorithm targeting πθ∗ (t|Y), sample t∗ ∼ π̂θ∗ (·|Y), and
P̂θ∗ (Y) is the marginal likelihood obtained from SMC.
Accept θ∗ and t∗ with the probability
!
P̂θ∗ (Y)p(θ∗ ) q{θ∗ → θ}
min 1,
.
P̂θ (Y)p(θ) q{θ → θ∗ }
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
37 / 60
Particle MCMC
Comparing the standard MCMC and the particle MCMC
The Particle Gibbs sampler (PGS)
For each iteration
Sample θ∗ ∼ p(·|t)
Run the conditional CSMC algorithm targeting πθ∗ (t|Y) conditional on
t and its ancestral lineage.
Sample t∗ ∼ π̂θ∗ (·|Y).
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
38 / 60
Particle MCMC
Estimation of the parameters with PMMH
100
y-axis: Coverage probability
x-axis: Credible intervals
●
Using 100 datasets
Averaged estimate: 1.99
●
●
●
●
●
0
Standard deviation: 0.25
●
●
20
True value: θ = 2
Coverage probability
40
60
80
●
0
Liangliang Wang (Western University)
20
Bayesian Phylogenetics via SMC
40
60
Credible intervals
80
100
39 / 60
Ongoing and Future Work
1
Background
2
Combinatorial Sequential Monte Carlo (CSMC)
3
Particle MCMC
4
Ongoing and Future Work
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
40 / 60
Ongoing and Future Work
Ongoing and Future Work
Harnessing non-Local evolutionary events for tree inference
Slipped strand mispairing (SSM)
Joint estimation of Multiple Sequence Alignment (MSA) and
phylogeny
Inferring large scale trees on Graphics Processing Units (GPUs)
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
41 / 60
Ongoing and Future Work
A General Evolutionary Model for Phylogenetics
An Example of Evolutionary Events
Slipped Strand Mispairing (SSM) ⇒ long indels that depend on their contexts
0
G
T1
T2 A
!:
A
T
T
C
G
T
C
G
T
G
G A T C
C
Starting sequence
Point deletion: G
Point mutation: A->G
T3
SSM insertion: (G T)
T
C
T4
Point insertion: A
G T G T A C
T5
SSM deletion: (G T)
G
T6
T
A
C
v1 :
Liangliang Wang (Western University)
G T
A
C
Bayesian Phylogenetics via SMC
Ending sequence
42 / 60
Ongoing and Future Work
A General Evolutionary Model for Phylogenetics
String-valued Continuous Time Markov Chain (CTMC)
This process is parametrized by the rate of departing from s, λ(s), and
the jumping distribution, J(s → ·).
Waiting time at s: t ∼ Exp(λ(s)); λi = λ(si ).
Strings
s4 G T G T A C
s3 G T G T C
s5 G
T
A
C
s0 G A
T
C
s1 A
T
C
s2 G T
C
J(s2->s3)
!2exp(-!2t2)
0
T1
Liangliang Wang (Western University)
T2
T3
T4 T5
Bayesian Phylogenetics via SMC
T=T6 Branch length
43 / 60
Ongoing and Future Work
Summary
Summary
A combinatorial SMC method
Applicable to Bayesian inference
in combinatorial spaces
Converges to the true posterior
asymptotically
Computationally fast
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
44 / 60
Ongoing and Future Work
Summary
Summary
A combinatorial SMC method
Particle MCMC
Using the proposed SMC within
MCMC iterations
The Markov chain can explore
the combinatorial space
efficiently
Accurate estimate of the
parameters
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
44 / 60
Ongoing and Future Work
Summary
Summary
A combinatorial SMC method
Harnessing non-Local
evolutionary events for tree
inference
Particle MCMC
Joint estimation of MSA and
phylogeny
Future work
Inferring large scale trees on
GPUs
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
44 / 60
Acknowledgements
Co-supervisors
Dr. Alexandre Bouchard-Côté
Dr. Arnaud Doucet
Thank you!
The Natural Sciences and Engineering Research Council of
Canada (NSERC).
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
45 / 60
Bibliography
Andrieu, C., A. Doucet, and R. Holenstein (2010). Particle Markov chain Monte Carlo
methods. J. R. Statist. Soc. B 72(3), 269–342.
Bouchard-Côté, A., S. Sankararaman, and M. I. Jordan (2011). Phylogenetic inference via
sequential Monte Carlo. Systematic Biology.
Teh, Y. W., H. Daumé III, and D. M. Roy (2008). Bayesian agglomerative clustering with
coalescents. In Advances in Neural Information Processing Systems (NIPS).
Liangliang Wang (Western University)
Bayesian Phylogenetics via SMC
46 / 60
Download