Phylogenetic Tree Construction using Sequential Monte Carlo Algorithms on Posets Liangliang Wang Western University, Canada Joint work with Alexandre Bouchard-Côté Arnaud Doucet Recent Advances in SMC Sep 19-21, 2012 Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 1 / 60 Outline 1 Background 2 Combinatorial Sequential Monte Carlo (CSMC) 3 Particle MCMC 4 Ongoing and Future Work Background Phylogenetics Example: what are the evolutionary relationships among these Cichlid fishes? (a) Chalinochromis popelini (d) Lepidiolamprologus (b) Julidochromis marlieri (e) Neolamprologus brichardi (f) Neolamprologus elongatus Liangliang Wang (Western University) (c) Lamprologus callipterus (g) Telmatochromis temporalis tetracanthus Bayesian Phylogenetics via SMC 3 / 60 Background Data: Y Biological sequences of a set of species e.g. a DNA sequence is a string of characters from the set of four nucleotides {A, C, G, T }. Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 4 / 60 Background Data: Y Biological sequences of a set of species e.g. a DNA sequence is a string of characters from the set of four nucleotides {A, C, G, T }. An example of aligned DNA sequences. Nucleotides in the same column were obtained from a shared ancestral nucleotide A CTCTAGCCTTTTTCCACT B TTCTAGCCTTTCTCTACT C CTCTAGCCTTTCTCTACT Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 4 / 60 Background A rooted phylogenetic tree, t, and evolution ● Root: a common ancestor; ● ● ● ● AGTC.. ● Bayesian Phylogenetics via SMC AGAC.. AGAC.. Liangliang Wang (Western University) ● ● ATTT.. AGAC.. ● 5 / 60 Background A rooted phylogenetic tree, t, and evolution ● Root: a common ancestor; ● Internal nodes ● ● ● AGTC.. ● Bayesian Phylogenetics via SMC AGAC.. AGAC.. Liangliang Wang (Western University) ● ● ATTT.. AGAC.. ● 5 / 60 Background A rooted phylogenetic tree, t, and evolution ● Root: a common ancestor; ● Internal nodes Branch lengths ● β5 β4 β3 β6 ● β7 β8 ● ● ATTT.. ● AGAC.. AGAC.. Bayesian Phylogenetics via SMC ● AGTC.. ● AGAC.. positive real numbers associated with each edge, specifying the amount of evolution between nodes. Liangliang Wang (Western University) β2 β1 5 / 60 Background A likelihood Model: CTMC A likelihood model: P(Y|t, θ) Assumption: site independence. A CTCTAGCCTTTTTCCACT B TTCTAGCCTTTCTCTACT C CTCTAGCCTTTCTCTACT One site Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 6 / 60 Background A likelihood Model: CTMC A likelihood model: P(Y|t, θ) Assumption: site independence. Likelihood model on each site over one branch is a Continuous Time Markov Chain (CTMC): {Y s : s ∈ [0, b]} The state space of the chain: Y s ∈ {A, C, G, T }. Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 6 / 60 Background A likelihood Model: CTMC A likelihood model: P(Y|t, θ) Assumption: site independence. Likelihood model on each site over one branch is a Continuous Time Markov Chain (CTMC): {Y s : s ∈ [0, b]} The state space of the chain: Y s ∈ {A, C, G, T }. Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 6 / 60 Background A likelihood Model: CTMC A likelihood model: P(Y|t, θ) Assumption: site independence. Likelihood model on each site over one branch is a Continuous Time Markov Chain (CTMC): {Y s : s ∈ [0, b]} The state space of the chain: Y s ∈ {A, C, G, T }. The transition matrix: P(b) = eQ4×4 b for the branch of length b. e.g. P1,3 (b) = P(Yb = G|Y0 = A). Liangliang Wang (Western University) πC κπG πT π πG κπT A Q = πT κπA πC πA κπC πG - Bayesian Phylogenetics via SMC 6 / 60 Background A likelihood Model: CTMC A likelihood model: P(Y|t, θ) Assumption: site independence. Likelihood model on each site over one branch is a Continuous Time Markov Chain (CTMC): {Y s : s ∈ [0, b]} The state space of the chain: Y s ∈ {A, C, G, T }. The transition matrix: P(b) = eQ4×4 b for the branch of length b. e.g. P1,3 (b) = P(Yb = G|Y0 = A). Evolutionary parameters in CTMCs are denoted by θ. Liangliang Wang (Western University) πC κπG πT π πG κπT A Q = πT κπA πC πA κπC πG - Bayesian Phylogenetics via SMC 6 / 60 Background Bayesian phylogenetics Bayesian phylogenetics Data: aligned sequences, denoted by Y θ: evolutionary parameters t: a phylogenetic tree Posterior π(θ, t|Y) = Liangliang Wang (Western University) P(Y|t, θ)p(t|θ)p(θ) P(Y) Bayesian Phylogenetics via SMC 7 / 60 Background Posterior expectation of ϕ(t): Bayesian phylogenetics Z π(dt)ϕ(t) An example of the function ϕ: ϕ(t) = 1(c ∈ clades(t)) V1 a clade: a group consisting of a species and all its descendants clades(t): all the clades of the tree t Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 8 / 60 Background Bayesian phylogenetics Difficult inference problem over a huge tree space A B C Liangliang Wang (Western University) A C B Bayesian Phylogenetics via SMC C B A 9 / 60 Background Bayesian phylogenetics Difficult inference problem over a huge tree space A B C A C B C B A Tree space for a phylogenetic tree #Species 3 4 6 10 Liangliang Wang (Western University) #Topologies 3 15 945 34459425 Bayesian Phylogenetics via SMC 9 / 60 Background Bayesian phylogenetics Standard Bayesian phylogenetics using MCMC MCMC: obtain samples tk ∼ π(·|Y), k = 1, · · · , K ...... Z Liangliang Wang (Western University) K 1X π(dt)ϕ(t) ≈ ϕ(tk ) K k=1 Bayesian Phylogenetics via SMC 10 / 60 Background Bayesian phylogenetics Problems with MCMC The Markov chain doesn’t explore the tree space well Only small moves are allowed in each iteration Each step is expensive to compute MCMC does not scale to large datasets a large number of taxa large amount of data for each taxon Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 11 / 60 Background Bayesian phylogenetics The Ultimate Goal in Phylogenetics Infer phylogenetic trees accurately and efficiently Develop new statistical evolutionary models Computational algorithms for efficient analysis of large-scale datasets Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 12 / 60 Background Bayesian phylogenetics The Ultimate Goal in Phylogenetics Infer phylogenetic trees accurately and efficiently Develop new statistical evolutionary models Current model: CTMC over characters for each site. Proposed model: a general string-valued CTMC for biological sequences. Computational algorithms for efficient analysis of large-scale datasets Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 12 / 60 Background Bayesian phylogenetics The Ultimate Goal in Phylogenetics Infer phylogenetic trees accurately and efficiently Develop new statistical evolutionary models Current model: CTMC over characters for each site. Proposed model: a general string-valued CTMC for biological sequences. Computational algorithms for efficient analysis of large-scale datasets Current methods: Standard MCMC SMC for unrealistic phylogenetic trees (Teh et al. 2008; Bouchard-Côté et al. 2011) for fixed parameters θ Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 12 / 60 Background Bayesian phylogenetics The Ultimate Goal in Phylogenetics Infer phylogenetic trees accurately and efficiently Develop new statistical evolutionary models Current model: CTMC over characters for each site. Proposed model: a general string-valued CTMC for biological sequences. Computational algorithms for efficient analysis of large-scale datasets Current methods: Standard MCMC SMC for unrealistic phylogenetic trees (Teh et al. 2008; Bouchard-Côté et al. 2011) for fixed parameters θ Proposed methods: An efficient SMC algorithm for general phylogenetic trees Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 12 / 60 Background Bayesian phylogenetics The Ultimate Goal in Phylogenetics Infer phylogenetic trees accurately and efficiently Develop new statistical evolutionary models Current model: CTMC over characters for each site. Proposed model: a general string-valued CTMC for biological sequences. Computational algorithms for efficient analysis of large-scale datasets Current methods: Standard MCMC SMC for unrealistic phylogenetic trees (Teh et al. 2008; Bouchard-Côté et al. 2011) for fixed parameters θ Proposed methods: An efficient SMC algorithm for general phylogenetic trees PMCMC for joint estimation of t and θ Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 12 / 60 Combinatorial Sequential Monte Carlo (CSMC) 1 Background 2 Combinatorial Sequential Monte Carlo (CSMC) 3 Particle MCMC 4 Ongoing and Future Work Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 13 / 60 Combinatorial Sequential Monte Carlo (CSMC) SMC algorithm for phylogenetic trees Target distribution: the posterior π(t|Y) ∝ γ(t|Y) Z = P(Y|t)p(t) Interested in the posterior expectation of ϕ(t): Liangliang Wang (Western University) Bayesian Phylogenetics via SMC π(dt)ϕ(t). 14 / 60 Combinatorial Sequential Monte Carlo (CSMC) SMC algorithm for phylogenetic trees Target distribution: the posterior π(t|Y) ∝ γ(t|Y) Z = P(Y|t)p(t) Interested in the posterior expectation of ϕ(t): Input Y Aligned biological sequences Liangliang Wang (Western University) π(dt)ϕ(t). A CTCTAGCCTTTTTCCACT B TTCTAGCCTTTCTCTACT C CTCTAGCCTTTCTCTACT D TTCTAGCCTTTTTCTACT Bayesian Phylogenetics via SMC 14 / 60 Combinatorial Sequential Monte Carlo (CSMC) SMC algorithm for phylogenetic trees Target distribution: the posterior π(t|Y) ∝ γ(t|Y) Z = P(Y|t)p(t) Interested in the posterior expectation of ϕ(t): Input Y Aligned biological sequences π(dt)ϕ(t). A CTCTAGCCTTTTTCCACT B TTCTAGCCTTTCTCTACT C CTCTAGCCTTTCTCTACT D TTCTAGCCTTTTTCTACT Output: weighted particles {(tk , Wk )} to approximate the posterior distribution over trees, π̂(t|Y) t1 B ● C● D● A ● t2 BADC ●●●● t3 t4 A ● C ● B ● D ● t5 t6 B ● A ● D ● C ● B ● A ● D ● C ● B ● A ● D ● C ● t7 t8 t9 B ● A ● D ● C ● B ● A ● C ● D ● B ● D ● A ● C ● t10 B● D● A● C ● estimate of the marginal likelihood, P̂(Y). Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 14 / 60 Combinatorial Sequential Monte Carlo (CSMC) A sequence of partial states Using ν+ (s0 → t) is not efficient sr : a partial state (forest) of n − r subtrees (n is the number of species) A forward proposal ν+ (sr−1 → sr ): randomly choose a pair of subtrees of sr−1 to merge. Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 15 / 60 Combinatorial Sequential Monte Carlo (CSMC) How to define distributions πr over partial states sr ? πr (sr |Y) ∝ P(Y|sr )p(sr ) P(Y|sr ): the likelihood of the partial state sr we have likelihood model for trees consider the trees in a forest to be independent the product of the likelihood of the subtrees of sr . πr is represented by K weighted particles, {(srk , Wrk )} Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 16 / 60 Combinatorial Sequential Monte Carlo (CSMC) Illustration of SMC The initial partial state s0 : a forest in which each sequence is a trivial tree with a single leaf. s0 Liangliang Wang (Western University) A● B● C● D ● Bayesian Phylogenetics via SMC 17 / 60 Combinatorial Sequential Monte Carlo (CSMC) Illustration of SMC The first partial state s1 Generate particle s11 using ν+ (s0 → ·) Randomly choose a pair of subtrees, species B and C, to merge. s1,1 B● C● D● A ● s0 Liangliang Wang (Western University) A● B● C● D ● Bayesian Phylogenetics via SMC 18 / 60 Combinatorial Sequential Monte Carlo (CSMC) Illustration of SMC The first partial state s1 Generate particle s12 using ν+ (s0 → ·) Randomly choose a pair of subtrees, species D and C, to merge. s1,1 s1,2 B● C● D● A ● B● A● D● C ● s0 Liangliang Wang (Western University) A● B● C● D ● Bayesian Phylogenetics via SMC 19 / 60 Combinatorial Sequential Monte Carlo (CSMC) Illustration of SMC The first partial state s1 Generate K particles s1k using ν+ (s0 → ·) These particles cannot represent π1 directly We need to compensate for the discrepancy between the distribution of interest and the proposed distribution. s1 s1,1 s1,2 s1,3 s1,4 s1,5 s1,6 s1,7 s1,8 s1,9 s1,10 B● C● D● A ● B● A● D● C ● A● C● B● D ● A● D● B● C ● B● A● D● C ● A● B● D● C ● C● D● B● A ● B● A● C● D ● B● D● A● C ● A● B● D● C ● s0 Liangliang Wang (Western University) A● B● C● D ● Bayesian Phylogenetics via SMC 20 / 60 Combinatorial Sequential Monte Carlo (CSMC) Illustration of SMC Update the weight of particles s1k The rectangle size corresponds to the normalized particle weight W1 (s1k ). s1,1 s1 B ● C ● D ● A ● s1,2 B ADC ●●●● s1,3 ACBD ●●●● s1,4 ADBC ● ●●● s1,5 B ● A ● D ● C ● s0 Liangliang Wang (Western University) s1,6 ABDC ● ●●● s1,7 C● D● B ● A ● s1,8 s1,9 B ● A ● C ● D ● B ● D ● A ● C ● s1,10 ABDC ● ●●● A● B● C● D ● Bayesian Phylogenetics via SMC 21 / 60 Combinatorial Sequential Monte Carlo (CSMC) Illustration of SMC Resample s1k Using a multinomial distribution Purpose: prune unpromising particles s1,1 s1 B ● C ● D ● A ● s1,2 B ADC ●●●● s1,3 ACBD ●●●● s1,4 ADBC ● ●●● s1,5 B ● A ● D ● C ● s0 Liangliang Wang (Western University) s1,6 ABDC ● ●●● s1,7 C● D● B ● A ● s1,8 s1,9 B ● A ● C ● D ● B ● D ● A ● C ● s1,10 ABDC ● ●●● A● B● C● D ● Bayesian Phylogenetics via SMC 22 / 60 Combinatorial Sequential Monte Carlo (CSMC) Illustration of SMC The second partial state s2 Generate the particles s2k using ν+ (s1k → ·) Update the weights of the particles W2 (s2k ) s2,1 s2 B ● C ● D ● A ● s1,1 s1 B ● C ● D ● A ● s2,2 s2,3 B ADC ●●●● A ● C ● B ● D ● s1,2 s1,3 B ADC ●●●● ACBD ●●●● s2,4 s2,5 B ● A ● D ● C ● B ● A ● D ● C ● s1,4 ADBC ● ●●● s1,5 Liangliang Wang (Western University) s2,7 B● A● D● C ● C● D● B● A ● s1,6 B ● A ● D ● C ● s0 s2,6 ABDC ● ●●● s1,7 C● D● B ● A ● s2,8 s2,9 B ● A ● C ● D ● B ● D ● A ● C ● s1,8 s1,9 B ● A ● C ● D ● B ● D ● A ● C ● s2,10 B ● D ● A ● C ● s1,10 ABDC ● ●●● A● B● C● D ● Bayesian Phylogenetics via SMC 23 / 60 Combinatorial Sequential Monte Carlo (CSMC) Illustration of SMC Resample s2k Using a multinomial distribution. Purpose: prune unpromising particles s2,1 s2 B ● C ● D ● A ● s1,1 s1 B ● C ● D ● A ● s2,2 s2,3 B ADC ●●●● A ● C ● B ● D ● s1,2 s1,3 B ADC ●●●● ACBD ●●●● s2,4 s2,5 B ● A ● D ● C ● B ● A ● D ● C ● s1,4 ADBC ● ●●● s1,5 Liangliang Wang (Western University) s2,7 B● A● D● C ● C● D● B● A ● s1,6 B ● A ● D ● C ● s0 s2,6 ABDC ● ●●● s1,7 C● D● B ● A ● s2,8 s2,9 B ● A ● C ● D ● B ● D ● A ● C ● s1,8 s1,9 B ● A ● C ● D ● B ● D ● A ● C ● s2,10 B ● D ● A ● C ● s1,10 ABDC ● ●●● A● B● C● D ● Bayesian Phylogenetics via SMC 24 / 60 Combinatorial Sequential Monte Carlo (CSMC) Illustration of SMC The final state (full tree) s3,1 s3 BCDA ●●●● s2,1 s2 B ● C ● D ● A ● s1,1 s1 B ● C ● D ● A ● s3,2 BADC ●●●● s2,2 s3,3 A● C● B● D ● s2,3 B ADC ●●●● A ● C ● B ● D ● s1,2 s1,3 B ADC ●●●● ACBD ●●●● s3,4 B ● A ● D ● C ● s3,5 s3,6 B● A● D● C ● B● A● D● C ● s2,4 s2,5 B ● A ● D ● C ● B ● A ● D ● C ● s1,4 ADBC ● ●●● s1,5 s0 Liangliang Wang (Western University) s3,8 s3,9 B ● A ● D ● C ● B ● A ● C ● D ● B ● D ● A ● C ● s2,6 s2,7 B● A● D● C ● C● D● B● A ● s1,6 B ● A ● D ● C ● s3,7 ABDC ● ●●● s1,7 C● D● B ● A ● s2,8 s2,9 B ● A ● C ● D ● B ● D ● A ● C ● s1,8 s1,9 B ● A ● C ● D ● B ● D ● A ● C ● s3,10 BDAC ●●●● s2,10 B ● D ● A ● C ● s1,10 ABDC ● ●●● A● B● C● D ● Bayesian Phylogenetics via SMC 25 / 60 Combinatorial Sequential Monte Carlo (CSMC) Weight update The weight update in a standard SMC wr (sr ) = wr−1 (sr−1 ) · γr (sr ) 1 γr−1 (sr−1 ) ν+ (sr−1 → sr ) γr (sr ) : unnormalized density of sr γr−1 (sr−1 ) : unnormalized density of sr−1 ν+ : forward proposal Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 26 / 60 Combinatorial Sequential Monte Carlo (CSMC) Weight update The weight update in a standard SMC wr (sr ) = wr−1 (sr−1 ) · γr (sr ) 1 γr−1 (sr−1 ) ν+ (sr−1 → sr ) γr (sr ) : unnormalized density of sr γr−1 (sr−1 ) : unnormalized density of sr−1 ν+ : forward proposal This weight update of the standard SMC will lead to a biased estimate for general phylogenetic trees due to an over-counting problem. Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 26 / 60 Combinatorial Sequential Monte Carlo (CSMC) Weight update The weight update in a standard SMC wr (sr ) = wr−1 (sr−1 ) · γr (sr ) 1 γr−1 (sr−1 ) ν+ (sr−1 → sr ) WRONG for general trees! γr (sr ) : unnormalized density of sr γr−1 (sr−1 ) : unnormalized density of sr−1 ν+ : forward proposal This weight update of the standard SMC will lead to a biased estimate for general phylogenetic trees due to an over-counting problem. Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 26 / 60 Combinatorial Sequential Monte Carlo (CSMC) Over-counting problem Over-counting some samples Two trees (each 1/2) A ● B ● C ● D ● Liangliang Wang (Western University) Bayesian Phylogenetics via SMC A ● C ● B ● D ● 27 / 60 Combinatorial Sequential Monte Carlo (CSMC) Over-counting problem Over-counting some samples Two trees (each 1/2) A ● B ● C ● D ● Two copies of the same partial state: s21 , s22 Two copies of the same full tree: s31 , s32 s31 A ● B ● C ● D ● s21 A ● B ● C ● D ● s11 A ● B ● C ● D ● A ● C ● B ● D ● s32 s33 A ● B ● C ● D ● A ● C ● B ● D ● s23 s22 A ● B ● C ● D ● A ● C ● B ● D ● s13 s12 A ● B ● C ● D ● A ● C ● B ● D ● A ● B ● C ● D ● s0 Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 27 / 60 Combinatorial Sequential Monte Carlo (CSMC) Over-counting problem Over-counting some samples Two trees (each 1/2) A ● B ● C ● D ● A ● C ● B ● D ● s31 + s32 Two copies of the same partial state: s21 , s22 A ● C ● A ● C ● B ● D ● D ● s21 + s22 Two copies of the same full tree: s31 , s32 A ● Their weights are doubled. This will cause biased estimates. B ● s33 s11 A ● B ● C ● D ● B ● C ● s23 A ● C ● B ● D ● D ● s13 s12 A ● B ● C ● D ● A ● C ● B ● D ● A ● B ● C ● D ● s0 Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 27 / 60 Combinatorial Sequential Monte Carlo (CSMC) Over-counting problem Over-counting some samples Two trees (each 1/2) A ● B ● C ● D ● A ● C ● B ● D ● s31 + s32 Two copies of the same partial state: s21 , s22 s33 A ● B ● C ● D ● Two copies of the same full tree: A ● C ● B ● D ● s21 + s22 s31 , s32 s23 A ● B ● C ● D ● A ● C ● B ● D ● Their weights are doubled. This will cause biased estimates. s11 A ● B ● C ● D ● s13 s12 A ● B ● C ● D ● A ● C ● B ● D ● Need to downweight the over-counted partial states. A ● B ● C ● D ● s0 Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 27 / 60 Combinatorial Sequential Monte Carlo (CSMC) Over-counting problem Over-counting some samples Two trees (each 1/2) A ● B ● C ● D ● Two copies of the same partial state: s21 , s22 s31 A ● B ● C ● D ● A ● C ● B ● D ● s32 A ● B ● C ● D ● Two copies of the same full tree: s21 s31 , s32 A ● B ● C ● D ● s22 A ● B ● C ● D ● s33 A ● C ● B ● D ● s23 A ● C ● B ● D ● Their weights are doubled. This will cause biased estimates. s11 A ● B ● C ● D ● s13 s12 A ● B ● C ● D ● A ● C ● B ● D ● Need to downweight the over-counted partial states. A ● B ● C ● D ● s0 Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 27 / 60 Combinatorial Sequential Monte Carlo (CSMC) Over-counting problem Over-counting some samples Two trees (each 1/2) A ● B ● C ● D ● A ● C ● B ● D ● s31 + s32 Two copies of the same partial state: s21 , s22 s33 A ● B ● C ● D ● Two copies of the same full tree: A ● C ● B ● D ● s21 + s22 s31 , s32 s23 A ● B ● C ● D ● A ● C ● B ● D ● Their weights are doubled. This will cause biased estimates. s11 A ● B ● C ● D ● s13 s12 A ● B ● C ● D ● A ● C ● B ● D ● Need to downweight the over-counted partial states. A ● B ● C ● D ● s0 Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 27 / 60 Combinatorial Sequential Monte Carlo (CSMC) Over-counting problem Our correct weight update (in the CSMC algorithm) wr (sr ) = wr−1 (sr−1 ) · γr (sr ) ν (sr → sr−1 ) γr−1 (sr−1 ) ν+ (sr−1 → sr ) The backward proposal ν− is based on a graded partially ordered set (poset) on an extended combinatorial space 1: if there is only one way from sr−1 to sr between 0 and 1 if there are multiple ways from sr−1 to sr to downweight the over-counted partial states. Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 28 / 60 Combinatorial Sequential Monte Carlo (CSMC) Convergence results Convergence results Under weak conditions, for any bounded real-valued function ϕ : Sr → R, Strong Law of Large Numbers (SLLN) K Z X a.s. lim Wrk ϕ(srk ) − πr (dsr )ϕ(sr ) −→ 0, K→∞ k=1 K : the number of weighted particles. Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 29 / 60 Combinatorial Sequential Monte Carlo (CSMC) Convergence results Illustration of convergence: simulation 0.0 0.1 0.2 0.3 0.4 0.5 y-axis: Total variation distance of π̂ to π x-axis: K (# particles) ● ● 1e+01 Liangliang Wang (Western University) Poset−correction Standard−SMC ● ● ● 1e+03 Bayesian Phylogenetics via SMC ● 1e+05 30 / 60 Combinatorial Sequential Monte Carlo (CSMC) Experiment on tree inference # leaves: 10 # sites: 1000 12 y-axis: Partition metric x-axis: time (in log scale) ● MB SMC 0 2 100×: 2 orders of magnitude ● 4 Computationally faster 6 8 # datasets: 1000 1e+02 Liangliang Wang (Western University) Bayesian Phylogenetics via SMC ● ● 1e+04 ● ● ● 1e+06 31 / 60 Particle MCMC 1 Background 2 Combinatorial Sequential Monte Carlo (CSMC) 3 Particle MCMC 4 Ongoing and Future Work Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 32 / 60 Particle MCMC Particle MCMC Inferring both the tree and the evolutionary parameter jointly π(θ, t|Y) Particle MCMC (Andrieu et al. 2010) Each MCMC iteration uses our proposed CSMC algorithm to approximate the posterior distribution of the phylogenetic tree Particle marginal Metropolis-Hastings (PMMH) Particle Independent Metropolis-Hastings (PIMH) The Particle Gibbs sampler (PGS) requires a special SMC algorithm, conditional SMC. Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 33 / 60 Particle MCMC Particle MCMC Particle MCMC Advantage Bolder and more efficient move to update t. Convergence result These algorithms converges to the true posterior. (Andrieu et al. 2010) Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 34 / 60 Particle MCMC Comparing the standard MCMC and the particle MCMC Cartoon: problem with standard MCMC Assume θ can only take two values: θ1 , θ2 Two possible trees: t1 , t2 At each iteration of MCMC, the chain is at one of 4 states: (θ1 , t1 ), (θ1 , t2 ), (θ2 , t1 ), (θ2 , t2 ) The square is a joint distribution A good Markov chain should move quickly among the states with high probability mass Liangliang Wang (Western University) t1 θ1 θ2 0.1 0.5 0.3 0.1 A ● B ● C ● D ● t2 A ● B ● C ● D ● Bayesian Phylogenetics via SMC 35 / 60 Particle MCMC Comparing the standard MCMC and the particle MCMC Cartoon: problem with standard MCMC Assume θ can only take two values: θ1 , θ2 Two possible trees: t1 , t2 At each iteration of MCMC, the chain is at one of 4 states: (θ1 , t1 ), (θ1 , t2 ), (θ2 , t1 ), (θ2 , t2 ) The square is a joint distribution A good Markov chain should move quickly among the states with high probability mass Liangliang Wang (Western University) t1 θ1 θ2 0.1 0.5 0.3 0.3 0.1 A ● B ● C ● D ● t2 A ● B ● C ● D ● Bayesian Phylogenetics via SMC 35 / 60 Particle MCMC Comparing the standard MCMC and the particle MCMC Cartoon: problem with standard MCMC Assume θ can only take two values: θ1 , θ2 Two possible trees: t1 , t2 At each iteration of MCMC, the chain is at one of 4 states: (θ1 , t1 ), (θ1 , t2 ), (θ2 , t1 ), (θ2 , t2 ) The square is a joint distribution A good Markov chain should move quickly among the states with high probability mass Liangliang Wang (Western University) t1 θ1 θ2 0.1 0.5 0.3 0.3 0.1 A ● B ● C ● D ● t2 A ● B ● C ● D ● Bayesian Phylogenetics via SMC 35 / 60 Particle MCMC Comparing the standard MCMC and the particle MCMC Cartoon: problem with standard MCMC Assume θ can only take two values: θ1 , θ2 Two possible trees: t1 , t2 At each iteration of MCMC, the chain is at one of 4 states: (θ1 , t1 ), (θ1 , t2 ), (θ2 , t1 ), (θ2 , t2 ) The square is a joint distribution A good Markov chain should move quickly among the states with high probability mass Liangliang Wang (Western University) t1 θ1 θ2 0.1 0.5 0.3 0.1 A ● B ● C ● D ● t2 A ● B ● C ● D ● Bayesian Phylogenetics via SMC 35 / 60 Particle MCMC Comparing the standard MCMC and the particle MCMC Advantage of using particle MCMC t1 θ1 θ2 0.1 0.5 0.3 0.1 A ● B ● C ● D ● t2 A ● B ● C ● D ● Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 36 / 60 Particle MCMC Comparing the standard MCMC and the particle MCMC Particle marginal Metropolis-Hastings (PMMH) Each iteration of PMMH 1 sample θ∗ ∼ q(θ → ·), Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 37 / 60 Particle MCMC Comparing the standard MCMC and the particle MCMC Particle marginal Metropolis-Hastings (PMMH) Each iteration of PMMH 1 sample θ∗ ∼ q(θ → ·), 2 run our SMC algorithm targeting πθ∗ (t|Y), sample t∗ ∼ π̂θ∗ (·|Y), and P̂θ∗ (Y) is the marginal likelihood obtained from SMC. Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 37 / 60 Particle MCMC Comparing the standard MCMC and the particle MCMC Particle marginal Metropolis-Hastings (PMMH) Each iteration of PMMH 1 2 3 sample θ∗ ∼ q(θ → ·), run our SMC algorithm targeting πθ∗ (t|Y), sample t∗ ∼ π̂θ∗ (·|Y), and P̂θ∗ (Y) is the marginal likelihood obtained from SMC. Accept θ∗ and t∗ with the probability ! P̂θ∗ (Y)p(θ∗ ) q{θ∗ → θ} min 1, . P̂θ (Y)p(θ) q{θ → θ∗ } Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 37 / 60 Particle MCMC Comparing the standard MCMC and the particle MCMC The Particle Gibbs sampler (PGS) For each iteration Sample θ∗ ∼ p(·|t) Run the conditional CSMC algorithm targeting πθ∗ (t|Y) conditional on t and its ancestral lineage. Sample t∗ ∼ π̂θ∗ (·|Y). Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 38 / 60 Particle MCMC Estimation of the parameters with PMMH 100 y-axis: Coverage probability x-axis: Credible intervals ● Using 100 datasets Averaged estimate: 1.99 ● ● ● ● ● 0 Standard deviation: 0.25 ● ● 20 True value: θ = 2 Coverage probability 40 60 80 ● 0 Liangliang Wang (Western University) 20 Bayesian Phylogenetics via SMC 40 60 Credible intervals 80 100 39 / 60 Ongoing and Future Work 1 Background 2 Combinatorial Sequential Monte Carlo (CSMC) 3 Particle MCMC 4 Ongoing and Future Work Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 40 / 60 Ongoing and Future Work Ongoing and Future Work Harnessing non-Local evolutionary events for tree inference Slipped strand mispairing (SSM) Joint estimation of Multiple Sequence Alignment (MSA) and phylogeny Inferring large scale trees on Graphics Processing Units (GPUs) Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 41 / 60 Ongoing and Future Work A General Evolutionary Model for Phylogenetics An Example of Evolutionary Events Slipped Strand Mispairing (SSM) ⇒ long indels that depend on their contexts 0 G T1 T2 A !: A T T C G T C G T G G A T C C Starting sequence Point deletion: G Point mutation: A->G T3 SSM insertion: (G T) T C T4 Point insertion: A G T G T A C T5 SSM deletion: (G T) G T6 T A C v1 : Liangliang Wang (Western University) G T A C Bayesian Phylogenetics via SMC Ending sequence 42 / 60 Ongoing and Future Work A General Evolutionary Model for Phylogenetics String-valued Continuous Time Markov Chain (CTMC) This process is parametrized by the rate of departing from s, λ(s), and the jumping distribution, J(s → ·). Waiting time at s: t ∼ Exp(λ(s)); λi = λ(si ). Strings s4 G T G T A C s3 G T G T C s5 G T A C s0 G A T C s1 A T C s2 G T C J(s2->s3) !2exp(-!2t2) 0 T1 Liangliang Wang (Western University) T2 T3 T4 T5 Bayesian Phylogenetics via SMC T=T6 Branch length 43 / 60 Ongoing and Future Work Summary Summary A combinatorial SMC method Applicable to Bayesian inference in combinatorial spaces Converges to the true posterior asymptotically Computationally fast Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 44 / 60 Ongoing and Future Work Summary Summary A combinatorial SMC method Particle MCMC Using the proposed SMC within MCMC iterations The Markov chain can explore the combinatorial space efficiently Accurate estimate of the parameters Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 44 / 60 Ongoing and Future Work Summary Summary A combinatorial SMC method Harnessing non-Local evolutionary events for tree inference Particle MCMC Joint estimation of MSA and phylogeny Future work Inferring large scale trees on GPUs Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 44 / 60 Acknowledgements Co-supervisors Dr. Alexandre Bouchard-Côté Dr. Arnaud Doucet Thank you! The Natural Sciences and Engineering Research Council of Canada (NSERC). Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 45 / 60 Bibliography Andrieu, C., A. Doucet, and R. Holenstein (2010). Particle Markov chain Monte Carlo methods. J. R. Statist. Soc. B 72(3), 269–342. Bouchard-Côté, A., S. Sankararaman, and M. I. Jordan (2011). Phylogenetic inference via sequential Monte Carlo. Systematic Biology. Teh, Y. W., H. Daumé III, and D. M. Roy (2008). Bayesian agglomerative clustering with coalescents. In Advances in Neural Information Processing Systems (NIPS). Liangliang Wang (Western University) Bayesian Phylogenetics via SMC 46 / 60