Efficient Inference in Phylogenetic InDel Trees Alexandre Bouchard-Côté Michael I. Jordan Dan Klein Computer Science Division University of California at Berkeley c: C - - A A A T - G T C - - - C a a: :C C C application: homology modeling • Solution: always condition on the data G A C A T b: C c: A a: • Approach 1: heuristic A T • Experiments [8] C ? A b: C A ? G ? ? c: ? A T C ? ? ? d: d r: b C ...ACCGTGCGTGCT... • Not irreducible d: ? ? ? ? c [2] ...ACCCTGCGGTGCT... ...ACGCTGCGTGAT... d: ? a: C a ? ? .. .. A T A b: C A c: A c application: the origins of life T A T C T ...ACGCTGTTGTGAT... ...ACGGTGCTGTGCT.. character-valued! C A G G Del ...ACCCTGCGGTGCT... [1] Sub Sub Sub ...ACGCTGCGGTGAT... ...ACGTTGCGGTGAT... • Assuming we have an evolutionary InDel model, both tasks can be cast as an inference problem ...ACGCTGTTGTGAT... r: d: State augmentation: alignments ...ACCGTGCGATGCT... C C ...ACGGTGCTGTGCT.. A Ins Ins T A } } 4 4 • What is the state of the sampler? • Approach 2: naive Gibbs sampling 5 • Contiguous version? (AR) G [2] G C A C d b A C 3 a: 1 1 0 0 2 0 5000 !! derivation C 10000 • How should "vertical slices" be defined? C A 20000 25000 T C string-valued r.v. C T A C T C C C A T C r d C A G G r: A Del A A C T C C C A G A T C C C A r: C d: C A T A G A C r: d: a: a T C C Ins Ins T A a: • Next: how to sample? C c: A T C T C C Obtaining multi-alignments: d: C T C C A T a: C A T A C b: C C A T C T C C C A T C C A G Indexed by a span on an observed word (anchor) A T C C A G C A T C Legend A T C C Problem II: each step is expensive (cubic in the sequence length) C A G observed characters selected characters anchor A G C C A G C C A G C C anchor 10000 12000 CS 0.63 0.77 14000 .ktisktkgaprmPDVYTLPPSRD ..tiakvtvntfpPQVHLLPPPSE .tkltvlgqpkssPSVTLFPPSSE gtkleikr.tvaaPSVFIFPPSDE .ttvtvssasttaPSVYPLAPVSG gc2_cavpo alc_mouse 1yuh18000 16000 1mim 1nld 49 49 49 20000 49 49 asnrvvsekpeykntppieda... lhgn....eelspesylveplkep kvdgtpvtegmettqpskqs.... kvdnalqsgnsqesvteqdsk... nsg..slssgvhtfpavlqs.... gc2_cavpo alc_mouse 1yuh 1mim 1nld 92 95 91 92 87 SP 0.77 0.86 YTCSVMHealhnhvtqkaisrsp YSCMVGHealpmnftqktidrls YSCQVTHeg..htve.kslsra. YACEVTHqglsspvt.ksfnrg. ITCNVAHpasstkvdkkieprg. The taller the tree, the larger the gap: CS 0.63 0.77 100 SP 0.77 90 0.86 100 75 75 25 References 0 SSR 100 AR 90 50 80 50Galtier, N. Tourasse, and M. Gouy. A nonhyperthermophilic common ancestor to extant [1] N. life forms. Science, 283:220–221, 1999. 3 4 5 6 7 70 [2] I. Holmes and W. J. Bruno. Evolutionary HMM: a bayesian approach to multiple alignment. Bioinformatics, 17:803–820, 2001. [3] J. 25Felsenstein. Inferring Phylogenies. Sinauer Associates, 2003. 8 3 4 5 6 7 8 [4] Z. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: A Markov Chain Monte Carlo Method. Mol. Biol. Evol., 1997. [5] B. Mau and M. A. Newton. Phylogenetic inference for binary data on dendrograms using markov chain monte carlo. J. of Comp. and Graphical Statistics, 1997. 0 70 3 4 5 6 7 8 [6] S. Li, D. K. Pearl, and H. Doss. Phylogenetic tree construction using Markov Chain Monte Carlo.3J. of the American 2000.6 4 Statistical Association, 5 7 8 Figure 4: CS and SP XXXCITE are recall metrics XXX Depth C A G C C A T C C C A G C C A T C C A G C C C A G C C A T C C [7] W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted neighbor joining: A likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol., 17:189–197, 2000. Depth [8] D. G. Higgins and P. M Sharp. Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene, 73:237–244, 1988. [9] J. L. Thorne, H. Kishino, and J. Felsenstein. Inching toward reality: an improved likelihood model of sequence evolution. J. of Mol. Evol., 34:3–16, 1992. The multi-alignment in the litterature that most resembles our lineage sampler is Handel. It models in r : A Tsystem C a probabilistic fashion an InDel evolutionary a treefrom above the observed proteins produces sequencederivation alignment canalong be extracted the infered derivation D as follow:and deem the amino acids x, y ∈ S iff y difference ∈ A0 (D, x). with our approach is that their inference algorithm a multi-alignment asd : described above. aligned The key c: A T C C C A T C is based on SSR rather than the lineageThesampling moves that wesequence advocate in this paper. state-of-the-art for multiple alignment systems based on an evolutionary model is Handel [2]. References A T A C A G C C A G [1] N. Galtier, N. Tourasse, and M. Gouy. A nonhyperthermophilic common ancestor to extant life forms. Science, 283:220–221, 1999. [10] J. L. Thorne, H. Kishino, and J. Felsenstein. An evolutionary model for maximum likelihood alignment of DNA sequences. J. of Mol. Evol., 33:114–124, 1991. [2] I. Holmes and W. J. Bruno. Evolutionary HMM: a bayesian approach to multiple alignment. Bioinformatics, 17:803–820, 2001. [3] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2003. [11] G. A Lunter, I. Miklós, Y. S. Song, and J. Hein. An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J. Comp. Biol., 10:869–889, 2003. [4] Z. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: A Markov Chain Monte Carlo Method. Mol. Biol. Evol., 1997. [5] B. Mau and M. A. Newton. Phylogenetic inference for binary data on dendrograms using markov chain monte carlo. J. of Comp. and Graphical Statistics, 1997. [6] S. Li, D. K. Pearl, and H. Doss. Phylogenetic tree construction using Markov Chain Monte Carlo. J. of the American Statistical Association, 2000. [12] A. Bouchard-Côté, P. Liang, T. L. Griffiths, and D. Klein. A probabilistic approach to language change. In EMNLP 07, 2007. [13] P. Diaconis, S. Holmes, and R. M. Neal. Analysis of a non-reversible Markov chain sampler. Technical report, Cornell University, 1997. [14] S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. of Mol. Biol., 48:443–453, 1970. It is based on TKF91 and produces a multiple sequence alignment as described above. The key difference a: C A T A C b:C A G The BAliBASE multi-alignment dataset is a standard for this type of comparison andmove wasthat we advocate with[13] our approach is that theirbenchmark inference algorithm is based on SSR rather than the AR C A G A T A C inC this paper. the dataset used to assess the performance of Handel in its original paper [8]. While other approaches are A a: 8000 Table 1: XXXXXX—clustalw C d: 6000 80 C A G C A T A C G C C C C AA G C AA TT AC CC C A G C A T A C r: 1 1 1 1 1 Table 1: XXXXXX—clustalw a: G a: A 4000 (b) Fig 1b System SSR (Handel) SSR+AR r: 300 gc2_cavpo alc_mouse 1yuh 1mim 1nld SSR 100 AR d: d: Sub Sub Sub C T 2000 Figure 1: XXXXX evolution (InDel process) C 0 - annotated protein data - source: BAliBASE [16] - loss: Sum-of-Pairs (SP) Time Column-Score (CS) System SSR (Handel) AR (this paper) • A correct definition: r: C 2. Multi-alignment 2 Time Deterministic functions of a random graphical model evolutionary parameters T 30000 3 3 • Not reversible Problem I: random walk behavior T 15000 4 - synthetic 2 - 4 characters types - 124010 character tokens 1 0 - tree has 7 nodes 200 Max -dev = 2 nodes Maxheld-out dev =0 MAX 100 all internal Time - evaluated on root reconstruction - loss: Levenshtein distance (a) Fig 1a (SSR) (High-level) graphical model: AR 5 ... C SSR Error • What are guide trees and ancestral reconstructions? ...ACGTTGCGTGAT... • Non-contiguous lineages a: AR 5 1. Reconstruction resample "thin vertical slices" a C A T • Connected component of the anchor? SSR CS C Error A SP b: C Error • What is a multi-alignment? T C C C A G C [7] W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted neighbor joining: A likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol., 17:189–197, 2000. [8] D. G. Higgins and P. M Sharp. Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene, 73:237–244, 1988. [9] J. L. Thorne, H. Kishino, and J. Felsenstein. Inching toward reality: an improved likelihood model of sequence evolution. J. of Mol. Evol., 34:3–16, 1992. Cylindric proposals: use a DP [10] J. L. Thorne, H. Kishino, and J. Felsenstein. An evolutionary model for maximum likelihood alignment of DNA sequences. J. of Mol. Evol., 33:114–124, 1991. [11] G. A Lunter, I. Miklós, Y. S. Song, and J. Hein. An efficient algorithm for statistical multiple [15] K. St. John, T. Warnow, B. M. E. Moret, and L. Vawter. Performance study of phylogenetic methods: (unweighted) quartet methods and neighbor-joining. J. of Algorithms, 48:173–193, 2003. [16] J. Thompson, F. Plewniak, and O. Poch. Balibase: A benchmark alignments database for the evaluation of multiple sequence alignment programs. Bioinformatics, 15:87–88, 1999. [17] C. B. Do, M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou. Probcons: Probabilistic consistency-based multiple sequence alignment. Genome Research, 15:330–340, 2005.