Mostly statistical Current challenges for molecular phylogenetics Barbara Holland School of Mathematics & Physics University of Tasmania Charles Darwin and Alfred Russell Wallace Evolution as descent with modification, implying relationships between organisms by unbroken genetic lines Phylogenetics seeks to determine these genetic relationships Darwin’s sketch: the first phylogenetic tree? Alfred Russel Wallace Charles Darwin Ernst Haeckel’s Tree of Life (1866) Since the publication of Origin of the Species in 1859 people have been trying to infer the evolutionary “tree of life”. Haeckel’s Pedigree of Man Why molecular phylogeny • Most molecules evolve independently of adaptations affecting morphology. • It is fairly easy to find genes that are present in all species of interest, e.g., a 12S RNA molecule in mitochondria is functional over all mammals. • Useful mathematical models of sequence evolution have been developed that underpin attempts to infer evolutionary trees Mole Golden Mole ? Whale Laurasiatheria Afrotheria Golden Mole Mole Whale hedgehog elephant ? tenrec Afrotheria elephant Laurasiatheria tenrec hedgehog A brief and incomplete history of molecular phylogenetics Antibodies DNA-DNA hybridisation Sequence Data (Amino acid then DNA) 60s Distance based 70s 80s PCR MORE Sequence Data Parsimony Sarich Wilson Systematic bias – Felsenstein Zone Explicit Models - Maximum likelihood Assessing support - Bootstrap Felsenstein 90s More complex models - Bayesian methods 00s Population processes, gene trees in species trees Various perils...anomalous gene trees, non identifiable models The molecular phylogeny problem ACCGCTTA Time ACCCCTTA ACCCCTTA We see the aligned modern day sequences ? ACTGCTTA ACTGCTAA ACCCCATA ACTGCTTA …ACCCCTTA… …ACCCCATA… …ACTGCTTA… …ACTGCTAA… ACTGCTAA And want to recover the underlying evolutionary tree. Sequence evolution is modelled as a Markov process A Consider a single edge in a phylogeny, i.e. evolution of a single species, and the evolution of a single DNA base amongst the possible states {A, C, G, T}. C A time The probability of mutating from state i to j over a length of time t depends only on the current state i and the potential future state j, not on any of the previous history of the sequence, and can be written pij(t). T time t G Continuous time Markov chains M= A C G T A pAA pAC pAG pAT C pCA pCC pCG pCT G pGA pGC pGG pGT T pTA pTC pTG pTT Transition matrix Q= A C G T A -qA* qAC qAG qAT C qCA -qC* qCG qCT G qGA qGC -qG* qGT T qTA qTC qTG -qT* Where qi* = Σj qij, j ≠ i i.e. rows sum to zero. Instataneous rate matrix M = exp(Qt) Typically we restrict to stationary, reversible models, with the stationary distribution denoted by π. So, π Q = 0, and D(π)Q is symmetric. Models of nucleotide substitution • Jukes Cantor (JC) – – • – Transitions and transversions at different rates Base frequencies equal HKY model – – • A Kimura 2 Parameter (K2P) – • All substitutions equally likely Base frequencies equal α α α α C Transitions and transversions at different rates Base frequencies different General Time Reversible (GTR) α G α T Models of nucleotide substitution • Jukes Cantor (JC) – – • – Transitions and transversions at different rates Base frequencies equal HKY model – – • A Kimura 2 Parameter (K2P) – • All substitutions equally likely Base frequencies equal β α α α C Transitions and transversions at different rates Base frequencies different General Time Reversible (GTR) α G β T Models of nucleotide substitution • Jukes Cantor (JC) – – • – Transitions and transversions at different rates Base frequencies equal HKY model – – • A Kimura 2 Parameter (K2P) – • All substitutions equally likely Base frequencies equal β α α α C Transitions and transversions at different rates Base frequencies different General Time Reversible (GTR) α G β T Models of nucleotide substitution • Jukes Cantor (JC) – – • – Transitions and transversions at different rates Base frequencies equal HKY model – – • A Kimura 2 Parameter (K2P) – • All substitutions equally likely Base frequencies equal β α ζ δ C Transitions and transversions at different rates Base frequencies different General Time Reversible (GTR) γ G ε T Models define probability distributions on site patterns The model θ consists of: the tree topology, edge weights, Q matrix*, and root distribution π. y Edge weights t1, t2, t3, t12 M12 M3 Me = exp(Qte) x pijk = Σx,y M1(x,i) M2(x,j) M12(y,x) M3(y,k) π(y) M1 M2 1 2 3 *More generally, this could be a set of Q matrices Tree estimation using maximum likelihood • For a given set of parameters θ we can calculate the probability of any particular site pattern. • The overall probability of an alignment is then taken to be the the product of the probabilities for each site (i.i.d assumption). • This is the likelihood function, i.e. the probability of the data given the model. • We can then use optimisation techniques to find the model parameters (tree topology, edge lengths, parameters of the substitution model) that maximise the likelihood. Extra features of sequence evolution that can be modelled • Site to site rate variation (usually modelled by a gamma distribution) • Invariant sites BUT • Some parts of reality are problematic… – Base composition bias – Sites that are free to vary change across the tree – Non independence of sites Likelihood versus parsimony (the Felsenstein Zone) Prior to the introduction of ML to phylogenetics community by Joe Felsenstein Maximum Parsimony (MP) was the most widely used method for estimating phylogenetic trees. MP chooses the tree that requires the fewest mutations to explain the data A C G A A A B A B G A G A G D G C A A D Likelihood versus parsimony (the Felsenstein Zone) The MP criterion has been shown to be statistically inconsistent on some trees under the models of nucleotide substitution discussed previously. Likelihood is statisitically consistent (given the correct model). Felsenstein (1978) Hendy & Penny (1989) Assessing confidence • It is not just of interest to get a point estimate of the phylogenetic tree. • We would also like some measure of confidence in our point estimate. – Is our tree likely to change if we get more data? – How robust is our result to sampling error? • The bootstrap is a useful tool for answering these sorts of questions. The bootstrap (Felsenstein 1985) • For each bootstrap sample: – Create a new alignment (of the same length as the original) by resampling the columns of the observed alignment – Construct a tree for the ‘bootstrap’ alignment • The bootstrap support for each edge is the number of bootstrap trees that edge appears in. 1234567 ATATAAA ATTATAA TAAAATA TATAAAT a b c d 1224567 ATTTAAA ATTATAA TAAAATA TAAAAAT a b c d a b c d 1334567 AAATAAA ATTATAA TAAAATA TTTAAAT a b c d a 1234567 ATATAAA ATTATAA TAAAATA TATAAAT a b c d a b c d b c a 0.75 b d c d 1244567 ATTTAAA ATAATAA TAAAATA TAAAAAT a b c d a b c d Example where the bootstrap is useful • Simulate data on the four taxon tree below (JC model) • Use sequence lengths of 100, 1000, and 10000 0.01 ((a,b),(c,d)) 0.2 ((a,c),(b,d)) a b c d ((a,d),(b,c)) 100 5.7% 42.8% 49.8% 1000 97% <5% <5% 10000 100% 0 0 Example where it is not so useful • Simulate data on the two four-taxon trees below (JC model) in the proportion 55%, 45% and concatenate the sequences • Use total sequence lengths of 100, 1000, and 10000 0.05 0.1 55% a b c d ((a,b),(c,d)) ((a,c),(b,d)) 0.05 0.1 45% a c b d ((a,d),(b,c)) 100 64% 33% 3% 1000 80% 20% 0% 10000 98% <5 <5 Genome-scale phylogeny • Data sets with many concatenated genes – Rokas et al, Nature 2003 (106 genes, 8 taxa) – Goremykin et al, MBE 2004 (61 genes, 14 taxa) • Estimated trees have very high bootstrap support. • BUT... trees are sensitive to: model used, method used, data-coding. Case study: The Amborella Wars Angiosperms A New Caladonian shrub Grasses bootstrap support NJ bootstrap with ML distances using a GTR + gamma model 100 80 Amb+Nym 60 Grasses Amb 40 Nym 20 0 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 alpha (gamma shape parameter) Skewed rates Equal rates Sensitivity to model choice • Phylogenomic datasets may involve hundreds of genes for many species. • These data sets create challenges for current phylogenetic methods, as different genes have different functions and hence evolve under different processes. • One question is how best to model this heterogeneity to give reliable phylogenetic estimates of the species tree. Example Rokas et al. (2003) produced 106 gene trees for 8 yeast taxa S. cerevisiae S. paradoxus S. mikatae S. kudriavzevii S. bayanus S. kluyveri S. castellii C. albicans Two extremes • How many parameters do we need to adequately represent the branches of all (unrooted) gene trees ? Between 13 (consensus tree) & 13 x 106 = 1378 • Too few parameters introduces bias • Too many parameters increases the variance Stochastic partitioning • Attempts to cluster genes into classes that have evolved in a similar fashion. • Each class is allowed its own set of parameters (e.g. branch lengths or model of nucleotide substitution) Algorithm overview 1. Randomly assign the n genes to k classes. 2. Optimise parameters for each class 3. Compute the posterior probability for each gene with the parameters from each class. 4. Move each gene into the class for which it has highest posterior probability 5. Go to step 2, when no genes change class STOP How many classes? Conclusions regarding stochastic partitioning • Pros – AIC/BIC allows you a quantitative method to choose how many parameters are needed. – Identifies groups of genes under similar constraints • Cons – Slow – Randomized algorithm so different starting points lead to different partitions. Brief Tour… Combinatorics of tree space Graph Theory Stochastic Models, Inference & Probability Theory Algebraic Geometry Lie groups, representation theory …. …. Identifiability Figure 2 Matsen and Steel (2007) …the underlying assumption was that mixture model data on one topology can be distinguished from data evolved on an unmixed tree of another topology given enough data and the ``correct'' method. Here we show that this assumption can be false. For biologists our results imply that, for example, the combined data from two genes whose phylogenetic trees differ only in terms of branch lengths can perfectly fit a tree of a different topology. Elizabeth Allman John Rhodes Algebraic geometry approach The boundary of phylogenetics and population genetics Fisher-Wright model Phylogenetic tree Gene trees in species phylogenies James Degnan Noah Rosenberg Representation theory, Lie groups, Markov invariants, closure of model classes Jeremy Sumner Peter Jarvis http://www.maths.utas.edu.au/phylomania/phylomania2011.htm