10/27/14 Estimating Evolutionary Trees v if the data are “consistent with infinite sites” then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent mutations at the same position v more than one tree may be equally good as a hypothesis of the genealogical history Phylogenetic Methods v UPGMA (single pass algorithm) v neighbor-joining (single pass algorithm) v Parsimony ² search more or less exhaustively for the tree with the smallest number of steps (mutations) required to explain the data v maximum likelihood ² search more or less exhaustively for the tree (topology and branch lengths) that maximizes the likelihood of the observed data v Bayesian MCMC methods ² summarize the posterior distribution of trees to estimate the probability of clades in the tree 1 10/27/14 Does it matter for pop gen? v we don’t need to know the genealogy for each locus to make inferences/estimate population genetic parameters v but, analyzing data that are not consistent with infinite sites requires more complex coalescent and/or mutation models Gene Trees versus Species Trees v “reciprocal monophyly” 2 10/27/14 Gene Trees versus Species Trees v “incomplete lineage sorting” The Lineage Sorting Process v Speciation at time X ² ancestral polymorphism retained v The gene tree is polyphyletic for both species between times X and Y v The gene tree is paraphyletic within species between time Y and Z v Reciprocal monophyly at time Z 3 10/27/14 Gene Trees versus Species Trees v Incongruence… ² between gene tree and species tree ² …and between different gene trees −t Probability of Incongruence 2 2N e 3 v for the simple 3 taxon case, where t is the number of generations between speciation events and one sample per taxon v also applies when lineage sorting is complete within each of the terminal taxa ² incongruence as a result of incomplete lineage sorting in the past 4 10/27/14 The “lasting effects” of incomplete lineage sorting Species 1 Species 2 Species 3 S1 S2 S3 S1 S2 S3 Ancestral population probability of mtDNA and nuclear gene trees matching species tree as a function of internode length Moore 1995 Evolution 49, 718-726 5 10/27/14 Interpreting Single Gene Trees? v human mtDNA tree ² consistent with “out of Africa” hypothesis Avise et al. 1990 Evolution 6 10/27/14 Other causes of incongruence v hybridization/introgression/horizontal transfer v balancing selection v gene duplication and loss 7 10/27/14 Introgression plus Selective Sweep A B C Time Species Tree C B A Gene Tree A B Introgression followed by a selective sweep C Balancing Selection H C G v results in a “balanced” allele frequency maintained by frequency-dependent selection v can maintain pre-existing alleles over long stretches of time H H C G C G 8 10/27/14 From Klein, Takahata, Ayala 1993 Gene Duplication and Loss Actual phylogeny a b c d a b c Gene duplication Apparent phylogeny a b c d d 9 10/27/14 phylogeny of a subunits of voltage-gated calcium channels Piontkivska & Hughes, 2003, JME Approaches for making inferences/ estimating parameters v direct estimates from summary statistics ² E.g., FST = 1 1− FST ≡ 4m = 1− 4Nm FST ² but this typically requires significant assumptions ² genetic equilibrium, constant population size, etc. v simple coalescent simulations to generate confidence intervals 10 0" 120" 0" 0.9" 1" 1.1" 1.2" 1.3" 1.4" 1.5" 1.6" 1.7" 1.8" 1.9" 2" 2.1" 2.2" 2.3" 2.4" 2.5" 2.6" 2.7" 2.8" 2.9" 3" 3.1" 3.2" 3.3" 3.4" 3.5" 3.6" 3.7" 3.8" 3.9" 4" 4.1" 120" 0.9" 1" 1.1" 1.2" 1.3" 1.4" 1.5" 1.6" 1.7" 1.8" 1.9" 2" 2.1" 2.2" 2.3" 2.4" 2.5" 2.6" 2.7" 2.8" 2.9" 3" 3.1" 3.2" 3.3" 3.4" 3.5" 3.6" 3.7" 3.8" 3.9" 4" 4.1" 10/27/14 Distribution of θS estimates 160" 140" k"="10" 100" k"="20" 80" 60" 40" 20" Distribution of θ∏ estimates 160" 140" k"="10" 100" k"="20" 80" 60" 40" 20" 11 10/27/14 More sophisticated approaches for making inferences/estimating parameters v start with historical model… MIGRATE-N ² simulates N populations connected by gene flow ² estimates population sizes and migration rates (both scaled by N and µ) ² equilibrium model ² coalescence of all samples requires migration between demes because populations do not merge as you go back in time Beerli P, Felsenstein J (1999) Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics 152, 763–773. Beerli P, Felsenstein J (2001) Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach. PNAS 98, 4563–4568. 12 10/27/14 IM - Isolation with Migration ² model of population divergence with gene flow ² estimates population sizes, migration rates and divergence time(s) Approaches for making inferences/ estimating parameters v Bayesian MCMC analyses to estimate demographic and historical parameters ² based either on maximum likelihood and the Felsenstein equation or on summary statistics (Approximate Bayesian Computation, ABC) ² the Felsenstein Equation gives the likelihood of the data given a set of model parameters, Θ Pr ( X Θ) = ∫ Pr ( X G )p (G Θ) dG G where X is the data, Θ is the set of model parameters, and G is the set of all possible genealogies given Θ 13 10/27/14 Calculating the likelihood of the data for a given genealogy v given a model of sequence evolution, a tree (=genealogy) with branch lengths, and observed character states (DNA sequences in the samples)... v we can calculate the likelihood (probability) of the data at a given sequence position C A t1 t2 C C G t4 t5 t3 y A tree/genealogy with branch lengths and the data at a single DNA sequence position w z t6 t7 t8 x Pr(Xi | G) = ∑∑∑∑ Pr(A, C, C, C, G, x, y, z, w | G) x y z w ∑ Pr(y | x, t )Pr(A | y, t )Pr(C | y, t )Pr(z | x, t ) Pr(C | z, t )Pr(w | z, t )Pr(C | w, t )Pr(G | w, t ) 6 1 2 8 3 7 4 5 x ² in this example, this quantity is summed over 256 (=44) possible combinations of x, y, z, w ² number of calculations increases exponentially with more taxa, but computational shortcuts are employed 14 10/27/14 Calculating the likelihood of the data for a given genealogy v given a model of sequence evolution, a tree (=genealogy) with branch lengths, and observed character states (DNA sequences in the samples)... v we can calculate the likelihood (probability) of the data at a given sequence position v the overall likelihood of the data is the product of the likelihoods for individual sites or the sum of the ln likelihoods… m m L = Pr(X | G) = ∏ Pr(Xi | G) ≡ ln L = ∑ ln Li i=1 i=1 In practice… v for a sample of k alleles, draw random coalescence times from the exponential distribution, as appropriate given the historical and demographic model parameters v estimate the likelihood (probability) of the observed DNA sequences for genealogies generated under the model Pr ( X Θ) = ∫ Pr ( X G )p (G Θ) dG G v change a model parameter (according to carefully designed rules), generate a new set of genealogies and calculate likelihood v we now have two results… 15 10/27/14 In practice… v if the new result is better, accept the new set of model parameters ( x!) and continue the process by taking another step in the Markov Chain (i.e., “updating” a model parameter, generating genealogies, etc…) v if the result is worse, either accept the new set of model parameters ( x!) or go back to the previous set of parameters ( x ), with the “coin flip” probabilities as defined by the MetropolisHastings Algorithm # P ( x") g ( x" → x ) & (( A ( x → x") = min %%1, " P x g x → x ( ) ( ) $ ' v repeat millions of times Markov Chain Monte Carlo methods 16 10/27/14 Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD (2009) Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genetics 5, e1000695. East African allele frequency (n = 10 birds, 20 alleles) v ∂a∂i uses the joint allele frequency distribution as the observed input data v uses the diffusion approximation to estimate the expected j.a.f.d. for a given set of model parameters v and then calculates the likelihood of the observed data based on the above 20 4 19 3 2 4 1 18 1 17 4 16 1 15 7 14 2 13 5 12 5 11 6 1 10 2 1 9 4 2 8 5 2 7 6 6 7 5 13 4 12 1 2 3 14 1 1 2 47 3 1 1 2 1 132 7 7 6 3 5 1 2 0 326 89 48 39 27 19 7 0 1 2 3 4 5 6 7 1 1 1 1 2 2 1 4 1 2 2 2 2 2 1 1 1 2 2 1 4 2 3 2 1 3 1 1 1 1 4 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 40 1 1 1 1 3 2 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 2 1 1 2 1 1 4 2 1 1 2 1 2 6 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 4 5 8 3 2 2 2 3 3 1 8 9 10 11 12 13 14 15 16 17 18 19 20 West African allele frequency (n = 10 birds, 20 alleles) 17