Niklas Wahlberg University of Turku Jarno Tuimala Free researcher / Finnish Tax Administration 14.4. Tue Introduction to models (Jarno) 16.4. Thu Distance-based methods (Jarno) 17.4. Fri ML analyses (Jarno) 20.4. Mon 21.4. Tue 23.4. Thu 24.4. Fri Assessing hypotheses (Jarno) Problems with molecular data (Jarno) Problems with molecular data (Jarno) Phylogenomics Search algorithms, visualization, and other computational aspects (Jarno) J With >100 billion bases in GenBank, we are beginning to understand how DNA sequences evolve Mitochondrial and nuclear genes differ in mutation dynamics Different genes have their own mutation dynamics Hidden evolution in DNA sequences Ancest GGCGCG Seq 1 AGCGAG Seq 2 GCGGAC Number of changes 1 Seq 1 C Seq 2 C 3 2 G T 1 A A Correction for the difference between the true and tha observed distance. Distance Time J Models incorporate information about the rates at which each nucleotide is replaced by each alternative nucleotide ◦ For DNA this can be expressed as a 4 x 4 rate matrix (known as the Q matrix) Other model parameters may include: ◦ Site by site rate variation - often modelled as a statistical distribution - for example a gamma distribution J The mean instantaneous substitution rate (=the general mutation rate + rate of fixation in population) The relative rates of substitution between each base pair The average frequencies of each base in the dataset Branch lengths Topology! Purines Pyrimidines A general model of sequence evolution πA a g πC c b h d i e k πG j l f πT A general model of sequence evolution transition πA a g πC c b h d i transversions e k πG j l f πT transition J If all substituitons were equally likely, the expected ratio (R) of transitions (P) to transversions (Q) would be about 0.5: ◦ Re = P / Q ~ 0.5 In reality, this is not the case, and the ratio is usually higher. Some models of sequence evolution take this ratio into account, some don't. J Q= A C G A -μ(aπC+bπG+cπT) μaπC μbπG μcπT C μgπA -μ(gπA+dπG+eπT) μdπG μeπT G μhπA μjπC -μ(hπA+jπC+fπT) μfπT T μiπA μkπC μlπG -μ(iπA+kπC+lπG) μ = mean instantaneous substitution rate a, b, c,... l = relative rate of substitution πA = frequency of A } T product is the rate parameter Rate of change from base i to base j is independent of the base that occupied a site prior to i (Markov property) Substitution rate does not change over time (homogeneity) Relative frequencies of A, G, C, and T are at equilibrium (stationarity) The Jukes and Cantor model is the simplest model A C G T A -3a a a a C a-3a a a G a a -3a a T a a a -3a The JC model is a one parameter model 1) it assumes that all bases are equally frequent (p=0.25) 2) unless modified it assumes all sites can change and that they do so at the same rate Jukes-Cantor model a A a a C • • • G a a a T a = the rate of substitution (a changes from A to G every t) The rate of substitution for each nucleotide is 3a In t steps there will be 3at changes Kimura model a A C a = transitions G a T = transversions The Kimura model has 2 parameters A C A - C G a T a G T a a - - The K2P model is more realistic, but still 1) it assumes that all bases are equally frequent (p=0.25) 2) unless modified it assumes all sites can change and that they do so at the same rate The Hasegawa-Kishino-Yano model A C G T A - a C - a G a - T a C A A A T G G T C C T G The HKY model takes into account variable base frequencies, but still 1) unless modified it assumes all sites can change and that they do so at the same rate The GTR model b πA c πG d f a πC e πT Q= -μ(aπC+bπG+cπT) μaπC μbπG μcπT μaπA -μ(aπA+dπG+eπT) μdπG μeπT μbπA μdπC -μ(bπA+dπC+fπT) μfπT μcπA μeπC μfπG -μ(cπA+eπC+fπG) μ = mean instantaneous substitution rate a, b, c,... f = relative rate of substitution πA = frequency of A } product is the rate parameter Almost all models used are special cases of one model: ◦ The general time reversible model The next three slides are from: https://code.google.com/p/jmodeltest2/wi ki/TheoreticalBackground ACAGGTGAGGCTCAGCCAATTTGAGCTTTGTCGATAGGT J J Hypotheses tested are: F = base frequencies; S = substitution type; I = proportion of invariable sites; G = gamma rates. J GTR Variable base frequencies 6 substitution types TrN SYM 3 substitution types 6 substitution types HKY85 K3ST F84 3 substitution types 2 substitution types K2P F81 2 substitution types Variable base frequencies JC Equal base frequencies Single substitution type Model parameters can be: ◦ estimated from the data (using a likelihood function) ◦ can be pre-set based upon assumptions about the data (for example that for all sequences all sites change at the same rate and all substitutions are equally likely - e.g. the Jukes and Cantor Model) ◦ wherever possible avoid assumptions which are violated by the data because they can lead to incorrect trees The most common additional parameters are: ◦ A correction for the proportion of sites which are invariable (parameter I ) ◦ A correction for variable site rates at those sites which can change (parameter gamma, G ) All models can be supplemented with these parameters (e.g. GTR+I+G, HKY+I+G ) Invariable sites α = shape parameter Computational difficulties in using continuous distribution Most programs use discrete categories Frequency Rate The parameters I and G covary! (I + G ) can be estimated, but the values of I and G are not easily teased apart Parameter G takes I into account, I not needed Usually though, a certain amount of sites (estimated from data) are assumed invariant, and rest (the varying sites) are allowed to follow the rates drawn from the discrete gamma distribution. J But the more parameters you estimate from the data the more time needed for an analysis and the more sampling error accumulates ◦ One might have a realistic model but large sampling errors ◦ Realism comes at a cost in time and precision! ◦ Fewer parameters may give an inaccurate estimate, but more parameters decrease the precision of the estimate ◦ In general use the simplest model which fits the data When models are nested ◦ Likelihood ratio test (LRT) ◦ Test statistic: -2*ln(likelihood for model 1 / likelihood for model 2) Compared to Chi square distribution df1-df2 degrees of freedom When models are not nested ◦ Akaike Information Criterion (AIC) 2k-2ln(likelihood), where k is the number of parameteres estimated in the models The best model has the lowest AIC ◦ Bayesian Information Criterion (BIC) Similar to AIC GTR Variable base frequencies 6 substitution types TrN SYM 3 substitution types 6 substitution types HKY85 K3ST F84 3 substitution types 2 substitution types K2P F81 2 substitution types Variable base frequencies JC Equal base frequencies Single substitution type GTR Variable base frequencies 6 substitution types TrN SYM 3 substitution types 6 substitution types HKY85 K3ST F84 3 substitution types 2 substitution types K2P F81 2 substitution types Variable base frequencies JC Equal base frequencies Single substitution type GTR Variable base frequencies 6 substitution types TrN SYM 3 substitution types 6 substitution types HKY85 K3ST F84 3 substitution types 2 substitution types K2P F81 2 substitution types Variable base frequencies JC Equal base frequencies Single substitution type GTR Variable base frequencies 6 substitution types TrN SYM 3 substitution types 6 substitution types HKY85 K3ST F84 3 substitution types 2 substitution types K2P F81 2 substitution types Variable base frequencies JC Equal base frequencies Single substitution type Yang (1995) has shown that parameter estimates are reasonably stable across tree topologies provided trees are not “too wrong”. Thus one can obtain a tree using a quick method, such as neighbor-joining, and then estimate parameters on that tree. These parameters can then be used to calculate the likelihood of the tree. When the likelihood of the tree is calculated under all the to-be-compared models, the model giving the lowest likelihood or AIC value can be selected. The final tree is then estimated using this model. For both tests, one needs to compute the likelihood of the trees under the models. For now, assume we know the likelihood of the models we want to compare. LR = 2*(lnL1-lnL0) Alternative hypothesis Null hypothesis More parameter-rich Less parameter-rich LRT statistic approximately follows a chisquare distribution Degrees of freedom equal to the number of extra parameters in the more complex model HKY85 -lnL = 1787.08 GTR -lnL = 1784.82 Then, LR = 2 (1784.82 - 1787.08) = 4.53 degrees of freedom = 4 (GTR adds 4 additional parameters to HKY85) critical value (P = 0.05) = 9.49 GTR does not fit significantly better! A measure of the goodness of fit of a model ◦ information lost when model M is used to approximate the process of molecular evolution ◦ AIC is an estimate of the expected relative distance between a fitted model, M, and the unknown true mechanism that generated the data AIC(M) = - 2*Log(Likelihood(M)) + 2*K(M) ◦ K(M) is number of estimable parameters of model M Given a dataset, models can be ranked according to their AIC The model with the lowest AIC is selected BIC takes into account also sample size n BIC(M) = - 2xLog(Likelihood(M)) + K(M)xLog(n) ◦ K(M) is number of estimable parameters of model M and n is the number of characters Kelchner & Thomas 2007, TREE 22:87-94 Model jumping ◦ Allow the data to determine which model is the most optimal during the analysis Only available in MrBayes 3.2 JC K2P GTR A priori separation of characters into different partitions Each partition analyzed with a different model In addition to allowing heterogeneity across data subsets in overall rate and in substitution model parameters, several programs also allow the user to unlink topology and branch lengths “Different data subsets can thus have independent branch lengths or even different topologies.” (Ronquist and Huelsenbeck, 2003:1573) 21 amino acids Models are based largely on empirical aa replacement matrices Examples: JTT, WAG, MtREV, Blosum62 Parameters include topology and branch lengths! How to estimate values for those parameters? ◦ Distance methods ◦ Maximum likelihood methods ◦ Bayesian methods Objective function (score) that quantifies how well the data fit a tree Used to evaluate and rank alternative trees Two logical steps for phylogenetic methods that rely on optimality criteria ◦ Definition of optimality criterion ◦ Maximization (or minimization) of criterion for alternative trees for their evaluation and ranking