Phylogenetic Model Testing with jModelTest2 https://www.ebi.ac.uk/~kgori/teaching/modeltest/ (shortlink: https://goo.gl/rm8xLu) Data: https://www.ebi.ac.uk/~kgori/teaching/modeltest/ (https://goo.gl/rm8xLu) jModelTest documentation: https://code.google.com/p/jmodeltest2/w/list (https://goo.gl/8pSxkd) Tree viewers: http://www.phylowidget.org/ (https://goo.gl/5m125Q) http://ete.cgenomics.org/treeview (https://goo.gl/4ZwUf2) PhyML: http://www.atgc-montpellier.fr/phyml/ (http://goo.gl/XumE3) Part 1: Likelihood ratio testing Likelihood ratio tests can be used to test the goodness of fit of models in the case that the models are nested. That means that, of the models being tested, one model (H0) has fewer parameters than the other one (H1), and fixing a subset of the complex model’s free parameters makes it equivalent to the simple model. The process in likelihood ratio testing is to: 1. Maximise the likelihood under each model 2. Calculate the test statistic, π¬= πΏπ»1 ⁄πΏ π»0 π₯ = ππ(π¬) = πππΏπ»1 − πππΏπ»0 2π₯ = 2(πππΏπ»1 − πππΏπ»0 ) 3. Test 2Δ against a χ2 distribution with q degrees of freedom, where q is the difference in the number of free parameters between H1 and H0 . This is the asymptotic distribution of 2Δ when H0 is correct. We will start by doing this without using jModelTest. Instructions Download the data file primate-mtDNA.phy. This is an alignment of mitochondrial DNA from twelve primate species. Open a command prompt. We will use the program PhyMl to calculate phylogenetic trees. We can use the phyml website http://goo.gl/XumE3 Type the following command: - phyml –i primate-mtDNA.phy –m JC69 –b 0 –c 1 –run_id=JC69 This will launch phyml, telling it to calculate a phylogenetic tree under the JukesCantor model of substitution. The tree will be output to the file primatemtDNA.phy_phyml_tree_JC69.txt, and parameter estimates to primatemtDNA.phy_phyml_stats_JC69.txt. Make a note of the log-likelihood. Now type these commands, and record the log-likelihood: - phyml –i primate-mtDNA.phy –m K80 –b 0 –c 1 –run_id=K80 phyml –i primate-mtDNA.phy –m HKY85 –b 0 –c 1 –run_id=HKY85 phyml –i primate-mtDNA.phy –m GTR –b 0 –c 1 –run_id=GTR This repeats the analysis using increasingly complex models. With each model we are only adding parameters, so these models are nested in the scheme: GTR ο HKY85 ο K80 ο JC69 Further details can be found in the tables at the end of this document. You now have four trees derived from the same data. You can copy and paste the contents of the tree files into one of the web-based tree viewers (see title page for links) to see what they look like. For this data set they will most likely differ in branch lengths, but not topology. They will also differ in likelihoods. Using the equation given above, calculate D for these pairs - H0 = JC69, H1 = K80 H0 = K80, H1 = HKY85 H0 = HKY85, H1 = GTR Are these values significant? Refer to the χ2 table at the end of this document, and remember that JC69, K80, HKY85 and GTR have 0, 1, 4 and 8 parameters, respectively. Which model does likelihood ratio testing suggest you should use? Part 2: Likelihood ratio testing in jModelTest2 In this part of the practical we will be using models that include a gamma parameter. This is an extra parameter we can add to the models we have already seen to improve their fit to data with variable rates. In the input alignment the columns vary from one to another in their number of mutations. We use the Gamma probability distribution to capture this variability. The Gamma distribution has two parameters, alpha and beta, but we fix its mean at 1, so only alpha is free. This means that adding Gamma-distributed rates to a model increases its number of parameters by one. Instructions Open Windows Explorer and navigate to the jModelTest directory Double-click on runjmodeltest-gui.bat Load the file primate-mtDNA.phy into jModelTest by clicking File > Load DNA alignment Calculate likelihoods for the models used in Part 1 by clicking Analysis > Compute likelihood scores, and then selecting 3 substitution schemes, +F base frequencies, and fixed BIONJ-JC. Click Compute Likeliho[o]ds. jModelTest now runs PhyMl, just like you did in part 1, with the advantage that it can run in parallel. When PhyMl is finished, click Analysis > Do hLRT calculations. Then click Run. Which model is selected? Try running the hLRT calculations again, using Backward selection. Which model is selected this time? Is this what you expect? Now we will add the Gamma parameter. Click Analysis > Compute likelihood scores, and this time check the +G box. Leave the +I box blank. Set the other options as before. Computing likelihoods will take a little longer this time, as there are more models to test, and estimating trees with the Gamma parameter enabled is more computer intensive. Run Analysis > Do hLRT calculations again. Should you use the gamma parameter? What effect does it have on the estimated tree? Run this command in the command prompt: - phyml –i primate-mtDNA.phy –m GTR –b 0 –c 4 –run_id=GTR+G Compare this tree primate-mtDNA.phy_phyml_tree_GTR+G.txt to the GTR tree from part 1. Part 3: Information Theory While jModelTest2 can perform likelihood ratio tests, its main means of model selection is through Information Theory. jModelTest2 implements two popular information theoretic model selection methods: the Akaike Information Criterion (AIC), and the Bayesian Information Criterion (BIC). These methods do not require models to be nested, and can compare multiple models simultaneously. AIC and BIC are penalised likelihood measures – they take the maximised likelihood estimate for a candidate model and apply a penalty term for model complexity. The model that has the smallest penalised likelihood is the preferred one. Instructions: Download the file carnivores_16S.phy. This is an alignment of 16S Ribosomal DNA from 31 carnivore species. Load this alignment into jModelTest (File > Load DNA Alignment) Calculate likelihoods under 5 substitution schemes, with Gamma (Analysis > Calculate likelihood scores). This alignment is larger than primate-mtDNA.phy, so this may take a long time. While waiting for jModelTest we can look at whether different models will produce different tree topologies. To check this, run these commands and then look at the trees: - phyml –i carnivores_16S.phy –m GTR –b 0 –c 4 –run_id=GTR+G phyml –i carnivores_16S.phy –m JC69 –b 0 –c 1 –run_id=JC69 Are there any differences between the trees? When the likelihood scores have been calculated, click Analysis > Do AIC Calculations, and check the Use AICc correction box. Which model is selected? How close is the second-place model. Now repeat for BIC. BIC penalises complex models more than AIC. Is the same model selected? How close is second-place? Extra: Model Averaging Information criteria can point to a single ‘best’ (or most appropriate) model for the data, but jModelTest has another feature – model averaging. This combines a set of models, weighted by their AIC or BIC score, producing average values for parameters of interest, for example the phylogenetic tree. Let’s try this: Click Analysis > Model-averaged phylogeny. Select BIC, with majority-rule consensus trees. jModelTest produces a consensus tree from all the models within a Confidence interval (which we set to 100%, meaning all the models were considered), weighting the contribution of each model by its BIC result. Repeat the model-averaging, this time with a strict consensus. What is the difference in the trees? Given the model weights, do you think strict or majorityrule consensus is most appropriate in this case? Reference Tables Models of nucleotide substitution Model Reference Description JC69 Jukes and Cantor 1969 Kimura 1980 Free parameters 0 Contains 1 JC69 3 JC69 F81 Felsenstein 1981 Equal base frequencies, One substitution rate Equal base frequencies, Two substitution rates (ts / tv) Unequal base frequencies, One substitution rate HKY85 Hasegawa, Kishino and Yano 1985 Unequal base frequencies, Two substitution rates (ts / tv) 4 F81, K80, JC69 SYM Zharkikh 1994 Equal base frequencies Six substitution rates 5 K80, JC69 GTR Tavaré 1986 Unequal base frequencies, Six substitution rates 8 Yang 1993 Fits gamma distribution of rate categories to sites +1 SYM, HKY85, F81, K80, JC69 - K80 (Kimura’s 2parameter model) (use “-m 012345, -f 0.25,0.25,0.25,0.25”) (General timereversible, a.k.a. REV) Gamma (additional parameter for above) - PhyMl command line: Flag -i -d -m -c -o -u Description Input alignment file Datatype (-d=nt is DNA, -d=aa is protein) Substitution Model Number of discrete categories for gamma (-c=1 – no gamma rates) What to optimise (-o=n – nothing is optimised, BIONJ tree is returned) User tree, in newick format. PhyMl calculates likelihood -b Bootstrapping (-b=0 for no bootstrapping) --run_id Output suffix – use to distinguish different PhyMl runs on same input χ2 table df P = 0.05 P = 0.01 P = 0.001 1 3.84 6.64 10.83 2 5.99 9.21 13.82 3 7.82 11.35 16.27 4 9.49 13.28 18.47 5 11.07 15.09 20.52 6 12.59 16.81 22.46 7 14.07 18.48 24.32 8 15.51 20.09 26.13 9 16.92 21.67 27.88 10 18.31 23.21 29.59 Information Theory Akaike Information Criterion (AIC) The model with the smallest AIC is the model with the smallest amount of information lost compared to the truth, according to the Kullbach-Leibler Divergence. The formula for AIC is π΄πΌπΆ = 2π − 2πππΏ where k is the number of free parameters, and lnL is the maximised log-likelihood The AIC value is asymptotically true; when data is limited (as it always is) it is better to use the corrected form, π΄πΌπΆπ = π΄πΌπΆ + 2π(π+1) π−π−1 where n is the number of data points Bayesian Information Criterion (BIC) The model with the smallest BIC corresponds to the model with the highest posterior probability, given a uniform prior probability. The formula for BIC is π΅πΌπΆ = π β ππ(π) − 2πππΏ