practical

advertisement
Phylogenetic Model Testing with
jModelTest2
https://www.ebi.ac.uk/~kgori/teaching/modeltest/
(shortlink: https://goo.gl/rm8xLu)
Data:
https://www.ebi.ac.uk/~kgori/teaching/modeltest/
(https://goo.gl/rm8xLu)
jModelTest documentation:
https://code.google.com/p/jmodeltest2/w/list
(https://goo.gl/8pSxkd)
Tree viewers:
http://www.phylowidget.org/ (https://goo.gl/5m125Q)
http://ete.cgenomics.org/treeview (https://goo.gl/4ZwUf2)
PhyML:
http://www.atgc-montpellier.fr/phyml/
(http://goo.gl/XumE3)
Part 1: Likelihood ratio testing
Likelihood ratio tests can be used to test the goodness of fit of models in the case
that the models are nested. That means that, of the models being tested, one
model (H0) has fewer parameters than the other one (H1), and fixing a subset of
the complex model’s free parameters makes it equivalent to the simple model.
The process in likelihood ratio testing is to:
1. Maximise the likelihood under each model
2. Calculate the test statistic,
𝛬=
𝐿𝐻1
⁄𝐿
𝐻0
π›₯ = 𝑙𝑛(𝛬) = 𝑙𝑛𝐿𝐻1 − 𝑙𝑛𝐿𝐻0
2π›₯ = 2(𝑙𝑛𝐿𝐻1 − 𝑙𝑛𝐿𝐻0 )
3. Test 2Δ against a χ2 distribution with q degrees of freedom, where q is the
difference in the number of free parameters between H1 and H0 . This is
the asymptotic distribution of 2Δ when H0 is correct.
We will start by doing this without using jModelTest.
Instructions
Download the data file primate-mtDNA.phy. This is an alignment of mitochondrial
DNA from twelve primate species.
Open a command prompt. We will use the program PhyMl to calculate
phylogenetic trees. We can use the phyml website http://goo.gl/XumE3
Type the following command:
-
phyml –i primate-mtDNA.phy –m JC69 –b 0 –c 1 –run_id=JC69
This will launch phyml, telling it to calculate a phylogenetic tree under the JukesCantor model of substitution. The tree will be output to the file primatemtDNA.phy_phyml_tree_JC69.txt, and parameter estimates to primatemtDNA.phy_phyml_stats_JC69.txt. Make a note of the log-likelihood.
Now type these commands, and record the log-likelihood:
-
phyml –i primate-mtDNA.phy –m K80 –b 0 –c 1 –run_id=K80
phyml –i primate-mtDNA.phy –m HKY85 –b 0 –c 1 –run_id=HKY85
phyml –i primate-mtDNA.phy –m GTR –b 0 –c 1 –run_id=GTR
This repeats the analysis using increasingly complex models. With each model
we are only adding parameters, so these models are nested in the scheme:
GTR  HKY85  K80  JC69
Further details can be found in the tables at the end of this document.
You now have four trees derived from the same data. You can copy and paste the
contents of the tree files into one of the web-based tree viewers (see title page
for links) to see what they look like. For this data set they will most likely differ
in branch lengths, but not topology. They will also differ in likelihoods.
Using the equation given above, calculate D for these pairs
-
H0 = JC69,
H1 = K80
H0 = K80,
H1 = HKY85
H0 = HKY85, H1 = GTR
Are these values significant? Refer to the χ2 table at the end of this document,
and remember that JC69, K80, HKY85 and GTR have 0, 1, 4 and 8 parameters,
respectively.
Which model does likelihood ratio testing suggest you should use?
Part 2: Likelihood ratio testing in jModelTest2
In this part of the practical we will be using models that include a gamma
parameter. This is an extra parameter we can add to the models we have
already seen to improve their fit to data with variable rates.
In the input alignment the columns vary from one to another in their number of
mutations. We use the Gamma probability distribution to capture this variability.
The Gamma distribution has two parameters, alpha and beta, but we fix its mean
at 1, so only alpha is free. This means that adding Gamma-distributed rates to a
model increases its number of parameters by one.
Instructions
Open Windows Explorer and navigate to the jModelTest directory
Double-click on runjmodeltest-gui.bat
Load the file primate-mtDNA.phy into jModelTest by clicking File > Load DNA
alignment
Calculate likelihoods for the models used in Part 1 by clicking Analysis >
Compute likelihood scores, and then selecting 3 substitution schemes, +F base
frequencies, and fixed BIONJ-JC. Click Compute Likeliho[o]ds.
jModelTest now runs PhyMl, just like you did in part 1, with the advantage that it
can run in parallel.
When PhyMl is finished, click Analysis > Do hLRT calculations. Then click Run.
Which model is selected? Try running the hLRT calculations again, using
Backward selection. Which model is selected this time? Is this what you expect?
Now we will add the Gamma parameter. Click Analysis > Compute likelihood
scores, and this time check the +G box. Leave the +I box blank. Set the other
options as before.
Computing likelihoods will take a little longer this time, as there are more
models to test, and estimating trees with the Gamma parameter enabled is more
computer intensive.
Run Analysis > Do hLRT calculations again. Should you use the gamma
parameter? What effect does it have on the estimated tree?
Run this command in the command prompt:
-
phyml –i primate-mtDNA.phy –m GTR –b 0 –c 4 –run_id=GTR+G
Compare this tree primate-mtDNA.phy_phyml_tree_GTR+G.txt to the GTR tree from
part 1.
Part 3: Information Theory
While jModelTest2 can perform likelihood ratio tests, its main means of model
selection is through Information Theory. jModelTest2 implements two popular
information theoretic model selection methods: the Akaike Information Criterion
(AIC), and the Bayesian Information Criterion (BIC). These methods do not
require models to be nested, and can compare multiple models simultaneously.
AIC and BIC are penalised likelihood measures – they take the maximised
likelihood estimate for a candidate model and apply a penalty term for model
complexity. The model that has the smallest penalised likelihood is the preferred
one.
Instructions:
Download the file carnivores_16S.phy. This is an alignment of 16S Ribosomal
DNA from 31 carnivore species.
Load this alignment into jModelTest (File > Load DNA Alignment)
Calculate likelihoods under 5 substitution schemes, with Gamma (Analysis >
Calculate likelihood scores). This alignment is larger than primate-mtDNA.phy, so
this may take a long time.
While waiting for jModelTest we can look at whether different models will
produce different tree topologies. To check this, run these commands and then
look at the trees:
-
phyml –i carnivores_16S.phy –m GTR –b 0 –c 4 –run_id=GTR+G
phyml –i carnivores_16S.phy –m JC69 –b 0 –c 1 –run_id=JC69
Are there any differences between the trees?
When the likelihood scores have been calculated, click Analysis > Do AIC
Calculations, and check the Use AICc correction box.
Which model is selected? How close is the second-place model.
Now repeat for BIC. BIC penalises complex models more than AIC. Is the same
model selected? How close is second-place?
Extra: Model Averaging
Information criteria can point to a single ‘best’ (or most appropriate) model for
the data, but jModelTest has another feature – model averaging. This combines a
set of models, weighted by their AIC or BIC score, producing average values for
parameters of interest, for example the phylogenetic tree. Let’s try this:
Click Analysis > Model-averaged phylogeny. Select BIC, with majority-rule
consensus trees.
jModelTest produces a consensus tree from all the models within a Confidence
interval (which we set to 100%, meaning all the models were considered),
weighting the contribution of each model by its BIC result.
Repeat the model-averaging, this time with a strict consensus. What is the
difference in the trees? Given the model weights, do you think strict or majorityrule consensus is most appropriate in this case?
Reference Tables
Models of nucleotide substitution
Model
Reference
Description
JC69
Jukes and
Cantor 1969
Kimura
1980
Free
parameters
0
Contains
1
JC69
3
JC69
F81
Felsenstein
1981
Equal base
frequencies,
One substitution rate
Equal base
frequencies,
Two substitution
rates (ts / tv)
Unequal base
frequencies,
One substitution rate
HKY85
Hasegawa,
Kishino and
Yano 1985
Unequal base
frequencies,
Two substitution
rates (ts / tv)
4
F81, K80,
JC69
SYM
Zharkikh
1994
Equal base
frequencies
Six substitution rates
5
K80, JC69
GTR
Tavaré 1986
Unequal base
frequencies,
Six substitution rates
8
Yang 1993
Fits gamma
distribution of rate
categories to sites
+1
SYM, HKY85,
F81, K80,
JC69
-
K80
(Kimura’s 2parameter model)
(use “-m 012345, -f
0.25,0.25,0.25,0.25”)
(General timereversible, a.k.a. REV)
Gamma
(additional
parameter for above)
-
PhyMl command line:
Flag
-i
-d
-m
-c
-o
-u
Description
Input alignment file
Datatype (-d=nt is DNA, -d=aa is protein)
Substitution Model
Number of discrete categories for gamma (-c=1 – no
gamma rates)
What to optimise (-o=n – nothing is optimised, BIONJ
tree is returned)
User tree, in newick format. PhyMl calculates
likelihood
-b
Bootstrapping (-b=0 for no bootstrapping)
--run_id
Output suffix – use to distinguish different PhyMl
runs on same input
χ2 table
df
P = 0.05
P = 0.01
P = 0.001
1
3.84
6.64
10.83
2
5.99
9.21
13.82
3
7.82
11.35
16.27
4
9.49
13.28
18.47
5
11.07
15.09
20.52
6
12.59
16.81
22.46
7
14.07
18.48
24.32
8
15.51
20.09
26.13
9
16.92
21.67
27.88
10
18.31
23.21
29.59
Information Theory
Akaike Information Criterion (AIC)
The model with the smallest AIC is the model with the smallest amount of
information lost compared to the truth, according to the Kullbach-Leibler
Divergence.
The formula for AIC is
𝐴𝐼𝐢 = 2π‘˜ − 2𝑙𝑛𝐿
where k is the number of free parameters,
and lnL is the maximised log-likelihood
The AIC value is asymptotically true; when data is limited (as it always is) it is
better to use the corrected form,
𝐴𝐼𝐢𝑐 = 𝐴𝐼𝐢 +
2π‘˜(π‘˜+1)
𝑛−π‘˜−1
where n is the number of data points
Bayesian Information Criterion (BIC)
The model with the smallest BIC corresponds to the model with the highest
posterior probability, given a uniform prior probability.
The formula for BIC is
𝐡𝐼𝐢 = π‘˜ βˆ™ 𝑙𝑛(𝑛) − 2𝑙𝑛𝐿
Download