Models - Jarno Tuimala

advertisement
Niklas Wahlberg
University of Turku
Jarno Tuimala
Free researcher / Finnish Tax Administration






14.4. Tue Introduction to models (Jarno)
16.4. Thu Distance-based methods (Jarno)
17.4. Fri ML analyses (Jarno)
20.4. Mon
21.4. Tue
23.4. Thu



24.4. Fri
Assessing hypotheses (Jarno)
Problems with molecular data (Jarno)
Problems with molecular data (Jarno)
Phylogenomics
Search algorithms, visualization, and
other computational aspects (Jarno)
J



With >100 billion bases in GenBank, we are
beginning to understand how DNA sequences
evolve
Mitochondrial and nuclear genes differ in
mutation dynamics
Different genes have their own mutation
dynamics
Hidden evolution in DNA
sequences
Ancest GGCGCG
Seq 1 AGCGAG
Seq 2 GCGGAC
Number of changes
1
Seq 1 C
Seq 2
C
3
2
G
T
1
A
A
Correction
for the
difference
between the
true and tha
observed
distance.
Distance
Time
J

Models incorporate information about the
rates at which each nucleotide is replaced by
each alternative nucleotide
◦ For DNA this can be expressed as a 4 x 4 rate
matrix (known as the Q matrix)

Other model parameters may include:
◦ Site by site rate variation - often modelled as a
statistical distribution - for example a gamma
distribution
J





The mean instantaneous substitution rate
(=the general mutation rate + rate of fixation
in population)
The relative rates of substitution between
each base pair
The average frequencies of each base in the
dataset
Branch lengths
Topology!
Purines
Pyrimidines
A general model of sequence evolution
πA
a g
πC
c
b
h
d
i
e
k
πG
j
l
f
πT
A general model of sequence evolution
transition
πA
a g
πC
c
b
h
d
i
transversions
e
k
πG
j
l
f
πT
transition
J

If all substituitons were equally likely, the
expected ratio (R) of transitions (P) to
transversions (Q) would be about 0.5:
◦ Re = P / Q ~ 0.5


In reality, this is not the case, and the ratio is
usually higher.
Some models of sequence evolution take this
ratio into account, some don't.
J
Q=
A
C
G
A -μ(aπC+bπG+cπT)
μaπC
μbπG
μcπT
C
μgπA
-μ(gπA+dπG+eπT)
μdπG
μeπT
G
μhπA
μjπC
-μ(hπA+jπC+fπT)
μfπT
T
μiπA
μkπC
μlπG
-μ(iπA+kπC+lπG)
μ = mean instantaneous substitution rate
a, b, c,... l = relative rate of substitution
πA = frequency of A
}
T
product is the rate parameter



Rate of change from base i to base j is
independent of the base that occupied a site
prior to i (Markov property)
Substitution rate does not change over time
(homogeneity)
Relative frequencies of A, G, C, and T are at
equilibrium (stationarity)
The Jukes and Cantor model is the
simplest model
A C G T
A -3a a a a
C a-3a a a
G a a -3a a
T a a a -3a
The JC model is a
one parameter model
1) it assumes that all
bases are equally
frequent (p=0.25)
2) unless modified it
assumes all sites can
change and that they
do so at the same
rate
Jukes-Cantor model
a
A
a
a
C
•
•
•
G
a
a
a
T
a = the rate of substitution (a changes from A to G every t)
The rate of substitution for each nucleotide is 3a
In t steps there will be 3at changes
Kimura model
a
A



C
a = transitions
G
a


T
= transversions
The Kimura model has 2 parameters
A C
A - 
C  G a 
T  a
G T
a 
 a
- 

-
The K2P model is
more realistic, but
still
1) it assumes that
all bases are equally
frequent (p=0.25)
2) unless modified it
assumes all sites can
change and that
they do so at the
same rate
The Hasegawa-Kishino-Yano model
A C G T
A -  a 
C  -    a
G  a  - 
T   a  C
A
A
A
T
G
G
T
C
C
T
G
The HKY model
takes into account
variable base
frequencies, but still
1) unless modified it
assumes all sites
can change and
that they do so at
the same rate
The GTR model
b
πA
c
πG
d
f
a
πC
e
πT
Q=
-μ(aπC+bπG+cπT)
μaπC
μbπG
μcπT
μaπA
-μ(aπA+dπG+eπT)
μdπG
μeπT
μbπA
μdπC
-μ(bπA+dπC+fπT)
μfπT
μcπA
μeπC
μfπG
-μ(cπA+eπC+fπG)
μ = mean instantaneous substitution rate
a, b, c,... f = relative rate of substitution
πA = frequency of A
}
product is the rate parameter

Almost all models used are special cases of
one model:
◦ The general time reversible model

The next three slides are from:
https://code.google.com/p/jmodeltest2/wi
ki/TheoreticalBackground
ACAGGTGAGGCTCAGCCAATTTGAGCTTTGTCGATAGGT
J
J

Hypotheses tested are: F = base frequencies; S =
substitution type; I = proportion of invariable
sites; G = gamma rates.
J
GTR
Variable base frequencies
6 substitution types
TrN
SYM
3 substitution types
6 substitution types
HKY85
K3ST
F84
3 substitution types
2 substitution types
K2P
F81
2 substitution types
Variable base frequencies
JC
Equal base frequencies
Single substitution type

Model parameters can be:
◦ estimated from the data (using a likelihood
function)
◦ can be pre-set based upon assumptions about the
data (for example that for all sequences all sites
change at the same rate and all substitutions are
equally likely - e.g. the Jukes and Cantor Model)
◦ wherever possible avoid assumptions which are
violated by the data because they can lead to
incorrect trees

The most common additional parameters are:
◦ A correction for the proportion of sites which are
invariable (parameter I )
◦ A correction for variable site rates at those sites
which can change (parameter gamma, G )

All models can be supplemented with these
parameters (e.g. GTR+I+G, HKY+I+G )
Invariable sites
α = shape
parameter

Computational difficulties in using
continuous distribution
Most programs use discrete categories
Frequency

Rate




The parameters I and G covary!
(I + G ) can be estimated, but the values of I
and G are not easily teased apart
Parameter G takes I into account, I not
needed
Usually though, a certain amount of sites
(estimated from data) are assumed invariant,
and rest (the varying sites) are allowed to
follow the rates drawn from the discrete
gamma distribution.
J

But the more parameters you estimate from the
data the more time needed for an analysis and
the more sampling error accumulates
◦ One might have a realistic model but large sampling
errors
◦ Realism comes at a cost in time and precision!
◦ Fewer parameters may give an inaccurate estimate,
but more parameters decrease the precision of the
estimate
◦ In general use the simplest model which fits the data

When models are nested
◦ Likelihood ratio test (LRT)
◦ Test statistic:
 -2*ln(likelihood for model 1 / likelihood for model 2)
 Compared to Chi square distribution df1-df2 degrees of
freedom

When models are not nested
◦ Akaike Information Criterion (AIC)
 2k-2ln(likelihood), where k is the number of parameteres
estimated in the models
 The best model has the lowest AIC
◦ Bayesian Information Criterion (BIC)
 Similar to AIC
GTR
Variable base frequencies
6 substitution types
TrN
SYM
3 substitution types
6 substitution types
HKY85
K3ST
F84
3 substitution types
2 substitution types
K2P
F81
2 substitution types
Variable base frequencies
JC
Equal base frequencies
Single substitution type
GTR
Variable base frequencies
6 substitution types
TrN
SYM
3 substitution types
6 substitution types
HKY85
K3ST
F84
3 substitution types
2 substitution types
K2P
F81
2 substitution types
Variable base frequencies
JC
Equal base frequencies
Single substitution type
GTR
Variable base frequencies
6 substitution types
TrN
SYM
3 substitution types
6 substitution types
HKY85
K3ST
F84
3 substitution types
2 substitution types
K2P
F81
2 substitution types
Variable base frequencies
JC
Equal base frequencies
Single substitution type
GTR
Variable base frequencies
6 substitution types
TrN
SYM
3 substitution types
6 substitution types
HKY85
K3ST
F84
3 substitution types
2 substitution types
K2P
F81
2 substitution types
Variable base frequencies
JC
Equal base frequencies
Single substitution type





Yang (1995) has shown that parameter estimates
are reasonably stable across tree topologies
provided trees are not “too wrong”.
Thus one can obtain a tree using a quick method,
such as neighbor-joining, and then estimate
parameters on that tree.
These parameters can then be used to calculate
the likelihood of the tree.
When the likelihood of the tree is calculated
under all the to-be-compared models, the model
giving the lowest likelihood or AIC value can be
selected.
The final tree is then estimated using this model.


For both tests, one needs to compute the
likelihood of the trees under the models.
For now, assume we know the likelihood of
the models we want to compare.
LR = 2*(lnL1-lnL0)


Alternative hypothesis
Null hypothesis
More parameter-rich
Less parameter-rich
LRT statistic approximately follows a chisquare distribution
Degrees of freedom equal to the number of
extra parameters in the more complex model




HKY85 -lnL = 1787.08
GTR
-lnL = 1784.82
Then, LR = 2 (1784.82 - 1787.08) = 4.53
degrees of freedom = 4 (GTR adds 4
additional parameters to HKY85)
critical value (P = 0.05) = 9.49
GTR does not fit significantly better!

A measure of the goodness of fit of a model
◦ information lost when model M is used to
approximate the process of molecular evolution
◦ AIC is an estimate of the expected relative distance
between a fitted model, M, and the unknown true
mechanism that generated the data

AIC(M) = - 2*Log(Likelihood(M)) + 2*K(M)
◦ K(M) is number of estimable parameters of model M


Given a dataset, models can be ranked
according to their AIC
The model with the lowest AIC is selected


BIC takes into account also sample size n
BIC(M) = - 2xLog(Likelihood(M)) +
K(M)xLog(n)
◦ K(M) is number of estimable parameters of model M
and n is the number of characters
Kelchner & Thomas 2007, TREE 22:87-94

Model jumping
◦ Allow the data to determine which model is the
most optimal during the analysis

Only available in MrBayes 3.2
JC
K2P
GTR




A priori separation of characters into different
partitions
Each partition analyzed with a different model
In addition to allowing heterogeneity across
data subsets in overall rate and in substitution
model parameters, several programs also allow
the user to unlink topology and branch lengths
“Different data subsets can thus have
independent branch lengths or even different
topologies.” (Ronquist and Huelsenbeck,
2003:1573)



21 amino acids
Models are based largely
on empirical aa
replacement matrices
Examples: JTT, WAG,
MtREV, Blosum62


Parameters include topology and branch
lengths!
How to estimate values for those parameters?
◦ Distance methods
◦ Maximum likelihood methods
◦ Bayesian methods



Objective function (score) that quantifies how
well the data fit a tree
Used to evaluate and rank alternative trees
Two logical steps for phylogenetic methods
that rely on optimality criteria
◦ Definition of optimality criterion
◦ Maximization (or minimization) of criterion for
alternative trees for their evaluation and ranking
Download