Statistical Thinking for Hypothesis Testing

advertisement
Overview
• Fundamentals of statistical hypothesis testing:
• no-one likes a statistics lecture,
but there are some things you have to know
• Likelihood ratio tests in phylogenetics:
• to make inferences about processes of evolution
e.g. patterns of substitution, rates of substitution (Kevin’s section)
• for finding best models of evolution for inferring trees (Kevin)
• for detecting selection (Asif’s section)
• [advanced topic] for comparing trees (topologies),
to give measures of confidence in estimated trees
Likelihood and maximum likelihood
Recall that for model M, parameters θ and data D:
likelihood L(M, θ | D) = Pr(D | M, θ)
and that maximum likelihood inference consists of finding θ̂, the θ that makes
the likelihood as large as possible:
find θ̂ so that L(M, θ̂ | D) ≥ L(M, θ | D) for all other θ
i.e. find the values for parameters θ that make the probability of the data as big as
possible for the model being used — intuitively, it is clear that these are sensible
estimates of the model parameters
Likelihood and hypotheses
Now suppose we have some hypothesis ‘H’ regarding the model and parameters.
Similarly, the likelihood of the hypothesis is:
L(H | D) = Pr(D | H)
Perhaps the hypothesis fully defines the likelihood (no free parameters), or
perhaps there are some free parameters in the hypothesis — in which case we
again maximize the likelihood to find the best value under a hypothesis.
‘Fair coin’ hypothesis
We toss a coin 100 times, and observe 65 Heads and 35 Tails. Our hypothesis ‘H0’
is that each throw is independent, with probability 0.5 of giving Heads.
What is the likelihood of this hypothesis?
L(H0 | D) = Pr(D | H0) = 0.565 x 0.535 = 7.889 x 10-31
or ln(L(H0)) = ln(7.889 x 10-31) = -69.31
This is all very interesting, but what can we do with it?
‘Possibly unfair coin’ hypothesis
We toss a coin 100 times, and observe 65 Heads and 35 Tails. Our hypothesis ‘H1’
is that each throw is independent, with unknown probability p of giving Heads.
What is the likelihood of this hypothesis?
Now we have a free parameter p, the probability of getting Heads. The maximum
likelihood estimate p̂ is exactly the observed proportion of Heads, i.e. 65/100 = 0.65
L(H1 | D) = Pr(D | H1) = 0.6565 x 0.3535 = 7.616 x 10-29
or ln(L(H1)) = ln(7.616 x 10-29) = -64.74
Comparison of coin hypotheses
L(H0) = 7.889 x 10-31;
ln(L(H0)) = -69.31
L(H1) = 7.616 x 10-29;
ln(L(H1)) = -64.74
L(H1)/L(H2) = 7.616 x 10-29 / 7.889 x 10-31 = 96.55
Evidently H1 is better than H0, but is it ‘significantly’ better?
Nested hypotheses
Some terminology: hypothesis H0 is ‘nested’ within hypothesis H1 if forcing a
particular choice of some of the parameters of H1 makes it the same as H0.
For coin tossing, forcing the unknown probability of Heads in H1 to equal 0.5 gives
us exactly H0. H0 is nested in H1.
Many sequence substitution models
are nested in others:
The more-complicated model H1
‘contains’ the simpler H0 and
must have more parameters to estimate
Comparison of general hypotheses (I)
Traditional statistical hypothesis testing compares a ‘null hypothesis’ H0 with an
alternative hypothesis H1. Usually H0 is nested in H1, and we will treat H0 as valid
unless the evidence in favour of H1 is much stronger.
The evidence we use for H0 and H1 are their likelihoods, L(H0) and L(H1).
The relative evidence is the ratio of likelihoods:
or twice the logarithm of this:
Λ = L(H1)/L(H0)
2∆ = 2 ln(Λ) = 2[ln(L(H1)) – ln(L(H0))]
Large values of 2∆ mean that the evidence for H1 is greater than for H0.
(Nested models ensure that Λ ≥ 1, i.e. that 2∆ ≥ 0.)
Comparison of general hypotheses (II)
How large a value of 2∆ is big enough? Traditionally, we say that if 2∆ is bigger
than we would expect by chance in 95% (or 99%, or 99.9%...) of cases when H0 is
correct, then we favour H1 over H0.
A useful theorem for doing the necessary calculations:
Suppose H0 is nested in H1, and H0 has d fewer free parameters than H1.
Then, if H0 is correct, 2∆ has a chi-squared distribution with d degrees of freedom:
2
2∆ = 2[ln(L(H1)) - ln(L(H0))] ~ χd
If 2∆ is greater than the 95% (or 99%, or 99.9%...) point of the χ 2 distribution,
d
then we reject H0 in favour of H1.
If 2∆ is ‘reasonable’, i.e. less than this, then we have no evidence to reject H0.
Comparison of general hypotheses (III)
These statistical tests are likelihood ratio tests (LRTs).
This is a very powerful class of statistical hypothesis tests, with very broad
applicability.
or:
http://www.danielsoper.com/statcalc3/calc.aspx?id=11
or:
http://graphpad.com/quickcalcs/PValue1.cfm
or:
http://www.danielsoper.com/statcalc3/calc.aspx?id=11
or:
http://graphpad.com/quickcalcs/PValue1.cfm
Scatterplot of C3 vs C1
25
20
C3
15
10
5
0
-5
0
5
10
C1
15
20
Scatterplot of C3 vs C1
25
20
C3
15
10
5
0
-5
0
5
10
C1
15
20
Scatterplot of C3 vs C1
25
20
C3
15
10
5
0
-5
0
5
10
C1
15
20
Scatterplot of C3 vs C1
25
20
C3
15
10
5
0
-5
0
5
10
C1
15
20
Scatterplot of C3 vs C1
25
20
C3
15
10
5
0
-5
0
5
10
C1
15
20
Comparison of coin hypotheses revisited
L(H0) = 7.889 x 10-31;
ln(L(H0)) = -69.31
L(H1) = 7.616 x 10-29;
ln(L(H1)) = -64.74
2∆ = 2[ln(L(H1)) – ln(L(H0))] = 2 x [-64.74 – -69.31] = 2 x [-64.74 + 69.31]
= 9.14
H0 and H1 differ by 1 free parameter (the probability of Heads),
so the degrees of freedom d = 1.
2
We compare 2∆ = 9.14 with the χ1 distribution, and observe a P-value < 0.005
— we conclude that H1 (probability of Heads not necessarily equal to 0.5)
is preferred to H0 (fair coin).
Fair dice?
Suppose you roll a die 100 times, and observe the following:
score:
1
2
3
4
5
6
# obs:
15
14
20
13
18
20
L(H0) = (1/6)15(1/6)14(1/6)20(1/6)13(1/6)18(1/6)20 = 1.531 x 10-78
L(H1) = (15/100)15(14/100)14(20/100)20(13/100)13(18/100)18(20/100)20 = 6.376 x 10-78
2∆ = 2(ln(L(H1)) – ln(L(H0))) = 2 x (-177.75 – -179.18) = 2.85
2
Comparing this with a χ5 distribution, we find the P-value is between 0.7 and 0.8
No statistical evidence to reject H0 in favour of H1
Some useful hypothesis tests in phylogenetics:
Comparison of patterns of DNA substitution
e.g.
Jukes-Cantor model vs. Kimura 2-parameter model
(0 parameters)
µ
Some useful hypothesis tests in phylogenetics:
Comparison of patterns of DNA substitution (I)
e.g.
Jukes-Cantor model vs. Kimura 2-parameter model
J-C:
K2P:
no rate parameters (same rate for all substitutions)
1 rate parameter (ratio of transition:transversion rates)
H0:
unknown tree relating the sequences; J-C model of substitutions
(parameters: tree shape, branch lengths)
H1:
unknown tree relating the sequences; K2P model of substitutions
(parameters: tree shape, branch lengths, ts:tv rate ratio)
The difference in parameters is just the transition:transversion rate ratio, a single
number. Fixing it equal to 1 in K2P gives us the J-C model back. So the models
are nested.
We can perform a hypothesis test between the models by
2
comparing 2∆ with a χ1 distribution.
Some useful hypothesis tests in phylogenetics:
Comparison of patterns of DNA substitution (II)
e.g.
Kimura 2-parameter model vs. Hasegawa, Kishino, Yano model
(4 parameters)
5
4
Some useful hypothesis tests in phylogenetics:
Comparison of patterns of DNA substitution (II)
e.g.
Kimura 2-parameter model vs. Hasegawa, Kishino, Yano model
K2P:
HKY:
1 rate parameter (ratio of transition:transversion rates)
4 rate parameters (ratio of transition:transversion rates
AND 3 base frequencies free to vary)
H0:
unknown tree relating the sequences; K2P model of substitutions
(parameters: tree shape, branch lengths, ts:tv rate ratio)
H1:
unknown tree relating the sequences; HKY model of substitutions
(parameters: tree shape, branch lengths, ts:tv rate ratio,
3 base frequencies)
The difference in parameters are the 3 base frequencies. Fixing them equal to 1/4
in HKY gives us back the K2P model. So the models are nested.
We can perform a hypothesis test between the models by
2
comparing 2∆ with a χ3 distribution.
Some useful hypothesis tests in phylogenetics:
Comparison of patterns of DNA substitution (III)
e.g.
Hasegawa, Kishino, Yano model vs. general time reversible (GTR) model
(8 parameters)
9
8
Some useful hypothesis tests in phylogenetics:
Comparison of patterns of DNA substitution (III)
e.g.
Hasegawa, Kishino, Yano model vs. general time reversible (GTR) model
HKY:
GTR:
4 parameters (ratio of transition:transversion rates
AND 3 base frequencies free to vary)
5 rates of change AND 3 base frequencies parameters
H0:
unknown tree relating the sequences; HKY model of substitutions
(parameters: tree shape, branch lengths, ts:tv rate ratio,
3 base frequencies)
H1:
unknown tree relating the sequences; GTR model of substitutions
(parameters: tree shape, branch lengths, 5 relative rates of
substitution, 3 base frequencies)
The difference in parameters are 4 relative rates of change. Fixing them in
appropriate ratios in GTR gives us back HKY. So the models are nested.
We can perform a hypothesis test between the models by
2
comparing 2∆ with a χ 4 distribution.
• Studying the models themselves helps us
learn about processes of evolution
• Choosing a good model will also help us make reliable
estimates of phylogenetic trees
• Likelihoods give an intuitive measure of how well models
fit the data we have observed
• so comparing likelihoods is a good way to compare different models
• Likelihood ratio tests (LRTs) enable us to perform robust
statistical tests of which models are better
• test statistic is 2∆ = 2 x log of ratio of likelihoods
• test distribution is a χ2 distribution
• degrees of freedom for the test depends on the difference in
the number of parameters estimated in the models compared
Hypothesis testing in phylogenetics
Further topics: Models that are not nested
We can use the Akaike Information Criterion (AIC):
AIC(model) = 2k – 2[ln(L(model))]
and the model with smallest AIC value is considered to be best
k is the number of parameters estimated in the model
Hypothesis testing in phylogenetics
Further topics: Models that are not nested
AIC(model) = 2k – 2[ln(L(model))]
AICc(model) = 2k – 2[ln(L(model))] +
2k (k + 2)
n − k −1
(correction for small sample size or many parameters — good!*)
BIC(model) = k ln(n) – 2[ln(L(model))]
(Bayesian approach; greater penalty for more parameters — bad?*)
*see Burnham & Anderson (2002) Model Selection and Multi-Model Inference. Springer, New York (http://tinyurl.com/3ef8sn7)
Download