multiple testing

advertisement
Multiple testing in highthroughput biology
Petter Mostad
Overview
•
•
•
•
•
Review of hypothesis testing
Multiple testing: What is the issue?
Types of errors to control
Adjusting p-values
A Bayesian alternative
Statistical hypothesis testing
• Fix a null-hypothesis H0, and a model for what
data would look like if H0 is true.
• Find a test-statistic (something computed from the
data) which will be more extreme when H0 is not
true.
• Compute the test statistic from the data
• Compute its p-value: The probability of observing
this or a more extreme test statistic if H0 is true
• Reject H0 if p is very small (e.g., less than 0.05)
Example: testing the mean of a
sample
• Assume you have observed values x1 , x2 ,..., xn
• Assume they come from a normal distribution
with expectation μ and variance σ². H0: μ = μ0
• Test statistic:
x  0
t
s x2 / n
• Under the null-hypothesis, t has a tn-1-distribution,
i.e., a t-distribution with n-1 d.f.
• The p-value is the probability of more extreme
values than t in this distribution.
• p-value < 0.05 if and only if μ0 is in confidence
interval for μ
Example: Non-parametric tests
• Tests where the null hypothesis does not specify a
particular probability model for the data
• Example: Are the values in two groups of values
different? To find out, rank all the values.
• Wilcoxon rank sum test statistic: Add the ranks of
all items in one group, and compare to the sum of
the other ranks.
• Compare with the rank sums one would get if one
set of items is selected at random. This gives a pvalue.
Example: Finding differentially
expressed genes
• Assume we want to find diff.exp. genes using 4 cDNA
arrays with 6000 genes comparing ”treated” samples with
”untreated”.
• Assume there are NO diff.exp. genes, only noise, and that
all 6000x4 values are normally distributed.
• Of the 6000 t-test statistics computed, roughly 300 will
have values ”more extreme” than the 0.05 cutoff value, just
by chance. These are false positives.
• This problem is a consequence of the hypothesis testing
approach and multiple testing, and not of the particular
method chosen.
Multiple testing in general
• Multiple testing is in fact always a problem: When
using a confidence level of 5%, 1 in 20 null
hypotheses will be falsely rejected, whether the
tests occur in the same or different research
projects.
• Problem seems more acute when a large number
of formalized statistical tests are performed, as in
modern high-throughput biology.
• What to do?
Ways to deal with the problem
• Adjusting p-values:
– Adjusting the levels of p-values to account for
multiple tests.
– Changing the interpretation of p-values.
• Reformulating statistics away from
hypothesis testing towards (Bayesian)
model choices.
Type 1 error:
false positives
Types of errors
# not rejected
# rejected
SUM
# true null hyp
U
V
M0
# false null hyp
T
S
M1
SUM
W
R
M
Type 2 errors: false negatives
Traditionally: Want to find a procedure controlling size of V,
while at the same time keeping T as small as possible.
Some error rates:
• Per comparison error rate:
• Family-wise error rate:
E (V )
PCER 
M
FWER  Pr(V  0)
V

FDR  E  | R  0  Pr( R  0)
R

V

• Positive false discovery rate: pFDR  E  | R  0 
R

• False discovery rate:
Types of control
• Exact control: Controlling error rates
conditionally on the set of true null
hypotheses.
• Weak control: Controlling error rates under
the condition that all null hypotheses are
true.
• Strong control: Controlling error rates no
matter what hypotheses are true or false.
Controlling FWER with Bonferroni
• Adjust p-values p1, p2, ..., pM using
Bonferroni method:
~
p  min( Mp ,1)
i
i
• Provides strong control of FWER
• Very conservative, hard to get any
hypotheses rejected.
Alternative: Sidak adjustment
• Given p-values p1, p2, ..., pM, the adjusted pvalues are now
~
p  1  (1  p ) M
i
i
• Provides strong control of FWER if the test
statistics are independent (rarely the case)
Holm step-down adjusted p-values
• Order the tests so that the raw p-values are
ordered:
p1  p2  ...  pM
• Compute adjusted p-values:
~
pi  max k 1,.., i min ( M  k  1) pi ,1
• Provides strong control of FWER
• Somewhat less conservative than Bonferroni
Permutation testing
• Given items divided into two or more groups
• Compute a test statistic capturing the way in
which your data seem to be extreme.
• Compare with the test statistic computed when the
items are permuted. This gives a p-value.
• Wilcoxon rank sum test can be seen as a
permutation test
• Can be a good way to find statistically significant
differences without using parametric assumptions
Combining permutation testing with
adjusting p-values
• Given data matrix, with some columns
corresponding to ”treated” samples, some to
”untreated”.
• Permute columns, to get permutationderived p-values.
• Adjust these, using for example the Holm
step-down method.
Correcting for multiple testing with
permutations (Dudoit / Speed)
• Assumes treatment and control samples have been
compared to a common reference sample.
• Use test statistic m2  m1
s12 s22

n1 n2
• By permuting data from treated and control
samples we can approximate the null distribution
of the test statistic.
• Use permutation for multiple testing problem
Adjusting p-values
|
• Order genes so that | t j values
are non-increasing.
(b )
|
t
• Compute j | values by permutations (b  1,..., B)
Unadjusted p *j is proportion of these that exceed | t j |
• For each permutation b, set u (bj ) as non-increasing
envelope.
*
~
p
• Set j as proportion of these that exceed | t j |
p *j non-decreasing by possibly increasing values
• Make ~
Adjusting p-values
QQ-plots
• We may also compute t-values
by computing differences
within groups instead of
between groups.
• Find extreme quantiles of nulldistribution, and compare with
the first t-values.
• Compare null t-values, or
normal distribution, with the
first t-values in qq-plot.
• May compute single-gene pvalues with quantiles in the
null-distribution.
Controlling FDR
• Start with ordering tests so that raw p-values are ordered:
p1  p2  ...  pM
• Adjusted p-values can then be computed:

M

~
pi  min k i ,.., M min 
pi ,1
 k


• This can be interpreted as a step-up procedure, starting with
investigating the largest p-value.
• Provides strong control of FDR under indepencence of test
statistics, and under certain more general conditions.
Interpreting FDR-adjusted p-values
• Remember that fdr-adjusted p-values must be
interpreted differently:
• If you select all genes with fdr-adjusted p-values
less than 0.05, it means that you expect a
proportion of 0.05 of these to be false discoveries.
• In contrast, if you select all genes with holmadjusted p-values less than 0.05, it means that you
can expect the chance of seing any false positive
gene as 0.05.
Multiple testing using BioConductor
software
• In the limma package, the topTable function
can be used to adjust p-values.
• It contains a number of different
procedures, for example bonferroni, holm,
and fdr.
• The package multtest can be used to
perform permutation-based computations
Critisism of FDR approach
• It is ”possible to cheat”:
– You have a number of hypotheses you want to
reject, but p-values are not quite good enough.
– Add to your hypotheses a number of untrue
hypotheses, with low p-values.
– The number of rejections will rise, but not the
number of false rejections, so your FDR improves,
and you ”prove” the hypotheses you care about.
Critisism of
hypothesis testing in general
• Assume a pharmaceutical company wants to prove a new
drug is effective
• If they compute new p-values continuously, as new data
comes in, they can just stop whenever the p-value dips
below 0.05.
• This gives ”unfair advantage” compared to testing only
after the end of the entire study, as they do multiple tests.
• ”Solution”: Deciding before study at what times to
compute p-values, and adjust them accordingly.
• Result: Two companies can have exactly identical data, but
one gets drug approved, the other not.
Example: A Bayesian approach
(Scott and Berger, 2003)
• Objective: Given a large number of values
of normally distributed noise with
expectation zero, and some values which
deviate substantially more from zero, how
to find these deviating values, limiting
errors of types 1 and 2.
• Crucial ingredient: Use data to estimate the
probability that values are deviating.
Hierarchical model used:
X i | i ,  2 ,  i ~ N ( i i ,  2 )
i |  2 ~ N (0, 2 )
( 2 ,  2 )  ( 2   2 ) 2
 i | p ~ Ber (1  p)
p ~ Unif (0,1)
• The model above specifies ”noise values”, for which
 i  0values”, where
and ”signal
i 1
• We would like to fit the model to data, and thus find
probability for each to be equal
to 1.
i
Example computations:
• We assume we have 10 ”signal
observations”: -8.48, -5.43, -4.81, -2.64, 2.40, 3.32, 4.07, 4.81, 5.81, 6.24
• We have n=10, 50, 500, or 5000 ”noise
observations”, normally distributed N(0,1).
• We find posterior probabilities for each
observation to be noise or signal: Can we
fish out the signals from the noise?
Results
False
positives:
Central seven ’signal’ observations:
n
10
-5.4
1
-4.8
1
-2.6
.94
-2.4
.89
3.3
.99
4.1
1
4.81
1
#pi>.6
1
50
1
1
.71
.59
.94
1
1
0
500
1
1
.26
.17
.67
.96
1
1
5000
1.0
.98
.03
.02
.16
.67
.98
2
Note: The 3 more extreme signal observations always have prob. 1 of
being signals.
Note: The number of ”false positives” does not change much with
the amount of noise: Adjustment for multiple testing is automatic!
Note: The most central observations are ”drowned out” when the noise
increases.
Multiple testing for microarray data
in practice:
• Too few repetitions (2 or 3) of each sample is
often used to reliably find differentially expressed
genes.
• Instead of finding a definite list of diff.exp. genes,
a ranking list may be the output, ordering genes
according to how likely it is that they are diff.
exp.. The top of the list is chosen for further
validation.
• Combination with other sources of information
may be used to find interesting genes.
Download