
Hypothesis Testing and Multiple Testing
Wolfgang Huber
There are four types of result:
True positives
False positives
True negatives
False negatives
Hypothesis testing is much more focussed on avoiding false positives than false
Results range from biased to accurate and from dispersed to precise.
There is asymmetry between verification and falsifiability – one false result can
disprove a theory but any number of positive results can’t prove it.
Hypothesis testing has 4 steps:
set up a model of reality
do an experiment and collect data
calculate the probability of the data given the model
make a decision
Two groups “are the same” is not a null hypothesis – expected to be approximately
the same.
The test statistic needs to be sensitive to interesting deviations from the null
False positives = type I errors = α
False negatives = type II errors = β
One sample t-test – compare the mean of a sample to a fixed value. Uses the T
distribution with n degrees of freedom.
Two sample t-test – do two samples have the same mean.
In R:
t.test(x, mu) – 1 group
t.test(x, y) for two groups
The p-value is the probability that the data could happen if the null hypothesis is true.
It is not the probability that the null hypothesis is true.
The T-distribution assumes:
independent observations
normal distribution
Deviations from normality esp. a wider distribution with heavier tails lead to reduced
If the t-test is no longer powerful enough use a Wilcoxon test or permutation test.
If the data are dependent (they are correlated or show batch effects) the p-values will
be wrong.
If the distribution of the data is normal, uniform or gamma distributed the distribution
of p-values under the null is uniform – violating the normal distribution in these cases
will have no major effect.
Correlated variables  lots of type I errors
Multiple Testing
To prove something we reject the null hypothesis.
P-values are used for data integration in genomics. They are uniformly distributed –
for each test which is performed there is a 5% chance of false positive.
Multiple testing is very common in genomics.
More tests  more false positives  soon reach a 100% chance of false positive.
Adding a prior gives context to new data. Informative priors can improve decisions.
FPR – false positive rate
FDR – false discovery rate
These are not the same – FPR is the fraction of false positives amongst all genes,
FDR is the fraction amongst hits.
e.g. with 20,000 genes, 100 hits, 10 of which are wrong.
FPR = 0.05%
FDR = 10%
Bonferroni Correction
If m tests are performed, multiply each p-value by m, see if any p-value remains
below α
This threshold can be hard to reach.
Plotting P-values
Using the FDR the proportion of mistakes remains controlled.
Plotting the p-values should show a peak on the left if the null-hypothesis is false –
this shows the effect of the uniform distribution vs the alternative distribution.
Observed p-values are a mixture of samples from a uniform distribution and one or
more distributions concentrated at 0.
Peaks are often seen in the middle for differential expression analyses – these are
caused by lowly expressed genes which appear differentially expressed but have
high p-values.
Can filter out tests with insufficient power to ever have a significant p-value – where
the mean normalised count is lower than a certain threshold (which can be
calculated). This can increase detection rates.
Bioconductor genefilter can do this – look at the vignette.
Benjamin-Hochberg Method
Plot the sorted p-values and a slope of α / number of genes. Everything to the left of
the intersection between these lines is rejected.
In R this is the BH function.
Error Rates
Experiment-wide – number of false positives and negatives
Family-wide – probability of one or more false positives.