Hypothesis Testing and Multiple Testing Wolfgang Huber There are four types of result: - True positives False positives True negatives False negatives Hypothesis testing is much more focussed on avoiding false positives than false negatives. Results range from biased to accurate and from dispersed to precise. There is asymmetry between verification and falsifiability – one false result can disprove a theory but any number of positive results can’t prove it. Hypothesis testing has 4 steps: - set up a model of reality do an experiment and collect data calculate the probability of the data given the model make a decision Two groups “are the same” is not a null hypothesis – expected to be approximately the same. The test statistic needs to be sensitive to interesting deviations from the null hypothesis. False positives = type I errors = α False negatives = type II errors = β T-Tests One sample t-test – compare the mean of a sample to a fixed value. Uses the T distribution with n degrees of freedom. Two sample t-test – do two samples have the same mean. In R: t.test(x, mu) – 1 group t.test(x, y) for two groups The p-value is the probability that the data could happen if the null hypothesis is true. It is not the probability that the null hypothesis is true. The T-distribution assumes: - independent observations normal distribution Deviations from normality esp. a wider distribution with heavier tails lead to reduced power. If the t-test is no longer powerful enough use a Wilcoxon test or permutation test. If the data are dependent (they are correlated or show batch effects) the p-values will be wrong. If the distribution of the data is normal, uniform or gamma distributed the distribution of p-values under the null is uniform – violating the normal distribution in these cases will have no major effect. Correlated variables lots of type I errors Multiple Testing To prove something we reject the null hypothesis. P-values are used for data integration in genomics. They are uniformly distributed – for each test which is performed there is a 5% chance of false positive. Multiple testing is very common in genomics. More tests more false positives soon reach a 100% chance of false positive. Adding a prior gives context to new data. Informative priors can improve decisions. FPR and FDR FPR – false positive rate FDR – false discovery rate These are not the same – FPR is the fraction of false positives amongst all genes, FDR is the fraction amongst hits. e.g. with 20,000 genes, 100 hits, 10 of which are wrong. FPR = 0.05% FDR = 10% Bonferroni Correction If m tests are performed, multiply each p-value by m, see if any p-value remains below α This threshold can be hard to reach. Plotting P-values Using the FDR the proportion of mistakes remains controlled. Plotting the p-values should show a peak on the left if the null-hypothesis is false – this shows the effect of the uniform distribution vs the alternative distribution. Observed p-values are a mixture of samples from a uniform distribution and one or more distributions concentrated at 0. Peaks are often seen in the middle for differential expression analyses – these are caused by lowly expressed genes which appear differentially expressed but have high p-values. Can filter out tests with insufficient power to ever have a significant p-value – where the mean normalised count is lower than a certain threshold (which can be calculated). This can increase detection rates. Bioconductor genefilter can do this – look at the vignette. Benjamin-Hochberg Method Plot the sorted p-values and a slope of α / number of genes. Everything to the left of the intersection between these lines is rejected. In R this is the BH function. Error Rates Experiment-wide – number of false positives and negatives Family-wide – probability of one or more false positives.