Statistical Analysis of Genomics Data: Review 3 (Francesca Chiaromonte) More on inference, tests and random permutations Null hypothesis on a feature of the population θ =t(F), for instance Ho : θ = 0 This is what we would like to refute. Using data from F, we investigate the null in comparison to an alternative, for instance Ha : θ ≠ 0 (two-sided) Ha : θ > 0 (one-sided, right – could be left) or This is what we would like to show is supported by evidence in the data. Note in this case the null specifies one value, while the alternative specifies a range (these are the most common specifications). As an example, let us consider the mean µ =t(F), and the hypothesis system Ho : µ = 0 Ha : µ > 0 e.g. we may want to assess if we have enough evidence to conclude, based on our sample of n=100 observations, that the log length ratio for human vs chicken has a positive mean (on average, the length of human is larger than that of chicken on orthologous regions). 1 We need to use a test statistic, i.e. a function of the data, whose distribution under Ho (null distribution) is known and can thus be used as reference. Here the situation is again as simple as it gets. We know that the quantity: y−µ se( y ) ~ N (0,1) ~ Tn −1 approx if n is large, regardless of the shape of F (asymptotics) if F is normal, for any n. Good approximation if F does not depart from normality too markedly (distributional assumption on population). Thus under Ho (say we use the first result, n=100): u= y se( y ) ~ N (0,1) approx the p-value or achieved significance level associated with the observed u is the probability that, under the null, the statistic would take the observed value, or a value even more extreme in the direction (here, right) defined by the alternative p(uobs) = Pr(u > uobs | µ = 0) (Integral under the right tail, after uobs.) For the two-sided alternative, the p-value is: p(uobs) = Pr(u < -|uobs| or u > |uobs| | µ = 0) = 2 Pr(u > |uobs| | µ = 0) here, because of symmetry of the null distribution. Basic idea: we can reject Ho in favour of Ha if the observed value of the test statistic is very extreme with respect to what one would expect under the null distribution; that is, if the corresponding p-value is small. The smaller the p-value, the stronger the evidence against Ho provided by the data. 2 In R: > #Compute the observed value of u: > u <- mean(chicken_toy[,"y"])/ (sd(chicken_toy[,"y"])/sqrt(length(chicken_toy[,"y"]))) > u [1] 19.57509 > #this is a very large value. Next, we compute the p-value > #for the right-sided alternative, i.e. the corresponding > #right tail prob under N(0,1): > pnorm(19.57509,mean=0,sd=1,lower.tail=FALSE) [1] 1.260955e-85 > #0 by all practical means! Very strong evidence that the mean > #log length ratio is positive. Rejection rule: reject Ho if the p-value is < a threshold α, say 0.05. This is called the level, or (target) significance. With this rule, we will ensure that Pr(rejecting Ho | Ho) < α i.e. we control the probability of a false positive, or so called type-I error. The other error we can make is to fail to reject Ho when Ha holds: Pr(not rejecting Ho | Ha) This is the probability of a false negative, or so called type-II error. 1- such probability is called the power of the test, and an explicit expression can be given for each point in the alternative (in our instance Ha is a range). Type-I and II error probabilities are in trade-off; test statistics are evaluated for their power behavior, once the level is fixed. 3 Connections between CI and tests of hypothesis The 1-2α coverage CI for the mean CI (α ) = y ± a(α ) se( y ) is the locus of points that could not be rejected as nulls at level α (against a two-sided alternative). Back to the right-sided alternative in the example above, consider the following reasoning: both for the CI and for the test, we exploit y−µ se( y ) ~ N (0,1) approx Loosely, when building the CI, we take N ( y , se( y )) as rendering the sampling distribution of the sample mean (µ replaced with y-bar). When testing, we take N (0, se( y )) as rendering the sampling distribution of the sample mean under the null µ=0. With α as tuning parameter, the data supports the conclusion µ > 0 if • The lower extreme of the CI(α) is >0 (0 not in the interval) … 0 is far enough from the observed sample mean, using N ( y , se( y )) • The test rejects the null, p(uobs) < α, i.e. uobs > a(α) … the observed sample mean is far enough from 0, using N (0, se( y )) These are parallel ways of reasoning N(0,se(mean)) 0 Rejection threshold N(mean,se(mean)) mean CI Note how here the two distributions are symmetric and identical except for their location. 4 It is useful to generalize this parallel. Generic population feature t(F) hypothesis system, say Ho: t(F) = 0 vs Ha t(F) > 0 Evidence in the data summarized through t(Fn) Now let: • PF(n) = sampling distribution of t(Fn) • Po(n) = sampling distribution of t(Fn) under the null. ∫ Po( n ) ( dt ) = p (tobs ) the p-value we seek t ≥ tobs ∫P (n) F ( dt ) = f the parallel construction t ≤0 if the two distributions are symmetric and not too different in shape, the latter is a good approximation to the former, having the advantage that we can estimate it numerically by bootstrap, whatever the t(F) under consideration: 1 fˆ = # (t * (b) ≤ 0) bootstrap-based empirical p-value B In rough terms, this is the logic for bootstrap testing (again, many refinements exist). This approach breaks down if the nature of the variability presented by the statistic under the null is substantially different. However, there is an important class of testing problems for which we can obtain pvalues numerically without resorting to the parallel construction and the bootstrap. (n) This is because we can simulate Po itself. 5 Permutation tests Consider for instance • F = population jointly representing y and x both quantitative, and t(F)=ρ(y,x) the correlation – e.g. the correlation between log length ratio and log large insertion ratio. • F = population jointly representing y (quantitative) and c=1,2 (categorical, indexed), and t(F)=µ1−µ2 the difference between the means – e.g. the difference in mean log length ratio between micro+medium and macro chicken chromosomes (creating two groups out of three here). In both cases the selected population features are ways of measuring the association between two variables. The values representing no association are 0, and we can imagine testing: Ho : ρ(y,x)=0 (no linear association) Ha : ρ(y,x)>0 Ho : µ1−µ2=0 (no location effects of the class) Ha : µ1−µ2>0 We surely have: y indep x Î Ho : ρ(y,x)=0 y indep c Î Ho : µ1−µ2=0 Permutation tests exploit the fact that we can simulate independence, and thus, a fortiori, null hypothesis concerning these types of features. 6 Let us consider the difference of means problem: y y1 . . . yn1 yn1+1 . . . yn1+n2 c 1 . . . 1 2 . . . 2 (sub) sample of size n1 from the c=1 subpopulation (sub) sample of size n2 from the c=2 subpopulation (n1+ n2= n) … often called 2-sample problem. The plug-in statistic for the difference of means is d = y1 − y 2 . Now, keep the y-column fixed, and For b=1…B 1. Create a random permutation of the c-labels in the second column (equivalent to drawing n times from them without replacement). This gives (c1*(b)… cn*(b)) 2. Using the permuted labels, recompute the difference between the means * * d * (b) = y1 (b) − y 2 (b) Compute the permutation-based empirical p-value as p ( d obs ) = 1 # ( d * (b) ≥ d obs ) B Random permutations simulate sampling from a population in which y and c are left unchanged marginally, but any association existing between them is broken. At the sample level, all the marginal features of y (overall mean, sd, etc) and c (frequencies) are preserved, because the y and c numbers stay the same, but the connection is “scrambled” – thus the means within the groups change. 7 Although in many practical cases bootstrap and permutation based empirical p-values will be quite close, the latter is, when applicable, a better approach because it simulates the null distribution directly. Note: only one of the data clmns is permuted (could permute both, but it would not help). To implement permutation tests in R, you can look for ready-to-use functions available on the web (as individual functions or in packages), or you can write your own function. In doing so, you will need to use the function: > sample(x, size, replace = FALSE, prob = NULL) > #A random permutation of x is a sample of size length(x) > #from x, without repl. Need not specify sampling weights (prob) 8