STAT 101, Module 8: Statistical Testing, null hypotheses, test statistics, p-values (Book: chapter 10) Motivation At the end of the last module we talked about values of μ that are compatible with the data in terms of their X and s. In this module we expand on these thoughts and develop the vocabulary and argumentation methods of statistical testing. Example 1: A manufacturer of consumer electronics would like to know how many households intend to purchase a computer next year. Management hopes that the proportion is greater than 10% in order to justify sales projections. Does the survey that shows 14% willingness to purchase support their claim? Could the good news be the result of chance? Since 10% is a crucial threshold it makes sense to use p = 0.10 as what we call a null hypothesis and check whether the data from their survey is compatible with this assumption. Example 2: Suppose you sample 25 students from the Penn class of 2006 and observe that their average SAT is 1380 with an SD of 125. An admissions officer claims the average SAT is at least 1420. You are surprised at the inconsistency, but then again, this assertion might be compatible with the data. One could use μ = 1420 as a null hypothesis and check how compatible it is with the numbers from the sample. Example 3: It is claimed that a coin is fair. A sample of 144 tosses results in 64 heads. Is this compatible with the assumption of a fair coin? The natural null hypothesis is p=0.5 which defines fairness. Example 4: At the end of Module 7 we mentioned elections. Again, a rate of 0.50 of favorable likely voters is critical to the claim of being ahead in the polls. Therefore p = 0.50 is a natural null hypothesis. Example 5: It is known that a standard surgery requires a mean hospital stay of 5.4 days. A new and less invasive type of surgery is said to require only 3.3 days in the hospital on average. How sure are we that the new method actually does require fewer hospital days? It would appear natural in this case to play “devil’s advocate” and check whether the null hypothesis of a population mean of 5.4 hospital days is actually compatible with the data for the new method that seem to have a sample average of 3.3. If it is compatible with the data, maybe the case for the new method is not strong enough, or one needs more data. Note that because of the long experience with the old type of surgery one can assume N ≈ ∞, hence 5.4 can be seen as a population mean. For the new type of surgery there will be much less experience and N will be small at this exploratory stage, hence 3.3 should be interpreted as X . The Components of Statistical Testing Statistical testing can be used in several ways: o Statistical testing can be a Socratic game: Allow someone to make an assertion, and play along till it leads to an apparent absurdity… or not. Example: the admission officer’s assertion about SAT averages of students admitted by Penn. o Statistical testing can be a devil’s advocate game: Assume an undesirable scenario, and try to show it probably isn’t so. Example: the assumption that the new type of surgery is no better than the old type. o Statistical testing can be used to check whether a norm is likely to be satisfied. Example: examining fairness of a coin. Note that we tend to cast statements in vague terms: “probably”, “likely”. The reason is that statistical testing quantifies uncertainty about conclusions. Science never deals in absolute certainties, although some conclusions can reach certainty beyond reasonable doubt. Null hypotheses: A null hypothesis is an assumption about a population quantity. Note the “population” part. Null hypotheses are never about actually observed statistics computed from data. Population values are the targets estimated by sample values, and the sample values are used for inference about population values. The two fundamental methods of statistical inference methods are: 1) confidence intervals and 2) statistical tests. We consider only population means μ and population proportions p (=probabilities), and the only type of assumption we will consider is that μ or p take on a specific value of interest. What these hypothesized values are depends on the context: If it is about testing fairness of a coin or commanding a majority in the polls, the natural null hypothesis is p=0.5. If the business plan asks for a minimum demand of 10% of households, then p=0.1 is the natural null hypothesis. If a new type of surgery is asserted to shorten hospital stays, then the devil’s advocate says the mean reduction is zero. One could also consider null hypotheses about population standard deviations σ, and this is done, but it is much less important. Below we will consider differences in population means and population proportions between groups, and finally null hypotheses about population slopes in regression. Notation for null hypotheses: H0: μ = μ0 and H0: p = p0 where μ0 and p0 are the assumed population values. In the case of the new type of surgery, we could let μ be the population mean of hospital days with the new procedure, so the null hypothesis of no improvement over the old type of surgery is that both types have the same population average: H0: μ = 5.4 When it comes to testing fairness of coins or claims to majorities in polls, the null hypothesis is: H0: p = 0.5 o Reminder: H0: X = 5.4 is completely mistaken. The quantity X will lend evidence about μ, but it cannot be the subject of a null hypothesis. o Why “null” hypothesis? There is another type called “alternative hypothesis”, hence “null” is opposed to “alternative”. The alternative hypothesis is essentially “not the null hypothesis”. There are subtleties about alternative hypotheses that we will not discuss here (two-sided versus one-sided alternatives: Ha: μ ≠ μ0 and Ha: μ > μ0). Test Statistics: A test statistic computed from data provides evidence for or against the null hypothesis. It is not too difficult for us to devise a test statistics for the above null hypotheses. Similar to confidence intervals, the ideas center on the deep fact that means vary across datasets, that their variation can be quantified by the standard error σ( X ), and that σ( X ) can be estimated from a single dataset by the standard error estimate stderr = s(X)/N1/2 . Let us have another look at the graphs at the end of Module 7: To play the game of testing a null hypothesis, we assume that it is true and that the data have the hypothesized population mean μ0. We then check how extreme the estimate X of μ0 is in light of the distribution of X : o In the first graph above, X is less than two standard errors away from μ0 . This is counted as compatible with the null hypothesis that μ has this particular value. o In the second figure, X is more than two standard errors away. One judges this X to be too unlikely under the null hypothesis and hence incompatible with it. In light of the CLT, a good test statistic would be a Z-score formed under the null hypothesis: z = X 0 (X ) If z is more extreme than ±2 (that is, > +2 or < –2), we will say: we reject H0. What we really mean is: H0 (the assumption that μ = μ0) is not very compatible with the data. An obvious problem is that while μ0 is specified by the null hypothesis, the standard deviation σ(X) of the data and hence the standard error σ( X ) = σ(X)/N ½ are not specified and hence need to be estimated. The result is what is called the t-statistic: t = X 0 stderr ( X ) where stderr( X ) = s(X)/N ½ is the standard error estimate as usual. Comments: o We can think of the t-statistic as a change of units in X : make μ0 the new origin of the scale and make stderr the new unit. If t =1.5, then X is 1.5 stderr to the right of μ0. Therefore, |t| measures the distance of X from μ0 in multiples of stderr. o |t| is a measure of evidence against the null hypothesis: if |t| > 2, we “reject the null hypothesis” (although see what follows). Null Distribution: The probability distribution of the test statistic t assuming H0: μ = μ0 is called the null distribution. Note it is a hypothetical distribution, literally. It is used to judge what values of t and hence of X should be considered as giving evidence for or against μ0. Large values |t| will count as evidence against μ0. Now that we have replaced the denominator σ( X ) of z with the quantity stderr which is no longer a constant but a random variable, the probability distribution of the resulting t has changed: If the observations themselves are normal, the random variable z is normal, but the random variable t is no longer exactly normal. It has what is called “Student’s t-distribution” (recall the story of “Student” aka Gosset at the Guinness Brewery in 1908). The t-distribution becomes very nearly normal for large N, but for N <60, the cut-off value, which should be the 97.5% quantile, is greater than 2 and grows as N gets smaller. Here is one more time the table from Module 7, where we included N=∞, which is the normal distribution: N: t0.975 : 10 2.23 15 2.13 20 2.09 30 2.04 40 2.02 N: t0.975 : 50 2.01 60 2.00 75 1.99 100 1.98 ∞ 1.96 Using these “exact” cut-offs, we say we reject H0 when |t| > t0.975. The union of the two intervals (–∞, –t0.975) and (t0.975, +∞) is called the rejection region. The interval (–t0.975, t0.975) is called the “nonrejection region”. Purists are against using the term “acceptance region”, hence it’s “nonrejection region”. Nicer terminology would use the words “incompatible” and “compatible”, which is what μ0 and X are depending on where t falls. In the next graph below, the part of the axis with the gray area is the rejection region, the part in between is the non-rejection region. The t-statistic is always reported in null hypothesis testing. When you see it, check it against the rough cut-offs ±2, but be aware that JMP and all other software use the t-quantiles as in the above table; they are exact if the observations are normally distributed. If the data are not normally distributed (as for discrete and skewed distributions), even the t-distribution is only an approximation. Visually, the tdistribution is indistinguishable from the normal distribution, except when N is extremely small. The following figure shows the t-density function for N=20. Significance Levels: The choice of boundaries at the 2.5% and 97.5% quantiles of the null distribution amounts to a test at the significance level α =5%, or simply at the 5% level. The significance level α is the tail probability that defines the cut-off values, approximately ±2. In the figures above, the gray areas denote the 5% tail probability α, divided into two areas of α/2 = 2.5% each. The choice of 5% is a convention that can be changed. The significance level of 5% is the most frequent choice, but when the evidence against the null hypothesis is required to be more stringent in order to reject it, one chooses a significance level of 1% or even lower. In this case, the quantiles for the t-distribution are as follows: N: t0.995: N: t0.995: 10 15 20 30 40 3.25 2.98 2.86 2.76 2.71 50 60 75 100 ∞ 2.68 2.66 2.64 2.63 2.58 It appears that cut-offs ±2⅔ are a good and conservative choice for testing at the 1% significance level. Again, all software, including JMP, uses the “exact” quantiles of the t-distribution. In general, for a given significance level α, one uses the (1–α/2)quantile of the t-distribution as a cut-off. That is: Reject H0 at the significance level α |t| > t1–α/2 Comments: o The lower the significance level α, the larger is the nonrejection region, and the less likely is rejection of the null hypothesis. o It is possible that we can reject at the 5% level, but not at the 1% significance level. This is the case if t is between 2 and 2.6: 2 < t < 2.6 means rejection at the 5% level, o The significance level α is also called the “Probability of a Type 1 Error”. A Type 1 Error is the rejection of the null hypothesis when it is in fact true. But the probability that this happens is exactly α, by construction: P( rejection of H0 at the level α | H0 is true ) = P( |t| > t1–α/2 | H0 is true ) = α See the red box above. So far we have discussed α as a tail probability, which it is: it is the probability of seeing a value of the t-statistic more extreme than the cut-off t1–α/2, but this probability under the null hypothesis is α. (There is a notion of Type 2 Error, which is not-rejecting H0 when H0 is in fact false. This is a more difficult concept and we will not explain it.) P-Values: The p-value is the achieved significance level. The idea behind the p-value starts with the following question: What would be the significance level for which the observed X and t would be exactly on the cut-off? That significance level is the p-value. The situation is depicted here: Comments: o The p-value is a measure of evidence in favor of H0: μ = μ0. If the p-value falls below α, we say there is insufficient evidence in favor of H0: μ = μ0, hence: Reject H0 at the significance level α p-value < α o The p-value is a random variable, even though it is calculated as a hypothetical probability assuming H0: μ = μ0 is true. o The p-value is a transformation of |t| to the 0-1 range: p-value = 1 μ0 = X p-value = 0 | μ0 – X | = ∞ o The p-value is the hypothetical probability of observing a value of t more extreme than the one in hand. If this hypothetical probability is small, it means the value of t in hand is extreme under H0. Hence we reject H0. o Why p-values are so popular: They allow testing a null hypothesis at all conceivable significance levels. Once we know the p-value, we know how to answer if someone asks for a test at the 5% level, at the 1% level, at the 0.5% level… The answer is always: if the p-value is below the significance level α, we reject H0 at the significance level α. o We see, therefore, that a p-value 0.02 allows us to reject at the 5% levels, but not at the 1% level. Confused? That’s ok. Here are handy rules for real life: o Reject H0 at the 5% significance level if the p-value is below 0.05. This never fails. o If the t-statistic is < –2 or > +2, expect rejection, but in borderline cases where the t-statistic is very near +2 or –2, recall that the cut-offs ±2 are not exact, hence trust the p-value. o Keep in mind that statistical testing is a “what if” game. It starts with “what if μ = μ0?” and checks what the consequences are in light of the data. Rejection of μ = μ0 means that this assumption is not compatible with the data. Confidence Intervals with Coverage Probability 1-α: Logically equivalent to rejection at the 5% significance level is μ0 falling outside the “exact” CI (provided by the software). The rough CI = ( X ± 2 stderr) is usually correct but may fail in borderline cases when | X – μ0| ≈ 2 stderr. The “exact” CI with coverage probability 1– α/2 is: CI1–α = ( X – t1–α /2 · stderr, X + t1–α /2 · stderr ) Therefore, a rough 99% confidence interval is X ±2⅔ · stderr. The general connection between α-level testing and (1–α)-CIs is: Reject H0 at the significance level α μ0 CI1–α Testing Means in JMP: Analyze > Distribution > (select Y,Columns) > OK (click tiny red triangle icon, next to variable name) Test Mean > (enter the values μ0 or p0 to be tested in the upper field) > OK Here is Example 3, the problem of testing fairness of a coin (H0:p=.5) where 64 heads in 144 flips were observed (Sim Dice and Coin Flips.JMP): Moments Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean N 0.4444444 0.4986384 0.0415532 0.5265823 0.3623066 144 Test Statistic Prob > |t| Prob > t Prob < t t Test -1.3370 0.1834 0.9083 0.0917 Test Mean=value Hypothesized Value Actual Estimate df Std Dev 0.5 0.44444 143 0.49864 .40 .45 .50 .55 .60 JMP gives you a picture of the null distribution with the area of the pvalue colored in blue. Note that it is centered at the hypothesized population mean 0.5, shown also in the numeric output. We see the mean or proportion twice: among moments and below the hypothesized value. o Our two-sided p-value is written as “Prob > |t|”. Its value is 0.1834. Since it is not below 0.05, we do not reject the null hypothesis. Our p-value is followed by two one-sided p-values for which we have no use; they are associated with one-sided alternative hypotheses. o The “Test Statistic” is the t-statistic (it can be the z-statistic if the standard deviation is known). Its value –1.337 is between ±2, hence again no rejection. o The CI (0.362, 0.527) contains the hypothesized value 0.5, hence yet again no rejection. Example 1: Recall the manufacturer’s target is an excess of 10% take rate, and the survey says the rate of self-declared intent of purchase is 14% of the households. Since 10% is the critical border line, we take H0: p=0.10 as the null hypothesis, and the question to be answered is whether the observed proportion p̂ =0.14 lends evidence against H0. To proceed, we need one more piece of information: the sample size, which happens to be N= 500. At the end of Module 7 we saw that the standard error estimate for the proportion is stderr( p̂ ) = ( p̂ (1– p̂ ) / N )1/2 = (0.14·0.86/500)1/2 = 0.0155 hence the test statistic is pˆ p 0.14 0.10 2.58 . stderr( pˆ ) 0.0155 Now this is fortunate: 2.58 is greater than 2. Hence we can reject the assumption that the true population proportion is 10%. Example 2: The null assumption in the Penn student SAT problem is H0: μ = 1420, the assertion made by the admission official. He/she may have made the assertion based on the complete census of Penn students; we wouldn’t know, it’s just a very specific assertion. Our evidence is rather scant: a random sample of N=25 students with a sample mean X =1380 and a standard deviation s=125. Hence the standard error estimate is s/N1/2=125/5=25. The test statistic is X 0 1380 1420 1.6 stderr ( X ) 25 The value 1.6 is clearly below 2, hence the assertion that the population mean of SAT scores is 1420 is compatible with the data. A problem is of course that the data is so small. With a larger sample we’d have a better chance to refute the admission official. Example 4: Can a candidate with 56% of likely voters in his/her favor brag that he/she has a majority? We need to know the sample size of the survey. If it is N=961, then stderr = (.56 · .44 /961)½ = 0.016, and the test statistic is (.56 – .50)/.016 = 3.75. The value p-value would be 0.0001874, which is smaller than all conventional significance levels, and hence a pretty sure thing. It means that if the truth is still p=0.50, then one would find a value as extreme as 56% in fewer than 2 out of 10,000 surveys of size N=961.