ECON 309 Lecture 7B: Hypothesis Testing I. Hypotheses A hypothesis is a claim about a parameter that you’re interested in. The simplest hypotheses are about the parameters of a single variable, such as the mean of a population. But there are more complicated hypotheses, as we’ll see when we get to regression analysis; these hypotheses are about the parameters that control the relationship between two or more variables. Some simple hypotheses: The average number of customers in this store per day is greater than 10. Condoms from this production line will break less than 1% of the time. The average number of years it takes to graduate from CSUN is 6.5. To do a hypothesis test, you will actual have two hypotheses: the null hypothesis and the alternative hypothesis, which are stated in such a way that they are mutually exclusive (you can’t have both hypotheses be true). The null hypothesis is the conclusion that is considered the default – you will accept this hypothesis if you fail to find sufficient support for the alternative hypothesis. This is important: it means you are placing the burden of proof on those who support the alternative hypothesis. The null hypothesis is essentially “innocent until proven guilty” – it can be accepted with little support from the evidence, simply because the evidence doesn’t strongly indicate something else. For this reason, researchers will usually use the alternative hypothesis to represent their own position – what they wish to prove – in order to put their claim to the strongest test. But sometimes researchers put their own position as the null, in which case they’ve made things very easy on themselves. II. One-Tail versus Two-Tail Hypo Tests What if the CSUN administration claims the average number of years to graduate from CSUN is 6.5? There are two ways they could be wrong: the average could be lower, or the average could be higher. If I wanted to test the claim, we could state the null and alternative hypotheses like so: H0: μ = 6.5 H1: μ ≠ 6.5 Here we’re taking the administration’s claim as the null; we are giving them the benefit of a doubt, and will only reject the claim with sufficient evidence to the contrary. On the other hand, what if the CSUN administration claims the average number of years to graduate from CSUN is no more than 6.5? Then there is only one way they could be wrong: if the average is really higher. We could state the null and alternative hypotheses like so: H0: μ < 6.5 H1: μ > 6.5 Again, we’re giving the administration the benefit of a doubt by putting their claim as the null. These two kinds of test are different. The first is called a two-tail test, because there are two ways we could reject the null. The second is called a one-tail test, because there is only one way we could reject the null. We can see the difference by looking at the distribution of sample means around the population mean. [Draw the bell curve with mean centered 6.5. Show regions to both the left and right of 6.5, indicating two ways to reject the hypothesis that the mean really is 6.5: because the sample mean is especially small or especially large.] [Draw the same bell curve, but with a somewhat larger region on the right, showing a rejection of the null because the sample mean is especially large. Have no similar region on the left.] We could have done a different one-tail test. What if CSUN’s administration claimed the average graduate time was no less than 6.5? Then we would say: H0: μ > 6.5 H1: μ < 6.5 Again, this gives the benefit of a doubt to the administration. III. Significance Levels and Type I and Type II Errors Remember from the lecture on CI’s that we had to choose a significance level, designated α. This was the probability that a CI generated from a sample would not include the true mean. Now, we’ll use the same significance level, or α, for the probability that a hypothesis test will reject the null hypothesis even though it’s true. This probability corresponds to the shaded area in the distributions just examined. If the true mean is 6.5, we could still (by chance) get a sample far enough from 6.5 that we reject the null hypothesis. In the twotail test, this could happen with an especially large or small sample mean. In the one-tail test, this could happen only with an especially large sample mean (for the claim that graduation rates are no greater than 6.5). The level of significance is often set at 0.10, or 10%. For the two-tail test, we need to split this between the tails, for 0.05 or 5% each. For the one-tail test, we put all the weight in a single tail. But we could choose a different significance level. In general, for significance level α, put α/2 in each tail for a two-tail test, α in the appropriate tail for a one-tail test. We usually make the significance level relatively small. That’s why I say the null hypothesis is the default, the claim being given the benefit of a doubt: you will are setting a relatively small chance of rejecting it when it’s true. (Again, the trial court analogy is apt: you want a relatively small chance of convicting an innocent man.) We call this kind of a error a Type I error. The probability of a Type I error is equal to α. There is another type of error you could make: accepting a null hypothesis even though it’s false. And if you think about it, this type of error is going to be fairly common if you’re setting a small α. If you’re giving the null hypothesis the benefit of a doubt, you’ll often fail to reject it even though it’s wrong. (Trial court analogy: By requiring a high standard of proof for guilt, we probably let a large number of guilty people go free.) We call this kind of error – failing to reject the null even though it’s false – a Type II error, and we say the probability of a Type II error is β. There is an inverse relationship between the probabilities of Type I and Type II errors. The higher is one, the lower is the other. Why do we usually set such a small probability of Type I error? This is a result of the fact that statistics has largely been developed in scientific applications. Scientists don’t like to accept a claim that differs from the existing wisdom, or that asserts the existence of a relationship, unless they have really strong evidence. Notice that I have continually referred to “not rejecting the null” instead of “accepting the null.” That’s because scientists generally wish to remain agnostic without sufficient evidence: they will simply say we don’t know in a wide range of circumstances. But that may not be appropriate in non-scientific contexts, including business and policy. For instance, if you’re thinking of starting a business, you might ask is: will I make a profit? But what you really want to know is: should I open the business or not? Now, what level of certainty do you need about the conclusion that you will? Do you need to be 95% certain of that? Put yourself in the position of a loan officer: would you require 95% certainty that the investment will pay off? That’s what you’d be asking for if you set α = 0.05. You might be willing to accept a substantially higher probability of failure. (I’ve been told that the wildcatters searching for sites on which to drill for oil will accept an alpha as large as 0.8, or 80%, on the proposition that a site has oil. Oil wells are so profitable that you can tolerate a very large number of failed drillings.) This is why I said, in the CI lecture, that the significance level is not magic. There is nothing special about 0.10 or 0.05 or 0.01. They are just convenient numbers that scientists use, but non-scientists may pick different numbers. IV. Performing the Test To perform a hypothesis test, you must find a z-score based on the value of the parameter specified in the null hypothesis. z x Ho x Note that in forming this z-score, we are using the standard error of the mean in the denominator. That’s because your sample mean is distributed normally with that standard deviation, not the standard deviation of the population as a whole. We can rewrite the above like so: z x Ho / n We will then compare this to a critical value of z from the standard normal table. If it’s greater than the z-critical, we reject the null and accept the alternative hypothesis. Otherwise, we do not reject the null, nor do we accept the alternative. Example: Let’s do the two-tail test on CSUN’s graduation time. Let’s say we know the standard deviation of the population is 2 years, and we sampled 49 CSUN graduates and found a sample mean of 6.9. Then we calculate: x Ho 6.9 6.5 z 1.4 / n 2 / 49 We need a z-critical value for a significance level of 0.10. Since this is a two-tail test, we want 0.05 in each tail, so find the value of z in Table 3 that gives you an area as close to 0.95 as possible. This turns out to be 1.64. Since 1.4 < 1.64, we do not reject the null. The administration’s claim cannot be rejected. Example: Now let’s do the one-tail test on CSUN’s graduation time. All the calculations are the same, except now we want the whole 10% in the right tail. That gives us a zcritical of 1.28. Since 1.4 > 1.28, we reject the null and accept the alternative. We think the administration has underestimated the true mean graduation time. NOTE: The test we just did is a right-tail test, because the null hypothesis is rejected only for a sufficiently high sample mean. But what if the null hypothesis had been that CSUN’s average graduation time was greater than or equal to 6.5? In that case we would have done a left-tail test. In addition to the z-value calculated above being greater than zcritical, you also need to make sure the sample mean is less than the hypothesized mean (6.5 in this case). Alternatively, just calculate the z-value above without absolute value signs, and then put a negative sign on your z-critical. Why did we reject in the two-tail case and accept in the one-tail case? Because in the two-tail case, some of the weight of α had to go in the left tail, which turned out to be irrelevant in this case. That meant there was less weight to go in the right tail, and thus less chance of rejecting the null as a result of a high sample mean. V. Getting Rid of the Bogus Assumptions We assumed above that true standard deviation was known. Just as with CI’s, this is a weird assumption. Why would we know the true standard deviation but not the true mean? When we have a large sample, we can get away with substituting the sample standard deviation for the true one and continuing to use the z-distribution. This gives us the following z-score formula: z x Ho s/ n But what if you don’t know the true standard deviation and the sample size is small? Then we have to use the t-distribution. We calculate a t-score instead of a z-score: t x Ho s/ n And then we find a t-critical value instead of a z-critical value. Example: Same example as above, doing a one-tail test. But this time, we don’t know the standard deviation is 2, and our sample size was only 17. Our sample standard deviation turns out to be 1.9, and we use this to find our t-score: x Ho 6.9 6.5 t 0.84 / n 1.9 / 16 In the t-table, with df = 16 -1 = 15 and 90% confidence level, t-critical is 1.75. We do not reject the null. If we had wanted a one-tail test, we’d have looked in the column of the table headed by 0.1000 (ignore the 0.8000 confidence level below, because that assumes a two-tail test). We get 1.341. Since 0.84 < 1.341, we do not reject the null. VI. P-Values Remember that we could have picked any value of α. Picking a large one (such as 0.10) makes it more likely you’ll reject the null hypothesis; picking a relatively small one (such as 0.01) makes it less likely. So for any given hypothesis test, you might ask: what is the lowest value of α that would still cause me to reject the null hypothesis? The answer is called the p-value. [Show standard bell-curve for one-tail test; mark rejection region as alpha. Show a zvalue that is to the right, so it would result in rejection. Then show the region that would correspond to the p-value: the region in the remaining tail to the right of the z-value.] Example: In the one-tail test of CSUN graduation times, we found z = 1.4. Table 3 tells us the area to the left of 1.4 is 0.9192, so the area to the right of it is 0.0808. This is the p-value. So any alpha greater than 0.0808 will lead to rejection of the null; any alpha less than .0808 will lead to non-rejection; or to put it another way, 0.0808 is the lowest alpha that will result in rejection of the null. But for a two-tail test, remember that the area to the right of your z-value (or to the left if you have a negative z-value) is only one-half the alpha. So you need to double the area to the right (left) of it. Example: In the above example, for the two-tail test, the p-value is 2(0.0808) = 0.1616. That’s the lowest alpha that will lead to rejection of the null in the two-tail test. It is possible to find p-values when we’re using the t-statistics as well. But the t-table in a book doesn’t give us enough information to find the p-value with much precision. A statistical software program can do it for us, though.