Chapter 12: Testing hypotheses about single means (z and t) Example: Suppose you have the hypothesis that UW undergrads have higher than the average IQ than the US population. You know that IQ’s of the whole population of are normally distributed with a mean of 100 and a standard deviation of 15. How would you test your hypothesis? The solution is to obtain a random sample of IQs from the UW population, calculate the mean, and compare it to 100. Let’s say we measure the IQs of 25 students and obtain a mean of 106 points. Is a mean of 106 points really different from 100? We need to compare this mean from our sample and compare it to the US population mean of 100 and ask the question: What if the population of UW students really has a mean IQ of 100 like the US population. How unlikely would it be for us to make this observation by chance? More specifically, how likely would it be for us to draw a mean that differs from 100 by more than 6 points by chance? If it’s sufficiently unlikely, we’d consider this evidence in favor of our hypothesis that UW students have higher IQs. More formally, we call the thing we’re trying to prove wrong the null hypothesis (H0), and the thing we’re trying to show to be true the alternative hypothesis (HA). In our example, the null hypothesis is that there is not a difference between the mean IQ scores of UW students and the US population. The alternative hypothesis is that UW students have a higher mean IQ than the US population. We compute a statistic from a sample and determine how probable our observed statistic should occur if the null hypothesis is true. If this probability is sufficiently low, we reject the null hypothesis. Our criterion for the probability for rejection is called the ‘alpha (a) value’. Choosing a value of alpha is both complicated and somewhat arbitrary (more on this later). But typically values are a = 0.05 or a =0.01. An alpha value of 0.05 means that there less than a probability of .05 (1 in 20) that we’d observe our sample statistic if the null hypothesis were true. Here is a step-by-step recipe for hypothesis testing using our UW IQ example. Step 1: Define the target population. This is the population that we want to make an inference about. In this case, we want to make an inference about UW undergrad IQs Step 2: Specify the null hypothesis (H0). This is the hypothesis we hope to reject. In our example, our null hypothesis is that UW students have a mean IQ of 100. We write this as: H0: mx = 100 Step 3: Specify the alternative hypothesis (HA). We must choose between a directional (‘one-tailed’) or non-directional (‘two-tailed’) test here. In our example we are expecting (hoping) for an IQ that is greater than the population, so this is a directional , or one-tailed test. HA: mx > 100 Step 4: Specify the ‘level of significance’ (a) to be use as a criterion for decision. This is the probability criterion for which we will reject the null hypothesis by chance if it is actually true. We’ll chose a = .05 for this example. Step 5: Decide on a sample size (n) and draw a random sample from our target population. In our example, our sample had 25 students. Step 6: Calculate your statistic on your sample (the mean in our example). In our example, we obtained a mean IQ of 106 points Step 7: Convert your statistic into standard units with respect to your null hypothesis. In our example, we’ll calculate the z-score with a standard error of the mean: X / n 15 / 25 3 z ( X mhyp ) X (106 100) 2 3 Step 8: Reject H0 if our observed mean is located in the ‘region of rejection’. For the standard normal (z) distribution, the region of rejection is the upper tail containing a proportion of area equal to a = .05. Looking this up in Table A (Column C), this corresponds to a value of z = 1.645. z=2 area = a = .05 -4 -3 -2 -1 0 z score 1 2 3 4 Our observed mean corresponds to z=2, which is within the region of rejection. This means that our observation would be unlikely if the null hypothesis were true. We therefore reject the null hypothesis. We say that “our study shows that UW students have statistically significantly higher IQs than the US population using criterion value of a=.05.” What if we had chosen a criterion of a = .01 instead of .05? This corresponds to a rejection region for values of z greater than 2.33. Our observation of z=2 does not fall into this region, so in this case we would fail to reject H0. z=2 area = a = .01 -4 -3 -2 -1 0 z score 1 2 3 4 If our choice of criterion (a) seems arbitrary, that’s because it is. To give the reader more information, we can report the probability of our observation under the null hypothesis. In our example, this is the area under the curve above z=2, or Pr(z>2) = .0228. This value is often called the p-value, and we write p = .0228. Note that this p-value falls between our two a values of 0.05 and 0.01. Another example: Suppose we have a drug that we think can influence IQ values. How would we test if this drug has an effect? Step 1: Define the target population. We’ll be randomly sampling from the US population this time. Step 2: Specify the null hypothesis (H0). Like before, our null hypothesis is H0: mx = 100 Step 3: Specify the alternative hypothesis (HA). By ‘influence’ we’re not specifically predicting an increase (or decrease) in IQ. So we’ll use a two-tailed test and write: HA: mx ≠ 100. Step 4: Specify the ‘level of significance’ (a) to be use as a criterion for decision. We’ll chose a = .05 again . Step 5: Decide on a sample size (n) Let’s run our experiment on 100 subjects. Step 6: Calculate your statistic on your sample (the mean in our example). Suppose obtained a mean IQ of 96 from our 100 subjects. Step 7: Convert your statistic into standard units with respect to your null hypothesis. In our example, we’ll calculate the z-score with a standard error of the mean: X X / n 15 / 100 1.5 z ( X m ) (96 100) 2.67 X 1.5 Step 8: Reject H0 if our observed mean is located in the ‘region of rejection’. We want to find the values of z that have an area of a/2=.025 in each tail. This corresponds to a values of z= ±1.96 z = -2.67 area = a/2 = .025 area = a/2 = .025 -4 -3 -2 -1 0 z score 1 2 3 4 Our observed mean corresponds to z=-2.67, which is within the region of rejection. We therefore reject the null hypothesis. We conclude that our drug has a significant influence on IQ values at a criterion level of a=.05. p-values: Calculating a p-value for a two-tailed test corresponds to calculating the area under the standard normal in both the positive and negative directions away from the absolute value of our observed z. z=-2.67 z=+2.67 area = .0038 -4 area = .0038 -3 -2 -1 0 z score 1 2 3 4 The area below z=-2.67 is .0038, which is the same as the area above z=+2.67. So our p-value is p=.0038 x 2 = .0076. The p-value is the probability of rejecting H0 when it is actually true. Note that if we had decided ahead of time to use a one-tailed test, with an alternative hypothesis of HA: mx >100 our region of rejection for a=.05 would include values of z greater than 1.675. In this case, we would have failed to reject H0 and would conclude that our drug did not significantly increase IQs at a criterion level of a=.05. z=-2.67 area = a = .05 -4 -3 -2 -1 0 z score 1 2 3 4 The t-distribution: when we don’t know What if we don’t know the standard deviation of the population from which we obtained our sample? This is a much more common situation. How do we estimate this value? Common sense says that we’d use the standard deviation of our sample as an estimate of the population’s standard deviation (and therefore use the standard error of the mean of our sample as an estimate of the population’s standard error of the mean). This is generally correct, but we have to make two changes: 1) We need to change our formula for the standard deviation to use n-1 instead of n. sx 2 ( X X ) n 1 2) To get our estimate of the population’s standard error of the mean, we still divide by the square root of our sample size: sX sX n 2) Our standardized measure no longer comes from a normal distribution. Instead, it’s called a ‘t-distribution’ t X uhyp sX What happened to our normal distribution? Note that now the mean and the standard error of the mean both vary for different samples. This increases the probability of very high and low values which fattens the distribution compared to normal. normal distribution (z) (n=∞) n=12 n=4 n=2 -3 -2 -1 0 t 1 2 3 Unlike our standard normal distribution, our t-distributions are a ‘family’ of curves, one for each sample size (n). We label each family member not by sample size but by ‘degrees of freedom (df)’, which is equal to n-1 for the examples we’re doing here (comparing a single mean to an expected population mean). normal distribution (z) (n=∞) df =11 df=3 df=1 -3 -2 -1 0 t 1 2 3 Example: The mean height of the 72 women in our class is 64.5 inches with a standard deviation of 3.28 inches. Is this significantly taller than 64 inches, which is the average height of a woman in the US? Example: The mean height of the 72 women in our class is 64.5 inches with a standard deviation of 3.28 inches. Is this significantly taller than 64 inches, which is the average height of a woman in the US? Step 1: Define the target population. We are interested in the heights of the women in our class. Step 2: Specify the null hypothesis (H0). Our null hypothesis is H0: mx = 64 Step 3: Specify the alternative hypothesis (HA). We’ll use a one-tailed test, since we’re asking if our mean is taller HA: mx > 64. Step 4: Specify the ‘level of significance’ (a) to be use as a criterion for decision. We’ll chose a = .05 again . Step 5: Decide on a sample size (n) We have a sample of 72 women Step 6: Calculate your statistics on your sample (the mean in our example). Our sample mean is 64.5 inches and our sample standard deviation is 3.28 inches Step 7: Convert your statistic into standard units with respect to your null hypothesis. Since we don’t know the population standard deviation, we’ll use our sample standard deviation and the t-distribution with 72-1 =71 degrees of freedom: s X 3.28 sX .3866 n 72 t X uhyp sX 64.5 64 1.29 .3866 Step 8: Reject H0 if our observed mean is located in the ‘region of rejection’. We will use table D which contains rejection regions for the t-distribution. This is the area in one tail for a = .05 and df = 71. The nearest df is 70. Our region of rejection is for values of t greater than 1.667. t= 1.29 area = a = .05 -3 -2 -1 0 t 1 2 3 Our observed mean is not within the region of rejection. We therefore fail to reject the null hypothesis. We conclude that the average height of women in our class is not significantly different from that of the US population at a criterion of a=.05. Calculating p-values using table D is pretty crude since we only have a limited set of alpha values to choose from. It turns out that our value of t = 1.29 with df = 71 is very close to the critical value of t for a=.10. For this example, our p-value is close to 0.1. But we can always use our t-statistic calculator in the Excel spreadsheet. This means that if we drew a random sample of 72 heights from women the US population, there is about a 10% chance that we’d observe a mean as high or higher than the mean of our class. Our mean is therefore above average, but not exceptionally so. t= 1.29 area: 0.1006 -3 -2 -1 0 1 t (df = 71) 2 3 Example: The 21 men in our class have a mean height of 70.3 inches with a standard deviation of 2.61 inches. Is this significantly different from 69.5 inches, the average height of a man in the US? Step 1: Define the target population. We are interested in the heights of the men in our class. Step 2: Specify the null hypothesis (H0). Our null hypothesis is H0: mx = 69.5 Step 3: Specify the alternative hypothesis (HA). We’ll use a two-tailed test, since we’re asking if our mean is different HA: mx ≠ 69.5. Step 4: Specify the ‘level of significance’ (a) to be use as a criterion for decision. We’ll chose a = .05 again . Step 5: Decide on a sample size (n) We have a sample of 21 men Step 6: Calculate your statistics on your sample (the mean in our example). Our sample mean is 70.3 inches and our sample standard deviation is 2.61 inches Step 7: Convert your statistic into standard units with respect to your null hypothesis. Since we don’t know the population standard deviation, we’ll use our sample standard deviation and the t-distribution with 21-1=20 degrees of freedom: sX 2.61 sX .5695 n 20 t X uhyp sX 70.3 69.5 1.40 .5837 Step 8: Reject H0 if our observed mean is located in the ‘region of rejection’. We will use table D which contains rejection regions for the t-distribution. This is the area covering two tails for a = .05 and df = 28. The critical t-value for two tails with a = .05 is the same as the critical t-value for one tail with a = .025/2 = 0.025. This is because for two tails, our total area of .05 needs to be split into two halves. Our region of rejection is for values of t greater than 2.09 or less than -2.09. 1.4 area =0.025 -3 -2.09 -2 area =0.025 -1 0 t (df=20) 1 2.09 2 3 Our observed mean is not within the region of rejection. We therefore fail to reject the null hypothesis. We conclude that the average height of men in our class is not significantly different from that of the US population at a criterion of a=.05. Looking at table D, our observed value of t= for df = 20 falls outside the rejection region for an alpha value of 0.5 (two-tailed). This means that our p-value is less than 0.5. The true p-value from our t-test calculator is .0884+.0884 = 0.1768 1.4 area =0.0884 -3 -2 area =0.0884 -1.4 -1 0 t (df=20) 1.4 1 2 3 Example: in the news. "Freshman 15" weight gain is a myth, new study finds Reuters - The idea that college freshmen gain an average of 15 pounds in their first year of school is a myth -- the average is really between 2.4 pounds for women and 3.4 pounds for men, the co-author of a new study said Tuesday. "Not only is there not a 'Freshman 15,' there doesn't appear to be even a 'college 15' for most students," said Jay Zagorsky, research scientist at Ohio State University's Center for Human Resource Research and co-author of a study on college weight gain. Here’s a table of weight gain (in pounds) from the actual publication: Zagorsky & Smith, Social Science Quarterly, 2011 Male Freshman Female Freshman Mean 3.1 pounds 3.5 sd 10.1 10.3 n 2536 2151 Male Freshman Female Freshman Mean 3.1 pounds 3.5 sd 10.1 10.3 n 2536 2151 Let’s look at the women. We don’t know the population standard deviation, so we’ll use a t-test. H0: mx = 15 HA: mx ≠ 15 sX t Our observed value of t falls way into the rejection region, so we conclude that college freshmen do not gain 15 lbs. sX 10.3 .2221 n 2151 X uhyp sX If we use a = .01, then with n-1 = 2150 degrees of freedom, our critical value of t for a nondirectional (two-tailed) test is +/- 2.81. 3.5 15 30.6 .2221 “Our results indicate that the “Freshman 15” is a media myth. While freshmen do gain weight, the observed average increase of 2.5 to 3.5 pounds falls far short of the ominous 15 pounds.”