Department of Urban Studies and Planning Massachusetts Institute of Technology 11.220 Quantitative Reasoning and Statistical Methods for Planning I Spring 1999 Homework Set #5 Solutions Comparison and Explanation— Experimental Design, Statistical Hypothesis Testing Question 1 1 In a particular community, the mean water usage per household for the month of January 1993 was .060 acre-feet—an acre-foot is the amount of water necessary to cover one acre to a depth of one foot. This figure is based on the usage of ALL households in the community during that month. During 1994 a countrywide water conservation campaign was conducted. In January 1995, a random sample of 50 homes was selected, and water usage was recorded for each home in the sample. The mean water usage per household in the sample was .054 acre-feet with a standard deviation (s) of .016 acre-feet. [6] (a) The county supervisors would like to know whether the data support the claim that mean January water usage has declined. Has it? Conduct an appropriate statistical hypothesis test and draw whatever conclusion you feel is justified by the data. This is a one-tailed hypothesis-test problem Ho : a Where true mean household water usage for January, 1995 Since this is a lower-tailed test, we compute the P-value by calculating the value of the z test statistic z x o s/ n 1 This question is adapted from Example 9.10, page 317 of Jay Devore and Roxy Peck, Statistics: The Exploration and Analysis of Data, third edition (Pacific Grove, California: Duxbury Press, 1997). 1 z 0.054 0.060 0.006 2.61 0.0023 0.016 / 50 This result implies that the observed mean of 0.054 is roughly 2.6 standard deviations below the hypothesized value. The P-value is then equal to the area under the z-curve and to the left of -2.61. From looking at the Z-table (Table II of Weiss), the P-value = 0.0045. Using a 0.01 significance level Ho would be rejected (since 0.0045 < 0.01), suggesting that the mean usage was lower. Notice that the rejection of Ho would not be justified if a very small significance level such as 0.001 had been selected. [6] (b) Of course, the county supervisors would like to conclude even more than simply that water usage has gone down. Comment briefly on the experimental design used here. How could it have been improved? What would that improvement have accomplished? Since water usage is often determined by income and education, a simple random sample may not be the best way to formulate an environmental policy for the area. Indeed, the elasticity of demand for water may vary considerably as the socioeconomic texture of a neighborhood changes. It would thus be more appropriate to have stratified random sampling by household income level or by education. It is also important to have a control in the study to offer some insights regarding causality. Therefore, it may be useful to compare the changes in water consumption rates in a nearby community with similar geographic and hydro-geological characteristics where a formal conservation program was not instituted. Such a control would help us determine whether the conservation program was actually a contributing factor in the change or whether external factors were responsible for the change in water usage pattern. 2 Question 2 The Internal Revenue Service has completed a study of tax compliance in the New England states. One of its conclusions was the following: “The percentage of taxpayers in Massachusetts that under-report their true incomes is significantly higher than the percentage of taxpayers in Connecticut who under-report their true incomes.” Surprisingly, despite this result, the Internal Revenue Service also announced that it was not going to invest any further resources on trying to decrease underreporting in Massachusetts. [6] (a) Explain the decision not to attempt to decrease under-reporting in Massachusetts might be consistent with the fact that they found a “significant” difference between the two states. Use the notions of practical significance and statistical significance in your answer. The statement implies that a statistical test may have been performed on the tax data which determined that the difference in under-reporting between the two states was “significant” insofar that the researchers could reject the null hypothesis. However, the cost of enforcement to improve reporting in Massachusetts may be too high as compared to the lost revenues from under-reporting. Hence a practical decision was made to not pursue the matter. In other words the difference in under-reporting was practically insignificant, given various resource constraints. Question 3 2 This problem is in honor of my son, a college freshman and golf team member. Recently Luc bought a new driver (the longest-hitting club). Prior to buying this club he had carefully collected statistics on how far he was able to hit the ball with his old driver. The mean distance was 200 yards. He has just begun using the new driver and is curious whether the average distance of his drives has gone up. He took a simple random sample of his drives using the new driver and came up with the following observations: 205 198 220 210 194 2 201 213 191 211 203 This question is adapted from Exercise 9.53, page 337 of Devore and Peck, Statistics: The Exploration and Analysis of Data. 3 [6] (a) Based upon these observations can Luc conclude that the length of his drives has gone up, on average, with this new club? First let us calculate some of the sample statistics for the data: n=10, sample mean = 204.60, sample standard deviation (s) = 9.03 Let us try and use a one-sample t test (one-tailed) for this problem t x s/ n 204.60 200 161 . 9.03 / 10 Degrees of freedom = n-1 = 9 The critical value of t for and 9degrees of freedom is From Weiss, Table IV) Since the critical value of t is greater than the test statistic, we cannot reject our null hypothesis Thus Luc’s new driver is not hitting significantly longer drives (in a statistical sense) than his old driver. Alternatively, the P- value =0.071 (from calculator). Since this value is greater than the usual significance level of 0.05, we fail to reject the null hypothesis. The values in Table IV indicate that we would only reject our null hypothesis if was set at 0.1 (a rather high risk of Type 1 error). [6] (b) What does a normal probability plot of the data tell you about the appropriateness of the significance test you used in part (a)? [We have only discussed this in passing in class, but using the text and Excel you should be able to construct a normal probability plot for these data and answer this question without difficulty.] To construct the Normal Probability Plot we arrange the data in ascending order and plot it against the normal scores for n=10 (given in Table III of Weiss) Shot Distance 191 194 198 201 203 205 210 211 Normal Score -1.55 -1.00 -0.65 -0.37 -0.12 0.12 0.37 0.65 4 213 220 1.00 1.55 Normal Probability Plot for Luc's Driver Shots 2 Normal Score 1.5 1 0.5 0 185 -0.5 190 195 200 205 210 215 220 225 -1 -1.5 -2 Distance The plot is relatively linear which means that we can assume that the data is normally distributed and hence the use of a t test is reasonable. [3] (c) What is the most worrisome threat to internal validity in this “experiment?” The most worrisome threat to internal validity in this problem is one of “maturation.” In other words Luc could be performing better over time irrespective of the age or design of the club. 5 Question 4 The following text setting up this problem is taken from this year’s midterm exam: The McClatchy News Service (San Luis Obispo Telegram-Tribune, June 13, 1991`) conducted a study to measure the degree of violence on television. They reported the results of three simple random samples of prime-time television hours. The following table summarizes the information reported for the three major networks:3 Network Mean Number of Violent Acts Per Hour ABC CBS NBC 15.6 11.9 11.0 Suppose that each of these sample means was computed on the basis of viewing 50 randomly selected prime-time hours and that the population standard deviation for each of the four networks is know to be = 5. [6] (a) The questions that you were asked in the midterm based on these data asked you to focus on them from an estimation point of view. You calculated confidence intervals and looked at the overlap between those confidence intervals to compare the level of violence on ABC as compared to the other networks. We now have the mathematical tools to approach this in a more mathematically precise fashion. Conduct a statistical hypothesis test to decide whether or not you can conclude based on these sample results that ABC programs are more violent, on average, than CBS programs. Ho= Mean violence acts per hour are the same (CBS=ABC) Ha= Mean violence acts per hours for ABC > Mean violence hours for CBS We need to do z test for two population means and independent samples. The z statistic in this case is calculated by the following formula: z ( xABC xCBS ) ( ABC CBS ) ( 2ABC / nABC ) ( 2CBS / ncbs) 3 This question is adapted from Question 8.15, pages 282-283 of Devore and Peck, Statistics: The Exploration and Analysis of Data. 6 CBS -ABC = 0 The standard error term in the denominator happens to be = 1 in this case Therefore z= 15.6-11.9 = 3.7 This value gives a P value =1- 0.9999=.00001, which means that we reject our null hypothesis Therefore the difference between ABC and CBS violent acts per hours is statistically significant. [3] (b) Without actually performing the calculations, given what you have discovered in your answer to part (a) what can you conclude about whether ABC programs are more violent, on average, than NBC programs? Given the very low P value for the result in part (a), and the fact that that the difference in violence on programs that are aired on NBC versus ABC is even greater than for CBS versus ABC, . we would also reject the null hypothesis that the violent acts per hour on the two networks in favor of the alternative that there is a significant difference in violent acts per hour on the two networks. 7 Question 5 4 A study compares two groups of mothers with young children who were on welfare two years ago. One group attended a voluntary training program offered free of charge at a local vocational school that was advertised in the local news media. The other group did not choose to attend the training program. A researcher finds out about this “natural experiment” (an experiment that has happened without being designed by a researcher) and is interested in seeing what can be concluded from these results. The researcher takes all of the data available from the vocational training program and calculates the proportion of the mothers who are still on welfare. Furthermore, he decides to treat these data as though they are the result of a simple random sample taken from all of those who might have enrolled in the voluntary training program. [I know that this seems odd, but the fact is that it is often done, though that decision is typically not as explicit as I have made it here.] He then takes a simple random sample of welfare mothers with children who decided not to participate in the program. [For the purposes of this problem, assume that the researcher can identify this parent population and take a simple random sample from it.] Finally, he conducts a statistical hypothesis test and concludes: “I conclude (1) that the proportion of mothers who enrolled in voluntary training and were still on welfare is lower than the proportion of mothers who did not enroll in the program and were still on welfare and (2) that that difference is statistically significant (P< .01).” You are on the staff of a member of Congress who is interested in the plight of welfare mothers. She asks for your advice: [6] (a) Explain to the Congresswoman what statistical hypothesis test the researcher conducted. You will need to explain what a statistical hypothesis test was called for (rather than just simply looking at the observed difference between the two proportions). The researcher probably conducted a pooled t test for the two groups of welfare mothers to determine whether or not the ostensible difference in numbers is statistically significant. The test was employed to provide some probabilistic measure of how likely we are of being correct in rejecting our null hypothesis, and asserting that there is a difference beyond just a “chance” occurrence. However, the test does not of itself establish causality since there may be other exogenous factors at play. Controls need to be instituted in the experimental design to ensure that the difference was not caused by these factors. 4 This question is adapted from Review Exercise 8.59, pages 513-514, of David S. Moore, Statistics: Concepts and Controversies, fourth edition (New York: W. H. Freeman, 1996). Please note that the wording is not identical to the original. 8 [4] (b) Explain to the Congresswoman in simple language what “statistically significant (P < .01)” means. “P” stands for the probability that the difference observed in the study samples would occur if the two populations actually have the same proportion still on welfare. [4] (c) Is this study good evidence that requiring job training of all welfare mothers would reduce the percent who remain on welfare several years later? In answering this question you may want to think of issues of both internal and external validity. This evidence is not convincing on its own. Treatments were not randomly assigned, but instead were chosen by the mothers. Mothers who chose to attend a job training program may be more inclined to get themselves off welfare. Question 6 NO GENERIC SOLUTIONS FOR THIS QUESTION Pick a question that is of interest to you and that you believe could be answered through carefully designing an experiment, conducting the experiment, and conducting a statistical hypothesis test on the collected data in order to draw conclusions. [2] (a) Clearly pose your question. [6] (b) Describe the experiment that you would use and explain how its design would help you to eliminate threats to internal and external validity. [6] (c) Invent data that might have resulted from your experiment and conduct an appropriate statistical hypothesis test. [4] (d) Draw whatever conclusions are appropriate from your calculations. 9