Statistics 101, Section 001: December 14, 2002 Final Exam ANSWERS Instructions: Write your answers on the exam in the spaces after the questions. For maximum credit, show all work. Writing an answer without showing work may not receive full credit. You are permitted to use four sheets of paper filled with whatever information you put on them. Other notes, texts, or pieces of paper are not permitted. You cannot work with or ask questions of others. If you need clarification on any part of the exam, contact Prof. Reiter. Provide the information requested below in the adjacent empty spaces. NAME (print): LAB SECTION: Honor Pledge: ``I have not given or received assistance on this exam while taking the exam.'' SIGNATURE: Page Points Possible 3 18 4 12 5 15 6 15 7 15 8 15 9 20 10 20 11 20 Total 150 Score 1 QUESTIONS 1 – 17 REFER TO THE DATA SET DESCRIBED BELOW What factors are related to the formation of hurricanes and tropical storms? To assess this question, W. Gray (1998) gathered storm data for each year from 1950 to 1997. The variables include: --- The number of hurricanes in the year. --- The number of tropical storms in the year. (Tropical storms are serious but not quite hurricanes.) --- The value of a commonly used storm index score. A score of 100 is an average year, and a score above 100 is a year when storms are stronger than average. --- Whether West Africa experienced a wet or dry year. --- Whether the El Nino effect was cold, neutral, or warm. That is, whether the ocean temperatures in the Pacific were colder than usual, about the same as usual, or warmer than usual. There are no missing data, so that there are 48 observations. There are no problems on this page. The problems begin on the next page. Questions 1 – 10 and 13-17 are worth three points each. 2 Below are histograms for hurricanes, tropical storms, and storm index. Number of Hurricanes 0 2 4 6 Number of Tropical Storms 8 10 12 14 5 10 15 20 Storm index QUESTIONS BEGIN HERE 1. Order the three variables by the values of their standard deviations (SD). Write the variable name next to each choice. Largest SD: storm index In Between SD: tropical storms 50 100 150 200 250 Smallest SD: hurricanes 2. Estimate the following quantities for number of hurricanes in a given year: Median: 5.5 Mean: 5.75 SD: 2.36 3. Estimate the percentage of years that have between four and eight hurricanes. Include the years with four or eight in your estimate. 68% 4. True or False: The median storm index is larger than 100. FALSE 5. True or False: A normal probability plot for the storm index would show the points on an approximately straight line. FALSE 6. Suppose the numbers of tropical storms in 1998-2002 equal 9, 10, 9, 9, and 10. What happens to the SD of number of tropical storms after adding these five values to the 1950-1997 data? Circle one choice. It decreases (these points are all very close to the mean) 3 Below is a box plot of hurricanes for the three types of El Nino effects. Oneway analysis of number of hurricanes by El Nino effect 14 12 hurricanes 10 8 6 4 2 0 cold neutral w arm el.nino 7. Which of the following statements is true? TRUE The typical deviation from the average for cold El Nino years is larger than the typical deviation from the average for warm El Nino years. ____ The typical deviation from the average for cold El Nino years is smaller than the typical deviation from the average for warm El Nino years. 8. Estimate the percentage of neutral El Nino years with five or more hurricanes. 65% 9. Estimate the following differences in median number of hurricanes: (median for cold – median for neutral): (median for cold – median for warm): (median for neutral – median for warm): 0 (both 6.5) 3 (6.5 – 3.5) 3 (6.5 – 3.5) 10. True or False: The data suggest that cold El Nino effects are associated with increased hurricane development and that warm El Nino effects are associated with decreased hurricane development. TRUE. 4 Below is a box plot of the relationship between hurricanes and whether West Africa is wet or dry. Oneway Analysis of hurricanes By west.africa 14 12 hurricanes 10 8 6 4 2 0 dry w et w est.africa Means and Std Deviations Level Dry Wet Number 28 20 Mean 5.17857 6.55000 Std Dev 1.90620 2.74293 Consider these 48 years as a random sample of possible hurricane seasons. There is no apparent time trend in the data, so this assumption is reasonable. Assume the Central Limit Theorem holds in each group. 11. (10 points) Researchers theorize that wet years in West Africa have more hurricanes on average than dry years do. Test this claim with a significance test. Report your null and alternative hypotheses, the value of the test statistic, the p-value, and your conclusions. Assume p-values near 0.05 are small. Let d be the average number of hurricanes in a dry El Nino season in the population of hurricane seasons in dry El Nino years. Let w be the average number of hurricanes in a wet El Nino season in the population of hurricane seasons in wet El Nino years. Ho: w d . Ha: w d We want to see if the difference in sample averages could be plausibly explained by random chance if in fact there is no difference in the two population averages. The test statistic equals: t 6.55 5.17857 (2.74293 2 20) (1.9062 2 28) 1.93 5 The p-value corresponding to this test is the area under the normal curve (actually, it’s the area under a t-curve, but we ignore that distinction) to the right of 1.93. This equals 0.027. There is only a 2.7% chance of seeing a difference in the sample averages that is as or more extreme than (6.55-5.17857), when the null hypothesis is true. Hence, we reject the null hypothesis. The data suggest that wet El Nino years on average have more hurricanes than dry ones. 12. (5 points) Is it reasonable to expect the Central Limit Theorem to apply within each group? Explain in no more than four sentences. The problem said to consider these data a simple random sample of hurricane years, since there was no apparent relationship in hurricane amounts across years. Hence, we need to consider whether the data in each group come from approximately normal curves. Looking at the box plots, we see that the data are roughly symmetric, and there are no outliers. Hence, irt seems reasonable for a normal curve to describe the data, so that the CLT should kick in. Below is a scatter plot of number of hurricanes by number of tropical storms Bivariate Fit of hurricanes By storms 14 12 hurricanes 10 8 6 4 2 0 5 10 15 20 storms 13. Estimate the slope and intercept of the regression line: Slope 0.61 Intercept 0.01 14. Estimate the correlation between number of hurricanes and number of tropical storms: 0.83 15. Estimate the typical deviation of hurricane values around the regression line: 1.33 6 16. For years in which there are ten tropical storms, estimate the chance that there will be seven or more hurricanes. Since the regression line fits the data well, we can use a normal curve to estimate this probability. The mean of the normal curve equals the predicted value on the regression line when there are 10 storms. This equals: 0.01 + .61(10) = 6.11. The SD for the normal curve equals the RMSE from part 15, which is 1.33. Hence, the standardized value of 7 storms equals: (7-6.11)/1.3 = 0.68. The area under the normal curve to the right of 0.68 equals 0.25. 17. True or False: The data suggest that seasons with high numbers of tropical storms also have high numbers of hurricanes. TRUE (the slope of the line is positive. Answers that said false because extrapolation may not be justified were also given credit.) 18. Hot Streaks (5 points per part) (i) Suppose a baseball player has a 30% chance of getting a hit in any attempt, and that each attempt is independent of other attempts. The player makes four attempts in a game. What is the chance that the player will get at least one hit in a game? Pr(one hit in at least 4 at bats) = 1 – Pr(no hits in at least 4 at bats) = 1 (0.7 0.7 0.7 0.7) 1 0.7 4 (ii) Suppose attempts are not independent. What parts of your calculations in part (i) would not be correct? For your answer, write the exact steps in your calculations that would not be correct. We could not multiply the 0.7s because the chances of getting a hit on attempts other than the first one would not equal 0.7. They would depend on outcomes of previous attempts. (iii) During the 1978 baseball season, Pete Rose got at least one hit in 44 consecutive games. Assume that, in any attempt, Rose has a 30% chance of getting a hit, and that he makes four attempts per game. Further, assume that each attempt is independent of other attempts. What is the chance that Rose would get at least one hit in 44 consecutive games? Pr(at least one hit in 44 consecutive games) = [1 0.7 4 ]44 A small number of people interpreted the problem to be the chance that he gets at least one hit in all his attempts over the 44 games. That is, the chance he gets at least one hit in (44*4) attempts. For those who interpreted it this way, we graded it so that you got credit if you got the problem right under this interpretation. 7 19. Samples and Sample Averages (3 points per answer) Using a census list provided by the North Carolina state government, a Stat 101 student selects a random sample of 100 households from North Carolina. She records the number of people living in each household. She then takes a separate random sample of 100 households using the same list (it is possible to pick households from the first sample again). She again records the number living in each household. She repeats this process to obtain 500 samples. The average household size in the population equals 2.6, and the standard deviation of household size in the population equals 1.42. A histogram looks roughly as follows: a) True or False: The percentage of households with more than 6 people will be very close to the area under the standard normal curve to the right of 2.39. FALSE: The histogram of the population does not look like a normal curve, so that the percentage of households with more than 6 people will not be well approximated by a normal probability. Note that the histogram shows the distribution of household sizes in the *population*, not the samples. b) True or False: The typical deviation of the 500 sample averages from 2.6 should be very close to 0.06. FALSE. Since the sample size equals 100, the standard error should be 1.42/10 = .142. Using 0.06= 1.42 / 500 is not correct. Note that the number of sample averages we pick doesn’t matter; it’s the size of the sample that determines the standard error of a sample average. c) True or False: The percentage of the 500 sample averages that are less than 2.3 should be very close to the area under the standard normal curve to the left of -2.11. TRUE. Since 100 is a large sample size, the CLT for the sample average should hold. Therefore, the percentage of sample averages less than 2.3 can be well approximated by a normal probability. The correct mean and SE for the sample average are 2.6 and 1.42/10. d) Determine the following quantities for samples of 100 households from this population. The expected value of 500 sample averages: 2.6. The expected value is the typical value of the sample averages. Since the average in the population equals 2.6, the expected value of the sample averages is 2.6 The SD of 500 sample averages: 1.42/10. The SD of the 500 sample averages is equivalent to the SE of the standard average, with some small error due to random sampling. The SE is the typical deviation of any sample average from its expected value. The SD of a bunch of numbers is the typical deviation of those numbers from their average. Plug in the word “sample averages” where you see “numbers” in the last sentence, and you can see that the SE and SD of the 500 8 sample averages are equivalent. PS—This is like the JAVA applet on the central limit theorem and the M&M example that we did in class just before Thanksgiving. 20. Two Problems (10 points per part) a) A poll run by a news organization states that, “The percentage of people who approve of the way President Bush is handling the situation with Iraq equals 62%, plus or minus 3%.” Assuming the poll is a random sample, and that the news organization uses 95% confidence intervals, what is the sample size they used for the poll? .03 = 1.96 SE = 1.96 .62(1 .62) / n Solving for the sample size n, we get: n 1.96 2 (.62)(.38) 1006 .03 2 We also accepted answers that used 2 instead of 1.96. When the approach was right, we accepted answers that were near the right answer (i.e., we weren’t sticklers about math.) b) Suppose that 0.5% of all students seeking treatment at Student Health are eventually diagnosed as having mononucleosis. Of those who do have mono, 90% complain of a sore throat. But, 30% of those not having mono also have sore throats. If a student comes to the infirmary and says that he has a sore throat, what is the probability that he has mono? Let M be the event that you get mono. Let S be the event that you have a sore throat. We want Pr(M|S). We know that Pr(S|M) = .90, and that Pr(S| not M) = .30. Also, we have that Pr(M) = .005. Hence, we can find that Pr(M|S) = Pr(M and S)/Pr(S) = Pr(S|M)Pr(M) / Pr(S) = (.90)(.005) / [(.90)(.005) + (.30)(.995)] = .0148. 1.48% chance. 9 21. Study Design I People who get lots of vitamins by eating five or more servings of fresh fruit and vegetables each day (especially cruciferous vegetables like broccoli) have much lower death rates from colon cancer and lung cancer, according to many observational studies. These studies were so encouraging that two randomized controlled experiments were done: treatment groups were given large doses of vitamin supplements, while people in the control groups just ate their usual diet. One experiment looked at colon cancer, and the other looked at lung cancer. The first experiment found no difference in the death rate from colon cancer between the treated and control group (Greenberg, et al., 1994). The second experiment found that beta carotene (as a diet supplement) increased the death rate from lung cancer (Heinonen, et al., 1994). a) (5 points) True or false, and justify your choice: The observational studies could have easily reached the wrong conclusions due to confounding. People who eat lots of fruit and vegetables have lifestyles that are different in many other ways, too. TRUE. Because this is an observational study, we have no assurance that the people in the different diets have similar background characteristics. b) (5 points) True or false, and justify your choice: The experiments could have easily reached the wrong conclusions due to confounding. People who eat lots of fruit and vegetables have lifestyles that are different in many other ways, too. FASLE. Random assignment should balance the background characteristics, so that the comparisons should be fair. Some people said TRUE because of issues related to the definition of treatments (e.g., placebo effects, double-blind, usual diets could differ). These got most credit (4 points). I’m not sure how one could possibly perform these studies double-blind, and letting people follow their normal diets seems the most realistic of the other treatments. 22. Study Design II On October 20, 1993, the San Francisco Chronicle reported on a survey of top high school students in the U.S. According to the survey, ”Cheating is pervasive. Nearly 80 percent admitted dishonesty, such as copying someone’s homework or cheating on an exam. The survey was sent last spring to 5,000 of the nearly 700,000 high achievers included in the 1993 edition of Who’s Who Among American High School Students. The results were based on the 1,957 completed surveys that were returned.” a) (5 points) Do you think the survey provides evidence that roughly 80% of high school students are cheating? Explain why or why not. It does not. The sampling frame (high achievers) is no way representative of all high school students. The nonresponse also weakens the evidence. b) (5 points) Do you think the survey provides evidence that roughly 80% of the students in Who’s Who Among American High School Students are cheating? Explain why or why not. 10 It does not. The nonresponse could have a strong impact. It may be that people who cheated were not willing to respond, so that the percentage could be even higher. 23. True or False (4 points per part). For each statement, if you think the statement is always true, just say it is true. If you think the statement is always false or sometimes false, say it is false and explain why or when it is false in two or less sentences. a) You get a p-value of 0.34. There is a 66% chance that the alternative hypothesis is true. FALSE. A p-value is not a probability of the null hypothesis being true, so that 1-pvalue is not a probability that the alternative hypothesis is true. b) If you increase the sample size, you have a better chance of rejecting a null hypothesis that is false (when all else about the population remains unchanged). TRUE. A larger sample size means a smaller SE, which means more accuracy to decide if the data are consistent with the null hypothesis. When that null hypothesis is false, this makes it easier to detect that it is false. c) A large value of the chi-squared independence test statistic suggests that the row and column variables may be independent. FALSE. A large value of the chi-squared test stat suggests dependence, not independence. d) Two researchers make 95% confidence intervals for the same unknown population average using different samples. The first researcher has a sample with 100 people, and the second researcher has a sample with 5000 people. True or False: the confidence interval based on the 5000 people is more likely to contain the value of the unknown population average than the confidence interval based on the sample of 100 people. FALSE. By definition, 95% confidence intervals have a 95% chance of containing the population average, regardless of sample size. The interval based on 100 people will be wider than the interval based on 5000 people, but each has a 95% chance of containing the population value. e) The same two researchers as in part d decide to make Bayesian posterior intervals instead of confidence intervals. They both use the same normal prior distribution. True or False: the prior distribution will have a greater impact on the inferences made from the sample of 5000 than it will on the inferences made from a sample of 100. FALSE. The prior distribution has a larger effect on inferences for the smaller sample size. You can see this in the weighted average of the data and prior means. 11