Review Lab on April 4 – Review for Exam + 10 Minutes on Nonparametric Tests Nonparametric Methods – For your reference and possibly to help with projects The z-tests, t-tests, and F-test (ANOVA) that we have studied so far are parametric tests. In the end, the statistical theory driving each of the techniques comes down to normal distributions (via sampling distributions) but these are also related to distributional assumptions (such as nearly normal population(s) or at least 10 successes/10 failures in your sample). Nonparametric tests do not require that samples come from populations with normal distributions or any other specific distribution. Hence they are sometimes called distribution-free tests. They may have other more easily satisfied assumptions - such as requiring that the population distribution is symmetric. In general, when the parametric assumptions are satisfied, it is better to use parametric test procedures. However, nonparametric test procedures are a valuable tool to use when those assumptions are not satisfied, and most have pretty high efficiency relative to the corresponding parametric test when the parametric assumptions are satisfied. This means you usually don't lose much testing power if you swap to the nonparametric test even when not necessary. The exception is the sign test. The following table discusses the parametric and corresponding nonparametric tests we have covered so far. Parametric Test One sample z-test One sample t-test (or paired t-test) Two independent samples t-test ANOVA (F-test) Nonparametric Test Exact Binomial Sign test or Wilcoxon Signed Ranks test Wilcoxon Rank-Sum test Kruskal-Wallis If you are doing an analysis and your distributional assumptions are not met, you may want to try the nonparametric versions. This is another reason to consult with a statistician, as there are variants on these methods. Many of these methods end up working with ranks rather than the original data, which results in some interesting mathematics. If you find that your distributional assumptions are not met for your projects and you want me to show you how to run these tests, just let me know. Generally, they are under Statistics>Nonparametrics. Summary The techniques we have covered in class so far are parametric techniques. There are alternatives called nonparametric techniques. The difference is that nonparametric techniques do not make assumptions like having a nearly normal population, but might require an assumption of a nearly symmetric population. Better to use parametric tests if the parametric assumptions are satisfied. Consult with a statistician if your assumptions aren’t met and you want to see if another test can be used! The next pages are old exam questions from previous midterms. 1. Colleges and universities give surveys to their first year students every year to learn about their habits, likes and dislikes, and goals. For goals, the surveys may list possible goals and ask the students to choose the three they feel are the most important. In light of the recent financial crisis, researchers want to compare the proportion of students who identified the goal of “being well-off financially” as one of their three important goals from this year to previous years. Suppose a random sample of first year students at a large university is given such a survey and that 152 out of 200 identify “being well-off financially” as one of their three important goals. a. Give a point estimate for the population proportion of first year students at this university that would identify “being well-off financially” as one of their three important goals. b. Provide a 99% confidence interval for the population proportion of first year students at this university that would identify “being well-off financially” as one of their three important goals. Include the check needed to be able to compute the confidence interval. c. The corresponding 99% confidence interval based on a similar sample the previous year was (.62, .67). Can you conclude that the population proportion from last year and this year are different? Explain. Yes No d. Suppose you wanted a 99% confidence interval for this population proportion with a margin of error of 4% (i.e. m= .04). What sample size should you use to attain that margin of error? (Hint: unknown population proportion means use .5 to get the sample size). 2. Name that Scenario – For each scenario, determine and define the parameter of interest and determine if a hypothesis test or confidence interval is appropriate. If you choose hypothesis test, provide the null and alternative hypothesis. a. Research question: A new housing development claims to have lower levels of carbon monoxide than an older development nearby on average. A group of homeowners wants to assess the claim by taking samples from both developments. b. Research question: The Clean Air Act requires the EPA to set National Ambient Air Quality Standards for lead and five other pollutants. Lead levels in the air measured quarterly are safe if their average doesn’t surpass 1.5 g / m 3 . Marion County, IN wants to determine whether or not their air is safe (it is a county which recently had unsafe levels and worked to attain safe ones). c. Research question: How much does milk selenium concentration change on average when the pastures of the dairy cows are treated with selenium? Milk selenium levels are available for the cows both before and (nine days) after the treatment of the pastures. d. Research question: During the course of a year, how often does rainwater have a pH less than 5.3 (5.6 is normal for regular rainwater, less is acidic)? Assume you sample rainwater and simply record if the pH is less than or greater than 5.3 for each sample. e. Research question: Anthropologists want to know whether residents at two sites in an ancient civilization used a specific type of decorative pottery to the same extent. They collect pottery samples from both sites and record the number of the decorative pottery fragments as a fraction of the total number of fragments for each site. 3. (Data from Devore/Peck) An article on the foraging behavior of the Indian False Vampire bat examined many aspects of bat behavior including the length of bat flights before food was found. The article reported the number of bats for each gender who spent more than 5 minutes in flight before finding food. For females, this was 36 out of 193 female bats and for males, this was 64 out of 168 male bats. The researchers want to know if there is a significant difference between the proportions of flights longer than 5 minutes for males and females before food was found. a. Compute estimates of the proportion of female bats who spent longer than 5 min. in flight before finding food and of the proportion of male bats who spent longer than 5 min. in flight before finding food. b. List hypotheses, significance level, and define parameter(s) of interest to address the researcher’s question. Assume the assumptions for the test are satisfied. c. One quantity needed for the computation of the test statistic is the estimate of the common proportion of bat flights which lasted longer than 5 minutes before food was found. Four choices for that estimate of the common proportion are below. Only one is correct. Circle the correct value, and explain how you could tell it was correct without relying on a hand calculation. .178 .277 .391 .426 Explanation: d. Compute a test statistic and p-value. e. Interpret your value of the test statistic in context. f. What conclusion do you reach? 4. A recent article in Conservation Biology investigated “Rapid Evolution in the Wild: Changes in Body Size, Life-History Traits, and Behavior in Hunted Populations of the Japanese Mamushi Snake.” The study compared snake populations that had been hunted regularly with populations that had not been hunted. Among the variables recorded was fleeing distance - how far an approaching human could get before a snake fled. Suppose the researchers decided to approach each snake twice – once with hunting weapons and once without (order randomly determined per snake) and record fleeing distance in each case. Suppose this is done for a random sample of n=90 snakes from a mixture of the hunted and unhunted sites and they want to know if snakes flee at a greater distance when approached with hunting weapons on average. a. What parameter should be used to address the researcher’s question? Explain your choice in one sentence. d 1 2 b. State the hypotheses you would test (be sure to define your order of subtraction). b. Suppose the distribution of fleeing distance for being approached with weapons is left-skewed and unimodal, and the distribution of fleeing distance for no weapons is also left-skewed and unimodal. How does this impact the conditions you have to check for this situation? c. Assuming the conditions hold, complete your test procedure. Using the R output appropriate for your choice in a., provide the appropriate test statistic (notation and value) and p-value for your hypotheses. Paired t-test (Weapons –NoWeapons) t = 14.7274, df = 89, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 Welch Two Sample t-test (Weapons –NoWeapons) t = 15.5724, df = 177.937, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 Test statistic: = p-value: d. What distribution was used to compute the p-value? (Be specific). e. Interpret your p-value. f. Circle your decision and state your conclusion at a .01 significance level. Reject H0 Conclusion: Do not reject H0 5. Name that Scenario – For each scenario, determine and define the parameter of interest and determine if a hypothesis test or confidence interval is most appropriate. If you choose hypothesis test, provide the null and alternative hypothesis. All questions relate to a data set on housing in Boston with the following variables: Nox (amount of nitrous oxides, quantitative but can be categorized as high/low also), Charles (on river or not), Rooms (number of rooms per dwelling), Age (indicator if built before 1940 or not), and Distance (distance to employment centers). Analysis of the data set may have impact on zoning and environmental regulations for real estate. a. Research question: The researchers want to know homes constructed after 1940 are closer to employment centers on average than homes constructed pre-1940. b. Research question: Researchers want to know how far on average homes with more than 8 rooms are from employment centers. c. Research question: Are there more homes built pre-1940 than post -1940 along the river? d. Research question: Researchers want to know if the level of nitrous oxides is higher on average for homes constructed pre-1940 than post-1940. e. Research question: A real estate developer wants to estimate the percentage of homes along the river. f. Research question: Researchers want to estimate the difference in average number of rooms per dwelling comparing pre-1940 to post-1940 housing.