Review Lab on April 4

advertisement
Review Lab on April 4 – Review for Exam + 10 Minutes on Nonparametric Tests
Nonparametric Methods – For your reference and possibly to help with projects
The z-tests, t-tests, and F-test (ANOVA) that we have studied so far are parametric tests. In the end, the
statistical theory driving each of the techniques comes down to normal distributions (via sampling
distributions) but these are also related to distributional assumptions (such as nearly normal
population(s) or at least 10 successes/10 failures in your sample). Nonparametric tests do not require
that samples come from populations with normal distributions or any other specific distribution. Hence
they are sometimes called distribution-free tests. They may have other more easily satisfied
assumptions - such as requiring that the population distribution is symmetric. In general, when the
parametric assumptions are satisfied, it is better to use parametric test procedures. However,
nonparametric test procedures are a valuable tool to use when those assumptions are not satisfied, and
most have pretty high efficiency relative to the corresponding parametric test when the parametric
assumptions are satisfied. This means you usually don't lose much testing power if you swap to the
nonparametric test even when not necessary. The exception is the sign test. The following table
discusses the parametric and corresponding nonparametric tests we have covered so far.
Parametric Test
One sample z-test
One sample t-test (or paired t-test)
Two independent samples t-test
ANOVA (F-test)
Nonparametric Test
Exact Binomial
Sign test or Wilcoxon Signed Ranks test
Wilcoxon Rank-Sum test
Kruskal-Wallis
If you are doing an analysis and your distributional assumptions are not met, you may want to try the
nonparametric versions. This is another reason to consult with a statistician, as there are variants on
these methods. Many of these methods end up working with ranks rather than the original data, which
results in some interesting mathematics.
If you find that your distributional assumptions are not met for your projects and you want me to show
you how to run these tests, just let me know. Generally, they are under Statistics>Nonparametrics.
Summary
 The techniques we have covered in class so far are parametric techniques.
 There are alternatives called nonparametric techniques.
 The difference is that nonparametric techniques do not make assumptions like having a nearly
normal population, but might require an assumption of a nearly symmetric population.
 Better to use parametric tests if the parametric assumptions are satisfied.
 Consult with a statistician if your assumptions aren’t met and you want to see if another test can
be used!
The next pages are old exam questions from previous midterms.
1. Colleges and universities give surveys to their first year students every year to learn about their
habits, likes and dislikes, and goals. For goals, the surveys may list possible goals and ask the students to
choose the three they feel are the most important. In light of the recent financial crisis, researchers
want to compare the proportion of students who identified the goal of “being well-off financially” as
one of their three important goals from this year to previous years. Suppose a random sample of first
year students at a large university is given such a survey and that 152 out of 200 identify “being well-off
financially” as one of their three important goals.
a. Give a point estimate for the population proportion of first year students at this university that would
identify “being well-off financially” as one of their three important goals.
b. Provide a 99% confidence interval for the population proportion of first year students at this
university that would identify “being well-off financially” as one of their three important goals. Include
the check needed to be able to compute the confidence interval.
c. The corresponding 99% confidence interval based on a similar sample the previous year was (.62, .67).
Can you conclude that the population proportion from last year and this year are different? Explain.
Yes
No
d. Suppose you wanted a 99% confidence interval for this population proportion with a margin of error
of 4% (i.e. m= .04). What sample size should you use to attain that margin of error? (Hint: unknown
population proportion means use .5 to get the sample size).
2. Name that Scenario – For each scenario, determine and define the parameter of interest and
determine if a hypothesis test or confidence interval is appropriate. If you choose hypothesis test,
provide the null and alternative hypothesis.
a. Research question: A new housing development claims to have lower levels of carbon monoxide than
an older development nearby on average. A group of homeowners wants to assess the claim by taking
samples from both developments.
b. Research question: The Clean Air Act requires the EPA to set National Ambient Air Quality Standards
for lead and five other pollutants. Lead levels in the air measured quarterly are safe if their average
doesn’t surpass 1.5 g / m 3 . Marion County, IN wants to determine whether or not their air is safe (it is
a county which recently had unsafe levels and worked to attain safe ones).
c. Research question: How much does milk selenium concentration change on average when the
pastures of the dairy cows are treated with selenium? Milk selenium levels are available for the cows
both before and (nine days) after the treatment of the pastures.
d. Research question: During the course of a year, how often does rainwater have a pH less than 5.3 (5.6
is normal for regular rainwater, less is acidic)? Assume you sample rainwater and simply record if the pH
is less than or greater than 5.3 for each sample.
e. Research question: Anthropologists want to know whether residents at two sites in an ancient
civilization used a specific type of decorative pottery to the same extent. They collect pottery samples
from both sites and record the number of the decorative pottery fragments as a fraction of the total
number of fragments for each site.
3. (Data from Devore/Peck) An article on the foraging behavior of the Indian False Vampire bat
examined many aspects of bat behavior including the length of bat flights before food was found. The
article reported the number of bats for each gender who spent more than 5 minutes in flight before
finding food. For females, this was 36 out of 193 female bats and for males, this was 64 out of 168 male
bats. The researchers want to know if there is a significant difference between the proportions of flights
longer than 5 minutes for males and females before food was found.
a. Compute estimates of the proportion of female bats who spent longer than 5 min. in flight before
finding food and of the proportion of male bats who spent longer than 5 min. in flight before finding
food.
b. List hypotheses, significance level, and define parameter(s) of interest to address the researcher’s
question. Assume the assumptions for the test are satisfied.
c. One quantity needed for the computation of the test statistic is the estimate of the common
proportion of bat flights which lasted longer than 5 minutes before food was found. Four choices for
that estimate of the common proportion are below. Only one is correct. Circle the correct value, and
explain how you could tell it was correct without relying on a hand calculation.
.178
.277
.391
.426
Explanation:
d. Compute a test statistic and p-value.
e. Interpret your value of the test statistic in context.
f. What conclusion do you reach?
4. A recent article in Conservation Biology investigated “Rapid Evolution in the Wild: Changes in Body
Size, Life-History Traits, and Behavior in Hunted Populations of the Japanese Mamushi Snake.” The study
compared snake populations that had been hunted regularly with populations that had not been
hunted. Among the variables recorded was fleeing distance - how far an approaching human could get
before a snake fled. Suppose the researchers decided to approach each snake twice – once with hunting
weapons and once without (order randomly determined per snake) and record fleeing distance in each
case. Suppose this is done for a random sample of n=90 snakes from a mixture of the hunted and
unhunted sites and they want to know if snakes flee at a greater distance when approached with
hunting weapons on average.
a. What parameter should be used to address the researcher’s question?
Explain your choice in one sentence.
d
1   2
b. State the hypotheses you would test (be sure to define your order of subtraction).
b. Suppose the distribution of fleeing distance for being approached with weapons is left-skewed and
unimodal, and the distribution of fleeing distance for no weapons is also left-skewed and unimodal. How
does this impact the conditions you have to check for this situation?
c. Assuming the conditions hold, complete your test procedure. Using the R output appropriate for your
choice in a., provide the appropriate test statistic (notation and value) and p-value for your hypotheses.
Paired t-test
(Weapons –NoWeapons)
t = 14.7274, df = 89, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
Welch Two Sample t-test
(Weapons –NoWeapons)
t = 15.5724, df = 177.937, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
Test statistic:
=
p-value:
d. What distribution was used to compute the p-value? (Be specific).
e. Interpret your p-value.
f. Circle your decision and state your conclusion at a .01 significance level. Reject H0
Conclusion:
Do not reject H0
5. Name that Scenario – For each scenario, determine and define the parameter of interest and
determine if a hypothesis test or confidence interval is most appropriate. If you choose hypothesis test,
provide the null and alternative hypothesis. All questions relate to a data set on housing in Boston with
the following variables: Nox (amount of nitrous oxides, quantitative but can be categorized as high/low
also), Charles (on river or not), Rooms (number of rooms per dwelling), Age (indicator if built before
1940 or not), and Distance (distance to employment centers). Analysis of the data set may have impact
on zoning and environmental regulations for real estate.
a. Research question: The researchers want to know homes constructed after 1940 are closer to
employment centers on average than homes constructed pre-1940.
b. Research question: Researchers want to know how far on average homes with more than 8 rooms
are from employment centers.
c. Research question: Are there more homes built pre-1940 than post -1940 along the river?
d. Research question: Researchers want to know if the level of nitrous oxides is higher on average for
homes constructed pre-1940 than post-1940.
e. Research question: A real estate developer wants to estimate the percentage of homes along the
river.
f. Research question: Researchers want to estimate the difference in average number of rooms per
dwelling comparing pre-1940 to post-1940 housing.
Download