Stat 401 A - Homework 9 Chapter 7: Computational problems:

advertisement
Stat 401 A - Homework 9 – with clarification and additional explanation for 1b.
Conceptual problems: Chapter 7: 8, 9, 10, 11, Chapter 8: 2, 3, 6, 7; Chapter 9: 2, 4, 8
Computational problems:
1) Chapter 7: 15/21, with my versions of the book’s questions and some new ones. The data are in
planet.txt. Note, the asteroids are considered to be a planet (number 5 from the sun). Recently, Pluto
has been demoted from “planet” to “dwarf planet”.
For the entirety of this problem, use only the distances for the first 9 planets (up to and including
Neptune).
a) Fit a regression of Y = log transformed distance vs X=i, the order from the sun. Report the estimated
intercept and slope, both with their standard errors.
b) Bode’s law (late 18’th century) in its simplest form claims that the distance from a planet to the sun is
twice the distance of the previous planet to the sun. If distance is log transformed, that means that
log(distance) = constant + (order) * log 2. Test whether the data support Bode’s law. (See description in
the book and hint below) Report your T statistic.
c) Predict the distance at which the 10’th planet would be found by extrapolating the regression line
fitted to the first 9 planets.
d) Is the actual distance for Pluto unusually small or unusually large relative to the prediction based on
the other 9 planets? Explain why or why not.
e) Estimate and report the error standard deviation. In the context of this problem, what does the error
standard deviation measure? (Clearly, it is not the error in measuring distance from the sun, since that is
very accurately measured).
f) You believe you have found a new planet orbiting at an extreme distance from our sun. On the scale
used for the data in planet.txt, this new planet's distance is 1350. You wonder whether there might be
a planet in between Pluto (number 10) and this new planet. Report the predicted order for a planet
with distance 1350. If you believe the regression model can be extrapolated out to 1350, does your
result suggest there is another planet waiting to be discovered?
g) Calculate an approximate standard error for the predicted order of that new planet.
Hints: 1) The X variable is i, the order from the sun. You need to estimate the slope coefficient. If Bode's
law is correct, the slope will be log(2), so log(2) will be used as the parameter value in the t-test in part
b. Remember the general form for a t test: (estimate - parameter)/se. Usually parameter is 0. Here it is
log 2.
2) Part c: consider an appropriate interval for the prediction of planet 10.
3) Part f is a calibration problem. Remember the Y variable is log(distance), where log is the natural log.
4) Part g will require predicting the distance for the predicted order from part f.
2) The data in diversity.txt are described in Chapter 8, problem 22/22. Ecological theory suggests that
the appropriate regression model is diversity = B0 + B1 * log(area).
These data come from an experimental study where the area of the patch was randomly assigned to
plots of land.
a) Estimate the regression coefficients, then test whether the slope is significantly different from 0.
Report the estimates and the test statistic and p-value for the slope test.
b) Write a one sentence interpretation of the effect of area on diversity that includes a number
quantifying that effect.
c) Use the ANOVA lack of fit test to evaluate lack of fit of the regression model.
d) The investigators are interested in using this model to predict the number of species on new patches
of forest. What is the area of a patch that has the smallest standard error for predictions of the mean
number of species?
e) What is the area of a patch that has the smallest standard error for predictions of number of species
in an individual patch?
f) The investigators would really like to make predictions of diversity in an individual patch that have
standard errors less than 20 species. Is that possible with this model and data?
g) The study that produced these data has attracted a lot of international attention. Imagine that this
study can be repeated with 10 times as many plots (160 instead of the 16 here). Imagine that
everything about the data remains the same, except for the sample size: the regression intercept, slope,
and error variance are the same as here. Will this study with 160 plots be able to predict diversity in an
individual patch with a standard error less than 20 species? Briefly explain why or why not.
Hints: for parts e and g: look at the formula for se pred, the standard error of the prediction of an
individual observation.
3) The data in wine.txt come from an observational study of the relationship between wine
consumption and death by heart attack. The study is described in Chapter 8:23/23. Each observation is
one country.
a) Plot the data (X=wine consumption, Y=mortality). Will a linear regression model be a reasonable
summary of the relationship between consumption and mortality? Briefly explain why or why not.
b) Fit a linear regression using log(wine consumption) to predict mortality. Describe the association
between wine consumption and mortality, including a number (or numbers) that quantify the
association. Although this is an observational study, you may causal language, because the wording is
clumsy otherwise. I am looking for your ability to summarize the relationship between wine
consumption and mortality. Note: Do not use log(wine consumption) in your description.
c) Fit a linear regression using log(wine consumption) to predict log(mortality). Again, use the
appropriate information from this regression to describe the association between wine consumption
and mortality. All the notes from b) apply here, with the addition of do not use log(mortality) in your
description.
d) Plot the residuals vs the predicted values for both regressions (parts b and c). Which model best
satisfies the assumptions for linear regression? Briefly explain your choice.
Download