Stat 401 A - Homework 9 – with clarification and additional explanation for 1b. Conceptual problems: Chapter 7: 8, 9, 10, 11, Chapter 8: 2, 3, 6, 7; Chapter 9: 2, 4, 8 Computational problems: 1) Chapter 7: 15/21, with my versions of the book’s questions and some new ones. The data are in planet.txt. Note, the asteroids are considered to be a planet (number 5 from the sun). Recently, Pluto has been demoted from “planet” to “dwarf planet”. For the entirety of this problem, use only the distances for the first 9 planets (up to and including Neptune). a) Fit a regression of Y = log transformed distance vs X=i, the order from the sun. Report the estimated intercept and slope, both with their standard errors. b) Bode’s law (late 18’th century) in its simplest form claims that the distance from a planet to the sun is twice the distance of the previous planet to the sun. If distance is log transformed, that means that log(distance) = constant + (order) * log 2. Test whether the data support Bode’s law. (See description in the book and hint below) Report your T statistic. c) Predict the distance at which the 10’th planet would be found by extrapolating the regression line fitted to the first 9 planets. d) Is the actual distance for Pluto unusually small or unusually large relative to the prediction based on the other 9 planets? Explain why or why not. e) Estimate and report the error standard deviation. In the context of this problem, what does the error standard deviation measure? (Clearly, it is not the error in measuring distance from the sun, since that is very accurately measured). f) You believe you have found a new planet orbiting at an extreme distance from our sun. On the scale used for the data in planet.txt, this new planet's distance is 1350. You wonder whether there might be a planet in between Pluto (number 10) and this new planet. Report the predicted order for a planet with distance 1350. If you believe the regression model can be extrapolated out to 1350, does your result suggest there is another planet waiting to be discovered? g) Calculate an approximate standard error for the predicted order of that new planet. Hints: 1) The X variable is i, the order from the sun. You need to estimate the slope coefficient. If Bode's law is correct, the slope will be log(2), so log(2) will be used as the parameter value in the t-test in part b. Remember the general form for a t test: (estimate - parameter)/se. Usually parameter is 0. Here it is log 2. 2) Part c: consider an appropriate interval for the prediction of planet 10. 3) Part f is a calibration problem. Remember the Y variable is log(distance), where log is the natural log. 4) Part g will require predicting the distance for the predicted order from part f. 2) The data in diversity.txt are described in Chapter 8, problem 22/22. Ecological theory suggests that the appropriate regression model is diversity = B0 + B1 * log(area). These data come from an experimental study where the area of the patch was randomly assigned to plots of land. a) Estimate the regression coefficients, then test whether the slope is significantly different from 0. Report the estimates and the test statistic and p-value for the slope test. b) Write a one sentence interpretation of the effect of area on diversity that includes a number quantifying that effect. c) Use the ANOVA lack of fit test to evaluate lack of fit of the regression model. d) The investigators are interested in using this model to predict the number of species on new patches of forest. What is the area of a patch that has the smallest standard error for predictions of the mean number of species? e) What is the area of a patch that has the smallest standard error for predictions of number of species in an individual patch? f) The investigators would really like to make predictions of diversity in an individual patch that have standard errors less than 20 species. Is that possible with this model and data? g) The study that produced these data has attracted a lot of international attention. Imagine that this study can be repeated with 10 times as many plots (160 instead of the 16 here). Imagine that everything about the data remains the same, except for the sample size: the regression intercept, slope, and error variance are the same as here. Will this study with 160 plots be able to predict diversity in an individual patch with a standard error less than 20 species? Briefly explain why or why not. Hints: for parts e and g: look at the formula for se pred, the standard error of the prediction of an individual observation. 3) The data in wine.txt come from an observational study of the relationship between wine consumption and death by heart attack. The study is described in Chapter 8:23/23. Each observation is one country. a) Plot the data (X=wine consumption, Y=mortality). Will a linear regression model be a reasonable summary of the relationship between consumption and mortality? Briefly explain why or why not. b) Fit a linear regression using log(wine consumption) to predict mortality. Describe the association between wine consumption and mortality, including a number (or numbers) that quantify the association. Although this is an observational study, you may causal language, because the wording is clumsy otherwise. I am looking for your ability to summarize the relationship between wine consumption and mortality. Note: Do not use log(wine consumption) in your description. c) Fit a linear regression using log(wine consumption) to predict log(mortality). Again, use the appropriate information from this regression to describe the association between wine consumption and mortality. All the notes from b) apply here, with the addition of do not use log(mortality) in your description. d) Plot the residuals vs the predicted values for both regressions (parts b and c). Which model best satisfies the assumptions for linear regression? Briefly explain your choice.