Stat 301 HW 10 Due: 4 Dec / 7 Dec 2015

advertisement
Stat 301
HW 10
Due: 4 Dec / 7 Dec 2015
1. The MTBE data file contains measurements of the chemical MTBE in drinking water wells
in New Hampshire. MTBE is an oxygenizer that was added to gasoline to increase the octane
number, help gasoline burn more completely, and reduce air pollution. It can get into drinking
water when gasoline is spilled or a gas station storage tank leaks. MTBE imparts a nasty
taste to water, even at very low concentrations, so it’s presence in drinking water may make
that water non-potable.
The variable MTBE is the reported MTBE concentration in the well water. Consider predicting that concentration from some subset of 8 variables. It is not necessary to consider
transformations of these variables. It is also not necessary to consider interactions or quadratic
terms. The 8 variables are:
• ph
• SpConduct (specific conductivity of the well water)
• DissOxy (dissolved oxygen of the well water)
• IndPct (the percentage of industrial land use around the well)
• WellClass (public or private well)
• Aquifer (the geological setting of the well)
• Depth (depth of the well)
• Distance (distance to the nearest gas station)
Note that these are not all the variables in the data file.
(a) If you use AICc as the criterion, what variables are included in the model for predicting
MTBE concentration?
(b) Using AICc, how much “worse” is the second best model? Should you ignore the second
best model?
(c) Using BIC as the criterion, what variables are included in the model?
(d) What is the AICc statistic for the model with DissOxy, IndPct, Depth, and Distance?
Should you ignore the model with these four variables?
(e) What is AICc statistic for the model with only the Depth variable? Should you ignore
this model?
(f) Using RSquare (R2 ) as the criterion, what variables are included in the model?
(g) Calculate the PRESS statistics for the five models considered up to now (i.e. in parts
1a, 1b, 1d, 1e, and 1f). Which model makes the best “out-of-sample” predictions?
(h) Which model would you suggest be used to predict MTBE concentrations. Briefly support your choice. (Note: there is multiple ways you could correctly answer this question)
(i) Now, look at the usual model diagnostics (e.g., a residual vs predicted plot, studentized
residuals and Cook’s D) for the model you chose in question 1h. Any concerns?
1
2. Many of the MTBE values used in problem 1 are what analytical chemists call ’Non-detects’.
These are observations where MTBE may be absent or may be present at a very low concentration. These observations are indicated by “Below Limit” in the variable MTBE-Detect.
Higher measured concentrations are indicated by ”Detect” in the variable MTBE-Detect.
This problem looks at the proportion of ”Detect” observations in the data set. The WellClass
variable indicates whether the well is public or private. There are 78 samples from private
water supply wells and 113 samples from public water supply wells.
(a) What proportion of private water supply wells are ”Detect” samples?
(b) What is a 95% confidence interval for that proportion?
(c) Use a Chi-square test to test whether the proportion of “Detect” samples is the same
for public and private wells. Report your test statistic, p-value, and a one-sentence
conclusion.
(d) It is appropriate to use the Chi-square test with these data? Explain why or why not.
2
Download