Professor François Nielsen SOCI 252

advertisement
Professor François Nielsen
SOCI 252-002 Homework 6 – Key
Chapter 27
2. (pg. 702 – drug use)
a) The percentage of 9th graders in these countries who have used other drugs is estimated to have
increased 0.615% for each 1% increase in the percentage of 9th graders who have used
marijuana.
b) H0: There is no (linear)relationship between the use of marijuana and other drugs; B1 = 0
H1: there is a relationship; B1 ≠ 0
c) t = 7,95, P-value = 0.0001. With such a low P-value, we reject H0. We are very confident that the
percentage of teens using other drugs is positively related to the percentage using marijuana.
d) Percentage using marijuana accounts for 87.3% of the variation in other drug usage for 9th
graders in these countries.
e) The use of other drugs is associated with marijuana use, but this relationship offers no proof of
causality. There may be a lurking variable (e.g., cultural permissiveness of attitudes toward
drugs may predict a country’s rate of marijuana use and other drug use)
4. (pg. 703 – Saratoga home prices)
a) Predicted price = -3.12 + 95.5*Size
The model suggests that Saratoga houses cost about $94.50 per square foot.
b) The P0-value for the intercept is 0.50. That means we cannot discern a difference between the
intercept value and zero. Remember that the intercept is the value of the response variable
(price) when the predictor variable (size) equals zero. A value of $0 for a house of zero size
makes sense.
c) Amounts by which house prices differ from predictions made by this model vary, with a
standard deviation of about $54,000.
d) $2.393 per square foot
e) If we constructed other models based on different samples of houses, we’d expect the slops of
the regression lines to vary, with a standard deviation of $2.39 per square foot.
6. (second home)
a) The scatterplot looks straight enough; the residuals look random and are nearly normal, and the
residuals don’t display any clear change in variability
b) I’m 95% confident that Saratoga housing costs increase at a rate of between $89.80 and $99.20
per square foot.
18. (pg. 705 – SAT scores)
a) H0: There is no (linear) relationship between SAT Verbal and Math scores; B1 = 0.
HA: there is a relationship; B1 ≠ 0
b) Assumptions seem reasonable, since conditions are satisfied. The scatterplot suggests a positive
linear relationship. Residual plot shows no patters (one outlier); histogram is unimodal and
roughly symmetric.
c) t = 11.9; P-value < 0.0001. These data show evidence of a positive relationship between SAT
Verbal and Math scores (surprise!)
20. (pg. 706 – SAT, part II)
a) 90% confidence interval for the slope of the true line describing the relationship between SAT
Math and Verbal scores: (0.581, 0.769)
b) Based on the sample, we are 90% confident that the average SAT Math scores increase between
0.58 and 0.77 points for each additional point scored on the SAT Verbal section.
22. (SAT, again) – Optional
Hint: this problem asks for a confidence interval for the predicted mean and a confidence interval for a
predicted individual observation. I recommend that you use the predict function in R with options
interval="confidence" and interval="prediction" to do this. See instructions in the handout
for HW 6.
a) 90% confidence interval for mean SAT Math score of all students with SAT Verbal score of 500:
First, predicted mean SAT Math score = 209.554 + 0.675*500 = 547.1
t-value for 5% area in either tail (so 10% in both tails, leaving 90% confidence in middle): 1.65
b) 90% confidence interval for mean SAT Math score for a specific student with SAT Verbal of 710:
Note that this uses the standard error of the prediction itself, s in the regression output = 71.75
First, predicted SAT math score = 209.554 + 0.675*710 = 688.8
t-value for 5% area in either tail (so 10% in both tails, leaving 90% confidence in middle): 1.65
688.8 ± 1.65*71.75 = 688.8 ± 118.4, so confidence interval is (570, 800)
note calculated predicted-value was higher than 800, but SAT sections have a maximum of 800
24. (brain size)
a) H0: No linear relationship between Brain Size and IQ; B1 = 0
HA: There is evidence of a relationship; B1 ≠ 0
t = 1.12, so this will not be significant
b) With R-squared = 6.5%, the relationship is very weak. There seem to be three students with
large brains who also scored high. Without them, there seems to be no association at all.
26. (winter)
• Scatterplot of Temperature against Latitude shows curvature (downward). Histogram of
residuals is right skewed; residual plot shows decreasing variance as predicted values increase.
28. (pg. 707 – climate change and CO2)
a) Predicted temperature = 10.707 + 0.01006*CO2
b) Yes, t = 7.74; P-value < 0.0001
c) The standard deviation of the residuals is 0.0985 degrees C, so we don’t expect the model to
predict an accuracy greater than ± 0.2 degrees C
37. (pg. 709 – grades)
a) The regression model is Midterm2 = 12.005 + 0.721*Midterm1
b) The relationship shows a weak, somewhat linear, positive relationship. There are several
outlying points, but removing them only makes the relationship slightly sgtronger. There is no
obvious pattern in the residual plot. The regression model appears appropriate. The small Pvalue for the slope shows that the slope is statistically distinguishable from 0 even though the Rsquare value is only 0.199 and the value of s = 16.8 points indicates that he would not be able to
predict performance on Midterm2 very accurately.
38. (pg. 710 – grades?)
a) The regression model is MT_total = 46.062 + 1.58*Homework
b) The scatterplot shows a strong, mostly linear, positive relationship between midterm total and
homework scores. There is a model outlier and a high influence point, but the model is not
significantly changed by deleting both points. There is no obvious pattern in the residual plot.
The regression model appears appropriate. The small P-value for the slope shows that the slope
is statistically distinguishable from 0.
c) The R-square value of 0.507 suggests that overall relationship is fairly strong. However, this does
not mean that midterm total is accurately predicted from homework scores. The error standard
deviation of 18.3 points indicates that a prediction of midterm total could easily be off by 20 or
30 points or more. If this is a significant number of points for deciding grades, then homework
scores alone will not suffice.
Chapter 30
4. (pg. 806 – Scottish hill races 2008: men)
a) Predicted Time = 10.372 + 4.04*Distance + 0.0342*Climb
The time it takes to complete a race increases with both distance and climbing elevation
b) 98% of the variability in the men’s record times is accounted for by the regression model on
Distance and Climb.
c) For races of a given distance, we expect the mean Men’s Record time to increase by 0.034
minutes for each additional meter of Climb.
6. (pg. 807 – more hill races 2008: women)
a) The two models are similar. It appears that both additional Distance and additional Climb lead to
larger increases in average Race Record.
b) The residuals appear to fan out with increasing predicted value. This is a violation of the Does
the Plot Thicken condition. There may be two high outliers as well, but re-expressing Time may
improve both problems.
12. (pg. 809 – breakfast cereals)
a) Calories = -0.88 + 3.605*Protein + 8.569*Fat – 0.309*Fiber + 4.140*Carbo + 4.007*Sugars
b) The R-square says that 93.6% of the variance in Calories is accounted for by the regression
model, and we are told that the assumptions and conditions are met. The model should do a
good job of predicting calories.
Tim’s comment: the model had BETTER do a good job of predicting calories, as it includes
measures of all calorie-possessing nutritional components
c) Scatterplot of residuals vs. predicted values, Normal probability plot of the residuals
d) After allowing for the linear effects of Protein, Fiber, Carbohydrate, and Sugars, each gram of Fat
is, on average, associated with an increase of 8.57 calories.
Tim’s comment: surprise! Fat has approximately 9 calories per gram
Chapter 31
4. (pg. 845 – 50 states)
a) Yes, they are influential. Points that have both large leverage and large Studentized residuals are
bound to be influential (though, remember that a point can be influential without large residual)
b) The t-ratios for indicator variables are t-tests of whether those cases fit the regression model
established by the other cases. Both of the indicators have t-ratios that are large enouh to be
significant at the 0.05 level.
c) The t-ratios for illiteracy and income are not very large. The coefficient for Income is near zero.
Either predictor might be considered for removal from the regression.
5. (pg. 846 – cereals, part 2)
a) After allowing for the effects of Sodium and Sugars, each gram of Potassium is associated with a
0.019-calorie decrease
b) These points pull the slope of the relationship down. Omitting them should increase the value of
the coefficient of Potassium. It would likely become positive, since the remaining points show a
positive slope in the partial regression plot.
c) These appear to be influential points. They have both high leverage and large residual, and the
partial regression plot shows their influence.
d) If our goal is to understand the relationships among these variables, then it might be best to
omit these cereals because they seem to behave as if they are from a separate sub-group.
6. (Scottish hill races 2008)
a) The scatterplot of the residuals against the predicted values shows a fan shape with the spread
increasing for longer races. The Lairig Ghru race has high leverage.
b) Both races may be outliers. Their larger residuals have inflated the residual standard deviation
and may have reduced the R-square, but we can’t tell if they have affected the coefficients
without examining partial regression plots.
c) The partial regression plots show that these races have had littlee effect on the coefficients
other than on the intercept (which may have increased). A scatterplot of Distance vs. Climb
shows that the Lairig Ghru race is unusually long and has very little climb (relative to other
races), accounting for its large leverage.
8. (pg. 846 – gourmet pizza)
Hint: if you go back to look at problem 2 (pg.844) be aware that the variable Type in the regression
model there should be labeled Cheese (an indicator that is 1 for cheese and 0 for pepperoni) to be
consistent with later problems.
a) Reggio’s and Machielina’s received lower scores than we would otherwise have expected from
the model.
b) The t-ratio for the indicator variable for Michelina’s is -4.03, which is large. We can reject the
null hypothesis that Michelina’s fits the regression model, and can conclude that it is an outlier.
10. (pg. 848 – another slice of pizza)
a) Cheese and pepperoni pizzas don’t appear to be described by the same model.
b) The slope of taste score on calories, after allowing for the linear effects of fat and removing the
influence of the two outlying pizzas, is estimated to be 1.92 – 0.4615 = 1.45 points per gram for
cheese pizzas.
c) This should be a better regression model. We’ve identified a consistent difference between
pepperoni and cheese pizzas and incorporated it into the model. All coefficients are significantly
different from zero, and both the R-square and adjusted R-square are higher than the model in
Exercise 8.
12. (the final slice)
a) The coefficient for the indicator for Weight Watchers is not significantly discernible from zero at
the 0.05 level, but with a P-value of 0.09, it may still improve the model. This model has a
slightly higher R-square and adjusted R-square, but is not enough improved to be grounds for
choosing between the models. But the t-ratios are larger and the P-values smaller for a number
of the coefficients. That’s a sign of improvement.
b) Looking at the other coefficients in the model (and especially at the coefficient for Calories – not
too surprising, considering the identity of the newly isolated pizza), the addition of an indicator
for the Weight Watcher’s pizza has made several of them more significantly different from zero.
This seems to be a cleaner model and one that might lead to better understanding.
c) The tasters score cheese pizzas substantially higher than pepperoni pizzas. Even after allowing
for that, additional fat lowers scores, but higher-calorie pizzas score a bit better.
Download