Professor François Nielsen SOCI 252-002 Homework 6 – Key Chapter 27 2. (pg. 702 – drug use) a) The percentage of 9th graders in these countries who have used other drugs is estimated to have increased 0.615% for each 1% increase in the percentage of 9th graders who have used marijuana. b) H0: There is no (linear)relationship between the use of marijuana and other drugs; B1 = 0 H1: there is a relationship; B1 ≠ 0 c) t = 7,95, P-value = 0.0001. With such a low P-value, we reject H0. We are very confident that the percentage of teens using other drugs is positively related to the percentage using marijuana. d) Percentage using marijuana accounts for 87.3% of the variation in other drug usage for 9th graders in these countries. e) The use of other drugs is associated with marijuana use, but this relationship offers no proof of causality. There may be a lurking variable (e.g., cultural permissiveness of attitudes toward drugs may predict a country’s rate of marijuana use and other drug use) 4. (pg. 703 – Saratoga home prices) a) Predicted price = -3.12 + 95.5*Size The model suggests that Saratoga houses cost about $94.50 per square foot. b) The P0-value for the intercept is 0.50. That means we cannot discern a difference between the intercept value and zero. Remember that the intercept is the value of the response variable (price) when the predictor variable (size) equals zero. A value of $0 for a house of zero size makes sense. c) Amounts by which house prices differ from predictions made by this model vary, with a standard deviation of about $54,000. d) $2.393 per square foot e) If we constructed other models based on different samples of houses, we’d expect the slops of the regression lines to vary, with a standard deviation of $2.39 per square foot. 6. (second home) a) The scatterplot looks straight enough; the residuals look random and are nearly normal, and the residuals don’t display any clear change in variability b) I’m 95% confident that Saratoga housing costs increase at a rate of between $89.80 and $99.20 per square foot. 18. (pg. 705 – SAT scores) a) H0: There is no (linear) relationship between SAT Verbal and Math scores; B1 = 0. HA: there is a relationship; B1 ≠ 0 b) Assumptions seem reasonable, since conditions are satisfied. The scatterplot suggests a positive linear relationship. Residual plot shows no patters (one outlier); histogram is unimodal and roughly symmetric. c) t = 11.9; P-value < 0.0001. These data show evidence of a positive relationship between SAT Verbal and Math scores (surprise!) 20. (pg. 706 – SAT, part II) a) 90% confidence interval for the slope of the true line describing the relationship between SAT Math and Verbal scores: (0.581, 0.769) b) Based on the sample, we are 90% confident that the average SAT Math scores increase between 0.58 and 0.77 points for each additional point scored on the SAT Verbal section. 22. (SAT, again) – Optional Hint: this problem asks for a confidence interval for the predicted mean and a confidence interval for a predicted individual observation. I recommend that you use the predict function in R with options interval="confidence" and interval="prediction" to do this. See instructions in the handout for HW 6. a) 90% confidence interval for mean SAT Math score of all students with SAT Verbal score of 500: First, predicted mean SAT Math score = 209.554 + 0.675*500 = 547.1 t-value for 5% area in either tail (so 10% in both tails, leaving 90% confidence in middle): 1.65 b) 90% confidence interval for mean SAT Math score for a specific student with SAT Verbal of 710: Note that this uses the standard error of the prediction itself, s in the regression output = 71.75 First, predicted SAT math score = 209.554 + 0.675*710 = 688.8 t-value for 5% area in either tail (so 10% in both tails, leaving 90% confidence in middle): 1.65 688.8 ± 1.65*71.75 = 688.8 ± 118.4, so confidence interval is (570, 800) note calculated predicted-value was higher than 800, but SAT sections have a maximum of 800 24. (brain size) a) H0: No linear relationship between Brain Size and IQ; B1 = 0 HA: There is evidence of a relationship; B1 ≠ 0 t = 1.12, so this will not be significant b) With R-squared = 6.5%, the relationship is very weak. There seem to be three students with large brains who also scored high. Without them, there seems to be no association at all. 26. (winter) • Scatterplot of Temperature against Latitude shows curvature (downward). Histogram of residuals is right skewed; residual plot shows decreasing variance as predicted values increase. 28. (pg. 707 – climate change and CO2) a) Predicted temperature = 10.707 + 0.01006*CO2 b) Yes, t = 7.74; P-value < 0.0001 c) The standard deviation of the residuals is 0.0985 degrees C, so we don’t expect the model to predict an accuracy greater than ± 0.2 degrees C 37. (pg. 709 – grades) a) The regression model is Midterm2 = 12.005 + 0.721*Midterm1 b) The relationship shows a weak, somewhat linear, positive relationship. There are several outlying points, but removing them only makes the relationship slightly sgtronger. There is no obvious pattern in the residual plot. The regression model appears appropriate. The small Pvalue for the slope shows that the slope is statistically distinguishable from 0 even though the Rsquare value is only 0.199 and the value of s = 16.8 points indicates that he would not be able to predict performance on Midterm2 very accurately. 38. (pg. 710 – grades?) a) The regression model is MT_total = 46.062 + 1.58*Homework b) The scatterplot shows a strong, mostly linear, positive relationship between midterm total and homework scores. There is a model outlier and a high influence point, but the model is not significantly changed by deleting both points. There is no obvious pattern in the residual plot. The regression model appears appropriate. The small P-value for the slope shows that the slope is statistically distinguishable from 0. c) The R-square value of 0.507 suggests that overall relationship is fairly strong. However, this does not mean that midterm total is accurately predicted from homework scores. The error standard deviation of 18.3 points indicates that a prediction of midterm total could easily be off by 20 or 30 points or more. If this is a significant number of points for deciding grades, then homework scores alone will not suffice. Chapter 30 4. (pg. 806 – Scottish hill races 2008: men) a) Predicted Time = 10.372 + 4.04*Distance + 0.0342*Climb The time it takes to complete a race increases with both distance and climbing elevation b) 98% of the variability in the men’s record times is accounted for by the regression model on Distance and Climb. c) For races of a given distance, we expect the mean Men’s Record time to increase by 0.034 minutes for each additional meter of Climb. 6. (pg. 807 – more hill races 2008: women) a) The two models are similar. It appears that both additional Distance and additional Climb lead to larger increases in average Race Record. b) The residuals appear to fan out with increasing predicted value. This is a violation of the Does the Plot Thicken condition. There may be two high outliers as well, but re-expressing Time may improve both problems. 12. (pg. 809 – breakfast cereals) a) Calories = -0.88 + 3.605*Protein + 8.569*Fat – 0.309*Fiber + 4.140*Carbo + 4.007*Sugars b) The R-square says that 93.6% of the variance in Calories is accounted for by the regression model, and we are told that the assumptions and conditions are met. The model should do a good job of predicting calories. Tim’s comment: the model had BETTER do a good job of predicting calories, as it includes measures of all calorie-possessing nutritional components c) Scatterplot of residuals vs. predicted values, Normal probability plot of the residuals d) After allowing for the linear effects of Protein, Fiber, Carbohydrate, and Sugars, each gram of Fat is, on average, associated with an increase of 8.57 calories. Tim’s comment: surprise! Fat has approximately 9 calories per gram Chapter 31 4. (pg. 845 – 50 states) a) Yes, they are influential. Points that have both large leverage and large Studentized residuals are bound to be influential (though, remember that a point can be influential without large residual) b) The t-ratios for indicator variables are t-tests of whether those cases fit the regression model established by the other cases. Both of the indicators have t-ratios that are large enouh to be significant at the 0.05 level. c) The t-ratios for illiteracy and income are not very large. The coefficient for Income is near zero. Either predictor might be considered for removal from the regression. 5. (pg. 846 – cereals, part 2) a) After allowing for the effects of Sodium and Sugars, each gram of Potassium is associated with a 0.019-calorie decrease b) These points pull the slope of the relationship down. Omitting them should increase the value of the coefficient of Potassium. It would likely become positive, since the remaining points show a positive slope in the partial regression plot. c) These appear to be influential points. They have both high leverage and large residual, and the partial regression plot shows their influence. d) If our goal is to understand the relationships among these variables, then it might be best to omit these cereals because they seem to behave as if they are from a separate sub-group. 6. (Scottish hill races 2008) a) The scatterplot of the residuals against the predicted values shows a fan shape with the spread increasing for longer races. The Lairig Ghru race has high leverage. b) Both races may be outliers. Their larger residuals have inflated the residual standard deviation and may have reduced the R-square, but we can’t tell if they have affected the coefficients without examining partial regression plots. c) The partial regression plots show that these races have had littlee effect on the coefficients other than on the intercept (which may have increased). A scatterplot of Distance vs. Climb shows that the Lairig Ghru race is unusually long and has very little climb (relative to other races), accounting for its large leverage. 8. (pg. 846 – gourmet pizza) Hint: if you go back to look at problem 2 (pg.844) be aware that the variable Type in the regression model there should be labeled Cheese (an indicator that is 1 for cheese and 0 for pepperoni) to be consistent with later problems. a) Reggio’s and Machielina’s received lower scores than we would otherwise have expected from the model. b) The t-ratio for the indicator variable for Michelina’s is -4.03, which is large. We can reject the null hypothesis that Michelina’s fits the regression model, and can conclude that it is an outlier. 10. (pg. 848 – another slice of pizza) a) Cheese and pepperoni pizzas don’t appear to be described by the same model. b) The slope of taste score on calories, after allowing for the linear effects of fat and removing the influence of the two outlying pizzas, is estimated to be 1.92 – 0.4615 = 1.45 points per gram for cheese pizzas. c) This should be a better regression model. We’ve identified a consistent difference between pepperoni and cheese pizzas and incorporated it into the model. All coefficients are significantly different from zero, and both the R-square and adjusted R-square are higher than the model in Exercise 8. 12. (the final slice) a) The coefficient for the indicator for Weight Watchers is not significantly discernible from zero at the 0.05 level, but with a P-value of 0.09, it may still improve the model. This model has a slightly higher R-square and adjusted R-square, but is not enough improved to be grounds for choosing between the models. But the t-ratios are larger and the P-values smaller for a number of the coefficients. That’s a sign of improvement. b) Looking at the other coefficients in the model (and especially at the coefficient for Calories – not too surprising, considering the identity of the newly isolated pizza), the addition of an indicator for the Weight Watcher’s pizza has made several of them more significantly different from zero. This seems to be a cleaner model and one that might lead to better understanding. c) The tasters score cheese pizzas substantially higher than pepperoni pizzas. Even after allowing for that, additional fat lowers scores, but higher-calorie pizzas score a bit better.