CSSS 508 Spring 2005 Homework Week 8 ANSWER KEY 1. Get the data set via read.csv() as shown below. This data has 8 variables: X1, Y1, X2, Y2, X3, Y3, X4, Y4. > anscombe <+ read.csv("http://www.stat.washington.edu/handcock/536/Data/anscombe.csv") (a) Use lm() to fit four simple linear regressions: Y1 predicted by X1, Y2 predicted by X2, Y3 predicted by X3, and Y4 predicted by X4. > fit1<-lm(Y1~X1, data=anscombe) > summary(fit1) Call: lm(formula = Y1 ~ X1, data = anscombe) Residuals: Min 1Q Median -1.92127 -0.45577 -0.04136 3Q 0.70941 Max 1.83882 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0001 1.1247 2.667 0.02573 * X1 0.5001 0.1179 4.241 0.00217 ** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.237 on 9 degrees of freedom Multiple R-Squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.002170 > fit2<-lm(Y2~X2, data=anscombe) > summary(fit2) Call: lm(formula = Y2 ~ X2, data = anscombe) Residuals: Min 1Q -1.9009 -0.7609 Median 0.1291 3Q 0.9491 Max 1.2691 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.001 1.125 2.667 0.02576 * X2 0.500 0.118 4.239 0.00218 ** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.237 on 9 degrees of freedom Multiple R-Squared: 0.6662, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179 > fit3<-lm(Y3~X3, data=anscombe) > summary(fit3) Call: lm(formula = Y3 ~ X3, data = anscombe) Residuals: Min 1Q Median -1.1586 -0.6146 -0.2303 3Q 0.1540 Max 3.2411 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0025 1.1245 2.670 0.02562 * X3 0.4997 0.1179 4.239 0.00218 ** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.236 on 9 degrees of freedom Multiple R-Squared: 0.6663, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176 > fit4<-lm(Y4~X4, data=anscombe) > summary(fit4) Call: lm(formula = Y4 ~ X4, data = anscombe) Residuals: Min 1Q -1.751e+00 -8.310e-01 Median 1.258e-16 3Q 8.090e-01 Max 1.839e+00 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0017 1.1239 2.671 0.02559 * X4 0.4999 0.1178 4.243 0.00216 ** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.236 on 9 degrees of freedom Multiple R-Squared: 0.6667, Adjusted R-squared: 0.6297 F-statistic: 18 on 1 and 9 DF, p-value: 0.002165 (b) Compare these four fits in terms of their R2 and significance of the beta coefficients for intercept and slope. Based on the comparison of R2 and the values and statistical significance for beta and slope, all the models appear to be very similar. (c) Do a plot of the raw residuals, standardized residuals, and Cook’s Distance for each regression. Which model looks best now? > par(mfrow=c(2,2)) > plot(fit1) 5 6 3 7 8 9 0 1 9 -1 Standardized residuals 2 1 0 -1 -2 Residuals 9 10 Normal Q-Q plot 2 Residuals vs Fitted 10 3 10 -1.5 Fitted values 1.5 Cook's distance plot 3 0.4 9 10 0.0 0.4 0.8 Cook's distance 3 0.2 1.2 9 0.0 Standardized residuals 0.5 Theoretical Quantiles Scale-Location plot 10 -0.5 5 6 7 8 Fitted values 9 10 2 4 6 8 Obs. number 10 > par(mfrow=c(2,2)) > plot(fit2) 8 5 6 6 7 8 9 -0.5 0.5 4 -1.5 0 1 4 -1 -2 Residuals Normal Q-Q plot Standardized residuals Residuals vs Fitted 6 10 8 -1.5 Fitted values 8 Fitted values 9 10 0.0 0.2 0.4 0.6 0.8 Cook's distance 1.2 0.4 0.8 4 7 1.5 Cook's distance plot 6 0.0 Standardized residuals Scale-Location plot 6 0.5 Theoretical Quantiles 8 5 -0.5 6 8 6 8 3 2 4 Obs. number 10 > par(mfrow=c(2,2)) > plot(fit3) 9 5 6 7 8 6 9 3 1 2 3 0 -1 0 1 2 3 3 -1 Residuals Normal Q-Q plot Standardized residuals Residuals vs Fitted 10 9 6 -1.5 Fitted values 3 1.0 1.5 Cook's distance plot 6 9 0.0 0.5 0.0 5 6 7 8 Fitted values 1.5 0.5 6 9 Cook's distance 1.5 3 1.0 0.5 Theoretical Quantiles Scale-Location plot Standardized residuals -0.5 9 10 2 4 6 8 Obs. number 10 0.00 6 0.02 0.04 8 Y4 0.06 10 0.08 Cook's distance 0.10 12 0.12 0.14 -1.5 -1.5 -1.0 -1.0 -0.5 0.0 0.5 2 2 4 4 6 6 Index 8 8 10 0.0 0.5 std. residual -0.5 residual 1.0 1.0 1.5 1.5 Residuals Standardized Residuals 10 2 8 4 10 6 12 14 X4 8 Index Index Cook's Distance Fitted Model 16 10 18 Based on the residuals and Cook’s distance plots, there appear to be some problems with Y2 ~ X2 (Cook’s distance near 0.8), Y3 ~ X3 (1.39), and Y4 ~ X4 (∞). Based only on this decision, Y1 ~ X1 looks best. (d) Plot X1 versus Y1, X2 versus Y2, X3 versus Y3, and X4 versus Y4. Include the fitted lines in your plots. What do you observe? What is the moral of this story? plot.series.c <- function() { par(mfrow=c(2,2)) ### make the plot 2 columns by 2 rows wide plot((anscombe$Y1~anscombe$X1),pch=20) abline(one.lm,col="blue") abline(two.lm,col="blue") plot((anscombe$Y3~anscombe$X3),pch=20) abline(three.lm,col="blue") plot((anscombe$Y4~anscombe$X4),pch=20) abline(four.lm,col="blue") par(mfrow=c(1,1)) } 9 8 7 6 4 5 anscombe$Y2 8 9 6 7 3 4 5 anscombe$Y1 11 > plot.series.c() 4 6 8 10 12 14 4 6 12 14 anscombe$X2 10 6 8 anscombe$Y4 12 10 8 6 anscombe$Y3 10 12 anscombe$X1 8 4 6 8 10 anscombe$X3 12 14 8 10 12 14 16 18 anscombe$X4 All paired vectors other than X1 versus Y1 seem to have systematic problems. Y2 ~ X2 has a parabolic shape which is not appropriate for modeling with a simple linear regression without some kind of data transformation. Y3 ~ X3 has a single outlier (high Y value) that pulls the regression line away from the strongly linear pattern of the rest of the data points. Y4 ~ X4 has all X values of 8 other than a single X value with a high Y value. Without this single outlier there would be no line at all, simply a mean Y value for all the identical X values. Only Y1 ~ X1 has what could be considered a pattern that might be described properly with a linear regression. The moral of the story is to graph the data and look carefully at all your data before blindly reporting significant p-values. 2. Using the data “hw8.dat”, fit a multiple linear regression where y is the dependent variable. Look for outliers, remove any that look too influential, and refit the linear regression. Did it make any difference? Include relevant plots and your reasoning in your answer to this question. (Try the pairs() function). > two <-read.table("hw8.dat", header = TRUE) > pairs(two) 1 2 3 0 2 4 6 10 20 30 -1 1 2 3 0 y 1 2 3 4 -1 x1 4 6 -1 x2 x4 0 10 20 30 -1 1 2 3 4 2 3 4 5 6 0 2 x3 2 3 4 5 6 Some patterns when looking at the pairwise plots. y seems to have a positive linear relationship with x1 and x2. There seems to be no relationship between y and x3 or x4. Other variables that seem to have a linear relationship are x1 and x4 and possibly x3 and x4. > fit <- lm(y~x1+x2+x3+x4,data=two) > summary(fit) Call: lm(formula = y ~ x1 + x2 + x3 + x4, data = two) Residuals: Min 1Q Median 3Q Max -14.2209 -2.4220 0.3521 1.9309 18.2397 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.1120 2.2991 4.833 5.16e-06 *** x1 1.3089 0.6034 2.169 0.032571 * x2 2.8176 0.4424 6.368 6.77e-09 *** x3 -1.7433 0.4643 -3.755 0.000299 *** x4 1.3264 0.6760 1.962 0.052689 . --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 3.911 on 95 degrees of freedom Multiple R-Squared: 0.5657, Adjusted R-squared: 0.5475 F-statistic: 30.94 on 4 and 95 DF, p-value: < 2.2e-16 0 -4 -2 stdres(fit) 2 4 > plot(stdres(fit)) 0 20 40 60 80 100 Index Based on the standardized residuals, observations 10 and 30 look suspect. 0.6 0.4 0.0 0.2 cooks.distance(fit) 0.8 1.0 > plot(cooks.distance(fit)) 0 20 40 60 80 100 Index Based on the cooks distance, observations 30 and 80 are out of line. These visual determinations based on the plot were verified by looking at the actual values for stdres and cooks.distance to make sure they are the observations that I am looking for. I just indexed the data set and included all the points that I wanted to keep (i.e. everything except 10, 30, and 80). > fit2 <- lm(y~x1+x2+x3+x4,data=two[c(1:9,11:29,31:79,81:100),1:5]) > summary(fit2) Call: lm(formula = y ~ x1 + x2 + x3 + x4, data = select.two) Residuals: Min 1Q Median 3Q Max -9.8548 -1.2187 0.5186 1.4796 4.4355 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.5623 1.4778 5.794 9.53e-08 *** x1 2.1143 0.3856 5.483 3.64e-07 *** x2 3.2754 0.2803 11.684 < 2e-16 *** x3 -0.1508 0.3303 -0.457 0.649 x4 0.2538 0.4419 0.574 0.567 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 2.449 on 92 degrees of freedom Multiple R-Squared: 0.7892, Adjusted R-squared: 0.78 F-statistic: 86.09 on 4 and 92 DF, p-value: < 2.2e-16 It appears to have helped quite a lot to eliminate observations, 10, 30, and 80. The R-squared went from 0.5657 to 0.7892. Method 2 Let’s throw caution to the wind and just run a multiple linear regression at the data set: > summary(lm(y ~ x1 + x2 + x3 + x4)) Call: lm(formula = y ~ x1 + x2 + x3 + x4) Residuals: Min 1Q -14.2209 -2.4220 Median 0.3521 3Q 1.9309 Max 18.2397 Coefficients: Estimate Std. Error t value (Intercept) 11.1120 2.2991 4.833 x1 1.3089 0.6034 2.169 x2 2.8176 0.4424 6.368 x3 -1.7433 0.4643 -3.755 x4 1.3264 0.6760 1.962 --Signif. codes: 0 `***' 0.001 `**' 0.01 Pr(>|t|) 5.16e-06 0.032571 6.77e-09 0.000299 0.052689 *** * *** *** . `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 3.911 on 95 degrees of freedom Multiple R-Squared: 0.5657, Adjusted R-squared: 0.5475 F-statistic: 30.94 on 4 and 95 DF, p-value: < 2.2e-16 This shows that under a simple model x4 has a weak association. Let’s drop that variable entirely: > summary(lm(y ~ x1 + x2 + x3)) Call: lm(formula = y ~ x1 + x2 + x3) Residuals: Min 1Q -16.2524 -1.9654 Median 0.2407 3Q 1.9065 Max 18.4910 Coefficients: Estimate Std. Error t value (Intercept) 14.3957 1.5994 9.001 x1 2.1693 0.4206 5.157 x2 2.6414 0.4396 6.008 x3 -1.2366 0.3915 -3.159 --Signif. codes: 0 `***' 0.001 `**' 0.01 Pr(>|t|) 2.09e-14 1.35e-06 3.36e-08 0.00212 *** *** *** ** `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 3.969 on 96 degrees of freedom Multiple R-Squared: 0.5481, Adjusted R-squared: 0.534 F-statistic: 38.82 on 3 and 96 DF, p-value: < 2.2e-16 Now the slope coefficients seem to be generally more significant without x4, the R2 values are nearly the same, and the p-value on the F statistic is no different. But this is a more parsimonious model. There appeared to be a few outliers when looking at x1 vs. y: > plot (x1, y) > identify(x1, y) [1] 10 30 50 80 10 y 20 30 30 0 10 50 80 -2 -1 0 1 2 3 4 x1 So let’s try dropping these points: > dropped <- -c(10,30,50,80) > summary(lm(y[dropped] ~ x1[dropped] + x2[dropped] + x3[dropped])) Call: lm(formula = y[dropped] ~ x1[dropped] + x2[dropped] + x3[dropped]) Residuals: Min 1Q -5.4795 -1.1172 Median 0.4216 3Q 1.3350 Max 4.2008 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.2102 0.9932 8.266 1.00e-12 *** x1[dropped] 2.4068 0.2374 10.139 < 2e-16 *** x2[dropped] 3.2402 0.2468 13.130 < 2e-16 *** x3[dropped] 0.2559 0.2445 1.047 0.298 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 2.186 on 92 degrees of freedom Multiple R-Squared: 0.828, Adjusted R-squared: 0.8224 F-statistic: 147.6 on 3 and 92 DF, p-value: < 2.2e-16 Now the overall R2 values increase substantially, from about 0.5 to 0.8! But also this lowers the significance value on x3. Let’s try running now without x3, but still on the subset: > summary(lm(y[dropped] ~ x1[dropped] + x2[dropped])) Call: lm(formula = y[dropped] ~ x1[dropped] + x2[dropped]) Residuals: Min 1Q -5.6336 -1.2701 Median 0.4167 3Q 1.3210 Max 4.5380 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.0933 0.5242 17.35 <2e-16 *** x1[dropped] 2.4497 0.2339 10.47 <2e-16 *** x2[dropped] 3.1691 0.2374 13.35 <2e-16 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 2.187 on 93 degrees of freedom Multiple R-Squared: 0.826, Adjusted R-squared: 0.8222 F-statistic: 220.7 on 2 and 93 DF, p-value: < 2.2e-16 The R2 value is about the same but the model is certainly more parsimonious. Of course I don’t exactly feel comfortable about dropping observations on an ad hoc basis…. FYI: this is what created the data make.mvn <- function() { require(MASS) mvn.sigma <- matrix(c(1.0, 0.3, 0.1, 0.7, 0.3, 1.0, 0.0, 0.2, 0.1, 0.0, 1.0, 0.5, 0.7, 0.2, 0.5, 1.0), byrow=T, ncol=4,nrow=4) mvn.dat <- mvrnorm(100,mu=c(1,2,3,4),Sigma=mvn.sigma) y <- 10 + 2 * mvn.dat[,1] + 3 * mvn.dat[,2] + rnorm(100,mean=0,sd=2) dat <- data.frame(y=y,x1=mvn.dat[,1],x2=mvn.dat[,2],x3=mvn.dat[,3], x4=mvn.dat[,4]) ### add outliers dat$x1[50] <- 2.6 dat$y[50] <- 10.7 dat$x2[10] <- 2.1 dat$y[10] <- 33.3 dat$x3[80] <- 6.1 dat$y[80] <- -0.8 dat$x3[30] <- 0.1 dat$y[30] <- 30.2 write.table(dat,file="hw8.dat",row.names=F,quote=F) }