Solution to HW8

advertisement
CSSS 508 Spring 2005
Homework Week 8
ANSWER KEY
1. Get the data set via read.csv() as shown below. This data has 8 variables: X1, Y1, X2,
Y2, X3, Y3, X4, Y4.
> anscombe <+ read.csv("http://www.stat.washington.edu/handcock/536/Data/anscombe.csv")
(a) Use lm() to fit four simple linear regressions: Y1 predicted by X1, Y2 predicted by
X2, Y3 predicted by X3, and Y4 predicted by X4.
> fit1<-lm(Y1~X1, data=anscombe)
> summary(fit1)
Call:
lm(formula = Y1 ~ X1, data = anscombe)
Residuals:
Min
1Q
Median
-1.92127 -0.45577 -0.04136
3Q
0.70941
Max
1.83882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
3.0001
1.1247
2.667 0.02573 *
X1
0.5001
0.1179
4.241 0.00217 **
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-Squared: 0.6665,
Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF, p-value: 0.002170
> fit2<-lm(Y2~X2, data=anscombe)
> summary(fit2)
Call:
lm(formula = Y2 ~ X2, data = anscombe)
Residuals:
Min
1Q
-1.9009 -0.7609
Median
0.1291
3Q
0.9491
Max
1.2691
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
3.001
1.125
2.667 0.02576 *
X2
0.500
0.118
4.239 0.00218 **
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-Squared: 0.6662,
Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
> fit3<-lm(Y3~X3, data=anscombe)
> summary(fit3)
Call:
lm(formula = Y3 ~ X3, data = anscombe)
Residuals:
Min
1Q Median
-1.1586 -0.6146 -0.2303
3Q
0.1540
Max
3.2411
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
3.0025
1.1245
2.670 0.02562 *
X3
0.4997
0.1179
4.239 0.00218 **
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-Squared: 0.6663,
Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
> fit4<-lm(Y4~X4, data=anscombe)
> summary(fit4)
Call:
lm(formula = Y4 ~ X4, data = anscombe)
Residuals:
Min
1Q
-1.751e+00 -8.310e-01
Median
1.258e-16
3Q
8.090e-01
Max
1.839e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
3.0017
1.1239
2.671 0.02559 *
X4
0.4999
0.1178
4.243 0.00216 **
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-Squared: 0.6667,
Adjusted R-squared: 0.6297
F-statistic:
18 on 1 and 9 DF, p-value: 0.002165
(b) Compare these four fits in terms of their R2 and significance of the beta coefficients
for intercept and slope.
Based on the comparison of R2 and the values and statistical significance for beta and slope,
all the models appear to be very similar.
(c) Do a plot of the raw residuals, standardized residuals, and Cook’s Distance for each
regression. Which model looks best now?
> par(mfrow=c(2,2))
> plot(fit1)
5
6
3
7
8
9
0
1
9
-1
Standardized residuals
2
1
0
-1
-2
Residuals
9
10
Normal Q-Q plot
2
Residuals vs Fitted
10
3
10
-1.5
Fitted values
1.5
Cook's distance plot
3
0.4
9
10
0.0
0.4
0.8
Cook's distance
3
0.2
1.2
9
0.0
Standardized residuals
0.5
Theoretical Quantiles
Scale-Location plot
10
-0.5
5
6
7
8
Fitted values
9
10
2
4
6
8
Obs. number
10
> par(mfrow=c(2,2))
> plot(fit2)
8
5
6
6
7
8
9
-0.5
0.5
4
-1.5
0
1
4
-1
-2
Residuals
Normal Q-Q plot
Standardized residuals
Residuals vs Fitted
6
10
8
-1.5
Fitted values
8
Fitted values
9
10
0.0 0.2 0.4 0.6 0.8
Cook's distance
1.2
0.4
0.8
4
7
1.5
Cook's distance plot
6
0.0
Standardized residuals
Scale-Location plot
6
0.5
Theoretical Quantiles
8
5
-0.5
6
8
6
8
3
2
4
Obs. number
10
> par(mfrow=c(2,2))
> plot(fit3)
9
5
6
7
8
6
9
3
1
2
3
0
-1
0
1
2
3
3
-1
Residuals
Normal Q-Q plot
Standardized residuals
Residuals vs Fitted
10
9
6
-1.5
Fitted values
3
1.0
1.5
Cook's distance plot
6
9
0.0
0.5
0.0
5
6
7
8
Fitted values
1.5
0.5
6
9
Cook's distance
1.5
3
1.0
0.5
Theoretical Quantiles
Scale-Location plot
Standardized residuals
-0.5
9
10
2
4
6
8
Obs. number
10
0.00
6
0.02
0.04
8
Y4
0.06
10
0.08
Cook's distance
0.10
12
0.12
0.14
-1.5
-1.5
-1.0
-1.0
-0.5
0.0
0.5
2
2
4
4
6
6
Index
8
8
10
0.0
0.5
std. residual
-0.5
residual
1.0
1.0
1.5
1.5
Residuals
Standardized Residuals
10
2
8
4
10
6
12
14
X4
8
Index
Index
Cook's Distance
Fitted Model
16
10
18
Based on the residuals and Cook’s distance plots, there appear to be some problems with Y2
~ X2 (Cook’s distance near 0.8), Y3 ~ X3 (1.39), and Y4 ~ X4 (∞). Based only on this
decision, Y1 ~ X1 looks best.
(d) Plot X1 versus Y1, X2 versus Y2, X3 versus Y3, and X4 versus Y4. Include the
fitted lines in your plots. What do you observe? What is the moral of this story?
plot.series.c <- function()
{
par(mfrow=c(2,2))
### make the plot 2 columns by 2 rows wide
plot((anscombe$Y1~anscombe$X1),pch=20)
abline(one.lm,col="blue")
abline(two.lm,col="blue")
plot((anscombe$Y3~anscombe$X3),pch=20)
abline(three.lm,col="blue")
plot((anscombe$Y4~anscombe$X4),pch=20)
abline(four.lm,col="blue")
par(mfrow=c(1,1))
}
9
8
7
6
4
5
anscombe$Y2
8 9
6 7
3
4 5
anscombe$Y1
11
> plot.series.c()
4
6
8
10
12
14
4
6
12
14
anscombe$X2
10
6
8
anscombe$Y4
12
10
8
6
anscombe$Y3
10
12
anscombe$X1
8
4
6
8
10
anscombe$X3
12
14
8
10
12
14
16
18
anscombe$X4
All paired vectors other than X1 versus Y1 seem to have systematic problems. Y2 ~ X2 has
a parabolic shape which is not appropriate for modeling with a simple linear regression
without some kind of data transformation. Y3 ~ X3 has a single outlier (high Y value) that
pulls the regression line away from the strongly linear pattern of the rest of the data points.
Y4 ~ X4 has all X values of 8 other than a single X value with a high Y value. Without this
single outlier there would be no line at all, simply a mean Y value for all the identical X
values. Only Y1 ~ X1 has what could be considered a pattern that might be described
properly with a linear regression.
The moral of the story is to graph the data and look carefully at all your data before blindly
reporting significant p-values.
2. Using the data “hw8.dat”, fit a multiple linear regression where y is the dependent
variable. Look for outliers, remove any that look too influential, and refit the linear
regression. Did it make any difference? Include relevant plots and your reasoning in
your answer to this question. (Try the pairs() function).
> two <-read.table("hw8.dat", header = TRUE)
> pairs(two)
1 2 3
0
2
4
6
10 20
30
-1
1 2 3
0
y
1 2 3 4
-1
x1
4
6
-1
x2
x4
0
10
20 30
-1
1 2 3 4
2 3 4 5 6
0
2
x3
2 3 4 5 6
Some patterns when looking at the pairwise plots. y seems to have a positive linear
relationship with x1 and x2. There seems to be no relationship between y and x3 or x4.
Other variables that seem to have a linear relationship are x1 and x4 and possibly x3 and x4.
> fit <- lm(y~x1+x2+x3+x4,data=two)
> summary(fit)
Call:
lm(formula = y ~ x1 + x2 + x3 + x4, data = two)
Residuals:
Min
1Q
Median
3Q
Max
-14.2209 -2.4220
0.3521
1.9309 18.2397
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.1120
2.2991
4.833 5.16e-06 ***
x1
1.3089
0.6034
2.169 0.032571 *
x2
2.8176
0.4424
6.368 6.77e-09 ***
x3
-1.7433
0.4643 -3.755 0.000299 ***
x4
1.3264
0.6760
1.962 0.052689 .
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 3.911 on 95 degrees of freedom
Multiple R-Squared: 0.5657,
Adjusted R-squared: 0.5475
F-statistic: 30.94 on 4 and 95 DF, p-value: < 2.2e-16
0
-4
-2
stdres(fit)
2
4
> plot(stdres(fit))
0
20
40
60
80
100
Index
Based on the standardized residuals, observations 10 and 30 look suspect.
0.6
0.4
0.0
0.2
cooks.distance(fit)
0.8
1.0
> plot(cooks.distance(fit))
0
20
40
60
80
100
Index
Based on the cooks distance, observations 30 and 80 are out of line.
These visual determinations based on the plot were verified by looking at
the actual values for stdres and cooks.distance to make sure they are the
observations that I am looking for.
I just indexed the data set and included all the points that I wanted to
keep (i.e. everything except 10, 30, and 80).
> fit2 <- lm(y~x1+x2+x3+x4,data=two[c(1:9,11:29,31:79,81:100),1:5])
> summary(fit2)
Call:
lm(formula = y ~ x1 + x2 + x3 + x4, data = select.two)
Residuals:
Min
1Q Median
3Q
Max
-9.8548 -1.2187 0.5186 1.4796 4.4355
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
8.5623
1.4778
5.794 9.53e-08 ***
x1
2.1143
0.3856
5.483 3.64e-07 ***
x2
3.2754
0.2803 11.684 < 2e-16 ***
x3
-0.1508
0.3303 -0.457
0.649
x4
0.2538
0.4419
0.574
0.567
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 2.449 on 92 degrees of freedom
Multiple R-Squared: 0.7892,
Adjusted R-squared: 0.78
F-statistic: 86.09 on 4 and 92 DF, p-value: < 2.2e-16
It appears to have helped quite a lot to eliminate observations, 10, 30,
and 80. The R-squared went from 0.5657 to 0.7892.
Method 2
Let’s throw caution to the wind and just run a multiple linear regression at the data set:
> summary(lm(y ~ x1 + x2 + x3 + x4))
Call:
lm(formula = y ~ x1 + x2 + x3 + x4)
Residuals:
Min
1Q
-14.2209 -2.4220
Median
0.3521
3Q
1.9309
Max
18.2397
Coefficients:
Estimate Std. Error t value
(Intercept) 11.1120
2.2991
4.833
x1
1.3089
0.6034
2.169
x2
2.8176
0.4424
6.368
x3
-1.7433
0.4643 -3.755
x4
1.3264
0.6760
1.962
--Signif. codes: 0 `***' 0.001 `**' 0.01
Pr(>|t|)
5.16e-06
0.032571
6.77e-09
0.000299
0.052689
***
*
***
***
.
`*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 3.911 on 95 degrees of freedom
Multiple R-Squared: 0.5657,
Adjusted R-squared: 0.5475
F-statistic: 30.94 on 4 and 95 DF, p-value: < 2.2e-16
This shows that under a simple model x4 has a weak association. Let’s drop that variable
entirely:
> summary(lm(y ~ x1 + x2 + x3))
Call:
lm(formula = y ~ x1 + x2 + x3)
Residuals:
Min
1Q
-16.2524 -1.9654
Median
0.2407
3Q
1.9065
Max
18.4910
Coefficients:
Estimate Std. Error t value
(Intercept) 14.3957
1.5994
9.001
x1
2.1693
0.4206
5.157
x2
2.6414
0.4396
6.008
x3
-1.2366
0.3915 -3.159
--Signif. codes: 0 `***' 0.001 `**' 0.01
Pr(>|t|)
2.09e-14
1.35e-06
3.36e-08
0.00212
***
***
***
**
`*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 3.969 on 96 degrees of freedom
Multiple R-Squared: 0.5481,
Adjusted R-squared: 0.534
F-statistic: 38.82 on 3 and 96 DF, p-value: < 2.2e-16
Now the slope coefficients seem to be generally more significant without x4, the R2 values are
nearly the same, and the p-value on the F statistic is no different. But this is a more
parsimonious model.
There appeared to be a few outliers when looking at x1 vs. y:
> plot (x1, y)
> identify(x1, y)
[1] 10 30 50 80
10
y
20
30
30
0
10
50
80
-2
-1
0
1
2
3
4
x1
So let’s try dropping these points:
> dropped <- -c(10,30,50,80)
> summary(lm(y[dropped] ~ x1[dropped] + x2[dropped] + x3[dropped]))
Call:
lm(formula = y[dropped] ~ x1[dropped] + x2[dropped] + x3[dropped])
Residuals:
Min
1Q
-5.4795 -1.1172
Median
0.4216
3Q
1.3350
Max
4.2008
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
8.2102
0.9932
8.266 1.00e-12 ***
x1[dropped]
2.4068
0.2374 10.139 < 2e-16 ***
x2[dropped]
3.2402
0.2468 13.130 < 2e-16 ***
x3[dropped]
0.2559
0.2445
1.047
0.298
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 2.186 on 92 degrees of freedom
Multiple R-Squared: 0.828,
Adjusted R-squared: 0.8224
F-statistic: 147.6 on 3 and 92 DF, p-value: < 2.2e-16
Now the overall R2 values increase substantially, from about 0.5 to 0.8! But also this lowers the
significance value on x3. Let’s try running now without x3, but still on the subset:
> summary(lm(y[dropped] ~ x1[dropped] + x2[dropped]))
Call:
lm(formula = y[dropped] ~ x1[dropped] + x2[dropped])
Residuals:
Min
1Q
-5.6336 -1.2701
Median
0.4167
3Q
1.3210
Max
4.5380
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
9.0933
0.5242
17.35
<2e-16 ***
x1[dropped]
2.4497
0.2339
10.47
<2e-16 ***
x2[dropped]
3.1691
0.2374
13.35
<2e-16 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 2.187 on 93 degrees of freedom
Multiple R-Squared: 0.826,
Adjusted R-squared: 0.8222
F-statistic: 220.7 on 2 and 93 DF, p-value: < 2.2e-16
The R2 value is about the same but the model is certainly more parsimonious. Of course I don’t
exactly feel comfortable about dropping observations on an ad hoc basis….
FYI: this is what created the data
make.mvn <- function()
{
require(MASS)
mvn.sigma <- matrix(c(1.0,
0.3,
0.1,
0.7,
0.3,
1.0,
0.0,
0.2,
0.1,
0.0,
1.0,
0.5,
0.7,
0.2,
0.5,
1.0), byrow=T, ncol=4,nrow=4)
mvn.dat <- mvrnorm(100,mu=c(1,2,3,4),Sigma=mvn.sigma)
y <- 10 + 2 * mvn.dat[,1] + 3 * mvn.dat[,2] + rnorm(100,mean=0,sd=2)
dat <- data.frame(y=y,x1=mvn.dat[,1],x2=mvn.dat[,2],x3=mvn.dat[,3],
x4=mvn.dat[,4])
### add outliers
dat$x1[50] <- 2.6
dat$y[50] <- 10.7
dat$x2[10] <- 2.1
dat$y[10] <- 33.3
dat$x3[80] <- 6.1
dat$y[80] <- -0.8
dat$x3[30] <- 0.1
dat$y[30] <- 30.2
write.table(dat,file="hw8.dat",row.names=F,quote=F)
}
Download