Linear Regression - Department of Statistics

advertisement
CSSS 508: Intro R
2/24/06
Linear Regression
Quick reminder of our regression equation:
y = B0 + B1*x1 + B2*x2 + B3*x3 + ….. + Bp*xp
y – dependent variable; x’s - the independent variables
B0 – intercept; Bi – the slope (or rate of change) for xi
Using R for (Multivariate) Linear Regression:
help(lm)
Arguments:
Formula: y ~ x1 + x2 + x3 + ….+ xp
Data: your data matrix containing the variables in the model;
only use if you do not have individual variables in R named y, x1, x2, etc.
There are several other arguments that give you the option of fine-tuning the method of
regression – weighting variables, modeling a subset of the data, fitting method, etc.
Example:
PLEASE NOTE THAT WE ARE GENERATING RANDOM DATA;
YOUR RESULTS WILL BE SLIGHTLY DIFFERENT.
A Known Line Equation:
x<-sample(seq(1,10,by=.5),20,replace=T)
y<- 12 + 3*x
plot(x,y,pch=16)
Our data are of course plotted along a perfect line: y = 12 +3x
abline(12,3)
In real life, data are much messier, and noise in the data can obscure the line.
y2<-12+3*x + rnorm(20,0,1)
plot(x,y2,pch=16)
Let’s see how well R finds the original line:
lm(y2~x)
Call:
lm(formula = y2 ~ x)
Coefficients:
(Intercept)
x
11.719
3.038
Note that lm( ) automatically finds an intercept.
If you do not want an intercept, use: lm(y2~x-1)
Rebecca Nugent, Department of Statistics, U. of Washington
-1-
But for us, lm() was pretty close.
abline(11.719,3.038)
However, lm() doesn’t just give back the coefficients.
lm() returns an object of class lm that you can see if you use the summary( ) command.
fit1<-lm(y2~x)
summary(fit1)
Call:
lm(formula = y2 ~ x)
Residuals:
Min
1Q Median
-1.4196 -0.5581 -0.2527
3Q
0.7194
Max
2.2567
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.71933
0.62115
18.87 2.63e-13 ***
x
3.03752
0.08931
34.01 < 2e-16 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1.039 on 18 degrees of freedom
Multiple R-Squared: 0.9847,
Adjusted R-squared: 0.9838
F-statistic: 1157 on 1 and 18 DF, p-value: < 2.2e-16
First, the model you used is returned. Next is a summary result of the residuals (the
differences between the y values you have and the y values predicted by the model).
Ideally, our residuals should be scattered around zero. The coefficients are returned with
their standard errors and their significance level (t statistic and p-value). * - signif.
The standard error of the residuals is reported along with the coefficient of variation R2
(closer to 1, the better the model). The F-statistic reports the significance of the model.
i.e. do we do better including x in the model compared to just having an intercept?
This is just the summary of the results returned by lm().
We can get the individual pieces of information as well.
The names command returns a list of all the information associated with your object.
names(fit1)
[1] "coefficients" "residuals"
[5] "fitted.values" "assign"
[9] "xlevels"
"call"
"effects"
"qr"
"terms"
"rank"
"df.residual"
"model"
fit1$coefficients
(Intercept)
x
11.719331
3.037521
fit1$residuals
fit1$fitted.values
etc
Rebecca Nugent, Department of Statistics, U. of Washington
-2-
Let’s look at how well our regression model did.
We’ll use the results from the model to look at some plotting diagnostics.
plot(fit1$residuals)
abline(h=0)
Our residuals are scattered about zero;
we don’t have too many positive or too many negative residuals.
We can also use a qqnorm plot to see if the residuals are distributed normally (with mean
zero). If yes, the plot should show a fairly straight line (add with qqline).
qqnorm(fit1$res)
qqline(fit1$res)
This code gives us a picture of our model, comparing the actual data to the predicted data.
plot(x,y2,pch=16,ylab=”y”)
abline(fit1$coef[1],fit1$coef[2])
points(x,fit1$fit,pch=16,col=2)
title("Actual vs. Fitted Values")
text(locator(1),paste("y =",
round(fit1$coef[1],2),"+",round(fit1$coef[2],2),"*x",sep=""))
legend(locator(1),c("Actual","Predicted"),col=c(1,2),pch=16)
Predicting Points:
lm( ) has a companion function predict( ) that you can use with your lm object.
We pass in the lm object and the points we want predictions for as a dataframe of x’s.
predict(fit1,data.frame(x=4.5))
predict(fit1,data.frame(x=2))
new.data<-c(2.67,8.14,9.67)
new.y<-predict(fit1,data.frame(x=new.data))
points(new.data,new.y,col=6,pch=16)
Rebecca Nugent, Department of Statistics, U. of Washington
-3-
Confidence Intervals:
(95% CI : B +/- 1.96*SEB)
Recall:
summary(fit1)$coef
Estimate Std. Error t value
Pr(>|t|)
(Intercept) 11.719331 0.62114662 18.86725 2.628948e-13
x
3.037521 0.08930662 34.01227 8.688654e-18
We can write code to find the confidence intervals for our coefficients.
sum.coef<-summary(fit1)$coef
ci<-cbind(sum.coef[,1],
round(sum.coef[,1]+1.96*sum.coef[,2],3),round(sum.coef[,1]1.96*sum.coef[,2],3))
colnames(ci)<-c(“Coef”,”Upper CI”, “Lower CI”)
ci
Coef Upper CI Lower CI
(Intercept) 11.719331
12.937
10.502
x
3.037521
3.213
2.862
How would we change this to get a 90% CI or a 99% CI?
How can we plot this information?
pt.est<-cbind(seq(1,nrow(ci)),ci[,1])
y.low<-min(-1,min(ci)-0.5)
y.high<-max(ci)+0.5
plot(pt.est,
pch=16,cex=1.5,xlim=c(0,nrow(ci)+1),ylim=c(y.low,y.high),xaxt=”n”,xlab=
””,ylab=”Coefficient”)
points(cbind(c(1,1),ci[1,2:3]),col=2,pch=16)
lines(cbind(c(1,1),ci[1,2:3]),col=2)
text(1,y.low,”Bo”)
points(cbind(c(2,2),ci[2,2:3]),col=2,pch=16)
lines(cbind(c(2,2),ci[2,2:3]),col=2)
text(2,y.low,”B1”)
title(“95% Confidence Intervals”)
abline(h=0,lwd=2)
Compared to the intercept, the 95% CI for the slope of the line is very narrow.
Confidence intervals that do not include zero correspond to significant variables.
Rebecca Nugent, Department of Statistics, U. of Washington
-4-
Higher Order Terms/Interactions:
We have looked at the residuals of our models to asses their fit. We can also use them to
look for patterns indicating the need for a higher order term.
g<-seq(0,5,by=.1)
h<-g^2+2 + rnorm(length(g),0,1)
plot(g,h)
lm(h~g)
Call:
lm(formula = h ~ g)
Coefficients:
(Intercept)
-2.287
g
5.013
abline(-2.287,5.013)
fit2<-lm(h~g)
plot(fit2$residuals)
Clearly not a scatterplot; there is a pattern among the residuals. Should have an x2 term.
We add this by defining a new variable g2 = g2.
g2<-g^2
fit3<-lm(h ~ g + g2)
summary(fit3)
Call:
lm(formula = h ~ g + g2)
Residuals:
Min
1Q
Median
-2.46341 -0.62599 -0.04959
3Q
0.55741
Max
2.50647
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.84659
0.41443
4.456
5e-05 ***
g
-0.04820
0.38332 -0.126
0.9
g2
1.01220
0.07414 13.653
<2e-16 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1.025 on 48 degrees of freedom
Multiple R-Squared: 0.9833,
Adjusted R-squared: 0.9826
F-statistic: 1413 on 2 and 48 DF, p-value: < 2.2e-16
plot(fit3$res)
Now our residuals look better. Note that the linear term (g) is not significant in the above
model. The g2 term is the important part of the model.
Rebecca Nugent, Department of Statistics, U. of Washington
-5-
Let’s say we have two variables and we want to fit an interaction term.
Creating Data with an inherent interaction:
x1<-sample(1:2,60,replace=T)
x2<-rnorm(60,3,1)
y<-rep(NA,60)
#(different models depending on the x1 value)
y[x1==1]<-2*x2[x1==1]-10
y[x1==2]<-10*x2[x1==2]+10
y<-y+rnorm(60,0,1)
summary(lm(y~x1+x2+x1*x2))
Call:
lm(formula = y ~ x1 + x2 + x1 * x2)
Residuals:
Min
1Q Median
-3.0077 -0.5502 -0.1577
3Q
0.7501
Max
2.0823
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -30.3836
1.3730 -22.13
<2e-16 ***
x1
20.3724
0.8249
24.70
<2e-16 ***
x2
-5.8004
0.4244 -13.67
<2e-16 ***
x1:x2
7.8371
0.2572
30.47
<2e-16 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1.044 on 56 degrees of freedom
Multiple R-Squared: 0.9981,
Adjusted R-squared: 0.998
F-statistic: 9717 on 3 and 56 DF, p-value: < 2.2e-16
Note that the coefficient for the intercept is roughly the difference in slopes of the two
models used above.
Rebecca Nugent, Department of Statistics, U. of Washington
-6-
Download