CSSS 508: Intro R 2/24/06 Linear Regression Quick reminder of our regression equation: y = B0 + B1*x1 + B2*x2 + B3*x3 + ….. + Bp*xp y – dependent variable; x’s - the independent variables B0 – intercept; Bi – the slope (or rate of change) for xi Using R for (Multivariate) Linear Regression: help(lm) Arguments: Formula: y ~ x1 + x2 + x3 + ….+ xp Data: your data matrix containing the variables in the model; only use if you do not have individual variables in R named y, x1, x2, etc. There are several other arguments that give you the option of fine-tuning the method of regression – weighting variables, modeling a subset of the data, fitting method, etc. Example: PLEASE NOTE THAT WE ARE GENERATING RANDOM DATA; YOUR RESULTS WILL BE SLIGHTLY DIFFERENT. A Known Line Equation: x<-sample(seq(1,10,by=.5),20,replace=T) y<- 12 + 3*x plot(x,y,pch=16) Our data are of course plotted along a perfect line: y = 12 +3x abline(12,3) In real life, data are much messier, and noise in the data can obscure the line. y2<-12+3*x + rnorm(20,0,1) plot(x,y2,pch=16) Let’s see how well R finds the original line: lm(y2~x) Call: lm(formula = y2 ~ x) Coefficients: (Intercept) x 11.719 3.038 Note that lm( ) automatically finds an intercept. If you do not want an intercept, use: lm(y2~x-1) Rebecca Nugent, Department of Statistics, U. of Washington -1- But for us, lm() was pretty close. abline(11.719,3.038) However, lm() doesn’t just give back the coefficients. lm() returns an object of class lm that you can see if you use the summary( ) command. fit1<-lm(y2~x) summary(fit1) Call: lm(formula = y2 ~ x) Residuals: Min 1Q Median -1.4196 -0.5581 -0.2527 3Q 0.7194 Max 2.2567 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.71933 0.62115 18.87 2.63e-13 *** x 3.03752 0.08931 34.01 < 2e-16 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.039 on 18 degrees of freedom Multiple R-Squared: 0.9847, Adjusted R-squared: 0.9838 F-statistic: 1157 on 1 and 18 DF, p-value: < 2.2e-16 First, the model you used is returned. Next is a summary result of the residuals (the differences between the y values you have and the y values predicted by the model). Ideally, our residuals should be scattered around zero. The coefficients are returned with their standard errors and their significance level (t statistic and p-value). * - signif. The standard error of the residuals is reported along with the coefficient of variation R2 (closer to 1, the better the model). The F-statistic reports the significance of the model. i.e. do we do better including x in the model compared to just having an intercept? This is just the summary of the results returned by lm(). We can get the individual pieces of information as well. The names command returns a list of all the information associated with your object. names(fit1) [1] "coefficients" "residuals" [5] "fitted.values" "assign" [9] "xlevels" "call" "effects" "qr" "terms" "rank" "df.residual" "model" fit1$coefficients (Intercept) x 11.719331 3.037521 fit1$residuals fit1$fitted.values etc Rebecca Nugent, Department of Statistics, U. of Washington -2- Let’s look at how well our regression model did. We’ll use the results from the model to look at some plotting diagnostics. plot(fit1$residuals) abline(h=0) Our residuals are scattered about zero; we don’t have too many positive or too many negative residuals. We can also use a qqnorm plot to see if the residuals are distributed normally (with mean zero). If yes, the plot should show a fairly straight line (add with qqline). qqnorm(fit1$res) qqline(fit1$res) This code gives us a picture of our model, comparing the actual data to the predicted data. plot(x,y2,pch=16,ylab=”y”) abline(fit1$coef[1],fit1$coef[2]) points(x,fit1$fit,pch=16,col=2) title("Actual vs. Fitted Values") text(locator(1),paste("y =", round(fit1$coef[1],2),"+",round(fit1$coef[2],2),"*x",sep="")) legend(locator(1),c("Actual","Predicted"),col=c(1,2),pch=16) Predicting Points: lm( ) has a companion function predict( ) that you can use with your lm object. We pass in the lm object and the points we want predictions for as a dataframe of x’s. predict(fit1,data.frame(x=4.5)) predict(fit1,data.frame(x=2)) new.data<-c(2.67,8.14,9.67) new.y<-predict(fit1,data.frame(x=new.data)) points(new.data,new.y,col=6,pch=16) Rebecca Nugent, Department of Statistics, U. of Washington -3- Confidence Intervals: (95% CI : B +/- 1.96*SEB) Recall: summary(fit1)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) 11.719331 0.62114662 18.86725 2.628948e-13 x 3.037521 0.08930662 34.01227 8.688654e-18 We can write code to find the confidence intervals for our coefficients. sum.coef<-summary(fit1)$coef ci<-cbind(sum.coef[,1], round(sum.coef[,1]+1.96*sum.coef[,2],3),round(sum.coef[,1]1.96*sum.coef[,2],3)) colnames(ci)<-c(“Coef”,”Upper CI”, “Lower CI”) ci Coef Upper CI Lower CI (Intercept) 11.719331 12.937 10.502 x 3.037521 3.213 2.862 How would we change this to get a 90% CI or a 99% CI? How can we plot this information? pt.est<-cbind(seq(1,nrow(ci)),ci[,1]) y.low<-min(-1,min(ci)-0.5) y.high<-max(ci)+0.5 plot(pt.est, pch=16,cex=1.5,xlim=c(0,nrow(ci)+1),ylim=c(y.low,y.high),xaxt=”n”,xlab= ””,ylab=”Coefficient”) points(cbind(c(1,1),ci[1,2:3]),col=2,pch=16) lines(cbind(c(1,1),ci[1,2:3]),col=2) text(1,y.low,”Bo”) points(cbind(c(2,2),ci[2,2:3]),col=2,pch=16) lines(cbind(c(2,2),ci[2,2:3]),col=2) text(2,y.low,”B1”) title(“95% Confidence Intervals”) abline(h=0,lwd=2) Compared to the intercept, the 95% CI for the slope of the line is very narrow. Confidence intervals that do not include zero correspond to significant variables. Rebecca Nugent, Department of Statistics, U. of Washington -4- Higher Order Terms/Interactions: We have looked at the residuals of our models to asses their fit. We can also use them to look for patterns indicating the need for a higher order term. g<-seq(0,5,by=.1) h<-g^2+2 + rnorm(length(g),0,1) plot(g,h) lm(h~g) Call: lm(formula = h ~ g) Coefficients: (Intercept) -2.287 g 5.013 abline(-2.287,5.013) fit2<-lm(h~g) plot(fit2$residuals) Clearly not a scatterplot; there is a pattern among the residuals. Should have an x2 term. We add this by defining a new variable g2 = g2. g2<-g^2 fit3<-lm(h ~ g + g2) summary(fit3) Call: lm(formula = h ~ g + g2) Residuals: Min 1Q Median -2.46341 -0.62599 -0.04959 3Q 0.55741 Max 2.50647 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.84659 0.41443 4.456 5e-05 *** g -0.04820 0.38332 -0.126 0.9 g2 1.01220 0.07414 13.653 <2e-16 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.025 on 48 degrees of freedom Multiple R-Squared: 0.9833, Adjusted R-squared: 0.9826 F-statistic: 1413 on 2 and 48 DF, p-value: < 2.2e-16 plot(fit3$res) Now our residuals look better. Note that the linear term (g) is not significant in the above model. The g2 term is the important part of the model. Rebecca Nugent, Department of Statistics, U. of Washington -5- Let’s say we have two variables and we want to fit an interaction term. Creating Data with an inherent interaction: x1<-sample(1:2,60,replace=T) x2<-rnorm(60,3,1) y<-rep(NA,60) #(different models depending on the x1 value) y[x1==1]<-2*x2[x1==1]-10 y[x1==2]<-10*x2[x1==2]+10 y<-y+rnorm(60,0,1) summary(lm(y~x1+x2+x1*x2)) Call: lm(formula = y ~ x1 + x2 + x1 * x2) Residuals: Min 1Q Median -3.0077 -0.5502 -0.1577 3Q 0.7501 Max 2.0823 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -30.3836 1.3730 -22.13 <2e-16 *** x1 20.3724 0.8249 24.70 <2e-16 *** x2 -5.8004 0.4244 -13.67 <2e-16 *** x1:x2 7.8371 0.2572 30.47 <2e-16 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.044 on 56 degrees of freedom Multiple R-Squared: 0.9981, Adjusted R-squared: 0.998 F-statistic: 9717 on 3 and 56 DF, p-value: < 2.2e-16 Note that the coefficient for the intercept is roughly the difference in slopes of the two models used above. Rebecca Nugent, Department of Statistics, U. of Washington -6-