Simple linear regression Thomas Lumley BIOST 578C Linear model Linear regression is usually presented in terms of a model Y = α + βX + ∼ N (0, σ 2) because the theoretical analysis is pretty for this model (BIOST 533: Theory of Linear Models). Unfortunately, this leads to people believing these assumptions are necessary. Drawing a line The best line between two points (x1, y1) and (x2, y2) is obvious: it is the line joining them. The line has slope y2 − y1 β̃12 = x2 − x2 With n points a sensible approach would be to compute all the pairwise slopes and take some sort of summary of them. 10 Drawing a line ● 8 ● ● ● y 6 ● 4 ● ● ● 2 ● ● 2 4 6 x 8 10 Drawing a line When x1 and x2 are close, the line is more likely to go ‘the wrong way’, so we should give more weight to pairs with large x1 − x2. One sensible possibility is to define weights wij = (x1 − x2)2 and then P i,j wij β̃ij β̂ = P i,j wij Another might be a weighted median: find β̂ so that same for βij > β̂ and for βij < β̂. P wij is the These are summaries of the data that don’t make any assumptions Least squares Some algebra shows that the weighted average summary of slope is exactly the usual least squares estimator in a linear regression model We can check this in an example. First an example that satisifes the usual assumptions x<-1:10 y<-rnorm(10)+x lsmodel<-lm(y~x) xdiff<-outer(x,x,"-") ydiff<-outer(y,y,"-") wij<-xdiff^2 betaij<-ifelse(xdiff==0, 0, ydiff/xdiff) sum(wij*betaij)/sum(wij) weighted.mean(as.vector(betaij), as.vector(wij)) coef(lsmodel) plot(x,y) abline(lsmodel) lines(x,x,lty=2) Notes • outer makes a matrix whose (ij) element is a function applied to the ith element of the first argument and the jth element of the second argument. • lm is a function for fitting linear models. • The object returned by lm incorporates a lot of information. Some of this can be extracted with functions such as coef, abline, and other functions that we didn’t use • Division by zero in ydiff/xdiff doesn’t cause an error, but we can’t just multiply by zero again. A non-zero number divided by zero is Inf or -Inf, and multiplying by zero gives NaN. 10 Notes ● ● 8 ● 6 ● ● y ● 4 ● ● 0 2 ● ● 2 4 6 x 8 10 and now an example that doesn’t satisfy the usual ”assumptions” x<-1:10 y<-x^2+rnorm(10,s=1:10) lsmodel<-lm(y~x) xdiff<-outer(x,x,"-") ydiff<-outer(y,y,"-") wij<-xdiff^2 betaij<-ifelse(xdiff==0, 0, ydiff/xdiff) sum(wij*betaij)/sum(wij) weighted.mean(as.vector(betaij), as.vector(wij)) coef(lsmodel) plot(x,y) abline(lsmodel) lines(x,x^2,lty=2) ● 60 ● 40 ● ● 20 ● 0 y 80 ● ● ● ● 2 ● 4 6 x 8 10 The ”true” slope in this second example is the value we would get from a very large sample, or equivalently the value we get if we remove the rnorm error in y: ytruediff<-outer(x^2,x^2,"-") betatrueij<-ifelse(xdiff==0, 0, ytruediff/xdiff) betatrue<-sum(wij*betatrueij)/sum(wij) alphatrue<-mean(x^2)-mean(x)*betatrue abline(alphatrue,betatrue,col="red") ● 60 ● 40 ● ● 20 ● 0 y 80 ● ● ● ● 2 ● 4 6 x 8 10 Model assumptions • If you actually want to predict Y from X then you need an accurate model • An average slope may not be a useful summary of the data if you expect the relationship to be very nonlinear • The most popular standard error formula assumes that the relationship is linear and the variance is constant, but there are alternatives. One obvious alternative is the bootstrap, but there is also a reasonably simple analytic approach. Example Anscombe (1973) gave four example data sets that give exactly the same slope and intercept summaries (α = 3, β = 0.5) and model-based standard errors. Whether the slope of the line is a useful summary will depend on the scientific questions at hand, as well as the data, but it doesn’t look promising for three of the data sets. Example 8 ● 10 12 ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● 4 4 ● ● y2 ● ● 8 ● 6 y1 10 12 Anscombe’s 4 Regression data sets ● 5 10 15 5 10 x1 10 12 ● 8 ● ● ● ● ● ● ● ● ● 4 6 y4 10 12 8 6 ● 4 y3 x2 ● ● ● ● ● ● ● ● ● ● 15 5 10 x3 15 5 10 x4 15 Least squares The usual way to describe this slope summary is in terms of squared errors: we choose the values (α̂, β̂) that minimize n X (Yi − α − βXi)2 i=1 The solution to this minimization problem is β̂ = cov(X, Y ) var(X) which turns out to be the same as the weighted average of pairwise slopes. Least squares generalizes more easily to multiple predictors, but the interpretation is less clear.