Simple linear regression

advertisement
Simple linear regression
Thomas Lumley
BIOST 578C
Linear model
Linear regression is usually presented in terms of a model
Y
= α + βX + ∼ N (0, σ 2)
because the theoretical analysis is pretty for this model (BIOST
533: Theory of Linear Models). Unfortunately, this leads to
people believing these assumptions are necessary.
Drawing a line
The best line between two points (x1, y1) and (x2, y2) is obvious:
it is the line joining them.
The line has slope
y2 − y1
β̃12 =
x2 − x2
With n points a sensible approach would be to compute all the
pairwise slopes and take some sort of summary of them.
10
Drawing a line
●
8
●
●
●
y
6
●
4
●
●
●
2
●
●
2
4
6
x
8
10
Drawing a line
When x1 and x2 are close, the line is more likely to go ‘the wrong
way’, so we should give more weight to pairs with large x1 − x2.
One sensible possibility is to define weights wij = (x1 − x2)2 and
then
P
i,j wij β̃ij
β̂ = P
i,j wij
Another might be a weighted median: find β̂ so that
same for βij > β̂ and for βij < β̂.
P
wij is the
These are summaries of the data that don’t make any assumptions
Least squares
Some algebra shows that the weighted average summary of slope
is exactly the usual least squares estimator in a linear regression
model
We can check this in an example. First an example that satisifes
the usual assumptions
x<-1:10
y<-rnorm(10)+x
lsmodel<-lm(y~x)
xdiff<-outer(x,x,"-")
ydiff<-outer(y,y,"-")
wij<-xdiff^2
betaij<-ifelse(xdiff==0, 0, ydiff/xdiff)
sum(wij*betaij)/sum(wij)
weighted.mean(as.vector(betaij), as.vector(wij))
coef(lsmodel)
plot(x,y)
abline(lsmodel)
lines(x,x,lty=2)
Notes
• outer makes a matrix whose (ij) element is a function applied
to the ith element of the first argument and the jth element
of the second argument.
• lm is a function for fitting linear models.
• The object returned by lm incorporates a lot of information.
Some of this can be extracted with functions such as coef,
abline, and other functions that we didn’t use
• Division by zero in ydiff/xdiff doesn’t cause an error, but
we can’t just multiply by zero again. A non-zero number
divided by zero is Inf or -Inf, and multiplying by zero gives
NaN.
10
Notes
●
●
8
●
6
●
●
y
●
4
●
●
0
2
●
●
2
4
6
x
8
10
and now an example that doesn’t satisfy the usual ”assumptions”
x<-1:10
y<-x^2+rnorm(10,s=1:10)
lsmodel<-lm(y~x)
xdiff<-outer(x,x,"-")
ydiff<-outer(y,y,"-")
wij<-xdiff^2
betaij<-ifelse(xdiff==0, 0, ydiff/xdiff)
sum(wij*betaij)/sum(wij)
weighted.mean(as.vector(betaij), as.vector(wij))
coef(lsmodel)
plot(x,y)
abline(lsmodel)
lines(x,x^2,lty=2)
●
60
●
40
●
●
20
●
0
y
80
●
●
●
●
2
●
4
6
x
8
10
The ”true” slope in this second example is the value we would
get from a very large sample, or equivalently the value we get if
we remove the rnorm error in y:
ytruediff<-outer(x^2,x^2,"-")
betatrueij<-ifelse(xdiff==0, 0, ytruediff/xdiff)
betatrue<-sum(wij*betatrueij)/sum(wij)
alphatrue<-mean(x^2)-mean(x)*betatrue
abline(alphatrue,betatrue,col="red")
●
60
●
40
●
●
20
●
0
y
80
●
●
●
●
2
●
4
6
x
8
10
Model assumptions
• If you actually want to predict Y from X then you need an
accurate model
• An average slope may not be a useful summary of the data
if you expect the relationship to be very nonlinear
• The most popular standard error formula assumes that the
relationship is linear and the variance is constant, but there
are alternatives. One obvious alternative is the bootstrap,
but there is also a reasonably simple analytic approach.
Example
Anscombe (1973) gave four example data sets that give exactly
the same slope and intercept summaries (α = 3, β = 0.5) and
model-based standard errors.
Whether the slope of the line is a useful summary will depend
on the scientific questions at hand, as well as the data, but it
doesn’t look promising for three of the data sets.
Example
8
●
10 12
●
●
●
●
●
● ●
● ●
●
●
6
●
●
●
●
4
4
● ●
y2
●
●
8
●
6
y1
10 12
Anscombe’s 4 Regression data sets
●
5
10
15
5
10
x1
10 12
●
8
●
●
●
●
●
●
●
●
●
4
6
y4
10 12
8
6
●
4
y3
x2
●
●
● ●
●
●
● ●
●
●
15
5
10
x3
15
5
10
x4
15
Least squares
The usual way to describe this slope summary is in terms of
squared errors: we choose the values (α̂, β̂) that minimize
n
X
(Yi − α − βXi)2
i=1
The solution to this minimization problem is
β̂ =
cov(X, Y )
var(X)
which turns out to be the same as the weighted average of
pairwise slopes. Least squares generalizes more easily to multiple
predictors, but the interpretation is less clear.
Download