licensed under a . Your use of this Creative Commons Attribution-NonCommercial-ShareAlike License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this
material constitutes acceptance of that license and the conditions of use of materials on this site.
Copyright 2006, The Johns Hopkins University and Karl W. Broman. All rights reserved. Use of these materials
permitted only in accordance with license rights granted. Materials provided “AS IS”; no representations or
warranties provided. User assumes all responsibility for use, and all liability related thereto, and must independently
review all materials for accuracy and efficacy. May contain materials owned by others. User is responsible for
obtaining permissions for use from third parties as needed.
Example
Measurements of degredation of heme with different concentrations of hydrogen peroxide (H2O2), for different species of heme.
pf3d7 and pyoelii
0.35
0.35
0.30
0.30
0.25
0.25
OD
OD
pf3d7
0.20
0.20
0.15
0.15
0.10
0.10
0
10
25
50
pf3d7
pyoelii
0
H2O2 concentration
10
25
50
H2O2 concentration
0.0
0.2
0.4
OD
0.6
0.8
1.0
Degradation
0
10
20
30
H2O2 concentration
40
50
Degradation [%]
100
pfhz
75
pbr
pov
OD[%]
pknow
pviv
50
pf3d7
pyoelii
25
pgal
0
0
10
25
50
H2O2 concentration
Y = 20 + 15X
140
120
Y = 40 + 8X
100
Y
80
Y = 70 + 0X
60
Y = 0 + 5X
40
20
0
0
2
4
6
X
8
10
12
3
2
β1
Y
1
1
β0
0
−1
0
1
2
3
4
X
The regression model
Let X be the predictor and Y be the response. Assume
we have n observations (x1, y1), . . . , (xn, yn) from X and
Y. The simple linear regression model is
yi = β0 + β1xi + ǫi,
How do we estimate β0, β1, σ 2 ?
ǫi ∼ iid
N(0,σ2).
Fitted values and residuals
We can write
ǫi = yi − β0 − β1xi
For a pair of estimates (β̂0, β̂1) for (β0, β1) we define the fitted values as
ŷi = β̂0 + β̂1xi
The residuals are
ǫ̂i = yi − ŷi = yi − β̂0 − β̂1xi
Y
Residuals
Y
^
Y
^ε
X
Residual sum of squares
For every pair of values for β0 and β1 we get a different value for
the residual sum of squares.
RSS(β0, β1) =
X
(yi − β0 − β1xi)2
i
We can look at RSS as a function of β0 and β1. We try to minimize
this function, i. e. we try to find
(β̂0, β̂1) = minβ0,β1 RSS(β0, β1)
Hardly surprising, this method is called least squares estimation.
Residual sum of squares
RSS
b0
b1
0.2
0.4
β1
0.6
0.8
Residual sum of squares
2
4
6
8
β0
Notation
Assume we have n observations: (x1, y1), . . . , (xn, yn).
P
i xi
x̄
=
n
P
i yi
ȳ
=
n
X
X
2
SXX =
(xi − x̄) =
x2i − n(x̄)2
SYY =
SXY =
RSS =
i
X
i
X
i
X
i
2
(yi − ȳ) =
i
X
y2i − n(ȳ)2
i
(xi − x̄)(yi − ȳ) =
X
i
(yi − ŷi) 2 =
X
i
ǫ̂2i
xiyi − nx̄ȳ
Parameter estimates
The function
RSS(β0, β1) =
X
(yi − β0 − β1xi)2
i
is minimized by
β̂1 =
SXY
SXX
β̂0 = ȳ − β̂1x̄
Useful to know
Using the parameter estimates, our best guess for any y given x is
y = β̂0 + β̂1x
Hence
β̂0 + β̂1x̄
=
ȳ − β̂1x̄ + β̂1x̄
=
ȳ
That means every regression line goes through the point (x̄, ȳ).
Variance estimates
As variance estimate we use
σ̂ 2 =
RSS
n–2
This quantity is called the residual mean square. It has the property
σ̂ 2
(n – 2) × 2 ∼ χ2n – 2
σ
In particular, this implies
E(σ̂ 2) = σ 2
Example
H2O2 concentration
0
10
25
50
0.3399 0.3168 0.2460 0.1535
0.3563 0.3054 0.2618 0.1613
0.3538 0.3174 0.2848 0.1525
We get
x̄ = 21.25,
ȳ = 0.27,
SXX = 4256.25,
SXY = – 16.48,
RSS = 0.0013.
Therefore
β̂1 =
σ̂ =
– 16.48
= – 0.0039,
4256.25
r
0.0013
= 0.0115.
12 – 2
β̂0 = 0.27 – (– 0.0039) × 21.25 = 0.353,
pf3d7
Y = 0.353 − 0.0039X
0.35
OD
0.30
0.25
0.20
0.15
0
10
25
50
H2O2 concentration
The R function lm() does all these calculations for you. And more!
Comparing models
We want to test whether β1 = 0:
H0 : yi = β0 + ǫi
versus
Ha : yi = β0 + β1xi + ǫi
Fit under Ha
y
Fit under Ho
x
Sum of squares
Under Ha :
RSS =
X
i
(SXY)2
(yi − ŷi) = SYY −
= SYY − β̂12 × SXX
SXX
2
Under H0 :
X
X
(yi − β̂0)2 =
(yi − ȳ)2 = SYY
i
i
Hence
(SXY)2
SSreg = SYY − RSS =
SXX
ANOVA
Source
df
SS
MS
F
regression on X
1
SSreg
MSreg =
SSreg
1
residuals for full model
n–2
RSS
MSE =
RSS
n–2
total
n–1
SYY
MSreg
MSE
David Sullivan’s pf3d7 data
Source
df
SS
MS
F
regression on X
1
0.06378
0.06378
484.1
residuals for full model
10
0.00131
0.00013
total
11
0.06509
pf3d7
Y = 0.353 − 0.0039X
0.35
Y = 0.271
OD
0.30
0.25
0.20
0.15
0
10
25
H2O2 concentration
Remember: The R function lm() does the calculations for you!
50