Chapter 2: Simple Linear Regression • 2.1 The Model • 2.2

advertisement
Chapter 2: Simple Linear Regression
•
•
•
•
•
•
•
•
2.1 The Model
2.2-2.3 Parameter Estimation
2.4 Properties of Estimators
2.5 Inference
2.6 Prediction
2.7 Analysis of Variance
2.8 Regression Through the Origin
2.9 Related Models
1
2.1 The Model
• Measurement of y (response) changes in a linear fashion with a
setting of the variable x (predictor):
β0 + β1x
ε
y=
+
linear relation
noise
◦ The linear relation is deterministic (non-random).
◦ The noise or error is random.
• Noise accounts for the variability of the observations about the
straight line.
No noise ⇒ relation is deterministic.
Increased noise ⇒ increased variability.
2
• Experiment with this simulation program:
simple.sim <function(intercept=0,slope=1,x=seq(1,10),sigma=1){
noise <- rnorm(length(x),sd = sigma)
y <- intercept + slope*x + noise
title1 <- paste("sigma = ", sigma)
plot(x,y,pch=16,main=title1)
abline(intercept,slope,col=4,lwd=2)
}
• >
>
>
>
>
source("simple.sim.R")
simple.sim(sigma=.01)
simple.sim(sigma=.1)
simple.sim(sigma=1)
simple.sim(sigma=10)
3
Simulation Examples
sigma = 0.1
6
4
y
6
2
2
4
y
8
8
10
10
sigma = 0.01
4
6
8
10
2
4
6
8
x
x
sigma = 1
sigma = 10
10
10 20 30
y
6
0
4
2
y
8
10
2
2
4
6
x
8
10
2
4
6
8
10
x
4
The Setup
• Assumptions:
1. E[y|x] = β0 + β1x.
2. Var(y|x) = Var(β0 + β1x + ε|x) = σ 2.
• Data: Suppose data y 1, y 2, . . . , y n are obtained at settings
x1, x2, . . . , xn, respectively. Then the model on the data is
y i = β0 + β1xi + εi
(εi i.i.d. N(0, σ 2) and E[y i|xi] = β0 + β1xi.)
Either
1. the x’s are fixed values and measured without error
(controlled experiment)
OR
2. the analysis is conditional on the observed values of x
(observational study).
5
2.2-2.3 Parameter Estimation, Fitted Values and Residuals
1. Maximum Likelihood Estimation
Distributional assumptions are required
2. Least Squares Estimation
Distributional assumptions are not required
6
2.2.1 Maximum Likelihood Estimation
◦ Normal assumption is required:
−1 (y −β −β x )2
1
f (yi|xi) = √
e 2σ2 i 0 1 i
2πσ
Likelihood: L(β0, β1, σ) =
1 − 12
∝ n e 2σ
σ
Y
f (yi|xi)
Pn
2
i=1 (yi −β0 −β1 xi )
◦ Maximize with respect to β0, β1, and σ 2.
◦ (β0, β1): Equivalent to minimizing
n
X
(yi − β0 − β1xi)2
i=1
(i.e. Least-squares) σ 2: SSE/n, biased.
7
2.2.2 Least Squares Estimation
◦ Assumptions:
1. E[εi] = 0
2. Var(εi) = σ 2
3. εi’s are independent.
• Note that normality is not required.
8
Method
• Minimize
S(β0, β1) =
n
X
(y i − β0 − β1xi)2
i=1
with respect to the parameters or regression coefficients β0 and
β1: βb0 and βb1.
◦ Justification: We want the fitted line to pass as close to all of the
points as possible.
◦ Aim: small Residuals (observed - fitted response values):
ei = y i − βb0 − βb1xi
9
Look at the following plots:
>
>
>
>
>
source("roller2.plot")
roller2.plot(a=14,b=0)
roller2.plot(a=2,b=2)
roller2.plot(a=12, b=1)
roller2.plot(a=-2,b=2.67)
10
−ve residual
4
6
8
60
40
0
2
4
6
8
10 12
a=12, b=1
a=−2, b=2.67
4
6
8
10 12
Roller weight (t)
60
40
20
20
0
−ve residual
Fitted values
Data values
+ve residual
−ve residual
0
40
60
−ve residual
Roller weight (t)
+ve residual
2
+ve residual
Roller weight (t)
Fitted values
Data values
0
20
10 12
Depression in lawn (mm)
2
Fitted values
Data values
0
+ve residual
Depression in lawn (mm)
20
40
60
Fitted values
Data values
0
Depression in lawn (mm)
a=2, b=2
0
Depression in lawn (mm)
a=14, b=0
0
2
4
6
8
10 12
Roller weight (t)
◦ The first three lines above do not pass as close to the plotted
points as the fourth, even though the sum of the residuals is
about the same in all four cases.
◦ Negative residuals cancel out positive residuals.
The Key: minimize squared residuals source("roller3.plot.R")
> roller3.plot(14,0); roller3.plot(2,2); roller3.plot(12,1);
> roller3.plot(a=-2,b=2.67) # small SS
4
6
8
120
80
40
2
4
6
8
10 12
0
a=12, b=1
a=−2, b=2.67
4
6
8
10 12
Roller weight (t)
80
120
Fitted values
Data values
40
40
80
120
0
Roller weight (t)
+ve residual
0
+ve residual
−ve residual
Roller weight (t)
−ve residual
2
Depression in lawn (mm)
10 12
Fitted values
Data values
0
Fitted values
Data values
+ve residual
−ve residual
0
2
Depression in lawn (mm)
120
80
40
+ve residual
−ve residual
0
Depression in lawn (mm)
a=2, b=2
Fitted values
Data values
0
Depression in lawn (mm)
a=14, b=0
0
2
4
6
8
10 12
Roller weight (t)
11
Unbiased estimate of σ 2
n
X
1
e2
n − # parameters estimated i=1 i
n
1 X
e2
=
n − 2 i=1 i
* n observations ⇒ n degrees of freedom
* 2 degrees of freedom are required to estimate the parameters
* the residuals retain n − 2 degrees of freedom
12
Alternative Viewpoint
* y = (y 1, y 2, . . . , y n) is a vector in n-dimensional space.
(n degrees of freedom)
* The fitted values ybi = βb0 + βb1xi also form a vector in
n-dimensional space:
b = (yb1, yb2, . . . , ybn)
y
(2 degrees of freedom)
b.
* Least-squares seeks to minimize the distance between y and y
* The distance between n-dimensional vectors u and v is the
square root of
n
X
(ui − vi)2
i=1
b is
* Thus, the squared distance between y and y
n
X
i=1
(y i − ybi)2 =
n
X
e2
i
i=1
13
Regression Coefficient Estimators
• The minimizers of
S(β0, β1) =
n
X
(y i − β0 − β1xi)2
i=1
are
βb0 = ȳ − βb1x̄
and
Sxy
b
β1 =
Sxx
where
Sxy =
n
X
(xi − x̄)(y i − ȳ)
i=1
and
Sxx =
n
X
(xi − x̄)2
i=1
14
HomeMade R Estimators
> ls.est <function (data)
{
x <- data[,1]
y <- data[,2]
xbar <- mean(x)
ybar <- mean(y)
Sxy <- S(x,y,xbar,ybar)
Sxx <- S(x,x,xbar,xbar)
b <- Sxy/Sxx
a <- ybar - xbar*b
list(a = a, b = b, data=data)
}
> S <function (x,y,xbar,ybar)
{
sum((x-xbar)*(y-ybar))
}
15
Calculator Formulas and Other Properties
•
Sxy =
n
X
Pn
Pn
( i=1 xi)( i=1 y i)
xiy i −
n
i=1
n
X
Pn
(
xi ) 2
2
i=1
Sxx =
xi −
n
i=1
• Linearity Property:
Sxy =
n
X
y i(xi − x̄)
i=1
so βb1 is linear in the y i’s.
• Gauss-Markov Theorem: Among linear unbiased estimators for
β1 and β0, βb1 and βb0 are best (i.e. they have smallest variance).
. Is it
• Exercise: Find the expected value and variance of xy1−ȳ
1 −x̄
unbiased for β1? Is there a linear unbiased estimator with
smaller variance?
16
Residuals
εbi = ei = y i − βb0 − βb1xi
= y i − ybi
= y i − ȳ − βb1(xi − x̄)
• In R:
> res
function (ls.object)
{
a <- ls.object$a
b <- ls.object$b
x <- ls.object$data[,1]
y <- ls.object$data[,2]
resids <- y - a - b*x
resids
}
17
2.3.1 Consequences of Least-Squares
1. ei = y i − ȳ − βb1(xi − x̄)
(follows from intercept formula)
P
2. n
ei = 0
(follows from 1.)
i=1
P
Pn
bi
3. n
y
=
i=1 i
i=1 y
(follows from 2.)
4. The regression line passes through the centroid (x̄, ȳ)
from intercept formula)
P
5.
xiei = 0
(set partial derivative of S(β0, β1) wrt β1 to 0)
P
6.
ybiei = 0
(follows from 2. and 5.)
(follows
18
2.3.2 Estimation of σ 2
• The residual sum of squares or error sum of squares is given by
SSE =
n
X
b
e2
i = Syy − β1 Sxy
i=1
Note: SSE = SSRes and Syy = SST
• An unbiased estimator for the error variance is
b 2 = M SE =
σ
SSE
=
n−2
h
Syy − βb1Sxy
i
n−2
Note: M SE = M SRes
b = Residual Standard Error =
• σ
√
MSE
19
Example
• roller data
weight depression
1
1.9
2
2
3.1
1
3
3.3
5
4
4.8
5
5
5.3
20
6
6.1
20
7
6.4
23
8
7.6
10
9
9.8
30
10
12.4
25
20
Hand Calculation
• (y = depression, x = weight)
P10
◦ i=1 xi = 1.9 + 3.1 + · · · 12.4 = 60.7
◦ x̄ = 60.7
10 = 6.07
P10
◦ i=1 y i = 2 + 1 + · · · 25 = 141
◦ ȳ = 141
10 = 14.1
◦
P10
2 = 1.92 + 3.12 + · · · + 12.42 = 461
x
i=1 i
P10
◦ i=1 y 2
i = 4 + 1 + 25 + · · · + 625 = 3009
◦
P10
i=1 xi y i = (1.9)(2) + · · · (12.4)(25) = 1103
(60.7)2
◦ Sxx = 461 − 10 = 92.6
21
◦ Sxy = 1103 − (60.7)(141)
= 247
10
(141)2
◦ Syy = 3009 − 10 = 1021
Sxy
247 = 2.67
= 92.6
◦ βb1 = Sxx
◦ βb0 = ȳ − β1x̄ = 14.1 − 2.67(6.07) = −2.11
b 2 = 1 (Syy − βb1Sxy )
◦ σ
n−2
=1
8 (1021 − 2.67(247)) = 45.2 = MSE
• Example Summary: the fitted regression line relating depression
(y) to weight (x) is
yb = −2.11 + 2.67x
The error variance is estimated as MSE = 45.2.
R commands (home-made version)
> roller.obj <- ls.est(roller)
> roller.obj[-3]
# intercept and slope estimate
$a
[1] -2.09
$b
[1] 2.67
> res(roller.obj)
# residuals
[1] -0.98 -5.18 -1.71 -5.71 7.95
[6] 5.82 8.02 -8.18 5.95 -5.98
> sum(res(roller.obj)ˆ2)/8 # error variance (MSE)
[1] 45.4
22
R commands (built-in version)
> attach(roller)
> roller.lm <- lm(depression ˜ weight)
> summary(roller.lm)
Call:
lm(formula = depression ˜ weight)
Residuals:
Min
1Q Median
-8.180 -5.580 -1.346
3Q
5.920
Max
8.020
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.0871
4.7543 -0.439 0.67227
weight
2.6667
0.7002
3.808 0.00518
23
Residual standard error: 6.735 on 8 degrees of freedom
Multiple R-squared: 0.6445,
Adjusted R-squared: 0.6001
F-statistic: 14.5 on 1 and 8 DF, p-value: 0.005175
> detach(roller)
• or
> roller.lm <- lm(depression ˜ weight, data = roller)
> summary(roller.lm)
Using Extractor Functions
• Partial Output:
> coef(roller.lm)
(Intercept)
weight
-2.09
2.67
> summary(roller.lm)$sigma
[1] 6.74
• From the output,
◦ the slope estimate is βb1 = 2.67.
◦ the intercept estimate is βb0 = −2.09.
◦ the Residual standard error is the square root of the MSE:
6.742 = 45.4.
24
Other R commands
◦ fitted values: predict(roller.lm)
◦ residuals: resid(roller.lm)
◦ diagnostic plots: plot(roller.lm)
(these include a plot of the residuals against the fitted values
and a normal probability plot of the residuals)
◦ Also plot(roller); abline(roller.lm)
(this gives a plot of the data with the fitted line overlaid)
25
2.4: Properties of Least Squares Estimates
◦ E[βb1] = β1
◦ E[βb0] = β0
2
σ
b
◦ Var(β1) = Sxx
x̄2
1+
◦ Var(βb0) = σ 2 n
Sxx
26
Standard Error Estimators
d βb ) = MSE
◦ Var(
1
Sxx
so the standard error (s.e.) of βb1 is estimated by
s
MSE
Sxx
roller e.g.: MSE
q = 45.2, Sxx = 92.6
⇒ s.e. of βb1 is 45.2/92.6 = .699
x̄2
d βb ) = MSE 1 +
◦ Var(
0
n
Sxx
so the standard error (s.e.) of βb0 is estimated by
v
u
u
tMSE
x̄2
1
+
n
Sxx
!
roller e.g.:
MSE = 45.2, Sxx
s = 92.6, x̄ = 6.07, n = 10
⇒ s.e. of βb0 is
1 +
45.2 10
6.072
92.62
= 4.74
27
Distributions of βb1 and βb0
◦ y i is N (β0 + β1xi, σ 2), and
P
xi −x̄
βb1 = n
a
y
(a
=
i
i
i
i=1
Sxx )
2
σ
b
⇒ β1 is N (β1, Sxx ).
2
b ) so
Also, SSE
is
χ
(independent
of
β
1
2
n−2
σ
βb1 − β1
∼ tn−2
q
M SE/Sxx
Pn
1 − x̄(xi −x̄) )
b
◦ β0 = i=1 ciyi
(ci = n
Sxx
1 + x̄2 ) and
⇒ βb0 is N (β0, σ 2 n
Sxx
βb0 − β0
s
MSE
1 + x̄2
n
Sxx
∼ tn−2
28
2.5 Inferences about the Regression Parameters
• Tests
• Confidence Intervals
for β1, and β0 + β1x0
29
2.5.1 Inference for β1
H0 : β1 = β10 vs. H1 : β1 6= β10
Under H0,
βb1 − β10
t0 = q
M SE/Sxx
has a t-distribution on n − 2 degrees of freedom.
p-value = P (|tn−2| > t0)
e.g. testing significance of regression for roller:
H0 : β1 = 0 vs. H1 : β1 6= 0
t0 =
2.67 − 0
= 3.82
.699
p-value = P (|t8| > 3.82) = 2(1 − P (t8 < 3.82)) = .00509
R command: 2*(1-pt(3.82,8))
30
(1 − α) Confidence Intervals
◦ slope:
βb1 ± tn−2,α/2s.e.
or
q
βb1 ± tn−2,α/2 MSE/Sxx
roller e.g. (95% confidence interval)
2.67 ± t8,.025(.699)
or
2.67 ± 2.31(.699) = 2.67 ± 1.61
R command: qt(.975, 8)
◦ intercept:
βb0 ± tn−2,α/2s.e.
e.g.
− 2.11 ± 2.31(4.74) = −2.11 ± 10.9
31
2.5.2 Confidence Interval for Mean Response
◦ E[y|x0] = β0 + β1x0
where x0 is a possible value that the predictor could take.
\ ] = βb + βb x is a point estimate of the mean response at
◦ E[y|x
0
0
1 0
that value.
◦ To find a 1 − α confidence interval for E[y|x0], we need the
\ ]:
variance of yb0 = E[y|x
0
Var(yb0) = Var(βb0 + βb1x0)
βb0 =
n
X
ciyi
i=1
βb1x0 =
n
X
bix0yi
i=1
32
Therefore,
n
X
yb0 =
(ci + bix0)yi
i=1
so
Var(yb0) =
n
X
(ci + bix0)2Var(yi)
i=1
= σ2
n
X
(ci + bix0)2
i=1
n
X
(x0 − x̄)(xi − x̄) 2
1
2
)
( +
=σ
Sxx
i=1 n
1
(x0 − x̄)2
2
=σ ( +
)
n
Sxx
1 + (x0 −x̄)2 )
\
◦ Var(
yb0) = MSE( n
Sxx
◦ The confidence interval is then
v
u
u
t
1
(x0 − x̄)2
yb0 ± tn−2,α/2 MSE( +
)
n
Sxx
◦ e.g. Compute a 95% confidence interval for the expected
depression when the weight is 5 tonnes.
x0 = 5, yb0 = −2.11 + 2.67(5) = 11.2
n = 10, x̄ = 6.07, Sxx = 92.6, MSE = 45.2
Interval:
s
1
(5 − 6.07)2
11.2 ± t8,.025 45.2(
+
)
10
92.6
√
= 11.2 ± 2.31 5.08 = 11.2 ± 5.21
= (5.99, 16.41)
◦ R code:
> predict(roller.lm,
newdata=data.frame(weight<-5),
interval="confidence")
fit lwr upr
[1,] 11.2 6.04 16.5
◦ Ex. Write your own R function to compute this interval.
2.6: Predicted Responses
◦ If a new observation were to be taken at x0, we would predict it
to be y 0 = β0 + β1x0 + ε where ε is independent noise.
◦ yb0 = βb0 + βb1x0 is a point prediction of the response at that
value.
◦ To find a 1 − α prediction interval for y 0, we need the variance of
y0 − yb0:
Var(y0 − yb0) = Var(β0 + β1x0 + ε − βb0 − βb1x0)
= Var(−βb0 − βb1x0) + Var(ε)
1
(x0 − x̄)2
2
=σ ( +
) + σ2
n
Sxx
33
1
(x0 − x̄)2
2
= σ (1 + +
)
n
Sxx
(x0 −x̄)2
1
\
◦ Var(y0 − yb0) = MSE(1 + n + Sxx )
◦ The prediction interval is then
v
u
u
t
(x0 − x̄)2
1
yb0 ± tn−2,α/2 MSE(1 + +
)
n
Sxx
◦ e.g. Compute a 95% prediction interval for the depression for a
single new observation where the weight is 5 tonnes.
x0 = 5, yb0 = −2.11 + 2.67(5) = 11.2
n = 10, x̄ = 6.07, Sxx = 92.6, MSE = 45.2
Interval:
s
1
(5 − 6.07)2
11.2 ± t8,.025 45.2(1 +
+
)
10
92.6
√
= 11.2 ± 2.31 50.3 = 11.2 ± 16.4 = (−5.2, 27.6)
R code for Prediction Intervals
> predict(roller.lm,
newdata=data.frame(weight<-5),
interval="prediction")
fit
lwr upr
[1,] 11.2 -5.13 27.6
◦ Write your own R function to produce this interval.
34
Degrees of Freedom
• A random sample of size n coming from a normal population
with mean µ and variance σ 2 has n degrees of freedom:
Y1, . . . , Yn.
• Each linearly independent restriction reduces the number of
degrees of freedom by 1.
Pn
(Yi −µ)2
• i=1 σ2
distribution.
has a χ2
(n)
Pn
(Yi −Ȳ )2
• i=1 σ2
has a χ2
distribution. (Calculating Ȳ imposes
(n−1)
one linear restriction.)
Pn
(Yi −Ybi )2
2
b and βb
• i=1
has
a
χ
distribution.
(Calculating
β
0
1
2
(n−2)
σ
imposes two linearly independent restrictions.)
• Given the xi’s, there are 2 degrees of freedom in the quantity
βb0 + βb1xi.
• Given xi and Ȳ , there is one degree of freedom:
ybi − Ȳ = βb1(xi − x̄).
35
2.7: Analysis of Variance: Breaking down (analyzing) Variation
◦ The variation in the data (responses) is summarized by
Syy =
n
X
(y i − ȳ)2 = TSS
i=1
(Total sum of squares)
◦ 2 sources of variation in the responses:
1. variation due to the straight line relationship with the
predictor
2. deviation from the line (noise)
◦
y i −ȳ
deviation from data center
y i −ybi
i −ȳ
= residual
+ difference:ybline
and center
y i − ȳ = ei + ybi − ȳ
36
so
Syy =
=
=
(y i − ȳ)2
X
(ei + ybi − ȳ)2
X
X
e2
i +
(ybi − ȳ)2
X
since
X
eiybi = 0 and ȳ
Syy = SSE +
X
ei = 0
(ybi − ȳ)2 = SSE + SSR
X
The last term is the regression sum of squares.
Relation between SSR and β1
◦ We saw earlier that
SSE = Syy − βb1Sxy
◦ Therefore,
SSR = βb1Sxy = βb12Sxx
Note that, for a given set of x’s SSR depends only on βb1.
M SR = SSR /d.f. = SSR /1
(1 degree of freedom for slope parameter)
37
Expected Sums of Squares
◦
E[SSR ] = E[Sxxβb12]
= Sxx Var(βb1) + (E[βb1])2
= Sxx
σ2
!
+ β12
Sxx
= σ 2 + β12Sxx
= E[M SR ]
Therefore, if β1 = 0, then SSR is an unbiased estimator for σ 2.
◦ Development of E[MSE]:
E[Syy ] = E[
= E[
X
(yi − ȳ)2]
X
yi2 − nȳ 2] =
X
E[yi2] − nE[ȳ 2]
38
Development of E[MSE] (cont’d)
Consider the 2 terms on RHS, separately:
1. term:
E[yi2] = Var(yi) + (E[yi])2
= σ 2 + (β0 + β1xi)2
X
E[yi2] = nσ 2 + nβ02 + 2nβ0β1x̄ +
X
β12x2
i
2. term:
E[ȳ 2] = Var(ȳ) + (E[ȳ])2
= σ 2/n + (β0 + β1x̄)2
nE[ȳ 2] = σ 2 + nβ02 + 2nβ0β1x̄ + nβ12x̄2
39
Development of E[MSE] (cont’d)
⇒
E[Syy ] = (n − 1)σ 2 + β12
(xi − x̄)2
X
= E[SST ]
◦
E[SSE] = E[Syy ] − E[SSR ]
= (n − 1)σ 2 + β12
(xi − x̄)2 − (σ 2 + β12Sxx)
X
= (n − 2)σ 2
E[MSE] = E[SSE/(n − 2)] = σ 2
40
Another approach to testing H0 : β1 = 0
◦ Under the null hypothesis, both MSE and M SR estimate σ 2.
◦ Under the alternative, only MSE estimates σ 2.
E[M SR ] = σ 2 + β12Sxx > σ 2
◦ A reasonable test is
F0 =
M SR
MSE
∼ F1,n−2
◦ Large F0 ⇒ evidence against H0.
◦ Note t2
ν = F1,ν so this is really the same test as


t2
0 = q
βb1
MSE/Sxx
2
βb12Sxx
M SR

=
 =
MSE
M SE
41
The ANOVA table
Source
Reg.
df
1
SS
βb12Sxx
Syy − βb12Sxx
Syy
MS
βb12Sxx
Error
n−2
SSE/(n − 2)
Total
n-1
roller data example:
> anova(roller.lm) # R code
Analysis of Variance Table
F
M SR
MSE
Response: depression
Df Sum Sq Mean Sq F value Pr(>F)
weight
1
658
658
14.5 0.0052
Residuals 8
363
45
(recall that the t-statistic for testing β1 = 0 had been 3.81 =
√
14.5)
◦ Ex. Write an R function to compute these ANOVA quantities.
42
Confidence Interval for σ 2
2
∼
χ
◦ SSE
2
n−2
σ
so
SSE
2
2
P (χn−2,1−α/2 ≤
≤
χ
)=1−α
n−2,α/2
2
σ
so
SSE
P( 2
χn−2,α/2
≤ σ2 ≤
SSE
χ2
n−2,1−α/2
)=1−α
e.g. roller data:
SSE = 363
χ2
8,.975 = 2.18
(R code: 1-qchisq(8, .025))
χ2
8,.025 = 17.5
(363/17.5, 363/2.18) = (20.7, 166.5)
43
2.7.1 R2 - Coefficient of Determination
◦ R2 = is the fraction of the response variability explained by the
regression:
SSR
2
R =
Syy
◦ 0 ≤ R2 ≤ 1. Values near 1 imply that most of the variability is
explained by the regression.
◦ roller data:
SSR = 658 and Syy = 1021
so
658
2
R =
= .644
1021
44
R output
> summary(roller.lm)
...
Multiple R-Squared: 0.644, ...
◦ Ex. Write an R function which computes R2.
◦ Another interpretation:
β12Sxx + σ 2
. E[SSR ]
2
=
E[R ] =
E[Syy ]
(n − 1)σ 2 + β12Sxx
2
2 Sxx
Sxx + σ
β
β12 n−1
1 (n−1)
n−1 .
=
=
Sxx
Sxx
σ 2 + β12 n−1
σ 2 + β12 (n−1)
for large n. (Note: this differs from the textbook.)
45
Properties of R2
• Thus, R2 increases as
1. Sxx increases (x’s become more spread out)
2. σ 2 decreases
◦ Cautions
1. R2 does not measure the magnitude of the regression slope.
2. R2 does not measure the appropriateness of the linear model.
3. A large value of R2 does not imply that the regression model
will be an accurate predictor.
46
Hazards of Regression
• Extrapolation: predicting y values outside the range of observed
x values. There is no guarantee that a future response would
behave in the same linear manner outside the observed range.
e.g. Consider an experiment with a spring. The spring is
stretched to several different lengths x (in cm) and the restoring
force F (in Newtons) is measured:
x
F
3
5.1
4
6.2
5
7.9
6
9.5
> spring.lm <- lm(F˜x - 1,data=spring)
> summary(spring.lm)
Coefficients:
47
Estimate Std. Error t value Pr(>|t|)
x
1.5884
0.0232
68.6 6.8e-06
The fitted model relating F to x is
Fb = 1.58x
Can we predict the restoring force for the spring, if it has been
extended to a length of 15 cm?
• High leverage observations: x values at the extremes of the
range have more influence on the slope of the regression than
observations near the middle of the range.
• Outliers can distort the regression line. Outliers may be
incorrectly recorded OR may be an indication that the linear
relation or constant variance assumption is incorrect.
• A regression relationship does not mean that there is a
cause-and-effect relationship. e.g. The following data give the
number of lawyers and number of homicides in a given year for a
number of towns:
no. lawyers
1
2
7
10
12
14
15
18
no. homicides
0
0
2
5
6
6
7
8
Note that the number of homicides increases with the number of
lawyers. Does this mean that in order to reduce the number of
homicides, one should reduce the number of lawyers?
• Beware of nonsense relationships. e.g. It is possible to show
that the area of some lakes in Manitoba is related to elevation.
Do you think there is a real reason for this? Or is the apparent
relation just a result of chance?
2.9 - Regression through the Origin
◦ intercept = 0
y i = β1xi + ε
◦ Max. Likelihood and L-S ⇒
minimize
n
X
(yi − β1xi)2
i=1
P
xy
βb1 = P i2i
xi
ei = y i − βb1xi
SSE =
X
e2
i
◦ Max. Likelihood ⇒
b2 =
σ
SSE
n
48
◦ Unbiased Estimator:
b 2 = MSE =
σ
SSE
n−1
◦ Properties of βb1:
E[βb1] = β1
2
σ
Var(βb1) = P 2
xi
◦ 1 − α C.I. for β1:
v
u
u MSE
b
β1 ± tn−1,α/2t P 2
xi
◦ 1 − α C.I. for E[y|x0]:
v
u
u MSEx2
yb0 ± tn−1,α/2t P 20
xi
since
2
σ
Var(βb1x0) = P 2 x2
xi 0
◦ 1 − α P.I. for y, given x0:
v
u
u
x2
t
yb0 ± tn−1,α/2 MSE(1 + P 02 )
xi
◦ R code:
> roller.lm <- lm(depression˜weight - 1,
data=roller)
> summary(roller.lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
weight
2.392
0.299
7.99 2.2e-05
Residual standard error: 6.43
on 9 degrees of freedom
Multiple R-Squared: 0.876,
Adjusted R-squared: 0.863
F-statistic: 63.9 on 1 and 9 DF,
p-value: 2.23e-005
> predict(roller.lm,
newdata=data.frame(weight<-5),
interval="prediction")
fit
lwr upr
[1,] 12.0 -2.97 26.9
Ch. 2.9.2 Correlation
• Bivariate Normal Distribution:
f (x, y) =
1
−
q
2πσ1σ2 1 − ρ2
e
1
(Y 2 −2ρXY +X 2 )
2
2(1−ρ )
where
x − µ1
X=
σ1
and
Y =
y − µ2
σ2
◦ correlation coefficient:
σ12
ρ=
σ1σ2
◦ µ1 and σ12 are the mean and variance of x
◦ µ2 and σ22 are the mean and variance of y
◦ σ12 = E[(x − µ1)(y − µ2)], the covariance
49
Conditional distribution of y given x
f (y|x) = √
1
2πσ1.2
y−β0 −β1 x 2
1
−
σ1.2
e 2
where
σ
β0 = µ2 − µ1ρ 2
σ1
σ
β1 = 2 ρ
σ1
2 = σ 2 (1 − ρ2 )
σ1.2
2
50
An Explanation
◦ Suppose
:
y = β0 + β1x + ε
where ε and x are independent N (0, σ 2) and N (µ1, σ12) random
variables.
◦ Define µ2 = E[y] and σ22 = Var(y):
µ2 = β0 + β1µ1
σ22 = β12σ12 + σ 2
◦ Define σ12 = Cov(x − µ1, y − µ2):
σ12 = E[(x − µ1)(y − µ2)]
= E[(x − µ1)(β1x + ε − β1µ1)]
= β1E[(x − µ1)2] + E[(x − µ1)ε]
= β1σ12
51
◦ Define ρ = σσ12
:
1 σ2
β1σ12
σ
= β1 1
ρ=
σ1σ2
σ2
Therefore,
σ2
ρ
β1 =
σ1
and
β0 = µ2 − β1µ1
σ
= µ2 − µ1 2 ρ
σ1
◦ What is the conditional distribution of y, given x = x?
y|x=x = β0 + β1x + ε
must be normal with mean
E[y|x = x] = β0 + β1x
σ2
= µ2 + ρ (x − µ1)
σ1
and variance
2 = σ 2 = σ 2 − β 2σ 2
Var(y|x = x) = σ1.2
2
1 1
2
σ
= σ22 − ρ2 22 σ12
σ1
= σ22(1 − ρ2)
Estimation
◦ Maximum Likelihood Estimation (using bivariate normal) gives
βb0 = ȳ − βb1x̄
βb1 =
Sxy
Sxx
r = ρb = q
Sxy
SxxSyy
◦ Note that
s
r = βb1
Sxx
Syy
so
Sxx
2
2
b
r = β1
= R2
Syy
52
i.e. the coefficient of determination = square of correlation
coefficient
Testing H0 : ρ = 0 vs. H1 : ρ 6= 0
◦ Cond’l approach (equiv. to testing β1 = 0):
βb1
t0 = q
M SE/Sxx
=r
βb1
Syy −βb1 2 Sxx
(n−2)Sxx
v
u 2
u βb (n − 2)
1
=u
t S
yy
b 2
−
β
1
Sxx
v
u
u (n − 2)r 2
=t
1 − r2
where we have used βb12 = r2Syy /Sxx.
53
◦ The above statistic is a t statistic on n − 2 degrees of freedom ⇒
conclude ρ 6= 0 if pvalue
P (|tn−2| > |t0|) is small.
Testing H0 : ρ = ρ0
1+r
1
log
2
1−r
has an approximate normal distribution with mean
Z=
µZ =
1
1+ρ
log
2
1−ρ
and variance
2 =
σZ
1
n−3
for large n.
◦ Thus,
Z0 =
1+ρ0
log 1+r
−
log
1−r
1−ρ
0
q
2 1/(n − 3)
has an approximate standard normal distribution when the null
hypothesis is true.
54
Confidence Interval for ρ
1+ρ
• Confidence interval for 1
log
2
1−ρ :
q
Z ± zα/2 1/(n − 3)
• Find endpoints (l, u) of this confidence interval and solve for ρ:
e2l − 1 e2u − 1
(
,
)
2l
2u
1+e 1+e
55
R code for fossum example
• Find 95% confidence interval for the correlation between total
length and head length:
>
>
>
>
>
source("fossum.R")
attach(fossum)
n <- length(totlngth)
# sample size
r <- cor(totlngth,hdlngth)# correlation est.
zci <- .5*log((1+r)/(1-r)) +
qnorm(c(.025,.975))*sqrt(1/(n-3))
> ci <- (exp(2*zci)-1)/(1+exp(2*zci))
> ci
# transformed conf. interval
> detach(fossum)
[1] 0.62
0.83
# conf. interval for true
# correlation
56
Download