Uploaded by hi.floraaa

220114 Intro to linear regression

advertisement
CHL 5202
Biostatistics II
Extending Linear Regression
Prof. Kevin E. Thorpe
Dalla Lana School of Public Health
University of Toronto
Objectives
1. Review the multiple linear regression model.
2. Understand how the multiple regression model can be used to
incorporate a non-linear relationship.
3. Understand the concept of the restricted cubic spline and be
able to fit them in R.
Introduction
I Recall the simple linear regression model:
yi = β0 + β1 xi + εi
where the error term, εi ∼ N(0, σ 2 )
I This model assumes that a straight line relationship exists
between X and Y .
I Suppose that is not the case, as in the following simulated
data.
3
2
y
1
0
−1
−2
0
2
4
x
6
3
2
y
1
0
−1
−2
0
2
4
x
Suppose we fit the linear model.
## Linear Regression Model
##
## ols(formula = y ~ x, data = mydat)
##
##
Model Likelihood
Discrimination
##
Ratio Test
Indexes
## Obs
100
LR chi2
0.13
R2
0.001
## sigma1.0867
d.f.
1
R2 adj -0.009
## d.f.
98
Pr(> chi2) 0.7172
g
0.045
##
## Residuals
##
##
Min
1Q Median
3Q
Max
## -2.5940 -0.7591 -0.1075 0.6510 2.2604
##
##
##
Coef
S.E.
t
Pr(>|t|)
## Intercept 0.6875 0.2157 3.19 0.0019
## x
-0.0213 0.0593 -0.36 0.7206
##
6
Analysis of Variance
Factor
d.f. Partial SS
x
1
0.1519194
REGRESSION 1
0.1519194
ERROR
98
115.7310879
Response: y
MS
F
P
0.1519194 0.13 0.7206
0.1519194 0.13 0.7206
1.1809295
−2
−1
0
1
2
Normal Q−Q Plot
Sample Quantiles
##
##
##
##
##
##
−2
−1
0
Theoretical Quantiles
1
2
2
1
e
0
−1
−2
0
2
4
6
x
2
1
e
0
−1
−2
0.55
0.60
0.65
yhat
3
2
y
1
0
−1
−2
0
2
4
6
x
Extending the Model
I The simple linear model previously fit is clearly not
appropriate.
I The residual plots suggest the possibility of a cubic
relationship.
I If we could find a way to add additional powers of X to the
model, maybe it would fit better.
Extending the Model
Multiple Regression
I Recall the multiple regression model:
yi = β0 + β1 xi1 + β2 xi2 + · · · + βp xip + εi , εi ∼ N(0, σ 2 )
I The model makes no statement about what the X variables
are, so we could define a cubic relationship between X and Y
by:
yi = β0 + β1 xi + β2 xi2 + β3 xi3 + εi , εi ∼ N(0, σ 2 )
I The next few slides show the results of fitting this model. We
will see how to do it later.
Suppose we fit the linear model.
## Linear Regression Model
##
## ols(formula = y ~ pol(x, 3), data = mydat)
##
##
Model Likelihood
Discrimination
##
Ratio Test
Indexes
## Obs
100
LR chi2
29.88
R2
0.258
## sigma0.9462
d.f.
3
R2 adj
0.235
## d.f.
96
Pr(> chi2) 0.0000
g
0.626
##
## Residuals
##
##
Min
1Q
Median
3Q
Max
## -2.60281 -0.73283 -0.07771 0.58072 2.21026
##
##
##
Coef
S.E.
t
Pr(>|t|)
## Intercept -0.3266 0.3647 -0.90 0.3728
## x
2.2649 0.5052 4.48 <0.0001
## x^2
-0.9841 0.1874 -5.25 <0.0001
## x^3
0.1093 0.0196 5.58 <0.0001
##
Analysis of Variance
Factor
d.f. Partial SS MS
x
3
29.93298
9.9776594
Nonlinear 2
29.78106
14.8905293
REGRESSION 3
29.93298
9.9776594
ERROR
96
85.95003
0.8953128
Response: y
F
11.14
16.63
11.14
−2
−1
0
1
2
Normal Q−Q Plot
Sample Quantiles
##
##
##
##
##
##
##
−2
−1
0
Theoretical Quantiles
1
2
P
<.0001
<.0001
<.0001
2
1
e
0
−1
−2
0
2
4
6
x
2
1
e
0
−1
−2
0.0
0.5
1.0
yhat
1.5
2.0
3
2
y
1
0
−1
−2
0
2
4
6
x
Extending the Model
Comments
I In theory, more complicated shapes could be fit by adding
higher order terms. This tends not to work very well in
practice (tail behaviour tends to be bad).
I Finding the correct degree of polynomial would require fitting
multiple options and choosing the best. This data-driven
approach will result in overfitting.
I It would be nice if there was a more general way to model a
non-linear relationship that behaved better in the tails and did
not require a stepwise approach to specify the degree.
Splines
Introduction
I Spline functions are piecewise polynomials.
I More specifically, they are polynomials fit within distinct
intervals of X and then connected across the intervals.
I The x axis is split into intervals by selecting 3 or more points,
called knots. Then, some polynomial function is fit in each
interval of the x axis resulting from the knots.
Spline
Linear Spline
I The simplest spline is the linear spline (straight line fit in each
interval).
I Suppose three knots a, b and c were chosen. The linear spline
function is given by
f (X ) = β0 + β1 X + β2 (X − a)+ + β3 (X − b)+ + β4 (X − c)+
where
(u)+ = max(0, u)
I This formulation ensures that the lines join at the knots.
f(X)
Example of a Linear Spline
0
1
2
3
4
5
6
X
Splines
Cubic Splines
I Cubic polynomials are quite flexible so if, instead of linear in
each interval we fit a cubic, we get a cubic spline.
I The cubic spline function is made smooth by forcing the first
and second derivatives to agree at the knots.
I Such a cubic spline function with three knots a, b and c
would be:
f (X ) = β0 + β1 X + β2 X 2 + β3 X 3 +
β4 (X − a)3+ + β5 (X − b)3+ + β6 (X − c)3+
I Note that this requires 6 parameters to be estimated. In
general, you need to estimate k + 3 parameters for a k-knot
cubic spline.
Splines
Restricted Cubic Splines
I The cubic spline can behave badly in the tails. Constraining
the function to be linear in the tails gives the restricted cubic
spline (or natural spline).
I The restricted cubic spline function with k knots t1 , . . . , tk is:
f (X ) = β0 + β1 X1 + β2 X2 + · · · + βk−1 Xk−1
where X1 = X and for j = 1, . . . , k − 2,
Xj+1 = (X − tj )3+ − (X − tk−1 )3+ (tk − tj )/(tk − tk−1 ) +
(X − tk )3+ (tk−1 − tj )/(tk − tk−1 )
I The rms package scales these computations by τ = (tk − t1 )2 .
I Notice that this only requires estimation of k − 1 parameters.
Splines
Restricted Cubic Splines
I Once β0 , . . . , βk−1 are estimated, the spline function can be
expressed as:
f (X ) = β0 + β1 X + β2 (X − t1 )3+ + · · · + βk+1 (X − tk )3+
by dividing β2 , . . . , βk−1 by τ and computing βk and βk+1 by
Equation 2.28 (Harrell).
I A test of linearity in X is obtained by testing:
H0 : β2 = β3 = · · · = βk−1 = 0
Splines
Concerning Knots
I Within this field some individuals make a distinction between
internal knots and boundary (at the extremes of X ) knots.
I In the rms package the rcs() function is used to create the
restricted cubic spline. When you specify the number of knots
in this function it is all knots. That is specifying 3 knots in
this function means 1 internal and 2 boundary. The boundary
knots are also not quite at the edges.
I In the splines package the function ns() can be used. When
you specify the number of knots here it is internal and the
boundary knots are at the edges of X by default.
Restricted cubic spline with 4 knots.
## Linear Regression Model
##
## ols(formula = y ~ rcs(x, 4), data = mydat)
##
##
Model Likelihood
Discrimination
##
Ratio Test
Indexes
## Obs
100
LR chi2
27.41
R2
0.240
## sigma0.9580
d.f.
3
R2 adj
0.216
## d.f.
96
Pr(> chi2) 0.0000
g
0.612
##
## Residuals
##
##
Min
1Q
Median
3Q
Max
## -2.79219 -0.71792 -0.04407 0.60252 2.18081
##
##
##
Coef
S.E.
t
Pr(>|t|)
## Intercept 0.0548 0.3191 0.17 0.8639
## x
0.8783 0.2599 3.38 0.0011
## x'
-3.7879 0.8066 -4.70 <0.0001
## x''
12.0017 2.3426 5.12 <0.0001
##
Analysis of Variance
Factor
d.f. Partial SS MS
x
3
27.78090
9.2603004
Nonlinear 2
27.62898
13.8144909
REGRESSION 3
27.78090
9.2603004
ERROR
96
88.10211
0.9177303
Response: y
F
10.09
15.05
10.09
−2
−1
0
1
2
Normal Q−Q Plot
Sample Quantiles
##
##
##
##
##
##
##
−2
−1
0
Theoretical Quantiles
1
2
P
<.0001
<.0001
<.0001
2
1
e
0
−1
−2
−3
0
2
4
6
x
2
1
e
0
−1
−2
−3
0.0
0.5
1.0
yhat
1.5
3
2
y
1
0
−1
−2
0
2
4
6
x
Splines
Choosing the number of knots
I The more knots used the more flexible the spline.
I In general, 3, 4 or 5 knots tend to be sufficient. Four is
probably the best compromise. Use 3 if the sample size is
small and use 5 if the sample size is large and you need a bit
of extra flexibility.
I Too many knots will start to fit the noise and become too
wiggly. The graph on the next slide shows what happens if
you fit 10 knots to the example data.
3
2
y
1
0
−1
−2
0
2
4
6
x
Splines
Interpretation
I The existence of a non-linear relationship means that the
effect size cannot be expressed with a single number.
I A graph is the best way to visualize and understand the
relationship as a whole.
I Numerical summaries of the effect can be computed over
different intervals of the X variable.
I The graph will often provide useful direction for selecting a
few key intervals in which to compute numeric summaries.
I A real example illustrates.
load("tipp1.RData")
t.dd <- datadist(tipp1)
options(datadist="t.dd")
t.fit <- ols(debirthw~degender+rcs(degesage,4)+mage,
data=tipp1)
t.fit
## Linear Regression Model
##
## ols(formula = debirthw ~ degender + rcs(degesage, 4) + mage,
##
data = tipp1)
##
##
Model Likelihood
Discrimination
##
Ratio Test
Indexes
## Obs
1174
LR chi2
447.76
R2
0.317
## sigma107.7510
d.f.
5
R2 adj
0.314
## d.f.
1168
Pr(> chi2) 0.0000
g
81.807
##
## Residuals
##
##
Min
1Q
Median
3Q
Max
## -353.127 -63.519
7.551
78.557 280.463
##
##
##
Coef
S.E.
t
Pr(>|t|)
## Intercept
-700.6421 206.3974 -3.39 0.0007
## degender=Male
26.3054
6.3406 4.15 <0.0001
## degesage
56.7278
8.6145 6.59 <0.0001
## degesage'
24.8321 29.0353 0.86 0.3926
## degesage''
-286.6153 114.2520 -2.51 0.0123
## mage
0.5169
0.5013 1.03 0.3027
##
anova(t.fit)
##
##
##
##
##
##
##
##
##
Analysis of Variance
Response: debirthw
Factor
d.f. Partial SS MS
F
P
degender
1
199834.45 199834.45 17.21 <.0001
degesage
3 6222213.92 2074071.31 178.64 <.0001
Nonlinear
2 1247050.00 623525.00 53.70 <.0001
mage
1
12345.42
12345.42
1.06 0.3027
REGRESSION
5 6296709.46 1259341.89 108.47 <.0001
ERROR
1168 13560798.03
11610.27
# Default summary() result.
# Probably not the most useful
summary(t.fit)
##
##
##
##
##
##
Effects
Response : debirthw
Factor
Low High Diff. Effect
degesage
25 27.00 2.00
97.6270
mage
24 33.75 9.75
5.0396
degender - Female:Male 2
1.00
NA -26.3050
S.E.
Lower 0.95 Upper 0.95
7.8104 82.3030
112.950
4.8872 -4.5492
14.628
6.3406 -38.7460
-13.865
plot(Predict(t.fit,degesage))
850
debirthw
800
750
700
650
24
26
28
30
Gestational age
Adjusted to:degender=Male mage=28.5
summary(t.fit,degesage=c(24,26))
##
##
##
##
##
##
Effects
Response : debirthw
Factor
Low High Diff. Effect
degesage
24 26.00 2.00 123.4300
mage
24 33.75 9.75
5.0396
degender - Female:Male 2
1.00
NA -26.3050
S.E.
Lower 0.95 Upper 0.95
6.3456 110.9800
135.880
4.8872 -4.5492
14.628
6.3406 -38.7460
-13.865
summary(t.fit,degesage=c(28,30))
##
##
##
##
##
##
Effects
Response : debirthw
Factor
Low High Diff. Effect
degesage
28 30.00 2.00
-1.1608
mage
24 33.75 9.75
5.0396
degender - Female:Male 2
1.00
NA -26.3050
S.E.
Lower 0.95 Upper 0.95
8.2892 -17.4240
15.102
4.8872 -4.5492
14.628
6.3406 -38.7460
-13.865
Understanding Categorical Variables
I It is generally inappropriate to code a categorical variable as 1,
2, etc.
I Instead, a categorical variable is represented in a model by a
set of dummy variables.
I One way to do this (also the default in R) is to select one
level as a reference level and code the dummy variables for the
remaining levels so that they are interpreted as effects relative
to the reference level. There will be 1 fewer dummy variables
than categories.
I Since the set of dummy variables collectively represents the
effect of the categorical variable, it is also generally
inappropriate to interpret p-values on the individual variables.
For example:
contrasts(tipp1$demethni)
##
##
##
##
##
Afro-American or African Asian Other
Caucasian
0
0
0
Afro-American or African
1
0
0
Asian
0
1
0
Other
0
0
1
shows that the default coding for the ethnicity variable would
choose Caucasian as the reference category and would create 3
dummy variables for the remaining categories in reference to
Caucasian.
fit.eth <- ols(debirthw~demethni,data = tipp1)
fit.eth
## Linear Regression Model
##
## ols(formula = debirthw ~ demethni, data = tipp1)
##
##
Model Likelihood
Discrimination
##
Ratio Test
Indexes
## Obs
1174
LR chi2
11.79
R2
0.010
## sigma129.6251
d.f.
3
R2 adj
0.007
## d.f.
1170
Pr(> chi2) 0.0081
g
11.931
##
## Residuals
##
##
Min
1Q Median
3Q
Max
## -288.51 -98.51
6.49 109.57 243.60
##
##
##
Coef
S.E.
t
## Intercept
788.5099 4.5546 173.13
## demethni=Afro-American or African -34.1074 11.2437 -3.03
## demethni=Asian
11.4901 15.8404
0.73
## demethni=Other
-18.6993 12.1670 -1.54
##
Pr(>|t|)
<0.0001
0.0025
0.4684
0.1246
anova(fit.eth)
##
##
##
##
##
##
Analysis of Variance
Factor
d.f. Partial SS
demethni
3
198392.6
REGRESSION
3
198392.6
ERROR
1170 19659114.9
Response: debirthw
MS
F
P
66130.86 3.94 0.0083
66130.86 3.94 0.0083
16802.66
summary(fit.eth) # Caucasian as reference
##
##
##
##
##
##
Effects
Response : debirthw
Factor
Low
demethni - Afro-American or African:Caucasian 1
demethni - Asian:Caucasian
1
demethni - Other:Caucasian
1
High
2
3
4
Diff. Effect
NA
-34.107
NA
11.490
NA
-18.699
S.E.
11.244
15.840
12.167
Lower 0.95 Upper 0.95
-56.167
-12.0470
-19.589
42.5690
-42.571
5.1724
summary(fit.eth, demethni="Asian") # Asian as reference
##
##
##
##
##
##
Effects
Response : debirthw
Factor
Low
demethni - Caucasian:Asian
3
demethni - Afro-American or African:Asian 3
demethni - Other:Asian
3
High
1
2
4
Diff.
NA
NA
NA
Effect
-11.490
-45.597
-30.189
S.E.
15.840
18.326
18.907
Lower 0.95 Upper 0.95
-42.569
19.5890
-81.553
-9.6415
-67.284
6.9056
Download