CHL 5202 Biostatistics II Extending Linear Regression Prof. Kevin E. Thorpe Dalla Lana School of Public Health University of Toronto Objectives 1. Review the multiple linear regression model. 2. Understand how the multiple regression model can be used to incorporate a non-linear relationship. 3. Understand the concept of the restricted cubic spline and be able to fit them in R. Introduction I Recall the simple linear regression model: yi = β0 + β1 xi + εi where the error term, εi ∼ N(0, σ 2 ) I This model assumes that a straight line relationship exists between X and Y . I Suppose that is not the case, as in the following simulated data. 3 2 y 1 0 −1 −2 0 2 4 x 6 3 2 y 1 0 −1 −2 0 2 4 x Suppose we fit the linear model. ## Linear Regression Model ## ## ols(formula = y ~ x, data = mydat) ## ## Model Likelihood Discrimination ## Ratio Test Indexes ## Obs 100 LR chi2 0.13 R2 0.001 ## sigma1.0867 d.f. 1 R2 adj -0.009 ## d.f. 98 Pr(> chi2) 0.7172 g 0.045 ## ## Residuals ## ## Min 1Q Median 3Q Max ## -2.5940 -0.7591 -0.1075 0.6510 2.2604 ## ## ## Coef S.E. t Pr(>|t|) ## Intercept 0.6875 0.2157 3.19 0.0019 ## x -0.0213 0.0593 -0.36 0.7206 ## 6 Analysis of Variance Factor d.f. Partial SS x 1 0.1519194 REGRESSION 1 0.1519194 ERROR 98 115.7310879 Response: y MS F P 0.1519194 0.13 0.7206 0.1519194 0.13 0.7206 1.1809295 −2 −1 0 1 2 Normal Q−Q Plot Sample Quantiles ## ## ## ## ## ## −2 −1 0 Theoretical Quantiles 1 2 2 1 e 0 −1 −2 0 2 4 6 x 2 1 e 0 −1 −2 0.55 0.60 0.65 yhat 3 2 y 1 0 −1 −2 0 2 4 6 x Extending the Model I The simple linear model previously fit is clearly not appropriate. I The residual plots suggest the possibility of a cubic relationship. I If we could find a way to add additional powers of X to the model, maybe it would fit better. Extending the Model Multiple Regression I Recall the multiple regression model: yi = β0 + β1 xi1 + β2 xi2 + · · · + βp xip + εi , εi ∼ N(0, σ 2 ) I The model makes no statement about what the X variables are, so we could define a cubic relationship between X and Y by: yi = β0 + β1 xi + β2 xi2 + β3 xi3 + εi , εi ∼ N(0, σ 2 ) I The next few slides show the results of fitting this model. We will see how to do it later. Suppose we fit the linear model. ## Linear Regression Model ## ## ols(formula = y ~ pol(x, 3), data = mydat) ## ## Model Likelihood Discrimination ## Ratio Test Indexes ## Obs 100 LR chi2 29.88 R2 0.258 ## sigma0.9462 d.f. 3 R2 adj 0.235 ## d.f. 96 Pr(> chi2) 0.0000 g 0.626 ## ## Residuals ## ## Min 1Q Median 3Q Max ## -2.60281 -0.73283 -0.07771 0.58072 2.21026 ## ## ## Coef S.E. t Pr(>|t|) ## Intercept -0.3266 0.3647 -0.90 0.3728 ## x 2.2649 0.5052 4.48 <0.0001 ## x^2 -0.9841 0.1874 -5.25 <0.0001 ## x^3 0.1093 0.0196 5.58 <0.0001 ## Analysis of Variance Factor d.f. Partial SS MS x 3 29.93298 9.9776594 Nonlinear 2 29.78106 14.8905293 REGRESSION 3 29.93298 9.9776594 ERROR 96 85.95003 0.8953128 Response: y F 11.14 16.63 11.14 −2 −1 0 1 2 Normal Q−Q Plot Sample Quantiles ## ## ## ## ## ## ## −2 −1 0 Theoretical Quantiles 1 2 P <.0001 <.0001 <.0001 2 1 e 0 −1 −2 0 2 4 6 x 2 1 e 0 −1 −2 0.0 0.5 1.0 yhat 1.5 2.0 3 2 y 1 0 −1 −2 0 2 4 6 x Extending the Model Comments I In theory, more complicated shapes could be fit by adding higher order terms. This tends not to work very well in practice (tail behaviour tends to be bad). I Finding the correct degree of polynomial would require fitting multiple options and choosing the best. This data-driven approach will result in overfitting. I It would be nice if there was a more general way to model a non-linear relationship that behaved better in the tails and did not require a stepwise approach to specify the degree. Splines Introduction I Spline functions are piecewise polynomials. I More specifically, they are polynomials fit within distinct intervals of X and then connected across the intervals. I The x axis is split into intervals by selecting 3 or more points, called knots. Then, some polynomial function is fit in each interval of the x axis resulting from the knots. Spline Linear Spline I The simplest spline is the linear spline (straight line fit in each interval). I Suppose three knots a, b and c were chosen. The linear spline function is given by f (X ) = β0 + β1 X + β2 (X − a)+ + β3 (X − b)+ + β4 (X − c)+ where (u)+ = max(0, u) I This formulation ensures that the lines join at the knots. f(X) Example of a Linear Spline 0 1 2 3 4 5 6 X Splines Cubic Splines I Cubic polynomials are quite flexible so if, instead of linear in each interval we fit a cubic, we get a cubic spline. I The cubic spline function is made smooth by forcing the first and second derivatives to agree at the knots. I Such a cubic spline function with three knots a, b and c would be: f (X ) = β0 + β1 X + β2 X 2 + β3 X 3 + β4 (X − a)3+ + β5 (X − b)3+ + β6 (X − c)3+ I Note that this requires 6 parameters to be estimated. In general, you need to estimate k + 3 parameters for a k-knot cubic spline. Splines Restricted Cubic Splines I The cubic spline can behave badly in the tails. Constraining the function to be linear in the tails gives the restricted cubic spline (or natural spline). I The restricted cubic spline function with k knots t1 , . . . , tk is: f (X ) = β0 + β1 X1 + β2 X2 + · · · + βk−1 Xk−1 where X1 = X and for j = 1, . . . , k − 2, Xj+1 = (X − tj )3+ − (X − tk−1 )3+ (tk − tj )/(tk − tk−1 ) + (X − tk )3+ (tk−1 − tj )/(tk − tk−1 ) I The rms package scales these computations by τ = (tk − t1 )2 . I Notice that this only requires estimation of k − 1 parameters. Splines Restricted Cubic Splines I Once β0 , . . . , βk−1 are estimated, the spline function can be expressed as: f (X ) = β0 + β1 X + β2 (X − t1 )3+ + · · · + βk+1 (X − tk )3+ by dividing β2 , . . . , βk−1 by τ and computing βk and βk+1 by Equation 2.28 (Harrell). I A test of linearity in X is obtained by testing: H0 : β2 = β3 = · · · = βk−1 = 0 Splines Concerning Knots I Within this field some individuals make a distinction between internal knots and boundary (at the extremes of X ) knots. I In the rms package the rcs() function is used to create the restricted cubic spline. When you specify the number of knots in this function it is all knots. That is specifying 3 knots in this function means 1 internal and 2 boundary. The boundary knots are also not quite at the edges. I In the splines package the function ns() can be used. When you specify the number of knots here it is internal and the boundary knots are at the edges of X by default. Restricted cubic spline with 4 knots. ## Linear Regression Model ## ## ols(formula = y ~ rcs(x, 4), data = mydat) ## ## Model Likelihood Discrimination ## Ratio Test Indexes ## Obs 100 LR chi2 27.41 R2 0.240 ## sigma0.9580 d.f. 3 R2 adj 0.216 ## d.f. 96 Pr(> chi2) 0.0000 g 0.612 ## ## Residuals ## ## Min 1Q Median 3Q Max ## -2.79219 -0.71792 -0.04407 0.60252 2.18081 ## ## ## Coef S.E. t Pr(>|t|) ## Intercept 0.0548 0.3191 0.17 0.8639 ## x 0.8783 0.2599 3.38 0.0011 ## x' -3.7879 0.8066 -4.70 <0.0001 ## x'' 12.0017 2.3426 5.12 <0.0001 ## Analysis of Variance Factor d.f. Partial SS MS x 3 27.78090 9.2603004 Nonlinear 2 27.62898 13.8144909 REGRESSION 3 27.78090 9.2603004 ERROR 96 88.10211 0.9177303 Response: y F 10.09 15.05 10.09 −2 −1 0 1 2 Normal Q−Q Plot Sample Quantiles ## ## ## ## ## ## ## −2 −1 0 Theoretical Quantiles 1 2 P <.0001 <.0001 <.0001 2 1 e 0 −1 −2 −3 0 2 4 6 x 2 1 e 0 −1 −2 −3 0.0 0.5 1.0 yhat 1.5 3 2 y 1 0 −1 −2 0 2 4 6 x Splines Choosing the number of knots I The more knots used the more flexible the spline. I In general, 3, 4 or 5 knots tend to be sufficient. Four is probably the best compromise. Use 3 if the sample size is small and use 5 if the sample size is large and you need a bit of extra flexibility. I Too many knots will start to fit the noise and become too wiggly. The graph on the next slide shows what happens if you fit 10 knots to the example data. 3 2 y 1 0 −1 −2 0 2 4 6 x Splines Interpretation I The existence of a non-linear relationship means that the effect size cannot be expressed with a single number. I A graph is the best way to visualize and understand the relationship as a whole. I Numerical summaries of the effect can be computed over different intervals of the X variable. I The graph will often provide useful direction for selecting a few key intervals in which to compute numeric summaries. I A real example illustrates. load("tipp1.RData") t.dd <- datadist(tipp1) options(datadist="t.dd") t.fit <- ols(debirthw~degender+rcs(degesage,4)+mage, data=tipp1) t.fit ## Linear Regression Model ## ## ols(formula = debirthw ~ degender + rcs(degesage, 4) + mage, ## data = tipp1) ## ## Model Likelihood Discrimination ## Ratio Test Indexes ## Obs 1174 LR chi2 447.76 R2 0.317 ## sigma107.7510 d.f. 5 R2 adj 0.314 ## d.f. 1168 Pr(> chi2) 0.0000 g 81.807 ## ## Residuals ## ## Min 1Q Median 3Q Max ## -353.127 -63.519 7.551 78.557 280.463 ## ## ## Coef S.E. t Pr(>|t|) ## Intercept -700.6421 206.3974 -3.39 0.0007 ## degender=Male 26.3054 6.3406 4.15 <0.0001 ## degesage 56.7278 8.6145 6.59 <0.0001 ## degesage' 24.8321 29.0353 0.86 0.3926 ## degesage'' -286.6153 114.2520 -2.51 0.0123 ## mage 0.5169 0.5013 1.03 0.3027 ## anova(t.fit) ## ## ## ## ## ## ## ## ## Analysis of Variance Response: debirthw Factor d.f. Partial SS MS F P degender 1 199834.45 199834.45 17.21 <.0001 degesage 3 6222213.92 2074071.31 178.64 <.0001 Nonlinear 2 1247050.00 623525.00 53.70 <.0001 mage 1 12345.42 12345.42 1.06 0.3027 REGRESSION 5 6296709.46 1259341.89 108.47 <.0001 ERROR 1168 13560798.03 11610.27 # Default summary() result. # Probably not the most useful summary(t.fit) ## ## ## ## ## ## Effects Response : debirthw Factor Low High Diff. Effect degesage 25 27.00 2.00 97.6270 mage 24 33.75 9.75 5.0396 degender - Female:Male 2 1.00 NA -26.3050 S.E. Lower 0.95 Upper 0.95 7.8104 82.3030 112.950 4.8872 -4.5492 14.628 6.3406 -38.7460 -13.865 plot(Predict(t.fit,degesage)) 850 debirthw 800 750 700 650 24 26 28 30 Gestational age Adjusted to:degender=Male mage=28.5 summary(t.fit,degesage=c(24,26)) ## ## ## ## ## ## Effects Response : debirthw Factor Low High Diff. Effect degesage 24 26.00 2.00 123.4300 mage 24 33.75 9.75 5.0396 degender - Female:Male 2 1.00 NA -26.3050 S.E. Lower 0.95 Upper 0.95 6.3456 110.9800 135.880 4.8872 -4.5492 14.628 6.3406 -38.7460 -13.865 summary(t.fit,degesage=c(28,30)) ## ## ## ## ## ## Effects Response : debirthw Factor Low High Diff. Effect degesage 28 30.00 2.00 -1.1608 mage 24 33.75 9.75 5.0396 degender - Female:Male 2 1.00 NA -26.3050 S.E. Lower 0.95 Upper 0.95 8.2892 -17.4240 15.102 4.8872 -4.5492 14.628 6.3406 -38.7460 -13.865 Understanding Categorical Variables I It is generally inappropriate to code a categorical variable as 1, 2, etc. I Instead, a categorical variable is represented in a model by a set of dummy variables. I One way to do this (also the default in R) is to select one level as a reference level and code the dummy variables for the remaining levels so that they are interpreted as effects relative to the reference level. There will be 1 fewer dummy variables than categories. I Since the set of dummy variables collectively represents the effect of the categorical variable, it is also generally inappropriate to interpret p-values on the individual variables. For example: contrasts(tipp1$demethni) ## ## ## ## ## Afro-American or African Asian Other Caucasian 0 0 0 Afro-American or African 1 0 0 Asian 0 1 0 Other 0 0 1 shows that the default coding for the ethnicity variable would choose Caucasian as the reference category and would create 3 dummy variables for the remaining categories in reference to Caucasian. fit.eth <- ols(debirthw~demethni,data = tipp1) fit.eth ## Linear Regression Model ## ## ols(formula = debirthw ~ demethni, data = tipp1) ## ## Model Likelihood Discrimination ## Ratio Test Indexes ## Obs 1174 LR chi2 11.79 R2 0.010 ## sigma129.6251 d.f. 3 R2 adj 0.007 ## d.f. 1170 Pr(> chi2) 0.0081 g 11.931 ## ## Residuals ## ## Min 1Q Median 3Q Max ## -288.51 -98.51 6.49 109.57 243.60 ## ## ## Coef S.E. t ## Intercept 788.5099 4.5546 173.13 ## demethni=Afro-American or African -34.1074 11.2437 -3.03 ## demethni=Asian 11.4901 15.8404 0.73 ## demethni=Other -18.6993 12.1670 -1.54 ## Pr(>|t|) <0.0001 0.0025 0.4684 0.1246 anova(fit.eth) ## ## ## ## ## ## Analysis of Variance Factor d.f. Partial SS demethni 3 198392.6 REGRESSION 3 198392.6 ERROR 1170 19659114.9 Response: debirthw MS F P 66130.86 3.94 0.0083 66130.86 3.94 0.0083 16802.66 summary(fit.eth) # Caucasian as reference ## ## ## ## ## ## Effects Response : debirthw Factor Low demethni - Afro-American or African:Caucasian 1 demethni - Asian:Caucasian 1 demethni - Other:Caucasian 1 High 2 3 4 Diff. Effect NA -34.107 NA 11.490 NA -18.699 S.E. 11.244 15.840 12.167 Lower 0.95 Upper 0.95 -56.167 -12.0470 -19.589 42.5690 -42.571 5.1724 summary(fit.eth, demethni="Asian") # Asian as reference ## ## ## ## ## ## Effects Response : debirthw Factor Low demethni - Caucasian:Asian 3 demethni - Afro-American or African:Asian 3 demethni - Other:Asian 3 High 1 2 4 Diff. NA NA NA Effect -11.490 -45.597 -30.189 S.E. 15.840 18.326 18.907 Lower 0.95 Upper 0.95 -42.569 19.5890 -81.553 -9.6415 -67.284 6.9056