STATS 330: Lecture 10 4/8/2015 330 lecture 10 1 Diagnostics 2 Aim of today’s lecture To describe some more remedies for non-planar data To look at diagnostics and remedies for non-constant scatter. 4/8/2015 330 lecture 10 2 Remedies for non-planar data (cont) Last time we looked at diagnostics for nonplanar data We discussed what to do if the diagnostics indicate a problem. The short answer was: we transform, so that the model fits the transformed data. How to choose a transformation? • Theory • Ladder of powers • Polynomials We illustrate with a few examples 4/8/2015 330 lecture 10 3 Example: Using theory cherry trees A tree trunk is a bit like a cylinder Volume = p (diameter/2)2 height Log volume = log(p/4) + 2 log(diameter) + log(height) so a linear regression using the logged variables should work! In fact R2 increases from 95% to 98%, and residual plots are better 4/8/2015 330 lecture 10 4 Example: cherry trees (cont) > new.reg<- lm(log(volume)~log(diameter)+log(height), data=cherry.df) > summary(new.reg) Call: lm(formula = log(volume) ~ log(diameter) + log(height), data = cherry.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6.63162 0.79979 -8.292 5.06e-09 *** Previously log(diameter) 1.98265 0.07501 26.432 < 2e-16 *** 94.8% log(height) 1.11712 0.20444 5.464 7.81e-06 *** --Residual standard error: 0.08139 on 28 degrees of freedom Multiple R-Squared: 0.9777, Adjusted R-squared: 0.9761 F-statistic: 613.2 on 2 and 28 DF, p-value: < 2.2e-16 4/8/2015 330 lecture 10 5 Example: cherry trees (original) 10 Residuals vs Fitted 31 0 -5 Residuals 5 2 18 10 20 30 40 50 60 70 Fitted values lm(formula = volume ~ diameter + height, data = cherry.df) 4/8/2015 330 lecture 10 6 Example: cherry trees (logs) 0.00 -0.05 -0.15 -0.10 Residuals 0.05 0.10 0.15 Residuals vs Fitted 16 18 -0.20 15 2.5 3.0 3.5 4.0 Fitted values lm(formula = log(volume) ~ log(diameter) + log(height), data = cherry.df) 4/8/2015 330 lecture 10 7 Tyre abrasion data: gam plots 100 50 -100 -100 -50 0 s(tensile,5.42) 50 0 -50 s(hardness,1) 100 150 150 rubber.gam = gam(abloss~s(hardness)+s(tensile), data=rubber.df) par(mfrow=c(1,2)); plot(rubber.gam) 50 60 70 80 90 120 hardness 4/8/2015 140 160 180 200 220 240 tensile 330 lecture 10 8 Tyre abrasion data: polynomial GAM curve is like a polynomial, so fit a polynomial (ie include terms tensile2, tensile3,…) lm(abloss~hardness+poly(tensile,4), data=rubber.df) Degree of polynomial Usually a lot of trial and error involved! We have succeeded when • R2 improves • Residual plots show no pattern 4th deg polynomial works for the rubber data: R2 increases from 84% to 94% 4/8/2015 330 lecture 10 9 Why th 4 degree? > rubber.lm = lm(abloss~poly(tensile,5)+hardness, data=rubber.df) > summary(rubber.lm) Try 5th degree Coefficients: (Intercept) poly(tensile, poly(tensile, poly(tensile, poly(tensile, poly(tensile, hardness 5)1 5)2 5)3 5)4 5)5 Estimate Std. Error t value Pr(>|t|) 615.3617 29.8178 20.637 2.44e-16 *** -264.3933 25.0612 -10.550 2.76e-10 *** 23.6148 25.3437 0.932 0.361129 119.9500 24.6356 4.869 6.46e-05 *** -91.6951 23.6920 -3.870 0.000776 *** 9.3811 23.6684 0.396 0.695495 -6.2608 0.4199 -14.911 2.59e-13 *** Highest significant power Residual standard error: 23.67 on 23 degrees of freedom Multiple R-squared: 0.9427, Adjusted R-squared: 0.9278 F-statistic: 63.11 on 6 and 23 DF, p-value: 3.931e-13 4/8/2015 330 lecture 10 10 Ladder of powers Rather than fit polynomials in some independent variables, guided by gam plots, we can transform the response using the “ladder of powers” (i.e. use yp as the response rather than y for some power p) Choose p either by trial and error using R2 or use a “Box-Cox plot – see later in this lecture 4/8/2015 330 lecture 10 11 Checking for equal scatter The model specifies that the scatter about the regression plane is uniform In practice this means that the scatter doesn’t depend on the explanatory variables or the mean of the response All tests, confidence intervals rely on this 4/8/2015 330 lecture 10 12 Scatter Scatter is measured by the size of the residuals A common problem is where the scatter increases as the mean response increases This is means the big residuals happen when the fitted values are big Recognize this by a “funnel effect” in the residuals versus fitted value plot 4/8/2015 330 lecture 10 13 Example: Education expenditure data Data for the 50 states of the USA Variables are • Per capita expenditure on education (response), variable educ • Per capita Income, variable percap • Number of residents per 1000 under 18, variable under18 • Number of residents per 1000 in urban areas, variable urban • Fit model educ~ percap+under18+urban 4/8/2015 330 lecture 10 14 300 320 340 360 380 300 400 500 600 700 800 900 200 250 300 350 400 450 500 550 500 urban Outlier! 400 response (response) 3500 4000 4500 5000 5500 200 300 educ 340 360 380 percap 300 320 under18 300 4/8/2015 400 500 600 700 800 900 3500 4000 4500 330 lecture 10 5000 5500 15 Outlier, pt 50 (California) 4/8/2015 330 lecture 10 16 Basic fit, outlier in > educ.lm = lm(educ~urban + percap + under18, data=educ.df) > summary(educ.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -555.92562 123.46634 -4.503 4.56e-05 *** urban -0.00476 0.05174 -0.092 0.927 percap 0.07236 0.01165 6.211 1.40e-07 *** under18 1.55134 0.31545 4.918 1.16e-05 *** Residual standard error: 40.53 on 46 degrees of freedom Multiple R-squared: 0.5902, Adjusted R-squared: 0.5634 F-statistic: 22.08 on 3 and 46 DF, p-value: 5.271e-09 R2 is 59% 4/8/2015 330 lecture 10 17 Basic fit, outlier out > educ50.lm = lm(educ~urban + percap + under18, data=educ.df, subset=-50) > summary(educ50.lm) See how we exclude pt 50 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -278.06430 132.61422 -2.097 0.041664 * urban 0.06624 0.04966 1.334 0.188948 percap 0.04827 0.01220 3.958 0.000266 *** under18 0.88983 0.33159 2.684 0.010157 * --Residual standard error: 35.88 on 45 degrees of freedom Multiple R-squared: 0.4947, Adjusted R-squared: 0.461 F-statistic: 14.68 on 3 and 45 DF, p-value: 8.365e-07 R2 is now 49% 4/8/2015 330 lecture 10 18 > par(mfrow=c(1,2)) > plot(educ50.lm, which = c(1,3)) Increasing relationship Scale-Location 1.5 Residuals vs Fitted 15 Standardized residuals 50 0.0 Residuals 0 -50 -100 10 1.0 7 7 220 240 260 280 300 320 340 220 240 260 280 300 320 340 Fitted values Fitted values 4/8/2015 1510 0.5 100 Funnel effect 330 lecture 10 19 Remedies Either Transform the response Or Estimate the variances of the observations and use “weighted least squares” 4/8/2015 330 lecture 10 20 Transforming the response > tr.educ50.lm <- lm(I(1/educ)~urban + percap + under18,data=educ.df[-50,]) > plot(tr.educ50.lm) Transform to reciprocal Residuals vs Fitted 0e+00 -5e-04 Residuals Better! 5e-04 1e-03 10 47 -1e-03 15 0.0025 0.0030 0.0035 0.0040 0.0045 Fitted values lm(I(1/educ) ~ urban + percap + under18) 4/8/2015 330 lecture 10 21 What power to choose? How did we know to use reciprocals? Think of a more general model I(educ^p)~percap + under18 + urban where p is some power Then estimate p from the data using a BoxCox plot 4/8/2015 330 lecture 10 22 Transforming the response (how?) boxcoxplot(educ~urban + percap + under18, educ.df[-50,]) Draws “Box-Cox plot” A “R330” function 270 268 269 Min at about -1 266 267 Profile likelihood 271 272 Box-Cox plot -2 -1 0 1 2 p 4/8/2015 330 lecture 10 23 Weighted least squares Tests are invalid if observations do not have constant variance If the ith observation has variance vis2, then we can get a valid test by using “weighted least squares”, minimising the sum of the weighted squared residuals Sri2/vi rather than the sum of squared residuals Sri2 Need to know the variances vi 4/8/2015 330 lecture 10 24 Finding the weights Step 1: Plot the squared residuals versus the fitted values Step 2: Smooth the plot Step 3: Estimate the variance of an observation by the smoothed squared residual Step 4: weight is reciprocal of smoothed squared residual Rationale: variance is a function of the mean Use “R330” function funnel 4/8/2015 330 lecture 10 25 Doing it in R (1) vars = funnel(educ50.lm)# a“R330” function 6000 4000 0 2000 Squared residuals 3.6 3.4 3.2 3.0 Log std. errors 3.8 Slope 1.7, indicates p=1-1.7=-0.7 5.5 5.6 5.7 5.8 Log means 4/8/2015 220 260 300 340 Fitted values 330 lecture 10 26 > educ50.lm<-lm(educ~urban+percap+under18data=educ.df[-50,]) > summary(educ50.lm) Note pvalues Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -278.06430 132.61422 -2.097 0.041664 * urban 0.06624 0.04966 1.334 0.188948 percap 0.04827 0.01220 3.958 0.000266 *** under18 0.88983 0.33159 2.684 0.010157 * --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 35.88 on 45 degrees of freedom Multiple R-Squared: 0.4947, Adjusted R-squared: 0.461 F-statistic: 14.68 on 3 and 45 DF, p-value: 8.365e-07 4/8/2015 330 lecture 10 27 > weighted.lm<-lm(educ~urban+percap+under18, weights=1/vars, data=educ.df[-50,]) Note > summary(weighted.lm) changes! Note reciprocals! Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -270.29363 102.61073 -2.634 0.0115 * urban 0.01197 0.04030 0.297 0.7677 percap 0.05850 0.01027 5.694 8.88e-07 *** under18 0.82384 0.27234 3.025 0.0041 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.019 on 45 degrees of freedom Multiple R-squared: 0.629, Adjusted R-squared: 0.6043 F-statistic: 25.43 on 3 and 45 DF, p-value: 8.944e-10 Conclusion: unequal variances matter! Can change results! 4/8/2015 330 lecture 10 28