STATS 330: Lecture 11 4/7/2015 330 lecture 11 1 Outliers and high-leverage points An outlier is a point that has a larger or smaller y value that the model would suggest • Can be due to a genuine large error e • Can be caused by typographical errors in recording the data A high leverage point is a point with extreme values of the explanatory variables 4/7/2015 330 lecture 11 2 Outliers The effect of an outlier depends on whether it is also a high leverage point A “high leverage” outlier • Can attract the fitted plane, distorting the fit, sometimes extremely • In extreme cases may not have a big residual • In extreme cases can increase R2 A “low leverage” outlier • Does not distort the fit to the same extent • Usually has a big residual • Inflates standard errors, decreases R2 4/7/2015 330 lecture 11 3 35 30 15 20 y Low leverage Outlier: big residual 25 35 30 25 15 20 y No outliers No highleverage points 2 3 4 5 6 7 2 3 4 30 25 15 10 0 5 y 0 1 2 3 4 5 6 x 4/7/2015 7 20 25 20 15 10 5 0 y High leverage Not an outlier 6 x 30 x 5 0 1 2 3 4 5 6 Highleverage outlier x 330 lecture 11 4 400 300 380 under18 400 350 educ 450 500 550 Example: the education data (ignoring urban) High leverage point 360 250 340 320 200 300 3000 280 3500 4000 4500 5000 5500 6000 percap 4/7/2015 330 lecture 11 5 50 0 -50 13 23 49 4 21 46 41 10 15 9 16 11 20 40 18 5 4717 24 14 2 43 38 19372839 36 26 3 44 31 30 1 25 27 35 42 34 32 157 47 21 14 59 4243 3 46 48 13 38 29 40 3436 25 126 23 31 37 4144 22 3027 2420 32 2839 35 6 17 49 3319 4 2 11 12 1618 8 45 33 300 320 340 360 Residual somewhat extreme 10 380 under18 4/7/2015 50 50 45 22 residuals(educ.lm) 5500 48 5000 4500 3500 4000 percap 6 8 12 7 29 100 An outlier also? 200 250 300 350 400 450 predict(educ.lm) 330 lecture 11 6 Measuring leverage It can be shown (see eg STATS 310) that the fitted value of case i is related to the response data y1,…,yn by the equation yˆ Hy where H X ( X ' X ) X ' , 1 yˆ i h i 1 y 1 ... h ii y i ... h in y n The hij depend on the explanatory variables. The quantities hii are called “hat matrix diagonals” (HMD’s) and measure the influence yi has on the ith fitted value. They can also be interpreted as the distance between the X-data for the ith case and the average x-data for all the cases. Thus, they directly measure how extreme the x-values of each point are 4/7/2015 330 lecture 11 7 Interpreting the HMDs Each HMD lies between 0 and 1 The average HMD is p/n • (p=no of reg coefficients, p=k+1) An HMD more than 3p/n is considered extreme 4/7/2015 330 lecture 11 8 Example: the education data educ.lm<-lm(educ~percapita+under18, data=educ.df) hatvalues(educ.lm)[50] 50 0.3428523 n=50, p=3 > 9/50 [1] 0.18 Clearly extreme! 4/7/2015 330 lecture 11 9 Studentized residuals How can we recognize a big residual? How big is big? The actual size depends on the units in which the y-variable is measured, so we need to standardize them. Can divide by their standard deviations Variance of a typical residual e is var(e) = (1-h) s2 where h is the hat matrix diagonal for the point. 4/7/2015 330 lecture 11 10 Studentized residuals (2) “Internally studentised” ei (Called “standardised” in R) (1 hi ) s “Externally studentised” (Called “studentised” in R) 4/7/2015 s2 is Usual Estimate of s2 2 s2i is estimate of s2 after deleting the ei ith data point (1 hi ) s i 2 330 lecture 11 11 Studentized residuals (3) How big is big? Both types of studentised residual are approximately distributed as standard normals when the model is OK and there are no outliers. (in fact the externally studentised one has a tdistribution) Thus, studentised residuals should be between -2 and 2 with approximately 95% probability. 4/7/2015 330 lecture 11 12 Studentized residuals (4) Calculating in R: library(MASS) # load the MASS library stdres(educ.lm) #internally studentised (standardised in R) studres(educ.lm) #externally studentised (studentised in R) > stdres(educ.lm)[50] 50 3.275808 > studres(educ.lm)[50] 50 3.700221 4/7/2015 330 lecture 11 13 What does studentised mean? 4/7/2015 330 lecture 11 14 Recognizing outliers If a point is a low influence outlier, the residual will usually be large, so large residual and a low HMD indicates an outlier If a point is a high leverage outlier, then a large error usually will cause a large residual. However, in extreme cases, a high leverage outlier may not have a very big residual, depending on how much the point attracts the fitted plane. Thus, if a point has a large HMD, and the residual is not particularly big, we can’t always tell if a point is an outlier or not. 4/7/2015 330 lecture 11 15 High-leverage outlier Small residual! Residuals vs Fitted 3.0 1.0 Plot of y versus x, Example 5 Fitted Line True line 21 Small residual! 0.0 0.5 -0.5 1.0 1.5 y Residuals 2.0 0.5 2.5 24 15 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x 4/7/2015 1.0 1.5 2.0 2.5 3.0 Fitted values 330 lecture 11 16 Leverage-residual plot plot(educ.lm, which=5) Point 50 is high leverage, big residual, is an outlier Residuals vs Leverage 2 1 0.5 -1 0 1 7 45 -2 Standardized residuals 3 50 Cook's distance 0.5 Can plot standardised residuals versus leverage (HMD’s): the leverage-residual plot (LR plot) 1 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Leverage lm(educ ~ percap + under18) 4/7/2015 330 lecture 11 17 Interpreting LR plots S t a n d a r d i s e d r e s i d u a l 4/7/2015 Low leverage outlier 3p/n High leverage outlier 2 0 Possible high leverage outlier OK -2 Low leverage outlier leverage 330 lecture 11 High leverage, outlier 18 Residuals and HMD’s Plot of y versus x, Example 1 2 Residuals vs Leverage 1 0 1 1.0 -2 1.5 23 -1 Standardized residuals 2.5 2.0 y 9 0.5 23 Cook's distance 1 0.0 0.5 9 Fitted Line True line 0.2 0.4 0.6 0.8 0.00 x 0.05 0.10 0.15 Leverage No big studentized residuals, no big HMD’s (3p/n = 0.2 for this example) 4/7/2015 330 lecture 11 19 Residuals and HMD’s (2) Plot of y versus x, Example 2 Residuals vs Leverage Point 24 4 3 2 2.0 0.5 -1 1.5 1 1 Standardized residuals 24 Point 24 y 24 0 2.5 Fitted True 1.0 Cook's distance -2 1 0.0 0.2 0.4 0.6 0.8 0.00 x 4/7/2015 19 19 0.05 0.10 1 0.5 0.15 Leverage One big studentized residual, no big HMD’s (3p/n = 0.2 for this example). Line moves a bit 330 lecture 11 20 Residuals and HMD’s (3) Plot of y versus x, Example 3 4.0 Residuals vs Leverage 2 1 1 0 1 0.5 1 -2 1.5 9 23 0.5 -1 Standardized residuals 2.5 Point 1 2.0 yy 1 9 3.0 3.5 Fitted Line True line 23 Cook's distance 0.0 0.5 1.0 1.5 xx Point 1 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Leverage No big studentized residual, one big HMD, pt 1. (3p/n = 0.2 for this example). Line hardly moves.Pt 1 is high leverage but not influential. 4/7/2015 330 lecture 11 21 Residuals and HMD’s (4) Plot of y versus x, Example 4 Residuals vs Leverage 5 5 1 3 2 1 3 Point 1 1 9 2 0.5 Point 1 0 Standardized residuals 4 1 2 yy 4 Fitted Line True line -1 9 Cook's distance 1 2 0.5 1 0.0 0.5 1.0 1.5 xx 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Leverage One big studentized residual, one big HMD, both pt 1. (3p/n = 0.2 for this example). Line moves but residual is large. Pt 1 is influential 4/7/2015 330 lecture 11 22 Residuals and HMD’s (5) Point 1 Plot of y versus x, Example 5 7 1 1 0.5 0 7 0.5 1 29 -2 0.0 0.5 -1 Standardized residuals 1 Point 1 1.5 2.0 2.5 Fitted Line True line 1.0 y 2 3.0 Residuals vs Leverage 1 29 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x 4/7/2015 Cook's distance 0.0 0.2 0.4 0.6 Leverage No big studentized residuals, big HMD, 3p/n= 0.2. Point 1 is high leverage and influential 330 lecture 11 23 Influential points How can we tell if a high-leverage point/outlier is affecting the regression? By deleting the point and refitting the regression: a large change in coefficients means the point is affecting the regression Such points are called influential points Don’t want analysis to be driven by one or two points 4/7/2015 330 lecture 11 24 “Leave-one out” measures We can calculate a variety of measures by leaving out each data point in turn, and looking at the change in key regression quantities such as • Coefficients • Fitted values • Standard errors We discuss each in turn 4/7/2015 330 lecture 11 25 Example: education data 4/7/2015 With point 50 Without point 50 Const -557 -278 percap 0.071 0.059 under18 1.555 0.932 330 lecture 11 26 Standardized difference in coefficients: DFBETAS Formula: ˆ j ˆ j ( i ) s .e .( ˆ j ) Problem when: Greater than 1 in absolute value This is the criterion coded into R 4/7/2015 330 lecture 11 27 Standardized difference in fitted values: DFFITS Formula: yˆ i yˆ i ( i ) s .e .( yˆ i ) Problem when: Greater than 3(p/(N-p) ) in absolute value (p=number of regression coefficients) 4/7/2015 330 lecture 11 28 COV RATIO & Cooks D • Cov Ratio: Measures change in the standard errors of the estimated coefficients Problem indicated: when Cov Ratio more than 1+3p/n or less than 1-3p/n • Cook’s D Measures overall change in the coefficients Problem when: More than qf(0.50, p,n-p) (lower 50% of F distribution), roughly 1 in most cases 4/7/2015 330 lecture 11 29 Calculations > influence.measures(educ.lm) Influence measures of lm(formula = educ ~ under18 + percap, data = educ.df) 10 dfb.1. dfb.un18 dfb.prcp dffit cov.r cook.d hat inf 0.06381 -0.02222 -0.16792 -0.3631 0.803 4.05e-02 0.0257 * 44 0.02289 -0.02948 50 -2.36876 > 2.23393 0.00298 -0.0340 1.283 3.94e-04 0.1690 * 1.50181 * 2.4733 0.821 1.66e+00 0.3429 p=3, n=50, 3p/n=0.18, 3(p/(n-p)) =0.758, qf(0.5,3,47)= 4/7/2015 0.8002294 330 lecture 11 30 Plotting influence # set up plot window with 2 x 4 array of plots par(mfrow=c(2,4)) # plot dffbetas, dffits, cov ratio, # cooks D, HMD’s influenceplots(educ.lm) 4/7/2015 330 lecture 11 31 dfb.un18 20 30 40 50 0 10 20 30 40 50 2.0 20 30 Obs. number ABS(COV RATIO-1) Cook's D Hats 20 30 Obs. number 4/7/2015 40 50 50 0 10 20 30 40 Obs. number 0.30 50 0.10 Hats 0.00 0.0 10 40 0.20 1.5 1.0 0.5 Cook's D 0.10 0.00 50 1.5 0.5 0.0 10 Obs. number 10 0 0 Obs. number 44 0.20 0.0 0.0 10 50 1.0 DFFITS 1.5 0.5 1.0 dfb.un18 1.0 0.5 dfb.prcp 1.5 1.0 0.0 0.5 dfb.1_ 0 ABS(COV RATIO-1) 50 2.0 50 2.0 50 DFFITS 2.5 dfb.prcp 1.5 dfb.1_ 0 10 20 30 40 50 0 Obs. number 10 20 30 40 50 Obs. number 330 lecture 11 32 50 Remedies for outliers Correct typographical errors in the data Delete a small number of points and refit (don’t want fitted regression to be determined by one or two influential points) Report existence of outliers separately: they are often of scientific interest Don’t delete too many points (1 or 2 max) 4/7/2015 330 lecture 11 33 Summary: Doing it in R LR plot: plot(educ.lm, which=5) Full diagnostic display plot(educ.lm) Influence measures: influence.measures(educ.lm) Plots of influence measures par(mfrow=c(2,4)) influenceplots(educ.lm) 4/7/2015 330 lecture 11 34 HMD Summary Hat matrix diagonals • Measure the effect of a point on its fitted value • Measure how outlying the x-values are (how “high –leverage” a point is) • Are always between 0 and 1 with bigger values indicating high leverage • Points with HMD’s more than 3p/n are considered “high leverage” 4/7/2015 330 lecture 11 35