732A35/732G28 1 How to: Identify outliers Influential observations 732A35/732G28 2 To identify outliers: Scatterplot X or Y Boxplot X or Y Residuals vs X Residuals vs Y Why bother about outliers? Better tools? Indicate errors in data May affect regression parameters strongly wrong conclusions 732A35/732G28 3 45 Types: Outlying Y (1) Outlying X (2) Both (3) 1 40 35 30 Y 25 Comment which are influential.. 2 20 15 10 5 3 0 0 2 4 6 8 10 12 X 732A35/732G28 4 We studied before: Plots with residuals Plots with semistudentized residuals (assume, variances are same, MSE) However, variance may be different for residuals Estimated standard deviation s(ei ) MSE (1 hii ) where H = X(X’X)-1X’ 732A35/732G28 5 Studentized residuals ri ei ei MSE (1 hii ) s(ei ) Thumb rule: If |ri| > 2 large residual Studentized residuals= Standardized residuals In Minitab, |ri| > 2 labeled with an 'R' in the table of unusual observations 732A35/732G28 6 Price (Y) and size (X) for a sample of 150 houses. 30 Y (Price) 121.87 150.25 122.78 144.35 116.2 139.49 115.73 140.59 120.29 . . . 114.92 X (Square_Footage) 20.5 22 15.9 18.6 12.1 17.1 16.7 17.8 15.2 . . . 14.2 25 20 Y (Price) Case 1 2 3 4 5 6 7 8 9 . . . 150 15 10 5 0 70 90 110 130 150 170 190 X (Square footage) 732A35/732G28 7 Deleted residuals Idea: Outlier may affect regression coefficients remove observation i, compute deleted residual (for each i ) di Yi Yˆi(i ) equivalent di ei 1 hii Compute deleted residuals and search for outliers as before 732A35/732G28 8 Studentized deleted residuals Idea: studentize deleted residuals to produce residuals with equal variance n p 1 t i ei 2 SSE (1 hii ) ei 1/ 2 Further: Examine large deviations Test if observation with largest ti is an outlier: H 0 : Observation with largest ti is not outlier H a : Observation with largest ti is outlier Test function T=ti, Critical value=t(1-α/2n,n-p-1), two-sided 732A35/732G28 9 Smoother matrix H = X(X’X)-1X’ Comments: 0 hii 1 n h i 1 ii p hii (leverage) is measure of the distance between Xi and mean(X) a measure for detecting outliers! ˆ HY Y fit hii defines contribution of the ith observation to ith s(ei ) MSE (1 hii ) to Yi Large hii small s(ei) fitted Yi is close 732A35/732G28 10 To find outliers, 1. Look for hii larger than 2*mean leverage: 2p hii n 2. Alternative: 0.2-0.5 –moderate leverage, >0.5 – high leverage Ex. Define threshold for our data.. 732A35/732G28 11 Hidden extrapolations When predicting Y for new X, is it within domain? ◦ Easy in two dimensions 1 new x1 In general, for a new observation, X N ... 1 ' x new hNN X N X' X X N p 1 Check if hNN is much larger than other leverages 732A35/732G28 12 Influential=if added, changes fitted model considerably Influence on single fitted value - DFFITS ˆ Measure of the influence that case i has on the fitted value Yi. Defined as Yˆ Yˆ (i ) ( DFFITS )i i i MSE(i ) hii Computed as (ti = studentized deleted residual) 1/ 2 h ( DFFITS )i ti ii 1 hii 732A35/732G28 13 To find influential cases, 1. Small, medium datasets |DFFITS| > 1 2. Large data sets | DFFITS | 2 p n Ex. Define threshold for our data.. 732A35/732G28 14 Influence on All fitted values - Cook’s Distance Measures influence of the i:th case on all fitted values Defined as n Di Computed as: (Yˆ j 1 j Yˆj(i ) ) 2 p MSE ei2 hii Di 2 p MSE 1 hii Look how hii and ei influence Di… 732A35/732G28 15 Decision rule: If Di ≥ F(0.5; p, n-p), the case is influential. Ex. Define threshold for our data.. 732A35/732G28 16 Ch 10.2-10.4, 10.6 732A35/732G28 17