1 732A35/732G28

732A35/732G28 1 How to:  Identify outliers  Influential observations 732A35/732G28 2 To identify outliers:     Scatterplot X or Y Boxplot X or Y Residuals vs X Residuals vs Y Why bother about outliers?   Better tools? Indicate errors in data May affect regression parameters strongly  wrong conclusions 732A35/732G28 3 45 Types:  Outlying Y (1)  Outlying X (2)  Both (3) 1 40 35 30 Y 25 Comment which are influential.. 2 20 15 10 5 3 0 0 2 4 6 8 10 12 X 732A35/732G28 4 We studied before:  Plots with residuals  Plots with semistudentized residuals (assume, variances are same, MSE) However, variance may be different for residuals  Estimated standard deviation s(ei )  MSE (1  hii ) where H = X(X’X)-1X’ 732A35/732G28 5  Studentized residuals ri  ei ei  MSE (1  hii ) s(ei ) Thumb rule: If |ri| > 2  large residual   Studentized residuals= Standardized residuals In Minitab, |ri| > 2 labeled with an 'R' in the table of unusual observations 732A35/732G28 6 Price (Y) and size (X) for a sample of 150 houses. 30 Y (Price) 121.87 150.25 122.78 144.35 116.2 139.49 115.73 140.59 120.29 . . . 114.92 X (Square_Footage) 20.5 22 15.9 18.6 12.1 17.1 16.7 17.8 15.2 . . . 14.2 25 20 Y (Price) Case 1 2 3 4 5 6 7 8 9 . . . 150 15 10 5 0 70 90 110 130 150 170 190 X (Square footage) 732A35/732G28 7 Deleted residuals Idea: Outlier may affect regression coefficients  remove observation i, compute deleted residual (for each i ) di  Yi  Yˆi(i ) equivalent di  ei 1  hii Compute deleted residuals and search for outliers as before 732A35/732G28 8 Studentized deleted residuals Idea: studentize deleted residuals to produce residuals with equal variance   n  p 1 t i  ei  2   SSE (1  hii )  ei  1/ 2 Further:  Examine large deviations  Test if observation with largest ti is an outlier: H 0 : Observation with largest ti is not outlier H a : Observation with largest ti is outlier Test function T=ti, Critical value=t(1-α/2n,n-p-1), two-sided 732A35/732G28 9 Smoother matrix H = X(X’X)-1X’ Comments:     0  hii  1 n h i 1 ii p hii (leverage) is measure of the distance between Xi and mean(X)  a measure for detecting outliers! ˆ  HY Y fit  hii defines contribution of the ith observation to ith s(ei )  MSE (1  hii ) to Yi  Large hii  small s(ei) fitted Yi is close 732A35/732G28 10 To find outliers, 1. Look for hii larger than 2*mean leverage: 2p hii  n 2. Alternative: 0.2-0.5 –moderate leverage, >0.5 – high leverage Ex. Define threshold for our data.. 732A35/732G28 11 Hidden extrapolations   When predicting Y for new X, is it within domain? ◦ Easy in two dimensions 1   new   x1  In general, for a new observation, X N    ...   1 '  x new  hNN  X N X' X X N  p 1     Check if hNN is much larger than other leverages 732A35/732G28 12  Influential=if added, changes fitted model considerably Influence on single fitted value - DFFITS ˆ  Measure of the influence that case i has on the fitted value Yi. Defined as Yˆ  Yˆ (i ) ( DFFITS )i  i i MSE(i ) hii Computed as (ti = studentized deleted residual) 1/ 2  h  ( DFFITS )i  ti  ii   1  hii  732A35/732G28 13 To find influential cases, 1. Small, medium datasets |DFFITS| > 1 2. Large data sets | DFFITS | 2 p n Ex. Define threshold for our data.. 732A35/732G28 14 Influence on All fitted values - Cook’s Distance Measures influence of the i:th case on all fitted values Defined as n  Di  Computed as:  (Yˆ j 1 j  Yˆj(i ) ) 2 p  MSE ei2  hii    Di  2   p  MSE  1  hii   Look how hii and ei influence Di… 732A35/732G28 15 Decision rule:  If Di ≥ F(0.5; p, n-p), the case is influential.  Ex. Define threshold for our data.. 732A35/732G28 16  Ch 10.2-10.4, 10.6 732A35/732G28 17

1 732A35/732G28

Related documents

Products

Support

1 732A35/732G28

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib