Stat 301 – Lecture 27 Unusual Points Outlier box plot identifies potential outliers. 1 Normal Quantile Plot 3 .99 2 .95 .90 1 .75 .50 0 .25 -1 .10 .05 -2 .01 -3 35 30 20 Count 25 15 10 5 -5 0 5 10 15 2 Box Plot – Potential Outliers Vehicle Name Highway MPG Predicted MPG Residual Honda Civic HX 2dr 44 35.4 8.6 Toyota Echo 2dr manual 43 35.5 7.5 Toyota Prius 4dr (gas/electric) 51 35.3 15.7 Volkswagen Jetta GLS TDI 4dr 46 33.4 12.6 3 Stat 301 – Lecture 27 Outlier How do we determine if a potential outlier identified on the box plot is statistically significant? 4 Unusual Points in Regression Outlier – a point with an unusually large residual. High leverage point – a point with an extreme value for one, or more, of the explanatory variables 5 Influential Points Does a point influence where the regression line goes? An outlier can. A high leverage point can. Is that point statistically significant in terms of influence? 6 Stat 301 – Lecture 27 Simple Linear Regression Example – 50 mammals Response variable: gestation (length of pregnancy) days Explanatory: brain weight 7 Distributions Gestation 10 Count 15 5 0 50 100 200 300 400 500 8 Gestation (days) Skewed to the right. Two potential outliers. Mean = 117.4 days Median = 65.5 days Values from 12 days to 440 days. 9 Stat 301 – Lecture 27 Distributions BrainWgt 35 30 20 15 Count 25 10 5 0 500 1000 1500 10 Brain Weight Highly skewed to the right with several mounds. Six potential outliers. Mean = 107.25 g Median = 16.25 g Values from 0.14 g to 1320 g 11 Simple Linear Regression Trying to explain variation in the response (gestation) by relating the response to the explanatory variable (brain weight). 12 Stat 301 – Lecture 27 Regression Residuals residual y yˆ Those observations that do not follow the general trend will have residuals that are far from zero, either positive or negative. 13 Regression Outlier A residual very far from zero, either negative or positive, will be called an outlier for regression. An outlier for regression corresponds to a value of the response that does not match the overall trend. 14 Simple Linear Regression Predicted Gestation = 85.25 + 0.30*Brain Weight R2 = 0.372, so only 37.2% of the variation in gestation is explained by the linear relationship with brain weight. RMSE = 85.1 days 15 Stat 301 – Lecture 27 Simple Linear Regression The model is useful. F = 28.49, P-value < 0.0001 This also indicates that there is a statistically significant linear relationship between brain weight and gestation. 16 300 200 Residual 100 0 -100 -200 -300 0 500 1000 1500 BrainWgt 17 Unusual Points The mammal with a brain weight around 1300 g has the residual furthest from zero on the negative side. There are other mammals with residuals of the same magnitude on the positive side. 18 Stat 301 – Lecture 27 Outlier Box Plot Start with five number summary Minimum = –214.1 25% Quartile = –57.9 50% Median = –31.1 75% Quartile = 36.7 Maximum = 256.1 19 InterQuartile Range (IQR) IQR = 75% Quart – 25% Quart Upper = 75% Quart + 1.5*IQR IQR = 36.7 – (– 57.9) = 94.6 Upper = 36.7 + 141.9 = 178.6 Lower =25% Quart – 1.5*IQR Lower = – 57.9 – 141.9 = – 199.8 20 Outlier Box Plot Any point above the Upper or below the Lower will be flagged as a potential outlier. Lines extend to the most extreme points inside the Lower and Upper bounds. 21 Stat 301 – Lecture 27 Distributions Residual Gestation 10 Count 15 5 -300 -200 -100 0 100 200 300 22 Regression Outliers Brain Gestation Pred Resid Weight Brazilian 169 g 392 days 135.9 256.1 Tapir “Man” 1320 g 267 days 481.1 –214.1 Okapi 490 g 440 days 232.2 207.8 23 Comments The residual for “Man” is not the most extreme. The residual for the Brazilian Tapir is the furthest from zero. 24 Stat 301 – Lecture 27 Text Definition Our text defines an outlier in regression as a value with a residual that is more than 3 standard deviations away from zero. 25 Standardized Residual z residual RMSE A standardized residual should follow a standard normal distribution. 26 Text Definition Our text defines an outlier in regression as a value with a standardized residual less than –3 or greater than +3. 27 Stat 301 – Lecture 27 Standardized Residual Residual z Outlier? Brazilian 256.1 Tapir “Man” –214.1 3.01 Yes –2.52 No Okapi 2.44 No 207.8 28 Computing a P-value JMP – Col – Formula (1 – Normal Distribution(|z|))*2 where |z| is the absolute value of z. 29 Standardized Residual Residual z P-value Brazilian 256.1 Tapir “Man” –214.1 3.01 0.0026 –2.52 0.0119 Okapi 2.44 0.0146 207.8 30 Stat 301 – Lecture 27 Caution We are essentially doing 50 tests of hypothesis. If each test has a chance of error of 5%, then one would expect to see some P-values less than 0.05 just by chance. 31 Bonferroni Correction Adjust what is a small P-value. 0.05 0.05 0.001 # of residuals 50 If a P-value is less than 0.001, then the standardized residual is statistically significant. 32 Conclusion Although some of the residuals were flagged on the outlier box plot, and the residual for the Brazilian Tapir meets the text definition, none were deemed statistically significant once we corrected for doing 50 simultaneous tests. 33 Stat 301 – Lecture 27 Comment Because the 3 potential outliers have residuals far from zero, they inflate the value of RMSE. Is there a way to evaluate these potential outliers using a value of RMSE more representative of the remaining values? 34 Adjust RMSE ∗ Variance(residuals) = 7091.274 Adjust by subtracting off the squared residuals for the potential outliers. 35 Adjust RMSE Brazilian Tapir “Man” Okapi Residual Residual2 256.1 65587.21 –214.1 45838.81 207.8 43180.84 36 Stat 301 – Lecture 27 Adjust RMSE 7091.2704(49) = 347472.2496 Subtract off (65587.21+45838.81+43180.84) = 154606.86 Adjusted Sum of Squares Residual 347472.25 – 154606.86 = 192865.39 37 Adjust RMSE . New RMSE = 65.467 Calculate new standardized residuals. 38 New Standardized Residual Residual z P-value Brazilian 256.1 Tapir “Man” –214.1 3.91 0.00009 –3.27 0.00108 Okapi 3.17 0.00150 207.8 39