Outliers Outliers are data points which lie outside the general linear pattern of which the midline is the regression line. A rule of thumb is that outliers are points whose standardized residual is greater than 3.3 (corresponding to the .001 alpha level). The removal of outliers from the data set under analysis can at times dramatically affect the performance of a regression model. Outliers should be removed if there is reason to believe that other variables not in the model explain why the outlier cases are unusual -- that is, these cases need a separate model. Alternatively, outliers may suggest that additional explanatory variables need to be brought into the model (that is, the model needs respecification). Another alternative is to use robust regression, whose algorithm gives less weight to outliers but does not discard them. The leverage statistic, h, also called the hat-value, is available to identify cases which influence the regression model more than others. • Belsley, Kuh, and Welsch (1980) define the leverage (hi ) of the ith observation as (x i&x) ¯2 1 hi = % 2 n (n&1)S x Leverage assesses how far away a value of the independent variable value is from the mean value: the farther away the observation the more leverage it has. From the definition you can see that leverage is mitigated by a larger sample size (any single point should have less influence) and by a larger variance of the independent variable (again, any single point should have less influence). • 0<h<1 The leverage statistic varies from 0 (no influence on the model) to 1 (completely determines the model). • A rule of thumb is that cases with leverage under .2 are not a problem, but if a case has leverage over .5, the case has undue leverage and should be examined for the possibility of measurement error or the need to model such cases separately. STATA command predict h, hat. Cook's distance, D, is another measure of the influence of a case. Cook's distance measures the effect of deleting a given observation. • Observations with larger D values than the rest of the data are those which have unusual leverage. • D > 4/n the criterion to indicate a possible problem. STATA command predict D, cooksd dfbetas, is another statistic for assessing the influence of a case. • If dfbetas > 0, the case increases the slope; if <0, the case decreases the slope. • The case may be considered an influential outlier if |dfbetas| > 2/%n. STATA command dfbeta creates dfbeta’s for all variables. or predict DFx1, dfbeta(x1) for individual variables dfFit. DfFit measues how much the estimate changes as a result of a particular observation being dropped from analysis. dfits is defined as hi • dfitsi = Rstudent • The case may be considered an influential outlier if DFITS> 2/%k/n 1&hi . Where Rstudent is the studentized residual. STATA command predict DFITS, dfits Studentized residuals and deleted studentized residuals are also used to detect outliers with high leverage. A "studentized residual" is the observed residual divided by the standard deviation. The "studentized deleted residual," also called the "jacknife residual," is the observed residual divided by the standard deviation computed with the given observation left out of the analysis. Analysis of outliers usually focuses on deleted residuals. Other synonyms include externally studentized residual or, misleadingly, standardized residual. There will be a t value for each residual, with df - n - k - 1, where k is the number of independent variables. When t exceeds the critical value for a given alpha level (ex., .05) then the case is considered an outlier. In a plot of deleted studentized residuals versus ordinary residuals, one may draw lines at plus and minus two standard units to highlight cases outside the range where 95% of the cases normally lie; points substantially off the straight line are potential leverage problems. STATA command predict student, rstudent or predict standard, rstandard Partial regression plots, also called partial regression leverage plots or added variable plots, are yet another way of detecting influential sets of cases. Partial regression plots are a series of bivariate regression plots of the dependent variable with each of the independent variables in turn. The plots show cases by number or label instead of dots. One looks for cases which are outliers on all or many of the plots. STATA command avplots Example using the Murder.dta data set. reg mrdrte exec unem d90 d93 Source SS df Model Residual 977.390644 4 11867.9475 148 MS 244.347661 80.1888343 Total 12845.3381 152 84.5088034 mrdrte Coef. exec .1627547 unem 1.390786 d90 2.675335 d93 1.607317 _cons -1.864393 Std. Err. .1939295 .4508653 1.816934 1.774768 3.069517 predict e, residual predict yhat predict standard, rstandard predict student, rstudent predict h, hat predict D, cooksd predict DFITS, dfits predict W, welsch dfbeta DFexec: DFbeta(exec) DFunem: DFbeta(unem) DFd90: DFbeta(d90) DFd93: DFbeta(d93) t 0.84 3.08 1.47 0.91 -0.61 P>t 0.403 0.002 0.143 0.367 0.545 Number of obs= 153 F( 4, 148) = 3.05 Prob > F = 0.0190 R-squared = 0.0761 Adj R-squared= 0.0511 Root MSE = 8.9548 [95% Conf. -.2204738 .4998207 -.91515 -1.899842 -7.930134 Interval] .5459832 2.281751 6.26582 5.114476 4.201349 -20 0 Residuals 20 40 60 rvplot 0 5 To create a standardized residual plot graph twoway scatter standard yhat, yline(0) Fitted values 10 15 8 6 Standardized residuals 2 4 0 -2 0 5 Fitted values 10 15 8 To identify the outliers. graph twoway scatter standard yhat, yline(0) mlabel(state) DC Standardized residuals 2 4 6 DC DC CA TX MS AK WV WV TX WV -2 0 NH LA LA N YMS MD NY FLNC GA CSC A MD MD NC SC N C GA CA GA DE V A T N M I TX T N CT NHIENJ TN NV MO NY MO VA VAR AM N V N AZ IL AL MA IN RI NE SHDNE OK LIOILFL VT AZ IL VA AZ NM SC NM MI ARA LMS I UT PA OK M OR CO OK AK INWA PA K YAR ME KS HI W WI IKS KY FL SD WI DE NO JIN H CT NM CT O OH H PA IA MN K S D E WY WA CO KY AK IA MN VT CO NDNSD UT DN DIAUT WY MTMT MN OR NJ VME TMT ID MAOR RIWA INH DMA ID NH RI WY ME 0 5 Fitted values 10 15 LA 80 avplots, mlabel(state) 60 DC DC DC LA LA NY MD FL GATX MS N CVA MD MD N SC C NY M TN TN IICT DE N GA C GA VA MO NV NH A R IOH M MO IDE AZ LCA SC N MS CA NJ H RI L NE MA NV AL AL OK ISC AZ AL MO NM IL MI NM KS PA AZ KS VT HI OK A RNV FL FL VA AK OK CO O NE IN R NE NM PA A CT KY AK OH WI KY OH MN PA IN HI CT DE IA NJ WA R ME KS WI SD MS UT AKY WA CO K V MN WI IA WY ND CO T SD UT UT NJ O MT MA MT WA O ND R IMN ND WY A R LA MA ID RI ME VT WV ID ID NH W WY NH V WRI V ME TX LALAN YMI CA NY MD MS CA CA MS NM MS AK LA MD NNY FL NCCMD GA GA TX AL IL NC SSC GA C TN TN MO AR NV NV MO TN M ITX NM LAL ISC A A MI R L NV AZ FL OK AK A VA VA ID N PA OK AZ AZ O MO OK R FL KY OH NM W C AO A K R KY CT HI KS IWI KY CT PA N IN OH PA WV VA KS C O OH MT NH MA VT HRI IT ME WI NE W KS MN EIMN CO WA CT DE WY U OR T MT MA OR NJ W R NENDE SD HI TX M W VT N IA YNJ MA RIA IID WYWV ESD UNJ IA ND SD UT IA N D VT ME MT ID NH NH ID ND ME -20 0 10 20 e( exec | X ) 30 -4 -2 0 2 e( unem | X ) 4 6 coef = 1.3907856, se = .45086525, t = 3.08 80 coef = .16275467, se = .19392954, t = .84 DC DC 0 TX e( mrdrte | X ) 20 40 60 e( mrdrte | X ) 0 20 40 DC DC N Y M D GA FL N C A C VA SC D C M E I H N A MO V H JN IZ M R IT LIT A VT K P O N A S M K TX LA MS IAL SD M N A N E E H R W C K Y K N IA UT IN MT D IA DO W Y V LA N MD MS C GA TN NV NY MO A IL R AL NE IN AZ OK CA SC KS MI FL H SD W CO KY AK IY I UT DE CT PA NM TX VA W IA ND MN VT OH NJ W ID MT O MA AV NH RI WR ME LA NY MD TX GA SC N C ACI TN NV VA FL AL MS IL M M A OK HO AZ NM RIA NE PA KY KS C NJ DE CT W IN OH AK O UT SD MT ND IA MN V ME O MA RI TRIY ID NH W V -20 -20 0 LA LA NY MS N MD CGA NY NC MD SC C A TX MD GA CA GA TN NYC A NE VA NV TN MS N C SC MI A NV MO R IL FL IL AL D E VA AL AZ NT A M M RI NH C H N T JFL MO NV AZ TN AL KS VA IDE N OK AZ MI SC AK H IND NJ IN PA OK M O M R A IIKS IL NM SD CO KY FL WMN KS WA DE CT IKY OH AK VT SD ME NE PA IOR N OH OK AR TX SNE UT IH A WI MN IWY CT OH TX PA NM UT IA CO W OR M MN WI W CO KY A MAK ND VT NJ V ME T MA I WV ND IA UT MT ID MA O WA INH DRY ID NH RIR WV SD WY W LA V MT ME e( mrdrte | X ) 0 20 40 e( mrdrte | X ) 20 40 6 0 DC DC 60 DC DC DC -1 -.5 0 e( d90 | X ) .5 coef = 2.6753348, se = 1.8169343, t = 1.47 -.5 0 e( d93 | X ) .5 coef = 1.6073174, se = 1.774768, t = .91 -2 0 Standardized residuals 2 4 6 8 To graphically measure the influence of observations graph twoway scatter standard yhat [aweight=D], msymbol(oh) yline(0) 0 5 Fitted values 10 15 Leverage versus residual squared plot marks the means of leverage and squared residuals. Leverage tells us how much potential for influencing the regression an observation has. lvr2plot To examine the numerical measure for outliers. leverage h>2k/n Cook’s D>4/n DFITS>2/%k/n Welsch’s W>3/%k DFBETA>2/%n For example, sort D list state yhat D DFITS W in -5/l 149. 150. 151. 152. 153. state WV DC TX DC DC yhat 13.15609 6.897557 15.01208 9.990127 11.5646 D .0150447 .0452811 .0504226 .2905912 .4116747 DFITS -.2742131 .4927538 -.500824 1.547051 1.832155 W -3.513339 6.137678 -8.794935 19.3077 22.9866 Recall that D > 4/n indicates a possible problem. With n=153, D > .02614379 DFITS> 2/%k/n may be considered an influential outlier. With n=153, k=4, DFITS>12.369317