WINSORIZING Kyle Allen & Matthew Whitledge May 7, 2013 What is it and why could it be inappropriate? WHAT IS WINSORIZING? What it isn’t… Trimming Truncating Any other method that completely removes observations from the data Term first used in 1960 John W. Tukey; W. J. Dixon “Numerical value of a wild observation is untrustworthy” However, its direction of deviation is important Decreasing the magnitude of the deviation, retaining its direction WINSORIZING AN EXAMPLE Order the observations by value X i1 , X i2 , …X i100 , where i denotes the i th regressor If Winsorizing at 1% and 99%, then The value for X i1 will be replaced by the value for X i2 The value for X i100 will be replaced by the value for X i99 Another example: X i1 , X i2 , …X i100 Winsorize at 10% (5% from bottom and 5% from the top) Beginning Sample: Xi1, Xi2, Xi3, Xi4, Xi5, Xi6,… Xi95, Xi96, Xi97, Xi98, Xi99, Xi100 Winsorized Sample Xi5, Xi5, Xi5, Xi5, Xi5, Xi6,… Xi95, Xi96, Xi96, Xi96, Xi96, Xi96 Winsorized at 5% and 95% Obs. Original Winsorized Xi1 0.2 6.3 Xi2 0.9 6.3 Xi3 3.5 6.3 Xi4 4.8 6.3 Xi5 6.3 6.3 Xi6 7 7 Xi7 7.1 7.1 Xi8 7.2 7.2 Xi9-Xi92 … … Xi93 82 82 Xi94 83.2 83.2 Xi95 83.5 83.5 Xi96 98 98 Xi97 112 98 Xi98 114 98 Xi99 3150 98 Xi100 6572 98 WINSORIZING ALTERNATIVES Are the observations really outliers? Look at Cook’s D measure Transform the variables Take the log or square root of the variable This shouldn’t be done only to increase significance Median based estimations Quantile regression Median absolute deviation Nonparametric methods WINSORIZING A SAS EXAMPLE Lift Index Data Workers perform lifting tasks Each lift has an amount of stress associated with it Measuring the number of days an employee missed based on the lift they were performing 206 observations WINSORIZING SAS CODE proc sgplot data=isqsdata.lilesmerge; scatter y=dayslost x=alr; scatter y=dayslost1 x=alr; run; data isqsdata.lileswin; set isqsdata.lileswin; if subject = 6 then dayslost = 27; if subject = 35 then dayslost = 27; run; proc qlim data=isqsdata.liles; model dayslost = alr; endogenous dayslost ~ censored(lb=0); run; proc qlim data=isqsdata.lileswin; model dayslost1 = alr; endogenous dayslost1 ~ censored(lb=0); run; WINSORIZING LOOK AT YOUR DATA PROC GLIM (NON-WINSORIZED) PROC GLIM (WINSORIZED) WINSORIZING IMPLICATIONS May impact significance The standard errors will decrease Depending on how symmetrical the data is, the mean may increase or decrease For example, if there is an extremely positive outlier, it will decrease the mean The significance will be determined by the proportionate change in the estimated coef ficient, relative to the change in the standard error WINSORIZING WHY COULD IT BE INAPPROPRIATE? May be appropriate for Ratios Book to Market Other measures in which the denominator can be extremely small Never winsorize valid observations Investment Returns R&D expenditures Truly exceptional observations Large number of biological elements Extremely low stress tolerances for mechanical implements Model should produce data we could actually see WINSORIZING BIBLIOGRAPHY Bibliography Brillinger, David R. “John W. Tukey: His Life and Professional Contributions.” The Annals of Statistics. 30(2002): 1535-75. Dixon, W. J. “Simplified Estimation from Censored Normal Samples.” The Annals of Mathematical Statistics. 31(1960): 385-91. Kafadar, Karen. “John Tukey and Robustness.” Proceedings of the Annual Meeting of the American Statistical Association. 2001. Kruskal, William, Thomas Ferguson, John W. Tukey, E. J. Gumbel, and F. J. Anscombe. “Discussion of the Papers of Messrs, Anscombe and Daniel.” Technometrics. 2(1960): 157-66. Tukey, John W. and Donald H. McLaughlin. “Less Vulnerable Confidence and Significance Procedures for Location Based on a Single Sample: Trimming/Winsorization 1. The Indian Journal of Statistics. 25(1963): 331-52. Westfall, Peter H. and Kevin S. S. Henning. Understanding Advanced Statistical Methods. Boca Raton, FL: CRC Publishing, 2013.