ROBUST STATISTICS R. Douglas Martin* and Ruben H. Zamar** *Professor of Statistics, Univ. of Washington **Professor of Statistics, Univ. of British Columbia Key Reference Books • Huber, P.J. (1981). Robust Statistics, Wiley • Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., and Stahel, W.A. (1986). Robust Statistics, The Approach Based on Influence Functions, Wiley. • Rousseeuw, P.J. and Leroy, A.M. (1987). Robust Regression and Outlier Detection, Wiley. J. W. Tukey (1979) “… just which robust/resistant methods you use is not important – what is important is that you use some. It is perfectly proper to use both classical and robust/resistant methods routinely, and only worry when they differ enough to matter. But when they differ, you should think hard.” J. W. Tukey “Statistics is a science in my opinion, and it is no more a branch of mathematics than are physics, chemistry and economics; for if its methods fail the test of experience – not the test of logic – they will be discarded” Recommended reading: Annals of Statistics Tukey Memorial Volume (Fall, 2002) “John Tukey’s Contributions to Robust Statistics” (P. J. Huber) “The Life and Professional Contributions of J. W. Tukey” (D. R. Brillinger) OUTLINE 1. DATA-ORIENTED INTRODUCTION 2. LOCATION AND SCALE ESTIMATES 3. BASIC ROBUSTNESS CONCEPTS 4. ROBUST REGRESSION 5. ROBUST MULTIVARIATE LOCATIONAND SCATTER INTRODUCTION 1. Outliers Examples 2. Classical Parameter Estimates are Not Robust 3. Classical Statistical Inference is Not Robust 4. Data-Oriented Robustness and Examples 5. Simple Robust Location and Scale Estimates 6. Simple Robust Estimates Have Bounded EIF’s 7. Outlier Mining One Dimension at a Time OUTLIERS – Outliers are atypical observations that are “well” separated from the bulk of the data • In isolation or in small clusters Dimensionality context • 1-D (relatively easy to detect) • 2-D (harder to detect) • Higher-D (very hard to detect) • Time Series (special challenges) Classical Statistics • PARAMETER ESTIMATES (“Point” Estimates) – – – – Sample mean and sample standard deviation Sample correlation and covariance estimates Linear least squares model fits Gaussian maximum likelihood • STATISTICAL INFERENCE – – – – t-statistic and t-interval for an unkown mean Standard errors and t-values for regression coefficients F-tests for regression model hypotheses AIC, BIC, Cp model selection statistics CLASSICAL STATS ARE NOT ROBUST Outliers have “unbounded influence” on classical statistics, resulting in: • Inaccurate parameter estimates and predictions • Inaccurate statistical inference – – – – Standard errors are too large Confidence intervals are too wide t-statistics lack power AIC, BIC, Cp result in wrong models • Unreliable outlier detection EMPIRICAL INFLUENCE FUNCTION x ( x1 , x2 ,, xn ) x an additional data point EIF ( x ; T , x) (n 1) T ( x, x) T ( x) Normalization across sample size Measures influence of an additional point x on T CLASSICAL ESTIMATES HAVE UNBOUNDED EIF EIF ( x; mean, x ) x x Sample Mean 4 3 2 eif 1 0 -4 -3 -2 -1 0 -1 -2 -3 -4 x 1 2 3 4 RESISTANCE (J.W. Tukey’s term) • A Fundamental Continuity Concept - Small changes in the data result in only small changes in estimate - “Change a few, so what” J.W. Tukey (Seattle, 1977) • “Small Changes” Generalization - Small changes in all the data (e.g., rounding errors) - Large changes in a small fraction of the data (a few outliers) • Valuable Consequence - A good fit to the bulk of the data - Reliable, automatic outlier detection 1-D Outliers: Stock Returns Outliers represent locally large losses/gains 1.2 1.0 Density 0.8 Sometimes you must process thousands of such series 0.6 0.4 You need to detect the outliers automatically! 0.2 0.0 -1 0 1 nobeled 2 3 1-D Outliers: Density of Earth Density of Earth Relative to Density of Water 8 Cavendish, 1798, measurements. 6 Because of the low outlier the median 5.46 is a better estimate of Earth density than the mean 5.42 0 2 4 Outlier 4.0 4.5 5.0 Density 5.5 6.0 2-D Outliers: Predicting EPS 0.05 0.10 You have to predict 2001 EPS! 0.00 You have many of these, e.g., Hundreds! -0.05 EARNINGS PER SHARE 0.15 INVENSYS ANNUAL EPS VERSUS TIME 1985 1990 1995 YEAR 2000 2-D Outliers: Main Gain Data TELEPHONE GAIN VS. DIFFERENCE IN NEW HOUSING STARTS 2.0 tel.gain 1.8 1.6 1.4 1.2 1.0 -0.85 -0.60 -0.35 -0.10 diff.hstarts 0.15 0.40 0.65 5-D Outliers: Woodmod Data X X X X X XX X XX X X X X X X X X X X X X X X XX X X X X X XX X X X X X X X XX X X X XX X X X X X X X X X X X X XX X X X X V4 X X X X X X X X X X XX X XX X XX 0.65 X X XX X X X X XX X Corr(V1,V2) = -0.15 X X X XX X X XX X X X X X X X X X X 0.65 X XX X X X XX X X 0.55 XXXX X X X X X X X X X XX XX X XX X X XX X X X X 0.40 0.45 0.50 XX X X X X X XX X X X X X 0.55 X XX X X A group of 4 outliers shows up in the plots of V1 vs V2 and V4 vs V5 X X X X X X X X XX X X X X XX X X XX X X X X X X X X X XX XX X X X X X X X X X X X X X X X XX XX X X X X X X 0.45 X X X V3 X X X XX XX X X X X X XX X X XXX X X X X X X X X X X X X X XX XX X X X XX X X X X X X X XX X XX X X X XX X X XX X X X X XX XX XX X 0.60 0.95 0.55 X X X X 0.45 X X X X X X X X X X X X X X X X X X X XX X X X X X X X XX XX X X XX X X 0.50 0.55 X X X X X X 0.55 X X 0.60 X V5 X 0.85 0.16 X X X XX V2 X X X X X X XX X XX X 0.12 X X X X X 0.14 XX 0.45 X X X 0.45 X X X X X XX X X X V1 0.16 X X 0.60 XX 0.50 0.14 X 0.40 0.12 X X 0.85 0.90 0.95 RobCorr((V1,V2) = 0.75 60 40 ESSEX Population densities in Suffolk and Essex are much larger than that in the other counties Correlation= -0.64 20 Percentage Treated at Home 80 LUNATICS IN MASSACHUSETTS Robust Correlation=-0.97 SUFFOLK 0 500 1000 1500 2000 Population Density 2500 3000 LUNATICS IN MASSACHUSETTS 80 (Continued) 60 40 50 Now Nantucket shows up as outlier Correlation = -0.84 30 Percentage Treated at Home 70 Plot with Suffolk and Essex removed NANTUCKET 50 100 150 Population Density Robust Correlation = -0.93 200 LUNATICS IN MASSACHUSETTS (Continued) 60 70 Now data show a clear decreasing trend with smaller percentages in more populated counties Correlation = -0.97 50 Percentage Treated at Home 80 Plot with Suffolk, Essex and Nantucket removed Robust Correlation = -0.97 50 100 150 Population Density 200 Time Series with Outliers and Level Shifts TOBACCO AND RELATED SALES IN THE UK Outlier 800 Key aspects of consumer behavior Level Shifts Automate for detecting key changes in a few out of many thousands of customers. 700 TOBACCO SALES 900 1000 Need to detect outliers and level shifts as important, distinct events 1955 1956 1957 1958 TIME 1959 1960 Gene Expression Data Microarray experiments typically used to identify differentially expressed genes. DNA probes printed on a glass are hybridized to two RNA samples separately labeled with two fluorescent dyes The intensity of hybridization values after slide scanning are calculated using image analysis and then used to identify differentially expressed genes Three Principal Stages of the Technology Array fabrication (pcr amplification and clone preparation, reaction clean up, array printing) Probe preparation (mRNA extraction, mRNA labeling, probe labeling and purification) and hybridization Slide scanning and image processing (gridding, segmentation intensity extraction) Gene Expression Data (continued) Each of the above-mentioned stages may generate several sources of random variation and of systematic error. For example • The first one involves variation in the quantity of probe at a spot and in hybridization efficiency of the probe as to their counterparts (mRNA targets) • The second one includes variation in the quantity of mRNA in a sample applied to the slide and variation in the amount of target hybridized to the probe • The third one is subject to variation in optical measurements and in fluorescent intensities computed from the scanned image. Gene Expression Data (continued) Different substances can be used to increase or damp the level of expression of a gene. Hughes et al., 2000 in Cell 102: 109-126 (2000) “Functional Discovery via Compendium of Expression Profiles” considered 6068 genes and ten different substances abbreviated as: cin cup spf vma fre yap mac yer sod and ymr Gene Expression Data (continued) The sample exposed to the substance (treatment sample) was labeled “green” The other sample (control sample) was labeled “red” . The normalized green intensity of gene “i” in sample “j” is denoted by X ij , i 1,...,6068 j 1,...,10 The normalized red intensity of gene “i” in sample “j” is denoted by Yij , i 1,...,6068 j 1,...,10 Gene Expression Data (continued) We will examine the differences between normalized gene expression intensities Z ij Yij X ij , i 1,...,6068 i 1,...,10 The expression level for most genes are similar. Those will appear as “normal data” in the boxplots. There are some genes for which the difference in intensity is large. Those are the genes that are likely to be over- or under-expressed in the “treatment” samples. Gene Expression Data GENE EXPRESSION DIFFERENCES FOR TEN SAMPLES (LOG-SCALE) 2 4 6 Red - Green intensity levels for ten samples -6 -4 -2 0 Similar intensity levels for most genes cin cup fre mac sod spfl vma yap yer ymr Outliers may correspond to over / under expressed genes NORMALIZED MEAN-MEDIAN DIFFERENCE Media Mean n CIN CUP FRE 0.007 0.013 0.003 0.001 -0.028 0.012 MAC SOD SPF VMA 0.000 0.003 0.013 0.003 -0.007 0.002 -0.012 -0.026 YAP VER VMR 0.010 0.003 0.000 -0.010 0.002 -0.003 Difference (Normalized) 0.34 2.61 -0.53 0.45 0.08 1.60 1.83 1.29 0.09 0.20 Diff = (Med-Mean)/SE(Med) In several cases (red rows in the table) the mean and median have different signs. Differences are relatively small The positive and negative outliers balance each other limiting their overall effect on the mean. NORMALIZED SD - MAD DIFFERENCE MAD S.D. Normalized Difference 4.28 10.08 3.13 5.96 CIN CUP FRE 0.113 0.207 0.163 0.181 0.367 0.212 MAC SOD SPF VMA 0.128 0.197 0.207 0.202 0.223 0.280 0.275 0.332 YAP YER 0.148 0.069 0.310 0.086 10.19 1.05 YMR 0.113 0.224 6.98 5.22 4.27 8.22 Diff = (SD-MAD)/SE(MAD) The outliers have a bigger impact on the standard deviations Flagging outliers by using means and SD’s becomes more difficult Standard Deviation vs. MAD 0.35 cup vma SD = 1.45 x MAD 0.25 0.30 yap 0.10 0.15 0.20 SD SD is approximately 50% larger than MAD across samples. 0.08 0.10 0.12 0.14 MAD 0.16 0.18 0.20 Flagging Outliers Suppose we have a set of numbers Zi (i 1,2, ...,n) such that most of them are independent normal random variables with mean m and variance 2 Suppose that a relatively small fraction of these numbers are expected to be different from the majority. Flagging Outliers (continued) We need reliable and automatic ways for flagging outliers We may use the popular c 3 rule But a better approach (specially for large datasets) is to use “c” determined by the equation P max | Z i m | c 0.999 1i n to reduce the probability of flagging “wrong outliers”. Flagging Outliers (continued) It is easy to verify that: c 1 n 0.999 1 2 Flagging Outliers (continued) For the Gene-Expression data n = 6068 and so: 6068 0.999 1 5.24 c 2 1 For such a large datasets it is better to use c 5.24 to reduce the probability of flagging “wrong genes”. Flagging Outliers (continued) We can assume that, for each sample, X i Red i Green i are (approximately) independent normal with mean and unknown variance 2 m=0 Flagging Outliers (continued) SAMPLE SD MAD Since sigma is unknown it must be estimated from the data Robust estimate: MAD Classical estimate: SD cin 0.18 0.11 cup 0.37 0.21 fre 0.21 0.16 mac 0.22 0.13 sod 0.28 0.20 spf 0.27 0.21 0.33 0.20 0.31 0.15 yer 0.09 0.07 ymr 0.22 0.11 vma Because of the outliers, the SD will systematically overestimate yap sigma Flagging Outliers cin cup fre mac sod spf vma yap yer ymr SD MAD OUT(SD) OUT (SD) 0.18 0.21 0.37 0.22 0.28 0.27 0.33 0.31 0.09 0.22 0.11 0.16 0.21 0.13 0.20 0.21 0.20 0.15 0.07 0.11 9 22 7 23 15 20 91 28 7 12 61 102 16 73 60 50 27 114 18 32 ymr has relatively few very large outliers which drastically inflate the SD cup and yap have a large number of moderate outliers Which inflate the SD. “MAD – SD Outliers” vs. “R = SD/MAD” ymr (right-bottom ROBUST LS 40 60 In this case there are relatively few large outliers which drastically inflate the Standard Deviation. ymr 20 OUTLIERS(MAD)-OUTLIERS(SD) 80 corner) appears as an outlier in this plot 1.4 1.6 1.8 SD/MAD 2.0 Robust Fit: Diff = -95+ 91 x R LS Fit: Diff = -51+ 60 x R 60 BEEF SALES IN USA (19251941) Beef sales sharply dropped around 1930 and showed a steady increase on 1933 - 41 OUTLIER 54 56 58 OUTLIER 52 CBE OUTLIER 46 48 50 High levels of beef consumption in 1925-27 show up as outliers in the plot. 1925 1930 1935 YEAR 1940