Outlier Detection in Interval Data ? A. Pedro Duarte Silva1 , Peter Filzmoser2 , Paula Brito3, , 1. Faculdade de Economia e Gestão & CEGE, Universidade Católica Portuguesa, Porto, Portugal 2. Institute of Statistics and Mathematical Methods in Economics, Vienna University of Technology, Vienna, Austria 3. Faculdade de Economia & LIAAD-INESC TEC, Universidade do Porto, Porto, Portugal ? Contact author: mpbrito@fep.up.pt Keywords: Mahalanobis distance, Modeling interval data, Robust estimation, Symbolic Data Analysis In this work we are interested in identifying outliers in multivariate observations that are consisting of interval data. The values of an interval-valued variable may be represented by the corresponding lower and upper bounds or, equivalently, by their mid-points and ranges. Parametric models have been proposed which rely on multivariate Normal or Skew-Normal distributions for the mid-points and log-ranges of the interval-valued variables. Different parameterizations of the joint variancecovariance matrix allow taking into account the relation that might or might not exist between mid-points and log-ranges of the same or different variables Brito and Duarte Silva (2012). Here we use the estimates for the joint mean t and covariance matrix C for multivariate outlier detection. The Mahalanobis distances D based on these estimates provide information on how different individual multivariate interval data are from the mean with respect to the overall covariance structure. A critical value based on the Chi-Square distribution allows distinguishing outliers from regular observations. The outlier diagnostics is particularly interesting when the covariance between the mid-points and the log-ranges in restricted to be zero. Then, Mahalanobis distances can be computed separately for mid-points and log-ranges, and the resulting distance-distance plot identifies outliers that can be due to deviations with respect to the mid-point, or with respect to the range of the interval data, or both. However, if t and C are chosen to be the classical sample mean vector and covariance matrix this procedure is not reliable, as D may be strongly affected by atypical observations. Therefore, the Mahalanobis distances should be computed with robust estimates of location and scatter. Many robust estimators for location and covariance have been proposed. The minimum covariance determinant (MCD) estimator Rousseeuw (1984, 1985) uses a subset of the original sample, consisting of the h points in the dataset for which the determinant of the covariance matrix is minimal. Weighted trimmed likelihood estimators Hadi and Luceño (1997) are also based on a sample subset, formed by the h observations that contribute most to the likelihood function. In either case, the proportion of data points to be used needs to be specified a priori. For multivariate Gaussian data, the two approaches lead to the same estimators Hadi and Luceño (1997); Neykov et al (2007). In this work we consider the Gaussian model for interval data, and employ the above approach based on minimum (restricted) covariance esimators, with the correction suggested in Pison et al (2002) and with the trimming percentage selected by a two-stage procedure. We evaluate our proposal with an extensive simulation study for different data structures and outlier contamination levels, showing that the proposed approach generally outperforms the method based on simple maximum likelihood estimators. Our methodology is illustrated by an application to a real dataset. References Brito, P. and Duarte Silva, A.P. (2012) Modelling interval data with Normal and Skew-Normal distributions. Journal of Applied Statistics 39 (1), 3–20. Hadi, A.S., and Luceño, A. (1997) Maximum trimmed likelihood estimators: a unified approach, examples, and algorithms. Computational Statistics & Data Analysis 25 (3), 251–272. Neykov, N., Filzmoser, P., Dimova, R. and Neytchev, P. (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Computational Statistics & Data Analysis 52 (1), 299–308. Pison, G., Van Aelst, S. and Willems, G. (2002) Small sample corrections for LTS and MCD. Metrika 55(1-2), 111–123. Rousseeuw, P.J. (1984) Least median of squares regression. Journal of the American Statistical Association 79 (388), 871–880. Rousseeuw, P.J. (1985) Multivariate estimation with high breakdown point. Mathematical Statistics and Applications 8, 283–297.