Outlier Detection in Interval Data A. Pedro Duarte Silva , Peter Filzmoser

advertisement
Outlier Detection in Interval Data
?
A. Pedro Duarte Silva1 , Peter Filzmoser2 , Paula Brito3, ,
1. Faculdade de Economia e Gestão & CEGE, Universidade Católica Portuguesa, Porto, Portugal
2. Institute of Statistics and Mathematical Methods in Economics, Vienna University of Technology, Vienna,
Austria
3. Faculdade de Economia & LIAAD-INESC TEC, Universidade do Porto, Porto, Portugal
? Contact author: mpbrito@fep.up.pt
Keywords: Mahalanobis distance, Modeling interval data, Robust estimation, Symbolic Data
Analysis
In this work we are interested in identifying outliers in multivariate observations that are consisting
of interval data. The values of an interval-valued variable may be represented by the corresponding
lower and upper bounds or, equivalently, by their mid-points and ranges. Parametric models have
been proposed which rely on multivariate Normal or Skew-Normal distributions for the mid-points
and log-ranges of the interval-valued variables. Different parameterizations of the joint variancecovariance matrix allow taking into account the relation that might or might not exist between
mid-points and log-ranges of the same or different variables Brito and Duarte Silva (2012).
Here we use the estimates for the joint mean t and covariance matrix C for multivariate outlier
detection. The Mahalanobis distances D based on these estimates provide information on how
different individual multivariate interval data are from the mean with respect to the overall covariance structure. A critical value based on the Chi-Square distribution allows distinguishing outliers
from regular observations. The outlier diagnostics is particularly interesting when the covariance
between the mid-points and the log-ranges in restricted to be zero. Then, Mahalanobis distances
can be computed separately for mid-points and log-ranges, and the resulting distance-distance plot
identifies outliers that can be due to deviations with respect to the mid-point, or with respect to
the range of the interval data, or both. However, if t and C are chosen to be the classical sample
mean vector and covariance matrix this procedure is not reliable, as D may be strongly affected
by atypical observations. Therefore, the Mahalanobis distances should be computed with robust
estimates of location and scatter.
Many robust estimators for location and covariance have been proposed. The minimum covariance
determinant (MCD) estimator Rousseeuw (1984, 1985) uses a subset of the original sample, consisting of the h points in the dataset for which the determinant of the covariance matrix is minimal.
Weighted trimmed likelihood estimators Hadi and Luceño (1997) are also based on a sample subset, formed by the h observations that contribute most to the likelihood function. In either case, the
proportion of data points to be used needs to be specified a priori. For multivariate Gaussian data,
the two approaches lead to the same estimators Hadi and Luceño (1997); Neykov et al (2007).
In this work we consider the Gaussian model for interval data, and employ the above approach
based on minimum (restricted) covariance esimators, with the correction suggested in Pison et
al (2002) and with the trimming percentage selected by a two-stage procedure. We evaluate our
proposal with an extensive simulation study for different data structures and outlier contamination
levels, showing that the proposed approach generally outperforms the method based on simple
maximum likelihood estimators. Our methodology is illustrated by an application to a real dataset.
References
Brito, P. and Duarte Silva, A.P. (2012) Modelling interval data with Normal and Skew-Normal
distributions. Journal of Applied Statistics 39 (1), 3–20.
Hadi, A.S., and Luceño, A. (1997) Maximum trimmed likelihood estimators: a unified approach,
examples, and algorithms. Computational Statistics & Data Analysis 25 (3), 251–272.
Neykov, N., Filzmoser, P., Dimova, R. and Neytchev, P. (2007) Robust fitting of mixtures using
the trimmed likelihood estimator. Computational Statistics & Data Analysis 52 (1), 299–308.
Pison, G., Van Aelst, S. and Willems, G. (2002) Small sample corrections for LTS and MCD.
Metrika 55(1-2), 111–123.
Rousseeuw, P.J. (1984) Least median of squares regression. Journal of the American Statistical
Association 79 (388), 871–880.
Rousseeuw, P.J. (1985) Multivariate estimation with high breakdown point. Mathematical Statistics and Applications 8, 283–297.
Download