Linear Discriminant Analysis for Interval and Histogram Data S´onia Dias , Paula Amaral

advertisement
Linear Discriminant Analysis
for Interval and Histogram Data
?
Sónia Dias1,2, , Paula Amaral3 , Paula Brito1,4
1. LIAAD/INESC-TEC, Porto, Portugal
2. School of Technology and Management, Polytechnic Institute of Viana do Castelo, Portugal
3. CMA and Faculty of Science and Engineering, University Nova de Lisboa, Portugal
4. Faculty of Economics, University of Porto, Portugal
? Contact author: sdias@estg.ipvc.pt
Keywords: linear discriminant analysis; quantile functions; Mallows distance; fractional quadratic
problems
During the last years, Symbolic Data Analysis developed concepts and methods that allow statistical studies with histogram-valued variables and interval-valued variables. Nonetheless, there are
only a few studies about discriminant analysis under the symbolic framework and these only focus
on interval-valued variables (Duarte Silva and Brito, 2006, forthcoming).
Dias and Brito (2015) proposed the Distribution and Symmetric Distributions (DSD) linear regression model, which allows predicting distributions from other distributions, represented by quantile
functions. From the DSD Model, we define a discriminant function for the classification of a set of
individuals in two classes. For each individual, a linear combination obtained as in the DSD Model
is considered, which allows defining a score of the individual in the form of a quantile function.
Irpino and Verde (2006) proved that total inertia, defined with the Mallows distance and with respect to a barycentric histogram, may be decomposed into within and between classes inertia,
according to the Huygens theorem. From this decomposition, and similarly to the classical linear
discriminant method, it is possible to deduce that the coefficients of the discriminant function are
obtained by maximizing the ratio of the between to the within classes inertia. To solve the optimization problem that allows obtaining these coefficients, it is necessary to solve a constrained
fractional quadratic problem. The solver BARON is used to solve this difficult optimization problem. A solution is obtained but the optimality certificate is only possible using conic relaxation
techniques (Amaral et al, 2014).
For the classification of an individual in one of the two groups, the Mallows distance between
the score of the individual and the score obtained for the barycentric histogram of each class is
computed. The observation is then assigned to the closest class.
The proposed linear discriminant method may be particularized to interval-valued variables, which
constitute a special case of histogram-valued variables.
Examples illustrate the behavior of the method.
References
Dias, S. and Brito, P. (2015). Linear Regression Model with Histogram-Valued Variables. Statistical Analysis and Data Mining: The ASA Data Science Journal 8 (2), 75–113.
Irpino, A. and Verde, R. (2006). A new Wasserstein based distance for the hierarchical clustering
of histogram symbolic data. In: Data Science and Classification, Proc. IFCS’2006, Batagelj et
al (eds.), Ljubljana, Slovenia, 185–192.
Duarte Silva, A.P. and Brito, P. (2006). Linear Discriminant Analysis for Interval Data. Computational Statistics 21 (2), 289-308.
Duarte Silva, A.P. and Brito, P. (forthcoming). Discriminant Analysis of Interval Data: An Assessment of Parametric and Distance-Based Approaches. Journal of Classification.
Amaral, P., Bomze, I. and Júdice, J. (2014). Copositivity and constrained fractional quadratic
problems. Math. Program., 146, (1–2), 325–350.
Download