Linear Discriminant Analysis for Interval and Histogram Data ? Sónia Dias1,2, , Paula Amaral3 , Paula Brito1,4 1. LIAAD/INESC-TEC, Porto, Portugal 2. School of Technology and Management, Polytechnic Institute of Viana do Castelo, Portugal 3. CMA and Faculty of Science and Engineering, University Nova de Lisboa, Portugal 4. Faculty of Economics, University of Porto, Portugal ? Contact author: sdias@estg.ipvc.pt Keywords: linear discriminant analysis; quantile functions; Mallows distance; fractional quadratic problems During the last years, Symbolic Data Analysis developed concepts and methods that allow statistical studies with histogram-valued variables and interval-valued variables. Nonetheless, there are only a few studies about discriminant analysis under the symbolic framework and these only focus on interval-valued variables (Duarte Silva and Brito, 2006, forthcoming). Dias and Brito (2015) proposed the Distribution and Symmetric Distributions (DSD) linear regression model, which allows predicting distributions from other distributions, represented by quantile functions. From the DSD Model, we define a discriminant function for the classification of a set of individuals in two classes. For each individual, a linear combination obtained as in the DSD Model is considered, which allows defining a score of the individual in the form of a quantile function. Irpino and Verde (2006) proved that total inertia, defined with the Mallows distance and with respect to a barycentric histogram, may be decomposed into within and between classes inertia, according to the Huygens theorem. From this decomposition, and similarly to the classical linear discriminant method, it is possible to deduce that the coefficients of the discriminant function are obtained by maximizing the ratio of the between to the within classes inertia. To solve the optimization problem that allows obtaining these coefficients, it is necessary to solve a constrained fractional quadratic problem. The solver BARON is used to solve this difficult optimization problem. A solution is obtained but the optimality certificate is only possible using conic relaxation techniques (Amaral et al, 2014). For the classification of an individual in one of the two groups, the Mallows distance between the score of the individual and the score obtained for the barycentric histogram of each class is computed. The observation is then assigned to the closest class. The proposed linear discriminant method may be particularized to interval-valued variables, which constitute a special case of histogram-valued variables. Examples illustrate the behavior of the method. References Dias, S. and Brito, P. (2015). Linear Regression Model with Histogram-Valued Variables. Statistical Analysis and Data Mining: The ASA Data Science Journal 8 (2), 75–113. Irpino, A. and Verde, R. (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Data Science and Classification, Proc. IFCS’2006, Batagelj et al (eds.), Ljubljana, Slovenia, 185–192. Duarte Silva, A.P. and Brito, P. (2006). Linear Discriminant Analysis for Interval Data. Computational Statistics 21 (2), 289-308. Duarte Silva, A.P. and Brito, P. (forthcoming). Discriminant Analysis of Interval Data: An Assessment of Parametric and Distance-Based Approaches. Journal of Classification. Amaral, P., Bomze, I. and Júdice, J. (2014). Copositivity and constrained fractional quadratic problems. Math. Program., 146, (1–2), 325–350.