Factor Analysis of Interval Data ? Paula Cheira1,3 , Paula Brito2,3, , A. Pedro Duarte Silva4 1. 2. 3. 4. Instituto Politécnico de Viana do Castelo, Viana do Castelo, Portugal Faculdade de Economia, Universidade do Porto, Porto, Portugal LIAAD - INESC TEC, Universidade do Porto, Porto, Portugal Faculdade de Economia e Gestão & CEGE, Universidade Católica Portuguesa, Porto, Portugal ? Contact author: mpbrito@fep.up.pt Keywords: Factor analysis, Interval data, Mallows distance, Symbolic data analysis When a large number of variables is measured on each statistical unit, the study of its dependence structure may be of interest. The orthogonal model of factor analysis assumes that there is a smaller set of uncorrelated variables, called factors, that explain the relations between the observed variables. With the new variables it is expected to get a better understanding of the data being analyzed, moreover, they may be used in future analysis (Johnson, 2002). In this work we present a factorial analysis model for symbolic data, focusing on the particular case of interval valued variables, i.e., where the statistical units are described by variables whose values are intervals of R (Billard, 2006; Bock, 2000). The method describes the correlation structure among the measured interval-valued variables in terms of a few underlying, but unobservable, uncorrelated interval-valued variables. Two cases are considered for the distribution assumed within each observed interval: Uniform distribution and Triangular distribution. In our proposal, factors are extracted by principal components analysis, performed on the interval variables correlation matrix (Billard, 2006). To estimate the factor scores, two approaches will be considered, which are inspired in methods for real data: the Bartlett and the Anderson-Rubin methods (DiStefano, 2009). In the both cases, the estimated values are obtained by solving an optimization problem that uses as criterion to be minimized the weighted squared Mallows distance between quantile functions. In the first method the factor scores are highly correlated with their corresponding factor and weakly (or not at all) with other factors. However, the estimated factor scores of different factors may still be correlated. In the second proposed method, the function to minimize is adapted to ensure that the factor scores are themselves not correlated with each other. The applicability of this method is illustrated using data of characteristics of cars of different makes and models. References Billard, L., Diday, E. (2006). Symbolic data analysis: Conceptual statistics and data mining. John Wiley and Sons, Ltd, Chichester. Bock, H.-H. & Diday, E., eds. (2000). Analysis of symbolic data. Exploratory methods for extracting statistical information from complex data. Springer-Verlag, Berlin-Heidelberg. DiStefano, C., Zhu, M., Mndrilă, D. (2009). Understanding and using factor scores: Considerations for the applied researcher. Practical Assessment, Research & Evaluation 14, 1–11. Johnson, R. A. & Wichern, D. W. (2002). Applied multivariate statistical analysis. Prentice-Hall, New Jersey.