Factor Analysis of Interval Data Paula Cheira , Paula Brito

advertisement
Factor Analysis of Interval Data
?
Paula Cheira1,3 , Paula Brito2,3, , A. Pedro Duarte Silva4
1.
2.
3.
4.
Instituto Politécnico de Viana do Castelo, Viana do Castelo, Portugal
Faculdade de Economia, Universidade do Porto, Porto, Portugal
LIAAD - INESC TEC, Universidade do Porto, Porto, Portugal
Faculdade de Economia e Gestão & CEGE, Universidade Católica Portuguesa, Porto, Portugal
? Contact
author: mpbrito@fep.up.pt
Keywords: Factor analysis, Interval data, Mallows distance, Symbolic data analysis
When a large number of variables is measured on each statistical unit, the study of its dependence structure may be of interest. The orthogonal model of factor analysis assumes that there is
a smaller set of uncorrelated variables, called factors, that explain the relations between the observed variables. With the new variables it is expected to get a better understanding of the data
being analyzed, moreover, they may be used in future analysis (Johnson, 2002).
In this work we present a factorial analysis model for symbolic data, focusing on the particular case
of interval valued variables, i.e., where the statistical units are described by variables whose values are intervals of R (Billard, 2006; Bock, 2000). The method describes the correlation structure
among the measured interval-valued variables in terms of a few underlying, but unobservable, uncorrelated interval-valued variables. Two cases are considered for the distribution assumed within
each observed interval: Uniform distribution and Triangular distribution.
In our proposal, factors are extracted by principal components analysis, performed on the interval
variables correlation matrix (Billard, 2006). To estimate the factor scores, two approaches will
be considered, which are inspired in methods for real data: the Bartlett and the Anderson-Rubin
methods (DiStefano, 2009). In the both cases, the estimated values are obtained by solving an optimization problem that uses as criterion to be minimized the weighted squared Mallows distance
between quantile functions. In the first method the factor scores are highly correlated with their
corresponding factor and weakly (or not at all) with other factors. However, the estimated factor
scores of different factors may still be correlated. In the second proposed method, the function to
minimize is adapted to ensure that the factor scores are themselves not correlated with each other.
The applicability of this method is illustrated using data of characteristics of cars of different makes
and models.
References
Billard, L., Diday, E. (2006). Symbolic data analysis: Conceptual statistics and data mining. John
Wiley and Sons, Ltd, Chichester.
Bock, H.-H. & Diday, E., eds. (2000). Analysis of symbolic data. Exploratory methods for extracting statistical information from complex data. Springer-Verlag, Berlin-Heidelberg.
DiStefano, C., Zhu, M., Mndrilă, D. (2009). Understanding and using factor scores: Considerations for the applied researcher. Practical Assessment, Research & Evaluation 14, 1–11.
Johnson, R. A. & Wichern, D. W. (2002). Applied multivariate statistical analysis. Prentice-Hall,
New Jersey.
Download