Population and Robust Symbolic Principal Component Analysis M. Rosário Oliveira , Margarida Vilela

advertisement
Population and Robust Symbolic Principal
Component Analysis
?
M. Rosário Oliveira1 , Margarida Vilela1 , Rui Valadas2 , Paulo Salvador3
1. CEMAT and Departmento de Matemática, Instituto Superior Técnico, Universidade de Lisboa, Portugal
2. DEEC and Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, Portugal
3. DETI and Instituto de Telecomunicações, Universidade de Aveiro, Portugal
? Contact author: rosario.oliveira@tecnico.ulisboa.pt
Keywords: Principal component analysis,
Internet traffic.
Interval symbolic data,
Robust methods,
Principal component analysis (PCA) is one of the most used statistical methods in the analysis of
real problems. In the symbolic data analysis (SDA) context there have been several proposals to
extend this methodology. The methods CPCA (centers) and VPCA (vertices), pioneers in symbolic
PCA and proposes by Cazes (1997) (vide Billard, 2006) are the best known examples of this family
of methods. However, in recent years many other alternatives have emerged in the literature (vide
e.g. Wang, 2012).
In this work, we present the population formulations corresponding to three of the symbolic PCA
algorithms for interval-data: method of the centers (CPCA), method of the vertices (VPCA), and
complete information PCA (CIPCA) (Wang, 2012). The theoretical formulations define a general
method which allows substantial improvements on the existing algorithms in terms of time and
number of operations, making them easily applicable to datasets with large number of symbolic
variables and high number of objects. Moreover, this formulation enables the definition of the
population symbolic components even when one or more variables are degenerate.
Furthermore, analogously to conventional (non-symbolic) data, we have verified that the existence
of atypical observations could distort the sample symbolic principal components and correspondent
scores. To overcome this problem in the context of SDA, we defined two families of robust methods
for symbolic PCA: one based on robust covariance matrices (Filzmoser, 2011) and another based
on Projection Pursuit (Croux, 2007).
To make this new statistical tools easily used in the analysis of real problems, we also developed a
web application, using the Shiny web application framework for R, which includes several tools to
analyse, represent and perform symbolic (classical and robust) PCA in interval data, in an interactive manner. In this app it is possible to compare the classical symbolic PCA methods with all the
new robust approaches proposed in this work and its operation will be illustrated with telecommunications data.
For conventional data, PCA is frequently used as an intermediate step in the analysis of complex
problems (Johnson, 2007), and is commonly used as input for other multivariate methods. To
pursue this goal, we designed R routines to make conversions between different representations
of interval-valued data, making easier to use several R SDA packages consecutively, in the same
analysis. These packages were developed independently and each one requires reading the data in
a specific format.
Acknowledgment
This work was partially funded by Fundação para a Ciência e a Tecnologia (FCT) through projects
PTDC/EEI-TEL/5708/2014 and UID/Multi/04621/2013.
References
Billard, L., Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Analysis
John Wiley and Sons, Chichester.
Cazes, P., Chouakria, A., Diday, E., Schektman, Y. (1997). Extensions de l’analyse en composantes
principales à des données de type intervalle. Revue de Statistique Appliquée 45(3), 5-24.
Croux, C., Filzmoser, P., Oliveira, M. R. (2007). Algorithms for Projection - Pursuit robust principal component analysis. Chemometr. Intell. Lab. 87(2), 218–225.
Filzmoser, P., Todorov, V. (2011). Review of robust multivariate statistical methods in high dimension. Anal. Chim. Acta 705, 2 – 14.
Johnson, R. A. and Wichern, D. W. (2007). Applied Multivariate Statistical Analysis Prentice-Hall,
Inc., Upper Saddle River, NJ, USA.
Wang, H., Guan, R., Wu, J. (2012). CIPCA: Complete-Information-based Principal Component
Analysis for Interval-valued Data. Neurocomputing 86, 158–169.
Download