Population and Robust Symbolic Principal Component Analysis ? M. Rosário Oliveira1 , Margarida Vilela1 , Rui Valadas2 , Paulo Salvador3 1. CEMAT and Departmento de Matemática, Instituto Superior Técnico, Universidade de Lisboa, Portugal 2. DEEC and Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, Portugal 3. DETI and Instituto de Telecomunicações, Universidade de Aveiro, Portugal ? Contact author: rosario.oliveira@tecnico.ulisboa.pt Keywords: Principal component analysis, Internet traffic. Interval symbolic data, Robust methods, Principal component analysis (PCA) is one of the most used statistical methods in the analysis of real problems. In the symbolic data analysis (SDA) context there have been several proposals to extend this methodology. The methods CPCA (centers) and VPCA (vertices), pioneers in symbolic PCA and proposes by Cazes (1997) (vide Billard, 2006) are the best known examples of this family of methods. However, in recent years many other alternatives have emerged in the literature (vide e.g. Wang, 2012). In this work, we present the population formulations corresponding to three of the symbolic PCA algorithms for interval-data: method of the centers (CPCA), method of the vertices (VPCA), and complete information PCA (CIPCA) (Wang, 2012). The theoretical formulations define a general method which allows substantial improvements on the existing algorithms in terms of time and number of operations, making them easily applicable to datasets with large number of symbolic variables and high number of objects. Moreover, this formulation enables the definition of the population symbolic components even when one or more variables are degenerate. Furthermore, analogously to conventional (non-symbolic) data, we have verified that the existence of atypical observations could distort the sample symbolic principal components and correspondent scores. To overcome this problem in the context of SDA, we defined two families of robust methods for symbolic PCA: one based on robust covariance matrices (Filzmoser, 2011) and another based on Projection Pursuit (Croux, 2007). To make this new statistical tools easily used in the analysis of real problems, we also developed a web application, using the Shiny web application framework for R, which includes several tools to analyse, represent and perform symbolic (classical and robust) PCA in interval data, in an interactive manner. In this app it is possible to compare the classical symbolic PCA methods with all the new robust approaches proposed in this work and its operation will be illustrated with telecommunications data. For conventional data, PCA is frequently used as an intermediate step in the analysis of complex problems (Johnson, 2007), and is commonly used as input for other multivariate methods. To pursue this goal, we designed R routines to make conversions between different representations of interval-valued data, making easier to use several R SDA packages consecutively, in the same analysis. These packages were developed independently and each one requires reading the data in a specific format. Acknowledgment This work was partially funded by Fundação para a Ciência e a Tecnologia (FCT) through projects PTDC/EEI-TEL/5708/2014 and UID/Multi/04621/2013. References Billard, L., Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Analysis John Wiley and Sons, Chichester. Cazes, P., Chouakria, A., Diday, E., Schektman, Y. (1997). Extensions de l’analyse en composantes principales à des données de type intervalle. Revue de Statistique Appliquée 45(3), 5-24. Croux, C., Filzmoser, P., Oliveira, M. R. (2007). Algorithms for Projection - Pursuit robust principal component analysis. Chemometr. Intell. Lab. 87(2), 218–225. Filzmoser, P., Todorov, V. (2011). Review of robust multivariate statistical methods in high dimension. Anal. Chim. Acta 705, 2 – 14. Johnson, R. A. and Wichern, D. W. (2007). Applied Multivariate Statistical Analysis Prentice-Hall, Inc., Upper Saddle River, NJ, USA. Wang, H., Guan, R., Wu, J. (2012). CIPCA: Complete-Information-based Principal Component Analysis for Interval-valued Data. Neurocomputing 86, 158–169.