Principal Curves and Surfaces to Interval Valued Variables Jorge Arce G. ∗ Oldemar Rodrı́guez † June 26, 2015 Abstract In this paper we propose a generalization to symbolic interval valued variables of the Principal Curves and Surfaces method proposed by T. Hastie in [4]. Given a data set X with n observations and m continuos variables the main idea of Principal Curves and Surfaces method is to generalize the principal component line, providing a smooth one-dimensional curved approximation to a set of data points in Rm . A principal surface is more general, providing a curved manifold approximation of dimension 2 or more. In our case we are interested in finding the main principal curve that approximates better symbolic interval data variables. In [2] and [3], the authors proposed the Centers and the Vertices Methods to extend the well known principal components analysis method to a particular kind of symbolic objects characterized by multi-valued variables of interval type. In this paper we generalize both, the Centers and the Vertices Methods, finding a smooth curve that passes through the middle of the data X in an orthogonal sense. Some comparisons of the proposed method regarding the Centers and the Vertices Methods are made, these was done using the RSDA package using Ichino and Interval Iris Data sets, see [8] and [1]. To make these comparisons we have used the cumulative variance and the correlation index. Keywords Interval-valued variables, Principal Curves and Surfaces, Symbolic Data Analysis. References [1] Billard, L. & Diday, E. (2006) Symbolic Data Analysis: Conceptual Statistics and Data Mining, John Wiley & Sons Ltd, United Kingdom. [2] Cazes P., Chouakria A., Diday E. et Schektman Y. (1997). Extension de l’analyse en composantes principales à des données de type intervalle, Rev. Statistique Appliquée, Vol. XLV Num. 3 pag. 5-24, France. ∗ † University of Costa Rica, San José, Costa Rica & Banco Nacional de Costa Rica;E-Mail: jarceg@bncr.fi.cr University of Costa Rica, San José, Costa Rica; E-Mail: oldemar.rodriguez@ucr.ac.cr 1 [3] Douzal-Chouakria A., Billard L., Diday E. (2011). Principal component analysis for intervalvalued observations. Statistical Analysis and Data Mining, Volume 4, Issue 2, pages 229-246. Wiley. [4] Hastie,T. (1984) Principal Curves and Surface, Ph.D Thesis Stanford University. [5] Hastie,T. & Weingessel,A. (2014). princurve - Fits a Principal Curve in Arbitrary Dimension. R package version 1.1-12 [http://cran.r-project.org/web/packages/princurve/index.html] [6] Hastie,T. & Stuetzle, W. (1989). Principal Curves, Journal of the American Statistical Association, Vol. 84 406: 502–516. [7] Hastie, T., Tibshirani, R. and Friedman, J. (2008). The Elements of Statistical Learning; Data Mining, Inference and Prediction. New York: Springer. [8] Rodrı́guez, O. with contributions from Olger Calderon and Roberto Zuniga (2014). RSDA - R to Symbolic Data Analysis. R package version 1.2. [http://CRAN.R-project.org/package=RSDA] [9] Rodrı́guez, O. (2000). Classification et Modèles Linéaires en Analyse des Données Symboliques. Ph.D. Thesis, Paris IX-Dauphine University 2