Principal Curves and Surfaces to Interval Valued Variables Jorge Arce G.

advertisement
Principal Curves and Surfaces to Interval Valued Variables
Jorge Arce G.
∗
Oldemar Rodrı́guez
†
June 26, 2015
Abstract
In this paper we propose a generalization to symbolic interval valued variables of the
Principal Curves and Surfaces method proposed by T. Hastie in [4]. Given a data set X with n
observations and m continuos variables the main idea of Principal Curves and Surfaces method
is to generalize the principal component line, providing a smooth one-dimensional curved
approximation to a set of data points in Rm . A principal surface is more general, providing a
curved manifold approximation of dimension 2 or more. In our case we are interested in finding
the main principal curve that approximates better symbolic interval data variables. In [2] and
[3], the authors proposed the Centers and the Vertices Methods to extend the well known
principal components analysis method to a particular kind of symbolic objects characterized
by multi-valued variables of interval type. In this paper we generalize both, the Centers and
the Vertices Methods, finding a smooth curve that passes through the middle of the data
X in an orthogonal sense. Some comparisons of the proposed method regarding the Centers
and the Vertices Methods are made, these was done using the RSDA package using Ichino
and Interval Iris Data sets, see [8] and [1]. To make these comparisons we have used the
cumulative variance and the correlation index.
Keywords
Interval-valued variables, Principal Curves and Surfaces, Symbolic Data Analysis.
References
[1] Billard, L. & Diday, E. (2006) Symbolic Data Analysis: Conceptual Statistics and Data Mining,
John Wiley & Sons Ltd, United Kingdom.
[2] Cazes P., Chouakria A., Diday E. et Schektman Y. (1997). Extension de l’analyse en composantes principales à des données de type intervalle, Rev. Statistique Appliquée, Vol. XLV
Num. 3 pag. 5-24, France.
∗
†
University of Costa Rica, San José, Costa Rica & Banco Nacional de Costa Rica;E-Mail: jarceg@bncr.fi.cr
University of Costa Rica, San José, Costa Rica; E-Mail: oldemar.rodriguez@ucr.ac.cr
1
[3] Douzal-Chouakria A., Billard L., Diday E. (2011). Principal component analysis for intervalvalued observations. Statistical Analysis and Data Mining, Volume 4, Issue 2, pages 229-246.
Wiley.
[4] Hastie,T. (1984) Principal Curves and Surface, Ph.D Thesis Stanford University.
[5] Hastie,T. & Weingessel,A. (2014). princurve - Fits a Principal Curve in Arbitrary Dimension.
R package version 1.1-12
[http://cran.r-project.org/web/packages/princurve/index.html]
[6] Hastie,T. & Stuetzle, W. (1989). Principal Curves, Journal of the American Statistical Association, Vol. 84 406: 502–516.
[7] Hastie, T., Tibshirani, R. and Friedman, J. (2008). The Elements of Statistical Learning; Data
Mining, Inference and Prediction. New York: Springer.
[8] Rodrı́guez,
O. with contributions from Olger Calderon and Roberto Zuniga (2014). RSDA - R to Symbolic Data Analysis. R package version 1.2.
[http://CRAN.R-project.org/package=RSDA]
[9] Rodrı́guez, O. (2000). Classification et Modèles Linéaires en Analyse des Données Symboliques.
Ph.D. Thesis, Paris IX-Dauphine University
2
Download