Symbolic Data Analysis Workshop SDA 2015 November 17 – 19 Orléans, France University of Orléans Symbolic Data Analysis Workshop SDA 2015 November 17 – 19 Orléans University, France SPONSORS CNRS MAPMO Mathematics Laboratory University of Orléans Denis Poisson Federation Centre-Val de Loire Region Council Loiret District Council STEERING COMMITTEE Paula BRITO, University of Porto, Portugal Monique NOIRHOMME, University of Namur, Belgium ORGANISING COMMITTEE Guillaume CLEUZIOU, Richard EMILION, Christel VRAIN Secretary: Marie-France GRESPIER University of Orléans, France SCIENTIFIC COMMITTEE Javier ARROYO, Spain Lynne BILLARD, USA Paula BRITO, Portugal Chun-houh CHEN, Taiwan Guillaume CLEUZIOU, France Francisco DE CARVALHO, Brazil Edwin DIDAY, France Richard EMILION, France Manabu ICHINO, Japan Yves LECHEVALLIER, France Monique NOIRHOMME, Belgium Rosanna VERDE, Italy Gilles VENTURINI, France Christel VRAIN, France Huiwen WANG, China Symbolic Data Analysis Workshop SDA 2015 November 17 – 19 Tuesday, November 17 Orléans University Campus IIIA Computer Science Building TUTORIAL 1rst floor, Room E19 09:00 - 09:50 Introduction to Symbolic Data Analysis Paula BRITO, FEP & LIAAD-INESC TEC, Univ. Porto, Portugal 09:50 - 10:15 Coffee Break 10:15 - 11:05 The Quantile Method for Symbolic Data Analysis Manabu ICHINO, SSE, Tokyo Denki University, Japan 11:05 - 11:55 The R SDA Package Oldemar RODRIGUEZ, University of Costa Rica 12:00 - 13:45 Welcome, Registration, Lunch L'Agora Restaurant, Orléans University Campus IIIA, Herbrand Amphitheatre 13:55 - 14:00 Workshop Opening 14:00 - 17:25 Workshop Talks 19:30 Workshop Dinner. 'Le Martroi' restaurant, 12 Place du Martroi, Orléans. Tram stop: 'De Gaulle' or 'République' Introduction to Symbolic Data Analysis Paula Brito FEP & LIAAD-INESC TEC, Univ. Porto, Portugal Symbolic Data, introduced by E. Diday is concerned with analysing data presenting intrinsic variability, which is to be explicitly taken into account. In classical Statistics and Multivariate Data Analysis, the elements under analysis are generally individual entities for which a single value is recorded for each variable - e.g., individuals, described by their age, salary, education level, marital status, etc.; cars each described by its weight, length, power, engine displacement, etc.; students for each of which the marks at different subjects were recorded. But when the elements of interest are classes or groups of some kind - the citizens living in given towns; teams, consisting of individual players; car models, rather than specific vehicles; classes and not individual students - then there is variability inherent to the data. To reduce this variability by taking central tendency measures mean values, medians or modes - obviously leads to a too important loss of information. Symbolic Data Analysis provides a framework allowing representing data with variability, using new variable types. Also, methods have been developed which suitably take data variability into account. Symbolic data may be represented using the usual matrix-form data arrays, where each entity is represented in a row and each column corresponds to a different variable - but now the elements of each cell are generally not single real values or categories, as in the classical case, but rather finite sets of values, intervals or, more generally, distributions. In this talk we shall introduce and motivate the field of Symbolic Data Analysis, present into some detail the new variable types that have been introduced to represent variability, illustrating with some examples. We shall furthermore discuss some issues that arise when analysing data that does not follow the usual classical model, and present data representation models for some variable types. The Quantile Method for Symbolic Data Analysis Manabu Ichino School of Science and Engineering, Tokyo Denki University ichino@mail.dendai.ac.jp Keywords: Quantiles, Monotonicity, Visualization, PCA, Clustering Abstract The quantile method transforms the given (N objects)×(d variables) symbolic data table to a standard {N×(m+1) sub-objects}×(d variables) numerical data table, where m is a preselected integer number that controls the granularity to represent symbolic objects. Therefore, a set of (m+1) d-dimensional numerical vectors, called the quantile vectors, represents each symbolic object. According to the monotonicity of quantile vectors, we present the following three methods for symbolic data analysis. Visualization: We visualize each symbolic object by m+1 parallel monotone line graphs [Ichino and Brito 2014]. Each line graph is composed of d-1 line segments accumulating the d zero-one normalized variable values. PCA: When the given symbolic objects have a monotone structure in the representation space, the structure confines the corresponding quantile vectors to a similar geometrical shape. We apply the PCA to the quantile vectors based on the rank order correlation coefficients. We reproduce each symbolic object as m series of arrow lines that connect from the minimum quantile vector to the maximum quantile vector in the factor planes [Ichino 2011]. Clustering: We present a hierarchical conceptual clustering based on the quantile vectors. We define the concept sizes of d-dimensional hyper-rectangles spanned by quantile vectors. The concept size plays the role of the similarity measure between sub-objects, i.e., quantile vectors, and it plays also the role of the measure for cluster quality [Ichino and Brito 2015]. References H-H. Bock and E. Diday (2000). Analysis of Symbolic Data - Exploratory Methods for Extracting Statistical Information from Complex Data. Heidelberg: Springer. L. Billard and E. Diday (2007). Symbolic Data Analysis - Conceptual Statistics and Data Mining. Chichester: Wiley. E. Diday and M. Noirhomme-Fraiture (2008). Symbolic Data Analysis and the SODAS Software. Chichester: Wiley. M. Ichino and P. Brito (2014). The data accumulation graph (DAG) to visualize multi-dimensional symbolic data. Workshop in Symbolic Data Analysis. Taipei, Taiwan. M. Ichino (2011). The quantile method for symbolic principal component analysis. Statistical Analysis and Data Mining, 4, 2, pp. 184-198. M. Ichino and P. Brito (2015). A hierarchical conceptual clustering based on the quantile method for mixed feature-type data. (Submitted to the IEEE Trans. SMC). Latest developments of the RSDA: An R package for Symbolic Data Analysis Oldemar Rodrı́guez ⇤ June 26, 2015 Abstract This package aims to execute some models on Symbolic Data Analysis. Symbolic Data Analysis was propose by the professor E. DIDAY in 1987 in his paper “Introduction à l’ approche symbolique en Analyse des Données”. Premiére Journées Symbolique-Numérique. Université Paris IX Dauphine. Décembre 1987. A very good reference to symbolic data analysis can be found in “From the Statistics of Data to the Statistics of Knowledge: Symbolic Data Analysis” of L. Billard and E. Diday that is the journal American Statistical Association Journal of the American Statistical Association June 2003, Vol. 98. The main purpose of Symbolic Data Analysis is to substitute a set of rows (cases) in a data table for an concept (second order statistical unit). For example, all of the transactions performed by one person (or any object) for a single “transaction” that summarizes all the original ones (Symbolic-Object) so that millions of transactions could be summarized in only one that keeps the customary behavior of the person. This is achieved thanks to the fact that the new transaction will have in its fields, not only numbers (like current transactions), but can also have objects such as intervals, histograms, or rules. This representation of an object as a conjunction of properties fits within a data analytic framework concerning symbolic data and symbolic objects, which has proven useful in dealing with big databases. In RSDA version 1.2, methods like centers interval principal components analysis, histogram principal components analysis, multi-valued correspondence analysis, interval multidemensional scaling (INTERSCAL), symbolic hierarchical clustering, CM, CRM, Lasso, Ridge and Elastic Net Linear regression model to interval variables have been implemented. This new version also includes new features to manipulate symbolic data through a new data structure that implements Symbolic Data Frames and methods for converting SODAS and XML SODAS files to RSDA files. Keywords Symbolic data analysis, R package, RSDA, interval principal components analysis, Lasso, Ridge, Elastic Net, Linear regression. ⇤ University of Costa Rica, San José, Costa Rica; E-Mail: oldemar.rodriguez@ucr.ac.cr 1 References [1] Billard, L., Diday, E., (2003). From the statistics of data to the statistics of knowledge: symbolic data analysis. J. Amer. Statist. Assoc. 98 (462), 470-487. [2] Billard, L. & Diday, E. (2006) Symbolic Data Analysis: Conceptual Statistics and Data Mining, John Wiley & Sons Ltd, United Kingdom. [3] Bock, H.-H., and Diday, E. (eds.) (2000). Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information From Complex Data, Berlin: Springer-Verlag. [4] Diday E. (1987): “Introduction à l’approche symbolique en Analyse des Données”. Premières Journées Symbolique-Numérique. Université Paris IX Dauphine. Paris, France. [5] Lima-Neto, E.A., De Carvalho, F.A.T., (2008). Centre and range method to fitting a linear regression model on symbolic interval data. Computational Statistics and Data Analysis 52, 1500-1515. [6] Lima-Neto, E.A., De Carvalho, F.A.T., (2010). Constrained linear regression models for symbolic interval-valued variables. Computational Statistics and Data Analysis 54, 333-347. [7] Rodrı́guez, O. (2000). Classification et Modèles Linéaires en Analyse des Données Symboliques. Ph.D. Thesis, Paris IX-Dauphine University. [8] Rodrı́guez, O. with contributions from Olger Calderon and Roberto Zuniga (2014). RSDA - R to Symbolic Data Analysis. R package version 1.2. [http://CRAN.R-project.org/package=RSDA] 2 Symbolic Data Analysis Workshop SDA 2015 November 17 – 19 Session Speakers SDA 2015 November 17 – 19 University of Orléans, France Tuesday, November 17, afternoon 12:00 - 13:45 Welcome, Registration, Lunch L'Agora Restaurant, Orléans University Campus _______________ IIIA Computer Science Building Herbrand Amphitheatre 13:55 - 14:00 Workshop Opening Session 1: VARIABLE DEPENDENCIES Chair: Rosanna VERDE 14:00 – 14:25 Explanatory Power of a Symbolic Data Table Edwin DIDAY, University Paris-Dauphine, France 14:25 – 14:50 Methods for Analyzing Joint Distribution Valued Data and Actual Data Sets Masahiro MIZUTA, Hiroyuki MINAMI, IIC, Hokkaido University, Japan 14:50 – 15:15 Advances in regression models for interval variables: a copula based model Eufrasio LIMA NETO, Ulisses DOS ANJOS, Univ. Paraiba, Joao Pessoa, Brasil 15:15 – 15:40 Symbolic Bayesian Networks Edwin DIDAY, University Paris-Dauphine Richard EMILION, MAPMO, University of Orléans, France 15:40 – 16:10 Coffee Break Session 2: STATISTICAL APPROACHES Chair: Didier CHAUVEAU 16:10 – 16:35 Maximum Likelihood Estimations for Interval-Valued Variables Lynne BILLARD, University of Georgia, USA 16:35 – 17:00 Function-valued Image Segmentation using Functional Kernel Density Estimation Laurent DELSOL, Cecile LOUCHET, MAPMO, University of Orléans, France 17:00 – 17:25 Outlier Detection in Interval Data A. Pedro DUARTE SILVA, UCP Porto, Portugal Peter FILZMOSER, TU Vienna, Austria Paula BRITO, FEP & LIAAD-INESC TEC, Univ. Porto, Portugal 19:30 Workshop Dinner. 'Le Martroi' restaurant. 12, Place du Martroi. Orléans Tram stop: 'De Gaulle' or 'République' Explanatory Power of a Symbolic Data Table Edwin Diday (Paris-Dauphine University) The main aim of this talk is to study the « explanatory » quality of a symbolic data table. We give criterion based on entropy and discrimination which are shown to be complementary. We show that under some conditions the best descriptive variables of the concepts are also the best predictive one of the concepts. Methods for Analyzing Joint Distribution Valued Data and Actual Data Sets Masahiro Mizuta1*, Hiroyuki Minami1* 1. Advanced Data Science Laboratory, Information Initiative Center, Hokkaido University, JAPAN *Contact author: mizuta@iic.hokudai.ac.jp Keywords: Simultaneous Distribution Valued Data, Parkinsons Telemonitoring Data Analysis of distribution valued data is one of the hottest topics in SDA: especially, joint distribution (or, simultaneous distribution) valued data. In this talk, we introduce methods for them and show an open actual data. We assume that we have n concepts (or, objects) and each of concepts is described by distribution. Many methods are proposed. Key ideas are summarized as follows: (1) Use of distances between concepts, (2) Use o f parameters of distributions, (3) Use o f quantile function. When the concepts are described by joint distributions, the approach (1) is natural. Igarashi (2015) proposed a method based on it. There is room to adopt the approaches (2) and (3). In order to study methods for data analysis, good actual data sets are helpful. But, there are not so many good datasets of joint distribution valued data. Mizuta (2014) showed a dataset; Monitoring Post Data in around Fukushima Prefecture. Another good data set is Parkinsons Telemonitoring data set, which can be gotten from Web (https://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring). I will introduce them. Acknowledgment: I wish to thank Mr. Igarashi. A part of this work is based on the results of Igarashi (2015). References M. Mizuta (2012). Analysis of Distribution Valued Dissimilarity Data. In Challenges at the Interface of Data Analysis, Computer Science, and Optimization. Springer, 23-28. M. Mizuta (2014). Symbolic Data Analysis for Big Data. Proceedings of 2014 Workshop in Symbolic Data Analysis 59. K. Igarashi, H. Minami, M. Mizuta (2015). Exploratory Methods for Joint Distribution Valued Data and Their Application. Communications for Statistical Applications and Methods, 2015, Vol. 22, No. 3, 265–276, DOI: http://dx.doi.org/10.5351/CSAM.2015.22.3.265. A. Irpino, R. Verde (2015). Basic Statistics for Distributional Symbolic Variables: A New Metricbased Approach. Advances in Data Analysis and Classification, Vol.9, No.2, 143-175. Advances in regression models for interval variables: a copula based model Eufrásio Lima Neto1,*, Ulisses dos Anjos1 1. Department of Statistics, Federal University of Paraíba, João Pessoa, PB, Brazil. *Contact author: eufrasio@de.ufpb.br Keywords: Inference, Copulas, Regression, Interval Variable. Regression models are widely used to solve problems in many fields. However, the uses of inferential techniques play an important role in order to validate these models. Recently, some contributions were presented in order to fit a regression model for interval-valued variables. We start this talk discussing about some of these techniques. Then, it is stated a regression model for interval-valued variables based on copula theory that allows more flexibility for the model’s random component. In this way, the main advance of the new approach is that is possible to consider inferential procedures over the parameters estimates as well as goodness-of-fit measures and residual analysis based on general probabilistic background. A Monte Carlo simulation study demonstrated asymptotic properties for the maximum likelihood estimates obtained form the copula regression model. Applications to real data sets are also considered. ! References Brito, P. and Duarte Silva, A.P. (2012). Modelling interval data with normal and skew-normal distributions. Journal of Applied Statistics 39, 3–20. Blanco-Fernández, A., Corral, N. and González-Rodríguez, G. (2011). Estimation of a flexible simple linear model for interval data based on set arithmetic. Computational Statistics Data Analysis 55, 2568– 2578. Diday, E. and Vrac, M. (2005). Mixture decomposition of distributions by copulas in the symbolic data analysis framework. Discrete Applied Mathematics 147(1), 27–41. Lima Neto, E.A. and Anjos, U.U. (2015). Regression model for interval-valued variables based on copulas. Journal of Applied Statistics, 1–20. Lima Neto, E.A., Cordeiro, G.M. and De Carvalho, F.A.T. (2011). Bivariate symbolic regression models for interval-valued variables, Journal of Statistical Computation and Simulation 81, 1727–1744. Symbolic Bayesian Networks Edwin DIDAY1 , Richard EMILION2, ? 1. CEREMADE, University Paris-Dauphine 2. MAPMO, University of Orléans, France ? Contact author: richard.emilion@univ-orleans.fr Keywords: Bayesian network, Conditional distribution, Dirichlet distribution, Independence test. Bayesian networks, see e.g. [1], are probabilistic directed acyclic graphs used for system behavior modelling through conditional distributions. They generally deal with coorelated categorical or real-valued random variables. We consider Bayesian networks dealing with probability-distribution-valued random variables. 1. Statistical setting Le X = (X1 , . . . , Xj , . . . , Xp ) be a random vector, p 1 being a integer and each Xj taking values in the space of probability measures defined on a measurable space (Vj , Vj ), j = 1, . . . , p. Let (Xk,1 , . . . , Xk,j , . . . , Xk,p ) k = 1, . . . , K be a sample of size K of X. Consider k as a row index and j as a column one. 2. Motivation Actually the sample (Xk,1 , . . . , Xk,j , . . . , Xk,p ) k = 1, . . . , K is not observed but only estimated from observed data. In symbolic data analysis (SDA), each observed data belong toQa class among K disjoint classes, say c1 , . . . , cK . They can be either vectors in pj=1 Vj or in some Vj as seen in the two examples below which illustrate two different situations. The empirical distribution of the data in Vj which belong to class ck is an estimation of the probability distribution Xk,j . This distribution is considered as the j-th descriptor of class ck . 2.1 Paired Samples In the well-known Fisher’s iris data set, K = 3, c1 = ’setosa’, c2 = ’versicolor, c3 = ’virginica’, p = 4. The observations are 50 iris in each of these 3 classes. The observed samples are paired since each iris is described by a vector of 4 data. As an example, X3,2 is the probability distribution of sepal width in ’virginica’ class. 2.2 Unpaired Samples Let c1 , . . . , ck be K students and p professors that grade several students’ exams. Let Xk,j be the distribution of student ck grades given by professor j. It is seen here that the samples are unpaired since the exams and the number of exams can differ from one professor to another. 2.3 Dependencies Clearly, in the case of paired samples, within each class, data of descriptor j are correlated to data of descriptor j0 while this correlation is meaningless in the case of unpaired samples. However considering the K pairs of estimated distributions (Xk,j , Xk,j0 ), k = 1, . . . , K, j, j0 = 1, . . . , p, j 6= j0, it is seen that the random distributions Xj and Xj0 can be correlated. This motivates us to consider Bayesian networks dealing with probability distributions. 3. The case of finite sets Assume Vj finite so that Xk,j is a probability vector of frequencies which size can depend on j. Therefore, Bayesian networks are built by testing the independence (resp. the correlation) between the two random vectors Xj and Xj0 . We have used the indep.etest() function implemented in the ’energy’ package for R [3]. Distributions and conditional distributions are estimated using kernels in the nonparametric case while Dirichlet distributions are used in the parametric case. 4. The case of densities Assume that each Vj is a measurable subsets of some Rdj and that Xk,j has a density fk,j w.r.t. the Lebesgue measure. Independence tests can be performed and conditional distributions can be estimated using some functional data analysis methods either using a finite number of coordinates on some basis to be in the finite sets case, or using kernel estimators w.r.t. a distance on a function space [2]. References [1] Darwich, A. (2009). Modeling and Reasoning with Bayesian Networks. Cambridge University Press. [2] Ramsey, J.O. - Silverman, B.W. (2005) Functional Data Analysis. Springer. [3] Szekely, G.J. - Rizzo, M.L. (2013). The distance correlation t-test of independence in high dimension. J. Mult. Variate Anal. 17, 193-213. http: //dx.doi.org/10.1016/j.jmva.2013.02.012 Maximum Likelihood Estimation for Interval-valued Data Lynne BILLARD, University of Georgia Bertrand and Goupil (2000) obtained empirical formulas for the mean and variance of interval-valued observations. Billard (2008) obtained empirical formulas for the covariance of interval-valued observations. These are in effect moment estimators. We show how, under certain probability assumptions, these are the same as the maximum likelihood estimators for the corresponding population parameters. Function-valued image segmentation using functional kernel density estimation Laurent Delsol1 , Cécile Louchet1, ? 1. MAPMO, UMR CNRS 7349, Université d’Orléans, Fédération Denis Poisson FR CNRS 2964 ? Contact author: cecile.louchet@univ-orleans.fr Keywords: Hyperspectral imaging, functional kernel density estimation, minimal partition. Introduction. More and more image acquisition systems record high-dimensional vectors for each pixel, as it is the case for hyperspectral imaging or dynamic PET imaging, to cite only them. In hyperspectral imaging, each pixel is associated with a light-spectrum containing up to several hundreds of radiance values, each corresponding to narrow spectral bands. In dynamic PET images, each pixel is associated with a time activity curve giving a radioactivity measurement throughout the (discretized) time after radiotracer injection. In both cases, the data accessible for each pixel is a vector coming from the discretization of a function with physical meaning. It is thus interesting to extend classical image processing techniques on function-valued images. Image segmentation is the task of partitioning the pixel grid of an image such that the obtained regions correspond to distinct physical objects; it can be described as automatic search of regions of interest. It is based on the assumption that each region has different effects on the image values. Many research works have been about segmenting single- or color-valued images, but little has been done on function-valued images. In symbolic data analysis terminology, each pixel is an object whose symbolic description is a function, and image segmentation can be seen as a symbolic clustering method. In our work, we recall the most famous single-valued image segmentation method, namely the minimal partition model (Mumford, Shah, 1989), and adapt it to our case using functional estimation tools, namely functional kernel density estimation, depending on a distance between functions. We discuss the choice of this distance, and test it on a hyperspectral image and give an application to gray-level texture image segmentation. Function-valued image segmentation. Let y be an image defined on a finite grid ⌦, with values in a function set F. We aim at segmenting y into L regions, that is at finding x : ⌦ ! {1, . . . , L} such that each region {s 2 ⌦ : xs = l} (1 l L) corresponds to one object. A Bayesian approach is to consider the segmentation x̂ that maximizes P (X = x|Y = y), or equivalently, that maximizes fY |X=x (y)PX (x) among x : ⌦ ! {1, . . . , L}. The widely used minimal partition model estimates the data term fY |X=x using an isotropic Gaussian model. For the regularity term, it uses a parameter > 0 and chooses the Potts (1952) model PX (x) / e P s⇠t 1xs 6=xt , (1) where ⇠ refers P to a neighborhood system in ⌦, typically the 4 or the 8 nearest neighbor system. The potential s⇠t 1xs 6=xt corresponds to a discrete boundary length between regions: short boundary segmentations are preferred. In our context, the isotropic Gaussian assumption on the data term is not admissible because it is not able to account for any function regularity. The only assumption we make is that the functions Q (ys )s2⌦ are independent conditionally to the segmentation, yielding fY |X=x (y) = s2⌦ fYs |Xs =xs (ys ), each term of which can be estimated using functional kernel density estimation (Dabo-Niang, 2004) ✓ ◆ X d(y , y ) t s fˆYs |Xs =xs (ys ) / K , (2) h t:x =x t s where K is a positive kernel, h > 0 is the K’s bandwidth, and d is a distance on the space of functions F. The normalizing factor, not written here, is linked to the small ball probability, but luckily enough does not depend on the segmentation. Results. The maximization algorithm details can be found in Delsol, Louchet (2014), so we prefer focusing here on the choice of the distance d. In many hyperspectral images, the L2 distance between the derivatives of the functions (Tsai, Philpot, 1998) is able to accurately measure a discrepancy between functions because it is more robust to vertical shifts of the functions, even if it is more sensitive to noise than usual L2 distance. This distance was used in Figure 1 (a-c) where a fake lemon slice was well distinguished from a real one. Another application to gray-level image texture segmentation is proposed. A texture will be described by the density of gray levels estimated at each pixel’s neighborhood: a simple gray-level image becomes a function-valued image, where each function is a density. In this particular case, natural choices of distance are transport distances: they allow for small horizontal shifts, and are straighforward to compute in dimension 1. The L2 transport distance was used in a mouse brain MRI to extract the cerebellum region which is more textured than the others (Figure 1 (d-e)). (a) (b) (c) (d) (e) Figure 1: (a) One slice of hyperspectral image (fake and real lemon) (b) Functions attached to a sample of pixels (c) Segmentation using the distance between derivatives (d) MRI of mouse brain (e) Cerebellum segmentation using L2 transport distance. References S. Dabo-Niang (2004). Kernel density estimator in an infinite-dimensional space with a rate of convergence in the case of diffusion process. Applied Mathematics Letters, 17(4), pp. 381–386. L. Delsol, C. Louchet (2014). Segmentation of hyperspectral images from functional kernel density estimation. Contributions in infinite-dimensional statistics and related topics, 101–105. D. Mumford, J. Shah (1989). Optimal approximations by piecewise smooth functions and associated variational problems. Comm. on Pure and Appl. Math., XLII (5), 577–685. R. B. Potts (1952). Some generalized order-disorder transformations. Mathematical Proceedings, 48(1), 106–109. F. Tsai, W. Philpot (1998). Derivative analysis of hyperspectral data. Remote Sensing of Environment, 66(1), 41–51. Outlier Detection in Interval Data ? A. Pedro Duarte Silva1 , Peter Filzmoser2 , Paula Brito3, , 1. Faculdade de Economia e Gestão & CEGE, Universidade Católica Portuguesa, Porto, Portugal 2. Institute of Statistics and Mathematical Methods in Economics, Vienna University of Technology, Vienna, Austria 3. Faculdade de Economia & LIAAD-INESC TEC, Universidade do Porto, Porto, Portugal ? Contact author: mpbrito@fep.up.pt Keywords: Mahalanobis distance, Modeling interval data, Robust estimation, Symbolic Data Analysis In this work we are interested in identifying outliers in multivariate observations that are consisting of interval data. The values of an interval-valued variable may be represented by the corresponding lower and upper bounds or, equivalently, by their mid-points and ranges. Parametric models have been proposed which rely on multivariate Normal or Skew-Normal distributions for the mid-points and log-ranges of the interval-valued variables. Different parameterizations of the joint variancecovariance matrix allow taking into account the relation that might or might not exist between mid-points and log-ranges of the same or different variables Brito and Duarte Silva (2012). Here we use the estimates for the joint mean t and covariance matrix C for multivariate outlier detection. The Mahalanobis distances D based on these estimates provide information on how different individual multivariate interval data are from the mean with respect to the overall covariance structure. A critical value based on the Chi-Square distribution allows distinguishing outliers from regular observations. The outlier diagnostics is particularly interesting when the covariance between the mid-points and the log-ranges in restricted to be zero. Then, Mahalanobis distances can be computed separately for mid-points and log-ranges, and the resulting distance-distance plot identifies outliers that can be due to deviations with respect to the mid-point, or with respect to the range of the interval data, or both. However, if t and C are chosen to be the classical sample mean vector and covariance matrix this procedure is not reliable, as D may be strongly affected by atypical observations. Therefore, the Mahalanobis distances should be computed with robust estimates of location and scatter. Many robust estimators for location and covariance have been proposed. The minimum covariance determinant (MCD) estimator Rousseeuw (1984, 1985) uses a subset of the original sample, consisting of the h points in the dataset for which the determinant of the covariance matrix is minimal. Weighted trimmed likelihood estimators Hadi and Luceño (1997) are also based on a sample subset, formed by the h observations that contribute most to the likelihood function. In either case, the proportion of data points to be used needs to be specified a priori. For multivariate Gaussian data, the two approaches lead to the same estimators Hadi and Luceño (1997); Neykov et al (2007). In this work we consider the Gaussian model for interval data, and employ the above approach based on minimum (restricted) covariance esimators, with the correction suggested in Pison et al (2002) and with the trimming percentage selected by a two-stage procedure. We evaluate our proposal with an extensive simulation study for different data structures and outlier contamination levels, showing that the proposed approach generally outperforms the method based on simple maximum likelihood estimators. Our methodology is illustrated by an application to a real dataset. References Brito, P. and Duarte Silva, A.P. (2012) Modelling interval data with Normal and Skew-Normal distributions. Journal of Applied Statistics 39 (1), 3–20. Hadi, A.S., and Luceño, A. (1997) Maximum trimmed likelihood estimators: a unified approach, examples, and algorithms. Computational Statistics & Data Analysis 25 (3), 251–272. Neykov, N., Filzmoser, P., Dimova, R. and Neytchev, P. (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Computational Statistics & Data Analysis 52 (1), 299–308. Pison, G., Van Aelst, S. and Willems, G. (2002) Small sample corrections for LTS and MCD. Metrika 55(1-2), 111–123. Rousseeuw, P.J. (1984) Least median of squares regression. Journal of the American Statistical Association 79 (388), 871–880. Rousseeuw, P.J. (1985) Multivariate estimation with high breakdown point. Mathematical Statistics and Applications 8, 283–297. SDA 2015 November 17 – 19 University of Orléans, France Wednesday, November 18 Morning IIIA Computer Science Building Herbrand Amphitheatre Session 3: PRINCIPAL COMPONENT ANALYSIS Chair: Edwin DIDAY 09:00 – 09:25 The Data Accumulation Method Manabu ICHINO, SSE,Tokyo Denki University, Japan 09:25 – 09:50 Principal Curves and Surfaces for Interval-Valued Variables Jorge Arce G., Oldemar RODRIGUEZ, University of Costa Rica, Costa Rica 09:50 – 10:15 Principal Component Analyses of Interval Data using Patterned Covariance Structures Anuradha ROY, University of Texas, USA 10:15 – 10:40 Population and Robust Symbolic PCA Rosario OLIVEIRA, Margarida VILELA, Rui VALADAS IST, University of Lisbon, Portugal, Paulo SALVADOR, TI, University of Aveiro, Portugal 10:40 – 11:05 Coffee Break Session 4: CLUSTERING 1 Chair: Yves LECHEVALLIER 11:05 – 11:30 A Topological Clustering Method for Histogram Data Guénaël CABANES, Younès Bennani, LIPN-CNRS, Univ. Paris 13, France Rosanna VERDE, Antonio IRPINO, Second University, Naples, Italy 11:30 – 11:55 A Co-Clustering Algorithm for Interval-Valued Data Francisco DE CARVALHO, Roberto C. FERNANDES, UFPE, Recife, Brazil 11:55 – 12:20 Dynamic Clustering with Hibrid L1, L2 and L∞ Distances Leandro SOUZA, Renata M. C. R. SOUZA, Getulio J. A. AMARAL UFPE, Recife, Brazil 12:20 Lunch. L'Agora restaurant, Campus The Data Accumulation Method: Dimensionality Reduction, PCA, and PCA Like Visualization Manabu Ichino School of Science and Engineering, Tokyo Denki University ichino@mail.dendai.ac.jp Keywords: Data Accumulation, Monotonicity, Dimensionality Reduction, PCA, Visualization Abstract The data accumulation method is based on the accumulation of feature values after the zero-one normalization of feature variables. Let x1, x2,..., xd be the normalized feature values that describe an object. The data accumulation generate new values as y1 = x1, y2 = x1 + x2,..., yd = x1 + x2 + ··· + xd. These new values define a monotone line graph, called the accumulated concept function. We regard the value yd as the maximum concept size of the object. The roughest approximation of the concept function for the object is the selection of the minimum and the maximum values y1 and yd. On the other hand, the selection of all of d-accumulated values yields the finest approximation for the function. By the sampling of the accumulated values we achieve various approximation level. From this viewpoint, this paper describes data accumulation of categorical multi-valued data, PCA and PCA like visualization for symbolic data, and the dimensionality reduction of high-dimensional data and PCA. References H-H. Bock and E. Diday (2000). Analysis of Symbolic Data - Exploratory Methods for Extracting Statistical Information from Complex Data. Heidelberg: Springer. L. Billard and E. Diday (2007). Symbolic Data Analysis - Conceptual Statistics and Data Mining. Chichester: Wiley. E. Diday and M. Noirhomme-Fraiture (2008). Symbolic Data Analysis and the SODAS Software. Chichester: Wiley. M. Ichino and P. Brito (2014). The data accumulation graph (DAG) to visualize multidimensional symbolic data. Workshop in Symbolic Data Analysis. Taipei, Taiwan. P. Brito and M. Ichino (2014). The data accumulation graph (DAG): visualization of high dimensional complex data. 21st International Conference on Computational Statistics. Geneva, Swiss. M. Ichino and P. Brito (2015). A hierarchical conceptual clustering based on the quantile method for mixed feature-type data. (Submitted to the IEEE Trans. SMC). M. Ichino and P. Brito (2015). The data accumulation PCA to analyze periodically summarized multiple data tables. (Submitted to the IEEE Trans. SM Principal Curves and Surfaces to Interval Valued Variables Jorge Arce G. ⇤ Oldemar Rodrı́guez † June 26, 2015 Abstract In this paper we propose a generalization to symbolic interval valued variables of the Principal Curves and Surfaces method proposed by T. Hastie in [4]. Given a data set X with n observations and m continuos variables the main idea of Principal Curves and Surfaces method is to generalize the principal component line, providing a smooth one-dimensional curved approximation to a set of data points in Rm . A principal surface is more general, providing a curved manifold approximation of dimension 2 or more. In our case we are interested in finding the main principal curve that approximates better symbolic interval data variables. In [2] and [3], the authors proposed the Centers and the Vertices Methods to extend the well known principal components analysis method to a particular kind of symbolic objects characterized by multi-valued variables of interval type. In this paper we generalize both, the Centers and the Vertices Methods, finding a smooth curve that passes through the middle of the data X in an orthogonal sense. Some comparisons of the proposed method regarding the Centers and the Vertices Methods are made, these was done using the RSDA package using Ichino and Interval Iris Data sets, see [8] and [1]. To make these comparisons we have used the cumulative variance and the correlation index. Keywords Interval-valued variables, Principal Curves and Surfaces, Symbolic Data Analysis. References [1] Billard, L. & Diday, E. (2006) Symbolic Data Analysis: Conceptual Statistics and Data Mining, John Wiley & Sons Ltd, United Kingdom. [2] Cazes P., Chouakria A., Diday E. et Schektman Y. (1997). Extension de l’analyse en composantes principales à des données de type intervalle, Rev. Statistique Appliquée, Vol. XLV Num. 3 pag. 5-24, France. ⇤ † University of Costa Rica, San José, Costa Rica & Banco Nacional de Costa Rica;E-Mail: jarceg@bncr.fi.cr University of Costa Rica, San José, Costa Rica; E-Mail: oldemar.rodriguez@ucr.ac.cr 1 [3] Douzal-Chouakria A., Billard L., Diday E. (2011). Principal component analysis for intervalvalued observations. Statistical Analysis and Data Mining, Volume 4, Issue 2, pages 229-246. Wiley. [4] Hastie,T. (1984) Principal Curves and Surface, Ph.D Thesis Stanford University. [5] Hastie,T. & Weingessel,A. (2014). princurve - Fits a Principal Curve in Arbitrary Dimension. R package version 1.1-12 [http://cran.r-project.org/web/packages/princurve/index.html] [6] Hastie,T. & Stuetzle, W. (1989). Principal Curves, Journal of the American Statistical Association, Vol. 84 406: 502–516. [7] Hastie, T., Tibshirani, R. and Friedman, J. (2008). The Elements of Statistical Learning; Data Mining, Inference and Prediction. New York: Springer. [8] Rodrı́guez, O. with contributions from Olger Calderon and Roberto Zuniga (2014). RSDA - R to Symbolic Data Analysis. R package version 1.2. [http://CRAN.R-project.org/package=RSDA] [9] Rodrı́guez, O. (2000). Classification et Modèles Linéaires en Analyse des Données Symboliques. Ph.D. Thesis, Paris IX-Dauphine University 2 Principal component analyses of interval data using patterned covariance structures Anuradha Roy Department of Management Science and Statistics The University of Texas at San Antonio One UTSA Circle, San Antonio, TX 78249, USA Contact author: Anuradha.Roy@utsa.edu Keywords: Equicorrelated covariance structure, Jointly equicorrelated covariance structure, Interval data New approaches to derive the principal components of interval data (Billard and Diday, 2006) are proposed by using block variance-covariance matrices, namely equicorrelated and jointly equicorrelated covariance structures (Leiva and Roy, 2011). This is accomplished by considering each interval as two repeated measures at the lower and upper bounds of the interval (two-level multivariate data), and then by assuming equicorrelated covariance structure for the data. That is, for interval data with p intervals, the data no more belong to the p dimensional space, but in 2p dimensional space, and the set of p lower bounds is correlated with the set of p upper bounds. The second set of p points is another possible realization of the first set of p points. So, we start with random points in dimension p and the computations are done as if we are in dimension 2p. Eigenblocks and eigenmatrices are obtained based on the eigendecomposition of the block variancecovariance matrix of the data. We then analyze these eigenblocks and the corresponding principal vectors together in some seemly sense to get the adjusted eigenvalues and the corresponding eigenvectors of the interval data. It is shown that the (p ⇥ 1) dimensional first principal vector corresponding to the first eigenblock represents the midpoints of the lower bounds and the corresponding upper bounds of the intervals. Similarly, the (p ⇥ 1) dimensional second principal vector corresponding to the second eigenblock represents the midranges of the lower bounds and the corresponding upper bounds of the intervals. We then work independently with these principal vectors (the components of whom are midpoints and midranges of the intervals respectively) and their corresponding variance-covariance matrices, i.e., the corresponding eigenblocks to get the eigenvalues and eigenvectors of the interval data. If there is some additional information (like brands etc.) in the interval data, the interval data can be considered as three-level multivariate data and one can analyze it by assuming jointly equicorrelated covariance structure. In this case, it is shown that the grand midpoints and grand midranges are the first and third principal vectors. The second principal vector turns out to be the additional information in the interval data. The proposed methods are illustrated with a real data set. References Billard L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining John Wiley & Sons Ltd., England. Leiva, R. and Roy, A. (2011). Linear Discrimination for Multi-level Multivariate Data with Separable Means and Jointly Equicorrelated Covariance Structure. J. Statist. Plann. Inference 141, 1910–1924. Population and Robust Symbolic Principal Component Analysis ? M. Rosário Oliveira1 , Margarida Vilela1 , Rui Valadas2 , Paulo Salvador3 1. CEMAT and Departmento de Matemática, Instituto Superior Técnico, Universidade de Lisboa, Portugal 2. DEEC and Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, Portugal 3. DETI and Instituto de Telecomunicações, Universidade de Aveiro, Portugal ? Contact author: rosario.oliveira@tecnico.ulisboa.pt Keywords: Principal component analysis, Internet traffic. Interval symbolic data, Robust methods, Principal component analysis (PCA) is one of the most used statistical methods in the analysis of real problems. In the symbolic data analysis (SDA) context there have been several proposals to extend this methodology. The methods CPCA (centers) and VPCA (vertices), pioneers in symbolic PCA and proposes by Cazes (1997) (vide Billard, 2006) are the best known examples of this family of methods. However, in recent years many other alternatives have emerged in the literature (vide e.g. Wang, 2012). In this work, we present the population formulations corresponding to three of the symbolic PCA algorithms for interval-data: method of the centers (CPCA), method of the vertices (VPCA), and complete information PCA (CIPCA) (Wang, 2012). The theoretical formulations define a general method which allows substantial improvements on the existing algorithms in terms of time and number of operations, making them easily applicable to datasets with large number of symbolic variables and high number of objects. Moreover, this formulation enables the definition of the population symbolic components even when one or more variables are degenerate. Furthermore, analogously to conventional (non-symbolic) data, we have verified that the existence of atypical observations could distort the sample symbolic principal components and correspondent scores. To overcome this problem in the context of SDA, we defined two families of robust methods for symbolic PCA: one based on robust covariance matrices (Filzmoser, 2011) and another based on Projection Pursuit (Croux, 2007). To make this new statistical tools easily used in the analysis of real problems, we also developed a web application, using the Shiny web application framework for R, which includes several tools to analyse, represent and perform symbolic (classical and robust) PCA in interval data, in an interactive manner. In this app it is possible to compare the classical symbolic PCA methods with all the new robust approaches proposed in this work and its operation will be illustrated with telecommunications data. For conventional data, PCA is frequently used as an intermediate step in the analysis of complex problems (Johnson, 2007), and is commonly used as input for other multivariate methods. To pursue this goal, we designed R routines to make conversions between different representations of interval-valued data, making easier to use several R SDA packages consecutively, in the same analysis. These packages were developed independently and each one requires reading the data in a specific format. Acknowledgment This work was partially funded by Fundação para a Ciência e a Tecnologia (FCT) through projects PTDC/EEI-TEL/5708/2014 and UID/Multi/04621/2013. References Billard, L., Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Analysis John Wiley and Sons, Chichester. Cazes, P., Chouakria, A., Diday, E., Schektman, Y. (1997). Extensions de l’analyse en composantes principales à des données de type intervalle. Revue de Statistique Appliquée 45(3), 5-24. Croux, C., Filzmoser, P., Oliveira, M. R. (2007). Algorithms for Projection - Pursuit robust principal component analysis. Chemometr. Intell. Lab. 87(2), 218–225. Filzmoser, P., Todorov, V. (2011). Review of robust multivariate statistical methods in high dimension. Anal. Chim. Acta 705, 2 – 14. Johnson, R. A. and Wichern, D. W. (2007). Applied Multivariate Statistical Analysis Prentice-Hall, Inc., Upper Saddle River, NJ, USA. Wang, H., Guan, R., Wu, J. (2012). CIPCA: Complete-Information-based Principal Component Analysis for Interval-valued Data. Neurocomputing 86, 158–169. A topological clustering method for histogram data ? Guénaël Cabanes 1 , Younès Bennani1 , Rosanna Verde2 , Antonio Irpino2 1. LIPN-CNRS, UMR 7030, Université Paris 13, France 2. Dip. Scienze Politiche “Jean Monnet” Seconda Università di Napoli, Italia ? Contact author: cabanes@lipn.univ-paris13.fr Keywords: Clustering, Self-Organising Map, Histogram data, Wasserstein distance We present a clustering algorithm for histogram data based on a Self-Organising Map (SOM) learning. SOM for symbolic data was firstly proposed by Bock (Bock and Diday, 2000) to visualise in a reduced subspace the structure of symbolic data. Further SOM method for particular symbolic data, the interval data, have been developed using suitable distances for interval data, like Hausdorff distance; L2 distance, adaptive distances (Irpino and Verde, 2008). In the analysis of histogram data, that represent another representation of symbolic data by empirical distributions, SOM has been proposed by De Carvalho et al. (2013) based on the Wasserstein L2 distance to clustering distributions. Adaptive Wasserstein distance has been also developped in this context to find, automatically, weights for the variables as well as for the clusters. However, the most part of these methods can provide a quantification and a visualization of symbolic data (intervals, histograms) but cannot be used directly to obtained a clustering of the data. The recent algorithm proposed by Cabanes et al. (2013): S2L-SOM learning for interval data, is a two-level clustering algorithm based on SOM that combine the dimension reduction by SOM and the clustering of the data in a reduced space in a certain number of homogeneous clusters. Here, we propose an extension of this approach to histogram data. In the clustering phase is used the L2 Wasserstein distance according to the dynamic clustering algorithm proposed by Verde and Irpino (2006). The number of cluster is not a priori fixed as parameter of the clustering algorithm but it is automatically found according to an estimation of local density and connectivity of the data in the original space, as in Cabanes et al. (2012). References Bock, H.-H. and E. Diday, Eds. (2000). Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data. Springer Verlag, Heidelberg. Cabanes, G., Y. Bennani, R. Destenay, and A. Hardy (2013). A new topological clustering algorithm for interval data. Pattern Recognition 46(11), 3030–3039. Cabanes, G., Y. Bennani, and D. Fresneau (2012). Enriched topological learning for cluster detection and visualization. Neural Networks (32), 186–195. De Carvalho, F. d. A., A. Irpino, and R. Verde (2013). Batch self organizing maps for interval and histogram data (ISI ed.)., pp. 143–154. Curran Associates, Inc. (2013). Irpino, A. and R. Verde (2008). Dynamic clustering of interval data using a wasserstein-based distance. Pattern Recogn. Lett. 29(11), 1648–1658. Verde, R. and A. Irpino (2006). Dynamic clustering of histograms using wasserstein metric. In A. Rizzi and M. Vichi (Eds.), Proceedings in Computational Statistics, COMPSTAT 2006, Heidelberg, pp. 869–876. Compstat 2006: Physica Verlag. A co-clustering algorithm for interval-valued data ? Francisco de A. T. de Carvalho1, , Roberto C. Fernandes 1 1. Centro de Informatica - CIn/UFPE ? Contact author: fatc@cin.ufpe.bf Keywords: Co-clustering, Double k-means, Interval-valued data, Symbolic data analysis Co-clustering, also known as bi-clustering or block clustering, methods aim simultaneously cluster objects and variables of a data set. They resume the initial data matrix into a much smaller matrix representing homogeneous blocks or co-clusters of similar objects and variables (Govaert, 1995; Govaert and Nadif, 2013). A clear advantage of this approach, Rather than the traditional sequential approach, is that the simultaneous clustering of objects and variables may provide news insights about the association between isolated clusters of objects and isolated clusters of variables (Mechelen et al., 2004). Co-clustering is being used successfully in many different areas such as text mining, bioinformatics, etc. This presentation aims at giving a co-clustering algorithm able to simultaneously cluster objects and interval-valued variables. Interval-valued variables are needed, for example, when an object represents a group of individuals and the variables used to describe it need to assume a value which express the variability inherent to the description of a group. Interval-valued data arise in practical situations such as recording monthly interval temperatures at meteorological stations, daily interval stock prices, etc. Another source of interval-valued data is the aggregation of huge databases into a reduced number of groups, the properties of which are described by interval-valued variables. Therefore, tools for interval-valued data analysis are very much required (Bock and Diday, 2000) . Symbolic data analysis has provided suitable tools for clustering objects described by intervalvalued variables: agglomerative (Gowda and Diday, 1991; Guru et al, 2004) and divisive (Gowda and Ravi, 1995; Chavent, 2000) hierarchical methods, partitioning hard (Chavent and Lechevallier, 2002; De Souza and De Carvalho, 2004; De Carvalho et al., 2006) and fuzzy (Yang et al, 2004; De Carvalho, 2007) cluster algorithms. However, much more less attention was given for simultaneous clustering of objects and symbolic variables (Verde and Lechevallier, 2001). This presentation gives a co-clustering algorithm for interval-valued data using suitable Euclidean, City-Block and Hausdorff distances. The presented algorithm is a double k-means type (Govaert, 1995; Govaert and Nadif, 2013). It means that it is an iterative three steps (representation, allocation of variables, allocation of objects) relocation algorithm that looks, simultaneously, for the representatives of the co-clusters (blocks), for a partition of the set of variables into a fixed number of clusters and for a partition of the set of objects also in a fixed number of clusters, such that a clustering criterion (objective function) measuring the fit between the initial data matrix and the matrix of co-clusters representatives is locally minimized. These steps are repeated until the algorithm convergence that can be proved (Diday and Coll, 1980). In this presentation it is given the clustering criterion (objective function) and the main steps of the algorithm (the computation of the co-clusters representatives, the determination of the best partition of the variables and the determination of the best partition of the objects). The usefulness of the presented co-clustering algorithm is illustrated with its execution on some benchmark interval-valued data sets. References H-.H. Bock and E. Diday (2000). Analysis of Symbolic Data, Springer, Berlin et al. M. Chavent (2000). Criterion-based divisive clustering for symbolic objects. In in H.-H. Bock, and E. Diday (Eds.), Analysis of symbolic data, exploratory methods for extracting statistical information from complex data (Springer, Berlin), pp. 291–311. M. Chavent and Y. Lechevallier (2002). Dynamical clustering algorithm of interval data: optimization of an adequacy criterion based on Hausdorff distance. In IFCS 2002, 8th Conference of the International Federation of Classification Societies (Cracow, Poland), pp. 53–59. F.A.T. De Carvalho (2007). Fuzzy c-means clustering methods for symbolic interval data. Pattern Recognition Letters, 28, 423–437. F.A.T. De Carvalho, P. Brito, and H.-H. Bock (2006). Dynamic clustering for interval data based on L2 distance. Computational Statistics, 2, 231–250. R.M.C.R. de Souza and F.A.T. de Carvalho (2004). Clustering of interval data based on City-Block distances. Pattern Recognition Letters, 25, 353–365. Diday and Coll (1980). Optimisation en Classification Automatique INRIA, Le Chesnay. Y. El-Sonbaty and M.A. Ismail (1998). Fuzzy clustering for symbolic data. IEEE Transactions on Fuzzy Systems, 6, 195–204. G. Govaert (1991). Simultaneous clustering of rows and columns. Control and Cybernetics 24, 437–458. G. Govaert and M. Nadif (2013). Co-clustering: models, algorithms and applications, Wiley, New York. K.C. Gowda and E. Diday (1991). Symbolic clustering using a new dissimilarity measure. Pattern Recognition 24, 567–578. K.C. Gowda and T.R. Ravi (1995). Divisive clustering of symbolic objects using the concepts of both similarity and dissimilarity. Pattern Recognition 28, 1277–1282. D.S. Guru, B.B. Kiranagi, and P. Nagabhushan (2004). Multivalued type proximity measure and concept of mutual similarity value useful for clustering symbolic patterns. Pattern Recognition Letters 25, 1203–1213. I. V. Mechelen, H. H. Bock, and P. D. Boeck (2004). Two-mode clustering methods: a structured overview. Statistical methods in medical research, 13, 363–394. R. Verde and Y. Lechevallier (2005). Crossed Clustering Method on Symbolic Data Tables. In Proceedings of the Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society - CLADAG 2003 (Bologna, Italy), pp. 87–94. M.-S. Yang, P.-Y. Hwang, and D.-H. Chen (2004). Fuzzy clustering algorithms for mixed feature variables. Fuzzy Sets and Systems, 141, 301–317. Dynamic clustering of interval data based on hibrid L1, L2 and L1 distances ? Leandro C. Souza1,2, , Renata M. C. R. Souza1 , Getúlio J. A. Amaral3 1. Universidade Federal de Pernambuco (UFPE), Cin, Recife - PE, Brazil 2. Universidade Federal Rural do Semi-Árido (UFERSA), DCEN, Mossoró - RN, Brazil 3. Universidade Federal de Pernambuco (UFPE), DE, Recife - PE, Brazil ? Contact author: lcs6@cin.ufpe.br Keywords: Interval Distance, Interval Symbolic Data, Dynamic Clustering Cluster analysis is a traditional approach to provide exploratory discovery of knowledge. Dynamic Partitional clustering approach does partitions on data and associates prototypes to each partition. Distances measures are necessary to perform the clustering. In literature, a variety of distances are proposed for clustering interval data, as L1 , L2 and L1 distances. Therefore, interval data has an extra kind of information, which is not verified on the point one, and it is related with the variation or uncertain represented by them. In this way, we propose a mapping from intervals to points, which preserves their location and internal variation and allows formulating the new hybrid L1 , L2 and L1 distances, all based on the Lq distance for point data. These distances are used to perform dynamic clustering for interval data. Let a set of p-dimensional interval data with N observations, such as = {I1 , I2 , · · · , In , · · · , IN }. An interval multivariate instance representation In 2 is given by In = [a1n , b1n ], [a2n , b2n ], · · · , [apn , bpn ] , (1) where n = 1, 2, · · · , N and ajn bjn , for j = 1, · · · , p. Consider the partition of the set in K clusters. Let Gk the p-dimensional interval prototype of class k and Ck the k th class. Partitional dynamic clustering over is proposed to minimize the criterion J , defined by J = K X X (2) (In , Gk ), k=1 In 2Ck where is a distance function. For the generic interval instance In , the mapping M which preserves location and internal variation generates one point and one vector(both p-dimensional) and it is given by ([a1n , b1n ], · · · , [apn , bpn ]) ! {(a1n , · · · , apn ), ( 1n , · · · , pn )}, (3) M with jn = bjn ajn . As two different kind of information are used, occurs a hybridism on mapping. The hybrid L1 (HL1 ) distance formulation is given by dHL1 (In , Gk ) = p X j=1 |ajn ajGk | + | j n j Gk | ajGk |2 + | j n j 2 Gk | (4) . The hybrid L2 (HL2 ) distance has the expression dHL2 (In , Gk ) = p X j=1 |ajn . (5) The hybrid L1 distance (HL1 ) distance is proposed as follows p dHL1 (In , Gk ) = max{|ajn j=1 p ajGk |} + max{| j n j=1 j Gk |}, (6) where max{·} is the maximum function. To compare the quality of the clustering results, adjusted rand index (ARI) is used associated with a synthetic dataset. ARI values more close to 1 indicates a strong agreement between the obtained clusters and a known partition. Bootstrap statistical method constructs non-parametric confidence intervals for the mean of ARI values for the distances L1 , L2 , L1 , HL1 , HL2 and HL1 distances, with 95% of confidence. In the synthetic dataset, intervals are constructed sorting randomly values for centers and ranges, which delivers three clusters, two ellipsoidal (with 150 elements) and the third one spherical (with 50 elements). The centers, with coordinates (cx , cy ), bivariate normal distributions with parameters µ and ⌃, with µ = ✓ ◆ ✓ follow ◆ 2 0 µx x and ⌃ = , with the following values : Cluster 1: µx = 30, µy = 10, x2 = 2 0 µy y 100 e y2 = 25; Cluster 2: µx = 50, µy = 30, x2 = 36 e y2 = 144; and Cluster 3: µx = 30, µy = 35, x2 = 16 e y2 = 16; The range is generated using uniform distributions over an interval [v, u], represented by U n(v, u). The rectangle with center coordinates in the point (cxi , cyi ) has ranges represented by xi and yi , for x and y, respectively . Interval data is constructed by ([cxi xi /2, cxi + xi /2], [cyi yi /2, cyi + (7) yi /2]). A general configuration is used for the ranges, where the uniform distributions are different for clusters and dimensions. Table 1 shows these distributions. Table 2 presents the non-parametric Table 1: Uniform distributions for interval ranges Cluster 1 2 3 x distribution U n(4, 7) U n(1, 2) U n(2, 3) y distribution U n(1, 3) U n(6, 9) U n(3, 6) confidence intervals for this synthetic configuration. 100 datasets were generated. For each one, the clustering was applied 100 times. The solution which has the lowest criterion was selected, resulting in 100 ARI values. The bootstrap method is applied to the ARI values with 2000 repetitions and confidence of 95%. The confidence intervals for the ARI means revel the better adjust of Table 2: Non-parametric confidence intervals for the comparison of distances Distance ARI confidence interval L1 HL1 L2 HL2 L1 HL1 [0.72, 0.77] [0.85, 0.87] [0.51, 0.56] [0.51, 0.55] [0.81, 0.84] [0.78, 0.82] HL1 , instead its limits are greater than the other distances. References [1] Chavent, M., Lechevallier, Y. (2002). Dynamical clustering of interval data: Optimization of an adequacy criterion based on hausdorff distance, Classification, Clustering, and Data Analysis, 53–60. [2] Souza, R. M. C. R., De Carvalho, F. D. A. T. (2004). Clustering of interval data based on city-block distances, Pattern Recognition Letters 25, 353–365. [3] De Carvalho, F. D. A. T., Brito, P., Bock, H.-H. (2006). Dynamic clustering for interval data based on l2 distance, Computational Statistics 21, 231–250. SDA 2015 November 17 – 19 University of Orléans, France Wednesday, November 18 Afternoon IIIA Computer Science Building, Herbrand Amphitheatre Session 5: NETWORKS Chair: Christel VRAIN 14:00 – 14:25 Symbolic Data Analysis for Social Networks Fernando PEREZ, University of Mexico, Mexico 14:25 – 14:50 Symbolic Data Analysis of Large Scale Spatial Network Data Carlo DRAGO, Alessandra REALE Niccolo Cusano University, Rome & ISTATN, Italy Session 6: VISUALIZATION Chair: Monique NOIRHOMME 14:50 – 15:15 Matrix Visualization for Big Data Chun-houh CHEN, Chiun-How KAO, Academia Sinica, Taipei, Taiwan Yin-Jing TIEN, UST, Taiwan 15:15 – 15:40 Exploration on Audio-Visual Mappings Daniel DEFAYS, University of Liège, Belgium 15:40 – 16:10 Coffee Break Session 7: COMPOSITIONAL DATA Chair: Paula BRITO 16:10 – 16:40 Sample Space Approach of Compositional Data Vera PAWLOWSKY, University of Girona, Spain 16:40 – 17:10 Compositional Data of Contingency Tables Juan José EGOZCUE, University of Catalonia, Spain 17:10 – 17:35 Distributional Modeling Using the Logratio Approach Karel HRON, Palacky University of Olomouc, Czech Republic Peter FILZMOSER, TU Vienna, Austria Alessandra MENAFOGLIO, Polytechnic University, Milan, Italy 19:30 Workshop Dinner. 'Au Bon Marché' Restaurant, 12 Place du Châtelet, Orléans. Tram stop: Royale-Châtelet Symbolic Data Analysis for Social Networks Fernando Pérez1, ? 1. IIMAS, National Autonomous University of Mexico ? Contact author: fernando@sigma.iimas.unam.mx Keywords: Symbolic Data, Social Network Analysis, Public Health. Public Health problems are one of the most interesting topics in Social Network Analysis given its repercusions and variety of contexts (Shaefer & Simpkins, 2014). We seek to study “infection” by customs and values, so we will be looking into non traditional structures. This is interesting because of the unusual nature and the complexity of quantifying a relationship of this abstract nature, but it is precisely this that allows the incorporation of Symbolic Data analysis (Diday & Noirhomme-Fraiture, 2008) (Giordano & Brito, 2014). References Giordano, G. & Brito, P (2014). Social Networks as Symbolic Data. Analysis and Modeling of Complex Data in Behavioral and Social Sciences, Springer Schaefer, D. & Simpkins, S. (2014). Using Social Nework Analysis to Clarify the Role of Obesity in Selection of Adolescent Friends. American Journal of Public Health. 104, 1223–1229. Diday, E. & Noirhomme-Fraiture, M. (2008). Symbolic Data Analysis and the SODAS Software. Symbolic Data Analysis and the SODAS Software Wiley, England. Symbolic Data Analysis of Large Scale Spatial Network Data Carlo Drago1,* , Alessandra Reale1 1. University of Rome “Niccolò Cusano” and Italian National Institute of Statistics (ISTAT) *Contact author: c.drago@mclink.it Keywords: Symbolic Data Analysis, Social Network Analysis, Community Detection, Spatial Data Mining Modern spatial networks are ubiquitous in various different contexts and are increasingly massive on their size. The challenges for large-scale spatial networks call for new methodologies and approaches which can allow to extract the relevant patterns on data. In this work we will examine spatial networks data, taking into account their characteristics, and we will consider different approaches in order to represent and analyze these networks by means of Symbolic Data. From the representations of the networks, we will show different Symbolic Data Analysis approaches to detect the different patterns is possible to find on data. We will conduct a simulation study and an application on real data. References Billard, L., & Diday, E. (2003). From the statistics of data to the statistics of knowledge: symbolic data analysis. Journal of the American Statistical Association, 98(462), 470-487. Diday, E., & Noirhomme-Fraiture, M. (Eds.). (2008). Symbolic data analysis and the SODAS software. J. Wiley & Sons. Drago C. (2015) Large Network Analysis: Representing the Community Structure by Means of Interval Data. Fifth International Workshop on Social Network Analysis (ARS 2015); 04/2015 Giordano G., Brito M. P. , (2014) Social Networks as Symbolic Data, in: Analysis and Modeling of Complex Data in Behavioral and Social Sciences, Edited by Vicari, D, Okada, A, Ragozini, G, Weihs, C. (Eds, 06/2014; Springer Series: Studies in Classification, Data Analysis, and Knowledge Organization., ISBN: 978-3-319-06691-2 Giordano G., Signoriello S. and Vitale M.P. (2008) Comparing Social Networks in the framework of Complex Data Analysis. CLEUP Editore, Padova: pp.1- 2, In: XLIV Riunione Scientifica della Società Italiana di Statistica Matrix Visualization for Big Data Chun-houh Chen1,*, Chiun-How Kao1,2,3, Yin-Jing Tien2 1. Academia Sinica 2. National Taiwan University of Science and Technology 3. Institute for Information Industry *Contact author: cchen@stat.sinica.edu.tw Keywords: Exploratory Data Analysis (EDA), Generalized Association Plots (GAP), Symbolic Data Analysis (SDA) “It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it” (Exploratory Data Analysis: John Tukey, 1977). Data analysts and statistics practitioners nowadays are facing difficulties in understanding higher and higher dimensional data with more and more complex nature while conventional graphics/visualization tools do not answer the needs. It is more difficult a challenge for understanding overall structure in big data sets so good and appropriate Exploratory Data Analysis (EDA) practices are going to play more important roles in understand what one can do in the big data era. Matrix Visualization (MV) has been shown to be more efficient than conventional EDA tools such as Boxplot, Scatterplot (with dimension reduction techniques), and Parallel Coordinate Plot for extracting information embedded in moderate to large data sets of binary, continuous, ordinal, and nominal nature. In this study we plan to investigate feasibility and potential difficulties for applying related MV techniques for visualizing and exploring structure from big data: 1). Memory/computation (permutation with clustering) of proximity matrices for variables and subject; 2). Display of data and proximity matrices for variables and subject. We shall integrate techniques from Hadoop computing environment, Image scaling, and Symbolic Data Analysis into the framework of GAP (Generalized Association Plots) in coming up with an appropriate package for conducting Big Data EDA with visualization. ! References Tukey, John W (1977). Exploratory Data Analysis. Addison-Wesley. Exploration on audio-visual mappings Daniel Defays1 Keywords: Dissimilarities, Distances, Metricon spaces, Isometric mapping, Pattern recognition Exploration audio-visual mappings Abstract Summary WhichWhichofthesevisualsequencesofimagescorrespondstotheHappyBirthdaysong? of these 5 visual sequences of images corresponds to the Happy Birthday song ? ! ! ! ! ! ! ! ! ! !! Sequence 1 Sequence 2 Sequence 3 Sequence 4 Sequence 5 1. Which of these visual images corresponds best to the Happy BIrthday song? And why? Figure Most people will probably choose the third sequence, with the cake, glasses of Champaign, birds and balloons, which evokes a birthday party. But those familiar with music notation could choose Mostpeoplewillprobablychosethethirdsequenceinthemiddle,withthecake,glassesof sequence 1 or 2 which code the music of the song. It is unlikely that the last two sequences will be Champaign,threebirdsandballoons.Somefamiliarwithmusicnotationcouldchoosesequence very appealing for anybody, despite of the fact that they are linked to the four parts of the song, as it 1or2anditiisunlikelythathelasttwosequenceswillbeveryappealing. will be explained in the talk. choosethescore, ! 2 0 1 5 -0 2 -0 7 1 9 : 2 7 :0 0 1 /1 ! P a rtitio n H B (# 2 3 ) The figure 1 illustrates the topic of an exploration on trans-sensory mappings: how can different sensorial inputs be mapped into each other. More specifically here, how can a set of images be associated with a piece of music ? Numerous facts suggest that different inputs to our sensorial channels share some common patterns or at least at some stage of their processing by the nervous system activates the same areas. If this is the case, matching between songs, images and odours could make some sense. In fact, software already exists to bridge music and images, like in the work of the Analema group or in the Media Player application. The paper will focus on one particular aspect of that exploration : the « mathematical » mapping of images on songs. The songs are decomposed into segments that are then represented in a multidimensional space through a kind of spectral analysis (with the use of Mell frequency Cepstral Coefficients) widely used in automatic and speaker recognition [Berenzweig et al, 2003]. In the area of automated processing of images, Nguyen-Khang Pham, Annie Morin, Patrick Gros and Quyet-Thang Le have used local descriptors obtained through filters to quantify the content of images and Factorial Correspondence 1UniversityofLiege,ddefays@ulg.ac.be Analysis (FCA) to reduce the number of dimensions [Pham et al, 2009]. This makes it possible to represent images into Cartesian spaces as well. Once the two sets have been wrapped into structures, a morphism between the two sets can be elaborated. A method will be presented which makes it possible to find the subset of images (S) - characterized by their dissimilarities - which matches in an optimal way the musical segments (C) of a song also characterized by dissimilarities. The quality of the match is assessed by comparing the dissimilarities of the elements of the target C with the corresponding dissimilarities in S. The closer they are, the better the fit is. Two different algorithms will be presented and commented. The images in the 5th sequence of figure 1 have been extracted from a set of 55 photos using that method. References Berenzweig A., Ellis D., Lawrence S. (2003). Anchor space for classification and similarity measurement of music, http://www.ee.columbia.edu/~dpwe/pubs/icme03-anchor.pdf/. Defays D. (1978). « A short note on a method of seriation » , British Journal of Mathematical Psychology 31, pp. 49-53 Diday E., et Noirhomme-Fraiture M. (eds) (2008). Symbolic Data Analysis and the SODAS Software, Wiley. Pham N-K., Morin A., Gros P., Le Q-T. (2009). « Utilisation de l’analyse factorielle des correspondances pour la recherche d’images à grande échelle », Actes d'EGC, RNTI-E-15, Revue des Nouvelles Technologies de l'Information - Série Extraction et Gestion des Connaissances, Cépaduès Editions, pp. 283 - 294. Widmer G., Dixon S., Goebl W., Pampalk E., Tobudic A. (2003). « In search of the Horowitz factor », AI Magazine Volume 24 Number 3, pp. 111-130. SAMPLE SPACE APPROACH TO COMPOSITIONAL DATA Vera PAWLOSKY-GLAHN University of Girona, Spain The analysis of compositional data based on the Euclidean space structure of their sample space is presented. Known as the Aitchison geometry of the simplex, it is based on the group operation known as perturbation, the external multiplication known as powering, and the simplicial inner product with the induced distance and norm. These basic operations are introduced, together with their geometric interpretation. To work with standard statistical methods within this geometry, it is necessary to build in a sensible way coordinates which are interpretable. This is done using Sequential Binary Partitions, which can be illustrated with the CoDadendrogram. Tools like the biplot and the variation array are used as a previous exploratory analysis to guide this construction. COMPOSITIONAL ANALYSIS OF CONTINGENCY TABLES J. Juan José EGOZCUE and Vera PAWLOSKY-GLAHN University of Catalonia and University of Girona, Spain Contingency tables contain in each cell counts of the corresponding events. In a multinomial sampling scenario, the join distribution of counts in the cells can be parametrised by a table of probabilities, which are a joint probability function of two categorical variables. These probabilities can be assumed to be a composition. The goal of the analysis is to decompose the probability table into an independent table and an interaction table. The optimal independent table, in the sense of the Aitchison geometry of the simplex, has been shown to be the product of geometric marginals, better than the traditional arithmetic marginals. The decomposition is unique and the independent part is an orthogonal projection of the probability table onto the subspace of independent tables. Interaction table is analysed using its clrrepresentation which is directly related to cell interactions. A summary measure of dependence is the simplicial deviance (the square Aitchison norm of the interaction table). An example is presented. Distributional modeling using the logratio approach ? Karel Hron1 , Peter Filzmoser2 , Alessandra Menafoglio3 1. Department of Mathematical Analysis and Applications of Mathematics, Palacký University, Olomouc, Czech Republic 2. Institute of Statistics and Mathematical Methods in Economics, Vienna University of Technology, Vienna, Austria 3. MOX - Department of Mathematics, Politecnico di Milano, Milano, Italy ? Contact author: hronk@seznam.cz Keywords: Symbolic data analysis, Bayes spaces, Compositional data, Logratio coordinates, Functional data analysis Symbolic data analysis provides a unified approach to analyze distributional data, resulting from capturing intrinsic variability of groups of individuals as input observations. In parallel to the symbolic data approach, since the early 1980s a concise methodology was developed to deal with compositional data, i.e., data carrying only relative information (Aitchison, 1986; Pawlowsky-Glahn et al., 2015), like proportions or percentages, through the logratios of their parts. Most methods in compositional data analysis aim to treat multivariate observations which can be identified with probability functions of discrete distributions. Nevertheless, in addition also a methodology to capture the specific features of continuous distributions (densities) has been recently introduced (Egozcue et al., 2006; van den Boogaart et al., 2014). Aim of this contribution is to describe a general setting that includes both the discrete and the continuous cases, and to provide specific details to both frameworks focusing on the implications on symbolic data analysis. The theoretical developments are illustrated with real-world case studies. References Aitchison, J. (1986). The statistical analysis of compositional data. Chapman and Hall, London. Egozcue, J.J., Dı́az-Barrero, J.L., Pawlowsky-Glahn, V. (2006). Hilbert space of probability density functions based on Aitchison geometry. Acta Mathematica Sinica, English Series 22, 11751182. Pawlowsky-Glahn, V., Egozcue, J.J., Tolosana-Delgado, R. (2015). Modeling and analysis of compositional data. Wiley, Chichester. van den Boogaart, K.G., Egozcue, J.J., Pawlowsky-Glahn, V. (2014). Bayes Hilbert spaces. Australian & New Zealand Journal of Statistics 56, 171–194. SDA 2015 November 17 – 19 University of Orléans, France Thursday, November 19, Morning IIIA Building, Herbrand Amphitheatre Session 8: TIME SERIES Chair: Lynne BILLARD 09:00 – 09:25 New Results in Forecasts Combination using PCA with Interval Time Series: The Case of Oil Price Carlos MATE, Comillas Pontifical University, Madrid, Spain Andrea VASEKOVA, Masaryk University, Brno, Czech Republic 09:25 – 09:50 Locally Weighted Learning Methods for Histogram Data Albert MECO, Materia Works, Madrid, Spain Javier ARROYO, Complutense University, Madrid, Spain 09:50 – 10:15 Improving Accuracy of Corporate Financial Distress Prediction by Considering Volatility: an Interval-Data-Based Discriminant Model Rong GUAN, Yu LIU, CUFE, Beijing, China 10:15 – 10:40 Beanplot Analysis Strategies for Financial Data Carlo DRAGO, Niccolo Cusano University, Naples, Italy Carlo LAURO, Germana SCEPI, Frederick II Univ., Naples, Italy 10:40 – 11:05 Coffee Break Session 9: DATA ANALYSIS Chair: Javier ARROYO 11:05 – 11:30 Generalized ANOVA for SDA Vladimir BATAGELJ, IMFM, Ljubljana, Slovenia, Simona KORENJAK-CERNE, Natasa KEJZAR, University of Ljubljana, Slovenia 11:30 – 11:55 Factor Analysis of Interval Data Paula CHEIRA, LIAAD-INESC TEC, Univ. Porto & PI, Viana do Castelo , Portugal Paula BRITO, FEP & LIAAD-INESC TEC, Univ. Porto, Portugal A. Pedro DUARTE SILVA, UCP Porto, Portugal 11:55 – 12:20 Linear Discriminant Analysis for Interval and Histogram Data Sonia DIAS, LIAAD-INESC TEC, Univ. Porto & PI, Viana do Castelo , Portugal Paula AMARAL, New University, Lisbon, Portugal Paula BRITO, FEP & LIAAD-INESC TEC, Univ. Porto, Portugal 12:20 Lunch. L'Agora Restaurant, Campus New results in forecasts combination using PCA with interval time series: The case of oil price ? Carlos Maté1, , Andrea Vašeková2 1. Universidad Pontificia Comillas, Madrid (Spain) 2. Masaryk University, Brno (Czech Republic) ? Contact author: cmate@icai.comillas.edu Keywords: oil price forecasting, principal component analysis, symbolic data analysis This century is much more complex, uncertain and riskier than the previous one. In addition to the global crisis which began in December 2007 and still remains around the world, the year 2015 has brought new problems like the new oil crisis or the China crisis, among others. In this scenario, having accurate forecasts in economics, finance, energy, health and so on is more critical than ever. According to Energy Information Administration, oil is the most consumed source of energy, and thus most of economic activity depends on the evolution of oil prices. Lately it has been observed that the oil prices were very volatile. Alquist et al. (2013) made a critical survey regarding the econometric models (time series models, financial models and structural models) used to predict the oil prices. A very recent review on artificial intelligence methods in oil price forecasting can be found in Sehgal and Pandey (2015). The introduction of interval time series (ITS) concepts and forecasting methods has been proposed in various papers, such as Arroyo et al. (2011), Arroyo and Maté (2006), among others. After more than 40 years of research, there is a general consensus that "combining forecasts reduces the final forecasting error" (see, for example, Clemen (1989) and Timmerman (2006)). As part of this consensus there is also a well-known fact that "a simple average of several forecasts often outperforms complicated weighting schemes" which was named the forecast combination puzzle by Stock and Watson (2004). Very recently, Maté (2015) has proposed several combination schemes with interval time series (ITS) forecasts. In addition, the forecast combination puzzle in the ITS forecasts framework has been analyzed in the context of different accuracy measures. As one result of this paper, the forecast combination puzzle remains in the case of ITS. As another result, the principal component analysis (PCA) method for interval-valued data has been proposed as a page of an agenda for future research. PCA was one of the first methods extended from single data to interval-valued data. For a review paper on this research field, see Douzal-Chouakria et al. (2011). In this paper we develop a new ITS forecast combination method based on PCA. We will show how this method performs in the forecast combination puzzle. As a case study we analyze the oil market. Further research issues will be proposed. References Alquist, R., L. Kilian, and R. J. Vigfusson (2013). Forecasting the price of oil. Handbook of economic forecasting 2, 427–507. Arroyo, J., R. Espínola, and C. Maté (2011). Different approaches to forecast interval time series: a comparison in finance. Computational Economics 37(2), 169–191. Arroyo, J. and C. Maté (2006). Introducing interval time series: accuracy measures. In Compstat, proceedings in computational statistics, pp. 1139–1146. Heidelberg: PhysicaVerlag. Clemen, R. (1989). Combining forecasts: A review and annotated bibliography. International Journal of Forecasting 5(4), 559–583. Douzal-Chouakria, A., L. Billard, and E. Diday (2011). Principal component analysis for interval-valued observations. Statistical Analysis and Data Mining: The ASA Data Science Journal 4(2), 229–246. Maté, C. (July, 2015). Combining interval time series forecasts: An agenda for future research. 1st International Symposium on Interval Data Modelling: Theory and Applications SIDM2015, Beijing, China. Sehgal, N. and K. K. Pandey (2015). Artificial intelligence methods for oil price forecasting: a review and evaluation. Energy Systems, 1–28. Stock, J. H. and M. W. Watson (2004). Combination forecasts of output growth in a seven-country data set. Journal of Forecasting 23(6), 405–430. Timmerman, A. (2006). Forecast Combinations. In Handbook of Economic Forecasting, Elliott, G. and Granger, . C. W. J. and Timmerman, A. (eds.). Amsterdam: Elsevier. Locally weighted learning methods for histogram data Albert Meco1,*, Javier Arroyo2 1. Materia Works. Madrid, Spain. 2. Facultad de Informática. Universidad Complutense de Madrid. Madrid, Spain. *Contact author: albert.meco@materiaworks.com Keywords: histogram data, locally weighted regression, kernel regression, k-Nearest Neighbors Lazy learning is a kind of machine learning that stores all the training instances and defers their processing until a new instance arrives. Locally weighted learning (LWL) is a form of lazy learning that, given a new instance, combines the most relevant training instances to yield a solution for the new one (Atkenson et al, 1999). In other words, instead of estimating a single global model for the whole data set, a new local model is estimated for each new instance using only the information provided by the closest already known instances. LWL methods include k-Nearest Neighbors (k-NN), kernel regression and locally weighted regression among others. These methods are especially suited for regression problems dealing with complex (non-linear) target functions. In the symbolic context, Arroyo and Maté (2009) adapted the k-NN method for histogram data, and Arroyo (2008) adapts it for interval data. The present work further explores locally weighted learning methods and adapts kernel regression and locally weighted regression to histogram data. Kernel regression is an extension of the k-NN method where all the training instances are averaged to estimate the output using a kernel as a weighting function. As in the histogram k-NN (Arroyo and Maté, 2009), the kernel regression for histogram data uses the histogram barycenter proposed by Irpino and Verde (2006) as averaging device. Locally weighted regression is similar to kernel regression as it uses all the training instances, but the output is approximated by linear regressions and not by averages. In this work, locally weighted regression is adapted to histogram data using the histogram linear regression proposed by Irpino and Verde (2015). In the classical context, kernel and locally weighted regression make possible to smoothly approximate non-linear functions. This work will analyze the potential of both methods and illustrate their potential with the help of synthetic and real-life data. References Arroyo, J. (2008). Métodos de predicción para series temporales de intervalos e histogramas. Ph. D. Dissertation, Universidad Pontificia Comillas, Madrid Arroyo, J., Maté, C. (2009). Forecasting histogram time series with k-nearest neighbours methods International Journal of Forecasting 25 (1), 192-20 Atkeson, C. G., Moore, A.W., Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11(1–5), 11–73 Irpino, A., Verde, R (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In Data Science and Classification, pp. 185-192 Irpino, A., Verde, R. (2015). Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance. Advances in Data Analysis and Classification, 9 (1), 81-106 Improving accuracy of corporate financial distress prediction by considering volatility: an interval-data-based discriminant model Rong Guan1,*, Yu Liu2 1. School of Statistics and Mathematics, Central University of Finance and Economics, Beijing, 100081, CHINA 2. Propaganda Department, Central University of Finance and Economics, Beijing, 100081, CHINA *Contact author: rongguan77@gmail.com Keywords: Interval Data, Volatility, Financial Distress, Linear Multivariate Discriminant Model Corporate financial distress prediction aims to tell a lead time in advance, two years for instance, whether a corporate will plunge into financial distress, a condition when promises to creditors of a corporate are broken or honored with difficulty. Due to its importance on investment decisions, the last three decades have witnessed an increasing research focus on this topic. One of the most well-known achievements belongs to Altman, whose z-score model (Altman, 1968) has guided a direction to estimate a statistical model using year-end data of financial indicators. Unfortunately, both z-score model and the subsequent statistical methods have suffered from some limitations in the practical usefulness of their results. One of the major defects exists in that they are more likely to mistake a distressed corporate as a non-distressed one. One possible cause is that the “distress signal” transmitted by financial indicators becomes increasingly weaker to capture in a longer time before its eventual distress. To tackle this problem, this paper proposes to supplementary consider volatility information of financial indicators in modeling process. What contributes to our motivations is the fact that a corporate generally experiences a large decline or fluctuation in their financial situations prior to its distress. This leads to a bigger volatility of financial data in prediction lead time, which we deem to be a significant signal for high risk of distress. In this paper, we collect corporates’ four-quarter financial data, and then use interval data to summarize these data by setting the minimum/maximum values of quarterly records respectively as the lower/upper bounds. In this way, both the central tendency and the volatility included in quarterly records will be taken into account in the prediction process (Billard and Diday, 2003). We then establish an intervalvalue-based linear multivariate discriminant model (iLMDA: Silva and Brito, 2006) for corporate financial distress prediction. An empirical analysis is carried out to compare iLMDA with the traditional LMDA using year-end numerical data. As expected, iLMDA model has shown much better performance in recognizing corporate in high risk, whereas the traditional LMDA model is more likely to mistake a distressed corporate as a non-distressed one. In summary, iLMDA model has successfully improved model efficiency in financial distress prediction by supplementary considering volatility information of financial indicators. References Altman, E(1986). Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy. The Journal of Finance, 23(4), 589-609. Billard, L. and Diday, E(2003). From the Statistics of Data to the Statistics of Knowledge. Journal of the American Statistical Association, 98(462), 470-487. Silva, A.P.D. and Brito, P(2006). Linear Discriminant Analysis for Interval Data. Computational Statistics, 21, 289-308. Beanplot Analysis Strategies for Financial Data Carlo Drago1,* , Carlo Lauro2, Germana Scepi2 1. Università degli Studi “Niccolò Cusano” 2. Università degli Studi di Napoli “Federico II” *Contact author: c.drago@mclink.it Keywords: High frequency data, kernel density estimation, clustering, change point analysis. finite mixture models, constrained In this work we propose a strategy to analyze complex data as financial time series by using beanplot time series. In particular the beanplots allow to visualize complex time series in terms of size, variability and shape of daily tick by tick financial indicators (prices, volume exchanged, etc) In order to synthetise the main information of each beanplot we develop a parametric approach Finite Mixture Models are useful to parameterize the original data and to summarize the information contained on the data. We introduce a suitable goodness of fit index (GOF) which allow to understand the adequacy of the models with respect to the original data. The GOF index can be used in order to weight consistently different data in order to obtain robust model based composite indicators over time. Finally we propose some strategies based on principal component and clustering on the model parameters with the aim to aggregate individual bean plot and discovery relevant change in the series to be used for operational and predictive scope. References Billard, L. and Diday, E. From the Statistics of Data to the Statistics of Knowledge. Journal of the American Statistical Association 98, 470–487 (2003). Drago C., Scepi G. (2015) Time Series Clustering from High Dimensional Data. Lecture Notes in Computer Science Series (LNCS) forthcoming edited by Francesco Masulli, Alfredo Petrosino, Stefano Rovetta, 05/2015; Springer. Forthcoming Drago C., Lauro C., Scepi G. (2013) “Beanplot Data Analysis in a Temporal Framework” Statistical Models for Data Analysis, Edited by Paolo Giudici, Salvatore Ingrassia, Maurizio Vichi, 01/2013: pages 115-122; Springer Berlin Heidelberg., ISBN: 978-3-319-00032-9 Drago C., Lauro C. Scepi G. (2015) “Visualization and Analysis of Multiple Time Series by Beanplot PCA”Statistical Learning and Data Sciences, Lecture Notes in Computer Science Series, Volume 9047 02/2015: Verde R., and Irpino A. Dynamic clustering of Histogram data: using the right metric. In: Brito P. Bertrand P. Cucumel G. and De Carvalho F. ”Selected contributions in data analysis and classification” (pp. 123-134). Berlin, Springer Germany (2007) Generalized ANOVA for SDA ? Vladimir Batagelj1 , Simona Korenjak-Černe2 , Nataša Kejžar3 1. IMFM - Institute of Mathematics, Physics and Mechanics, Ljubljana, Slovenia 2. University of Ljubljana, Faculty of Economics, Ljubljana, Slovenia 3. University of Ljubljana, Faculty of Medicine, Ljubljana, Slovenia ? Contact author: simona.cerne@ef.uni-lj.si Keywords: Generalized ANOVA, Generalized Ward’s method, Generalized Huygens theorem In Batagelj, 1988, the generalized Huygens theorem for any dissimilarity d (a basis for generalized ANOVA) was proved. It is based on the generalized definition of the cluster error p(C) p(C) = X 1 w(X) · w(Y ) · d(X, Y ) 2 · w(C) X,Y 2C and on the following extension of dissimilarity to generalized center C̃ of a cluster C d(U, C̃) = d(C̃, U ) = 1 X ( w(X) · d(X, U ) w(C) X2C p(C)). The generalized Huygens theorem takes the form X X I = p(E) = p(C) + w(C) · d(C̃, Ẽ) = IW + IB . C2C C2C Studer et al., 2011, used this to generalize ANOVA to a set of sequences. It is implemented in the R package TraMineR. For dissimilarity they use optimal matching and therefore call a generalized variance a discrepancy. They also exposed the problem of nonnegativity of the dissimilarity d(U, C̃). In Batagelj, 1988, it is shown that d(U, C̃) is nonnegative if the dissimilarity d between units satisfies the triangle inequality. If a dissimilarity d is not a metric it can be transformed into it using the power transformation (Joly and Le Calvé, 1986). In this paper we study possible adaptation of the generalized ANOVA for symbolic data analysis. In Batagelj et al., 2015, it is shown that the generalized Huygens theorem holds for the first of six proposed dissimilarities. Here, we propose a more general approach that can be used for any dissimilarity. Due to Joly and Le Calvé, 1986, for each dissimilarity d there exist a real positive number p, called the metric index, such that dk is a metric for k p, and is not a metric for k > p. Therefore if for the selected dissimilarity d the triangle inequality does not hold, its metric index is less than 1. For such dissimilarities, we can find their metric index and use the transformed dissimilarity in the generalized formulas. In these cases the generalized Huygens theorem can be used. We further follow the procedure proposed in Studer et al., 2011: the part of the discrepancy which is explained by differences between clusters is measured with R2 = IB . I Alternatively, the comparison between the inertia between clusters and the inertia within them is made with IB /(m 1) F = , IW /(n m) where m is the number of clusters and n is the number of units. Since in the general case the distribution of F is not the F-distribution, we use Monte Carlo approach to test the statistical significance. The proposed approach will be demonstrated on a real-life data. References Batagelj, V. (1988). Generalized Ward and Related Clustering Problems. Classification and Related Methods of Data Analysis, 67–74. Batagelj, V., Korenjak-Černe, S., and Kejžar, N. (2015). Clustering of Modal Valued Data. Draft of a chapter in Brito, P. (ed.) Analysis of Distributional Data. Joly, S., and Le Calvé, G. (1986). Etude des puissances d’une distance. Statistique et Analyse de Données, 11, 30–50. Studer, M., Ritschard, G., Gabadinho, A., and Müller, N. S. (2011). Discrepancy Analysis of State Sequences. Sociological Methods and Research. Sociological Methods and Research., Vol 40, Num. 3, 471–510. Factor Analysis of Interval Data ? Paula Cheira1,3 , Paula Brito2,3, , A. Pedro Duarte Silva4 1. 2. 3. 4. Instituto Politécnico de Viana do Castelo, Viana do Castelo, Portugal Faculdade de Economia, Universidade do Porto, Porto, Portugal LIAAD - INESC TEC, Universidade do Porto, Porto, Portugal Faculdade de Economia e Gestão & CEGE, Universidade Católica Portuguesa, Porto, Portugal ? Contact author: mpbrito@fep.up.pt Keywords: Factor analysis, Interval data, Mallows distance, Symbolic data analysis When a large number of variables is measured on each statistical unit, the study of its dependence structure may be of interest. The orthogonal model of factor analysis assumes that there is a smaller set of uncorrelated variables, called factors, that explain the relations between the observed variables. With the new variables it is expected to get a better understanding of the data being analyzed, moreover, they may be used in future analysis (Johnson, 2002). In this work we present a factorial analysis model for symbolic data, focusing on the particular case of interval valued variables, i.e., where the statistical units are described by variables whose values are intervals of R (Billard, 2006; Bock, 2000). The method describes the correlation structure among the measured interval-valued variables in terms of a few underlying, but unobservable, uncorrelated interval-valued variables. Two cases are considered for the distribution assumed within each observed interval: Uniform distribution and Triangular distribution. In our proposal, factors are extracted by principal components analysis, performed on the interval variables correlation matrix (Billard, 2006). To estimate the factor scores, two approaches will be considered, which are inspired in methods for real data: the Bartlett and the Anderson-Rubin methods (DiStefano, 2009). In the both cases, the estimated values are obtained by solving an optimization problem that uses as criterion to be minimized the weighted squared Mallows distance between quantile functions. In the first method the factor scores are highly correlated with their corresponding factor and weakly (or not at all) with other factors. However, the estimated factor scores of different factors may still be correlated. In the second proposed method, the function to minimize is adapted to ensure that the factor scores are themselves not correlated with each other. The applicability of this method is illustrated using data of characteristics of cars of different makes and models. References Billard, L., Diday, E. (2006). Symbolic data analysis: Conceptual statistics and data mining. John Wiley and Sons, Ltd, Chichester. Bock, H.-H. & Diday, E., eds. (2000). Analysis of symbolic data. Exploratory methods for extracting statistical information from complex data. Springer-Verlag, Berlin-Heidelberg. DiStefano, C., Zhu, M., Mndrilă, D. (2009). Understanding and using factor scores: Considerations for the applied researcher. Practical Assessment, Research & Evaluation 14, 1–11. Johnson, R. A. & Wichern, D. W. (2002). Applied multivariate statistical analysis. Prentice-Hall, New Jersey. Linear Discriminant Analysis for Interval and Histogram Data ? Sónia Dias1,2, , Paula Amaral3 , Paula Brito1,4 1. LIAAD/INESC-TEC, Porto, Portugal 2. School of Technology and Management, Polytechnic Institute of Viana do Castelo, Portugal 3. CMA and Faculty of Science and Engineering, University Nova de Lisboa, Portugal 4. Faculty of Economics, University of Porto, Portugal ? Contact author: sdias@estg.ipvc.pt Keywords: linear discriminant analysis; quantile functions; Mallows distance; fractional quadratic problems During the last years, Symbolic Data Analysis developed concepts and methods that allow statistical studies with histogram-valued variables and interval-valued variables. Nonetheless, there are only a few studies about discriminant analysis under the symbolic framework and these only focus on interval-valued variables (Duarte Silva and Brito, 2006, forthcoming). Dias and Brito (2015) proposed the Distribution and Symmetric Distributions (DSD) linear regression model, which allows predicting distributions from other distributions, represented by quantile functions. From the DSD Model, we define a discriminant function for the classification of a set of individuals in two classes. For each individual, a linear combination obtained as in the DSD Model is considered, which allows defining a score of the individual in the form of a quantile function. Irpino and Verde (2006) proved that total inertia, defined with the Mallows distance and with respect to a barycentric histogram, may be decomposed into within and between classes inertia, according to the Huygens theorem. From this decomposition, and similarly to the classical linear discriminant method, it is possible to deduce that the coefficients of the discriminant function are obtained by maximizing the ratio of the between to the within classes inertia. To solve the optimization problem that allows obtaining these coefficients, it is necessary to solve a constrained fractional quadratic problem. The solver BARON is used to solve this difficult optimization problem. A solution is obtained but the optimality certificate is only possible using conic relaxation techniques (Amaral et al, 2014). For the classification of an individual in one of the two groups, the Mallows distance between the score of the individual and the score obtained for the barycentric histogram of each class is computed. The observation is then assigned to the closest class. The proposed linear discriminant method may be particularized to interval-valued variables, which constitute a special case of histogram-valued variables. Examples illustrate the behavior of the method. References Dias, S. and Brito, P. (2015). Linear Regression Model with Histogram-Valued Variables. Statistical Analysis and Data Mining: The ASA Data Science Journal 8 (2), 75–113. Irpino, A. and Verde, R. (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Data Science and Classification, Proc. IFCS’2006, Batagelj et al (eds.), Ljubljana, Slovenia, 185–192. Duarte Silva, A.P. and Brito, P. (2006). Linear Discriminant Analysis for Interval Data. Computational Statistics 21 (2), 289-308. Duarte Silva, A.P. and Brito, P. (forthcoming). Discriminant Analysis of Interval Data: An Assessment of Parametric and Distance-Based Approaches. Journal of Classification. Amaral, P., Bomze, I. and Júdice, J. (2014). Copositivity and constrained fractional quadratic problems. Math. Program., 146, (1–2), 325–350. SDA 2015 November 17 – 19 University of Orléans, France Thursday, November 19 Afternoon IIIA Computer Science Building Herbrand Amphitheatre Session 10: CLUSTERING 2 Chair: Guillaume CLEUZIOU 14:00 – 14:25 Aggregated symbolic data with categorical variables Nobuo SHIMIZU, Junji NAKANO, Inst. Stat. Math., Japan Yoshikazu YAMAMOTO, Bunri University, Tokushima, Japan 14:25 – 14:50 Joining Similarity Measures Using Quasi-Arithmetic Etienne CUVELIER, ICHEC Management School, Bruxelles, Belgium Marie-Aude AUFAURE, Ecole Centrale, Paris, France 14:50 – 15:15 Fuzzy Clustering Method for Interval-Valued Data Bruno PIMENTEL, Renata SOUZA, UFPE, Recife, Brazil Roberta FAGUNDES, UFPE, Caruaru, Brazil 15:15 – 15:45 Coffee Break 15:45 Discussion _______ Dissimilarity decomposition for aggregated symbolic data with categorical variables ! Nobuo Shimizu1, , Junji Nakano1 , Yoshikazu Yamamoto2 1. The Institute of Statistical Mathematics, Japan 2. Tokushima Bunri University, Japan ! Contact author: nobuo@ism.ac.jp Keywords: Aggregated symbolic data, Clustering, Contingency table, Likelihood ratio test statistics In recent “Big data” era, huge amount of individual data sometimes can be divided into some naturally defined groups. We consider that the groups are concepts and can be expressed by several statistics calculated by using some information about marginal distributions and the joint distribution. We call them “aggregated symbolic data” (ASD). In this talk, we consider that individual data consists of several categorical variables, and use contingency tables for pairs of them to describe characteristics of categorical variables. For clustering such ASD, we define a dissimilarity measure based on likelihood ratio test statistics to test the similarity between two contingency tables generated by two ASD. We also propose to decompose the dissimilarity to investigate details of it. We illustrate the usefulness of our dissimilarity and its decomposition by analyzing real example data. Joining Similarity Measures Using Quasi-Arithmetic Means Etienne Cuvelier1,⇤ , Marie-Aude Aufaure2 1. ICHEC, Brussels Management School, Bruxelles, Belgium 2. Ecole Centrale Paris, Paris, France ? Contact author: etienne.cuvelier@ichec.be Keywords: Similarity Measures, Dissimilarity Measures, Combining Measures, Quasi-Arithmetic Means, Archimedean Generator. A lot of data analysis methods are based on similarity or dissimilarity measures but, most of the times, these measures are defined for one type of data (real multidimensional data, interval data, functional data, nodes in graphs,...). This fact implies that all the techniques of knowledge extraction based on such measures can be performed only on the data type for which they are defined. But the description and the modelling of real situations require the joint use of several kind of data. Symbolic Data Analysis deals also with this situation describing concepts using real data, interval data, histogram data, set data and/or probability distributions. We propose a new technique of combination of different measures in one single result. The method is based on Quasi-Arithmetics Means using Archimedean Generators. Quasi-Arithmetic Means with this kind of generators have several advantages to compute a resulting measure starting from several measures (computed on different types of data describing the same concept or individual): they allow to choose to emphasize the similarity or the dissimilarity between objects, they have flexible parameters, its possible to mix similarities and dissimilarities to compute a resulting similarity (or dissimilarity). The resulting measure (similarity or dissimilarity) can be used in any existing algorithm based on such measures: clustering, supervised classification... We will give some examples of use of this method on attributed networks and on symbolic data. References Stanislawa Ostasiewicz and Walenty Ostasiewicz (2000). Means and their applications. Annals of Operations Research , 97, :337 – 355. Arlei Silva, Wagner Meira Jr., and Mohammed J. Zaki (2012). Mining attribute-structure correlated patterns in large attributed graphs. PVLDB, 5(5), 466 – 477. Yang Zhou, Hong Cheng, and Jeffrey Xu Yu (2009). Graph clustering based on structural/attribute similarities. VLDB09 , Lyon, France. J. Kim and L. Billard (2013). Dissimilarity measures for histogram-valued observations. Communications in Statistics-Theory and Method , 42, 283 – 303. Fuzzy Clustering Method in Interval-Valued Scientific Production Data ? Bruno Pimentel1 , Renata Souza1 , Roberta Fagundes2 1. Universidade Federal de Pernambuco (UFPE), Centro de Informática (CIn), Av. Jornalista Anibal Fernandes, s/n - Cidade Universitária 50.740-560, Recife - PE, Brazil 2. Universidade de Pernambuco, Campus Gov. Miguel Arraes de Alencar, Polo Comercial, BR 104, Km 62 Caruaru - PE, Brazil ? Contact author: bap@cin.ufpe.br Keywords: Clustering, Fuzzy C-Means method, Symbolic Data Analysis, Weighted Multivariate Membership, Scientific Production Data. In recent decades, many applications aim to extract useful information or knowledge from data sets (Han and Kamber, 2006). Clustering methods, for example, are used to group unlabeled data (Jain et al., 1999) aiming to extract information form data. With the growing interest in automatically understanding, processing and summarizing data, several application domains such as pattern recognition, machine learning, data mining, computer vision and computational biology have used clustering algorithms (Jain et al., 1999). Taxonomically, clustering methods may be divided into two main approaches: hierarchical and partitional (Jain et al., 1999). Hierarchical methods yield a nested sequence of partitions of the input data. Partitional methods seek to obtain a single partition of the input data in a fixed number of clusters. These methods may be classified into two categories: hard and fuzzy clustering algorithms. The first category allocates objects to a single group where one of the most popular hard methods is the K-Means (Jain, 2010). On the second category of partitional clustering, objects have membership degrees for all clusters and the most popular method is the Fuzzy C-Means (FCM), which is more suitable for overlapping clusters(Pal & Sarkar, 2013). Many methods in the literature discus clustering involving numeric data only (Bock & Diday, 2000). In classical cluster analysis, objects are often represented as quantitative or qualitative values, where each one represents a variable. However, this representation may not be adequate to model more complex information found in real problems (Diday & Noirhomme-Fraiture, 2008). Databases, for example, may be huge and many clustering methods spend much time trying extract any information. A solution to execute these methods more efficiently is to summarize this database using symbolic data (Billard & Diday, 2003). Symbolic Data Analysis (SDA) handles this type of data that may be represented as interval, histogram, distribution and so on in order to take into account the variability and/or uncertainty innate of data (Billard & Diday, 2003). The SDA framework extends standard statistics and data mining tools to symbolic data, such as, descriptive statistics, multidimensional data analysis, dissimilarities and clustering (Diday & NoirhommeFraiture, 2008). Here, we use clustering and Symbolic Data Analysis in scientific production data. The database of scientific production used in this work contains 141260 researchers each one described by 33 continuous numerical and 3 categorical variables. The continuous variables are averages of production values computed in three years (2006, 2007 and 2008) for each researcher. The categorical variables are: Institute, Area of knowledge and Subarea of knowledge. In order to obtain interval scientific production data, this original data set is summarized using the institute and subarea of knowledge categorical variables. Thus, a symbolic data set of size 5630 is created and these data represent new concepts of scientific production (second level of observation). Each unit is a group of researches representing a profile described by interval symbolic variables. Therefore, the advantages for using interval scientific production data are: 1. Summarize data: These data can be aggregated using one or more categorical variables and a new data set smaller than the old one without losing much information can be obtained; 2. Ensure the privacy of individuals: The generalization process allows to ensure confidentiality of original data; 3. Use higher-level category: The aggregated data set is able to represent profiles of scientific production taking into account variability intrinsic to each profile. In order to obtain groups of knowledge area profiles, the clustering method that has weighted multivariate membership degree proposed by Pimentel & Souza (2013) is applied. This method handles interval data in order to consider the variability and/or uncertainty of information and for a given object and cluster, there is a membership degree for each variable and it is weighted according to the importance of the variable. According to CAPES (2015), there are 7 levels of course grades where levels 1 and 2 mean a poor performance below the minimum standard of quality required. Thus, in this work, the number of clusters is defined as 5. The result obtained by clustering method applied to this data set shows that only 5.49% of knowledge area profiles has very high scientific production. On the other hand, knowledge area profiles with very low scientific production represents 49.06% of the data set. References Billard, L., & Diday, E. (2003). From the statistics of data to the statistics of knowledge: symbolic data analysis. Journal of the American Statistical Association 98(462), 470–487. Bock, H. H., & Diday, E.(2000). Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer. Coordenação de Aperfeiçoamento http://www.capes.gov.br. de Pessoal de Nı́vel Superior (CAPES)(2015). Diday, E., & Noirhomme-Fraiture, M. (2008). Symbolic data analysis and the SODAS software. John Wiley and Sons, (Chapter 1). Han, J., & Kamber, M. (2006). Data mining: concepts and techniques. (2nd ed.). San Francisco: Morgan Kaufmann, (Chapter 1). Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR) 31(3), 264–323. Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8), 651–666. Pal, N. R., & Sarkar, K. (2013). What and when can we gain from the kernel versions of c-means algorithm. IEEE Transactions on Fuzzy Systems. Pimentel, B. A., & Souza, R. M. C. R. (2013). A Weighted Multivariate Fuzzy C-Means Method in Interval-Valued Scientific Production Data. Applied Soft Computing 13(4), 1592–1607.