Fuzzy Clustering Method in Interval-Valued Scientific Production Data ? Bruno Pimentel1 , Renata Souza1 , Roberta Fagundes2 1. Universidade Federal de Pernambuco (UFPE), Centro de Informática (CIn), Av. Jornalista Anibal Fernandes, s/n - Cidade Universitária 50.740-560, Recife - PE, Brazil 2. Universidade de Pernambuco, Campus Gov. Miguel Arraes de Alencar, Polo Comercial, BR 104, Km 62 Caruaru - PE, Brazil ? Contact author: bap@cin.ufpe.br Keywords: Clustering, Fuzzy C-Means method, Symbolic Data Analysis, Weighted Multivariate Membership, Scientific Production Data. In recent decades, many applications aim to extract useful information or knowledge from data sets (Han and Kamber, 2006). Clustering methods, for example, are used to group unlabeled data (Jain et al., 1999) aiming to extract information form data. With the growing interest in automatically understanding, processing and summarizing data, several application domains such as pattern recognition, machine learning, data mining, computer vision and computational biology have used clustering algorithms (Jain et al., 1999). Taxonomically, clustering methods may be divided into two main approaches: hierarchical and partitional (Jain et al., 1999). Hierarchical methods yield a nested sequence of partitions of the input data. Partitional methods seek to obtain a single partition of the input data in a fixed number of clusters. These methods may be classified into two categories: hard and fuzzy clustering algorithms. The first category allocates objects to a single group where one of the most popular hard methods is the K-Means (Jain, 2010). On the second category of partitional clustering, objects have membership degrees for all clusters and the most popular method is the Fuzzy C-Means (FCM), which is more suitable for overlapping clusters(Pal & Sarkar, 2013). Many methods in the literature discus clustering involving numeric data only (Bock & Diday, 2000). In classical cluster analysis, objects are often represented as quantitative or qualitative values, where each one represents a variable. However, this representation may not be adequate to model more complex information found in real problems (Diday & Noirhomme-Fraiture, 2008). Databases, for example, may be huge and many clustering methods spend much time trying extract any information. A solution to execute these methods more efficiently is to summarize this database using symbolic data (Billard & Diday, 2003). Symbolic Data Analysis (SDA) handles this type of data that may be represented as interval, histogram, distribution and so on in order to take into account the variability and/or uncertainty innate of data (Billard & Diday, 2003). The SDA framework extends standard statistics and data mining tools to symbolic data, such as, descriptive statistics, multidimensional data analysis, dissimilarities and clustering (Diday & NoirhommeFraiture, 2008). Here, we use clustering and Symbolic Data Analysis in scientific production data. The database of scientific production used in this work contains 141260 researchers each one described by 33 continuous numerical and 3 categorical variables. The continuous variables are averages of production values computed in three years (2006, 2007 and 2008) for each researcher. The categorical variables are: Institute, Area of knowledge and Subarea of knowledge. In order to obtain interval scientific production data, this original data set is summarized using the institute and subarea of knowledge categorical variables. Thus, a symbolic data set of size 5630 is created and these data represent new concepts of scientific production (second level of observation). Each unit is a group of researches representing a profile described by interval symbolic variables. Therefore, the advantages for using interval scientific production data are: 1. Summarize data: These data can be aggregated using one or more categorical variables and a new data set smaller than the old one without losing much information can be obtained; 2. Ensure the privacy of individuals: The generalization process allows to ensure confidentiality of original data; 3. Use higher-level category: The aggregated data set is able to represent profiles of scientific production taking into account variability intrinsic to each profile. In order to obtain groups of knowledge area profiles, the clustering method that has weighted multivariate membership degree proposed by Pimentel & Souza (2013) is applied. This method handles interval data in order to consider the variability and/or uncertainty of information and for a given object and cluster, there is a membership degree for each variable and it is weighted according to the importance of the variable. According to CAPES (2015), there are 7 levels of course grades where levels 1 and 2 mean a poor performance below the minimum standard of quality required. Thus, in this work, the number of clusters is defined as 5. The result obtained by clustering method applied to this data set shows that only 5.49% of knowledge area profiles has very high scientific production. On the other hand, knowledge area profiles with very low scientific production represents 49.06% of the data set. References Billard, L., & Diday, E. (2003). From the statistics of data to the statistics of knowledge: symbolic data analysis. Journal of the American Statistical Association 98(462), 470–487. Bock, H. H., & Diday, E.(2000). Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer. Coordenação de Aperfeiçoamento http://www.capes.gov.br. de Pessoal de Nı́vel Superior (CAPES)(2015). Diday, E., & Noirhomme-Fraiture, M. (2008). Symbolic data analysis and the SODAS software. John Wiley and Sons, (Chapter 1). Han, J., & Kamber, M. (2006). Data mining: concepts and techniques. (2nd ed.). San Francisco: Morgan Kaufmann, (Chapter 1). Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR) 31(3), 264–323. Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8), 651–666. Pal, N. R., & Sarkar, K. (2013). What and when can we gain from the kernel versions of c-means algorithm. IEEE Transactions on Fuzzy Systems. Pimentel, B. A., & Souza, R. M. C. R. (2013). A Weighted Multivariate Fuzzy C-Means Method in Interval-Valued Scientific Production Data. Applied Soft Computing 13(4), 1592–1607.