Fuzzy Clustering Method in Interval-Valued Scientific Production Data Bruno Pimentel , Renata Souza

advertisement
Fuzzy Clustering Method in Interval-Valued
Scientific Production Data
?
Bruno Pimentel1 , Renata Souza1 , Roberta Fagundes2
1. Universidade Federal de Pernambuco (UFPE), Centro de Informática (CIn), Av. Jornalista Anibal Fernandes, s/n - Cidade Universitária 50.740-560, Recife - PE, Brazil
2. Universidade de Pernambuco, Campus Gov. Miguel Arraes de Alencar, Polo Comercial, BR 104, Km 62
Caruaru - PE, Brazil
? Contact author: bap@cin.ufpe.br
Keywords: Clustering, Fuzzy C-Means method, Symbolic Data Analysis, Weighted Multivariate Membership, Scientific Production Data.
In recent decades, many applications aim to extract useful information or knowledge from data
sets (Han and Kamber, 2006). Clustering methods, for example, are used to group unlabeled
data (Jain et al., 1999) aiming to extract information form data. With the growing interest in
automatically understanding, processing and summarizing data, several application domains such
as pattern recognition, machine learning, data mining, computer vision and computational biology
have used clustering algorithms (Jain et al., 1999).
Taxonomically, clustering methods may be divided into two main approaches: hierarchical and
partitional (Jain et al., 1999). Hierarchical methods yield a nested sequence of partitions of the
input data. Partitional methods seek to obtain a single partition of the input data in a fixed number of clusters. These methods may be classified into two categories: hard and fuzzy clustering
algorithms. The first category allocates objects to a single group where one of the most popular
hard methods is the K-Means (Jain, 2010). On the second category of partitional clustering, objects have membership degrees for all clusters and the most popular method is the Fuzzy C-Means
(FCM), which is more suitable for overlapping clusters(Pal & Sarkar, 2013).
Many methods in the literature discus clustering involving numeric data only (Bock & Diday,
2000). In classical cluster analysis, objects are often represented as quantitative or qualitative values, where each one represents a variable. However, this representation may not be adequate to
model more complex information found in real problems (Diday & Noirhomme-Fraiture, 2008).
Databases, for example, may be huge and many clustering methods spend much time trying extract any information. A solution to execute these methods more efficiently is to summarize this
database using symbolic data (Billard & Diday, 2003). Symbolic Data Analysis (SDA) handles this
type of data that may be represented as interval, histogram, distribution and so on in order to take
into account the variability and/or uncertainty innate of data (Billard & Diday, 2003). The SDA
framework extends standard statistics and data mining tools to symbolic data, such as, descriptive
statistics, multidimensional data analysis, dissimilarities and clustering (Diday & NoirhommeFraiture, 2008).
Here, we use clustering and Symbolic Data Analysis in scientific production data. The database
of scientific production used in this work contains 141260 researchers each one described by 33
continuous numerical and 3 categorical variables. The continuous variables are averages of production values computed in three years (2006, 2007 and 2008) for each researcher. The categorical
variables are: Institute, Area of knowledge and Subarea of knowledge. In order to obtain interval
scientific production data, this original data set is summarized using the institute and subarea of
knowledge categorical variables. Thus, a symbolic data set of size 5630 is created and these data
represent new concepts of scientific production (second level of observation). Each unit is a group
of researches representing a profile described by interval symbolic variables.
Therefore, the advantages for using interval scientific production data are:
1. Summarize data: These data can be aggregated using one or more categorical variables and
a new data set smaller than the old one without losing much information can be obtained;
2. Ensure the privacy of individuals: The generalization process allows to ensure confidentiality of original data;
3. Use higher-level category: The aggregated data set is able to represent profiles of scientific
production taking into account variability intrinsic to each profile.
In order to obtain groups of knowledge area profiles, the clustering method that has weighted
multivariate membership degree proposed by Pimentel & Souza (2013) is applied. This method
handles interval data in order to consider the variability and/or uncertainty of information and
for a given object and cluster, there is a membership degree for each variable and it is weighted
according to the importance of the variable. According to CAPES (2015), there are 7 levels of
course grades where levels 1 and 2 mean a poor performance below the minimum standard of
quality required. Thus, in this work, the number of clusters is defined as 5. The result obtained by
clustering method applied to this data set shows that only 5.49% of knowledge area profiles has
very high scientific production. On the other hand, knowledge area profiles with very low scientific
production represents 49.06% of the data set.
References
Billard, L., & Diday, E. (2003). From the statistics of data to the statistics of knowledge: symbolic
data analysis. Journal of the American Statistical Association 98(462), 470–487.
Bock, H. H., & Diday, E.(2000). Analysis of symbolic data: exploratory methods for extracting
statistical information from complex data. Springer.
Coordenação de Aperfeiçoamento
http://www.capes.gov.br.
de
Pessoal
de
Nı́vel
Superior
(CAPES)(2015).
Diday, E., & Noirhomme-Fraiture, M. (2008). Symbolic data analysis and the SODAS software.
John Wiley and Sons, (Chapter 1).
Han, J., & Kamber, M. (2006). Data mining: concepts and techniques. (2nd ed.). San Francisco:
Morgan Kaufmann, (Chapter 1).
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing
surveys (CSUR) 31(3), 264–323.
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8),
651–666.
Pal, N. R., & Sarkar, K. (2013). What and when can we gain from the kernel versions of c-means
algorithm. IEEE Transactions on Fuzzy Systems.
Pimentel, B. A., & Souza, R. M. C. R. (2013). A Weighted Multivariate Fuzzy C-Means Method
in Interval-Valued Scientific Production Data. Applied Soft Computing 13(4), 1592–1607.
Download