An Approach to Semi-Supervised Co-Clustering With Side Information in Text Mining

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 19 Number 6 – Jan 2015
An Approach to Semi-Supervised Co-Clustering With Side Information in Text
Mining
Rucha Bhutada#1, D.A.Borikar#2
#1
Research Scholar, #2Assistant Professor,
Department of Computer Science and Engineering,
Shri Ramdeobaba College of Engineering and Management, Nagpur, India
Abstract— Nowadays, in many text mining applications,
eloquent quantity of information from document is present in the
form of text. This text information contains various types of
information such as side information or metadata. This side
information is easily attainable in the text document. Such side
information may of distinct types such as document provenance
information, user access behaviour from the web logs, title of the
document, author name or any other non textual attributes
which are enclosed into the text document. Such attributes may
possess large amount of information for the clustering purposes.
However, the proportionate importance of this side information
is difficult to calculate. In such as cases, it becomes prohibitive to
cluster such types of information as this side information consist
of noise which can either refine the demeanour of interpretation
of mining process or can enumerate noise to the process. As
result of this, there is need of moralistic way to carry through the
mining process so as to enhance the superiority of side
information. Co-clustering originates clusters of similar objects
with regard to the value as well as clusters of similar features
with regard to the object associated by them. Conforming to that,
this paper represents solution to use of the side information in
justifiable way using techniques containing different kinds of
data.
Keywords— clustering, co-clustering, metadata, side-information,
text mining
I. INTRODUCTION
Data mining is hugely used in normal life, something can
be found from it. Many researchers have used numerous
techniques such as outlier detection, clustering, regression
analysis, classification etc. The clustering is used some special
application. Clustering is mechanisms of combining set of
physical or abstract objects into classes of similar objects.
There are different groups called as clusters, consist of objects
that are associated within themselves and unrelated to objects
of other groups. Clusters of data objects can be taken together
as one bunch and so may be examined as structure of data
compression. Depicting data by fewer clusters consequently
exhaust refined analysis; however achieves interpretation. It
exhibits frequent data objects by few clusters, and therefore it
figure out data by its cluster. Data modelling brings clustering
in a real broad view in mathematics and numerical analysis.
Due to enormous amounts of data assembled in databases,
cluster analysis has afresh inclined eminently functioning
topic in data mining research. Clustering is a demanding field
of research in which its plausible utilization poses their
peculiar specification. Data mining comprises of many large
databases so as to enforce on clustering study appended firm
estimating fulfilment. Clustering is also called as data
segmentation in some applications because clustering divides
ISSN: 2231-5381
large data sets into bunch by considering presence of their
resemblance. Cluster analysis has been recognized basis task
in data mining. Terminology in the research composes
clustering follow the distinct rational mark on the object.
The essence of a clear hierarchical clustering deteriorates
from its impotence to enforce the arrangement once converge
and divide has been implemented. Clustering and coclustering can be done on different kinds of data. The
technique that simultaneously clusters row and column is
called as co-clustering. Clustering can also be done on text
data, called as text clustering. Text clustering has remodelled
progressively significant technique as it is applicable to
various areas such as in fast information retrieval, uprooting
of automatic document, sentiment analysis, formulation of
granular taxonomies. The suggestive clustering algorithms are
applied on statistics and probability such as Word Term
Frequency [9], Word Meaning Frequency [10], and Frequent
Item set [11]. Large number of techniques has recently
become known that suitable to requisite and were strongly
adapted to substantial problems in data mining for clustering
and co-clustering. Clustering deals either with row or column.
Han and Kamber have given the classic introduction to
contemporary data mining clustering techniques [8].
Hierarchical clustering construct data in hierarchy structure in
which the input sets of objects points is recursively disjoined
into petite subgroups until these subgroups assumed to be
single objects points.
In text mining, text clustering is an issue for increasing
eloquent quantity of information unregulated data which is
present in various fields for instance any network. Data is not
present in flawless contents pattern. This text information
contained by document possesses different kinds of data such
as metadata. Metadata is also called as side information which
involves links in the document, user access behaviour from
web logs, document provenance information, title, name of
authors, date of publication etc. Many web documents have
metadata associated with them such as ownership, location or
even temporal information may be informative for mining
process. In a number of network and user sharing applications,
documents may be associated with user tags, which may also
be quite informative. Further, it can be difficult to merge this
kind of information in the clustering process as this
information contains noisy which can either cultivate the
aspect of representation of mining process or can even count
up noise to the process. Therefore, an approach should
http://www.ijettjournal.org
Page 292
International Journal of Engineering Trends and Technology (IJETT) – Volume 19 Number 6 – Jan 2015
ascertain the coherence of the clustering characteristics of the
side information with that of the text content. Co-clustering is
a solution to merge two or more than two types of data. It
interpolates to simultaneously clustering of more than one
data type. This helps in magnifying the co-clustering effects of
both kinds of data. Hierarchical co-clustering refers to
concurrently clustering of more than two data points or
objects points. In plain words, it attempts to acquire the
objective of both hierarchical clustering and co-clustering. I. S.
Dhillon proposed two algorithms which are information
theoretic co-clustering algorithm and bipartite graph
partitioning co-clustering algorithm. These are used to exploit
the co-clustering of documents and words [5]. This results in
developing the upright solution which can increase the
excellence of this text information. The goal is to show the
advantages of using side information extend beyond a pure
clustering task, and can provide competitive advantages for a
wider variety of problem scenarios.
The remainder of this paper is organized as follows.
Section 2 discusses the related work. In section 3, research
methodology has been told, that is, the details of preprocessing, clustering and co-clustering of text data are
described. Section 4 formalizes the proposed approach. In
section 5, partial implementation results have been shown.
Section 6 draws conclusion from this work.
II. RELATED WORK
There is a strong association between clustering techniques
and several other approaches. Clustering is one of the most
typical topics in the field of information retrieval and machine
learning. It helps in feature compression and extraction to
reduce dimensionality of feature vector by coupling related
features into clusters. It has always been applied in statistics.
Similarly, text clustering is also an important task for mining
text data.
The concept of text clustering using side information has
briefly described in [1]. The problem of text clustering has
been studied widely by database people [2], [3], [4].
Simultaneous clustering of documents and words are a
problem which can be represented as isoperimetric graph
partitioning, bipartite graph partitioning, optimization problem
etc. By representing these problems, simultaneous clustering
or co-clustering can be performed. The quality of this
document and words simultaneous clustering increases by
implementing confusion matrix. In addition to this, purity and
entropy is also determined by confusion matrix [5]. Clustering
is applied on text as well as auxiliary attributes by using
COATES algorithm and classification methods across many
baseline techniques on real and scientific datasets. Further, the
results obtained from proposed methods are to reinforce
peculiarity of text clustering and classification by using the
side information [6]. There is a proposed approach called
constrained information theoretic co-clustering algorithm.
This algorithm performs better than existing one since it takes
the benefit of the co-occurrences of documents and words and
ISSN: 2231-5381
adds some constraints to monitor the clustering process. The
more work can be done on this system by determining better
text features using natural language processing or readily
available tools [7]. Detailed explanation of scalability of text
clustering has briefly given in [12], [13], [14].
III. RESEARCH METHODOLOGY
Clustering of text data is one of the text mining tasks. Text
mining deals with unstructured data. Text clustering involves
the use of descriptors and descriptor extraction. Descriptors
are sets of words that describe the contents within the cluster.
Text clustering or document clustering is generally considered
to be a centralized process. The application of document
clustering can be categorized into two types, online and
offline. Online applications are normally constrained by
efficiency problems when compared to offline application.
Text information fails to get the imposed structure of
traditional database, though it explicitly has very wide range
of information. Thus, it is important to represent unstructured
data into structured form, so that appropriate patterns and
features can be retrieved from this text information.
Incomplete, noisy, and inconsistent data are commonplace
properties of large real world databases. Incomplete data can
occur for number of reasons. Attributes of interest may not
always be available. Relevant data may not be recorded due to
equipment malfunctions. Data that were inconsistent with
other recorded data may have been deleted. Thus, applying the
pre-processing is again a crucial step in text mining. There are
number of data pre-processing techniques such as data
cleaning to remove noise, data integration to merge it from
different sources, data transformation such as normalization,
data reduction to reduce data size. This pre-processing
removes punctuations and stop words, stemming and white
spaces are required to obtain the well structured data.
However, pre-processed data needs to reduce the dimension as
it is of large size and also administering further process
requires more effort as many algorithm requires numerical
representation of objects since such representation facilitate
processing and statistical analysis.
Data pre-processing techniques can improve the quality of
the data, thereby helping to recuperate the accuracy and
efficiency of the postliminary mining process. Identifying data
anomalies, recalibrating them early and reducing them to be
analysed can lead to huge payoffs for decision making.
In data mining, generating the feature vector is a special
form of dimensional reduction. Transforming the input data
to the set of features is called feature vector. Feature vector is
multi-dimensional vector of numerical features that imitates
some objects. Many algorithms in machine learning require a
numerical representation of objects since such illustrations
facilitate processing and statistical analysis.
http://www.ijettjournal.org
Page 293
International Journal of Engineering Trends and Technology (IJETT) – Volume 19 Number 6 – Jan 2015
side information do not show rational behaviour for clustering
process, the effects of those portions of the side information
are minimal. The concentration of the first phase is simply to
construct an initialization, which provides a good starting
point for the co-clustering process based on text content. Since,
the key technique for content and auxiliary information
integration is in the second phase.
Fig.1 Proposed Approach
Generation of feature vector is to reduce the dimensions of
the pre-processed data. If the generated features are carefully
chosen it is expected that the features set will extract the
relevant information from the input data in order to perform
the desired task using this reduced representation instead of
the full size input. Feature vector can be generated by using
Gini index. After feature vector is generated, clustering can be
done on feature vector, that is, those features which do not
satisfies the Gini index condition will not be clustered. Further,
input to the co-clustering will be clusters from the recently
procreated clustering step and the side information that is,
remaining all the data which includes those features which
don’t satisfy the Gini index condition. The output of this
process will be improved co-clusters and clusters.
IV. PROPOSED WORK
The thrust of the approach is to induce clustering and coclustering in which the text attribute with side information
provide related hints about the behaviour of the underlying
clusters and co-clusters and concurrently neglecting those
aspects in which clashing intimation are provided. The
proposed system also provides using other kinds of attribute in
conjunction with text clustering. The auxiliary information
will be used in order to improve the quality of clustering. In
many cases, such auxiliary information may be noisy and may
not have useful information for the clustering process.
Conventionally, hierarchical co-clustering consists of two
major steps:
a.
b.
Clustering : this is the first phase which is an
initialization phase where in clustering will be done
on normal text attribute without side information, so
that similar objects categorize into same clusters.
Clustering will be implemented by using K-means
algorithm.
Co-clustering: this is the main phase of this process
which will be executed after the clustering is done.
This phase starts off with the initial groups and
iteratively reconstructs these clusters which will form
the co-clusters with the use of both the text content
and the auxiliary information. Co-clustering will be
done on inside the clustering so as to make the
simultaneously clustering process in better manner.
The approach is designed in order to dignify the
consistency between the text content and the side information,
when this is detected. In cases, in which the text content and
ISSN: 2231-5381
V. IMPLMENTATION
In this section, partial implementation has been discussed.
Text data is needed for further processing. Following is the
dataset used for the text co-clustering.
A. CORA dataset:
The Cora dataset is comprised of set of entities and their
relations which allow performing various machine learning
approaches. Entities are authors and scientific papers. Cora
assigns to each paper a set of categories selected from
taxonomy of classes. The Cora dataset contains scientific
publications in the computer science domain. Each research
paper in the Cora dataset is classified into a topic of hierarchy.
On the leaf level, there are 70 classes in total and on the
second level. There are 10 class labels which are Artificial
Intelligence, Data Structures Algorithms and Theory,
Networking, Hardware and Architecture, Programming,
Human Computer Interaction, Information Retrieval,
Operating Systems, Databases and Encryption and
Compression.
This dataset contains 8 files which are in .txt format. The
files are namely author dataset, author dictionary, authorship,
citations, papers dataset, tags list, topics, words dictionary.
Author dataset possesses unique author number and unique
author code from author will be identified. Author dictionary
has author names and their respective author code which is
unique. Authorship contains relations between papers and
author which 27020 in total. There are 34648 citations
between papers in the citations file. Paper dataset is an
important file which has 11881 papers belonging to 80
different categories. In this file, Gini index is given which
represents tf-idf weights of the term. Along with this, unique
term identifier has given to the corresponding term. Then, tags
list file contains all the topics names with the short form and
full form. Topics file has topics names and paper id so that
each paper will be recognized easily. Words dictionary is
related to the paper dataset as the words dictionary contains
term and its identifier. As this dataset is in .txt format and thus
cannot be used for clustering process, there is a necessity of
converting that data into SQL table by making primary key,
candidate key to the author number, term identifier and paper
id. Table has too many tuples which has less use of that
particular data such as a term identifier may have less
probability, so this can be reduced by generating feature
vector using Gini index.
It is essential to reduce the dimension by using Gini Index.
The Gini index measures the inequality among values of a
frequency distribution. Mathematically, Gini index is defined
as:
http://www.ijettjournal.org
Page 294
International Journal of Engineering Trends and Technology (IJETT) – Volume 19 Number 6 – Jan 2015
)
….Eq (1)
where A is single attribute and B is all the attributes.
In this feature selection process, Gini index will be
computed for each attribute with respect to class labels. This
Gini index condition will be applied. If the Gini index is ᵞ
standard deviation (or more) below the average Gini index of
all attributes, then these attributes are excluded globally and
will never be used in further clustering process.
VI. CONCLUSION
Co-clustering on text data in semi-supervised medium is
parallelized in a large scale distributed environment. We have
shown that, on scientific publication dataset, feature vector
has been generated by using Gini index. This feature gives
important information for the clustering purpose and the
expectation is to get efficient clusters. Along with these, we
also concentrate on side information which will be
implemented by using K-medoids method. It is expected that
our approach will give improved clustering process so as to
fetch the data more easily as there will be clusters and coclusters from side information. Not only we focus on Cora
dataset, but similar to this dataset also.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
Fig.2 System flow for proposed approach
Therefore, in papers dataset, we have been given Gini index
which represents tf-idf weights. These data are grouped
together based on some similarity measure into different
clusters. Similarity between all pairs of clusters and coclusters is computed to form a similarity matrix. These feature
zone text data and feature metadata can be clustered and coclustered into hierarchical structure suitable for browsing
which suffers efficiency problems by using k-means and kmedoids algorithm. Retrieving information from such groups
turns out extensively manageable task.
[7]
[8]
[9]
[10]
[11]
[12]
[13]
ISSN: 2231-5381
S. Guha, R. Rastogi, and K. Shim, ―CURE: An efficient clustering
algorithm for large databases,‖ in Proc. ACM SIGMOD Conf.,
New York, NY, USA, 1998, pp. 73–84.
R. Ng and J. Han, ―Efficient and effective clustering methods for
spatial data mining,‖ in Proc. VLDB Conf., San Francisco, CA,
USA, 1994, pp. 144–155.
T. Zhang, R. Ramakrishnan, and M. Livny, ―BIRCH: An efficient
data clustering method for very large databases,‖ in Proc. ACM
SIGMOD Conf., New York, NY, USA, 1996, pp. 103–114.
I. S. Dhillon, Co-Clustering Documents and Words Using
Bipartite Spectral Graph Partitioning, in Proc. 7th ACM SIGKDD
Int. Conf Knowledge Discovery and Data Mining, 2001, pp. 269274.
Charu C. Aggrawal, Yuchen Zhao, and Philip S. Yu, On the Use
of Side Information for Mining Text Data, IEEE Transactions on
Knowledge and Data Engineering, Vol. 26, No. 6, June 2014.
Yangqiu Song, Shimei Pan, Shixia Liu, Furu Wei, Michelle X.
Zhou, Weihong Qian, Constrained Text Coclustering with
Supervised and Unsupervised Constraints, IEEE Transactions on
Knowledge and Data Engineering, VOL. 25, NO. 6, JUNE 2013
Jiawei Han and Michelie Kamber, Data Mining Concepts and
Techniques 500 Sansome Street, Suite 400, San Francisco, CA
94111, Morgan Kaufmann
F. Beil, M. Ester, X. Xu, Frequent term-based text clustering, in
Proc of ACM SIGKDD International Conf on Knowledge
Discovery and Data Mining, pp. 436-442, 2002.
Y. Li, S. M. Chung, J. D. Holt, Text document clustering based on
frequent word meaning sequences, Data and Knowledge
Engineering, 64(1), pp. 381-404, 2008.
B. C. M. Fung, K. Wang, M. Ester, Hierarchical document
clustering using frequent itemsets, in Proc of SIAM International
Conference on Data Mining, 2003
C. C. Aggarwal and P. S. Yu, ―A framework for clustering
massive text and categorical data streams,‖ in Proc. SIAM Conf.
Data Mining, 2006, pp. 477–481.
Q. He, K. Chang, E.-P. Lim, and J. Zhang, ―Bursty feature
representation for clustering text streams,‖ in Proc. SDM Conf.,
2007, pp. 491–496.
S. Zhong, ―Efficient streaming text clustering,‖ Neural Netw., vol.
18,
no.
5–6,
pp.
790–798,
2005
http://www.ijettjournal.org
Page 295
Download