International Journal of Engineering Trends and Technology (IJETT) – Volume 19 Number 6 – Jan 2015 An Approach to Semi-Supervised Co-Clustering With Side Information in Text Mining Rucha Bhutada#1, D.A.Borikar#2 #1 Research Scholar, #2Assistant Professor, Department of Computer Science and Engineering, Shri Ramdeobaba College of Engineering and Management, Nagpur, India Abstract— Nowadays, in many text mining applications, eloquent quantity of information from document is present in the form of text. This text information contains various types of information such as side information or metadata. This side information is easily attainable in the text document. Such side information may of distinct types such as document provenance information, user access behaviour from the web logs, title of the document, author name or any other non textual attributes which are enclosed into the text document. Such attributes may possess large amount of information for the clustering purposes. However, the proportionate importance of this side information is difficult to calculate. In such as cases, it becomes prohibitive to cluster such types of information as this side information consist of noise which can either refine the demeanour of interpretation of mining process or can enumerate noise to the process. As result of this, there is need of moralistic way to carry through the mining process so as to enhance the superiority of side information. Co-clustering originates clusters of similar objects with regard to the value as well as clusters of similar features with regard to the object associated by them. Conforming to that, this paper represents solution to use of the side information in justifiable way using techniques containing different kinds of data. Keywords— clustering, co-clustering, metadata, side-information, text mining I. INTRODUCTION Data mining is hugely used in normal life, something can be found from it. Many researchers have used numerous techniques such as outlier detection, clustering, regression analysis, classification etc. The clustering is used some special application. Clustering is mechanisms of combining set of physical or abstract objects into classes of similar objects. There are different groups called as clusters, consist of objects that are associated within themselves and unrelated to objects of other groups. Clusters of data objects can be taken together as one bunch and so may be examined as structure of data compression. Depicting data by fewer clusters consequently exhaust refined analysis; however achieves interpretation. It exhibits frequent data objects by few clusters, and therefore it figure out data by its cluster. Data modelling brings clustering in a real broad view in mathematics and numerical analysis. Due to enormous amounts of data assembled in databases, cluster analysis has afresh inclined eminently functioning topic in data mining research. Clustering is a demanding field of research in which its plausible utilization poses their peculiar specification. Data mining comprises of many large databases so as to enforce on clustering study appended firm estimating fulfilment. Clustering is also called as data segmentation in some applications because clustering divides ISSN: 2231-5381 large data sets into bunch by considering presence of their resemblance. Cluster analysis has been recognized basis task in data mining. Terminology in the research composes clustering follow the distinct rational mark on the object. The essence of a clear hierarchical clustering deteriorates from its impotence to enforce the arrangement once converge and divide has been implemented. Clustering and coclustering can be done on different kinds of data. The technique that simultaneously clusters row and column is called as co-clustering. Clustering can also be done on text data, called as text clustering. Text clustering has remodelled progressively significant technique as it is applicable to various areas such as in fast information retrieval, uprooting of automatic document, sentiment analysis, formulation of granular taxonomies. The suggestive clustering algorithms are applied on statistics and probability such as Word Term Frequency [9], Word Meaning Frequency [10], and Frequent Item set [11]. Large number of techniques has recently become known that suitable to requisite and were strongly adapted to substantial problems in data mining for clustering and co-clustering. Clustering deals either with row or column. Han and Kamber have given the classic introduction to contemporary data mining clustering techniques [8]. Hierarchical clustering construct data in hierarchy structure in which the input sets of objects points is recursively disjoined into petite subgroups until these subgroups assumed to be single objects points. In text mining, text clustering is an issue for increasing eloquent quantity of information unregulated data which is present in various fields for instance any network. Data is not present in flawless contents pattern. This text information contained by document possesses different kinds of data such as metadata. Metadata is also called as side information which involves links in the document, user access behaviour from web logs, document provenance information, title, name of authors, date of publication etc. Many web documents have metadata associated with them such as ownership, location or even temporal information may be informative for mining process. In a number of network and user sharing applications, documents may be associated with user tags, which may also be quite informative. Further, it can be difficult to merge this kind of information in the clustering process as this information contains noisy which can either cultivate the aspect of representation of mining process or can even count up noise to the process. Therefore, an approach should http://www.ijettjournal.org Page 292 International Journal of Engineering Trends and Technology (IJETT) – Volume 19 Number 6 – Jan 2015 ascertain the coherence of the clustering characteristics of the side information with that of the text content. Co-clustering is a solution to merge two or more than two types of data. It interpolates to simultaneously clustering of more than one data type. This helps in magnifying the co-clustering effects of both kinds of data. Hierarchical co-clustering refers to concurrently clustering of more than two data points or objects points. In plain words, it attempts to acquire the objective of both hierarchical clustering and co-clustering. I. S. Dhillon proposed two algorithms which are information theoretic co-clustering algorithm and bipartite graph partitioning co-clustering algorithm. These are used to exploit the co-clustering of documents and words [5]. This results in developing the upright solution which can increase the excellence of this text information. The goal is to show the advantages of using side information extend beyond a pure clustering task, and can provide competitive advantages for a wider variety of problem scenarios. The remainder of this paper is organized as follows. Section 2 discusses the related work. In section 3, research methodology has been told, that is, the details of preprocessing, clustering and co-clustering of text data are described. Section 4 formalizes the proposed approach. In section 5, partial implementation results have been shown. Section 6 draws conclusion from this work. II. RELATED WORK There is a strong association between clustering techniques and several other approaches. Clustering is one of the most typical topics in the field of information retrieval and machine learning. It helps in feature compression and extraction to reduce dimensionality of feature vector by coupling related features into clusters. It has always been applied in statistics. Similarly, text clustering is also an important task for mining text data. The concept of text clustering using side information has briefly described in [1]. The problem of text clustering has been studied widely by database people [2], [3], [4]. Simultaneous clustering of documents and words are a problem which can be represented as isoperimetric graph partitioning, bipartite graph partitioning, optimization problem etc. By representing these problems, simultaneous clustering or co-clustering can be performed. The quality of this document and words simultaneous clustering increases by implementing confusion matrix. In addition to this, purity and entropy is also determined by confusion matrix [5]. Clustering is applied on text as well as auxiliary attributes by using COATES algorithm and classification methods across many baseline techniques on real and scientific datasets. Further, the results obtained from proposed methods are to reinforce peculiarity of text clustering and classification by using the side information [6]. There is a proposed approach called constrained information theoretic co-clustering algorithm. This algorithm performs better than existing one since it takes the benefit of the co-occurrences of documents and words and ISSN: 2231-5381 adds some constraints to monitor the clustering process. The more work can be done on this system by determining better text features using natural language processing or readily available tools [7]. Detailed explanation of scalability of text clustering has briefly given in [12], [13], [14]. III. RESEARCH METHODOLOGY Clustering of text data is one of the text mining tasks. Text mining deals with unstructured data. Text clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Text clustering or document clustering is generally considered to be a centralized process. The application of document clustering can be categorized into two types, online and offline. Online applications are normally constrained by efficiency problems when compared to offline application. Text information fails to get the imposed structure of traditional database, though it explicitly has very wide range of information. Thus, it is important to represent unstructured data into structured form, so that appropriate patterns and features can be retrieved from this text information. Incomplete, noisy, and inconsistent data are commonplace properties of large real world databases. Incomplete data can occur for number of reasons. Attributes of interest may not always be available. Relevant data may not be recorded due to equipment malfunctions. Data that were inconsistent with other recorded data may have been deleted. Thus, applying the pre-processing is again a crucial step in text mining. There are number of data pre-processing techniques such as data cleaning to remove noise, data integration to merge it from different sources, data transformation such as normalization, data reduction to reduce data size. This pre-processing removes punctuations and stop words, stemming and white spaces are required to obtain the well structured data. However, pre-processed data needs to reduce the dimension as it is of large size and also administering further process requires more effort as many algorithm requires numerical representation of objects since such representation facilitate processing and statistical analysis. Data pre-processing techniques can improve the quality of the data, thereby helping to recuperate the accuracy and efficiency of the postliminary mining process. Identifying data anomalies, recalibrating them early and reducing them to be analysed can lead to huge payoffs for decision making. In data mining, generating the feature vector is a special form of dimensional reduction. Transforming the input data to the set of features is called feature vector. Feature vector is multi-dimensional vector of numerical features that imitates some objects. Many algorithms in machine learning require a numerical representation of objects since such illustrations facilitate processing and statistical analysis. http://www.ijettjournal.org Page 293 International Journal of Engineering Trends and Technology (IJETT) – Volume 19 Number 6 – Jan 2015 side information do not show rational behaviour for clustering process, the effects of those portions of the side information are minimal. The concentration of the first phase is simply to construct an initialization, which provides a good starting point for the co-clustering process based on text content. Since, the key technique for content and auxiliary information integration is in the second phase. Fig.1 Proposed Approach Generation of feature vector is to reduce the dimensions of the pre-processed data. If the generated features are carefully chosen it is expected that the features set will extract the relevant information from the input data in order to perform the desired task using this reduced representation instead of the full size input. Feature vector can be generated by using Gini index. After feature vector is generated, clustering can be done on feature vector, that is, those features which do not satisfies the Gini index condition will not be clustered. Further, input to the co-clustering will be clusters from the recently procreated clustering step and the side information that is, remaining all the data which includes those features which don’t satisfy the Gini index condition. The output of this process will be improved co-clusters and clusters. IV. PROPOSED WORK The thrust of the approach is to induce clustering and coclustering in which the text attribute with side information provide related hints about the behaviour of the underlying clusters and co-clusters and concurrently neglecting those aspects in which clashing intimation are provided. The proposed system also provides using other kinds of attribute in conjunction with text clustering. The auxiliary information will be used in order to improve the quality of clustering. In many cases, such auxiliary information may be noisy and may not have useful information for the clustering process. Conventionally, hierarchical co-clustering consists of two major steps: a. b. Clustering : this is the first phase which is an initialization phase where in clustering will be done on normal text attribute without side information, so that similar objects categorize into same clusters. Clustering will be implemented by using K-means algorithm. Co-clustering: this is the main phase of this process which will be executed after the clustering is done. This phase starts off with the initial groups and iteratively reconstructs these clusters which will form the co-clusters with the use of both the text content and the auxiliary information. Co-clustering will be done on inside the clustering so as to make the simultaneously clustering process in better manner. The approach is designed in order to dignify the consistency between the text content and the side information, when this is detected. In cases, in which the text content and ISSN: 2231-5381 V. IMPLMENTATION In this section, partial implementation has been discussed. Text data is needed for further processing. Following is the dataset used for the text co-clustering. A. CORA dataset: The Cora dataset is comprised of set of entities and their relations which allow performing various machine learning approaches. Entities are authors and scientific papers. Cora assigns to each paper a set of categories selected from taxonomy of classes. The Cora dataset contains scientific publications in the computer science domain. Each research paper in the Cora dataset is classified into a topic of hierarchy. On the leaf level, there are 70 classes in total and on the second level. There are 10 class labels which are Artificial Intelligence, Data Structures Algorithms and Theory, Networking, Hardware and Architecture, Programming, Human Computer Interaction, Information Retrieval, Operating Systems, Databases and Encryption and Compression. This dataset contains 8 files which are in .txt format. The files are namely author dataset, author dictionary, authorship, citations, papers dataset, tags list, topics, words dictionary. Author dataset possesses unique author number and unique author code from author will be identified. Author dictionary has author names and their respective author code which is unique. Authorship contains relations between papers and author which 27020 in total. There are 34648 citations between papers in the citations file. Paper dataset is an important file which has 11881 papers belonging to 80 different categories. In this file, Gini index is given which represents tf-idf weights of the term. Along with this, unique term identifier has given to the corresponding term. Then, tags list file contains all the topics names with the short form and full form. Topics file has topics names and paper id so that each paper will be recognized easily. Words dictionary is related to the paper dataset as the words dictionary contains term and its identifier. As this dataset is in .txt format and thus cannot be used for clustering process, there is a necessity of converting that data into SQL table by making primary key, candidate key to the author number, term identifier and paper id. Table has too many tuples which has less use of that particular data such as a term identifier may have less probability, so this can be reduced by generating feature vector using Gini index. It is essential to reduce the dimension by using Gini Index. The Gini index measures the inequality among values of a frequency distribution. Mathematically, Gini index is defined as: http://www.ijettjournal.org Page 294 International Journal of Engineering Trends and Technology (IJETT) – Volume 19 Number 6 – Jan 2015 ) ….Eq (1) where A is single attribute and B is all the attributes. In this feature selection process, Gini index will be computed for each attribute with respect to class labels. This Gini index condition will be applied. If the Gini index is ᵞ standard deviation (or more) below the average Gini index of all attributes, then these attributes are excluded globally and will never be used in further clustering process. VI. CONCLUSION Co-clustering on text data in semi-supervised medium is parallelized in a large scale distributed environment. We have shown that, on scientific publication dataset, feature vector has been generated by using Gini index. This feature gives important information for the clustering purpose and the expectation is to get efficient clusters. Along with these, we also concentrate on side information which will be implemented by using K-medoids method. It is expected that our approach will give improved clustering process so as to fetch the data more easily as there will be clusters and coclusters from side information. Not only we focus on Cora dataset, but similar to this dataset also. REFERENCES [1] [2] [3] [4] [5] [6] Fig.2 System flow for proposed approach Therefore, in papers dataset, we have been given Gini index which represents tf-idf weights. These data are grouped together based on some similarity measure into different clusters. Similarity between all pairs of clusters and coclusters is computed to form a similarity matrix. These feature zone text data and feature metadata can be clustered and coclustered into hierarchical structure suitable for browsing which suffers efficiency problems by using k-means and kmedoids algorithm. Retrieving information from such groups turns out extensively manageable task. [7] [8] [9] [10] [11] [12] [13] ISSN: 2231-5381 S. Guha, R. Rastogi, and K. Shim, ―CURE: An efficient clustering algorithm for large databases,‖ in Proc. ACM SIGMOD Conf., New York, NY, USA, 1998, pp. 73–84. R. Ng and J. Han, ―Efficient and effective clustering methods for spatial data mining,‖ in Proc. VLDB Conf., San Francisco, CA, USA, 1994, pp. 144–155. T. Zhang, R. Ramakrishnan, and M. Livny, ―BIRCH: An efficient data clustering method for very large databases,‖ in Proc. ACM SIGMOD Conf., New York, NY, USA, 1996, pp. 103–114. I. S. Dhillon, Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning, in Proc. 7th ACM SIGKDD Int. Conf Knowledge Discovery and Data Mining, 2001, pp. 269274. Charu C. Aggrawal, Yuchen Zhao, and Philip S. Yu, On the Use of Side Information for Mining Text Data, IEEE Transactions on Knowledge and Data Engineering, Vol. 26, No. 6, June 2014. Yangqiu Song, Shimei Pan, Shixia Liu, Furu Wei, Michelle X. Zhou, Weihong Qian, Constrained Text Coclustering with Supervised and Unsupervised Constraints, IEEE Transactions on Knowledge and Data Engineering, VOL. 25, NO. 6, JUNE 2013 Jiawei Han and Michelie Kamber, Data Mining Concepts and Techniques 500 Sansome Street, Suite 400, San Francisco, CA 94111, Morgan Kaufmann F. Beil, M. Ester, X. Xu, Frequent term-based text clustering, in Proc of ACM SIGKDD International Conf on Knowledge Discovery and Data Mining, pp. 436-442, 2002. Y. Li, S. M. Chung, J. D. Holt, Text document clustering based on frequent word meaning sequences, Data and Knowledge Engineering, 64(1), pp. 381-404, 2008. B. C. M. Fung, K. Wang, M. Ester, Hierarchical document clustering using frequent itemsets, in Proc of SIAM International Conference on Data Mining, 2003 C. C. Aggarwal and P. S. Yu, ―A framework for clustering massive text and categorical data streams,‖ in Proc. SIAM Conf. Data Mining, 2006, pp. 477–481. Q. He, K. Chang, E.-P. Lim, and J. Zhang, ―Bursty feature representation for clustering text streams,‖ in Proc. SDM Conf., 2007, pp. 491–496. S. Zhong, ―Efficient streaming text clustering,‖ Neural Netw., vol. 18, no. 5–6, pp. 790–798, 2005 http://www.ijettjournal.org Page 295