International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 4 - Mar 2014 A BI-Partite Graph Partition and Link Based Approach for Solving Categorical data Clustering G.Rama Lakshmi1,E.Thenmozhi2, #1 2 Post graduate student ,Sathyabama University Chennai, India. E.Thenmozhi Faculty,sathyabama university,Chennai,India. Abstract— Link based approach to solve the problem of categorical data clustering through cluster ensembles. It consists of generating a set of clustering from the dataset and combining them into a final clustering. The combination process is to improve the quality of individual data clustering. The ensemble information matrix presents only cluster data point relation with many entries being left unknown. Link based approach discovering unknown entries through similarity between clusters. Similarity approach is applied to weighted bipartite graph to obtain final clustering. In this paper propose a Weighted Triple Quality(WTQ),which is provide efficient approximation of the similarity between clusters. Min Hash algorithm used to eliminate the duplicate cluster from cluster ensemble. It provide efficient result. 3.graph based algorithm that use a graph partition methodology 4.pairwise-similarity that use the co-occurrence relation between data point. I. Related Work In[1] Partitioning the large dataset into homogenous clusters. K-mean algorithm is the best method to divide the dataset into clusters, that is working only on numerical values. Real dataset contain categorical values, so we extend the kmean algorithm for categorical domain that is called k-mode. k-mode algorithm operations: 1.select k-initial mode for each cluster. 2.allocate an object to the cluster whose mode is nearest to the Keywords— clustering, categorical data, cluster ensemble object. 3.After all object have been allocated to the cluster, retest the Introduction Data clustering is used to define the structure of data dissimilarity of object set. Clustering is to group similar elements in a data set in which is belong to the another cluster rather than its current accordance with its similarity such that elements in each one. update the clusters. cluster are similar while elements from different clusters are 4.Repeat the step3 until no object has changed clusters dissimilar.. It uses in pattern recognition, information retrieval, Advantage of this paper is to describe the characteristics of data mining, machine learning Clustering algorithm such as k- cluster, k-mode algorithm is faster than k-mean algorithm. mean and PAM for numerical data. An Example of categorical feature work of this paper is develop the parallel k-modes attribute is color = {red, green, blue}, gender={male, female}. algorithm to cluster dataset with millions of objects Although, a large number of algorithms have been In[2] clustering algorithm is used to partition the introduced for clustering categorical data, there is no single large database. In this paper clustering algorithm apply to the clustering algorithm that performs best for all data sets and Boolean data and categorical attributes. proposed system of can discover all types of cluster shapes and structures presented in data. Each algorithm has its own strengths and this paper is novel link based approach to find similarity weaknesses. For a particular data set, different algorithms, or between pair wise of data points. RObust hierarchical even the same algorithm with different parameters, usually Clustering linKs (ROCK) algorithm measure the non-metric provide different solutions. Therefore, it is difficult for users similarity. Clustering Neighbors: consider the data points Pi, to decide which algorithm would be the proper alternative for and Pj. Sim (Pi, Pj) be a similarity function, that captures the a given set of data. Recently, cluster ensembles have emerged closeness between the data points Pi and Pj. The function sim as an effective solution that is able to overcome these limitations, and improve the robustness as well as the quality could be distance metrics, or it could be non-metrics. Assume of clustering results. The main objective of cluster ensembles the sim values between 0 and 1, if its larger value mean it is to combine different clustering decisions in such a way as to indicating more similar, the threshold value ϴ between 0 and achieve accuracy more to that of any individual clustering. 1. Sim(Pi ,Pj)>=ϴ ϴ-is user defined parameter. Sim is 1 indicate the identical data point in the cluster. Sim is 0 for Examples of well-known ensemble methods are: 1.feature based approach that performs the problem of cluster totally dissimilar data points Links: The Link based approach is global approach. it ensembles to clustering categorical data i.e., cluster label. 2.direct approach that finds the final partition through base contains the information about neighbors data points make the relationship between individual pair of points. Link (Pi, Pj) clustering result. ISSN: 2231-5381 http://www.ijettjournal.org Page 159 International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 4 - Mar 2014 Procedure: 1) compute the neighbor list for every points.2) Find the similarity between data points. 3) Make the link between data points of the cluster. Combination of random sampling and labelling enable the ROCK performance to quite well for large databases. algorithms doesn’t discover the unknown values. Promising findings, this initial framework is based on the data point-data point pairwise-similarity matrix, which is highly expensive to obtain. III. In[3] Finding the correspondence between cluster of different systems. Proposed system of this paper is Singular Value Decomposition(SVD). SVD construct a matrix R, size of matrix (D*SC) where D is the number of observation, S is the number of system, C number of cluster. Each row contain the posteriors system for a given observation. Evaluating Clustering System has two steps; first create one-to-one mapping between cluster using classification. Second Normalized mutual information between cluster and classes.NMI means mapping between cluster and class. In[4] solve the critical problem of cluster ensemble by bi-partite graph partitioning. problem of cluster ensemble is combine the multiple cluster to yields a final clustering. produce the graph partitioning technique to solve the cluster ensemble problem here new reduction methods are used to construct a bi-partite graph from given cluster ensemble In existing system combine the cluster by the similarity matrix then applying agglomerative clustering algorithm to produce a final clustering Graph partitioning: Given Graph partition, the weighted graph represented by G=(V,W), where V is the number of vertices, W is a nonnegative and similarity matrix. Partition the Graph G into k number of parts. K way partition is used to minimize the cut. Each partition contain same number of vertices. Hybird Bipartite Graph Formulation, Cluster C={C1,C2,…Ck}, vertices I and j are both clusters or both instance W(i,j)=0 otherwise,if instance i belongs to cluster j W(i,j)=W(j, i)=1 and 0 otherwise. Advantage of HGBF is the reduction of HGBF lossless, original cluster ensemble can easily constructed from HBGF. In[5] The best ensembles were based on k-means individual clusterers. Consensus functions interpreting the consensus matrix of the ensemble as data, rather than similarity, were found to be significantly better than the traditional alternatives, including CSPA and HGPA II. Existing System The Ensemble-information matrix presents only clusterdata point relations, with many entries being left unknown. To solve the problem of clustering categorical data is a critical issue in data clustering. It generates a final data partition of cluster ensembles based on incomplete information. One critical issue is to determine the relative importance of the data’s in computing is not satisfied. To cluster the categorical data is not efficient and easy. The entities are ignored without including in the final cluster. link-based similarity technique is employed to estimate the similarity among data points is inapplicable to a large data set in existing clustering ISSN: 2231-5381 Proposed system In the same database, the clustering has a similar copy of data sets in different clustering groups. So we can apply Min Hash clustering algorithm, to achieve the duplication avoidance. The data sets can be applied by different clustering methods. Also the different clustering method gives different output as base clustering results. The base clustering results are linked with link based similarity approach to produce a final clustering results using spectral graph partitioning mechanism. It generates a final data partition of cluster ensembles based on complete cluster information’s. In the same database, the clustering has a similar copy of data sets in different clustering groups. So we can apply Min Hash clustering algorithm, to achieve the duplication avoidance. The categorical data sets can be clustered easily. The entities are also considered for including in the final cluster. To achieve the accuracy and efficiency using a different methods are proposed which is a link based algorithm and spectral graph partitioning mechanism. A new link-based algorithm has been specifically proposed to generate such measures in an inexpensive manner. A link-based method has been established with the ability to discover unknown values. To achieve the accuracy and efficiency using a different methods are proposed which is a link based similarity and BIPartite graph partitioning mechanism. IV. Conclusion This paper presents a novel, highly effective linkbased cluster ensemble approach to categorical data clustering. Bipartite graph generate from Refined Matrix(RM). Construct RM using similarity data points. The problem of constructing the RM is efficiently resolved by the similarity among categorical labels. Calculate weight for graph partition using Weighted Triple Quality(WTQ),which improve the clustering result. In this paper min hash technique is used to avoid the duplication between cluster sets. http://www.ijettjournal.org Page 160 International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 4 - Mar 2014 REFERENCE: [1]. Y. Yang, S. Guan, and J. You, “CLOPE: A Fast and Effective Clustering Algorithm for Transactional Data,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 682- 687, (2005). [2]. [4]Guha. S, Rastogi.R, and Shim.K (2000). ROCK: A robust clustering algorithm for categorical attributes’,Information System., vol. 25, no. 5, pp. 345– 366. [3].A.L.N. Fred and A.K. Jain, “Combining Multiple Clusterings Using Evidence Accumulation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 835-850, June (2005). [4].X.Z. Fern and C.E. Brodley, “Solving Cluster Ensemble Problems by Bipartite Graph Partitioning,” Proc. Int’l Conf. Machine Learning (ICML), pp. 36-43,( 2004). [5]. L.I. Kuncheva, S.T. Hadjitodorov, L.P. Todorova “Experimental Comparison of Cluster Ensemble Methods”, Information Fusion ICIF. July (2006) I. ISSN: 2231-5381 http://www.ijettjournal.org Page 161