A BI-Partite Graph Partition and Link Based Approach for —

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 4 - Mar 2014
A BI-Partite Graph Partition and Link Based Approach for
Solving Categorical data Clustering
G.Rama Lakshmi1,E.Thenmozhi2,
#1
2
Post graduate student ,Sathyabama University
Chennai, India.
E.Thenmozhi
Faculty,sathyabama university,Chennai,India.
Abstract— Link based approach to solve the problem of
categorical data clustering through cluster ensembles. It consists
of generating a set of clustering from the dataset and combining
them into a final clustering. The combination process is to
improve the quality of individual data clustering. The ensemble
information matrix presents only cluster data point relation with
many entries being left unknown. Link based approach
discovering unknown entries through similarity between clusters.
Similarity approach is applied to weighted bipartite graph to
obtain final clustering. In this paper propose a Weighted Triple
Quality(WTQ),which is provide efficient approximation of the
similarity between clusters. Min Hash algorithm used to
eliminate the duplicate cluster from cluster ensemble. It provide
efficient result.
3.graph based algorithm that use a graph partition
methodology
4.pairwise-similarity that use the co-occurrence relation
between data point.
I.
Related Work
In[1] Partitioning the large dataset into homogenous
clusters. K-mean algorithm is the best method to divide the
dataset into clusters, that is working only on numerical values.
Real dataset contain categorical values, so we extend the kmean algorithm for categorical domain that is called k-mode.
k-mode algorithm operations:
1.select k-initial mode for each cluster.
2.allocate an object to the cluster whose mode is nearest to the
Keywords— clustering, categorical data, cluster ensemble
object.
3.After all object have been allocated to the cluster, retest the
Introduction
Data clustering is used to define the structure of data dissimilarity of object
set. Clustering is to group similar elements in a data set in which is belong to the another cluster rather than its current
accordance with its similarity such that elements in each one. update the clusters.
cluster are similar while elements from different clusters are 4.Repeat the step3 until no object has changed clusters
dissimilar.. It uses in pattern recognition, information retrieval, Advantage of this paper is to describe the characteristics of
data mining, machine learning Clustering algorithm such as k- cluster, k-mode algorithm is faster than k-mean algorithm.
mean and PAM for numerical data. An Example of categorical feature work of this paper is develop the parallel k-modes
attribute is color = {red, green, blue}, gender={male, female}. algorithm to cluster dataset with millions of objects
Although, a large number of algorithms have been
In[2] clustering algorithm is used to partition the
introduced for clustering categorical data, there is no single
large
database.
In this paper clustering algorithm apply to the
clustering algorithm that performs best for all data sets and
Boolean
data
and
categorical attributes. proposed system of
can discover all types of cluster shapes and structures
presented in data. Each algorithm has its own strengths and this paper is novel link based approach to find similarity
weaknesses. For a particular data set, different algorithms, or between pair wise of data points. RObust hierarchical
even the same algorithm with different parameters, usually Clustering linKs (ROCK) algorithm measure the non-metric
provide different solutions. Therefore, it is difficult for users similarity.
Clustering Neighbors: consider the data points Pi,
to decide which algorithm would be the proper alternative for
and
Pj.
Sim
(Pi, Pj) be a similarity function, that captures the
a given set of data. Recently, cluster ensembles have emerged
closeness
between
the data points Pi and Pj. The function sim
as an effective solution that is able to overcome these
limitations, and improve the robustness as well as the quality could be distance metrics, or it could be non-metrics. Assume
of clustering results. The main objective of cluster ensembles the sim values between 0 and 1, if its larger value mean it
is to combine different clustering decisions in such a way as to indicating more similar, the threshold value ϴ between 0 and
achieve accuracy more to that of any individual clustering. 1. Sim(Pi ,Pj)>=ϴ ϴ-is user defined parameter. Sim is 1
indicate the identical data point in the cluster. Sim is 0 for
Examples of well-known ensemble methods are:
1.feature based approach that performs the problem of cluster totally dissimilar data points
Links: The Link based approach is global approach. it
ensembles to clustering categorical data i.e., cluster label.
2.direct approach that finds the final partition through base contains the information about neighbors data points make the
relationship between individual pair of points. Link (Pi, Pj)
clustering result.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 159
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 4 - Mar 2014
Procedure: 1) compute the neighbor list for every points.2)
Find the similarity between data points. 3) Make the link
between data points of the cluster.
Combination of random sampling and labelling enable the
ROCK performance to quite well for large databases.
algorithms doesn’t discover the unknown values. Promising
findings, this initial framework is based on the data point-data
point pairwise-similarity matrix, which is highly expensive to
obtain.
III.
In[3] Finding the correspondence between cluster of
different systems. Proposed system of this paper is Singular
Value Decomposition(SVD). SVD construct a matrix R, size
of matrix (D*SC) where D is the number of observation, S is
the number of system, C number of cluster. Each row contain
the posteriors system for a given observation.
Evaluating Clustering System has two steps; first
create one-to-one mapping between cluster using
classification. Second Normalized mutual information
between cluster and classes.NMI means mapping between
cluster and class.
In[4] solve the critical problem of cluster ensemble
by bi-partite graph partitioning. problem of cluster ensemble
is combine the multiple cluster to yields a final clustering.
produce the graph partitioning technique to solve the cluster
ensemble problem here new reduction methods are used to
construct a bi-partite graph from given cluster ensemble
In existing system combine the cluster by the
similarity matrix then applying agglomerative clustering
algorithm to produce a final clustering
Graph partitioning: Given Graph partition, the weighted graph
represented by G=(V,W), where V is the number of vertices,
W is a nonnegative and similarity matrix. Partition the Graph
G into k number of parts. K way partition is used to minimize
the cut. Each partition contain same number of vertices.
Hybird
Bipartite
Graph
Formulation,
Cluster
C={C1,C2,…Ck}, vertices I and j are both clusters or both
instance W(i,j)=0 otherwise,if instance i belongs to cluster j
W(i,j)=W(j, i)=1 and 0 otherwise.
Advantage of HGBF is the reduction of HGBF lossless,
original cluster ensemble can easily constructed from HBGF.
In[5] The best ensembles were based on k-means
individual clusterers. Consensus functions interpreting the
consensus matrix of the ensemble as data, rather than
similarity, were found to be significantly better than the
traditional alternatives, including CSPA and HGPA
II.
Existing System
The Ensemble-information matrix presents only clusterdata point relations, with many entries being left unknown. To
solve the problem of clustering categorical data is a critical
issue in data clustering. It generates a final data partition of
cluster ensembles based on incomplete information. One
critical issue is to determine the relative importance of the
data’s in computing is not satisfied. To cluster the categorical
data is not efficient and easy. The entities are ignored without
including in the final cluster. link-based similarity technique is
employed to estimate the similarity among data points is
inapplicable to a large data set in existing clustering
ISSN: 2231-5381
Proposed system
In the same database, the clustering has a similar
copy of data sets in different clustering groups. So we can
apply Min Hash clustering algorithm, to achieve the
duplication avoidance.
The data sets can be applied by different clustering
methods. Also the different clustering method gives different
output as base clustering results. The base clustering results
are linked with link based similarity approach to produce a
final clustering results using spectral graph partitioning
mechanism. It generates a final data partition of cluster
ensembles based on complete cluster information’s.
In the same database, the clustering has a similar
copy of data sets in different clustering groups. So we can
apply Min Hash clustering algorithm, to achieve the
duplication avoidance.
The categorical data sets can be clustered easily. The
entities are also considered for including in the final cluster.
To achieve the accuracy and efficiency using a different
methods are proposed which is a link based algorithm and
spectral graph partitioning mechanism. A new link-based
algorithm has been specifically proposed to generate such
measures in an inexpensive manner. A link-based method has
been established with the ability to discover unknown values.
To achieve the accuracy and efficiency using a different
methods are proposed which is a link based similarity and BIPartite graph partitioning mechanism.
IV.
Conclusion
This paper presents a novel, highly effective linkbased cluster ensemble approach to categorical data clustering.
Bipartite graph generate from Refined Matrix(RM). Construct
RM using similarity data points. The problem of constructing
the RM is efficiently resolved by the similarity among
categorical labels. Calculate weight for graph partition using
Weighted Triple Quality(WTQ),which improve the clustering
result. In this paper min hash technique is used to avoid the
duplication between cluster sets.
http://www.ijettjournal.org
Page 160
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 4 - Mar 2014
REFERENCE:
[1]. Y. Yang, S. Guan, and J. You, “CLOPE: A Fast and
Effective Clustering Algorithm for Transactional Data,” Proc.
ACM SIGKDD Int’l Conf. Knowledge Discovery and Data
Mining (KDD), pp. 682- 687, (2005).
[2]. [4]Guha. S, Rastogi.R, and Shim.K (2000). ROCK: A
robust clustering algorithm for categorical
attributes’,Information System., vol. 25, no. 5, pp. 345– 366.
[3].A.L.N. Fred and A.K. Jain, “Combining Multiple
Clusterings Using Evidence Accumulation,” IEEE Trans.
Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp.
835-850, June (2005).
[4].X.Z. Fern and C.E. Brodley, “Solving Cluster Ensemble
Problems by Bipartite Graph Partitioning,” Proc. Int’l Conf.
Machine Learning (ICML), pp. 36-43,( 2004).
[5]. L.I. Kuncheva, S.T. Hadjitodorov, L.P. Todorova
“Experimental
Comparison
of
Cluster
Ensemble
Methods”, Information Fusion ICIF. July (2006)
I.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 161
Download