Comparative Study of Hierarchical Clustering over Partitioning

advertisement
Comparative Study of Hierarchical Clustering
over Partitioning Clustering Algorithm
Devendra Kumar
Neha Arora
Asst. Professor
IFTM, Moradabad (U.P.)
dev625@yahoo.com
M. Tech Scholar
IFTM, Moradabad (U.P.)
apexneha2009@gmail.com
Abstract.
With huge amount of information available, it is not possible to take the full advantage of the World Wide Web without
having a proper framework to search through the available data. The size of web has increased exponentially over the past
few years with thousands of documents related to a subject available to the user.
Data clustering is one of the important data mining methods. It is a process of finding classes of a data set with most
similarity in the same class and most dissimilarity between different classes. Many clustering algorithms have been proposed
by researchers. Partitioning clustering and hierarchical clustering are two main approaches to clustering. This paper
summarizes the different characteristics of partitioning clustering and hierarchical clustering.
Keyword. Search Engine, Web Mining, Hierarchical Clustering, Partitioning Clustering.
1.INTRODUCTION: There is a tremendous growth of information on web. As the number of users using
the web growing rapidly, so it creates many challenges of information retrieval which become the current
research topics [1].It is practically impossible to search through this extremely large database for the
information needed by him. Hence the need for Search Engine arises. Search Engines uses crawlers to
gather information and stores it in database maintained at search engine side. For a given user's query the
search engine searches in the local database and very quickly displays the results. The huge amount of
information is retrieved using data mining tools. Clustering plays a key role in searching for structures in
data. As the number of available documents nowadays is large, hierarchical approaches are better suited
because they permit categories to be defined at different levels [2].
Clustering can be considered the most important unsupervised learning problem; it deals with finding a
structure in a collection of unlabeled data. The ability to form meaningful groups of objects is one of the
most fundamental modes of intelligence. Cluster analysis is a tool for exploring the structure of data.
Clustering techniques have been widely used in many areas such as data mining, artificial intelligence,
pattern recognition, bioinformatics, segmentation and machine learning [3]. Web Mining is the use of
Data Mining techniques to automatically discover and extract information from web. The unstructured
feature of web data triggers more complexity of web mining. Web mining research is actually a
converging area from several research communities such as Database, Information retrieval, Artificial
Intelligence and also the psychology and statistics as well .Web Mining is the use of Data mining
techniques to automatically discover and extract information from web. The following are the orthogonal
aspects with which clustering methods can be compared: The partitioning criteria, separation of clusters
similarity measure, clustering space. There are various clustering methods: Partitioning methods,
Hierarchical method, Density based method and Grid based method .Web mining is a rapid growing
research area, it consist of Web usage mining, Web structure mining and Web content mining. Web usage
mining refers to the discovery of user access patterns from the Web Usage logs .Web structure mining
aims to discover useful knowledge from the structure of hyperlinks. Web content mining refers to mine
useful information or knowledge from the web page contents. Web mining is a technique to discover and
analyze the useful information from the web data .Web involves three types of data that is data on the
web, web log data and web structure data.
1.1 Web Mining Process
Web mining may be decomposed into the following subtasks:
ources.
-processing: is the transform process of the result of resource discovery
resources.
rns at individual Web sites and across multiple sites [4].
RESOURCE
Web Mining
DISCOVERY
INFORMATION
PREPROCESSING
INFORMATION
EXTRACTION
Fig 1: Web Mining Process
1.2 Web Mining Taxonomy
Fig 2: Web Mining Taxonomy
GENERALIZATION
Data mining is the process of extracting previously unknown, valid and useful information from large
databases. It is a collection of techniques and tools for handling large amount of information [5]. Many
clustering algorithms have been proposed by researchers. Partitioning clustering and hierarchical
clustering are two main approaches to clustering. Some of the clustering algorithms in the literature are Kmeans, K-medoid, FCM, PAM, CLARA, CLARANS, BIRCH, CURE, ROCK and CHAMELEON. The
main steps of a typical clustering process include object representation, definition of an object similarity
measure appropriate to the data domain, grouping objects by a given clustering algorithm, and assessment
of the output [3].
Clustering routines (either implicitly or explicitly) map a distance matrix D into p cluster labels. Graph
theoretic, model-based and non-parametric approaches have been implemented. The non-parametric
clustering methods can be separated into partitioning and hierarchical algorithms.
Thus according to the method adopted to define clusters, the algorithms can be broadly classified into the
following types (Jain et al., 1999):
2.Clustering Algorithms
• Partition clustering attempts to directly decompose the data set into a set of disjoint clusters. More
specifically, they attempt to determine an integer number of partitions that optimize a certain criterion
function. The criterion function may emphasize the local or global structure of the data and its
optimization is an iterative procedure.
• Hierarchical clustering proceeds successively by either merging smaller clusters into larger ones, or by
splitting larger clusters. The result of the algorithm is a tree of clusters, called dendogram , which shows
how the clusters are related. By cutting the dendogram at a desired level, a clustering of the data items
into disjoint groups is obtained [6].
2.1 Partitioning Clustering algorithm
Partitioning methods, such as Self-Organizing Maps (T¨or¨onen et al. (1999)), Partitioning around KMeans, identify a user specified number of clusters and assign each element to a cluster. Medoids
(Kaufman & Rousseeuw (1990)) .
2.1.1 K-means clustering algorithm
K-means clustering is a well known partitioning method. In this objects are classified as belonging to one
of K-groups. The results of partitioning method are a set of K clusters, each object of data set belonging
to one cluster. In each cluster there may be a centroid or a cluster representative. In case where we
consider real-valued data, the arithmetic mean of the attribute vectors for all objects within a cluster
provides an appropriate representative; alternative types of centroid may be required in other cases.
Example: A cluster of documents can be represented by a list of those keywords that occur in some
minimum number of documents within a cluster. If the number of the clusters is large, the centroids can
be further clustered to produces hierarchy within a dataset. K-means is a data mining algorithm which
performs clustering of the data samples. As mentioned previously, clustering means the division of a
dataset into a number of groups such that similar items falls or belong to same groups. In order to cluster
the database, K-means algorithm uses an iterative approach [7].
K-means attempts to minimize the squared or absolute error points with respect to their cluster centroids
while this is sometime a reasonable criterion and leads to a simple algorithm.
K-means is a centroid based partitioning technique uses the centroid of a cluster, Ci to represent that
cluster.The difference between an object p Є Ci and ci ,the representive of the cluster is measured by
dist(p, ci) , where dist(x,y) is the Euclidean distance between two points x and y.The quality of the cluster
Ci can be measured by the within cluster variation, which is sum of the squared error between all objects
in Ci and the centroid ci, defined as
E =ki=1 pЄ Ci
dis(p, ci)
2
where E is the sum of the squared error for all objects in the data set ;p is the point in space representing a
given object; and ci is the centriod of cluster Ci..
In particular it shows the results when clusters have widely different sizes or have the convex shapes. The
K- means objective function is minimized by globular clusters of equal size or by cluster that are well
separated. The K-means algorithm is also normally restricted to data in Euclidean spaces because in many
cases the required means and medians do not make a sense. K-means method is not guaranteed to
converge to the global optimum and often terminates at a local optimum. The results may depend on the
initial random selection of cluster centers. K-means can only be applied only when the mean of a set of
objects is defined.The K-means algorithm is sensitive to outliers because such objects are far away from
the majority of the data and thus when assigned to a cluster, they dramatically distort the mean value of
the cluster. A related technique K-medoid clustering does not have the restriction.
2.1.2 K-medoids Clustering Algorithm
The objective of K-medoid clustering is to find a non-overlapping clusters such that each cluster has a
most representative point i.e., that is mostly centrally located with respect to some measure, e .g, distance.
These representative points are called medoids. Such an approach is not restricted to Euclidean spaces
and is likely to be more tolerant of outliers. However, finding a better medoid requires trying all points
that are currently not medoid and is computationally expensive.
A typical K-medoid portioning algorithm like PAM (Partition Around Medoid) works efficiently for
small data sets but does not scale well for large data sets. CLARA (Clustering Large Applications) cannot
find a good clustering if any of the best sampled medoid is far from the best K-medoid. If an object is one
of the best K-medoid but is not selected during sampling, CLARA will never find the best clustering.
2.2 Hierarchical Clustering
Hierarchical methods are well known clustering technique that can be potentially very useful for various
data mining tasks. A hierarchical clustering scheme produces a sequence of clustering in which each
clustering is nested into the next clustering in the sequence. Hierarchical methods are commonly used for
clustering in Data Mining. An important class of clustering methods is hierarchical cluster analyses. . A
hierarchical tree can be Agglomerative (i.e.: built from the bottom up by recursively combining the
elements) or divisive (i.e.: built from the top down by recursively partitioning the elements). AGNES
(Kaufman & Rousseeuw (1990)) and Cluster (Eisen et al.(1998)) are examples of agglomerative
hierarchical algorithms ,while DIANA (Kaufman & Rousseeuw (1990)) is an example of a divisive
hierarchical algorithm. An agglomerative hierarchical method begins with each object as its own cluster.
It then successively merges the most similar clusters together until the entire set of data becomes one
group .In order to determine which groups should be merged in agglomerative hierarchical clustering,
various linkage methods can be used like Single linkage, Complete linkage ,average linkage Ward’s
method and centroid method .Single linkage (Sneath, 1957) merges groups based on the minimum
distance between two objects in two groups[8].Since hierarchical clustering is a greedy search algorithm
based on a local search, the merging decision made early in the agglomerative process are not necessarily
the right ones. One possible solution to this problem is to refine a clustering produced by the
agglomerative hierarchical algorithm to potentially correct the mistakes made early in the agglomerative
process. Divisive algorithm start with one, all inclusive clusters and at each step split a cluster until only
singleton clusters of individual points remain in this algorithm we need to decide which cluster to split at
each step like Minimum Spanning Tree.
Hierarchical methods involve constructing a tree of clusters in which the root is a single cluster containing
all the elements and the leaves each contain only one element. Hierarchical methods are used when it is of
interest to look at clusters at a range of levels of detail, including the final level of the tree, which is an
ordered list of the elements .Such a list, in which nearby elements are similar, is often more helpful to the
subject matter .Scientist than a collection of large, unordered groups. Agglomerative methods can be
employed with different types of linkage. In average linkage methods, the distance between two clusters
is the average of the dissimilarities between the points in one cluster and the points in the other cluster. In
single linkage methods (nearest neighbor methods), the dissimilarity between two clusters is the smallest
dissimilarity between a point in the first cluster and a point in the second cluster.
A convenient way to keep track of the tree structure is to assign each element a label indicating its path
through the tree. At each level, the previous labels are extended with another digit representing the new
cluster label.
A hierarchical clustering scheme produces a sequence of clustering in which each clustering is nested into
the next clustering in the sequence [9]. An important benefit of a hierarchical tree is that one can look at
clusters at increasing levels of detail. We propose to visualize the clusters at any level of the tree by
plotting the distance matrix corresponding with an ordering of the clusters and an ordering of elements
within the clusters. In the hierarchical procedure, a hierarchy or a tree like structure is built which allows
to visualize the relationship among the entities but in non hierarchical method a position in the
measurement is taken as central place and distance is measured from such central point (seed). Therefore
identifying a right central position is a big challenge and hence non hierarchical methods are less popular.
Fig 3: Hierarchical Clustering[11]
The advantages of hierarchical clustering are:
 Embedded flexibility regarding a level of granularity.
 Ease of handling of any forms of similarity or distance.
 Consequently applicability to any attributes types.
 Hierarchical clustering algorithms are more versatile.
How the algorithms are compared




Size of the data set.
Number of clusters.
Type of data set.
Type of software[10]
CONCLUSION . Partition Clustering is better than Hierarchal Clustering because it is more suitable for
Web Mining. Hierarchal clustering is also useful to detect the outlier data point or documents. This
technique keeps the related documents in the same cluster so that searching of documents becomes more
efficient in terms of time complexity. In future work we can also improve the relevancy factor of
Hierarchal clustering to retrieve the web documents.
References .
[1] An EFFECTIVE WEB DOCUMENT CLUSTERING for INFORMATION RETRIEVAL Rajendra
Kumar Roul1 (rkroul@bits-goa.ac.in), Dr.S.K.Sahay2 (ssahay@bits-goa.ac.in) BITS, Pilani - K.K. Birla,
Goa Campus, Zuarinagar, Goa - 403726, India.
[2]Bailey, P., Craswell, N., & Hawking, D. , “Engineering a multipurpose test collection for Web
retrieval experiments”, Information Processing and Management, 39, 853–871,2003.
[3] Jain A, Murty M and Flynn P. Data clustering: A review ACM Computing Surveys, 31(3), pp. 264–
323, 1999.
[4] Hillol Kargupta, Anupam Joshi, Krishnamoorthy Sivakumar and Yelena Yesha, “Data Mining: Next
Generation Challenges and Future Directions”, MIT Press,USA , 2004.
[5]Han J and Kamber M. Data mining: concepts and techniques, Morgan Kaufmann, San Francisco, 2001.
[6] On Clustering Validation Techniques Journal of Intelligent Information Systems, 17:2/3, 107–145,
2001_c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.
[7] Kehar Singh , Dimple Malik and Naveen Sharma “ Evolving limitations in K-means algorithm in data
mining and” IJCEM International Journal of Computational Engineering & Management, Vol. 12, April
2011.
[8] A COMPARISON OF HIERARCHICAL METHODS FOR CLUSTERING FUNCTIONAL DATA
Laura Ferreira and David B. Hitchcock,Department of Statistics University of South Carolina Columbia,
South Carolina 29208.
[9]Navjot Kaur, Jaspreet Kaur Sahiwal, Navneet Kaur” Efficient K-means Clustering Algorithm Using
Ranking Method In Data Mining”ISSN: 2278 – 1323 International Journal of Advanced Research in
Computer Engineering & Technology Volume 1, Issue 3, May2012.
[10] Comparison between Data Clustering Algorithm Osama Abu Abbas The International Arab Journal
of information technology Vol 5, No. 3, July 2003.
[11]Data Mining Algorithms In R/Clustering/Hybrid Hierarchical ...en.wikibooks.org 564 × 433 - Search
by image A dendogram obtained using a single-link agglomerative clustering algorithm. Source: Jain,
Murty, Flynn (1999)
Download