Comparative Study of Hierarchical Clustering over Partitioning Clustering Algorithm Devendra Kumar Neha Arora Asst. Professor IFTM, Moradabad (U.P.) dev625@yahoo.com M. Tech Scholar IFTM, Moradabad (U.P.) apexneha2009@gmail.com Abstract. With huge amount of information available, it is not possible to take the full advantage of the World Wide Web without having a proper framework to search through the available data. The size of web has increased exponentially over the past few years with thousands of documents related to a subject available to the user. Data clustering is one of the important data mining methods. It is a process of finding classes of a data set with most similarity in the same class and most dissimilarity between different classes. Many clustering algorithms have been proposed by researchers. Partitioning clustering and hierarchical clustering are two main approaches to clustering. This paper summarizes the different characteristics of partitioning clustering and hierarchical clustering. Keyword. Search Engine, Web Mining, Hierarchical Clustering, Partitioning Clustering. 1.INTRODUCTION: There is a tremendous growth of information on web. As the number of users using the web growing rapidly, so it creates many challenges of information retrieval which become the current research topics [1].It is practically impossible to search through this extremely large database for the information needed by him. Hence the need for Search Engine arises. Search Engines uses crawlers to gather information and stores it in database maintained at search engine side. For a given user's query the search engine searches in the local database and very quickly displays the results. The huge amount of information is retrieved using data mining tools. Clustering plays a key role in searching for structures in data. As the number of available documents nowadays is large, hierarchical approaches are better suited because they permit categories to be defined at different levels [2]. Clustering can be considered the most important unsupervised learning problem; it deals with finding a structure in a collection of unlabeled data. The ability to form meaningful groups of objects is one of the most fundamental modes of intelligence. Cluster analysis is a tool for exploring the structure of data. Clustering techniques have been widely used in many areas such as data mining, artificial intelligence, pattern recognition, bioinformatics, segmentation and machine learning [3]. Web Mining is the use of Data Mining techniques to automatically discover and extract information from web. The unstructured feature of web data triggers more complexity of web mining. Web mining research is actually a converging area from several research communities such as Database, Information retrieval, Artificial Intelligence and also the psychology and statistics as well .Web Mining is the use of Data mining techniques to automatically discover and extract information from web. The following are the orthogonal aspects with which clustering methods can be compared: The partitioning criteria, separation of clusters similarity measure, clustering space. There are various clustering methods: Partitioning methods, Hierarchical method, Density based method and Grid based method .Web mining is a rapid growing research area, it consist of Web usage mining, Web structure mining and Web content mining. Web usage mining refers to the discovery of user access patterns from the Web Usage logs .Web structure mining aims to discover useful knowledge from the structure of hyperlinks. Web content mining refers to mine useful information or knowledge from the web page contents. Web mining is a technique to discover and analyze the useful information from the web data .Web involves three types of data that is data on the web, web log data and web structure data. 1.1 Web Mining Process Web mining may be decomposed into the following subtasks: ources. -processing: is the transform process of the result of resource discovery resources. rns at individual Web sites and across multiple sites [4]. RESOURCE Web Mining DISCOVERY INFORMATION PREPROCESSING INFORMATION EXTRACTION Fig 1: Web Mining Process 1.2 Web Mining Taxonomy Fig 2: Web Mining Taxonomy GENERALIZATION Data mining is the process of extracting previously unknown, valid and useful information from large databases. It is a collection of techniques and tools for handling large amount of information [5]. Many clustering algorithms have been proposed by researchers. Partitioning clustering and hierarchical clustering are two main approaches to clustering. Some of the clustering algorithms in the literature are Kmeans, K-medoid, FCM, PAM, CLARA, CLARANS, BIRCH, CURE, ROCK and CHAMELEON. The main steps of a typical clustering process include object representation, definition of an object similarity measure appropriate to the data domain, grouping objects by a given clustering algorithm, and assessment of the output [3]. Clustering routines (either implicitly or explicitly) map a distance matrix D into p cluster labels. Graph theoretic, model-based and non-parametric approaches have been implemented. The non-parametric clustering methods can be separated into partitioning and hierarchical algorithms. Thus according to the method adopted to define clusters, the algorithms can be broadly classified into the following types (Jain et al., 1999): 2.Clustering Algorithms • Partition clustering attempts to directly decompose the data set into a set of disjoint clusters. More specifically, they attempt to determine an integer number of partitions that optimize a certain criterion function. The criterion function may emphasize the local or global structure of the data and its optimization is an iterative procedure. • Hierarchical clustering proceeds successively by either merging smaller clusters into larger ones, or by splitting larger clusters. The result of the algorithm is a tree of clusters, called dendogram , which shows how the clusters are related. By cutting the dendogram at a desired level, a clustering of the data items into disjoint groups is obtained [6]. 2.1 Partitioning Clustering algorithm Partitioning methods, such as Self-Organizing Maps (T¨or¨onen et al. (1999)), Partitioning around KMeans, identify a user specified number of clusters and assign each element to a cluster. Medoids (Kaufman & Rousseeuw (1990)) . 2.1.1 K-means clustering algorithm K-means clustering is a well known partitioning method. In this objects are classified as belonging to one of K-groups. The results of partitioning method are a set of K clusters, each object of data set belonging to one cluster. In each cluster there may be a centroid or a cluster representative. In case where we consider real-valued data, the arithmetic mean of the attribute vectors for all objects within a cluster provides an appropriate representative; alternative types of centroid may be required in other cases. Example: A cluster of documents can be represented by a list of those keywords that occur in some minimum number of documents within a cluster. If the number of the clusters is large, the centroids can be further clustered to produces hierarchy within a dataset. K-means is a data mining algorithm which performs clustering of the data samples. As mentioned previously, clustering means the division of a dataset into a number of groups such that similar items falls or belong to same groups. In order to cluster the database, K-means algorithm uses an iterative approach [7]. K-means attempts to minimize the squared or absolute error points with respect to their cluster centroids while this is sometime a reasonable criterion and leads to a simple algorithm. K-means is a centroid based partitioning technique uses the centroid of a cluster, Ci to represent that cluster.The difference between an object p Є Ci and ci ,the representive of the cluster is measured by dist(p, ci) , where dist(x,y) is the Euclidean distance between two points x and y.The quality of the cluster Ci can be measured by the within cluster variation, which is sum of the squared error between all objects in Ci and the centroid ci, defined as E =ki=1 pЄ Ci dis(p, ci) 2 where E is the sum of the squared error for all objects in the data set ;p is the point in space representing a given object; and ci is the centriod of cluster Ci.. In particular it shows the results when clusters have widely different sizes or have the convex shapes. The K- means objective function is minimized by globular clusters of equal size or by cluster that are well separated. The K-means algorithm is also normally restricted to data in Euclidean spaces because in many cases the required means and medians do not make a sense. K-means method is not guaranteed to converge to the global optimum and often terminates at a local optimum. The results may depend on the initial random selection of cluster centers. K-means can only be applied only when the mean of a set of objects is defined.The K-means algorithm is sensitive to outliers because such objects are far away from the majority of the data and thus when assigned to a cluster, they dramatically distort the mean value of the cluster. A related technique K-medoid clustering does not have the restriction. 2.1.2 K-medoids Clustering Algorithm The objective of K-medoid clustering is to find a non-overlapping clusters such that each cluster has a most representative point i.e., that is mostly centrally located with respect to some measure, e .g, distance. These representative points are called medoids. Such an approach is not restricted to Euclidean spaces and is likely to be more tolerant of outliers. However, finding a better medoid requires trying all points that are currently not medoid and is computationally expensive. A typical K-medoid portioning algorithm like PAM (Partition Around Medoid) works efficiently for small data sets but does not scale well for large data sets. CLARA (Clustering Large Applications) cannot find a good clustering if any of the best sampled medoid is far from the best K-medoid. If an object is one of the best K-medoid but is not selected during sampling, CLARA will never find the best clustering. 2.2 Hierarchical Clustering Hierarchical methods are well known clustering technique that can be potentially very useful for various data mining tasks. A hierarchical clustering scheme produces a sequence of clustering in which each clustering is nested into the next clustering in the sequence. Hierarchical methods are commonly used for clustering in Data Mining. An important class of clustering methods is hierarchical cluster analyses. . A hierarchical tree can be Agglomerative (i.e.: built from the bottom up by recursively combining the elements) or divisive (i.e.: built from the top down by recursively partitioning the elements). AGNES (Kaufman & Rousseeuw (1990)) and Cluster (Eisen et al.(1998)) are examples of agglomerative hierarchical algorithms ,while DIANA (Kaufman & Rousseeuw (1990)) is an example of a divisive hierarchical algorithm. An agglomerative hierarchical method begins with each object as its own cluster. It then successively merges the most similar clusters together until the entire set of data becomes one group .In order to determine which groups should be merged in agglomerative hierarchical clustering, various linkage methods can be used like Single linkage, Complete linkage ,average linkage Ward’s method and centroid method .Single linkage (Sneath, 1957) merges groups based on the minimum distance between two objects in two groups[8].Since hierarchical clustering is a greedy search algorithm based on a local search, the merging decision made early in the agglomerative process are not necessarily the right ones. One possible solution to this problem is to refine a clustering produced by the agglomerative hierarchical algorithm to potentially correct the mistakes made early in the agglomerative process. Divisive algorithm start with one, all inclusive clusters and at each step split a cluster until only singleton clusters of individual points remain in this algorithm we need to decide which cluster to split at each step like Minimum Spanning Tree. Hierarchical methods involve constructing a tree of clusters in which the root is a single cluster containing all the elements and the leaves each contain only one element. Hierarchical methods are used when it is of interest to look at clusters at a range of levels of detail, including the final level of the tree, which is an ordered list of the elements .Such a list, in which nearby elements are similar, is often more helpful to the subject matter .Scientist than a collection of large, unordered groups. Agglomerative methods can be employed with different types of linkage. In average linkage methods, the distance between two clusters is the average of the dissimilarities between the points in one cluster and the points in the other cluster. In single linkage methods (nearest neighbor methods), the dissimilarity between two clusters is the smallest dissimilarity between a point in the first cluster and a point in the second cluster. A convenient way to keep track of the tree structure is to assign each element a label indicating its path through the tree. At each level, the previous labels are extended with another digit representing the new cluster label. A hierarchical clustering scheme produces a sequence of clustering in which each clustering is nested into the next clustering in the sequence [9]. An important benefit of a hierarchical tree is that one can look at clusters at increasing levels of detail. We propose to visualize the clusters at any level of the tree by plotting the distance matrix corresponding with an ordering of the clusters and an ordering of elements within the clusters. In the hierarchical procedure, a hierarchy or a tree like structure is built which allows to visualize the relationship among the entities but in non hierarchical method a position in the measurement is taken as central place and distance is measured from such central point (seed). Therefore identifying a right central position is a big challenge and hence non hierarchical methods are less popular. Fig 3: Hierarchical Clustering[11] The advantages of hierarchical clustering are: Embedded flexibility regarding a level of granularity. Ease of handling of any forms of similarity or distance. Consequently applicability to any attributes types. Hierarchical clustering algorithms are more versatile. How the algorithms are compared Size of the data set. Number of clusters. Type of data set. Type of software[10] CONCLUSION . Partition Clustering is better than Hierarchal Clustering because it is more suitable for Web Mining. Hierarchal clustering is also useful to detect the outlier data point or documents. This technique keeps the related documents in the same cluster so that searching of documents becomes more efficient in terms of time complexity. In future work we can also improve the relevancy factor of Hierarchal clustering to retrieve the web documents. References . [1] An EFFECTIVE WEB DOCUMENT CLUSTERING for INFORMATION RETRIEVAL Rajendra Kumar Roul1 (rkroul@bits-goa.ac.in), Dr.S.K.Sahay2 (ssahay@bits-goa.ac.in) BITS, Pilani - K.K. Birla, Goa Campus, Zuarinagar, Goa - 403726, India. [2]Bailey, P., Craswell, N., & Hawking, D. , “Engineering a multipurpose test collection for Web retrieval experiments”, Information Processing and Management, 39, 853–871,2003. [3] Jain A, Murty M and Flynn P. Data clustering: A review ACM Computing Surveys, 31(3), pp. 264– 323, 1999. [4] Hillol Kargupta, Anupam Joshi, Krishnamoorthy Sivakumar and Yelena Yesha, “Data Mining: Next Generation Challenges and Future Directions”, MIT Press,USA , 2004. [5]Han J and Kamber M. Data mining: concepts and techniques, Morgan Kaufmann, San Francisco, 2001. [6] On Clustering Validation Techniques Journal of Intelligent Information Systems, 17:2/3, 107–145, 2001_c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. [7] Kehar Singh , Dimple Malik and Naveen Sharma “ Evolving limitations in K-means algorithm in data mining and” IJCEM International Journal of Computational Engineering & Management, Vol. 12, April 2011. [8] A COMPARISON OF HIERARCHICAL METHODS FOR CLUSTERING FUNCTIONAL DATA Laura Ferreira and David B. Hitchcock,Department of Statistics University of South Carolina Columbia, South Carolina 29208. [9]Navjot Kaur, Jaspreet Kaur Sahiwal, Navneet Kaur” Efficient K-means Clustering Algorithm Using Ranking Method In Data Mining”ISSN: 2278 – 1323 International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 3, May2012. [10] Comparison between Data Clustering Algorithm Osama Abu Abbas The International Arab Journal of information technology Vol 5, No. 3, July 2003. [11]Data Mining Algorithms In R/Clustering/Hybrid Hierarchical ...en.wikibooks.org 564 × 433 - Search by image A dendogram obtained using a single-link agglomerative clustering algorithm. Source: Jain, Murty, Flynn (1999)