International Journal of Research In Science & Engineering Volume: 1 Issue: 6 November 2015 e-ISSN: 2394-8299 p-ISSN: 2394-8280 REVIEW OF CLASSIFICATION/CLUSTERING TECHNIQUES USED IN WEB DATA MINING Nutan Borkar1, Shrikant Kulkarni2 1 M.Tech Student, Department of Information Technology, Walchand College of Engineering, Sangli, MS, India, nutan.borkar@walchandsangli.ac.in 2 Dr. Shrikant V. Kulkarni, Professor (PG), Department of Information Technology, Walchand College of Engineering, Sangli, MS, India, shrikant.kulkarni@walchandsangli.ac.in ABSTRACT Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page content. Due to heterogeneity and the lack of structure that permits much of the ever-expanding information sources on the WWW as hyper-texted documents, it requires perfection in making automated discovery, organization, and search using indexing tools. Few search engines provide some comfort to users, but they do not generally provide structural information nor categorize, filter, or interpret documents. The classification and clustering algorithms are backbones of web search. This paper takes review of prominent clustering techniques used in web data mining in terms of requirements, algorithmic approaches, similarity measures and performance evaluation in terms of accuracy, implementation complexity, robustness and scalability. Keywords: Web-Search Results, Web Data Mining, Web Classification and Data Clustering ----------------------------------------------------------------------------------------------------------------------------1. INTRODUCTION Classification is a data mining technique used to predict group membership for data instances. Classification consists of assigning a class label to a set of unclassified cases which can be done with a priori knowledge of labels, groups, categories or the set of possible classes (called supervised classification). In an unsupervised classification, the set of possible classes is not known a priori; after classification it is needed to assign a name to that class. Unsupervised classification is also known as clustering and is used to place data elements into related groups without advance knowledge of the group definitions. 1.1 How Does Classification Works? The Data Classification process includes two steps Building the Classifier or Model which is the learning step or the learning phase and requires build the classification algorithms. The classifier is built from the training set made up of term/keyword tuples and their associated class labels. Each tuple that constitutes the training set is referred to as a category or class. These tuples can also be referred to as sample, object or data points. This process is shown in figure 1. Using Classifier for Classification where the classifier is used for classification. In this case test data is used to estimate the accuracy of classification rules, which can be applied to the new data tuples if the accuracy is considered acceptable. This process is shown in figure 2. This paper focuses on formalization of classification and clustering techniques suitable for web data mining and IR methods which can be suitably extended in presenting the results from web search engines. One of the peculiarities in web data mining using search engines is that the accurate document clustering is not implicit and therefore it demands for post clustering in order to improve the data classification performance. The postIJRISE| www.ijrise.org|editor@ijrise.org International Journal of Research In Science & Engineering Volume: 1 Issue: 6 November 2015 e-ISSN: 2394-8299 p-ISSN: 2394-8280 classification/clustering is done on the results provided by the search engine (which are the results of pre-clustering of the entire corpus mapping user queries onto the relevant web information associated by the web-crawlers). Fig-1: Process of building a classifier Fig-2: Using the classifier for classification 2. WEB DATA CLASSIFICATION 2.1 Specific Requirements in Web Classification Some of the key requirements for web data classification/clustering of search engine results are Coherent Clustering: The clustering algorithm should group similar documents together. Hierarchical Partitioning: The user needs to determine at a glance whether the contents of a cluster are of interest. Therefore, the system has to provide concise and accurate cluster labels. Speed: The clustering system should not introduce a substantial delay before displaying the results. Four main classes of coefficient needed in clustering distance coefficients, association coefficients, probabilistic coefficients, and correlation coefficients. 2.2 Similarity Measures A similarity measure is the relationships existing among data objects. In web document/page classification the objects are hierarchical in nature and can be viewed as compositions of simpler constituents, including keywords, phrases, attributes, links, text and other types objects like images, videos, etc. The hierarchy of composition is quite rich: attributes and web data objects are contained in search results can be are organized into higher-order structures matrices, trees and lattices. In measuring similarity at textual granularity, common IR IJRISE| www.ijrise.org|editor@ijrise.org International Journal of Research In Science & Engineering Volume: 1 Issue: 6 November 2015 e-ISSN: 2394-8299 p-ISSN: 2394-8280 approaches can be applied on text, where words that are deemed irrelevant are eliminated (e.g. stop list/punctuations) and only the words that share a common stem are replaced by the stem word are useful to formulate the basis for similarity comparisons. Distance Based Measures: The closeness or similarity can be measured as distance between two web objects if the data sets are represented by numerical terms like word (or keyword or phrase) counts/ frequencies. Distance coefficients, such as the Euclidean distance, are used very extensively in cluster analysis, owing to their simple geometric interpretation. The commonly used distance measures are as shown in Table-1below. Table-1: Distance Based Measures Linkage Criteria: The linkage criterion can also be considered as measure of relation for classification that evaluates the distance between sets of observations as a function of the pairwise distances between observations. Linkage criteria between two sets of observations A and B can be evaluated by formulas used in Table 2. Table-2: Linkage Criteria External Quality Measures: The external quality measures (shown in Table-3) use an (external) manual classification of the documents and include the entropy (which measures how the manually tagged classes are distributed within each cluster), the purity (which measures how much a cluster is specialized in a class by dividing its largest class by its size), and the F-measure which combines the precision and recall rates as an overall performance measure. Other Measures: Other linkage criteria include The sum of all intra-cluster variance. The decrease in variance for the cluster being merged (Ward's criterion). The probability that candidate clusters spawn from the same distribution function (V-linkage). The product of in-degree and out-degree on a k-nearest neighbor graph (graph degree linkage). The increment of some cluster descriptor (i.e., a quantity defined for measuring the quality of a cluster) after merging two clusters. IJRISE| www.ijrise.org|editor@ijrise.org International Journal of Research In Science & Engineering Volume: 1 Issue: 6 November 2015 e-ISSN: 2394-8299 p-ISSN: 2394-8280 Table-3: External Quality Measures Also Association coefficients have been very widely used for document clustering. The simplest association coefficient c, is the number of terms common to a pair of documents having a and b terms, respectively. Probabilistic coefficients can also be used in formation of clusters where the documents in the cluster have a maximal probability of being jointly co-relevant to a query. However, the correlation coefficients are rarely used for document clustering and have potential to find applicability in associating set of documents with user queries for search engine result clustering. 2.3 Generic Approaches for Data Clustering Different algorithms have been proposed for clustering web search result and are mostly the extensions of the classical hierarchical and partitioning clustering approaches. In top-down approaches, upon the availability of term frequencies in the documents, the agglomerative algorithms find the clusters by initially assigning each document to its own cluster and then repeatedly merging pairs of clusters until a certain stopping criterion is met. The end result can be graphically represented as a tree called a dendrogram. The dendrogram shows the clusters that have been merged together, and the distance between these merged clusters (the horizontal length of the branches is proportional to the distance between the merged clusters). The bottom-up clustering uses partitioning algorithms find clusters by partitioning the set of documents into either a predetermined or an automatically derived number of clusters. The collection is initially partitioned into clusters whose quality is repeatedly optimized, until a stable solution based on a criterion function is found. Hierarchical clustering (HC) is a method of cluster analysis which seeks to build a hierarchy of clusters. Hierarchical clustering produces clusters of better quality but its main drawback is the quadratic time complexity. For large documents, the linear time complexity of partitioning techniques has made them more popular especially in IR systems where the clustering is employed for efficiency reasons. 3. APPROACHES SUITABLE FOR WEB DATA CLUSTERING The difference in techniques applied for generic data mining and web mining is due to underlying structure and relationships in amounts of data. The generic data mining mainly involves the large sized static documents as achievable through catalogues or indexed lists. The clustering is efficient as it is mostly done under human supervisor. IJRISE| www.ijrise.org|editor@ijrise.org International Journal of Research In Science & Engineering Volume: 1 Issue: 6 November 2015 e-ISSN: 2394-8299 p-ISSN: 2394-8280 On the other hand, the web contains a rich and dynamic collection of hyperlink information and web page (small in sizes) mostly accessed using search engines and the pre-clustering is done by the search engine using the page-ranking algorithms. Therefore the web data clustering as post-clustering (with or without apriori class references) is needed for specific web mining requirements and involves the analysis of Web server logs of a Web site that contain the entire collection of requests made by a potential or current customer through their browser and responses by the Web server. Such classification is used in applications like web site usability, path to purchase, dynamic content marketing, user profiling through behaviour analysis and product affinities. Different algorithms have been proposed for clustering web search result and are mostly the extensions of the classical hierarchical and partitioning clustering approaches. 3.1 Suffix Tree Clustering Suffix tree document model and Suffix Tree Clustering (STC) algorithm is a linear time clustering algorithm (linear in the size of the document set), which is based on identifying phrases that are common to groups of documents. A phrase is an ordered sequence of one or more words[1] and their combination containing all possible suffixes of a given string, so as to run many important string operations more efficiently. The STC algorithm does not treat documents as a collection of words but as a string of words and operates using the proximity information between words. STC use suffix tree structure to efficiently identify sets of documents that share common phrases and terms, and uses this information to create clusters and to concisely present their contents to the users. STC meanly includes four logical steps: first, document “cleaning”; secondly, constructing a generalized suffix tree; thirdly, identifying base clusters; the last step is to combine these base clusters into clusters [2]. For web data clustering the specific steps involve removal of HTML tags and word stemming using a stemming algorithm which is a process of linguistic normalization, in which the variant forms of a word are reduced to a common form. Stemming algorithms can be classified in three groups: truncating methods, statistical methods, and mixed methods. Each of these groups has a typical way of finding the stems of the word variants. Some of stemming algorithms are presented in the figure 3. Fig-3: Classification of stemming algorithms 3.2 Hierarchical Bayesian Clustering Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers. Bayesian classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a particular class. Posterior Probability P(H/X) in Bayesian classification is given by ration P(X/H)P(H) / P(X), where P(H) is prior Probability, X is data tuple and H is some hypothesis. A Hierarchical Bayesian Clustering (HBC) algorithm is the one that constructs a set of clusters having the maximum Bayesian posterior probability, i.e. the probability that the given texts are classified into clusters. The IJRISE| www.ijrise.org|editor@ijrise.org International Journal of Research In Science & Engineering Volume: 1 Issue: 6 November 2015 e-ISSN: 2394-8299 p-ISSN: 2394-8280 HBC has advantages like possibility of re-constructing the original clusters more accurately than non-probabilistic algorithms and when a probabilistic text categorization is extended to a cluster-based one, the HBC offers better performance than non-probabilistic algorithms[3][4]. Bayesian Clustering assumes that web pages follow one of the different behaviour/evolution types (clusters) and can correspond to different dominant web page. The HBC is found useful in automatic classification of Web documents into pre-specified categories or taxonomies for increasing the precision of Web search. 3.3 Density-based clustering Density-based clustering algorithms locate clusters by constructing a density function that reflects the spatial distribution of the data points. The density-based notion of a cluster is defined as a set of density-connected points that is maximal with respect to density-reachability. In other words, the density of points inside each cluster is considerably higher than outside of the cluster. This technique is useful when clusters fall in dense regions of objects in the data space that are separated by regions of low density (representing noise). DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density based clustering algorithm that grows regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in spatial databases with noise. It defines a cluster as a maximal set of density-connected points. Given a set of web objects D, the DBSCAN searches for clusters by checking the -neighbourhood (defined by radius ) of each point in the D. If the -neighbourhood of a point p contains more than MinPts (i.e. set piD), a new cluster with p as a core object is created. The DBSCAN then iteratively collects directly density-reachable objects from these core objects, by performing the merging of a few density-reachable clusters. The process terminates when no new point can be added to any cluster [5]. 3.4 Model-Based Method Model-based clustering is a framework that combines cluster analysis with probabilistic techniques. The objects under considerations are characterized by a finite mixture of probability distributions and that each component distribution expresses a cluster; each cluster has a data-generating model with different parameters. The main aspect of this approach is the classifier must learn the parameters for each cluster. For clustering, the objects are assigned to clusters using a hard assignment policy [6]. The expectation-maximization (EM) algorithm is usually used to learn the set of parameters for each cluster. The EM algorithm is an iterative procedure that finds the maximum-likelihood estimates of the parameter vector by repeating the following steps (which are iterated until the convergence happens): The expectation E-step: Given a set of parameter estimates, the E-step calculates the conditional expectation of the complete data-log likelihood given the observed data and the parameter estimates. The maximization M-step: Given complete data-log likelihood, the M-step finds the parameter estimates to maximize the complete data-log likelihood from the E-step. The complexity of the EM algorithm depends on the complexity of the E- and M-steps. The number of clusters on model-based schemes is practically estimated using probabilistic techniques. In web data mining, the model-based approaches try to solve clustering problems by building models that describe the browsing behaviour of users on the Web [5]. The modelling algorithm employed should be able to generate insight into how the users use the web as well as provide mechanisms for making predictions for a variety of applications like Web prefetching, the personalization of Web content, etc. Therefore, the model-based schemes are usually favoured for clustering Web users’ sessions. 3.5 Constraint-based Method Constraint based clustering is used for the cases of high dimensional spaces and the clustering process requires user preferences and constraints inputs. The constraints usually include the expected number of clusters, the IJRISE| www.ijrise.org|editor@ijrise.org International Journal of Research In Science & Engineering Volume: 1 Issue: 6 November 2015 e-ISSN: 2394-8299 p-ISSN: 2394-8280 minimal or maximal cluster size, weights for different objects or dimensions, and other desirable characteristics of the resulting clusters[5][8]. Depending on the nature of the constraints, the clustering may adopt following approaches. Constraints on individual objects, where user can specify constraints on the objects to be clustered to obtain intermediate cluster that reduces cluster to an instance of unconstrained clustering. Constraints on the selection of clustering parameters, where user sets a desired range for each clustering parameters suitable to a specific to the given post-clustering algorithm. Constraints on distance or similarity functions, where, user can specify different distance or similarity functions for specific attributes of the objects to be clustered, or different distance measures for specific pairs of objects. The constraint based clustering is used in web mining task where large numbers of web objects are in the data set and the classification flexibility is desirable. Thus a user can impose constraints on the clustering to be found, such as must-link and cannot-link constraints. 4. OMPARISON OF CLUSTERING TECHNIQUES The criteria for comparing the methods are as under Accuracy: Accuracy of classifier refers to the ability of classifier to predict the class label correctly and refers to how well a given predictor can guess the value of predicted attribute for a new data. Speed: It is the computational cost in generating and using the classifier or predictor. Robustness: It refers to the ability of classifier or predictor to make correct predictions from given noisy data. Scalability: Scalability refers to the ability to construct the classifier or predictor efficiently; given large amount of data. Interpretability: Tells to what extent the classifier or predictor understands. Comparison of typical methods discussed in section 3 on basis of above criteria is as below Method Suffix Tree Clustering Hierarchical Bayesian Clustering Densitybased clustering Accuracy The accuracy is dependent on the number of clusters, and number of web objects to be classified. It decreases if partitioned into large number of clusters. 85% to 93% Accuracy has been observed in most cases. DBSCAN assumes clusters of similar density, and may suffer with problems separating nearby clusters IJRISE| www.ijrise.org|editor@ijrise.org Complexity STC algorithms are all linear-time for a constant-sized labels, and have worst-case running time of O(n.logn) in general. Complexity is limited by a growth in time complexity which is at least quadratic in the number of elements. Also, the memory usage is proportional to the square of the initial clustering number. DBC complexity in general is O(n2) and if used DBSCAN, it is O(n.logn), where n is the number of database objects. Robustness STC is seen quite robust in "noisy" situations Scalability Scalable for large datasets HBC is robust in "noisy" situations Semi-scalable for large datasets DBC is less robust in "noisy" situations Scalable for large datasets International Journal of Research In Science & Engineering Volume: 1 Issue: 6 November 2015 Method Model-Based Method ConstraintBased Method Accuracy Accuracy depends on variance between the clusters’ maximum and inside the clusters’ minimum. The accuracy improves if pre-processing using Micro-clusters (groups of points that are close together) is adopted. e-ISSN: 2394-8299 p-ISSN: 2394-8280 Complexity Complexity analysis is not available in references Robustness MBM is less robust in "noisy" situations Scalability Semi-scalable for large datasets Complexity analysis is not available in references --- Semi-scalable for large datasets Table-4: Detailed comparison of Web Classifiers 5. CONCLUSION In this paper we compared the prominent clustering algorithms for suitability in web data mining and the evaluated performance with reference to available literature and analytic figures/metrics available. We found that Suffix Tree Clustering can be easily adopted for the cases where no apriority knowledge or classes is not available and further by combinatorial techniques conjoining multiple class parameters, the tree can be manipulated for specific domains. Hierarchical Bayesian Clustering and Density-based clustering requires the reference classes to cluster effectively. Their efficiency can be improved if the clusters are pre-processed in advance. Model Based clustering and Constraint Based clustering requires expert assistance for initial taxonomies for specific domains so that clustering leads to meaningful relationships between various web documents. Such techniques can play prominent role in domain oriented search engines. ACKNOWLEDGEMENT We acknowledge the support and encouragement from Dr. Mrs. S. P. Sonavane, HOD and Prof. A. J. Umbarkar Information Technology Department, WCE, Sangli for this paper. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] O. Zamir and O.Etzionie, “A Dynamic Clustering interface to Web search results,” Computer Networks, vol. 31(11-16), pp. 1361-1374, 1999. M. Ilic, P. Spalevic and M. Veinovic, “Suffix Tree Clustering – Data mining algorithm,” Twenty-Third International Electrotechnical and Computer Science Conference ERK'2014, Portorož, pp. 15-18, September 2014. K.A. Heller and Z. Ghahramani, “Bayesian hierarchical clustering”, Proceedings of the 22nd international conference on Machine learning, pp. 297-304, 2005. R.E. Ruviaro Christ, E. Talavera and C. Maciel, “Gaussian Hierarchical Bayesian Clustering Algorithm,” ISDA 2007, pp. 133-13, 2007. J. Han and M. Kamber, "Data Mining: Concepts and Techniques," Morgan Kaufmann Publisher, 2nd Edition, 2006. X. Li, G. Yu and D. Wang, “Mmpclust: A skew prevention algorithm for model-based document clustering,” Database Systems for Advanced Applications - Springer, pp. 536–547, 2005. A. Vakali and G. PallisWeb, "Data Management Practices: Emerging Techniques and Technologies," Idea Group Publishing, ISBN 1-599004-228-2. K. H. Anthony Tung, T. Ng Raymond, V. S. Lakshmanan and H. Jiawei, “Constraint-based clustering in large databases,” ICDT 2001, pp. 405-419, 2001. IJRISE| www.ijrise.org|editor@ijrise.org