IJRISE_Paper_Review_of_Classification_Clustering_Techniques_

advertisement
International Journal of Research In Science & Engineering
Volume: 1 Issue: 6 November 2015
e-ISSN: 2394-8299
p-ISSN: 2394-8280
REVIEW OF CLASSIFICATION/CLUSTERING TECHNIQUES USED IN
WEB DATA MINING
Nutan Borkar1, Shrikant Kulkarni2
1
M.Tech Student, Department of Information Technology, Walchand College of Engineering, Sangli, MS,
India, nutan.borkar@walchandsangli.ac.in
2
Dr. Shrikant V. Kulkarni, Professor (PG), Department of Information Technology, Walchand College of
Engineering, Sangli, MS, India, shrikant.kulkarni@walchandsangli.ac.in
ABSTRACT
Web content mining is the mining, extraction and integration of useful data, information and knowledge from
Web page content. Due to heterogeneity and the lack of structure that permits much of the ever-expanding
information sources on the WWW as hyper-texted documents, it requires perfection in making automated
discovery, organization, and search using indexing tools. Few search engines provide some comfort to users, but
they do not generally provide structural information nor categorize, filter, or interpret documents. The
classification and clustering algorithms are backbones of web search. This paper takes review of prominent
clustering techniques used in web data mining in terms of requirements, algorithmic approaches, similarity
measures and performance evaluation in terms of accuracy, implementation complexity, robustness and
scalability.
Keywords: Web-Search Results, Web Data Mining, Web Classification and Data Clustering
----------------------------------------------------------------------------------------------------------------------------1. INTRODUCTION
Classification is a data mining technique used to predict group membership for data instances.
Classification consists of assigning a class label to a set of unclassified cases which can be done with a priori
knowledge of labels, groups, categories or the set of possible classes (called supervised classification). In an
unsupervised classification, the set of possible classes is not known a priori; after classification it is needed to assign
a name to that class. Unsupervised classification is also known as clustering and is used to place data elements into
related groups without advance knowledge of the group definitions.
1.1 How Does Classification Works?
The Data Classification process includes two steps
 Building the Classifier or Model which is the learning step or the learning phase and requires build
the classification algorithms. The classifier is built from the training set made up of term/keyword
tuples and their associated class labels. Each tuple that constitutes the training set is referred to as
a category or class. These tuples can also be referred to as sample, object or data points. This
process is shown in figure 1.
 Using Classifier for Classification where the classifier is used for classification. In this case test
data is used to estimate the accuracy of classification rules, which can be applied to the new data
tuples if the accuracy is considered acceptable. This process is shown in figure 2.
This paper focuses on formalization of classification and clustering techniques suitable for web data mining
and IR methods which can be suitably extended in presenting the results from web search engines. One of the
peculiarities in web data mining using search engines is that the accurate document clustering is not implicit and
therefore it demands for post clustering in order to improve the data classification performance. The postIJRISE| www.ijrise.org|editor@ijrise.org
International Journal of Research In Science & Engineering
Volume: 1 Issue: 6 November 2015
e-ISSN: 2394-8299
p-ISSN: 2394-8280
classification/clustering is done on the results provided by the search engine (which are the results of pre-clustering
of the entire corpus mapping user queries onto the relevant web information associated by the web-crawlers).
Fig-1: Process of building a classifier
Fig-2: Using the classifier for classification
2. WEB DATA CLASSIFICATION
2.1 Specific Requirements in Web Classification
Some of the key requirements for web data classification/clustering of search engine results are
 Coherent Clustering: The clustering algorithm should group similar documents together.
 Hierarchical Partitioning: The user needs to determine at a glance whether the contents of a cluster
are of interest. Therefore, the system has to provide concise and accurate cluster labels.
 Speed: The clustering system should not introduce a substantial delay before displaying the
results.
Four main classes of coefficient needed in clustering distance coefficients, association coefficients,
probabilistic coefficients, and correlation coefficients.
2.2 Similarity Measures
A similarity measure is the relationships existing among data objects. In web document/page classification
the objects are hierarchical in nature and can be viewed as compositions of simpler constituents, including
keywords, phrases, attributes, links, text and other types objects like images, videos, etc. The hierarchy of
composition is quite rich: attributes and web data objects are contained in search results can be are organized into
higher-order structures matrices, trees and lattices. In measuring similarity at textual granularity, common IR
IJRISE| www.ijrise.org|editor@ijrise.org
International Journal of Research In Science & Engineering
Volume: 1 Issue: 6 November 2015
e-ISSN: 2394-8299
p-ISSN: 2394-8280
approaches can be applied on text, where words that are deemed irrelevant are eliminated (e.g. stop
list/punctuations) and only the words that share a common stem are replaced by the stem word are useful to
formulate the basis for similarity comparisons.
Distance Based Measures: The closeness or similarity can be measured as distance between two web
objects if the data sets are represented by numerical terms like word (or keyword or phrase) counts/ frequencies.
Distance coefficients, such as the Euclidean distance, are used very extensively in cluster analysis, owing to their
simple geometric interpretation. The commonly used distance measures are as shown in Table-1below.
Table-1: Distance Based Measures
Linkage Criteria: The linkage criterion can also be considered as measure of relation for classification that
evaluates the distance between sets of observations as a function of the pairwise distances between observations.
Linkage criteria between two sets of observations A and B can be evaluated by formulas used in Table 2.
Table-2: Linkage Criteria
External Quality Measures: The external quality measures (shown in Table-3) use an (external) manual
classification of the documents and include the entropy (which measures how the manually tagged classes are
distributed within each cluster), the purity (which measures how much a cluster is specialized in a class by dividing
its largest class by its size), and the F-measure which combines the precision and recall rates as an overall
performance measure.
Other Measures: Other linkage criteria include
 The sum of all intra-cluster variance.
 The decrease in variance for the cluster being merged (Ward's criterion).
 The probability that candidate clusters spawn from the same distribution function (V-linkage).
 The product of in-degree and out-degree on a k-nearest neighbor graph (graph degree linkage).
 The increment of some cluster descriptor (i.e., a quantity defined for measuring the quality of a
cluster) after merging two clusters.
IJRISE| www.ijrise.org|editor@ijrise.org
International Journal of Research In Science & Engineering
Volume: 1 Issue: 6 November 2015
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Table-3: External Quality Measures
Also Association coefficients have been very widely used for document clustering. The simplest
association coefficient c, is the number of terms common to a pair of documents having a and b terms, respectively.
Probabilistic coefficients can also be used in formation of clusters where the documents in the cluster have
a maximal probability of being jointly co-relevant to a query. However, the correlation coefficients are rarely used
for document clustering and have potential to find applicability in associating set of documents with user queries for
search engine result clustering.
2.3 Generic Approaches for Data Clustering
Different algorithms have been proposed for clustering web search result and are mostly the extensions of
the classical hierarchical and partitioning clustering approaches.
In top-down approaches, upon the availability of term frequencies in the documents, the agglomerative
algorithms find the clusters by initially assigning each document to its own cluster and then repeatedly merging
pairs of clusters until a certain stopping criterion is met. The end result can be graphically represented as a tree
called a dendrogram. The dendrogram shows the clusters that have been merged together, and the distance between
these merged clusters (the horizontal length of the branches is proportional to the distance between the merged
clusters).
The bottom-up clustering uses partitioning algorithms find clusters by partitioning the set of documents
into either a predetermined or an automatically derived number of clusters. The collection is initially partitioned into
clusters whose quality is repeatedly optimized, until a stable solution based on a criterion function is found.
Hierarchical clustering (HC) is a method of cluster analysis which seeks to build a hierarchy of clusters.
Hierarchical clustering produces clusters of better quality but its main drawback is the quadratic time complexity.
For large documents, the linear time complexity of partitioning techniques has made them more popular especially
in IR systems where the clustering is employed for efficiency reasons.
3. APPROACHES SUITABLE FOR WEB DATA CLUSTERING
The difference in techniques applied for generic data mining and web mining is due to underlying structure
and relationships in amounts of data. The generic data mining mainly involves the large sized static documents as
achievable through catalogues or indexed lists. The clustering is efficient as it is mostly done under human
supervisor.
IJRISE| www.ijrise.org|editor@ijrise.org
International Journal of Research In Science & Engineering
Volume: 1 Issue: 6 November 2015
e-ISSN: 2394-8299
p-ISSN: 2394-8280
On the other hand, the web contains a rich and dynamic collection of hyperlink information and web page
(small in sizes) mostly accessed using search engines and the pre-clustering is done by the search engine using the
page-ranking algorithms. Therefore the web data clustering as post-clustering (with or without apriori class
references) is needed for specific web mining requirements and involves the analysis of Web server logs of a Web
site that contain the entire collection of requests made by a potential or current customer through their browser and
responses by the Web server. Such classification is used in applications like web site usability, path to purchase,
dynamic content marketing, user profiling through behaviour analysis and product affinities.
Different algorithms have been proposed for clustering web search result and are mostly the extensions of
the classical hierarchical and partitioning clustering approaches.
3.1 Suffix Tree Clustering
Suffix tree document model and Suffix Tree Clustering (STC) algorithm is a linear time clustering
algorithm (linear in the size of the document set), which is based on identifying phrases that are common to groups
of documents. A phrase is an ordered sequence of one or more words[1] and their combination containing all possible
suffixes of a given string, so as to run many important string operations more efficiently. The STC algorithm does
not treat documents as a collection of words but as a string of words and operates using the proximity information
between words. STC use suffix tree structure to efficiently identify sets of documents that share common phrases
and terms, and uses this information to create clusters and to concisely present their contents to the users. STC
meanly includes four logical steps: first, document “cleaning”; secondly, constructing a generalized suffix tree;
thirdly, identifying base clusters; the last step is to combine these base clusters into clusters [2].
For web data clustering the specific steps involve removal of HTML tags and word stemming using a
stemming algorithm which is a process of linguistic normalization, in which the variant forms of a word are reduced
to a common form. Stemming algorithms can be classified in three groups: truncating methods, statistical methods,
and mixed methods. Each of these groups has a typical way of finding the stems of the word variants. Some of
stemming algorithms are presented in the figure 3.
Fig-3: Classification of stemming algorithms
3.2 Hierarchical Bayesian Clustering
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers.
Bayesian classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a
particular class. Posterior Probability P(H/X) in Bayesian classification is given by ration P(X/H)P(H) / P(X), where
P(H) is prior Probability, X is data tuple and H is some hypothesis.
A Hierarchical Bayesian Clustering (HBC) algorithm is the one that constructs a set of clusters having the
maximum Bayesian posterior probability, i.e. the probability that the given texts are classified into clusters. The
IJRISE| www.ijrise.org|editor@ijrise.org
International Journal of Research In Science & Engineering
Volume: 1 Issue: 6 November 2015
e-ISSN: 2394-8299
p-ISSN: 2394-8280
HBC has advantages like possibility of re-constructing the original clusters more accurately than non-probabilistic
algorithms and when a probabilistic text categorization is extended to a cluster-based one, the HBC offers better
performance than non-probabilistic algorithms[3][4].
Bayesian Clustering assumes that web pages follow one of the different behaviour/evolution types
(clusters) and can correspond to different dominant web page. The HBC is found useful in automatic classification
of Web documents into pre-specified categories or taxonomies for increasing the precision of Web search.
3.3 Density-based clustering
Density-based clustering algorithms locate clusters by constructing a density function that reflects the
spatial distribution of the data points. The density-based notion of a cluster is defined as a set of density-connected
points that is maximal with respect to density-reachability. In other words, the density of points inside each cluster is
considerably higher than outside of the cluster. This technique is useful when clusters fall in dense regions of objects
in the data space that are separated by regions of low density (representing noise).
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density based clustering
algorithm that grows regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in
spatial databases with noise. It defines a cluster as a maximal set of density-connected points.
Given a set of web objects D, the DBSCAN searches for clusters by checking the -neighbourhood (defined
by radius ) of each point in the D. If the -neighbourhood of a point p contains more than MinPts (i.e. set piD), a
new cluster with p as a core object is created. The DBSCAN then iteratively collects directly density-reachable
objects from these core objects, by performing the merging of a few density-reachable clusters. The process
terminates when no new point can be added to any cluster [5].
3.4 Model-Based Method
Model-based clustering is a framework that combines cluster analysis with probabilistic techniques. The
objects under considerations are characterized by a finite mixture of probability distributions and that each
component distribution expresses a cluster; each cluster has a data-generating model with different parameters. The
main aspect of this approach is the classifier must learn the parameters for each cluster. For clustering, the objects
are assigned to clusters using a hard assignment policy [6].
The expectation-maximization (EM) algorithm is usually used to learn the set of parameters for each
cluster. The EM algorithm is an iterative procedure that finds the maximum-likelihood estimates of the parameter
vector by repeating the following steps (which are iterated until the convergence happens):
 The expectation E-step: Given a set of parameter estimates, the E-step calculates the conditional
expectation of the complete data-log likelihood given the observed data and the parameter
estimates.
 The maximization M-step: Given complete data-log likelihood, the M-step finds the parameter
estimates to maximize the complete data-log likelihood from the E-step.
The complexity of the EM algorithm depends on the complexity of the E- and M-steps. The number of
clusters on model-based schemes is practically estimated using probabilistic techniques.
In web data mining, the model-based approaches try to solve clustering problems by building models that
describe the browsing behaviour of users on the Web [5]. The modelling algorithm employed should be able to
generate insight into how the users use the web as well as provide mechanisms for making predictions for a variety
of applications like Web prefetching, the personalization of Web content, etc. Therefore, the model-based schemes
are usually favoured for clustering Web users’ sessions.
3.5 Constraint-based Method
Constraint based clustering is used for the cases of high dimensional spaces and the clustering process
requires user preferences and constraints inputs. The constraints usually include the expected number of clusters, the
IJRISE| www.ijrise.org|editor@ijrise.org
International Journal of Research In Science & Engineering
Volume: 1 Issue: 6 November 2015
e-ISSN: 2394-8299
p-ISSN: 2394-8280
minimal or maximal cluster size, weights for different objects or dimensions, and other desirable characteristics of
the resulting clusters[5][8].
Depending on the nature of the constraints, the clustering may adopt following approaches.
 Constraints on individual objects, where user can specify constraints on the objects to be clustered
to obtain intermediate cluster that reduces cluster to an instance of unconstrained clustering.
 Constraints on the selection of clustering parameters, where user sets a desired range for each
clustering parameters suitable to a specific to the given post-clustering algorithm.
 Constraints on distance or similarity functions, where, user can specify different distance or
similarity functions for specific attributes of the objects to be clustered, or different distance
measures for specific pairs of objects.
The constraint based clustering is used in web mining task where large numbers of web objects are in the
data set and the classification flexibility is desirable. Thus a user can impose constraints on the clustering to be
found, such as must-link and cannot-link constraints.
4. OMPARISON OF CLUSTERING TECHNIQUES
The criteria for comparing the methods are as under  Accuracy: Accuracy of classifier refers to the ability of classifier to predict the class label
correctly and refers to how well a given predictor can guess the value of predicted attribute for a
new data.
 Speed: It is the computational cost in generating and using the classifier or predictor.
 Robustness: It refers to the ability of classifier or predictor to make correct predictions from given
noisy data.
 Scalability: Scalability refers to the ability to construct the classifier or predictor efficiently; given
large amount of data.
 Interpretability: Tells to what extent the classifier or predictor understands.
Comparison of typical methods discussed in section 3 on basis of above criteria is as below
Method
Suffix Tree
Clustering
Hierarchical
Bayesian
Clustering
Densitybased
clustering
Accuracy
The accuracy is dependent
on the number of clusters,
and number of web objects
to be classified. It decreases
if partitioned into large
number of clusters.
85% to 93% Accuracy has
been observed in most cases.
DBSCAN assumes clusters
of similar density, and may
suffer with problems
separating nearby clusters
IJRISE| www.ijrise.org|editor@ijrise.org
Complexity
STC algorithms are
all linear-time for a
constant-sized labels,
and have worst-case
running time of
O(n.logn) in general.
Complexity is limited
by a growth in time
complexity which is
at least quadratic in
the number of
elements. Also, the
memory usage is
proportional to the
square of the initial
clustering number.
DBC complexity in
general is O(n2) and
if used DBSCAN, it
is O(n.logn), where n
is the number of
database objects.
Robustness
STC is seen
quite robust in
"noisy"
situations
Scalability
Scalable for
large datasets
HBC is robust
in "noisy"
situations
Semi-scalable
for large
datasets
DBC is less
robust in
"noisy"
situations
Scalable for
large datasets
International Journal of Research In Science & Engineering
Volume: 1 Issue: 6 November 2015
Method
Model-Based
Method
ConstraintBased
Method
Accuracy
Accuracy depends on
variance between the
clusters’ maximum and
inside the clusters’
minimum.
The accuracy improves if
pre-processing using
Micro-clusters (groups of
points that are close
together) is adopted.
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Complexity
Complexity analysis
is not available in
references
Robustness
MBM is less
robust in
"noisy"
situations
Scalability
Semi-scalable
for large
datasets
Complexity analysis
is not available in
references
---
Semi-scalable
for large
datasets
Table-4: Detailed comparison of Web Classifiers
5. CONCLUSION
In this paper we compared the prominent clustering algorithms for suitability in web data mining and the evaluated
performance with reference to available literature and analytic figures/metrics available. We found that Suffix Tree
Clustering can be easily adopted for the cases where no apriority knowledge or classes is not available and further
by combinatorial techniques conjoining multiple class parameters, the tree can be manipulated for specific domains.
Hierarchical Bayesian Clustering and Density-based clustering requires the reference classes to cluster effectively.
Their efficiency can be improved if the clusters are pre-processed in advance. Model Based clustering and
Constraint Based clustering requires expert assistance for initial taxonomies for specific domains so that clustering
leads to meaningful relationships between various web documents. Such techniques can play prominent role in
domain oriented search engines.
ACKNOWLEDGEMENT
We acknowledge the support and encouragement from Dr. Mrs. S. P. Sonavane, HOD and Prof. A. J. Umbarkar
Information Technology Department, WCE, Sangli for this paper.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
O. Zamir and O.Etzionie, “A Dynamic Clustering interface to Web search results,” Computer Networks, vol.
31(11-16), pp. 1361-1374, 1999.
M. Ilic, P. Spalevic and M. Veinovic, “Suffix Tree Clustering – Data mining algorithm,” Twenty-Third
International Electrotechnical and Computer Science Conference ERK'2014, Portorož, pp. 15-18, September
2014.
K.A. Heller and Z. Ghahramani, “Bayesian hierarchical clustering”, Proceedings of the 22nd international
conference on Machine learning, pp. 297-304, 2005.
R.E. Ruviaro Christ, E. Talavera and C. Maciel, “Gaussian Hierarchical Bayesian Clustering Algorithm,” ISDA
2007, pp. 133-13, 2007.
J. Han and M. Kamber, "Data Mining: Concepts and Techniques," Morgan Kaufmann Publisher, 2nd Edition,
2006.
X. Li, G. Yu and D. Wang, “Mmpclust: A skew prevention algorithm for model-based document clustering,”
Database Systems for Advanced Applications - Springer, pp. 536–547, 2005.
A. Vakali and G. PallisWeb, "Data Management Practices: Emerging Techniques and Technologies," Idea
Group Publishing, ISBN 1-599004-228-2.
K. H. Anthony Tung, T. Ng Raymond, V. S. Lakshmanan and H. Jiawei, “Constraint-based clustering in large
databases,” ICDT 2001, pp. 405-419, 2001.
IJRISE| www.ijrise.org|editor@ijrise.org
Download