Cluster Analysis for Effective Information Retrieval through Cohesive Group of Cluster Methods

“Cluster Analysis for Effective Information Retrieval through Cohesive
Group of Cluster Methods”
Prof. S.N. Sawalkar 1, Ms. Sheetal Yamde2
Department of Computer Science and Engineering, Computer Science and Engineering, Sipna COET
Amravati, Maharashtra, India
2Student, Computer Science and Engineering, Sipna COET Amravati, Maharashtra, India
Abstract: Locating interesting information is one of the
most important tasks in Information Retrieval (IR). The
different IR systems emphasize different query features
when determining relevance and therefore retrieved from
different sets of documents. Clustering is an approach to
improve the effectiveness of IR. In clustering, documents are
clustered either before or after retrieval. The motivation of
this paper is to explain the need of clustering in retrieving
efficient information that closely associates documents
which are relevant to the same query. Here IR framework
has been defined which consists of four steps (1) IR system
Similarity measure (3) document clustering and (4) ranking
of clusters. Furthermore, we present the short comings of
cluster algorithms based on the various facets of their
features and functionality. Finally based on the review of the
different approaches we conclude that although clustering
has been a topic for scientific community for three decades,
there are still many open issues that call for more research.
Key Words: Information Retrieval, IR System, Document
Clustering, Similarity measure, Ranking
Nowadays the internet has become the largest data
repository facing the problem of information overload.
This information explosion has lead to a growing challenge
for information Retrieval systems to efficiently and
effectively manage and retrieve the information for
average user. The purpose of information Retrieval is to
store documents electronically and assist user to
effectively navigate, trace and organize the available web
documents. The IR system accepts a query from the user
and responds with a set of documents. The system returns
both relevant and non-relevant material and a document
organization approach are applied to assist the user in
finding the relevant information in the retrieved set.
Generally, a search engine presents the retrieved
document set as a rank list of documents. The documents
in the list are ordered by the probability of being relevant
to the user’s request. The highest ranked document is
considered to be the most likely relevant document; the
next one is slightly less likely and so on. Every search
engine works on above organizational approach. The user
will start at the top of the list and follow it down
examining the documents one at a time. A number of
alternative document retrieval approaches have been
developed over the recent years. These approaches are
normally based on visualization and presentation of some
relationships among the documents and the user’s query.
One of such approach can play an important role towards
achievement of this objective, is document clustering.
Document clustering has been studied in the field of
information retrieval for several decades. Willett gives an
excellent overview of existing algorithm and application.
The increasing importance of document clustering and the
variety of its application has led to the development of
wide range of algorithms with different qualities. The goal
of clustering is to separate relevant documents from nonrelevant documents. To accomplish this we define a
measure for similarity between documents and design
corresponding clustering algorithm. We can start with
vector space model(VSM), which represents a document
as a vector of the terms that appear in all the document
set. Each feature vector contains term weights of the terms
appearing in that document. The term weighting scheme is
usually based on tf×idf method in IR. A collection of
documents can be represented by a term-document
matrix. A similarity between documents is measured using
one of several similarity measures that are based on
relations of feature vectors. After clustering algorithms the
clusters are ranked using the ranking algorithm, which
generates clusters according to their match with the
query. Section 2, of this paper summarizes related work in
this area. In Section 3, Analysis of problem of Information
Retrieval. Section 4 Proposed Work and objectives. Section
5, shows the conclusion and discussion.
IR is the act of sorting, searching and retrieving
information that matches the user’s request. Until 1950’s
the IR was mostly a library science. Recently, clustering
has been used as an alternate organization of retrieved
documents, aiming to help users better understand the
retrieved documents and therefore be better able to focus
their search. The document clustering has been
traditionally investigated mainly as a means of improving
the performance of IR by pre-clustering the entire corpus.
However, clustering has also been investigated as postretrieval document browsing technique. Numerous
document clustering algorithms appear in the literature
including K-means, hierarchical agglomerative clustering,
scatter/gather and suffix tree clustering (STC). When
using only textual information for clustering has shown
that STC outperforms other algorithms but suffix tree
based method suffers from large memory requirements
and poor locality characteristics. SHOC uses suffix array
for phrase extraction and organizes the snippets in a
hierarchy via an SVD (Singular Value Decomposition)
approach. Lingo uses SVD on a term-document matrix to
find meaningful long labels, generates flat clustering
result. Zeng re-formalizes the clustering problem as a
salient phrase ranking problem. It uses phrases rather
than words and that it allows clusters to overlap. A cooccurrence based hierarchical clustering method is used to
group search results into hierarchical and overlapping
clusters. CoHC outperforms all other clustering algorithms.
The most well known clustering methodology can be
divided into two methods according to the structure of the
group created as a result of clustering: the hierarchical
clustering and non-hierarchical clustering method. There
are diverse algorithms associated with each methodology.
Within non-hierarchical clustering there is a single pass
method where the results differ according to order in
which the documents are inputted. Apart from this the
hybrid algorithm does not require a number for the output
cluster prior to the clustering. The clusters can be rearranged according to quality measurements which occur
iteratively until the quality of clusters reaches the
maximum level of satisfaction. Hybrid hierarchical
clustering algorithm (HHCA) outperforms the popularly
used hierarchical single linked clustering algorithm.
The IR framework for representing the relevant
information consists mainly of four steps: the IR system,
similarity measures, clustering and ranking. Fig 1 gives an
overview of IR Framework. Initially user gives a query to
the IR system to retrieve the relevant documents from the
corpus. The IR system produces the list of the documents.
The documents clustering algorithms attempt to group the
documents using similarity measures. The documents that
are relevant to certain topics will be allotted in a single
cluster. Further these clusters are ranked with the most
relevant cluster getting the highest rank and are displayed
on the top of the list. This model can be loosely
categorized based on rule based inference, scattering
activation with a conventional database implementation.
The set of documents data that they cluster search results
Collection information retrieval system they try to
improve –user experience, user interface competence of
the search system. A very general class of clustering is
concerned with building partitions (clusters) of datasets
on the basis of some performance index known also as an
objective or cost function
Today the amount of information available on the Web has
increased to a point that there are great demands for
effective systems that allow an easy and flexible access to
information relevant to specific user’s need. The system
should be capable of managing imperfect information, and
to adapt its behavior to the user context. Information
Retrieval aims at defining models and techniques that
improves the limitations of current systems for the
Information Access (mainly Information Retrieval and
Information Filtering systems). Information Retrieval
system has been illustrated in Figure2.
4.1. Similarity Measure
Clustering exploits similarities between the documents to
be clustered. The similarity of two documents is computed
as a function of the distance between the corresponding
term vectors for these documents. Of the various
measures used to compute this distance, the cosine
measure has proved the most reliable and accurate [19].
The similarity between two or more documents used to
rank the documents They experimented with different
similarity measures such as cosine coefficient, inner
product, dice coefficient, overlap coefficient and jaccard
coefficient match up to each similarity measures using
genetic algorithms approach relevant information and
query optimization.
In order to cluster documents, one must first choose the
type and characteristics or attributes of the documents on
which the clustering algorithms will be based. The most
commonly used model is the Vector Space Model (VSM).
The goal of clustering is to separate the relevant
documents from the non-relevant documents. To
accomplish this we need to define a measure for similarity
between documents and appropriate similarity measure
must be chosen for the calculation of the similarity
between the documents. Some widely used similarity
measures are the cosine coefficient which gives the cosine
of the angle between the two featured vectors, the Jaccard
coefficients, Euclidean and Pearson Correlation and the
dice coefficients (all normalised).
Some clustering algorithms operate on a dissimilarity
matrix. Some of the most frequently used dissimilarity
measures for continuous data are: Minkowski q L distance
(for 1 q), City-block (or Manhattan dissimilarity between
two objects computed depends on distance or 1 L),
Chebychev distance metric (or maximum or L). In the case
of Chebychev distance the objects with the largest
dispersion will have the largest impact on the clustering. If
all objects are considered equally important, the data need
to be standardized first. For interval-scaled data, Pearson
correlation coefficient is used; for ordinal data, GoodmanKruskal gamma correlation coefficient is used.
4.2. Document Clustering
Document Clustering is a technique employed for the
purpose of analyzing statistical data sets. Essentially, the
goal of clustering is to identify distinct groups within a
dataset, and then place the data within those groups,
according to their relationships with each other [6,22].
Clustering of multidimensional data is an important
procedure in many information retrieval applications. In
these applications, one or more clustering algorithms are
used to group similar items together to form clusters.
There exists a large number of data clustering algorithms
which are classified into two main categories-Hierarchical
algorithms and Partitional algorithms. A Hierarchical
clustering algorithm generates a cluster hierarchy, which
is called a dendrogram. A dendrogram is a tree that
records the process of clustering. Similar items are
connected by links whose level in the tree is determined
by the similarity between the two items. The hierarchical
algorithms can be further divided into agglomerative
approach and divisive approach. The major differences
between the two approaches are agglomerative approach
works in a bottom-up manner while divisive approach
works in a top-down manner. A partitional clustering
algorithm obtains a single partition of the data instead of a
cluster hierarchy. One major advantage of this category of
clustering algorithms is their fast speed. The widely used
K-means method and its variances belong to this category.
The K-means algorithm selects K data points as cluster
centers and assigns each data point to the nearest one. The
reassigning process is continued until a convergence
criterion is met. The advantage of this algorithm is that it
can perform with O(n) time complexity. For web page
clustering, some of the previous works have adapted Kmeans and agglomerative hierarchical clustering
algorithm. Further, density based algorithm is adapted for
hierarchical clustering of web documents. The web
documents contain textual information as well as
hyperlinks between them. There have been efforts to use
one or both of them to make clusters. The density based
algorithm is introduced for using both content and linked
information which has the advantages of creating clusters
in various shapes and removing noisy data. In this
algorithm, at first a random data item is selected and its
neighborhood is investigated to determine whether it has
an acceptable number of data points. Most existing
hierarchical clustering algorithms are unable to undo
previous clustering operations which prevent the
detection of not well separated structures. Partitional
clustering does not suffer from this problem but requires a
pre-specified number for the output clusters. Hence the
advantages of hierarchical clustering and partitional
clustering techniques are combined in Hybrid clustering
algorithm which is called Hybrid Hierarchical Clustering
Algorithm (HHCA). The HHCA algorithm performs
iteratively a process of a mixed cluster splitting and
merging until the quality of clustering reaches its
maximum level of satisfaction. Most of the clustering
algorithm approaches can be divided into two categories:
the clustering-then-labeling approach and the labelingthen-clustering approach.
a) Clustering-then-labeling approach: Generally, the
clustering-then-labeling approach first applies traditional
clustering algorithms to group snippets into topicallycoherent clusters according to content similarity, and then
generates a label for each cluster. However, the cluster
labels are often unreadable, which makes it for users
difficult to identify relevant clusters. Scatter/Gather
system is implemented based on a variant of the classic KMeans algorithm. The Scatter/Gather browsing paradigm
clusters documents into topically – coherent groups and
present descriptive textual summaries to the user. The
summaries consist of topical terms that characterize each
cluster and a number of typical titles that sample the
contents of the cluster. The user may select the summaries
forming a sub-selection for iterative examination. The
clustering and re-clustering is done so that different topics
are seen depending on the sub-collection cluster. The
schematic diagram of Scatter/Gather clustering algorithm
is shown in figure 3. Scatter/Gather may be applied to the
entire corpus, in which case static off-line computations
may be exploited to speed dynamic online clustering. The
use of Scatter/Gather successfully conveys some of the
content and structure of the corpus. However,
Scatter/Gather is less effective than a standard similarity
search when the subjects are provided with a query.
It is possible to integrate Scatter/Gather with conventional
search technology by applying it after a search to organize
and navigate the retrieved documents, which then form
the target document collection. The topic-coherent
clusters can be used in several ways: to identify promising
subsets of documents, to be pursued with other tools or
re-clustered into more refined groups and to eliminate
groups of documents whose contents appear irrelevant.
b) Labelling-then-clustering approach: The labelling-thenclustering approach first identifies sets of documents that
share phrases and extracts these phrases as candidate
cluster labels. Candidate cluster labels are ranked and
some of them are selected as the final cluster labels. Base
clusters are created according to these cluster labels.
Grouper adopts a phrase-analysis algorithm called Suffix
Tree Clustering STC, in which snippets sharing the same
sequence of words are grouped together. Suffix Tree
Clustering STC is a linear time clustering algorithm that is
based on identifying the phrases that are common to
groups of documents. A phrase in our context is an
ordered sequence of one or more words. We define a base
cluster to be a set of documents that share a common
phrase. STC has three logical steps: (1) document cleaning
(2) identifying base clusters using a suffix tree, and (3)
combining these base clusters into clusters as shown in
Figure 4. A Suffix tree is a data structure that admits
efficient string matching and querying. Suffix trees have
been studied and used extensively in fundamental string
problems such as large volumes of biological sequence
data searching, approximate string matches and text
features extraction in spam email classification. In suffix
tree document model, a document is considered as a string
consisting of words, not characters. In Zamir and Etzioni
STC algorithm, after the suffix tree construction, the
overlap of the different clusters is calculated and the
clusters are merged if they have more than 50% overlap.
The merging method is fast but it neglects the similarity
between the known overlapping parts. Another problem
in the merging algorithm is that it can lead to too many
clusters in hundreds and thousands with only a small
amount of documents in each of it frustrating the browser
to locate the desired information. Further, a new cluster
merging algorithm of suffix tree clustering introduces the
well known cosine similarity algorithm into the cluster
merging process.
Word combinations as cluster labels is the first and critical
step. The label-extraction process is based on term cooccurrence information and this algorithm is called Cooccurrence based Hierarchical Clustering (CoHC). Base
clusters are constructed according to cluster labels and
then aggregated into higher-level clusters. A term cooccurrence based method (called CoHC) is used to extract
cluster labels which are composed by interrupted as well
as uninterrupted sequences of words with arbitrary
length. In fact, all the relevant documents are usually
distributed in several clusters. After clustering each
ranked list is composed of a set of clusters by using
ranking algorithm.
4.3. Ranking of Clusters
A group of clusters are obtained after applying the
document clustering algorithm each of which contains
more or less relevant documents. By ranking the clusters
we expect to determine reliable clusters and adjust the
relevance score of documents in each ranked list such that
relevant scores become more reasonable. In order to find
the ranked cluster of the query three measures are
applied: Normalized Match Ratio (NMR), Normalized
Order Ratio (NOR) and Log Odds Ratio (LOR).
NMR shows the number of terms related to the topic. It
calculates the number of terms with high relevance with a
given query and uses this to decide which cluster will be
used. After which it shows them according to their rank.
The NMR measures can be defined as
Reference Term List
NOR shows the number of terms related to the topic
within a selected cluster. It shows cluster according to the
level of relevance within the selected clusters and shows
the term list.
In this algorithm the similarity of two clusters is not only
decided by the overlap of their document but also by the
similarity of non overlap parts. Suffix tree based method
suffers from large memory requirements and poor locality
A new method for the document clustering is defined to
group search results into hierarchical and overlapping
clusters [26]. In this method, extracting meaningful,
orderly, multiple
Although these methods have not performed to their best
so far, we believe they still can be further improved. A
better clustering algorithm to identify more reliable
clusters and more elaborate formula to rank the cluster
are expected to bring up the improvement. Since our
approach is based on two hypotheses, we first verified
them by means of experiments. We also compared our
approach with other conventional approaches. The results
show that each of them achieves some improvement, and
that our approach compares favorably with them. We also
investigated the impact of cluster size. We found that our
approach is rather stable under variation in the size of
clusters. Although our method showed good performance
in our experiments, we believe it still can be improved
further. A better clustering algorithm for identifying more
reliable clusters and more elaborate formula for reranking ranked lists should lead to further improvement.
These will be topics for our future work.
It is defined as
Reference Clusters
LOR is a statistical measure of how relevant a term is to a
cluster. It is the probability value that shows the similarity
among the terms and how much they are relevant to the
topic. Its value can be expressed as a probability unlike the
two prior estimated values.
LOR = (Odds of an event in one Group) / (Odds of it
Occurring in another Group)
By applying each of three estimated measures the cluster
that matches with the topic can be shown. These measures
can be used intuitively. The ranking algorithm is defined
as Ranking Algorithm = NMR
This technique analyses the contents of each cluster to
determine the top terms and associate them with
probability values. Based method suffers from large
memory requirements and poor locality characteristics
A clustering on data set is performing using k-Means
clustering method. Graphical representation on the k
means algorithm and the output generated by the
algorithm presented in tabular form.
The K-Means algorithm is then modifying and improved
more accurate and perceive result which is also graphical
visualized and display in tabular form and finally the
centroid K-Means algorithm compared with newly
modified algorithm and comparison analysis
reducible hence the current scenario, the modified
algorithm is based suited for this performance and
We have mentioned that clustering is used to improve the
IR from the collection of documents. The need of
Information Retrieval mechanism can only be supported if
the document collection is organized into a meaningful
structure, which allows part or all the document collection
to be browsed at each stage of a search. This has prompted
researchers to re-examine the process of cluster based
information retrieval. In the process of IR, we can calculate
similarity coefficients between the query and the
documents and search which clusters of documents best
correspond to the query. This way of calculation is less
time consuming for searching documents with high
similarity than calculation of similarity coefficients
between the query and individual documents. Numerous
studies and anecdotal evidence hint that document
clustering can be a better way of organizing the retrieval
results. Hierarchical algorithms start with established
clusters, and then create new clusters based upon the
relationships of the data within the set. All the
relationships are analyzed in the hierarchical algorithms
which tend to be costly in terms of time and processing
power. Moreover agglomerative hierarchical clustering
does not do well because of the nature of documents, i.e.,
nearest neighbors of documents often belong to different
classes. This causes agglomerative hierarchical clustering
techniques to make mistakes that cannot be fixed by the
hierarchical scheme. Partitional algorithms have better
time complexity than hierarchical algorithms which allows
them to be used in analyzing datasets. The disadvantage of
partitional algorithm is that the initial choice of clusters is
arbitrary and does not necessarily comprise all the actual
groups that exist within a data set. Therefore, if a
particular group is missed in the initial clustering decision
the members of that group will be placed within the
clusters that are closest to them, according to the
predetermined parameters of the algorithms. Moreover,
these algorithms will yield inconsistent results. The
clusters determined this time by the algorithm probably
wouldn’t be the same as the clusters generated the next
After the algorithm on the dataset and result generated
and visualized. They are evaluated with execution time.
The accuracy and the time taken to algorithm is was
