International Journal of Latest Research in Science and Technology

advertisement
International Journal of Latest Research in Science and Technology
Vol.1,Issue 2 :Page No.141-144 ,July-August(2012)
http://www.mnkjournals.com/ijlrst.htm
ISSN (Online):2278-5299
IMPLEMENTATION OF MULTILEVEL
INDEXING IN SEARCH ENGINES
Sneh kalra
Assistant Professor @Modern Vidya Niketan University,Palwal.
snehchhabra@gmail.com
Abstract—This paper proposes creation of clusters at multilevel.i.e. creating the indexes at the multiple levels so as to reduce the search
time by directing the search to a specific path from higher level indexes to the lower level indexes. The multilevel indexes are created
after the hierarchical clusters of the documents have been formed on the basis of document similarity. At each level of documents
clusters hierarchy, a separate index is created and the search for a document proceeds from the higher level indexes to the lowest level
index which is the document level index and is stored as inverted file. Thus the paper implements the proposed hierarchical algorithm
at different levels which optimizes the search process by directing the search to a specific path from higher levels of clustering to the
lower levels i.e. from super clusters to mega clusters, then to clusters and finally to the individual documents so that the user gets the
best possible matching results in minimum possible time. The paper further presents the graphs showing clusters created at different
levels.
Keywords— Multilevel indexes, hierarchical clustering, document similarity, inverted files, posting list.
I. INTRODUCTION
Search Engines are used to find the data in response to the
user queries. The indexing phase of these search engines can
be viewed as a web content mining process. Starting from a
collection of unstructured documents, the indexer extracts a
large amount of information like the list of documents, which
contain a given term. It also keeps account of number of all
the occurrences of each term within every document. This
information is maintained in a global index, which is usually
represented using an inverted file (IF). The index consists of
an array of the posting lists where each posting list is
associated with a term and contains the term as well as the
identifiers of the documents containing the term. For
example, consider the posting list ((job; 5) 1, 4, 14, 20, 27)
indicating that the term job appears in five documents having
the document identifiers 1,4,14,20,27 respectively. The above
posting list can be written as ((job; 5) 1, 3, 10, 6, 7) where the
items of the list represent the difference between the
successive document identifiers. The following figure shows
the structure of index
Term
No of
documents
in which
term appears
Doc id of
documents in
which terms
appears
Doc ids
stored with
different
coding
scheme
Fig. 1 Structure of an Index in search engines
ISSN 2278-5299
Clustering is a widely adopted technique aimed at dividing a
collection of data into disjoint groups of homogenous
elements. A cluster is therefore a collection of objects, which
are similar between them and are dissimilar to the objects
belonging to other clusters. Clustering algorithms attempt to
group together the documents based on their similarities.
Thus documents relating to a certain topic will hopefully be
placed in a single cluster. So if the documents are clustered,
comparisons of the documents against the user’s query are
only needed with certain clusters and not with the whole
collection of documents.
II. RELATED WORK
In this paper, a review of previous work on index
organization is given. In this field of index organization and
maintenance, many algorithms and techniques have already
been proposed but they seem to be less efficient in efficiently
accessing the index.
In [2] the authors introduce a novel pipelining technique for
structuring the index–building system that substantially
reduces the index constructing time. They demonstrate that
for large collections, the pipelining technique can speed up
index construction by several hours. They propose and
compare different schemes for storing and managing inverted
files using an embedded database system. They show that an
intelligent scheme for packing inverted lists in the storage
structures of the database can provide performance and
storage efficiency comparable to tailored inverted file
implementations
Another proposed work was the threshold based clustering
algorithm [8] in which the number of clusters is unknown.
However, two documents are classified to the same cluster if
141
Kalra,,International Journal of Latest Research in Science and Technology
the similarity between them is below a specified threshold.
This threshold is defined by the user before the algorithm
starts. It is easy to see that if the threshold is small; all the
elements will get assigned to different clusters. If the
threshold is large, the elements may get assigned to just one
cluster. Thus the algorithm is sensitive to specification of
threshold.
Another proposed work was the K means clustering
algorithm, which initially chooses k documents as cluster
representatives and then assigns the remaining n-k documents
to one of these clusters on the basis of similarity between the
documents. New centroids for the k clusters are recomputed
and documents are reassigned according to their similarity
with the k new centroids. This process repeats until the
position of the centroids become stable. Computing new
centroids is expensive for large values of n and the number of
iterations required to converge may be large.
The given paper implements the multilevel index structure in
which the clustering of data is done at multiple levels so as to
direct the search to a specific path from high level index to
low level index.
III.
However the indexes at the higher level consist of only the
document terms with no extra information stored in them.
B. How to Create Clusters
Let D={D1, D2,……Dn) be the collection of N textual
documents being crawled to which consecutive integers
document identifiers 1…n are assigned. N1 is the number of
clusters which would be created. This algorithm will make
less or equal to the number of cluster i.e. less or equal to
N1.first it will find out the similarity between the documents
and create the similarity matrices. After that it will take 2
clusters and make one cluster on the basis of similarity.
Similarly all clusters are merged to make cluster. Now it will
cheek for the termination condition. If it s not a termination
condition then it will calculate again similarity matrices.
Again take 2 clusters and make one cluster until it reach
termination condition.
B. Example Illustrating the Formation of Hierarchy of The
WebDocuments
Super
cluster level
PROPOSED WORK
The indexing phase of the search engine collects information
from the web documents gathered in the web repository. This
global large sized index is stored in the form of inverted files.
The proposed work explains the concept of multilevel
indexes which are created at multiple levels of hierarchy of
documents clustering. At first level of document hierarchy,
the similar documents are clustered into clusters on the basis
of documents similarity. Then at the next level, the similar
clusters are clustered into mega clusters and at the last level
of hierarchy, the similar mega clusters are clubbed together to
form super clusters.
Mega
cluster
level
No. of docs in which Doc id of docs in
term appears
which term appears
Indexing
50
12,34,45,49…
Search engine
59
15,20,34,55…
Pipeline
15
3,6,9,12…
Fig. 3 Structure
of which
Higher Level
Indexes
This is a descriptive
index
is stored
in the form
of
inverted files. The index consists of an array of posting
lists as in case of global large size index. The index at the
mega cluster level consists of the terms that come after the
intersection of the terms in the clusters that appear in that
particular mega cluster. This index is of small size and
contains only the similar terms of the clusters within the
same mega cluster. The index does not contain any
additional information. The highest level index is formed
last at the super cluster level which consists of the terms
that are obtained after the intersection of the terms in the
similar mega cluster within that super cluster and the
search starts with this index and proceeds downwards to
the lowest level index.
IV.
IMPLEMENTATION WORK
Crawling in
search engines
Indexing in
search engines
Cluster
level
Parallel
crawling
in search
engines
Distributed
crawling
in search
engines
Global
indexing
in search
engines
Pipelined
indexing
in search
engines
Documents
A. Structure of Multilevel Indexes
The structure of the index at the lowest level is in the form of
posting lists as shown in figure:
Term
Search engine
Fig. 4. Example Illustrating Hierarchy of Clusters
The Fig 4 shows the hierarchy of the documents created.
The index formation star
means algorithm is applied for creating the clusters. Clusters
will be created at first level. For creating clusters at second
level same procedure is applied again and then finally
hierarchical clustering is done for indexing. Finding
Document
similarity
Parsing
Documents
Parsed data
Matrix
created
After applying algorithm
for hierarchical clustering
Clusters
created
Fig. 5 Work Flow of Implementation
For indexing the documents, firstly we have to parse the
documents. After that similarity matrix is created and then k
ISSN 2278-5299
142
Kalra,,International Journal of Latest Research in Science and Technology
V.
A. Snapshots of Implemented Work
1) Given input & parsed data
The following snapshot represents the parsed data. Which is
the initial step for indexing the data?
RESULTS
A. Graph Representing the Created Clusters at First Level
At first level, less number of clusters are created. As at first
level the similarity between two documents is less.
10
h, 9
c, 9
a
9
b
n u m b e r o f c l u s te r s
8
g, 7
7
g, 7
c, 6
6
a,c,7 7
c
b, 6
d
f, 5
a, 5
e
5
f
4
g
3
h
2
i
1
j
0
a
b
c
d
e
f
g
h
i
j
10 documents
Fig. 8 Clusters Created at First Level
B. Graph Representing the Created Clusters at Second Level
At second level, more number of clusters are created in
comparison to first level. As the similarity between two
documents is more
120
g, 98
Fig. 6 Given & Parsed Data
number of clusters
100
2) Clustered data
The data is now clustered according to the similarity of
words. The following fig shows the clusters created that are
created after matching the similarity of documents with each
other.
80
60
f, 89
d, 76 a, 76
g, 76
f, 69
e,f, 69
67 c, 65
e, 63
h,i, 62
61
d, 54 a,c,5255 a, 54
e, 54
h, 43
40
c, 87
h, 87
i, 43
d, 34
g, 34 e, 32
i, 98
a
e, 87
b
i, 76
a, 76
c, 76
f, 65
g, 65
c
d
e
47
h,i, 45
c, 33
f
g
h, 31
h
g, 21
i
20
j
0
a
b
c
d
e
f
10 documents
g
h
i
j
Fig. 9 Clusters Created at Second Level
VI.
CONCLUSION
The given paper implements a multilevel indexing structure
for a search engine index so as to speed up the search process
by directing the search to a specific path from the higher level
index to the lower level indexes. The implemented
architecture and the structure of the indexes are the answer to
one of the biggest problems for search optimization in search
engines. The implementation work aims at reducing the
search space and search time by maintaining the index at
various levels.
REFRENCES
1
Fig 7 Clustered Data
ISSN 2278-5299
Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garci
Molina. Building a Distributed Full–Text Index for the
Web. In World Wide Web, pages 396–406,2001.
143
Kalra,,International Journal of Latest Research in Science and Technology
2
Fabrizio Silvestri, Raffaele Perego and Salvatore Orlando. Assigning
Document Identifiers to Enhance Compressibility of Web Search
Engines Indexes. In the proceedings of SAC, 2004.
3
S.Brin
and
L.Page.The
Anatomy
of
a
Large-Scale
Hypertextual Web Search Engine. In the 7th International
WWW
Conference,
(WWW7),
pp.
107-117,
Brisbane,
Australia.
Van Rijsbergen C.J. Information Retrieval. Butterworth 1979.
[5] Oren Zamir and Oren Etzioni. Web Document Clustering: A
feasibility
demonstration.
In
the
proceedings
of
SIGIR, 1998.
Soumen Chakrabarti. Mining the Web – Discovering
Knowledge
from
Hypertext
Data.
Morgan
Kaufmann
Publications, 2003.
4
5
6
A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall,
1988.
7
P. Elias. Universal codeword sets and representation of the integers.
IEEE Trans. on Information Theory, 21(2):194–203, Feb 1975.
8
Sanjiv K. Bhatia. Adaptive K-Means
Association for Artificial Intelligence, 2004.
9
Krishna Kummarmuru, Rohit Lotlikar, Shourya Roy. Hierarchical
Monothetic Document Clustering Algorithm for Summarization and
Browsing Search Results. WWW 2004, New York.
10
Bhatia, S.K. and Deougan , J.S. 1998. Conceptual Clustering in
Information Retrieval. IEEE Transactions on Systems, Man and
Cybernetics.
11
Dan Bladford and Guy Blelloch. Index compression through
document reordering. In IEEE, editor, Proc. Of DCC’02. IEEE, 2002.
12
Parul Gupta, Dr. A.K. Sharma, Hierarchical Clustering based Indexing
in Search Engines, communicated to International Journal of
Information and Communication Technology.
13
A Framework for Hierarchical Clustering Based Indexing in Search
Engines Parul Gupta , A.K. Sharma
14
Kilgour, Frederick G.”The Evolution of the Book search engine “.
New York: Oxford University Press, 1998; Katz, Bill. Cuneiform to
Computer: A History of Reference Sources. Lanham, MD: Scarecrow
Press, 1998.
Weinberg, Bella Hass. "Indexes and Religion: Reflections on Research
in the History of Indexes". The Indexer, 21 (3): 111-118 (1999).
15
Clustering.
“Viewing morphology as an inference process”. In Proceedings of the
16th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval., pp. 191-202, 1993.
17
Korfhage, R., 1997. Information Storage and Retrieval.
Cooper. W.S., 1969. “Is inter indexer consistency a hobgoblin?”
American Documentation 20. 3: 268-278
18
C. C. Aggarwal, F. Al-Garawi, and P. S. Yu. “Intelligent crawling on
the World Wide Web with arbitrary predicates”. In WWW10, Hong
Kong, May 2001.
19
S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2003.
21
Michael R. Anderberg. Cluster analysis for applications. Academic
Press, 1973.
22
G. Adami, P. Avesani, D. Sona, Clustering Documents in a Web
Directory, in Proceedings of the 5th International Workshop on Web
Information and Data Management (ACM WIDM 03), September
2003.
Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. Efficient
crawling through url ordering. In 7th World Wide Web Conference
(WWW7), Apr 1998
ISSN 2278-5299
Advances in Web Mining and Web Usage Analysis 2005 - revised
papers from 7 th workshop on Knowledge Discovery on the Web, Olfa
Nasraoui, Osmar Zaiane, Myra Spiliopoulou, Bamshad Mobasher,
Philip Yu, Brij Masand, Eds., Springer Lecture Notes in Artificial
Intelligence, LNAI 4198, 2006
25
Web Mining and Web Usage Analysis 2004 - revised papers from 6 th
workshop on Knowledge Discovery on the Web, Bamshad Mobasher,
Olfa Nasraoui, Bing Liu, Brij Masand, Eds., Springer Lecture Notes in
Artificial Intelligence, 2006
American
16
23
24
144
Download