A New Approach for Context Based Indexing in Pooja Bhardwaj

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 8 - May 2014
A New Approach for Context Based Indexing in
IR System Using BST
Pooja Bhardwaj#1, Rajbala Tanwar*2, Renu Singla#3
#1
M.Tech. Schola, M.Tech. Scholar*2, Assistant Professor#3 , Department of Computer Science and Engineering,
Shri Ram College of Engineering and Management, Palwal, Haryana,India
Abstract - Since last two decades size of web grows
rapidly. The task of finding relevant information
becomes challenging. The key issue in achieving
efficient information retrieval is the development of a
search mechanism to ensure delivery of qualitative data
in response to search requests. Quality of the retrieved
results is determined by two factors; precision and
recall. Increased precision accounts to the delivery of
information that is relevant to the search query,
whereas increased recall accounts to the delivery of all
the relevant information. Traditional approaches in
information retrieval employ keyword-based techniques
to look for relevant data. However, keyword-based
searching does not always result to the retrieval of
qualitative data, basically due to the variety in the
vocabulary used to convey alike information. In this
paper we introduce a context-based retrieval model
which uses binary Search Tree BST for creating context
based index. Index not only stores keywords but also
the context in which that keyword exist in the given
document. Experiments results shows that in this IR
system the value of precision has been improved and
user get quality information for search query.
Keywords —information retrieval,Indexing,Binary
Search Tree, Context Based Page Ranking.
I.
INTRODUCTION
Information retrieval (IR) is concerned with
selecting from a collection of documents, those that
are likely to be relevant to a user’s information need
expressed using a query. Three basic functions are
carried out in an information retrieval system (IRS):
document and information need representation, and
matching of these representations. Document
representation is usually called indexing. The main
objective of document indexing is to associate a
document with a descriptor represented by a set of
features manually assigned or automatically derived
from its content. Representing the user’s information
need involves a one step or multi-step query
ISSN: 2231-5381
formulation by means of prior terms expressed by the
user and/or additive information driven by iterative
query improvements like relevance feedback [1]. The
main goal of document-query matching, also called
query evaluation, is to estimate the relevance of a
document to the given query. Most of IR models
handle during this step an approximate matching
process using the frequency distribution of query
terms over the documents to compute the relevance
score. This score is used as a criterion to rank the list
of documents returned to the user in response to his
query. Traditional IRS are based on the well known
technique of “bag of words” (BOW) representation
expressing the fact that both documents and queries
are represented as bags of lexical entities, namely
keywords. A keyword may be a simple word (as in
“computer“ ) or a compound word (as in “computer
science”). Weights are associated with document or
query keywords [2], [3] to express their importance
in the considered material. The weighting scheme is
generally based on variations of the well known
tf*idf formula [4]. A key characteristic of such
systems is that the degree of document-query
matching depends on the number of shared
keywords. This leads to a “lexical focused” relevance
estimation which is less effective than a “semantic
focused” one [5]. Indeed, in such IRS, relevant
documents are not retrieved if they do not share
words with the query, and irrelevant documents that
have commonwords with the query are retrieved even
if these words have not the same meaning in the
document and the query. The problems mainly stem
from the richness in terms of expressive power, yet
the synonymy and polysemy inherent in natural
language.
II. PROPOSED WORK
In this paper a new technique to create index for
the information retrieval system has been proposed.
The proposed index will be created using Binary
Search Tree BST. The architecture is as follows:
http://www.ijettjournal.org
Page 401
International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 8 - May 2014
Fig-1 Proposed architecture for Context Based Index using BST.
The components are as follows:
Corpus :- It is a set of documents for which index
has to be created. These documents are normally
either crawled by crawler or store by the admin of the
information retrieval system.
Parser :- This component will take a document from
the Indexer, parse it to create a list of word called
tokens. This token stream will be returned to the
Indexer.
Indexer :- This component will retrieve document
one by one from the corpus, create a token stream of
it by using parser. It further store these tokens in the
BST. It also generate context of the document by
context generator and add this information in the
BST.
Binary Search Tree (BST) :- This component will
finally stores tokens. It also stores the contextual
information of the token with it. The nodes of the
BST will stores the following information:
1.
Keyword associated with token after
stemming.
2.
List of Context for which index has to
store document. Every context also stores
the list of documents which belongs to
ISSN: 2231-5381
this context and keyword is present in
these documents.
Thesaurus: - It is used to decide the context of the
document. It receives a document from the context
generator module and return context of it. This
module takes the help of dictionary to decide context
of the document. The working of this component is
not in the scope of this work.
Search Interface: - This interface read the query
from the user and returns a list of documents in
which that query word is present. It finds the list of
documents from the Indexer module.
III. IMPLEMENTATION
The architecture which has been proposed in the
previous section has been implemented in JAVA
programming language in NetBeans IDE. First of all
a sample corpus of 11 documents in 4 contexts has
been taken. Figure-2 show the retrieval of the
documents for the sample query “program” when
contextual information has not been used. Five
documents has been retrieved by the system.
Query : program
Number of Documents Retrieved = 5
http://www.ijettjournal.org
Page 402
International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 8 - May 2014
Figure-2 Snapshot of IR system without context
Figure-3 Snapshot of Context Based IR system
ISSN: 2231-5381
http://www.ijettjournal.org
Page 403
International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 8 - May 2014
Figure-3 show the snapshot of the output for a
query “mobile”. Here contextual information has
been used. The eleven documents has been clustered
in four contexts. When user have to fire a query the
he/she has to specify the query and also the name of
the context in which he/she is interested. The
returned documents list contain list of documents in
the selected context in which query word exists.
Query : mobile
Context : Telecommunication
No of Documents Retrieved : 2
IV. RESULTS AND ANALYSIS :
The results has been compared for precision value for
both context based searching and normal searching.
Precision is the fraction of the documents retrieved
that are relevant to the user's information need.
6
5
4
Relevent Documents
3
Retrieved Documents
2
1
0
Context Based IR
Normal IR
Figure 4 Precision Based Comparison Between Context Based IR and Normal IR.
Precision = (|{Relevant Documents}∩{Retrieved
Documents}|) / |{Retrieved Documents}|
Without context:
Relevant Documents = 2
Retrieved Documents = 5
(|{Relevant Documents}∩{Retrieved Documents}|
=2
Precision = 2/5 = 0.4
With Context
Relevant Documents = 2
Retrieved Documents = 2
Precision = 2/2 = 1.0
So almost all the documents retrieved will be relevant
to user in context based retrieval.
V. CONCLUSION AND FUTURE WORK
In this paper a new context based approach for
indexing using BST in IR systems has been proposed.
The system introduces the contextual information of
documents. User have to specify the query and the
context of the query. Documents which are in given
context has been searched and returned to the user.
Number of documents returned to the user belong to
the context in which user is interested for the query.
From the experiments it has been justified that the
precision value for the proposed system is better than
the existing systems. Future work is needed to apply
the system on a corpus with large number of
documents and large number of contexts.
ISSN: 2231-5381
REFERENCES
[1]
Rocchio, J. J. (1971). Relevance feedback in
information retrieval. In The SMART Retrieval System,
in Experiments in Automatic Document Processing.
G.Salton editor, Prentice-Hall, Englewood Cliffs,
NJ,pp. 313–323.
[2]
G. Bordogna and G. Pasi,” A fuzzy linguistic approach
generalizing Boolean information retrieval: a model and
its evaluation,” in Journal of the American Society for
Information Science, 44(2), 70-82, 1993.
[3]
D. A. Buell and D. H. Kraft, ” A model for a weighted
retrieval system,” in Journal of the American Society
for Information Science, 32(3), 211-216,1981.
[4]
Salton G, Buckley C (1988). "Term-weighting
approaches in automatic text retrieval". Information
Processing and Management 24 (5): 513–523.
[5]
Mauldin, M., Carbonell J. and Thomason R., (1987).
Beyond the keyword bariier: knowledge-based
information retrieval. Information services and use 7(45): 103-117.
[6]
G. Miller (1995) WordNet : A Lexical database for
English.. Actes de ACM 38, pp. 39-41.
[7]
Luisa Bentivogli, Pamela Forner, Bernardo Magnini
and Emanuele Pianta. "Revising WordNet Domains
Hierarchy: Semantics, Coverage, and Balancing", in
COLING 2004 Workshop on "Multilingual Linguistic
Resources", Geneva, Switzerland, August 28, 2004, pp.
101-108.
http://www.ijettjournal.org
Page 404
Download