International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 8 - May 2014 A New Approach for Context Based Indexing in IR System Using BST Pooja Bhardwaj#1, Rajbala Tanwar*2, Renu Singla#3 #1 M.Tech. Schola, M.Tech. Scholar*2, Assistant Professor#3 , Department of Computer Science and Engineering, Shri Ram College of Engineering and Management, Palwal, Haryana,India Abstract - Since last two decades size of web grows rapidly. The task of finding relevant information becomes challenging. The key issue in achieving efficient information retrieval is the development of a search mechanism to ensure delivery of qualitative data in response to search requests. Quality of the retrieved results is determined by two factors; precision and recall. Increased precision accounts to the delivery of information that is relevant to the search query, whereas increased recall accounts to the delivery of all the relevant information. Traditional approaches in information retrieval employ keyword-based techniques to look for relevant data. However, keyword-based searching does not always result to the retrieval of qualitative data, basically due to the variety in the vocabulary used to convey alike information. In this paper we introduce a context-based retrieval model which uses binary Search Tree BST for creating context based index. Index not only stores keywords but also the context in which that keyword exist in the given document. Experiments results shows that in this IR system the value of precision has been improved and user get quality information for search query. Keywords —information retrieval,Indexing,Binary Search Tree, Context Based Page Ranking. I. INTRODUCTION Information retrieval (IR) is concerned with selecting from a collection of documents, those that are likely to be relevant to a user’s information need expressed using a query. Three basic functions are carried out in an information retrieval system (IRS): document and information need representation, and matching of these representations. Document representation is usually called indexing. The main objective of document indexing is to associate a document with a descriptor represented by a set of features manually assigned or automatically derived from its content. Representing the user’s information need involves a one step or multi-step query ISSN: 2231-5381 formulation by means of prior terms expressed by the user and/or additive information driven by iterative query improvements like relevance feedback [1]. The main goal of document-query matching, also called query evaluation, is to estimate the relevance of a document to the given query. Most of IR models handle during this step an approximate matching process using the frequency distribution of query terms over the documents to compute the relevance score. This score is used as a criterion to rank the list of documents returned to the user in response to his query. Traditional IRS are based on the well known technique of “bag of words” (BOW) representation expressing the fact that both documents and queries are represented as bags of lexical entities, namely keywords. A keyword may be a simple word (as in “computer“ ) or a compound word (as in “computer science”). Weights are associated with document or query keywords [2], [3] to express their importance in the considered material. The weighting scheme is generally based on variations of the well known tf*idf formula [4]. A key characteristic of such systems is that the degree of document-query matching depends on the number of shared keywords. This leads to a “lexical focused” relevance estimation which is less effective than a “semantic focused” one [5]. Indeed, in such IRS, relevant documents are not retrieved if they do not share words with the query, and irrelevant documents that have commonwords with the query are retrieved even if these words have not the same meaning in the document and the query. The problems mainly stem from the richness in terms of expressive power, yet the synonymy and polysemy inherent in natural language. II. PROPOSED WORK In this paper a new technique to create index for the information retrieval system has been proposed. The proposed index will be created using Binary Search Tree BST. The architecture is as follows: http://www.ijettjournal.org Page 401 International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 8 - May 2014 Fig-1 Proposed architecture for Context Based Index using BST. The components are as follows: Corpus :- It is a set of documents for which index has to be created. These documents are normally either crawled by crawler or store by the admin of the information retrieval system. Parser :- This component will take a document from the Indexer, parse it to create a list of word called tokens. This token stream will be returned to the Indexer. Indexer :- This component will retrieve document one by one from the corpus, create a token stream of it by using parser. It further store these tokens in the BST. It also generate context of the document by context generator and add this information in the BST. Binary Search Tree (BST) :- This component will finally stores tokens. It also stores the contextual information of the token with it. The nodes of the BST will stores the following information: 1. Keyword associated with token after stemming. 2. List of Context for which index has to store document. Every context also stores the list of documents which belongs to ISSN: 2231-5381 this context and keyword is present in these documents. Thesaurus: - It is used to decide the context of the document. It receives a document from the context generator module and return context of it. This module takes the help of dictionary to decide context of the document. The working of this component is not in the scope of this work. Search Interface: - This interface read the query from the user and returns a list of documents in which that query word is present. It finds the list of documents from the Indexer module. III. IMPLEMENTATION The architecture which has been proposed in the previous section has been implemented in JAVA programming language in NetBeans IDE. First of all a sample corpus of 11 documents in 4 contexts has been taken. Figure-2 show the retrieval of the documents for the sample query “program” when contextual information has not been used. Five documents has been retrieved by the system. Query : program Number of Documents Retrieved = 5 http://www.ijettjournal.org Page 402 International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 8 - May 2014 Figure-2 Snapshot of IR system without context Figure-3 Snapshot of Context Based IR system ISSN: 2231-5381 http://www.ijettjournal.org Page 403 International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 8 - May 2014 Figure-3 show the snapshot of the output for a query “mobile”. Here contextual information has been used. The eleven documents has been clustered in four contexts. When user have to fire a query the he/she has to specify the query and also the name of the context in which he/she is interested. The returned documents list contain list of documents in the selected context in which query word exists. Query : mobile Context : Telecommunication No of Documents Retrieved : 2 IV. RESULTS AND ANALYSIS : The results has been compared for precision value for both context based searching and normal searching. Precision is the fraction of the documents retrieved that are relevant to the user's information need. 6 5 4 Relevent Documents 3 Retrieved Documents 2 1 0 Context Based IR Normal IR Figure 4 Precision Based Comparison Between Context Based IR and Normal IR. Precision = (|{Relevant Documents}∩{Retrieved Documents}|) / |{Retrieved Documents}| Without context: Relevant Documents = 2 Retrieved Documents = 5 (|{Relevant Documents}∩{Retrieved Documents}| =2 Precision = 2/5 = 0.4 With Context Relevant Documents = 2 Retrieved Documents = 2 Precision = 2/2 = 1.0 So almost all the documents retrieved will be relevant to user in context based retrieval. V. CONCLUSION AND FUTURE WORK In this paper a new context based approach for indexing using BST in IR systems has been proposed. The system introduces the contextual information of documents. User have to specify the query and the context of the query. Documents which are in given context has been searched and returned to the user. Number of documents returned to the user belong to the context in which user is interested for the query. From the experiments it has been justified that the precision value for the proposed system is better than the existing systems. Future work is needed to apply the system on a corpus with large number of documents and large number of contexts. ISSN: 2231-5381 REFERENCES [1] Rocchio, J. J. (1971). Relevance feedback in information retrieval. In The SMART Retrieval System, in Experiments in Automatic Document Processing. G.Salton editor, Prentice-Hall, Englewood Cliffs, NJ,pp. 313–323. [2] G. Bordogna and G. Pasi,” A fuzzy linguistic approach generalizing Boolean information retrieval: a model and its evaluation,” in Journal of the American Society for Information Science, 44(2), 70-82, 1993. [3] D. A. Buell and D. H. Kraft, ” A model for a weighted retrieval system,” in Journal of the American Society for Information Science, 32(3), 211-216,1981. [4] Salton G, Buckley C (1988). "Term-weighting approaches in automatic text retrieval". Information Processing and Management 24 (5): 513–523. [5] Mauldin, M., Carbonell J. and Thomason R., (1987). Beyond the keyword bariier: knowledge-based information retrieval. Information services and use 7(45): 103-117. [6] G. Miller (1995) WordNet : A Lexical database for English.. Actes de ACM 38, pp. 39-41. [7] Luisa Bentivogli, Pamela Forner, Bernardo Magnini and Emanuele Pianta. "Revising WordNet Domains Hierarchy: Semantics, Coverage, and Balancing", in COLING 2004 Workshop on "Multilingual Linguistic Resources", Geneva, Switzerland, August 28, 2004, pp. 101-108. http://www.ijettjournal.org Page 404