Research on Ontology Based Information Retrieval Techniques Kausar Mukadam Fuzail Misarwala Sindhu Nair Undergraduate Student, Department of Computer Engineering Dwarkadas J. Sanghvi College of Engineering Mumbai, India. Undergraduate Student, Department of Computer Engineering Dwarkadas J. Sanghvi College of Engineering Mumbai, India. Assistant Professor, Department of Computer Engineering Dwarkadas J. Sanghvi College of Engineering Mumbai, India. ABSTRACT Information retrieval can be a daunting task owing to the fact that there is colossal amount information available on the web. Search engines have to precise and efficient with the information they retrieve. They have to be efficient in terms of time, space, and most importantly, relevance of the documents retrieved. Users searching using keywords, want results which are accurate and match the intent of the user. In this paper, we study and compare a few novel methodologies for information retrieval in terms of their relevance scores and precision ratings of the search results. The empirical data put forth in this paper are directly obtained from the calculations and results presented by the authors of the respective proposed information retrieval techniques. We compare the algorithms used in these proposals and their targeted domains. KEYWORDS Ontology, Information, Retrieval. INTRODUCTION The advent of the World Wide Web brought with itself a large volume of data and information that is readily available for public use. Information retrieval presents a means to gather this information by reducing information overload. Information retrieval (IR) can be defined as finding material or documents of an unstructured nature, usually in the form of text, which satisfies an information need from large collections of data. As most of the data is unstructured, numerous information retrieval techniques have been developed to help deal with the huge amounts of unstructured knowledge accessible over networks. The information retrieval techniques commonly used are based on keyword, which uses lists of keywords to describe the information content. The main drawback of this method is that no information about the semantic relationships between these keywords is provided which makes the use of these systems difficult for ordinary users. Describing and then translating their needs into keyword based request is a problem as information needs cannot be expressed appropriately with system terms. One widely used approach to combat this issue is to incorporate ontologies into the information system which are used to represent essential concepts in a subject area in addition to the semantic relationships among them. Ontology provides metadata elements and familiar vocabulary to put elucidation on resources and uses class hierarchy and class relations for metadata interpretation. 1. INFORMATION RETRIEVAL MODELS For retrieving of related documents through information retrieval, the documents are usually transformed into an appropriate representation. Each information retrieval strategy uses a specialized model for the representation of documents which guide research and provide a blueprint to implement a retrieval system. The model predicts what will be relevant to the user given the user query. For these predictions, the models are grounded in some branch of mathematics, in order to formalise a model, ensure consistency, and to establish that it can be implemented within a real system [1]. The major models for retrieval of information are the Boolean model, the Statistical model (including vector space and probabilistic retrieval model) and the Linguistic and Knowledge-based models. 1.1. Boolean Model: The Boolean model is one the first models of information retrieval and is a much criticised model. The model can be defined by thinking of the user query term in the form of an unambiguous definition of a document set. For example, the query term ‘finance’ defines the set of all documents that are indexed with the term finance. In this model, the operators of George Boole’s mathematical logic- logical product AND, logical sum OR and logical difference NOT- can be combined along with query terms and sets of documents to form new document sets. element and can also be multiplied by a real number [2]. The index representations and query are represented as vectors embedded in a dimensional Euclidean space, in which each term is assigned a separate dimension. The similarity measure is usually the cosine of the angle that separates the two vectors d and q, where d represents the documents index representation and q represents the query. 1.2. Extended Boolean Model: 1.5 Probabilistic Models: Several methods have been developed to overcome the disadvantages of the traditional model. The traditional method has no provision for ranking, it does not support the weight assignment to queries or document terms, and the operators are too strict. Smart Boolean approach and extended Boolean models (for example- P-norm and Fuzzy Logic approaches) provide relevance ranking to users. Probabilistic models consider the information retrieval process as a probabilistic inference. Similarities in documents are assed as probabilities that the document is pertinent to the query. These models use various probabilistic theorems such as Bayes' theorem. The Probabilistic retrieval model implements the Probability Ranking Principle that specifies that an IR system should rank the documents on the basis of their probability of relevance to the user query, using all the available information. A variety of evidence sources are used in this method, the most common one being the statistical distribution of terms in the relevant and nonrelevant documents. Other probabilistic models include Bayesian Network Models, 2-Poisson Model, Probabilistic Indexing Model, etc. The P-norm method allows query and document terms to comprise of weights that are computed through term frequency statistics with proper normalization procedures. The normalized weights are used to rank documents in decreasing order of distance for an OR query, and increasing order of distance for an AND query. The operators also have an associated coefficient (P) to indicate the degree of strictness of the Boolean operator (from 1 for lowest strictness to infinity for highest strictness). This method uses distance-based measure. In Fuzzy Set theory, each element has a differing degree of membership to a set which is a direct contrast to traditional binary membership. The index term weight for a given document reflects the degree to which the term describes the document content. The weight is an indication of membership of the document in the associated fuzzy set. 1.3. Statistical Model: These models use statistical information (term frequencies) to determine the relevance of documents with reference to the query, to produce a list of documents ranked by an estimated relevance. Some common types are vector space and probabilistic models. 1.4. Vector Space Models: The basic requirement of the vector space model is that information retrieval objects are modelled as elements in a vector space. Terms, documents, queries, concepts are all represented as vectors in the vector space. This implies that the system has linear properties i.e. any two elements of the system can be added to create a new 1.6. Linguistic and Knowledge-Based Models: Linguistic and knowledge-based approaches, which have been developed to address various problems in information retrieval, perform a semantic and syntactic analysis in order to retrieve documents more effectively 2. ONTOLOGY: The Oxford English Dictionary [3] defines ontology as “A set of concepts and categories in a subject area or domain that shows their properties and the relations between them:” Ontology is a specification of conceptualization which consists of a list of terms (names and definitions) and the relationships between them. The terms are used to represent important concepts, or classes of objects, of the domain. For example, in the university domain, faculty, students, lecture rooms, courses and departments can be some important concepts. The ontology concept has found use in Artificial Intelligence, Computer Science, and Knowledge Engineering in a myriad set of related applications including natural language processing, Ecommerce, information retrieval, and the Semantic Web [4]. In information retrieval, ontologies have been used to overcome the limitations of traditional keywordbased search, and provide a vocabulary for classification of the content and improve search through class hierarchy based query expansion, multifaceted browsing and searching, etc. Figure 1. Basic process of text mining information retrieval based on ontology [5]. 3. SURVEY OF NOVEL METHODOLOGIES 3.1. Retrieval Model for Traditional Chinese Medicine: Some shortcomings of traditional information retrieval methods is discussed in [6]. The biggest problems faced in information retrieval in the TCM (Traditional Chinese Medicine Field) are those of low coverage and high redundancy. The purpose of working with traditional Chinese medicinal literature and database is the lack of research in TCM despite extensive and relevant research carried out by scholars in other fields. The TCM domain is constructed using a seven step method. The paper then proceeds to summarize the implementation process of their ontology based information retrieval technique which is a two-step technique. The next part deals with concept similarity. It is stated that in the field of ontology, the correlation between information is a performance measure of the correlation between concepts. In the domain of TCM, ontology follows a clear hierarchical architecture. The relevance between concepts is measured on a scale of 0 to 1. If two concepts are unrelated, i.e. there is no relevance, the relevance score is 0. This is an effective method to quantify the correlation between concepts and hence the effectiveness of the retrieval system can be measured. The paper defines three levels of correlation between concepts. Thus by determining the degree of correlation, the relevance between retrieved information and information resources, the most relevant information can be gathered. In the final step of sorting the search results, a sorting algorithm is proposed since the traditional algorithms for sorting of search results may not be very effective for the TCM domain. The measure of concept similarity is used for the sorting of the search results. Figure 2. Information Retrieval framework for TCM [6] . inference that the newly proposed ontology based information retrieval system was efficient and effective in the Traditional Chinese Medicine domain. 3.2. Semantic Indexing based Retrieval Model: Information Some limitations like the inability to describe relations between search terms are dealt with in [7]. The proposed framework deals with important issues related to semantic search and information retrieval that are Scalability, Usability, and Retrieval Performance. For the improvement of scalability, the use of a semantic indexing approach is suggested based on an entity retrieval model. Usability is improved through the adoption of a keyword based interface. The use of domain specific information extraction, rules, and inference is proposed to improve retrieval performance. The framework proposed is based on three key processes. They are representation of semantic knowledge, semantic indexing, and querying. An existing ontology is reused in the implementation of information retrieval in transport systems for the representation of semantic knowledge. OWL Web Ontology Language is used. At the end of the first step, useful OWL files are obtained that are indexed for the search. The next step is semantic indexing. An indexing system is designed using entity retrieval model due to the knowledge base being composed of entities defined for RDF, OWL and RDFS. The knowledge base, comprised of entities defined for RDF, is a weighted and labelled graph where the edges are properties and the nodes are the resources. The graph is a set of RDF triples, which consist of three components that are subject, predicate, and object. The job of the subject is the identification of the object described by the triple, while the function of the predicate is the definition of the piece of data present in the object that is given a value. The EAV (Entity Attribute Value) model is adopted and used for the indexing system. The indexing structure is then described as it largely affects the retrieval performance. The next step is semantic querying. It is the process of querying the EAV graph after the semantic knowledge is represented and indexed. There are three types of supported queries. Full text, structural, and semi structural are the query types supported. SIRE is used to search query and results are obtained using a Boolean combination of an attribute value pairs based on logical operators. The proposed information retrieval method is then evaluated using a set of pre-set queries, showing a high rate of precision Figure 3. Framework for Semantic Indexing based Information Retrieval [7]. 3.3. Semantic Extension Retrieval Model: A new technique aimed at tackling some key problems posed by traditional keyword based methods, is proposed in [8]. These problems are that firstly, the keywords do not always convey the full meaning of the content and the retrieved information may be irrelevant. Secondly, the keyword may have different meanings in different contexts, which leads to difficulties in the processing of query features, and thirdly, due to polysemy and synonym problems in natural language, keyword-based retrieval can only cover information containing the same word, while other information with similar meaning but different words has been missing. [8] To overcome these issues, an information retrieval technique based on semantic extension is proposed. The semantic retrieval is based on the semantic extension. The strategy considers whether or not the result is suitable for the user’s query. The proposed model is different when compared to the tradition model of expressing content features through the use of keywords, since the proposed model has the provision of ontology annotation to summarize semantic features of the information, and makes use of semantic extension for retrieval. Two parts are included in this model. They are ontology annotation and retrieval of text based on semantic extension. Firstly, ontology annotation indexes documents based on ontology of the domain. This serves as the foundation for the text retrieval. This is followed by the extension of the query keyword and turns it into a fulltext research. The results obtained are then reordered. The indexing is executed by an index writer which adds documents to the index and serves as a core component for the construction of the index. The core component for retrieval is the index reader which reads the index. The analyser pre-treats the documents through ontology annotations and sends the content to the index writer. It also matches the keyword from the query to the domain. The analyser has a subcomponent that is the ontology encoder. It processes the elements of the domain ontology into a multi tree which is in turn used for annotation and keyword matching. The results obtained after further processing is reordered. For the performance evaluation of the proposed technique, 1000 papers were collected and tested on. Precision and recall measures were used for evaluation, and the experimental results show a fairly high rate of recall and precision as compared to traditional keyword based information retrieval. Figure 4. Proposed framework for Semantic Extension Retrieval Model [8]. 4. PERFORMANCE MEASURES: An ideal performance measure for an information retrieval system would take into account the resources used by the system to perform a retrieval operation, the amount of effort time spent by a user to obtain needed information, and the ability of the system to retrieve useful items. But this approach is extremely hard to implement. The user would want the system to retrieve the highest number of appropriate items possible and reduce the number of non-relevant items in the response. The former criterion is represented by Recall, and the latter one is the concept of Precision [9]. Recall (also known as sensitivity in binary classification) can be defined as the fraction of the documents retrieved that are relevant to the user query. Precision (also known as positive predictive value) is the fraction of the retrieved documents which are relevant to the information requirement of the user. 5. RESULT ANALYSIS: Precision and Recall are the best performance measures for any novel technique proposed by researchers as the primary goal of these information retrieval techniques is the return of relevant pages as the search result. All of the above studied techniques have presented a high success rate through their own experiments. The average precision and recall rates show a significant rise when compared to traditional keyword based methods, and serve as evidence for the fact that the proposed methodologies have overcome the difficulties that they set out to, in their respective domains. 6. CONCLUSION AND FUTURE WORK: Various novel ontology based information retrieval techniques have been proposed by researchers, which have been used in a specific or multiple domain, aimed at defeating certain problems encountered with the use of traditional keyword based algorithms. These new techniques, manage to overcome the stated issues and return high recall and precision rates, and hence should be used with increased frequency for information retrieval. REFERENCES [1] D. Hiemstra, "Information Retrieval Models∗," Goker, A., and Davies, J. Information Retrieval: Searching in the 21st Century, 2009. [2] V. V. Raghavan, "A critical analysis of vector space model in information retrieval," Journal of American Society for Information Science, 1986. [3] "Oxford English Dictionary," [Online]. Available: http://www.oxforddictionaries.com/definition/english/ontology. [4] S. S.Yasodha, "An Ontology-Based Framework for Semantic Web Content Mining," International Conference on Computer Communication and Informatics (IEEE), 2014. [5] M. Q. Song Yibing, "Research of literature information retrieval method based on ontology," IEEE, 2014. [6] Y. Z. D. Z. H. L. H. R. Aziguli Wulamu, "The Research and Application of Ontology-Based Information Retrieval," IEEE 9th Conference on Industrial Electronics and Applications (ICIEA), 2014. [7] M. A. Amir Zidi, "A Generalized Framework for OntologyBased Information Retrieval," IEEE , 2013. [8] H. L. Rui Zhang, "Design and Realization of Semantic Extension Information Retrieval Mechanism," Third International Conference on Information Science and Technology, 2013. [9] V. V. Raghavan, "A Critical Investigation of Recall and Precision as MEasures of Retrieval System Performance," ACM Transactions on Information Systems, 1989.