Research on Ontology Based Information Retrieval Techniques

advertisement
Research on Ontology Based Information Retrieval
Techniques
Kausar Mukadam
Fuzail Misarwala
Sindhu Nair
Undergraduate Student,
Department of Computer
Engineering
Dwarkadas J. Sanghvi College of
Engineering
Mumbai, India.
Undergraduate Student,
Department of Computer
Engineering
Dwarkadas J. Sanghvi College of
Engineering
Mumbai, India.
Assistant Professor, Department of
Computer Engineering
Dwarkadas J. Sanghvi College of
Engineering
Mumbai, India.
ABSTRACT
Information retrieval can be a daunting task owing to
the fact that there is colossal amount information
available on the web. Search engines have to precise
and efficient with the information they retrieve. They
have to be efficient in terms of time, space, and most
importantly, relevance of the documents retrieved.
Users searching using keywords, want results which are
accurate and match the intent of the user. In this paper,
we study and compare a few novel methodologies for
information retrieval in terms of their relevance scores
and precision ratings of the search results. The
empirical data put forth in this paper are directly
obtained from the calculations and results presented by
the authors of the respective proposed information
retrieval techniques. We compare the algorithms used in
these proposals and their targeted domains.
KEYWORDS
Ontology, Information, Retrieval.
INTRODUCTION
The advent of the World Wide Web brought with itself
a large volume of data and information that is readily
available for public use. Information retrieval presents a
means to gather this information by reducing
information overload. Information retrieval (IR) can be
defined as finding material or documents of an
unstructured nature, usually in the form of text, which
satisfies an information need from large collections of
data.
As most of the data is unstructured, numerous
information retrieval techniques have been developed to
help deal with the huge amounts of unstructured
knowledge accessible over networks. The information
retrieval techniques commonly used are based on keyword, which uses lists of keywords to describe the
information content. The main drawback of this method
is that no information about the semantic relationships
between these keywords is provided which makes the
use of these systems difficult for ordinary users.
Describing and then translating their needs into
keyword based request is a problem as information
needs cannot be expressed appropriately with system
terms.
One widely used approach to combat this issue is to
incorporate ontologies into the information system
which are used to represent essential concepts in a
subject area in addition to the semantic relationships
among them. Ontology provides metadata elements and
familiar vocabulary to put elucidation on resources and
uses class hierarchy and class relations for metadata
interpretation.
1. INFORMATION RETRIEVAL
MODELS
For retrieving of related documents through information
retrieval, the documents are usually transformed into an
appropriate representation. Each information retrieval
strategy uses a specialized model for the representation
of documents which guide research and provide a
blueprint to implement a retrieval system. The model
predicts what will be relevant to the user given the user
query. For these predictions, the models are grounded in
some branch of mathematics, in order to formalise a
model, ensure consistency, and to establish that it can
be implemented within a real system [1].
The major models for retrieval of information are the
Boolean model, the Statistical model (including vector
space and probabilistic retrieval model) and the
Linguistic and Knowledge-based models.
1.1. Boolean Model:
The Boolean model is one the first models of
information retrieval and is a much criticised model.
The model can be defined by thinking of the user query
term in the form of an unambiguous definition of a
document set. For example, the query term ‘finance’
defines the set of all documents that are indexed with
the term finance. In this model, the operators of George
Boole’s mathematical logic- logical product AND,
logical sum OR and logical difference NOT- can be
combined along with query terms and sets of documents
to form new document sets.
element and can also be multiplied by a real number
[2]. The index representations and query are represented
as vectors embedded in a dimensional Euclidean space,
in which each term is assigned a separate dimension.
The similarity measure is usually the cosine of the angle
that separates the two vectors d and q, where d
represents the documents index representation and q
represents the query.
1.2. Extended Boolean Model:
1.5 Probabilistic Models:
Several methods have been developed to overcome the
disadvantages of the traditional model. The traditional
method has no provision for ranking, it does not support
the weight assignment to queries or document terms,
and the operators are too strict. Smart Boolean approach
and extended Boolean models (for example- P-norm
and Fuzzy Logic approaches) provide relevance ranking
to users.
Probabilistic models consider the information retrieval
process as a probabilistic inference. Similarities in
documents are assed as probabilities that the document
is pertinent to the query. These models use various
probabilistic theorems such as Bayes' theorem. The
Probabilistic retrieval model implements the Probability
Ranking Principle that specifies that an IR system
should rank the documents on the basis of their
probability of relevance to the user query, using all the
available information. A variety of evidence sources are
used in this method, the most common one being the
statistical distribution of terms in the relevant and nonrelevant documents. Other probabilistic models include
Bayesian Network Models, 2-Poisson Model,
Probabilistic Indexing Model, etc.
The P-norm method allows query and document terms
to comprise of weights that are computed through term
frequency statistics with proper normalization
procedures. The normalized weights are used to rank
documents in decreasing order of distance for an OR
query, and increasing order of distance for an AND
query. The operators also have an associated coefficient
(P) to indicate the degree of strictness of the Boolean
operator (from 1 for lowest strictness to infinity for
highest strictness). This method uses distance-based
measure.
In Fuzzy Set theory, each element has a differing degree
of membership to a set which is a direct contrast to
traditional binary membership. The index term weight
for a given document reflects the degree to which the
term describes the document content. The weight is an
indication of membership of the document in the
associated fuzzy set.
1.3. Statistical Model:
These models use statistical information (term
frequencies) to determine the relevance of documents
with reference to the query, to produce a list of
documents ranked by an estimated relevance. Some
common types are vector space and probabilistic
models.
1.4. Vector Space Models:
The basic requirement of the vector space model is that
information retrieval objects are modelled as elements
in a vector space. Terms, documents, queries, concepts
are all represented as vectors in the vector space. This
implies that the system has linear properties i.e. any two
elements of the system can be added to create a new
1.6. Linguistic and Knowledge-Based Models:
Linguistic and knowledge-based approaches, which
have been developed to address various problems in
information retrieval, perform a semantic and syntactic
analysis in order to retrieve documents more effectively
2. ONTOLOGY:
The Oxford English Dictionary [3] defines ontology as
“A set of concepts and categories in a subject area or
domain that shows their properties and the relations
between them:” Ontology is a specification of
conceptualization which consists of a list of terms
(names and definitions) and the relationships between
them. The terms are used to represent important
concepts, or classes of objects, of the domain. For
example, in the university domain, faculty, students,
lecture rooms, courses and departments can be some
important concepts. The ontology concept has found
use in Artificial Intelligence, Computer Science, and
Knowledge Engineering in a myriad set of related
applications including natural language processing, Ecommerce, information retrieval, and the Semantic Web
[4]. In information retrieval, ontologies have been used
to overcome the limitations of traditional keywordbased search, and provide a vocabulary for
classification of the content and improve search through
class hierarchy based query expansion, multifaceted
browsing and searching, etc.
Figure 1. Basic process of text mining information retrieval based on ontology [5].
3. SURVEY OF NOVEL
METHODOLOGIES
3.1. Retrieval Model for Traditional Chinese
Medicine:
Some shortcomings of traditional information retrieval
methods is discussed in [6]. The biggest problems faced
in information retrieval in the TCM (Traditional
Chinese Medicine Field) are those of low coverage and
high redundancy. The purpose of working with
traditional Chinese medicinal literature and database is
the lack of research in TCM despite extensive and
relevant research carried out by scholars in other fields.
The TCM domain is constructed using a seven step
method. The paper then proceeds to summarize the
implementation process of their ontology based
information retrieval technique which is a two-step
technique. The next part deals with concept similarity.
It is stated that in the field of ontology, the correlation
between information is a performance measure of the
correlation between concepts. In the domain of TCM,
ontology follows a clear hierarchical architecture. The
relevance between concepts is measured on a scale of 0
to 1. If two concepts are unrelated, i.e. there is no
relevance, the relevance score is 0. This is an effective
method to quantify the correlation between concepts
and hence the effectiveness of the retrieval system can
be measured.
The paper defines three levels of correlation between
concepts. Thus by determining the degree of
correlation, the relevance between retrieved information
and information resources, the most relevant
information can be gathered.
In the final step of sorting the search results, a sorting
algorithm is proposed since the traditional algorithms
for sorting of search results may not be very effective
for the TCM domain. The measure of concept similarity
is used for the sorting of the search results.
Figure 2. Information Retrieval framework for TCM [6] .
inference that the newly proposed ontology based
information retrieval system was efficient and effective
in the Traditional Chinese Medicine domain.
3.2. Semantic Indexing based
Retrieval Model:
Information
Some limitations like the inability to describe relations
between search terms are dealt with in [7]. The
proposed framework deals with important issues related
to semantic search and information retrieval that are
Scalability, Usability, and Retrieval Performance. For
the improvement of scalability, the use of a semantic
indexing approach is suggested based on an entity
retrieval model. Usability is improved through the
adoption of a keyword based interface. The use of
domain specific information extraction, rules, and
inference is proposed to improve retrieval performance.
The framework proposed is based on three key
processes. They are representation of semantic
knowledge, semantic indexing, and querying. An
existing ontology is reused in the implementation of
information retrieval in transport systems for the
representation of semantic knowledge. OWL Web
Ontology Language is used. At the end of the first step,
useful OWL files are obtained that are indexed for the
search.
The next step is semantic indexing. An indexing system
is designed using entity retrieval model due to the
knowledge base being composed of entities defined for
RDF, OWL and RDFS. The knowledge base, comprised
of entities defined for RDF, is a weighted and labelled
graph where the edges are properties and the nodes are
the resources. The graph is a set of RDF triples, which
consist of three components that are subject, predicate,
and object. The job of the subject is the identification of
the object described by the triple, while the function of
the predicate is the definition of the piece of data
present in the object that is given a value. The EAV
(Entity Attribute Value) model is adopted and used for
the indexing system. The indexing structure is then
described as it largely affects the retrieval performance.
The next step is semantic querying. It is the process of
querying the EAV graph after the semantic knowledge
is represented and indexed. There are three types of
supported queries. Full text, structural, and semi
structural are the query types supported. SIRE is used to
search query and results are obtained using a Boolean
combination of an attribute value pairs based on logical
operators.
The proposed information retrieval method is then
evaluated using a set of pre-set queries, showing a high
rate
of
precision
Figure 3. Framework for Semantic Indexing based Information Retrieval [7].
3.3. Semantic Extension Retrieval Model:
A new technique aimed at tackling some key problems
posed by traditional keyword based methods, is
proposed in [8]. These problems are that firstly, the
keywords do not always convey the full meaning of the
content and the retrieved information may be irrelevant.
Secondly, the keyword may have different meanings in
different contexts, which leads to difficulties in the
processing of query features, and thirdly, due to
polysemy and synonym problems in natural language,
keyword-based retrieval can only cover information
containing the same word, while other information with
similar meaning but different words has been missing.
[8] To overcome these issues, an information retrieval
technique based on semantic extension is proposed.
The semantic retrieval is based on the semantic
extension. The strategy considers whether or not the
result is suitable for the user’s query. The proposed
model is different when compared to the tradition
model of expressing content features through the use of
keywords, since the proposed model has the provision
of ontology annotation to summarize semantic features
of the information, and makes use of semantic
extension for retrieval. Two parts are included in this
model. They are ontology annotation and retrieval of
text based on semantic extension.
Firstly, ontology annotation indexes documents based
on ontology of the domain. This serves as the
foundation for the text retrieval. This is followed by the
extension of the query keyword and turns it into a fulltext research. The results obtained are then reordered.
The indexing is executed by an index writer which adds
documents to the index and serves as a core component
for the construction of the index. The core component
for retrieval is the index reader which reads the index.
The analyser pre-treats the documents through ontology
annotations and sends the content to the index writer. It
also matches the keyword from the query to the domain.
The analyser has a subcomponent that is the ontology
encoder. It processes the elements of the domain
ontology into a multi tree which is in turn used for
annotation and keyword matching. The results obtained
after further processing is reordered.
For the performance evaluation of the proposed
technique, 1000 papers were collected and tested on.
Precision and recall measures were used for evaluation,
and the experimental results show a fairly high rate of
recall and precision as compared to traditional keyword
based
information
retrieval.
Figure 4. Proposed framework for Semantic Extension Retrieval Model [8].
4. PERFORMANCE MEASURES:
An ideal performance measure for an information
retrieval system would take into account the resources
used by the system to perform a retrieval operation, the
amount of effort time spent by a user to obtain needed
information, and the ability of the system to retrieve
useful items. But this approach is extremely hard to
implement. The user would want the system to retrieve
the highest number of appropriate items possible and
reduce the number of non-relevant items in the
response. The former criterion is represented by Recall,
and the latter one is the concept of Precision [9].
Recall (also known as sensitivity in binary
classification) can be defined as the fraction of the
documents retrieved that are relevant to the user query.
Precision (also known as positive predictive value) is
the fraction of the retrieved documents which
are relevant to the information requirement of the user.
5. RESULT ANALYSIS:
Precision and Recall are the best performance measures
for any novel technique proposed by researchers as the
primary goal of these information retrieval techniques is
the return of relevant pages as the search result. All of
the above studied techniques have presented a high
success rate through their own experiments. The
average precision and recall rates show a significant rise
when compared to traditional keyword based methods,
and serve as evidence for the fact that the proposed
methodologies have overcome the difficulties that they
set out to, in their respective domains.
6. CONCLUSION AND FUTURE
WORK:
Various novel ontology based information retrieval
techniques have been proposed by researchers, which
have been used in a specific or multiple domain, aimed
at defeating certain problems encountered with the use
of traditional keyword based algorithms. These new
techniques, manage to overcome the stated issues and
return high recall and precision rates, and hence should
be used with increased frequency for information
retrieval.
REFERENCES
[1] D. Hiemstra, "Information Retrieval Models∗," Goker, A., and
Davies, J. Information Retrieval: Searching in the 21st
Century, 2009.
[2] V. V. Raghavan, "A critical analysis of vector space model in
information retrieval," Journal of American Society for
Information Science, 1986.
[3] "Oxford English Dictionary," [Online]. Available:
http://www.oxforddictionaries.com/definition/english/ontology.
[4] S. S.Yasodha, "An Ontology-Based Framework for Semantic
Web Content Mining," International Conference on Computer
Communication and Informatics (IEEE), 2014.
[5] M. Q. Song Yibing, "Research of literature information
retrieval method based on ontology," IEEE, 2014.
[6] Y. Z. D. Z. H. L. H. R. Aziguli Wulamu, "The Research and
Application of Ontology-Based Information Retrieval," IEEE
9th Conference on Industrial Electronics and Applications
(ICIEA), 2014.
[7] M. A. Amir Zidi, "A Generalized Framework for OntologyBased Information Retrieval," IEEE , 2013.
[8] H. L. Rui Zhang, "Design and Realization of Semantic
Extension Information Retrieval Mechanism," Third
International Conference on Information Science and
Technology, 2013.
[9] V. V. Raghavan, "A Critical Investigation of Recall and
Precision as MEasures of Retrieval System Performance,"
ACM Transactions on Information Systems, 1989.
Download