ONTOLOGICAL SYSTEM FOR INFORMATION RETRIEVAL AND UNIVERSITIES.

advertisement
ONTOLOGICAL SYSTEM FOR INFORMATION RETRIEVAL AND
INTEGRATION OF RESEARCH REPORTS IN SELECTED NIGERIAN
UNIVERSITIES.
M. A. Mabayoje (MSc Computer Science) 1, S. O. Olabiyisi (PhD Mathematics) 2
1
Department of Computer Science, Faculty of Communication and Information Sciences,
University of Ilorin, PMB1515, Ilorin, Kwara-Nigeria.
2
Department of Computer Science and Engineering, Ladoke Akintola University of
Technology, Ogbomosho, Oyo-Nigeria.
Tel.: +2348063185885, +2348032866333, Services100ng@yahoo.com
Tel: +2348036669863, Soolabiyisi@lautech.ed.ng.
1
Abstract
Carrying out a successful research in any field of knowledge requires having access
to information sources like literatures and other relevant materials. One of the basic
information sources for researches in tertiary education is previous research reports in term
of Long Essays and Dissertations. Findings have revealed that current way of accessing
available research materials like dissertations (and other research reports) is cumbersome.
There is the need for the provision for, and enhancement of knowledge reusability. This
would be achieved through a system by which such dissertations (as research information
sources) can be integrated in Nigerian tertiary schools.
The aim of this study is to present a web based realistic framework for easy
information retrieval, integration of graduate and undergraduate research report (thesis and
long essays) for effective accessing of such research works in a number of Nigerian
universities. This would be done within a given domain for knowledge re-use and for new
ideas to emerge or to implement existing works. Ontological method is used in the
classification of distinct entities, i.e. student research reports. This method has become
popular in knowledge engineering, cooperative information systems, intelligent information
integration, and knowledge management (Smith et. al. 2001). Performance evaluation of the
system is carried out by the use of Recall and Precision. Visual Basic is also adopted for the
implementation of the system.
Key Words: Ontology, Information, Retrieval, Integration, Knowledge re-use, System.
2
Introduction
In this research, attempt is made to develop a web based realistic framework for easy
information retrieval from previous academic works; as well as integration of such works in a number
of Nigerian universities. This study is aimed at providing solutions to some problems in the area of
knowledge re-use; and for new ideas to emerge or to implement existing works.
Currently, there is lack of a system used for effective representation of the number and areas
of researches so far embarked upon in the individual universities; such system that would represent
data on the researches and research areas, showing their entities interpreted to exist in some area of
interest and the relationships that hold among them.
All the research reports that form our data in this study are represented in a table of
categories. Every type of entity is captured by some node within a hierarchical tree ( Chisholm, 1996).
By this, ontological method is adopted to classify academic research reports as distinct entities.
Visual Basic.Net programming language; as supporting database on the Net is adopted in the
implementation of this system. Precision and Recall are used as instruments to measure the
effectiveness of the system. We compared performance on a common set of queries and documents;
the result of which is represented in statistical tables.
The Problem:
One of the basic pre-requisites for the award of academic certificate in universities is for
candidates to submit hard copies of research reports. Currently in many Nigerian universities,
particularly in the twenty-seven universities taken as our case study, submitted hard copies of
research reports (in various forms like Term Papers, Seminar Papers, Journal Articles, Dissertations,
Theses, and Long Essays) have become quite numerous and so cumbersome in term of
administrative record keeping and data management. Research works and research areas are
repeated by different researchers. There is lack of a system responsible for research details in
departments and faculties. There is no effective representation of the number and areas of
researches so far embarked upon in the individual universities. There is no system that would
represent data on the researches and research areas, showing their entities interpreted to exist in
some area of interest and the relationships that hold among them.
Information is power. Academic researchers rely on information from libraries and from similar
areas of previous researches; like research project samples as carried out by students in addition to
references found on the internet. While other sources might be less tedious, laying hands on reports
of researches as carried out by students in a department or faculty is usually not encouraging. This is
because a system of ontology to keep and manage the existence of such research reports in term of
Long Essay and Theses is usually not in place.
The Study
This study aims at presenting a realistic system for ontology and information retrieval for
research works in selected Nigerian universities; as well as performance evaluation of the system.
Other aims are stated as follow:
1. To build an ontology system in the area of data keeping and management as relevant to
research materials in Nigerian Universities.
2. To propose a Mathematical model for Relevance Ranking regarding this research.
3. To proffer answers to questions relating to accessibility and re- usability of knowledge issues
in intelligent search engine.
Methodology
Material of Study:
Our major materials for this study are research reports from various departments in selected
Nigerian Universities.
Method of Study
In carrying out this study, we tried to identify the various categories onto which gathered
research topics and abstract from various departments could be mapped. (E.g. Networking,
Database, Security etc). Adequate background knowledge pertaining to the research areas are also
acquired and represented in such way that could facilitate the
mapping. Then we tried to Segment
various documents by employing background knowledge to map each document section to its
corresponding category. We created different identity for the data in a persistent data store. Then we
tried to enable easy search across indexed documents by providing a user interface. This is achieved
by linking each research work title to its corresponding abstract/full-text.
3
Ontology
The field of computer and information science describe ontology as indicating an historical
object that is designed for a purpose, which is to enable the modeling of knowledge about some
domain, real or imagined (Alani and Brewster, 2005). It is an arrangement of a conceptualization, that
represents the objects, concepts, and other entities which are interpreted to exist in some area of
interest and the relationships that hold among them, (Gruber, 1995).
Ontological method is also used in the filtering, ranking and presentation of the results
covering quality issues such as contradictions and related information, i.e. a different possible answer
to the same query or an answer to a different but related query.
Ontology in computer science is specifically related to knowledge sharing and reuse, a specification of
a conceptualization which has some properties for knowledge sharing among Artificial Intelligence (AI)
software (Grubber, 1992).
In the traditional model used in the field of information retrieval, information is organized into
documents, and it is assumed that there are a large number of documents. Data contained in
documents is unstructured; without any associated schema. The process of information retrieval
consists of locating relevant documents on the basis of user input, such as keywords or example
document (Silberschatz, 2001).
Information Retrieval
Information Retrieval can be described as the scientific method of searching information
either in documents or searching for documents themselves. It also includes searching within
databases which could either be relational stand-alone databases or hyper textually-networked
databases like the World Wide Web (Salton, 1989). Information Retrieval is also defined as the
science of locating, from a large document collection, those documents that fulfill a specified
information need (Frake and Baeza-Yates, 1992).
It also covers other areas of science like mathematics, library science, information science,
information architecture, cognitive psychology, linguistics, statistics and physics. However each of
these areas of relevance has its application of skills, theory, literature, and technologies.
In each of these areas, information overload are reduced through the use of automated
Information Retrieval System (IRS). IRS is used in places like universities and some other tertiary
institutions. Access to books, journals, and other documents are easily achieved in libraries by the
use of IRS. Programmers have succeeded in their effort at creating applications that have very well
functioned in information retrieval. Examples of such applications are found within Web search
engines like Google, Yahoo search, Live search (Formerly MSN search) etc. (Behrouz, 2003).
Areas where Information Retrieval techniques are employed are Adversarial information
retrieval, Automatic summarization, Multi-document summarization, Cross-lingual retrieval, Document
classification, Spam filtering, Open source information retrieval, Question answering, Structured
document retrieval, Topic detection and tracking etc.
4
Ontology and Information Retrieval System
The research work is a Web based package, in which a number of components communicate
together to achieve the required functionality. The main components of this system are: an indexing
user interface, an indexing backend linked to an Ontology & IR System, and a search front end also
linked to an Ontology & IR System. The diagram below shows various components and their
interactions.
Upload (new)
expected results
Input
information
Contents
Outputs
Administrator
Indexing
Backend
Save indexed doc.
Query
Ontology
Web based
Indexing front end
Update
system user
Search Criteria
Web based structured
search interface
Background
knowledge
Ranking Module
Ontology and Information Retrieval System
Indexing Backend is the component responsible for augmenting input research (project)
documents with meta-data using background knowledge. The Indexing Backend is HTTP server that
is capable of receiving requests embedded in HTTP requests. All project documents are indexed
sequentially in a remote storage component (Database) where they are kept according to their
classes of ontology. The Ranking Module then analyses each document of various categories of
ontology to determine the degree of relevance to the given searchtext. This analysis will generate a
ranking of retrieved projects documents and the results are returned to the users in relation form in
which the abstract/ full text can be downloaded through document path field (address); and their total
ranks are as well displayed.
Indexing User Interface:
This interface allows an authorized user to upload documents to a web server, then indexes
such document through communication with the indexing backend.
Search Front end:
A Web Search front end gives room for researchers to easily retrieve their required
documents from existing research works for further research by keying in searchtext. The query will
be a loose or a specific one based on the number of used keywords. The more specific the query, the
less number of documents returned. After a query is entered, it is converted to SQL and dispatched to
the Database in which indexing documents has been kept. The relevant documents (output) are
displayed in the form of HTML page containing a list of index project documents that match the
entered query.
The output include the following: Author’s names, supervisor’s name, project titles, and a hyperlink to
the source documents will also be displayed for downloading.
Performance Evaluation
Category Match Measure (CMM) in Ontology
Category Match Measure (CMM) is used to evaluate the coverage of an ontology for the
given search text. Categories that contain all search texts will obviously score higher than others.
Exact matches have higher score than Partial matches. For instance, using “security” as search text,
5
category of ontology with exact search terms will score higher in this measure than other categories
which contain partial classes.
Definition 1: Let C[o] be a set of categories in ontology o, and T is the set of search texts.
E ( o ,T ) 
  I (c,t )
(1)
cC [ O ] tT
I(c, t) =
1:
0:
P ( o ,T ) 
if label(c) = t
if label(c)  t
(2)
  J (c,t )
(3)
cC [ O ] tT
J(c, t) =
1:
if label(c) contains t
0:
if label(c) not contain t
(4)
where E(o, T) and P(o, T) are the number of classes of ontology o that have labels that match any of
the search texts T exactly or partially, respectively.
CMM(o, τ) = αE(o, T) + βP(o, T)
(5)
Where CMM (o, τ) is the Categories Match Measure for ontology o with respect to search text
τ . α and β are the exact matching and partial matching weight factors respectively.
Exact matching is favored over partial matching if α > β. In the experiments described in example
1(below), α = 0.3 & β =0.1 thus putting more emphasis on exact matching.
For example: let consider “security" as our search text then will have the following result based on
CATEGORY MATCH MEASURE (CMM)
CMM(o, τ) =0.3(2) + 0.1(1) =0.7
( Networking)
CMM(o, τ) =0.3(1) + 0.1(0)=0.3
(Information System)
CMM(o, τ) =0.3(6) + 0.1(1)=1.9
(Security)
CMM(o, τ) =0.3(1) + 0.1(0) =0.3 (Web)
CMM(o, τ) =0.3(2) + 0.1(2)=0.8
(DBMS)
CMM(o, τ) =0.3(2) + 0.1(0)=0.3
(Programming)
CATEGORIES(CLASSES)
CMM
Networking
0.7
Information System
0.3
Security
1.9
Web
0.3
DBMS
0.8
Programming
0.3
Ranks based on CMM
6
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
C1
C2
C3
C4
C5
C6
CATEGORIES (CLASSES)
Ranks based on CMM
KEYS:
C1= Networking
C2=Information System
C3=Security
C4=Web
C5=DBMS
C6=Programming
Variation of Attached Weights
Different weights are used for α and β as (0.1 and 0.3), (0.2 and 0.4), (0.3 and 0.5) in the
Categories Match Measure in order to determine the optimal values. From the observation so far,
change in values of α and β has little or no effect as shown in figure below. However for the purpose
of this stydy we adopt α = 0.3 and β=0.1
Variation of Attached Weights
3.5
3
CMM
2.5
a
2
b
1.5
c
1
0.5
0
1
2
3
4
5
6
7
w e ights
Relevance Ranking Using Dynamically Acquired Background Knowledge
Although, the set of all documents that satisfy a query expression may be very large; most
keyword queries on this system find numbers of documents containing the keywords. It estimates
relevance of documents to a query, and return only highly ranked documents as answers.
The following gives mathematical approaches of relevance ranking using terms:
The first question to address is, given a particular term t, how relevant is a particular
document d to the term. It used the number of occurrences of the term in the document as a measure
or its relevance, on the assumption that relevant terms are likely to be mentioned many times in a
document. First, the number of occurrences depends on the length of the document, and second, a
document containing 10 occurrences of a term may not be 10 times as relevant as a document
containing one occurrence.
7
The relevance of a document d to a term t i.e. r(d,t,) is
 n( d , t ) 
r (d , t )  log 1 
n(d ) 

Where n(d) = the number of terms in the document
And n(d, t) = denotes the number of occurrences of the term t in the
document d.
Really, this metric takes the length of the document into account. The relevance grows with more
occurrences of a term in the document, although it is not directly proportional to the number of
occurrences.
Ontology Ranking Metrics
Domain ontologies for representing knowledge have become a useful technique and format
for managing, maintaining and retrieving. Ontology search facilities provided by some libraries such
as Ontolingua, DAML library, OWL library are best restricted to term search, making it difficult for user
to select relevant documents from ontologies, (Alani, 2006).
Consequently, in order to achieve effective level of knowledge reuse, we need such search
engine that would take care of the restrictions identified with previous search facilities as mentioned
above to retrieve relevant documents (research projects).
Swoogle and Ontosearch are example of such ontology engine that have proved to have
improved on previous ontology search engines in order to select relevant ontologies for relevant reuse
purposes. However, experts still believe that there is still need for more effort on ontology search
engine in order to make reuse even more effective. (Alani and Brewster, 2005)
In this effort of ours, we present ontology and information retrieval system as well as ranking
ontologies based on the analysis of their structure. It as well facilitates the reuse of such knowledge
structures. Details of metrics used in ranking system are given in order to test the extent of ranking
ontology potentiality of this effort.
For examples, typing-in such queries as “security” gives us returned research documents from various
categories as follows:
TYPED-IN-QUERY: “security”
The following documents that are also represented by corresponding alphabetic contain search term
(security).
a
IS2(2006)Salu.doc
b
NW1(2005)Akindele.doc
c
NW2(2005)Gbenga.doc
d
SEC1(2007)Olatunji.doc
e
SEC2(2005)Akinbode.doc
f
WEB1(2007)Anufe.doc
g
DBMS3(2006)Abdulsalam.doc
h
WEB2(2005)Balogun.doc
i
SEC3(2005) Aroyehun.doc
j
SEC3(2007).doc
k
SEC3(2003).doc
l
SEC4(2007)Muhammed.doc
m
DBMS1(2005)Atolagbe.doc
n
NW4(2005)Abikoye.doc
o
SEC2(2006)Ogunwuyi.doc
p
DBMS1(2007)Akosile.doc
q
DBMS1(2006)Olajide.doc
r
Mabayoje .doc
Experimental Results
Ontology and Information Retrieval System is tested and evaluated with application of
Precision and Recall measures. A large number of research works are gathered. Firstly, a collection
of queries are tested. The relevant research document of various ontologies can be selected through
an interface. The active ontological classifications used in this system are Networking, Security,
Database Management System, Information System, Programming and Artificial Intelligence.
8
A numbers of tests are performed in order to evaluate the effect of ranking model features on
degree of relevance of retrieved documents. For example the following query “security” keyed-in
provided the following results according to degree of relevance:
User Query: “security”
Results are shown in the following diagrams:
This table is deduced based on Silberschatz et al (2001) ranking method
Doc
N(d,t)
n(d)
r(d,t)=log(1+n(d,t)/n(d))
A
5
176
0.012166
B
4
143
0.011981
C
4
160
0.010724
D
1
135
0.003205
E
4
160
0.010724
F
4
160
0.010724
G
3
141
0.009143
H
3
154
0.008379
I
4
185
0.00929
J
4
206
0.008352
K
2
96
0.008955
L
3
172
0.00751
M
3
142
0.00908
N
1
163
0.002656
O
1
148
0.002925
P
2
242
0.003574
Q
1
194
0.002233
R
7
6327
0.00048
That is the relevance of a document d to a term t i.e. r(d,t,) is
 n( d , t ) 
r (d , t )  log 1 
n(d ) 

Where doc= research project documents
n(d) = the number of terms in the document
And (d, t) = denotes the number of occurrences of the term t in the document d.
Final Returned Project Documents from the System:
Doc
SORTED
RANKS
A
0.012166
B
0.011981
C
0.010724
E
0.010724
9
F
0.010724
I
0.00929
G
0.009143
M
0.00908
K
0.008955
H
0.008379
J
0.008352
L
0.00751
P
0.003574
D
0.003205
O
0.002925
N
0.002656
Q
0.002233
R
0.001211
0.014
0.012
Graphical Representation of Displayed Results
doc/project
0.01
0.008
0.006
0.004
0.002
0
A B C E
F
I
G M K
H J
L P D O N Q R
rank
Graphical Representation of Final Returned Project Documents from the System
Searching, Ranking and Sorting Of Documents
In our effort we aim that the system provides quality search results efficiently. To achieve a
relatively quick response time, once certain number of matching documents are found, the searcher
automatically sorts the documents that have matched by rank and returns to the user display interface
for reuse. This means that it is possible that sub-optimal results would be returned. This is
represented in the figure below:
10
1.
2.
3.
4.
5.
6.
7.
Parse the query.
Convert words into wordIDs.
Seek to the start of the doclist in the database for every word.
Scan through the doclists until there is a document that matches all the search terms.
Compute the rank of that document for the query.
If we are not at the end of any doclist go back to step 4.
Sort the documents that have matched by rank and return them to the user display interface
for reuse.
Searching, Ranking and Sorting of Documents
Discussion
Submitted hard copies of researches as prerequisite for the award of academic certificates;
over time, has become quite numerous in various departments in Universities and other tertiary
schools (Computer Science Department, University of Ilorin; a case study). In many cases their
volumes have become bulky and getting information from them for the purpose of knowledge reuse
has not been easy or effective.
This situation often causes repetition of research works and research areas by different
researchers because there is no integrated management system. Information regarding the
contribution of student researchers to knowledge in various fields has not been adequate; especially
during accreditation. Information provided during accreditation are usually not sufficient or rather
incoherent. This is also due to lack of such system that could effectively take care of representation of
the number and areas of researches so far embarked upon in the university. There is lack of the
system that could represent data on the researches and research areas in terms of their entities
interpreted to exist in some area of interest and the relationships that hold among them.
Ontological methods have become popular in the domain of knowledge reuse as well as
knowledge integration and management. Adopting this trend, our Ontological System for Information
Retrieval and Integration, amongst others will serve the purpose of integrating academic research
reports (in term of long essays and thesis) so that subsequent researchers (graduates and
undergraduates) can reuse such relevant information in their individual areas of research.
Data of such academic reports (projects) are uploaded from the back-end of the system.
Users make use of uploaded data in term of needed relevant information at the front end. This is
achieved by typing in required information in text format. The system is flexible so that updates and
upgrading could be done. Document is retrieved by the system through the following frontend
procedures: Parse the query. - Convert words into wordIDs.-Seek to the start of the doclist in the
database for every word. - Scan through the doclists until there is a document that matches all the
search terms. - Compute the rank of that document for the query. - Sort the documents that have
matched by rank and return the results to the user display interface for reuse.
Adding of documents to the database can be done by a user having knowledge about the
domain. This comes in two ways- uploading of new document and updating of existing ones.
Uploading of new document is achieved through the following backend procedure: Enter document
and necessary information through the administrative end. - Save document (automatically) into
database based on different categories of ontology. For the update of existing data, i.e. background
knowledge: Substitute background knowledge with current relevant information.
Recommendation
As knowledge reuse continues to gain more popularity in the domain of knowledge
engineering; especially in the field of computer and information science, this particular research has
developed a new approach to information retrieval in academic settings, particularly in Nigerian
Universities. By the goal of this study, we have tried to develop a system for integrating research
reports. This system- Ontological system for Information Retrieval and Integration would purposely
serve the advantage of knowledge reuse. The system is expected to continue to require upgrading in
order to relatively fit in to developing research and academic goals in each university at large.
It is recommended that prospective researchers in prospective universities should be made to
submit a soft copy of detailed information regarding the area of research being embarked upon. This
should be in addition to the hard copy of such approved works. This would provide an easy and
efficient mean of uploading as well as updating of data stored in the system.
11
References:
Alani H. & Brewster C. (2005). Ontology Ranking Based on the Analysis of Concept
Structures. In 3rd Int. Conf. Knowledge Capture (K-Cap), pages 51–58, Canada. Banff.
Alani H. (2006). Ontology Construction from Online Ontologies. 15th International World Wide
Web Conference, Edinburgh.
Behrouz A.F (2003). Data Communications and Networking. New York: McGraw-Hill.
Chisholm, R. (1996). A Realistic Theory of Categories--An Essay on Ontology (1 ed.):
Cambridge University Press.
Frakes W.B. & Baeza-Yates R. (1992). Information Retrieval: Data Structure and Algorithms.
NY: Prentice-Hall.
Gruber, T. (1992). Ontolingua: A mechanism to support portable ontologies. Technical report,
Technical Report KSL91-66, Stanford University, Knowledge Systems Laboratory.
Gruber, T. (1995). Toward Principles for the Design of Ontologies Used for Knowledge
Sharing. International Journal Human-Computer Studies. Vol. 43, Issues 5-6, November
1995, p.907-928.
Salton, G. (1989). Automatic text processing: The transformation, analysis and retrieval of
information by computer. Reading, MA: Addison-Wesley.
Silberschatz A. et al. (2001). Database System Concepts. New York: McGraw-Hill.
Smith, Barry & David M. Mark (2001). Geographic Categories: An Ontological Investigation.
International Journal of Geographic Information Science. Vol. 15 February 2000, P. 66-67.
Smith, et al. (Eds.) (2001). Usability evaluation and interface design: Cognitive engineering,
intelligent agents and virtual reality. Proceedings of HCI International 2001, Volume 1.
.
12
Download