DYNAMIC INFORMATION EXTRACTION (DIE) ALGORITHM FOR KNOWLEDGE ACCESSIBILITY AND

advertisement
DYNAMIC INFORMATION
EXTRACTION (DIE) ALGORITHM
FOR KNOWLEDGE
ACCESSIBILITY AND
REUSABILITY IN TERTIARY
SCHOOLS.
1
1
Mabayoje M. A. and 2Olabiyisi S. O.
Department of Computer Science, Faculty of
Communication and Information Sciences,
University of Ilorin, PMB1515, Ilorin, Kwara-Nigeria.
Services100ng@yahoo.com.
2
Department of Computer Science and Engineering,
Ladoke Akintola University of Technology, Ogbomosho,
Oyo-Nigeria.
tundeolabiyisi@hotmail.com
Abstract
Carrying out a successful research in any field of
knowledge requires having access to information sources
like literatures and other relevant materials. One of the
basic information sources for researches in tertiary
education are previous research reports in term of Long
Essays and Dissertations. Currently, accessing available
research materials as dissertations has been cumbersome.
There is the need for the provision and enhancement of
knowledge reusability in Nigerian tertiary schools. This
would be achieved through a system by which such
dissertations (as research information sources) can be
integrated.
The purpose of this study is to present a Dynamic
Information Extraction (DIE) Algorithm with the help of
ontology approach that can satisfy the identified need
above. This aims at providing steps to the development of
application/system for easy and effective accessibility to
information regarding research reports. This algorithm
will serve the purpose of integrating academic research
reports so that subsequent researchers can easily locate
and reuse such relevant information in their individual
areas of research.
Keywords: Information, Extraction, Integration,
Algorithm, Ontology, Research.
Introduction
This paper developed a Dynamic Information
Extraction (DIE) Algorithm for Knowledge Accessibility
and Reusability in Tertiary Schools. It will serve the
purpose of easy retrieval of information as well as
integration of research works in Nigerian tertiary schools.
It is aimed at providing solutions to some problems in the
area of knowledge re-use; and for new ideas to emerge or
to implement existing works.
Currently, there is lack of a system responsible
for effective representation of the number and areas of
researches so far embarked upon in the tertiary schools;
such system that would represent data on the researches
and research areas, showing their entities interpreted to
exist in some area of interest and the relationships that
hold among them.
All the research reports that form data in this
study are represented in a table of categories, in which
every type of entity is captured by some node within a
hierarchical tree[1]. By this, ontological method is
adopted to classify distinct entities, i.e. student research
reports.
Visual Basic.Net programming language; as
supporting database on the Net can be adopted in the
implementation of this algorithm. Precision and Recall are
also instruments to measure the effectiveness of the
system.
The Derivation of the Algorithm
The need to check repetition of research works and to
provide accessibility to relevant materials in tertiary
schools during academic research has become imperative.
This realization has made it necessary for the
development of a Knowledge Management System
(KMS). As a mean to an end, this algorithm is derived
from the need to evolve that system which can assist in
the area of knowledge accessibility and re-usability issues
in intelligent search engine. This algorithm as presented in
this study forms an in-road to the development of a real
system that can work in the area of information retrieval
and integration of existing research materials and relevant
data.
Related Concepts
Ontology
The field of computer and information science
describes ontology as indicating an historical object that is
designed for a purpose, which is to enable the modeling of
knowledge about some domain, real or imagined [2]. It is
an arrangement of a conceptualization, that represents the
objects, concepts, and other entities which are interpreted
to exist in some area of interest and the relationships that
hold among them, [3].
Ontological method is also used in the filtering,
ranking and presentation of the results covering quality
issues such as contradictions and related information, i.e.
a different possible answer to the same query or an answer
to a different but related query.
Ontology in computer science is specifically related to
knowledge sharing and reuse, a specification of a
conceptualization which has some properties for
knowledge sharing among Artificial Intelligence (AI)
software [4].
Information Retrieval
Information Retrieval is the scientific method of
searching information either in documents or searching
for documents themselves. It also includes searching
within databases which could either be relational standalone databases or hyper textually-networked databases
like the World Wide Web [5]. It is the science of locating,
from a large document collection; those documents that
fulfill a specified information need [6].
It also covers other areas of science like
mathematics, library science, information science,
information
architecture,
cognitive
psychology,
linguistics, statistics and physics. However each of these
areas of relevance has its application of skills, theory,
literature, and technologies.
In each of these areas, information overload are
reduced through the use of automated Information
Retrieval System (IRS). IRS is used in places like
universities and some other tertiary institutions. Access to
books, journals, and other documents are easily achieved
in libraries by the use of IRS. Programmers have
succeeded in their effort at creating applications that have
very well functioned in information retrieval. Examples of
such applications are found within Web search engines
like Google, Yahoo search, Live search (Formerly MSN
search) etc. [7].
Areas where Information Retrieval techniques
are employed are Adversarial information retrieval,
Automatic
summarization,
Multi-document
summarization, Cross-lingual retrieval, Document
classification, Spam filtering, Open source information
retrieval, Question answering, Structured document
retrieval, Topic detection and tracking etc.
Measuring Retrieval Effectiveness
It is interesting to note that the technique of
measuring retrieval effectiveness has been largely
influenced by the particular retrieval strategy adopted and
the form of its output. The two metrics that can be used to
measure how well an information-retrieval system is able
to answer queries are: Precision; which measures what
percentage of the retrieved documents are actually
relevant to the query. The second is Recall. It measures
what percentage of the relevant documents to the query
was retrieved. Both Precision and Recall are measured in
100 percent [8].
In addition to the ones earlier proposed, i.e.
Precision and Recall, two major measures of assessments
were further proposed: Fall-out and F-measure. All of
these measures assume a ground truth notion of relevancy,
i.e. every document is known to be either relevant or nonrelevant to a particular query.
Database and Database Management System
A Database is a large storage of data held in a
computer and made easily accessible for needed purpose,
[9]. A single organized collection of a structured data
which are stored with a minimum duplication of data
items in order to provide a consistent and controlled pool
of data is referred to as database [10].
Databases are classified according to their
approaches of organization. They include:
a) Relational database (model), which uses types of tables
called relations that represent both data and relationships
among those data. Each table has multiple columns, and
each column has a unique name.
b) Network Model: Data in the network model are
represented by collections of records and relationship
among data; and are represented by links, which can be
viewed as pointers.
c) Hierarchical Model is similar to the network model in
the sense that data and relationships among data are
represented by records and links [8].
[8] explains database management system as
consisting of a collection of interrelated data and a set of
programs to access those data. It constructs, expands and
maintains the database. It makes available the controlled
interface between the user and the data in the database.
Through its maintenance of indices the database
management system allocates storage to data so that any
required data can be retrieved and any separate items of
data in the database can be cross referenced, [10, 11].
Algorithm
Algorithm is referred to as a set of rules or
procedures that must be followed in solving a particular
problem. [12]. [13] described an algorithm as a procedure
or formula for solving a problem.
METHOD
This framework presents the following dynamic
algorithm that can be implemented as mean of knowledge
accessibility and reusability in tertiary schools.
Implementation of this algorithm will proffer answers to
questions relating to accessibility and re- usability of
knowledge issues in intelligent search engine.
DYNAMIC
INFORMATION
EXTRACTION ALGORITHM:
// Record to represent each document in a directory
doc : RECORD
index
: integer
WordCount : integer
END
// Determine the total number of documents in the
directory
TotalNumberOfDoc = 0
OPEN Directory
DO UNTIL End-of-Directory
READ Document
TotalNumberOfDoc = TotalNumberOfDoc + 1
END DO
//Define the Document Record to Operate with
Doc[TotalNumberOfDoc] : doc
//Determine the number of occurrence of a SearchText
in each project document
INPUT SearchText
Index = 1
No-of-Occurrence = 0
DO UNTIL index > TotalNumberOfDoc
OPEN Doc[index] (for Reading)
DO UNTIL EOF (Doc[index])
READ Aword
IF SearchText = Aword THEN
Doc[index].WordCount= Doc[index].WordCount + 1
ENDIF
END DO
CLOSE Doc[index]
index = index + 1
END DO
//Rearrange according to the degree of relevance of a
Project document to a term(SearchText) in descending
order of //WordCount and display the documents in
the order.
FOR i = 1 TO TotalNumberOfDoc – 1 DO
FOR j = 2 TO TotalNumberOfDoc – i + 1 DO
IF Doc [j-i].WordCount < Doc[j].WordCount THEN
SWAP(Doc[j-1] , Doc[j])
index = 1
DO UNTIL index > TotalNumberOfDoc
PRINT Doc[index]
END DO
//Subroutine for swap
SWAP (Doc1, Doc2) (pass by reference)
BEGIN
tempDoc : Doc
tempDoc = Doc1
Doc1 = Doc2
Doc2 = tempDoc
END
Data of research reports in tertiary schools can be
collected for use in the evaluation of performance of this
information retrieval system.
To augment various document sections with
metadata involves a number of steps:
•
Identifying the various categories onto which
various project topics with abstract can be
mapped with respect to various departments.
(E.g. Computer Science as department containing
various categories: Networking, Database,
Security etc).
•
Acquiring and representing background
knowledge in a way that can facilitate the
mapping of various project topics with abstract
into the identified categories.
• Segmenting various documents and employing
background knowledge to map each document
section to its
corresponding category.
• Storing structured index information in a
persistent data store.
• Providing a user interface to enable search across
indexed documents.
• Linking of each research work title to its
corresponding abstract/full-text.
Upload (new)
Overview of the Proposed System
The research work is a Web based package, in
which a number of components communicate together to
achieve the required functionality. The main components
of this system are: an indexing user interface, an indexing
backend linked to an Ontology Database System, and a
search front end also linked to an Ontology Database
System. The diagram below shows various components
and their interactions.
Display of
outputs in
expected results
decreasing
order of
relevance
degree
Input
information
Contents
Administrator
Indexing
Backend
Save indexed doc.
Web based
Indexing front end
Update
Query
Ontology
Database
Background
knowledge
system user
Search Criteria
Web based structured
Search interface
Ranking Module:
Matching of docs and
user query statement.
Indexing Backend is the component responsible
for augmenting input research (project) documents with
meta-data using background knowledge. The Indexing
Backend is HTTP server that is capable of receiving
requests embedded in HTTP requests. All project
documents are indexed sequentially in a remote storage
component (Database) where they are kept according to
their classes of ontology. The Ranking Module then
analyses each document of various categories of ontology
to determine the degree of relevance to the given
searchtext. This analysis will generate a ranking of
retrieved projects documents and the results are returned
to the users in relation form in which the abstract/ full text
can be downloaded through document path field
(address); and their total ranks are as well displayed.
Indexing User Interface:
This interface allows an authorized user to
upload documents to a web server, then indexes such
document through communication with the indexing
backend.
Search Front end:
A Web Search front end gives room for
researchers to easily retrieve their required documents
from existing research works for further research by
keying in searchtext. The query will be a loose or a
specific one based on the number of used keywords. The
more specific the query, the less number of documents
returned. After a query is entered, it is converted to SQL
and dispatched to the Database in which indexing
documents has been kept. The relevant documents
(output) are displayed in the form of HTML page
containing a list of index project documents that match
the entered query.
The output include the following: Author’s names,
supervisor’s name, project titles, and a hyperlink to the
source documents will also be displayed for downloading.
Category
Ontology
Match
Measure
(CMM)
in
Category Match Measure (CMM) is used to
evaluate the coverage of an ontology for the given search
text. Categories that contain all search texts will obviously
score higher than others. Exact matches have higher score
than Partial matches. For instance, using “security” as
search text, category of ontology with exact search terms
will score higher in this measure than other categories
which contain partial classes.
Definition 1: Let C[o] be a set of categories in ontology o,
and T is the set of search texts.
Definition 1: Let C[o] be a set of categories in
ontology o, and T is the set of search texts.
E ( o ,T ) 
  I (c,t )
(1)
in a document. First, the number of occurrences depends
on the length of the document, and second, a document
containing 10 occurrences of a term may not be 10 times
as relevant as a document containing one occurrence.
The relevance of a document d to a term t i.e. r(d,t,) is
 n( d , t ) 
r (d , t )  log 1 
n(d ) 

Where n(d) = the number of terms in the document
And n(d, t) = denotes the number of occurrences of the
term t in the document d.
Really, this metric takes the length of the document into
account. The relevance grows with more occurrences of a
term in the document, although it is not directly
proportional to the number of occurrences.
cC [ O ] tT
I(c, t) =
P ( o ,T ) 
1:
0:
if label(c) = t
if label(c)  t
  J (c,t )
DISCUSSION AND RECOMMENDATION
(2)
(3)
cC [ O ] tT
J(c, t) =
1:
0:
if label(c) contains t
if label(c) not contain t (4)
where E(o, T) and P(o, T) are the number of classes of
ontology o that have labels that match any of the
searchtexts T exactly or partially, respectively.
CMM(o, τ) = αE(o, T) + βP(o, T)
(5)
Where CMM(o, τ ) is the Categories Match
Measure for ontology o with respect to search text τ . α
and β are the exact matching and partial matching weight
factors respectively.
Exact matching is favored over partial matching if α > β.
RELEVANCE
RANKING
USING
DYNAMICALLY
ACQUIRED
BACKGROUND KNOWLEDGE
Although, the set of all documents that satisfy a
query expression may be very large; most keyword
queries on this system find numbers of documents
containing the keywords. It estimates relevance of
documents to a query, and return only highly ranked
documents as answers.
The following gives mathematical approaches of
relevance ranking using terms:
The first question to address is, given a particular
term t, how relevant is a particular document d to the
term. It used the number of occurrences of the term in the
document as a measure or its relevance, on the assumption
that relevant terms are likely to be mentioned many times
The goal of this study is to present an algorithm
that can be easily implemented. This will be a channel for
easy integration and accessing of existing research works.
It will go a long way to eliminate repetition of research
works in tertiary schools. It will also help to provide
adequate answers for related questions on research works
during the accreditation. It will enhance knowledge
accessibility and reusability in tertiary schools.
Implementation of this algorithm will give room
for a flexible system in such a way that data of academic
reports (projects) can be uploaded from the back-end of
the system. Users can make use of uploaded data in term
of needed relevant information at the front end. This is
achieved by typing in required information in text format.
The system will be flexible so that updates and upgrading
could be done. Document(s) can be retrieved by the
system through the following frontend procedures: Parse
the query. - Convert words into wordIDs.-Seek to the start
of the doclist in the database for every word. - Scan
through the doclists until there is a document that matches
all the search terms. - Compute the rank of that document
for the query. - Sort the documents that have matched by
rank and return the results to the user display interface for
reuse.
It is thus recommended that prospective
researchers in prospective tertiary schools should be made
to submit a soft copy of detailed information regarding the
area of research being embarked upon. This should be in
addition to the hard copy of such approved works. This
would provide an easy and efficient mean of uploading as
well as updating of data stored in the system.
REFERENCES
[1] D. Stephen, Information Systems for You (Cheltenham
Gl50 1yw: Stanley Thornes (Publishers) Ltd., 1996)
[2] H. Alani & C. Brewster, Ontology Ranking Based
on the Analysis of Concept Structures. In 3rd Int. Conf.
Knowledge Capture (K-Cap), 2005, 51–58, Canada.
Banff.
[3] T. Gruber, Toward Principles for the Design of
Ontologies Used for Knowledge Sharing. International
Journal Human-Computer Studies. Vol. 43, Issues 5-6,
November 1995, 907-928.
[4] T. Gruber, Ontolingua: A mechanism to support
portable ontologies. Technical report, Technical Report
KSL91- 66, 1992, Stanford University, Knowledge
Systems Laboratory.
[5] G. Salton, Automatic text processing: The
transformation, analysis and retrieval of information by
computer (Reading, MA: Addison-Wesley, 1989).
[6] W.B. Frakes & R. Baeza-Yates, Information
Retrieval: Data Structure and Algorithms (NY: PrenticeHall, 1992).
[7] A.F. Behrouz, Data Communications and Networking
(New York: McGraw-Hill, 2003).
[8] Silberschatz A. et al., Database System Concepts
(New York: McGraw-Hill, 2001).
[9] A.S Hornby, Oxford Advanced Learner’s Dictionary
of Current English (fifth edition)
(Oxford:Oxford
University Press, 2000).
[10] C.S. French, Data Processing and Information
Technology (London: Continuum, 2001).
[11] Mabayoje M. A. (2009). Ontology and Information
Extraction System: A Model for Integrating Research
Works, M. Sc. Thesis, University of Ilorin, Ilorin, Nigeria.
[12] L.S. Russell, Introduction to Computing and
Algorithms (Addison Wesley USA: Longman Inc, 1998).
[13] W. Niklause, Algorithms + Data Structures =
Programs (New Delhi-110015: Prentice-Hall of India,
2005).
Download