DYNAMIC INFORMATION EXTRACTION (DIE) ALGORITHM FOR KNOWLEDGE ACCESSIBILITY AND REUSABILITY IN TERTIARY SCHOOLS. 1 1 Mabayoje M. A. and 2Olabiyisi S. O. Department of Computer Science, Faculty of Communication and Information Sciences, University of Ilorin, PMB1515, Ilorin, Kwara-Nigeria. Services100ng@yahoo.com. 2 Department of Computer Science and Engineering, Ladoke Akintola University of Technology, Ogbomosho, Oyo-Nigeria. tundeolabiyisi@hotmail.com Abstract Carrying out a successful research in any field of knowledge requires having access to information sources like literatures and other relevant materials. One of the basic information sources for researches in tertiary education are previous research reports in term of Long Essays and Dissertations. Currently, accessing available research materials as dissertations has been cumbersome. There is the need for the provision and enhancement of knowledge reusability in Nigerian tertiary schools. This would be achieved through a system by which such dissertations (as research information sources) can be integrated. The purpose of this study is to present a Dynamic Information Extraction (DIE) Algorithm with the help of ontology approach that can satisfy the identified need above. This aims at providing steps to the development of application/system for easy and effective accessibility to information regarding research reports. This algorithm will serve the purpose of integrating academic research reports so that subsequent researchers can easily locate and reuse such relevant information in their individual areas of research. Keywords: Information, Extraction, Integration, Algorithm, Ontology, Research. Introduction This paper developed a Dynamic Information Extraction (DIE) Algorithm for Knowledge Accessibility and Reusability in Tertiary Schools. It will serve the purpose of easy retrieval of information as well as integration of research works in Nigerian tertiary schools. It is aimed at providing solutions to some problems in the area of knowledge re-use; and for new ideas to emerge or to implement existing works. Currently, there is lack of a system responsible for effective representation of the number and areas of researches so far embarked upon in the tertiary schools; such system that would represent data on the researches and research areas, showing their entities interpreted to exist in some area of interest and the relationships that hold among them. All the research reports that form data in this study are represented in a table of categories, in which every type of entity is captured by some node within a hierarchical tree[1]. By this, ontological method is adopted to classify distinct entities, i.e. student research reports. Visual Basic.Net programming language; as supporting database on the Net can be adopted in the implementation of this algorithm. Precision and Recall are also instruments to measure the effectiveness of the system. The Derivation of the Algorithm The need to check repetition of research works and to provide accessibility to relevant materials in tertiary schools during academic research has become imperative. This realization has made it necessary for the development of a Knowledge Management System (KMS). As a mean to an end, this algorithm is derived from the need to evolve that system which can assist in the area of knowledge accessibility and re-usability issues in intelligent search engine. This algorithm as presented in this study forms an in-road to the development of a real system that can work in the area of information retrieval and integration of existing research materials and relevant data. Related Concepts Ontology The field of computer and information science describes ontology as indicating an historical object that is designed for a purpose, which is to enable the modeling of knowledge about some domain, real or imagined [2]. It is an arrangement of a conceptualization, that represents the objects, concepts, and other entities which are interpreted to exist in some area of interest and the relationships that hold among them, [3]. Ontological method is also used in the filtering, ranking and presentation of the results covering quality issues such as contradictions and related information, i.e. a different possible answer to the same query or an answer to a different but related query. Ontology in computer science is specifically related to knowledge sharing and reuse, a specification of a conceptualization which has some properties for knowledge sharing among Artificial Intelligence (AI) software [4]. Information Retrieval Information Retrieval is the scientific method of searching information either in documents or searching for documents themselves. It also includes searching within databases which could either be relational standalone databases or hyper textually-networked databases like the World Wide Web [5]. It is the science of locating, from a large document collection; those documents that fulfill a specified information need [6]. It also covers other areas of science like mathematics, library science, information science, information architecture, cognitive psychology, linguistics, statistics and physics. However each of these areas of relevance has its application of skills, theory, literature, and technologies. In each of these areas, information overload are reduced through the use of automated Information Retrieval System (IRS). IRS is used in places like universities and some other tertiary institutions. Access to books, journals, and other documents are easily achieved in libraries by the use of IRS. Programmers have succeeded in their effort at creating applications that have very well functioned in information retrieval. Examples of such applications are found within Web search engines like Google, Yahoo search, Live search (Formerly MSN search) etc. [7]. Areas where Information Retrieval techniques are employed are Adversarial information retrieval, Automatic summarization, Multi-document summarization, Cross-lingual retrieval, Document classification, Spam filtering, Open source information retrieval, Question answering, Structured document retrieval, Topic detection and tracking etc. Measuring Retrieval Effectiveness It is interesting to note that the technique of measuring retrieval effectiveness has been largely influenced by the particular retrieval strategy adopted and the form of its output. The two metrics that can be used to measure how well an information-retrieval system is able to answer queries are: Precision; which measures what percentage of the retrieved documents are actually relevant to the query. The second is Recall. It measures what percentage of the relevant documents to the query was retrieved. Both Precision and Recall are measured in 100 percent [8]. In addition to the ones earlier proposed, i.e. Precision and Recall, two major measures of assessments were further proposed: Fall-out and F-measure. All of these measures assume a ground truth notion of relevancy, i.e. every document is known to be either relevant or nonrelevant to a particular query. Database and Database Management System A Database is a large storage of data held in a computer and made easily accessible for needed purpose, [9]. A single organized collection of a structured data which are stored with a minimum duplication of data items in order to provide a consistent and controlled pool of data is referred to as database [10]. Databases are classified according to their approaches of organization. They include: a) Relational database (model), which uses types of tables called relations that represent both data and relationships among those data. Each table has multiple columns, and each column has a unique name. b) Network Model: Data in the network model are represented by collections of records and relationship among data; and are represented by links, which can be viewed as pointers. c) Hierarchical Model is similar to the network model in the sense that data and relationships among data are represented by records and links [8]. [8] explains database management system as consisting of a collection of interrelated data and a set of programs to access those data. It constructs, expands and maintains the database. It makes available the controlled interface between the user and the data in the database. Through its maintenance of indices the database management system allocates storage to data so that any required data can be retrieved and any separate items of data in the database can be cross referenced, [10, 11]. Algorithm Algorithm is referred to as a set of rules or procedures that must be followed in solving a particular problem. [12]. [13] described an algorithm as a procedure or formula for solving a problem. METHOD This framework presents the following dynamic algorithm that can be implemented as mean of knowledge accessibility and reusability in tertiary schools. Implementation of this algorithm will proffer answers to questions relating to accessibility and re- usability of knowledge issues in intelligent search engine. DYNAMIC INFORMATION EXTRACTION ALGORITHM: // Record to represent each document in a directory doc : RECORD index : integer WordCount : integer END // Determine the total number of documents in the directory TotalNumberOfDoc = 0 OPEN Directory DO UNTIL End-of-Directory READ Document TotalNumberOfDoc = TotalNumberOfDoc + 1 END DO //Define the Document Record to Operate with Doc[TotalNumberOfDoc] : doc //Determine the number of occurrence of a SearchText in each project document INPUT SearchText Index = 1 No-of-Occurrence = 0 DO UNTIL index > TotalNumberOfDoc OPEN Doc[index] (for Reading) DO UNTIL EOF (Doc[index]) READ Aword IF SearchText = Aword THEN Doc[index].WordCount= Doc[index].WordCount + 1 ENDIF END DO CLOSE Doc[index] index = index + 1 END DO //Rearrange according to the degree of relevance of a Project document to a term(SearchText) in descending order of //WordCount and display the documents in the order. FOR i = 1 TO TotalNumberOfDoc – 1 DO FOR j = 2 TO TotalNumberOfDoc – i + 1 DO IF Doc [j-i].WordCount < Doc[j].WordCount THEN SWAP(Doc[j-1] , Doc[j]) index = 1 DO UNTIL index > TotalNumberOfDoc PRINT Doc[index] END DO //Subroutine for swap SWAP (Doc1, Doc2) (pass by reference) BEGIN tempDoc : Doc tempDoc = Doc1 Doc1 = Doc2 Doc2 = tempDoc END Data of research reports in tertiary schools can be collected for use in the evaluation of performance of this information retrieval system. To augment various document sections with metadata involves a number of steps: • Identifying the various categories onto which various project topics with abstract can be mapped with respect to various departments. (E.g. Computer Science as department containing various categories: Networking, Database, Security etc). • Acquiring and representing background knowledge in a way that can facilitate the mapping of various project topics with abstract into the identified categories. • Segmenting various documents and employing background knowledge to map each document section to its corresponding category. • Storing structured index information in a persistent data store. • Providing a user interface to enable search across indexed documents. • Linking of each research work title to its corresponding abstract/full-text. Upload (new) Overview of the Proposed System The research work is a Web based package, in which a number of components communicate together to achieve the required functionality. The main components of this system are: an indexing user interface, an indexing backend linked to an Ontology Database System, and a search front end also linked to an Ontology Database System. The diagram below shows various components and their interactions. Display of outputs in expected results decreasing order of relevance degree Input information Contents Administrator Indexing Backend Save indexed doc. Web based Indexing front end Update Query Ontology Database Background knowledge system user Search Criteria Web based structured Search interface Ranking Module: Matching of docs and user query statement. Indexing Backend is the component responsible for augmenting input research (project) documents with meta-data using background knowledge. The Indexing Backend is HTTP server that is capable of receiving requests embedded in HTTP requests. All project documents are indexed sequentially in a remote storage component (Database) where they are kept according to their classes of ontology. The Ranking Module then analyses each document of various categories of ontology to determine the degree of relevance to the given searchtext. This analysis will generate a ranking of retrieved projects documents and the results are returned to the users in relation form in which the abstract/ full text can be downloaded through document path field (address); and their total ranks are as well displayed. Indexing User Interface: This interface allows an authorized user to upload documents to a web server, then indexes such document through communication with the indexing backend. Search Front end: A Web Search front end gives room for researchers to easily retrieve their required documents from existing research works for further research by keying in searchtext. The query will be a loose or a specific one based on the number of used keywords. The more specific the query, the less number of documents returned. After a query is entered, it is converted to SQL and dispatched to the Database in which indexing documents has been kept. The relevant documents (output) are displayed in the form of HTML page containing a list of index project documents that match the entered query. The output include the following: Author’s names, supervisor’s name, project titles, and a hyperlink to the source documents will also be displayed for downloading. Category Ontology Match Measure (CMM) in Category Match Measure (CMM) is used to evaluate the coverage of an ontology for the given search text. Categories that contain all search texts will obviously score higher than others. Exact matches have higher score than Partial matches. For instance, using “security” as search text, category of ontology with exact search terms will score higher in this measure than other categories which contain partial classes. Definition 1: Let C[o] be a set of categories in ontology o, and T is the set of search texts. Definition 1: Let C[o] be a set of categories in ontology o, and T is the set of search texts. E ( o ,T ) I (c,t ) (1) in a document. First, the number of occurrences depends on the length of the document, and second, a document containing 10 occurrences of a term may not be 10 times as relevant as a document containing one occurrence. The relevance of a document d to a term t i.e. r(d,t,) is n( d , t ) r (d , t ) log 1 n(d ) Where n(d) = the number of terms in the document And n(d, t) = denotes the number of occurrences of the term t in the document d. Really, this metric takes the length of the document into account. The relevance grows with more occurrences of a term in the document, although it is not directly proportional to the number of occurrences. cC [ O ] tT I(c, t) = P ( o ,T ) 1: 0: if label(c) = t if label(c) t J (c,t ) DISCUSSION AND RECOMMENDATION (2) (3) cC [ O ] tT J(c, t) = 1: 0: if label(c) contains t if label(c) not contain t (4) where E(o, T) and P(o, T) are the number of classes of ontology o that have labels that match any of the searchtexts T exactly or partially, respectively. CMM(o, τ) = αE(o, T) + βP(o, T) (5) Where CMM(o, τ ) is the Categories Match Measure for ontology o with respect to search text τ . α and β are the exact matching and partial matching weight factors respectively. Exact matching is favored over partial matching if α > β. RELEVANCE RANKING USING DYNAMICALLY ACQUIRED BACKGROUND KNOWLEDGE Although, the set of all documents that satisfy a query expression may be very large; most keyword queries on this system find numbers of documents containing the keywords. It estimates relevance of documents to a query, and return only highly ranked documents as answers. The following gives mathematical approaches of relevance ranking using terms: The first question to address is, given a particular term t, how relevant is a particular document d to the term. It used the number of occurrences of the term in the document as a measure or its relevance, on the assumption that relevant terms are likely to be mentioned many times The goal of this study is to present an algorithm that can be easily implemented. This will be a channel for easy integration and accessing of existing research works. It will go a long way to eliminate repetition of research works in tertiary schools. It will also help to provide adequate answers for related questions on research works during the accreditation. It will enhance knowledge accessibility and reusability in tertiary schools. Implementation of this algorithm will give room for a flexible system in such a way that data of academic reports (projects) can be uploaded from the back-end of the system. Users can make use of uploaded data in term of needed relevant information at the front end. This is achieved by typing in required information in text format. The system will be flexible so that updates and upgrading could be done. Document(s) can be retrieved by the system through the following frontend procedures: Parse the query. - Convert words into wordIDs.-Seek to the start of the doclist in the database for every word. - Scan through the doclists until there is a document that matches all the search terms. - Compute the rank of that document for the query. - Sort the documents that have matched by rank and return the results to the user display interface for reuse. It is thus recommended that prospective researchers in prospective tertiary schools should be made to submit a soft copy of detailed information regarding the area of research being embarked upon. This should be in addition to the hard copy of such approved works. This would provide an easy and efficient mean of uploading as well as updating of data stored in the system. REFERENCES [1] D. Stephen, Information Systems for You (Cheltenham Gl50 1yw: Stanley Thornes (Publishers) Ltd., 1996) [2] H. Alani & C. Brewster, Ontology Ranking Based on the Analysis of Concept Structures. In 3rd Int. Conf. Knowledge Capture (K-Cap), 2005, 51–58, Canada. Banff. [3] T. Gruber, Toward Principles for the Design of Ontologies Used for Knowledge Sharing. International Journal Human-Computer Studies. Vol. 43, Issues 5-6, November 1995, 907-928. [4] T. Gruber, Ontolingua: A mechanism to support portable ontologies. Technical report, Technical Report KSL91- 66, 1992, Stanford University, Knowledge Systems Laboratory. [5] G. Salton, Automatic text processing: The transformation, analysis and retrieval of information by computer (Reading, MA: Addison-Wesley, 1989). [6] W.B. Frakes & R. Baeza-Yates, Information Retrieval: Data Structure and Algorithms (NY: PrenticeHall, 1992). [7] A.F. Behrouz, Data Communications and Networking (New York: McGraw-Hill, 2003). [8] Silberschatz A. et al., Database System Concepts (New York: McGraw-Hill, 2001). [9] A.S Hornby, Oxford Advanced Learner’s Dictionary of Current English (fifth edition) (Oxford:Oxford University Press, 2000). [10] C.S. French, Data Processing and Information Technology (London: Continuum, 2001). [11] Mabayoje M. A. (2009). Ontology and Information Extraction System: A Model for Integrating Research Works, M. Sc. Thesis, University of Ilorin, Ilorin, Nigeria. [12] L.S. Russell, Introduction to Computing and Algorithms (Addison Wesley USA: Longman Inc, 1998). [13] W. Niklause, Algorithms + Data Structures = Programs (New Delhi-110015: Prentice-Hall of India, 2005).