PhD Proposal Defense Ontological Web-Mining Goh Hui Ngo Faculty of Information Technology Multimedia University, Cyberjaya hngoh@mmu.edu.my Abstract Question answering (Q&A) application keep improving from traditional keyword-based search engine to a complex system incorporating natural language processing to deal with syntactical and semantical structure of the questions. Understanding is crucial in Q&A application. The proposed project intended to build an ontological question answering application. Automatic ontology construction will be explored to represent domain knowledge. In this study, inputs from user will be treated as query to search for answer in the ontology. Ontology reasoning mechanism will be study to deal with complex questions. Introduction The emerging of Word Wide Web (WWW) has tremendously contributes to all levels of society. People rely on it to search for needed information. However, with the increasing number of digital documents populating in cyberspace everyday, it is difficult to locate needed information within the time constraint. For this reason, question answering application has receiving attention lately. This project is proposed to construct ontology to assist in question answering. Proposed system will take user’s question as input to be processed to interpret type of question and get the concepts out of it to be route through the ontology to get the needed answer. Ontology reasoning will be triggered for complex questions. Background Search engine, always rated as a priority tool in locating needed information. It treats each question as a series of keywords by eliminating the stopwords. It is a process of direct matching the keywords formed against documents in WWW. Frequency of keywords occurrence in a document will determine its ranking position. There is no understanding (i.e. semantics) of the query given to the search engine. Documents ranked top do not assure the occurrences of needed answer. However, documents with the best answer might ranked low due to it have fewer keywords from the typed question. Situation will become worse if user typed in wrong keywords or misleading question from the original intention. Therefore, query processing based on keyword-matching mechanism does not satisfy the current demand and it decreases the accuracy of the search result. In contrast to keyword-based search engine as mentioned above which treat question as a set of keywords. Natural language processing (NLP) has been used to explore the question’s syntactic and semantic structure. Question is parsed using parser to identify its query type and meaning to be reformulated into sub-queries comprise of synonym keywords or phrases that Page 1 of 11 might appear in the documents which contains answer. Formulated multiple sub-queries are then passed to search engine to locate the most related documents to the original question. Documents that are likely to contain the answers of the queries will be returned. However, search engine will only return documents that might contain answer. Thus, user is often being forced to digest the whole returned documents to extract the needed information. Again, it is expensive and labour intensive. Question answering goes far beyond than search engine. Its aim is to return a concise answer based on the question typed rather than a list related documents from a set of corpus. Early age of question answering applications were mostly domain specific. They were built with a specific database or knowledge-base in advance. Very often, natural language question has to be transformed carefully and specifically to suit the database or knowledge-base query language to facilitate the answers extraction. Therefore, knowledge engineer has to be familiar with the database or knowledge-base structure and database or knowledge-base querying language in order to represent the question typed in a proper manner. Misrepresentation of the query typed will lead to retrieving incorrect answer. Obviously, it is a customized application within the application domain and query language to be used. Answers returned based on given query are constraint to the represented data or knowledge in database or knowledge-base. In this context, it holds the assumption that the database or knowledge-based should stored huge amount of represented data or knowledge. Yet it is expensive to build such a complicated application which requires high accuracy performance as data or facts represented in database or knowledge-based are domain specific. It is manual labour intensive. Nevertheless, the great explosion of digital documents in the WWW has encouraged people moving towards opened domain question answering. User can ask any questions and answers will be extracted from free texts or web pages in the WWW. The natural language question input by user needs to be analysed to identify its question type. It is important to justify the question type to extract the concise answer. Reformulated queries based on question type will be passed to search engine to retrieve potential documents that might contain answer. Candidate answers will then be extracted from the set of returned documents to be rank to determine the final answer. Yet, this approach still lack of reasoning mechanism to manipulate complicated question which require domain background knowledge and asking the same question to the application by different users can be inefficient and time consuming, due to the application has to repeat mentioned steps above each time there is a question. Moreover, it is just merely a question answering application equipped with advance manipulating capabilities. Moreover, with the evolving of semantic network, ontology begins to play the role in knowledge representation. It is an explicit representation of domain knowledge with the aim to eliminate terminologies, relations confusion, hence promote shareable understanding environment to facilitate data collaboration. The essential aspects of ontology are: 1. 2. 3. 4. It often used to describe a specific domain Terms and relations are clearly defined in that domain A mechanism will be used to organize the terms (IS-A / HAS-A …) Meaning of the term is used consistently throughout the ontology construction Page 2 of 11 Complete and precise domain ontology will produce a clear picture of domain context, thus, will ease the question answering process by manipulating the represented knowledge rather than browse through the Internet all the time for needed answers. There are three approaches in constructing ontology, namely manual, semi-automatic and automatic. Manual [4] construction of ontology requires a lot of time and resources from domain expert and knowledge engineers. Terminologies and relations have to be defined manually, thus keeping ontology up-to-date demands a great effort. It lacks of flexibility and scalability. In order to overcome the above matter, a lot of concentrations have been put on automatic [6,10] and semi-automatic [2,3,27] ontology construction. However fully and completely automatic construction of ontology is not an easy task because many domain-specific decisions must be made to adequately to specify the domain of interest. At present, the fully automated method only functions well for very lightweight ontology under very limited circumstances. To strike a balance between manually and fully automatic ontology constructions, semiautomatic construction of ontology always come in mind of ontology developers. Semiautomatic ontology construction begins with the manual selection of relevant concepts and relationships, where the text pre-processing output is used to guide the selection. The concepts and relationships formed will serve as a foundation to facilitate automatic terminology extraction for the remaining of the construction. Ontology without reasoning is merely a knowledge representation. Reasoning is needed to uncover the implicit knowledge from the explicit represented knowledge. Major considerations in the choice of representation are the expressivity, satisfiable and consistency of the encoding language and the power of knowledge reasoning: The expressivity is a measure of the features that can be use to formally, flexibly, explicitly and accurately describe the meaning of the knowledge. A model is satisfiable if none of the statements within contradict each other Consistency is a matter of encoding the conceptualisation of the knowledge in the same manner throughout the knowledge representation. The power of reasoning is the ability to take decisions under incomplete knowledge. One of the logic-based approaches is Description Logic (DL). It is about modelling world knowledge of a particular domain of application and reasoning about it. Originally, description logics were intended to be fragments of first-order logic whose satisfiability problem in decidable in polynomial time but currently, they are intended to be decidable logics with good deductive systems. DL provides good ontology reasoning for semantic web due to it uses natural language and well-defined semantic. It provides algorithms for automatically extracting implicit information from facts explicitly represented. The basic building blocks of description logic are atomic concepts (unary predicates), atomic roles (binary predicates) and individuals. Concepts describe the common properties of a collection of individuals and can be considered as unary predicates which are interpreted as sets of objects. Roles are interpreted as binary relations between objects. Page 3 of 11 The expressive power of description logic is restricted in that it uses a small set of constructors (such as intersection, union, role quantification, etc.) to build complex concepts and roles. Implicit knowledge about concepts and roles can be inferred automatically with the help of inference procedure. The main reasoning tasks of description logic are classification, satisfiability, subsumption and instance checking. Classification of concepts allows one to structure the terminology in the form of a subsumption hierarchy. This hierarchy provides useful information on the connection between different concepts, and it can be used to speed up other inference services. Classification of individuals determines whether a given individual always an instance of a certain concepts. It is useful information on the properties of an individual. Subsumption relation specifies the relative generality of two concepts. A concept X subsumes a concept Y if the definitions of X and Y logically imply that members of Y must also be members of X. Unlike IS-A links in semantic network, which are explicitly introduced by the user, subsumption relationships and instance relations are inferred from the definition of the concepts and the properties of the individual. Satisfiability means non-contradiction among concepts and a concept makes sense for us if there is some interpretation that satisfies the axioms of T (a model of T) such that the concept denotes a nonempty set in that interpretation. Instance checking is used to check whether an individual is an instance of a concept. It can be considered the central reasoning task for retrieving information on individuals in the knowledge base. In fact, instance checking is a basic tool for more complex reasoning problems. Justification of Study This project will focus its question answering process in e-story domain. E-story, an electronic story which is being publishes in digital format and each e-story is content specific to its title. User is able to ask any questions related to the flow and content of the e-story. Therefore, there is a need to find a way to optimise the searching method in e-story. E-story ontology will be constructed automatically to represent domain knowledge. Natural language question posted to the interface has to be analysed syntactic and semantically to identify its question types and meaning to be reformulated into formal representation of meaning as logic. Later, the formulated logic will infer with the ontology constructed to return answer in natural language. However, not all answers are able to be extracted directly from ontology. There will be cases where ontology reasoning is needed to manipulate the represented knowledge to reveal the implicit knowledge of the particular domain. Research Objectives 1. To investigate the auto-ontological construction process in Q&A approach compare to keyword techniques. 2. To study Description Logic to be used in ontology reasoning for question-answering. Page 4 of 11 Literature Review Characteristics Domain Query Processing Ontological Reasoning Answer Returned OntoNova(03) Closed domain (Chemistry domain) Yes F-logic Natural language explanation AQUA(03) Hybrid domain (academic life) MOSES(04) Closed domain AquaLog(05) Closed domain (knowledge base) Yes FOL Not Stated Yes DL Not Stated Yes Logic Not Stated Figure 1: Characteristics among ontological question answering applications Figure 1 above showed the ontological question answering applications. AquaLog – a portable question-answering system which takes queries expressed in natural language and ontology as input and returns answer drawn from one or more knowledge bases. Ontology is used for ontology-compliant query and ontology knowledge base. AQUA – Ontology is used for a. In the refinement of the initial query b. In the reasoning process (a generalization/specialization process using classes and subclasses from the ontology) c. In the similarity algorithm (based on the ontological structures and instances of ontology, a WordNet thesaurus and the Dice coefficient). It is used to find similarities between relations/concepts in the translated query and relations/concepts in the ontological structures) OntoNova - In this application, ontology is used to deal with the question of how to describe in a declarative and abstract way the domain information, its relevant vocabulary and how to constraint the use of the data, by understanding what can be drawn from it. A complex ontology with rules, multiple representations of objects, and call-out functionality to particular problem-solving methods has been modelled in F-Logic manually to allow for inferring answers posed to the system. Basic concepts of chemistry are represented as concepts in F-logic and are arranged in is-a hierarchy. Complex chemical relationships and axioms are represented by rules MOSES – Build on existing ontologies and interpret questions with the ontological language. Ontology provides a metalanguage for describing the meaning of lexical units of a language as well as for the specification of meaning encoded in TMRs (Text Meaning Representation). The ontology contains specifications of concepts corresponding to classes of things and events in the world. In format, Ontosem ontology that is using in this project is a collection of frames, or named collections of property-value pairs, organized into an hierarchy with multiple inheritance. Fact repository contains a list of remembered instances of ontological concepts. For example, ontology contains the concept CITY and the fact repository contains entries for London, Paris and Rome. Based on the works done shown above, most ontologies developed are mainly used to clarify and refine question asked by end user. They are developed manually using tool to represent the domain background in a structured manner. Page 5 of 11 Research Methodology 1) Literature survey - Literature review will be on techniques and approaches used for ontology construction and knowledge representation. Focus will be given to syntactical and statistical approaches for ontology construction, and logic-based approaches for knowledge representation and ontology reasoning. 2) Formulate conceptual framework - A conceptual framework of the project will be produce to indicate the flow of it. 3) Produce prototype - Prototype will be developed with the help of part-of-speech tagger for syntactical parsing in raw texts for pre-processing, terminologies and relations extraction. Hybrid approach (syntactical and statistical) will be used for ontology construction. Logic-based approach will be use for ontology reasoning. 4) Test and refine designed prototype - Developed ontology will be tested with various questions related domain. Time Frame System Overview – current status Below is the architecture of the proposed application. For the time being, our research is concentrated in a single document for the purpose of ontology construction. Upper Layer of System Architecture Document s Sentence Segmentation Question Pre-processed document Reasoning Ontology Construction Engine Term Extraction Relation Tagging Answer Figure 2: Upper Level Architecture Page 6 of 11 Pre-processing In this step, document has to be segmented into sentences to be used by part-of-speech (POS) tagger to tag the segmented sentences. The tagged document will serve as a foundation step for terminologies and relations extraction for ontology construction. Ontology Construction Syntactic approach will then be used to extract facts (terminologies and relations) from the tagged document. It is a supervised corpus-based technique by matching the predefined patterns against the tagged document to extract facts. The triplet model, Subject- Relation Object (SRO) will be the format for the extracted facts. However, extracting facts based on predefined pattern solely does not guarantee the completeness of the facts. Therefore, a set of rules will be used to enhance the extraction process by considering the relation between sentences (discourse). Discourse between sentences is the link that conceptually ties one sentence to another sentence. The discourse can be explicit, in which a transition or clue word helps to identify the connection. Discourse analysis on written text is still a wide open problem, while there has been substantial work on analyzing spoken discourse. Anaphora resolution is also one of the major problems in natural language processing, specifically in natural language understanding. It involved the recognition of Proper name and pronoun (he, she, it, they, …). JavaRAP, Java implementation of the classic Resolution of Anaphora Procedure (RAP) will be used to resolve anaphora resolution. Extracted facts will be used to form ontology automatically. Knowledge Base Knowledge base rules will be constructed to represent the domain knowledge. It will be arranged in the following framework – a general framework for story development. The reason of segmenting the rules is to ease the reasoning process. a. Introduction Majority of the stories will always start by introducing the background of the story. It can be the characters, relationship between characters (family, employer-employee, friends, etc) and background of the story (location, scene of the story) b. Conflict Identify the struggle, battle; difference of opinion between characters. What are the actions taken in the process of conflict? c. Resolution How do they solve the conflict? Among characters What are the consequences of the conflict? Among characters d. Ending Morale of the story What happen to each character especially the main character? Lastly, the constructed ontology will be use as a foundation for question answering system. Logical-based approach will be explored for ontology reasoning to answer questions. Logical-based approach was chosen because it allows the descriptions both in concrete and abstract format. It also provides inferential services which can fully automatically deduce new information from given information Page 7 of 11 Question Analysis and Answer Questions post by user will be analyzed to identify its meaning. All questions will be represented as same structure as the reasoning language. Begin Question posted by end user Question analysis Search needed answer in the ontology If answer found in ontology then Return answer else Identify related facts in the ontology against the question posted Retrieve a set of related facts from ontology Start reasoning process End Figure 3: Algorithm for answer extraction Conclusion Ontological-based assisted in question answering is has not been explore in deep. There are still a lot of research possibilities to be done. It is hope that my research will have contribution in this area. References/Bibliography 1. A. Kawtrakul, M. Suktarachan, A. Imsombut. Automatic Thai Ontology Construction and Maintenance System. Workshop on Papillon 2004, Grenoble, France, 2004. 2. A. Maedche, S. Staab. Semi-automatic engineering of ontologies from text. In Proceedings of the 12th Internal Conference on Software and Knwoledge Engineering. Chicago, USA, July 5-7, 2000.KSI 3. B. Fortuna, D. Mladenic, M. Grobelnik. Semi-automatic Construction of Topic Ontologies. EWMF/KDO 2005: 121-131 4. B. Omelayenko. Learning of Ontologies for the Web: the analysis of Existent Approaches. Proceedings of the International Workshop on Web Dynamics, 2001. 5. C. Kwok, O. Etzioni, D. S. Weld. Scaling Question Answering to the Web. ACM Transaction on Information Systems, Vol. 19, No. 3, July 2001, pages 242-262. Page 8 of 11 6. C. S. Lee, Y. F. Kao, Y. H. Kuo, M. H. Wang. Automated ontology construction for unstructured text documents. Data Knowl. Eng. 60(3): 547-566 (2007) 7. D. Radev, W.G. Fan, H. Qi, H. Wu, A. Grewal. Probabilistic Question Answering on the Web. WWW2002, May 7-11, 2002, Honolulu, Hawaii, USA. 8. D. Fensel. The Semantic Web and Its Language, 2000. 9. D. Koller, A. Levy, A. Pfeffer. P-CLASSIC: A tractable probabilistic description logic. Proc. of the 14th Nat. Conf. on Artificial Intelligence (AAAI-97), 390-397,1997. 10. E. Blomqvist. Fully Automatic Construction of Enterprise Ontologies Using Design Patterns: Initial Method and First Experiences. In Proc. of the 4th Intl Conf. on Ontologies, Databases and Applications of Semantics (ODBASE, 2005), Cyprus. 11. H. Kaji, Y. Morimoto, T. Aizono, N. Yamasaki. Corpus-dependent Association Thesauri for Information Retrieval. COLING 2000: 404-410 12. J. Angele, E. Moench, H. Oppermann, S. Staab, D. Wenke. Ontology-Based Query and Answering in Chemistry: OntoNova @ Project Halo. In: Proceedings of the second International Semantic Web Conference 2003 (ISWC 2003), 20-23 octobre 2003, Sanibel Island, Florida, USA. 13. J. Z. Pan, I. Horrocks. Web Ontology Reasoning with Datatype Groups. 14. M. Ciaramita, A. Gangemi, E. Ratsch, J. Saric, I. Rojas. Unsupervised Learning of Semantic Relations between Concepts of a Molecular Biology Ontology. Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI '05) 15. M. Vargas-Vera, E. Motta. AQUA: A Question Answering System for Heterogeneous Sources. TR-kmi-04-20, Knowledge Media Institute, The Open University. 16. M. Vargas-Vera, E. Motta. A Knowledge-based Approach to Ontologies Data Integration. TR-kmi-04-17, Knowledge Media Institute, The Open University. 17. M. Vargas-Vera, E. Motta. AQUA – Ontology-Based Question Answering System, Third International Mexican Conference on Artificial Intelligence (MICAI-2004), Lecture Notes in Computer Science 2972 Springer Verlag, (eds R. Monroy et al), April 26-30, 2004. 18. M. Vargas-Vera, E. Motta, J. Domingue. AQUA: AN Ontology-Driven Question Answering System, AAAI Spring Symposium, New Directions in Question Answering, Standford University, 2003. 19. N. Collier, C. Nobata J. Tsujii. Extracting the Names of Genes and Gene Products with a Hidden Markov Model. COLING 2000: 201-207 20. N. Henze , P. Dolog, W. Nejdl. Reasoning and Ontologies for personalized E-Learning in the Semantic Web. Educational technology & Society 2004, 7(4), 82-97. 21. O. Streiter, D. Zielinski, I. Ties, L. Voltmer. Term Extraction for Ladin: An Example Page 9 of 11 Approach. TALN 2003, Batz-sur-Mer, 11-14 jiun 2003. 22. P. Atzeni, R. Basili, D. H. Hansen, P. Missier, P. Paggio, M. T. Pazienza, F. M. Zanzotto. Ontology-Based Question Answering in a Federation of University Sites: The MOSES Case Study. NLDB 2004: 413-420 23. P. Velardi, P. Fabriani, M. Missikoff. Using Text Processing Techniques to Automatically Enrich a Domain Ontology. FOIS 2001: 270-284 24. R. Basili, D. H. Hansen, P. Paggio, M. T. Pazienza, F. M. Zanzotto. Ontological resources and question answering. Workshop on Pragmatics of Question Answering, held jointly with NAACL 2004 Boston, Massachusetts, May 2004 25. S. Beale, B. Lavoie, M. Mcshane, S. Nirenburg, T. Korelsky. Question Answering Using Ontological Semantics. In Proceedings of the 2nd Workshop on Text Meaning and Interpretation of ACL 2004, Barcelona, Spain, pp. 41-48. 26. S. Beale, B. Lavoie, M. McShane, S. Nirenburg, T. Korelsky. Question Answering Using Ontological Semantics. Proceedings of ACL-2004 Workshop on Text Meaning and Interpretation, 2004. Barcelona, Spain. 27. S. J. Kang, J. H. Lee. Semi-automatic practical ontology construction by using a thesaurus, computational dictionaries, and large corpora. Proceedings of the workshop on Human Language Technology and Knowledge Management, 2001.Volume 2001 28. T. Berners-Lee, M. Fischetti. Weaving the Web. Harper, San Francisco, 1999. 29. U. Straccia. Uncertainty and Description Logic Programs: A proposal for Expressing Rules and Uncertainty on Top of Ontologies. Technical Report 2004-TR-14. 30. U. Straccia. A fuzzy Descriptiopn Logic. Proceeding of AAAI-98, 15th Conference of American Association for Artificial Intelligence. 31. V. Lopez, M. Pasin, E.Motta. AquaLog: An Ontology-Portable Question Answering System for the Semantic Web. ESWC 2005: 546-562 32. V. Lopez, E.Motta. Ontology-Driven Question Answering in AquaLog, In Proceedings of the 9th International Conference on Applications of Natural Language to information System, 2004, Manchester, England. Page 10 of 11 ID 1 2007 Task Name Sep Oct Nov Dec 2008 Jan Feb Mar Apr May Proposal Defence 2 Text Preprocessing 3 Facts Extraction Algorithm 4 Facts Extraction and Ontology Construction 5 Ontology Reasoning 6 Prototype Testing 7 Prototype Refinement 8 Work Completion Page 11 of 11 Jun Jul Aug Sep Oct Nov Dec 2009 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov D