PhD Proposal Defense

advertisement
PhD Proposal Defense
Ontological Web-Mining
Goh Hui Ngo
Faculty of Information Technology
Multimedia University, Cyberjaya
hngoh@mmu.edu.my
Abstract
Question answering (Q&A) application keep improving from traditional keyword-based
search engine to a complex system incorporating natural language processing to deal with
syntactical and semantical structure of the questions. Understanding is crucial in Q&A
application. The proposed project intended to build an ontological question answering
application. Automatic ontology construction will be explored to represent domain
knowledge. In this study, inputs from user will be treated as query to search for answer in the
ontology. Ontology reasoning mechanism will be study to deal with complex questions.
Introduction
The emerging of Word Wide Web (WWW) has tremendously contributes to all levels of
society. People rely on it to search for needed information. However, with the increasing
number of digital documents populating in cyberspace everyday, it is difficult to locate
needed information within the time constraint. For this reason, question answering
application has receiving attention lately.
This project is proposed to construct ontology to assist in question answering. Proposed
system will take user’s question as input to be processed to interpret type of question and get
the concepts out of it to be route through the ontology to get the needed answer. Ontology
reasoning will be triggered for complex questions.
Background
Search engine, always rated as a priority tool in locating needed information. It treats each
question as a series of keywords by eliminating the stopwords. It is a process of direct
matching the keywords formed against documents in WWW. Frequency of keywords
occurrence in a document will determine its ranking position. There is no understanding (i.e.
semantics) of the query given to the search engine. Documents ranked top do not assure the
occurrences of needed answer. However, documents with the best answer might ranked low
due to it have fewer keywords from the typed question. Situation will become worse if user
typed in wrong keywords or misleading question from the original intention. Therefore, query
processing based on keyword-matching mechanism does not satisfy the current demand and it
decreases the accuracy of the search result.
In contrast to keyword-based search engine as mentioned above which treat question as a set
of keywords. Natural language processing (NLP) has been used to explore the question’s
syntactic and semantic structure. Question is parsed using parser to identify its query type and
meaning to be reformulated into sub-queries comprise of synonym keywords or phrases that
Page 1 of 11
might appear in the documents which contains answer. Formulated multiple sub-queries are
then passed to search engine to locate the most related documents to the original question.
Documents that are likely to contain the answers of the queries will be returned.
However, search engine will only return documents that might contain answer. Thus, user is
often being forced to digest the whole returned documents to extract the needed information.
Again, it is expensive and labour intensive.
Question answering goes far beyond than search engine. Its aim is to return a concise answer
based on the question typed rather than a list related documents from a set of corpus.
Early age of question answering applications were mostly domain specific. They were built
with a specific database or knowledge-base in advance. Very often, natural language question
has to be transformed carefully and specifically to suit the database or knowledge-base query
language to facilitate the answers extraction. Therefore, knowledge engineer has to be
familiar with the database or knowledge-base structure and database or knowledge-base
querying language in order to represent the question typed in a proper manner.
Misrepresentation of the query typed will lead to retrieving incorrect answer. Obviously, it is
a customized application within the application domain and query language to be used.
Answers returned based on given query are constraint to the represented data or knowledge in
database or knowledge-base. In this context, it holds the assumption that the database or
knowledge-based should stored huge amount of represented data or knowledge. Yet it is
expensive to build such a complicated application which requires high accuracy performance
as data or facts represented in database or knowledge-based are domain specific. It is manual
labour intensive.
Nevertheless, the great explosion of digital documents in the WWW has encouraged people
moving towards opened domain question answering. User can ask any questions and answers
will be extracted from free texts or web pages in the WWW. The natural language question
input by user needs to be analysed to identify its question type. It is important to justify the
question type to extract the concise answer. Reformulated queries based on question type will
be passed to search engine to retrieve potential documents that might contain answer.
Candidate answers will then be extracted from the set of returned documents to be rank to
determine the final answer. Yet, this approach still lack of reasoning mechanism to
manipulate complicated question which require domain background knowledge and asking
the same question to the application by different users can be inefficient and time consuming,
due to the application has to repeat mentioned steps above each time there is a question.
Moreover, it is just merely a question answering application equipped with advance
manipulating capabilities.
Moreover, with the evolving of semantic network, ontology begins to play the role in
knowledge representation. It is an explicit representation of domain knowledge with the aim
to eliminate terminologies, relations confusion, hence promote shareable understanding
environment to facilitate data collaboration. The essential aspects of ontology are:
1.
2.
3.
4.
It often used to describe a specific domain
Terms and relations are clearly defined in that domain
A mechanism will be used to organize the terms (IS-A / HAS-A …)
Meaning of the term is used consistently throughout the ontology construction
Page 2 of 11
Complete and precise domain ontology will produce a clear picture of domain context, thus,
will ease the question answering process by manipulating the represented knowledge rather
than browse through the Internet all the time for needed answers.
There are three approaches in constructing ontology, namely manual, semi-automatic and
automatic.
Manual [4] construction of ontology requires a lot of time and resources from domain expert
and knowledge engineers. Terminologies and relations have to be defined manually, thus
keeping ontology up-to-date demands a great effort. It lacks of flexibility and scalability.
In order to overcome the above matter, a lot of concentrations have been put on automatic
[6,10] and semi-automatic [2,3,27] ontology construction. However fully and completely
automatic construction of ontology is not an easy task because many domain-specific
decisions must be made to adequately to specify the domain of interest. At present, the fully
automated method only functions well for very lightweight ontology under very limited
circumstances.
To strike a balance between manually and fully automatic ontology constructions, semiautomatic construction of ontology always come in mind of ontology developers. Semiautomatic ontology construction begins with the manual selection of relevant concepts and
relationships, where the text pre-processing output is used to guide the selection. The
concepts and relationships formed will serve as a foundation to facilitate automatic
terminology extraction for the remaining of the construction.
Ontology without reasoning is merely a knowledge representation. Reasoning is needed to
uncover the implicit knowledge from the explicit represented knowledge. Major
considerations in the choice of representation are the expressivity, satisfiable and consistency
of the encoding language and the power of knowledge reasoning:




The expressivity is a measure of the features that can be use to formally, flexibly,
explicitly and accurately describe the meaning of the knowledge.
A model is satisfiable if none of the statements within contradict each other
Consistency is a matter of encoding the conceptualisation of the knowledge in the
same manner throughout the knowledge representation.
The power of reasoning is the ability to take decisions under incomplete knowledge.
One of the logic-based approaches is Description Logic (DL). It is about modelling world
knowledge of a particular domain of application and reasoning about it. Originally,
description logics were intended to be fragments of first-order logic whose satisfiability
problem in decidable in polynomial time but currently, they are intended to be decidable
logics with good deductive systems.
DL provides good ontology reasoning for semantic web due to it uses natural language and
well-defined semantic. It provides algorithms for automatically extracting implicit
information from facts explicitly represented.
The basic building blocks of description logic are atomic concepts (unary predicates), atomic
roles (binary predicates) and individuals. Concepts describe the common properties of a
collection of individuals and can be considered as unary predicates which are interpreted as
sets of objects. Roles are interpreted as binary relations between objects.
Page 3 of 11
The expressive power of description logic is restricted in that it uses a small set of
constructors (such as intersection, union, role quantification, etc.) to build complex concepts
and roles.
Implicit knowledge about concepts and roles can be inferred automatically with the help of
inference procedure. The main reasoning tasks of description logic are classification,
satisfiability, subsumption and instance checking.
Classification of concepts allows one to structure the terminology in the form of a
subsumption hierarchy. This hierarchy provides useful information on the connection
between different concepts, and it can be used to speed up other inference services.
Classification of individuals determines whether a given individual always an instance of a
certain concepts. It is useful information on the properties of an individual.
Subsumption relation specifies the relative generality of two concepts. A concept X subsumes
a concept Y if the definitions of X and Y logically imply that members of Y must also be
members of X. Unlike IS-A links in semantic network, which are explicitly introduced by the
user, subsumption relationships and instance relations are inferred from the definition of the
concepts and the properties of the individual.
Satisfiability means non-contradiction among concepts and a concept makes sense for us if
there is some interpretation that satisfies the axioms of T (a model of T) such that the concept
denotes a nonempty set in that interpretation.
Instance checking is used to check whether an individual is an instance of a concept. It can be
considered the central reasoning task for retrieving information on individuals in the
knowledge base. In fact, instance checking is a basic tool for more complex reasoning
problems.
Justification of Study
This project will focus its question answering process in e-story domain. E-story, an
electronic story which is being publishes in digital format and each e-story is content specific
to its title. User is able to ask any questions related to the flow and content of the e-story.
Therefore, there is a need to find a way to optimise the searching method in e-story.
E-story ontology will be constructed automatically to represent domain knowledge. Natural
language question posted to the interface has to be analysed syntactic and semantically to
identify its question types and meaning to be reformulated into formal representation of
meaning as logic. Later, the formulated logic will infer with the ontology constructed to
return answer in natural language.
However, not all answers are able to be extracted directly from ontology. There will be cases
where ontology reasoning is needed to manipulate the represented knowledge to reveal the
implicit knowledge of the particular domain.
Research Objectives
1. To investigate the auto-ontological construction process in Q&A approach compare to
keyword techniques.
2. To study Description Logic to be used in ontology reasoning for question-answering.
Page 4 of 11
Literature Review
Characteristics
Domain
Query Processing
Ontological Reasoning
Answer Returned
OntoNova(03)
Closed
domain
(Chemistry
domain)
Yes
F-logic
Natural
language
explanation
AQUA(03)
Hybrid domain
(academic life)
MOSES(04)
Closed domain
AquaLog(05)
Closed domain
(knowledge base)
Yes
FOL
Not Stated
Yes
DL
Not Stated
Yes
Logic
Not Stated
Figure 1: Characteristics among ontological question answering applications
Figure 1 above showed the ontological question answering applications.
AquaLog – a portable question-answering system which takes queries expressed in natural
language and ontology as input and returns answer drawn from one or more knowledge
bases. Ontology is used for ontology-compliant query and ontology knowledge base.
AQUA – Ontology is used for
a. In the refinement of the initial query
b. In the reasoning process (a generalization/specialization process using classes
and subclasses from the ontology)
c. In the similarity algorithm (based on the ontological structures and instances of
ontology, a WordNet thesaurus and the Dice coefficient). It is used to find
similarities between relations/concepts in the translated query and
relations/concepts in the ontological structures)
OntoNova - In this application, ontology is used to deal with the question of how to describe
in a declarative and abstract way the domain information, its relevant vocabulary and how
to constraint the use of the data, by understanding what can be drawn from it. A complex
ontology with rules, multiple representations of objects, and call-out functionality to
particular problem-solving methods has been modelled in F-Logic manually to allow for
inferring answers posed to the system. Basic concepts of chemistry are represented as
concepts in F-logic and are arranged in is-a hierarchy. Complex chemical relationships and
axioms are represented by rules
MOSES – Build on existing ontologies and interpret questions with the ontological
language. Ontology provides a metalanguage for describing the meaning of lexical units of a
language as well as for the specification of meaning encoded in TMRs (Text Meaning
Representation). The ontology contains specifications of concepts corresponding to classes of
things and events in the world. In format, Ontosem ontology that is using in this project is a
collection of frames, or named collections of property-value pairs, organized into an
hierarchy with multiple inheritance. Fact repository contains a list of remembered instances
of ontological concepts. For example, ontology contains the concept CITY and the fact
repository contains entries for London, Paris and Rome.
Based on the works done shown above, most ontologies developed are mainly used to clarify
and refine question asked by end user. They are developed manually using tool to represent
the domain background in a structured manner.
Page 5 of 11
Research Methodology
1) Literature survey
- Literature review will be on techniques and approaches used for ontology construction
and knowledge representation. Focus will be given to syntactical and statistical
approaches for ontology construction, and logic-based approaches for knowledge
representation and ontology reasoning.
2) Formulate conceptual framework
- A conceptual framework of the project will be produce to indicate the flow of it.
3) Produce prototype
- Prototype will be developed with the help of part-of-speech tagger for syntactical
parsing in raw texts for pre-processing, terminologies and relations extraction. Hybrid
approach (syntactical and statistical) will be used for ontology construction. Logic-based
approach will be use for ontology reasoning.
4) Test and refine designed prototype
- Developed ontology will be tested with various questions related domain.
Time Frame
System Overview – current status
Below is the architecture of the proposed application. For the time being, our research is
concentrated in a single document for the purpose of ontology construction.
Upper Layer of System Architecture
Document
s
Sentence Segmentation
Question
Pre-processed
document
Reasoning
Ontology Construction
Engine
Term Extraction
Relation Tagging
Answer
Figure 2: Upper Level Architecture
Page 6 of 11
Pre-processing
In this step, document has to be segmented into sentences to be used by part-of-speech (POS)
tagger to tag the segmented sentences. The tagged document will serve as a foundation step
for terminologies and relations extraction for ontology construction.
Ontology Construction
Syntactic approach will then be used to extract facts (terminologies and relations) from the
tagged document. It is a supervised corpus-based technique by matching the predefined
patterns against the tagged document to extract facts. The triplet model, Subject- Relation Object (SRO) will be the format for the extracted facts. However, extracting facts based on
predefined pattern solely does not guarantee the completeness of the facts. Therefore, a set of
rules will be used to enhance the extraction process by considering the relation between
sentences (discourse). Discourse between sentences is the link that conceptually ties one
sentence to another sentence. The discourse can be explicit, in which a transition or clue word
helps to identify the connection. Discourse analysis on written text is still a wide open
problem, while there has been substantial work on analyzing spoken discourse.
Anaphora resolution is also one of the major problems in natural language processing,
specifically in natural language understanding. It involved the recognition of Proper name
and pronoun (he, she, it, they, …). JavaRAP, Java implementation of the classic Resolution
of Anaphora Procedure (RAP) will be used to resolve anaphora resolution.
Extracted facts will be used to form ontology automatically.
Knowledge Base
Knowledge base rules will be constructed to represent the domain knowledge. It will be
arranged in the following framework – a general framework for story development. The
reason of segmenting the rules is to ease the reasoning process.
a. Introduction
Majority of the stories will always start by introducing the background of the story. It
can be the characters, relationship between characters (family, employer-employee,
friends, etc) and background of the story (location, scene of the story)
b. Conflict
Identify the struggle, battle; difference of opinion between characters. What are the
actions taken in the process of conflict?
c. Resolution
How do they solve the conflict? Among characters
What are the consequences of the conflict? Among characters
d. Ending
Morale of the story
What happen to each character especially the main character?
Lastly, the constructed ontology will be use as a foundation for question answering system.
Logical-based approach will be explored for ontology reasoning to answer questions.
Logical-based approach was chosen because it allows the descriptions both in concrete and
abstract format. It also provides inferential services which can fully automatically deduce
new information from given information
Page 7 of 11
Question Analysis and Answer
Questions post by user will be analyzed to identify its meaning.
All questions will be represented as same structure as the reasoning language.
Begin
Question posted by end user
Question analysis
Search needed answer in the ontology
If answer found in ontology then
Return answer
else
Identify related facts in the ontology against the question posted
Retrieve a set of related facts from ontology
Start reasoning process
End
Figure 3: Algorithm for answer extraction
Conclusion
Ontological-based assisted in question answering is has not been explore in deep. There are
still a lot of research possibilities to be done. It is hope that my research will have
contribution in this area.
References/Bibliography
1. A. Kawtrakul, M. Suktarachan, A. Imsombut. Automatic Thai Ontology Construction and
Maintenance System. Workshop on Papillon 2004, Grenoble, France, 2004.
2. A. Maedche, S. Staab. Semi-automatic engineering of ontologies from text. In
Proceedings of the 12th Internal Conference on Software and Knwoledge
Engineering. Chicago, USA, July 5-7, 2000.KSI
3.
B. Fortuna, D. Mladenic, M. Grobelnik. Semi-automatic Construction of
Topic Ontologies. EWMF/KDO 2005: 121-131
4. B. Omelayenko. Learning of Ontologies for the Web: the analysis of
Existent Approaches. Proceedings of the International Workshop on
Web Dynamics, 2001.
5. C. Kwok, O. Etzioni, D. S. Weld. Scaling Question Answering to the Web. ACM
Transaction on Information Systems, Vol. 19, No. 3, July 2001, pages 242-262.
Page 8 of 11
6. C. S. Lee, Y. F. Kao, Y. H. Kuo, M. H. Wang. Automated ontology construction for
unstructured text documents. Data Knowl. Eng. 60(3): 547-566 (2007)
7. D. Radev, W.G. Fan, H. Qi, H. Wu, A. Grewal. Probabilistic Question Answering on the
Web. WWW2002, May 7-11, 2002, Honolulu, Hawaii, USA.
8. D. Fensel. The Semantic Web and Its Language, 2000.
9. D. Koller, A. Levy, A. Pfeffer. P-CLASSIC: A tractable probabilistic description logic.
Proc. of the 14th Nat. Conf. on Artificial Intelligence (AAAI-97), 390-397,1997.
10. E. Blomqvist. Fully Automatic Construction of Enterprise Ontologies Using Design
Patterns: Initial Method and First Experiences. In Proc. of the 4th Intl Conf. on
Ontologies, Databases and Applications of Semantics (ODBASE, 2005), Cyprus.
11. H. Kaji, Y. Morimoto, T. Aizono, N. Yamasaki. Corpus-dependent Association Thesauri
for Information Retrieval. COLING 2000: 404-410
12. J. Angele, E. Moench, H. Oppermann, S. Staab, D. Wenke. Ontology-Based Query and
Answering in Chemistry: OntoNova @ Project Halo. In: Proceedings of the second
International Semantic Web Conference 2003 (ISWC 2003), 20-23 octobre 2003, Sanibel
Island, Florida, USA.
13. J. Z. Pan, I. Horrocks. Web Ontology Reasoning with Datatype Groups.
14. M. Ciaramita, A. Gangemi, E. Ratsch, J. Saric, I. Rojas. Unsupervised Learning of
Semantic Relations between Concepts of a Molecular Biology Ontology. Proceedings of
the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI '05)
15. M. Vargas-Vera, E. Motta. AQUA: A Question Answering System for Heterogeneous
Sources. TR-kmi-04-20, Knowledge Media Institute, The Open University.
16. M. Vargas-Vera, E. Motta. A Knowledge-based Approach to Ontologies Data Integration.
TR-kmi-04-17, Knowledge Media Institute, The Open University.
17. M. Vargas-Vera, E. Motta. AQUA – Ontology-Based Question Answering System, Third
International Mexican Conference on Artificial Intelligence (MICAI-2004), Lecture
Notes in Computer Science 2972 Springer Verlag, (eds R. Monroy et al), April 26-30,
2004.
18. M. Vargas-Vera, E. Motta, J. Domingue. AQUA: AN Ontology-Driven
Question Answering System, AAAI Spring Symposium, New Directions in
Question Answering, Standford University, 2003.
19. N. Collier, C. Nobata J. Tsujii. Extracting the Names of Genes and Gene Products with a
Hidden Markov Model. COLING 2000: 201-207
20. N. Henze , P. Dolog, W. Nejdl. Reasoning and Ontologies for personalized E-Learning in
the Semantic Web. Educational technology & Society 2004, 7(4), 82-97.
21. O. Streiter, D. Zielinski, I. Ties, L. Voltmer. Term Extraction for Ladin: An Example
Page 9 of 11
Approach. TALN 2003, Batz-sur-Mer, 11-14 jiun 2003.
22. P. Atzeni, R. Basili, D. H. Hansen, P. Missier, P. Paggio, M. T. Pazienza, F. M. Zanzotto.
Ontology-Based Question Answering in a Federation of University Sites: The MOSES
Case Study. NLDB 2004: 413-420
23. P. Velardi, P. Fabriani, M. Missikoff. Using Text Processing Techniques to
Automatically Enrich a Domain Ontology. FOIS 2001: 270-284
24. R. Basili, D. H. Hansen, P. Paggio, M. T. Pazienza, F. M. Zanzotto. Ontological resources
and question answering. Workshop on Pragmatics of Question Answering, held jointly
with NAACL 2004 Boston, Massachusetts, May 2004
25. S. Beale, B. Lavoie, M. Mcshane, S. Nirenburg, T. Korelsky. Question Answering Using
Ontological Semantics. In Proceedings of the 2nd Workshop on Text Meaning and
Interpretation of ACL 2004, Barcelona, Spain, pp. 41-48.
26. S. Beale, B. Lavoie, M. McShane, S. Nirenburg, T. Korelsky. Question Answering Using
Ontological Semantics. Proceedings of ACL-2004 Workshop on Text Meaning and
Interpretation, 2004. Barcelona, Spain.
27. S. J. Kang, J. H. Lee. Semi-automatic practical ontology construction by using a
thesaurus, computational dictionaries, and large corpora. Proceedings of the workshop on
Human Language Technology and Knowledge Management, 2001.Volume 2001
28. T. Berners-Lee, M. Fischetti. Weaving the Web. Harper, San Francisco, 1999.
29. U. Straccia. Uncertainty and Description Logic Programs: A proposal for Expressing
Rules and Uncertainty on Top of Ontologies. Technical Report 2004-TR-14.
30. U. Straccia. A fuzzy Descriptiopn Logic. Proceeding of AAAI-98, 15th Conference of
American Association for Artificial Intelligence.
31. V. Lopez, M. Pasin, E.Motta. AquaLog: An Ontology-Portable Question Answering
System for the Semantic Web. ESWC 2005: 546-562
32. V. Lopez, E.Motta. Ontology-Driven Question Answering in AquaLog, In Proceedings of
the 9th International Conference on Applications of Natural Language to information
System, 2004, Manchester, England.
Page 10 of 11
ID
1
2007
Task Name
Sep
Oct
Nov
Dec
2008
Jan Feb Mar
Apr
May
Proposal Defence
2
Text Preprocessing
3
Facts Extraction Algorithm
4
Facts Extraction and Ontology Construction
5
Ontology Reasoning
6
Prototype Testing
7
Prototype Refinement
8
Work Completion
Page 11 of 11
Jun
Jul
Aug Sep
Oct
Nov
Dec
2009
Jan Feb Mar
Apr
May
Jun
Jul
Aug Sep
Oct
Nov
D
Download