Ontology Generation -- surveys Yihong Ding CS652 Spring 2004 Three Papers Mariano Fernández-López. Overview of Methodologies for Building Ontologies. In IJCAI-99 Workshop on Ontologies and Problem-solving Methods, 1999. Borys Omelayenko. Learning of Ontologies for the Web: the Analysis of Existent Approaches. In International Workshop on Web Dynamics held in conj. with the 8th International Conference on Database Theory (ICDT'01), 2001. Ying Ding and Schubert Foo. Ontology research and development. Part 1: A review of ontology generation. In Journal of Information Science, 2002. 2 Mariano Fernández-López, 1999 Propose lots of guidelines based on IEEE Standard 1074-1995 for manual ontology development Examine the methodologies for five different projects Uschold and King 1995 Grüninger And Fox, 1995 Berneras et. al., 1996 METHONTOLOGY, 1996 SENSUS, 1997 3 IEEE Standard 1074-1995 The standard for developing software life cycle Software life cycle model processes (identify and select life cycle) Project management processes (create framework of project) Software development-oriented processes Pre-development processes (study the environment) Development processes • Requirement process (develop software requirements specification) • Design process (develop software representation that meets the requirements) • Implementation process (transform representation to programming language) Post-development processes (install, operate, support, and maintenance) Integral process (ensure the completion and quality) 4 Criteria for Analyzing Methodologies C1. Inheritance from Knowledge Engineering C2. Detail of the methodology C3. Recommendation for knowledge formalization C4. Strategy for building ontologies Application dependent, semi-dependent, or independent C5. Strategy for identifying concepts Bottom-up, top-down, or middle-out C6. Recommended life cycle C7. Differences between the methodology and IEEE 10741995 C8. Recommended techniques C9. Ontology and system built 5 Uschold and King Description: developing the Enterprise Ontology for enterprise modeling processes Building process (middle-out) Ontology capture • Identify key concepts and relationships • Produce precise unambiguous text definitions • Identify other terms refer to identified concepts and relationships Coding Integrating existing ontologies 6 Uschold and King Analysis of Methodology C1. partial: identifies an acquisition, coding and evaluation stage, but without feasibility study and prototyping C2. very little C4. application-independent C5. middle-out: from most important to less important, the others from generalization and specialization C7. Processes missing: management, pre-development, and postdevelopment, design Activities missing: environment study, feasibility study, training and configuration management C8. technical details are unclear 7 Grüninger And Fox Description: developing the TOVE (TOronto Virtual Enterprise) project ontology within the domain of business processes and activities modeling Building process (middle-out) Capture of motivating scenarios • Motivating scenarios: problems or examples which are not adequately addressed by existing ontologies • Motivating scenario provides possible solutions • Solutions provide an informal intended semantics for the objects and relations Formulation of informal competency questions • Based on the motivating scenarios • Serve as constraints rather than determining a particular design • Evaluate ontological commitment Specification of the terminology of the ontology within a formal language • Getting informal terminology: terms extracted from the questions • Specification of formal terminology: formalizing terms Formulation of formal competency questions using the terminology of the ontology Specification of axioms and definitions for the terms in the ontology within the formal language Establish conditions for characterizing the completeness of the ontology 8 Grüninger And Fox Analysis of Methodology C1. small: this is a question-answer-pair driven approach, not very much involved in knowledge-based system development C2. little C3. logic C4. application-semidependent (scenarios) C5. middle-out C7. Processes missing: management, pre-development, and postdevelopment, design Activities missing: training and configuration management C8. technical details are unclear 9 Berneras et. al Description: developing the Esprit KACTUS project to investigate the feasibility of knowledge reuse in complex technical systems and the role of ontologies to support it Building process (top-down) Specification of the application Preliminary design based on relevant top-level ontological categories • It involves searching ontologies developed for other applications, which are refined and extended for use in the new application. Ontology refinement and structuring 10 Berneras et. Al Analysis of Methodology C1. big: follow the tradition of knowledge engineering C2. very little C4. application-dependent C5. top-down C7. Processes missing: management, pre-development, and postdevelopment Activities missing: training, documentation, configuration management, verification, and validation C8. technical details are unclear 11 METHONTOLOGY Description Enabling the construction of ontologies at the knowledge level Supported by Ontology Design Environment (ODE) Including • • • Identification of the ontology development process A life cycle based on evolving prototypes Particular techniques for carrying our each activity Ontologies developed • • • • CHEMICALS Environment pollutants ontologies The Reference-Ontology The restructured version of (KA)2 ontology Building process (middle-out): refers to which activities are carried out Project management activities • • • Planning: identify tasks Control: guarantee planned tasks being completed when intended Quality Assurance: assure the quality of outputs Development-oriented activities • Specification, conceptualization, formalization, and implementation Support activities • Knowledge acquisition, evaluation, integration, documentation, and configuration management 12 METHONTOLOGY Analysis of Methodology C1. big: it has its roots in a methodology for developing knowledge-based systems C2. a lot C3. flexible C4. application-independent C5. middle-out: most relevant concepts are identified first C6. evolving prototypes C7. Processes missing: software life cycle model, and pre-development Activities missing: project initiation, installation, support, retirement, and training C8. technical details are unclear 13 SENSUS Description Developed for natural language processing Content obtained by extracting and merging information from various electronic sources of knowledge • PENMAN Upper Model, ONTOS, manually built semantic categories, WordNet, Spanish and Japanese lexical entries Including • More than 50,000 concepts organized in a hierarchy • Both high and medium level of abstraction • Generally not cover terms from specific domains Building process (bottom-up) Take a series of seed terms, linked to SENSUS by hand Specify paths from the seed terms to the root Add more relevant terms Prune any irrelevant terms 14 SENSUS Analysis of Methodology C1. none: based on adding terms into an existing ontology C2. medium: not very detailed C3. semantic networks C4. application-semidependent C5. bottom-up C7. Processes missing: management, pre-development, and postdevelopment, design Activities missing: training, documentation, configuration management, verification, and validation C8. technical details are unclear 15 Summary None of the methodologies are fully mature comparing with the IEEE standard The proposals are not unified SENSUS is completely different from the others It suggests we adopt several widely accepted methodologies than on standardized one Interpretability between systems is allowed 16 Borys Omelayenko 2001 Learning-based ontology development Examine eleven different approaches Bisson et. al. 2000 Faure and Poibeau, 2000 Agirre et. al., 2000 Junker et. al., 1999 Craven et. al., 2000 Bowers et. al., 2000 Taylor et. al., 1997 Webb, Wells, Zheng, 1999 Soderland et. al., 1995 Maedche and Staab, 2000 Suryanto and Compton 2000 17 Semantic Querying over the Web 18 Ontological Components Natural language ontologies (horizontal) Contain lexical relations between language concepts Large in size and do not require frequent updates Used to expand user queries Capture concepts but not provide detailed descriptions Domain ontologies (vertical) Capture knowledge of a particular domain Provide detailed descriptions of the domain Ontology instances (dot) Main piece of knowledge presented in the future Semantic Web Serve for Web pages Contain links to other instances 19 Ontology Learning Tasks Ontology acquisition Ontology creation Ontology schema extraction Extraction of ontology instances Ontology maintenance Ontology integration and navigation Ontology update Ontology enrichment 20 Machine Learning Techniques Ontology representation requires symbolic learning methods Skip neural networks, genetic algorithm, and the family of ‘lazy learners’. Methods studies in this paper Propositional rule learning (zero-order logic) First-order logic rules learning Bayesian learning Clustering algorithms 21 ML vs. Manually Modeling primitives ML: simple and limited (usually simple rules) Man: rich (frames, subclasses, rules with rich set of operations, functions, etc.) Knowledge base structure ML: flat and homogeneous Man: hierarchical, consisting of various components with subclass-of, part-of, and other relations Tasks ML: categorize objects into a limited and unstructured set of classes Man: classify objects into a tree of structured classes Problem-solving methods ML: very primitive, based on simple search strategies Man: complicated, inference over a knowledge base with rich structure Solution space ML: non-extensible, fixed set of class labels Man: extensible set of primitive and compound solutions Readability of the knowledge bases to a human Not required required 22 Requirements for OL Aim: automatically construct ontologies with the properties of manually constructed ontologies Requirements Ability to interact with a human Readability of internal and external results of the learner Ability to use complex modeling primitives Ability to deal with complex solution spaces 23 Requirements for Ontological Components NLO Hierarchical clustering of language concepts Limited set of relations Ability to link to specific domain ontologies ML focus: enrichment based on domain texts is popular Do not require frequent or automatic updates DO Use the whole set of modeling primitives Complex in structure ML focus: discovering statistically valid patterns for creation Require more updates OI Concepts mark-up of the underlying domain ontology in Web pages ML focus: IE and annotation Require frequent updates 24 Leaning of NLO Bisson et. al. 2000 (Mo’K tool) Human-assisted bottom-up clustering of conceptual hierarchies from corpora Human selects input examples and attributes, level of pruning, and distance evaluation functions Group ‘similar’ objects to create the classes Group ‘similar’ classes to form the hierarchy No human interaction during clustering process Further study on integrating NLO enrichment with the Web search of relevant texts 25 Leaning of NLO Agirre et. al., 2000 Enrich WordNet by exploiting texts from the Web Construct lists (topic signatures) of topically related words (with weight/strength) for each concept in WordNet Each word sense has one associated list of related words Related Web pages from AltaVista search engine by specifying particular queries Query refers to a particular sense but not others Example: waiter AND and (restaurant OR menu) AND NOT (station OR airport) 26 Leaning of NLO Faure and Poibeau, 2000 (Asium) Creating domain-specific NLO by unsupervised domain-specific clustering of texts from corpora Generate syntactical structure of texts by Sylex Cooperative learning of semantic knowledge from parsed texts Bottom-up, breadth-first clustering for form the hierarchy Expert validate and label concepts 27 Learning of DO Maedche and Staab, 2000 Semiautomatically ontology learning from texts Input : a set of transactions Transaction: contain a set of items appearing together Association rule: sets of items that appear together sufficiently often ML: discover generalized association rule Final: present the rules to the knowledge engineer 28 Learning of DO Suryanto and Compton 2000 First attempt of using ML to discover hierarchical relations between textually described classes Discovery class relations between classification rules Three basic relations: intersection, mutual-exclusion, similarity Each relation is defined a measure of degree for three basic relations 29 Learning of DO Taylor et. al., 1997 Ontology-based induction of high-level classification rules Ontologies not only for explaining rules but also to guide learning algorithm Algorithm generates queries for an external learner ParkaDB DO and input data check consistency of queries Consistent queries become classification rules Query generation continues until the set of rules covers the whole data set 30 Learning of DO Webb, Wells, Zheng, 1999 ML plus knowledge acquisition from experts improves the accuracy of developed domain ontology and reduce development time Three types of knowledge acquisition systems • Manually based on experts • ML systems • Integrated system ML method: C4.5 decision tree 31 Learning of OI Bowers et. al., 2000 Replacing the attribute-value dictionary with a more expressive one that consists of simple data types, tuples, sets and graphs Using modified C4.5 learner 32 Learning of OI Soderland et. al., 1995 (CRYSTAL) Formalize ontology instances from text and generate a concept hierarchy from the instances Given domain model as input Use a richer set of modeling primitives Generalize semantic mark-up of the manually markedup training corpora Formalize the instance level of hierarchy Searched-based generalization of concept nodes 33 Learning of OI Craven et. al., 2000 (Web-KB) Systematic study of the extraction of OI from Web documents Ontology as an academic web-site to populate it with actual instances and relations from CS departments’ web sites Three learning tasks • Recognize class instances from hypertext documents guided by the ontology • Recognize relation instance from the chains of hyperlinks • Recognize class and relation instances from the pieces of hypertext Two supervised learning methods • Naïve Bayes learner • Modified FOIL (first-order rule learner) Automatically create mapping between the manually constructed domain ontology and the Web pages by generalizing from the training instances 34 Summary Main problem of OL: flat and homogeneous structure learned Learning of NLO General-purpose NLO exists Mainly enrichment Most popular ML algorithm: clustering Learning of DO Human-guided learning Learning plays only a minor role in knowledge acquisition Most popular ML algorithm: propositional learning Learning of OI The structure of OI is too rich to be adequately captured by propositional rules Multiple different ML algorithm are applied 35 Ying Ding and Schubert Foo 2002 Methods used and problems encountered in many recent ontology generation approaches Examine seven main collection of approaches InfoSleuth (MCC) SKC (Stanford) Ontology Learning (AIFB) ECAI2000 Inductive logic programming (UT) Library Science and Ontology Others 36 InfoSleuth A research project at MCC (Microelectronics and Computer Technology Corporation) Develop and deploy new technologies for finding information available both in corporate networks and external networks Description Locating, evaluating, retrieving, and merging information in a frequently updating environment Build up an ontology-based agent architecture Been successfully implemented in • • • • • • Knowledge management Business intelligence Logistics Crisis management Genome mapping Environment data exchange network 37 InfoSleuth: method Input resources Human expert feeds system a small set of seedwords (high-level concept) IR engine feeds relevant documents (with or without POS tagged) automatically System process Parse documents Extract phrases with seedwords Generate concept terms Place them into ontology Collect candidate seedwords for next round of processing Relationship retrieving is-a, part-of, manufactured-by, owned-by, etc. assoc-with is used to define relations except is-a Use linguistic properties to identify relations Human experts evaluate and adjust results Special features Expand ontology with new concepts and alert human expert to update Discover attributes associated with certain concepts Index documents for future retrieval Allow users to decide between precision and completeness by browsing 38 InfoSleuth: problems Syntactic structure ambiguity (concept token identification) image process software Different phrases refer to the same concept Word sense disambiguation Proper attachment of adjective modifier may help avoid non-concepts Heterogeneous resources (inconsistent terminologies) Automatically constructed ontology can be too prolific and deficient at the same time (because of the seedwords) 39 SKC (Scalable Knowledge Composition) A research project at Stanford Resolve semantic heterogeneity in information systems Description Derive general methods for ontology integration Application-independent Develop an ontology algebra Convert Webster’s dictionary to a graph structure Funded by • AFOSR, DARPA, HPKB 40 SKC: method Concept graph technique detail is unknown Use a novel algebraic extraction technique to generate the graph structure and create thesaurus entries for all words including some stopwords Idea from PageRank algorithm ArcRank algorithm to extract relations Basic hypothesis: structural relationships between terms are relevant to their meaning Pattern/Relation extraction algorithm Compute a set of nodes that contain arcs comparable to seed arc set Threshold them according to ArcRank value Extend seed arc set, when nodes contain further commonality If the node set increased in size repeat from the first step The algorithm is self-limited via threshold and distinguish senses 41 SKC: problems Syllable and accent markers in head words Misspelled head words Mis-tagged fields Stemming and irregular verbs Common abbreviations in definitions Undefined words with common prefixes Multi-word head words Undefined hyphenated and compound words 42 Ontology Learning A project in AIFB (Institute of Applied Informatics and Formal Description Methods, University of Karlsruhe, Germany) Extract ontology from domain data Description To learn both taxonomic and non-taxonomic relations for ontologies 43 OL: method Shallow text processing Implement on top of SMES (text process for German) Use weighted finite state transducers to process phrasal and sentential patterns Output dependency relations Learning algorithm Input dependency relations Select the set of documents Define association rules Determine confidence for the rules Output association rules exceeding the user-defined confidence 44 OL: problems Lightweight ontology contains too many noisy data Word sense problem generates lots of ambiguity Refinement of the lightweight ontologies is a trickle issue (need future work) Relationship learning is not trivial 45 ECAI 2000 Ontology Learning Workshop of ECAI 2000 (European Conference on Artificial Intelligence) Description Use NLP techniques Extract important (high frequency) words or phrases to define concepts Use general top-level ontology (WordNet, SENSUS) to assist disambiguation Problem: relation extraction 46 Inductive Logic Programming WOLFIE (WOrd Learning From Interpreted Examples) at Machine Learning Group in University of Texas at Austin Description Learn semantic lexicon from a corpus of sentences Learned lexicon • Consist of words with meaning • Allow synonym and ploysymy Ultimate goal: learn to parse novel sentences into their meaning representations Have the potential to be a workbench for ontological concept extraction and relation detection Problem: how to deploy their methods for ontology concept and rule learning to make the workbench work 47 Library Science and Ontology Digital Library + Semantic Web Digital libraries use various forms of vocabularies instead of formal ontologies Kwasnik (1999) convert a controlled vocabulary scheme into an ontology Higher levels of conception of descriptive vocabulary Deeper semantics for class/subclass and cross-class relationships Ability to express concepts and relationship in a description language Reusable and sharable of the ontological constructs Strong inference and reasoning functions Problems Different ways of modeling knowledge (shallow or deeper semantics) Different ways of representing knowledge (lexical-flavored or mathematical and logical-flavored) To merge or create a common standard for the two fields will be a long way 48 Others Borgo 1997 Use lexical semantic graphs to create ontology Based on WordNet Yamaguchi 1999 Construct domain ontologies Based on a machine-readable dictionary Kashyap 1999 Construct ontology for IR Based on database schema 49 Ontology Learning (Research Location Index) [34] Europe France (7) Germany (5) Spain (3) Others: Italy (2), Austria, Greece, Netherlands, Portugal, Switzerland, UK *European Union (2): • OntoWeb: University of Karlsruhe • On-To-Knowledge: many countries USA Stanford (2) Austin (2): UT, MCC Dallas (2): UT, Southern Methodist University Other: UC Berkeley, Mississippi State University, BYU, UW Others Australia, Canada, Israel, Japan, Taiwan (China) 50 Conclusion Top-level NLO: manual construction required, need human experts Domain-level NLO: learnable, fed by Top-level NLOs Domain descriptions Domain ontology: learnable, fed by Domain description Training documents Instance ontology: learnable, fed by Domain ontology Specified instance Web pages 51 Conclusion Source data Semi-structured documents (more or less) Seedwords Existing generic ontologies (WordNet) Concept extraction IE, NLP, ML (mostly clustering and inductive learning), existing digital resource assistance High precision, not bad completeness Relationship extraction Complex and not well-solved Ontology reuse is another important issue To map ontologies to different representations may be valuable (like conceptual graph, conceptual hierarchy, description logic, ontology language) 52