Learning-based - osm.cs.byu.edu

advertisement
Ontology Generation
-- surveys
Yihong Ding
CS652 Spring 2004
Three Papers
Mariano Fernández-López. Overview of Methodologies
for Building Ontologies. In IJCAI-99 Workshop on
Ontologies and Problem-solving Methods, 1999.
Borys Omelayenko. Learning of Ontologies for the Web:
the Analysis of Existent Approaches. In International
Workshop on Web Dynamics held in conj. with the 8th
International Conference on Database Theory (ICDT'01),
2001.
Ying Ding and Schubert Foo. Ontology research and
development. Part 1: A review of ontology generation. In
Journal of Information Science, 2002.
2
Mariano Fernández-López,
1999
Propose lots of guidelines based on IEEE Standard
1074-1995 for manual ontology development
Examine the methodologies for five different
projects
Uschold and King 1995
Grüninger And Fox, 1995
Berneras et. al., 1996
METHONTOLOGY, 1996
SENSUS, 1997
3
IEEE Standard 1074-1995
The standard for developing software life cycle
Software life cycle model processes (identify and select life cycle)
Project management processes (create framework of project)
Software development-oriented processes
Pre-development processes (study the environment)
Development processes
• Requirement process (develop software requirements specification)
• Design process (develop software representation that meets the requirements)
• Implementation process (transform representation to programming language)
Post-development processes (install, operate, support, and maintenance)
Integral process (ensure the completion and quality)
4
Criteria for Analyzing
Methodologies
C1. Inheritance from Knowledge Engineering
C2. Detail of the methodology
C3. Recommendation for knowledge formalization
C4. Strategy for building ontologies
Application dependent, semi-dependent, or independent
C5. Strategy for identifying concepts
Bottom-up, top-down, or middle-out
C6. Recommended life cycle
C7. Differences between the methodology and IEEE 10741995
C8. Recommended techniques
C9. Ontology and system built
5
Uschold and King
Description: developing the Enterprise Ontology
for enterprise modeling processes
Building process (middle-out)
Ontology capture
• Identify key concepts and relationships
• Produce precise unambiguous text definitions
• Identify other terms refer to identified concepts and
relationships
Coding
Integrating existing ontologies
6
Uschold and King
Analysis of Methodology
C1. partial: identifies an acquisition, coding and evaluation
stage, but without feasibility study and prototyping
C2. very little
C4. application-independent
C5. middle-out: from most important to less important, the
others from generalization and specialization
C7.
Processes missing: management, pre-development, and postdevelopment, design
Activities missing: environment study, feasibility study, training
and configuration management
C8. technical details are unclear
7
Grüninger And Fox
Description: developing the TOVE (TOronto Virtual Enterprise) project
ontology within the domain of business processes and activities modeling
Building process (middle-out)
Capture of motivating scenarios
• Motivating scenarios: problems or examples which are not adequately addressed by
existing ontologies
• Motivating scenario provides possible solutions
• Solutions provide an informal intended semantics for the objects and relations
Formulation of informal competency questions
• Based on the motivating scenarios
• Serve as constraints rather than determining a particular design
• Evaluate ontological commitment
Specification of the terminology of the ontology within a formal language
• Getting informal terminology: terms extracted from the questions
• Specification of formal terminology: formalizing terms
Formulation of formal competency questions using the terminology of the ontology
Specification of axioms and definitions for the terms in the ontology within the
formal language
Establish conditions for characterizing the completeness of the ontology
8
Grüninger And Fox
Analysis of Methodology
C1. small: this is a question-answer-pair driven approach,
not very much involved in knowledge-based system
development
C2. little
C3. logic
C4. application-semidependent (scenarios)
C5. middle-out
C7.
Processes missing: management, pre-development, and postdevelopment, design
Activities missing: training and configuration management
C8. technical details are unclear
9
Berneras et. al
Description: developing the Esprit KACTUS
project to investigate the feasibility of knowledge
reuse in complex technical systems and the role of
ontologies to support it
Building process (top-down)
Specification of the application
Preliminary design based on relevant top-level
ontological categories
• It involves searching ontologies developed for other
applications, which are refined and extended for use in the new
application.
Ontology refinement and structuring
10
Berneras et. Al
Analysis of Methodology
C1. big: follow the tradition of knowledge engineering
C2. very little
C4. application-dependent
C5. top-down
C7.
Processes missing: management, pre-development, and postdevelopment
Activities missing: training, documentation, configuration
management, verification, and validation
C8. technical details are unclear
11
METHONTOLOGY
Description
Enabling the construction of ontologies at the knowledge level
Supported by Ontology Design Environment (ODE)
Including
•
•
•
Identification of the ontology development process
A life cycle based on evolving prototypes
Particular techniques for carrying our each activity
Ontologies developed
•
•
•
•
CHEMICALS
Environment pollutants ontologies
The Reference-Ontology
The restructured version of (KA)2 ontology
Building process (middle-out): refers to which activities are carried out
Project management activities
•
•
•
Planning: identify tasks
Control: guarantee planned tasks being completed when intended
Quality Assurance: assure the quality of outputs
Development-oriented activities
•
Specification, conceptualization, formalization, and implementation
Support activities
•
Knowledge acquisition, evaluation, integration, documentation, and configuration management
12
METHONTOLOGY
Analysis of Methodology
C1. big: it has its roots in a methodology for developing
knowledge-based systems
C2. a lot
C3. flexible
C4. application-independent
C5. middle-out: most relevant concepts are identified first
C6. evolving prototypes
C7.
Processes missing: software life cycle model, and pre-development
Activities missing: project initiation, installation, support,
retirement, and training
C8. technical details are unclear
13
SENSUS
Description
Developed for natural language processing
Content obtained by extracting and merging information from various
electronic sources of knowledge
• PENMAN Upper Model, ONTOS, manually built semantic categories,
WordNet, Spanish and Japanese lexical entries
Including
• More than 50,000 concepts organized in a hierarchy
• Both high and medium level of abstraction
• Generally not cover terms from specific domains
Building process (bottom-up)
Take a series of seed terms, linked to SENSUS by hand
Specify paths from the seed terms to the root
Add more relevant terms
Prune any irrelevant terms
14
SENSUS
Analysis of Methodology
C1. none: based on adding terms into an existing ontology
C2. medium: not very detailed
C3. semantic networks
C4. application-semidependent
C5. bottom-up
C7.
Processes missing: management, pre-development, and postdevelopment, design
Activities missing: training, documentation, configuration
management, verification, and validation
C8. technical details are unclear
15
Summary
None of the methodologies are fully mature
comparing with the IEEE standard
The proposals are not unified
SENSUS is completely different from the others
It suggests we adopt several widely accepted
methodologies than on standardized one
Interpretability between systems is allowed
16
Borys Omelayenko
2001
Learning-based ontology development
Examine eleven different approaches
Bisson et. al. 2000
Faure and Poibeau, 2000
Agirre et. al., 2000
Junker et. al., 1999
Craven et. al., 2000
Bowers et. al., 2000
Taylor et. al., 1997
Webb, Wells, Zheng, 1999
Soderland et. al., 1995
Maedche and Staab, 2000
Suryanto and Compton 2000
17
Semantic Querying over the Web
18
Ontological Components
Natural language ontologies (horizontal)
Contain lexical relations between language concepts
Large in size and do not require frequent updates
Used to expand user queries
Capture concepts but not provide detailed descriptions
Domain ontologies (vertical)
Capture knowledge of a particular domain
Provide detailed descriptions of the domain
Ontology instances (dot)
Main piece of knowledge presented in the future Semantic Web
Serve for Web pages
Contain links to other instances
19
Ontology Learning Tasks
Ontology acquisition
Ontology creation
Ontology schema extraction
Extraction of ontology instances
Ontology maintenance
Ontology integration and navigation
Ontology update
Ontology enrichment
20
Machine Learning Techniques
Ontology representation requires symbolic
learning methods
Skip neural networks, genetic algorithm, and the family
of ‘lazy learners’.
Methods studies in this paper
Propositional rule learning (zero-order logic)
First-order logic rules learning
Bayesian learning
Clustering algorithms
21
ML vs. Manually
Modeling primitives
ML: simple and limited (usually simple rules)
Man: rich (frames, subclasses, rules with rich set of operations, functions, etc.)
Knowledge base structure
ML: flat and homogeneous
Man: hierarchical, consisting of various components with subclass-of, part-of, and
other relations
Tasks
ML: categorize objects into a limited and unstructured set of classes
Man: classify objects into a tree of structured classes
Problem-solving methods
ML: very primitive, based on simple search strategies
Man: complicated, inference over a knowledge base with rich structure
Solution space
ML: non-extensible, fixed set of class labels
Man: extensible set of primitive and compound solutions
Readability of the knowledge bases to a human
Not required
required
22
Requirements for OL
Aim: automatically construct ontologies with the
properties of manually constructed ontologies
Requirements
Ability to interact with a human
Readability of internal and external results of the
learner
Ability to use complex modeling primitives
Ability to deal with complex solution spaces
23
Requirements for
Ontological Components
NLO
Hierarchical clustering of language concepts
Limited set of relations
Ability to link to specific domain ontologies
ML focus: enrichment based on domain texts is popular
Do not require frequent or automatic updates
DO
Use the whole set of modeling primitives
Complex in structure
ML focus: discovering statistically valid patterns for creation
Require more updates
OI
Concepts mark-up of the underlying domain ontology in Web pages
ML focus: IE and annotation
Require frequent updates
24
Leaning of NLO
Bisson et. al. 2000 (Mo’K tool)
Human-assisted bottom-up clustering of
conceptual hierarchies from corpora
Human selects input examples and attributes, level of
pruning, and distance evaluation functions
Group ‘similar’ objects to create the classes
Group ‘similar’ classes to form the hierarchy
No human interaction during clustering process
Further study on integrating NLO enrichment with
the Web search of relevant texts
25
Leaning of NLO
Agirre et. al., 2000
Enrich WordNet by exploiting texts from the Web
Construct lists (topic signatures) of topically related
words (with weight/strength) for each concept in
WordNet
Each word sense has one associated list of related
words
Related Web pages from AltaVista search engine
by specifying particular queries
Query refers to a particular sense but not others
Example: waiter AND and (restaurant OR menu) AND
NOT (station OR airport)
26
Leaning of NLO
Faure and Poibeau, 2000 (Asium)
Creating domain-specific NLO by unsupervised
domain-specific clustering of texts from corpora
Generate syntactical structure of texts by Sylex
Cooperative learning of semantic knowledge from
parsed texts
Bottom-up, breadth-first clustering for form the
hierarchy
Expert validate and label concepts
27
Learning of DO
Maedche and Staab, 2000
Semiautomatically ontology learning from texts
Input : a set of transactions
Transaction: contain a set of items appearing together
Association rule: sets of items that appear together
sufficiently often
ML: discover generalized association rule
Final: present the rules to the knowledge engineer
28
Learning of DO
Suryanto and Compton 2000
First attempt of using ML to discover hierarchical
relations between textually described classes
Discovery class relations between classification rules
Three basic relations: intersection, mutual-exclusion,
similarity
Each relation is defined a measure of degree for three
basic relations
29
Learning of DO
Taylor et. al., 1997
Ontology-based induction of high-level
classification rules
Ontologies not only for explaining rules but also to
guide learning algorithm
Algorithm generates queries for an external learner
ParkaDB
DO and input data check consistency of queries
Consistent queries become classification rules
Query generation continues until the set of rules covers
the whole data set
30
Learning of DO
Webb, Wells, Zheng, 1999
ML plus knowledge acquisition from experts
improves the accuracy of developed domain
ontology and reduce development time
Three types of knowledge acquisition systems
• Manually based on experts
• ML systems
• Integrated system
ML method: C4.5 decision tree
31
Learning of OI
Bowers et. al., 2000
Replacing the attribute-value dictionary
with a more expressive one that consists of
simple data types, tuples, sets and graphs
Using modified C4.5 learner
32
Learning of OI
Soderland et. al., 1995 (CRYSTAL)
Formalize ontology instances from text and
generate a concept hierarchy from the instances
Given domain model as input
Use a richer set of modeling primitives
Generalize semantic mark-up of the manually markedup training corpora
Formalize the instance level of hierarchy
Searched-based generalization of concept nodes
33
Learning of OI
Craven et. al., 2000 (Web-KB)
Systematic study of the extraction of OI from Web
documents
Ontology as an academic web-site to populate it with actual
instances and relations from CS departments’ web sites
Three learning tasks
• Recognize class instances from hypertext documents guided by the
ontology
• Recognize relation instance from the chains of hyperlinks
• Recognize class and relation instances from the pieces of hypertext
Two supervised learning methods
• Naïve Bayes learner
• Modified FOIL (first-order rule learner)
Automatically create mapping between the manually constructed
domain ontology and the Web pages by generalizing from the
training instances
34
Summary
Main problem of OL: flat and homogeneous structure
learned
Learning of NLO
General-purpose NLO exists
Mainly enrichment
Most popular ML algorithm: clustering
Learning of DO
Human-guided learning
Learning plays only a minor role in knowledge acquisition
Most popular ML algorithm: propositional learning
Learning of OI
The structure of OI is too rich to be adequately captured by
propositional rules
Multiple different ML algorithm are applied
35
Ying Ding and Schubert Foo
2002
Methods used and problems encountered in many recent
ontology generation approaches
Examine seven main collection of approaches
InfoSleuth (MCC)
SKC (Stanford)
Ontology Learning (AIFB)
ECAI2000
Inductive logic programming (UT)
Library Science and Ontology
Others
36
InfoSleuth
A research project at MCC (Microelectronics and Computer
Technology Corporation)
Develop and deploy new technologies for finding information
available both in corporate networks and external networks
Description
Locating, evaluating, retrieving, and merging information in a frequently
updating environment
Build up an ontology-based agent architecture
Been successfully implemented in
•
•
•
•
•
•
Knowledge management
Business intelligence
Logistics
Crisis management
Genome mapping
Environment data exchange network
37
InfoSleuth: method
Input resources
Human expert feeds system a small set of seedwords (high-level concept)
IR engine feeds relevant documents (with or without POS tagged) automatically
System process
Parse documents
Extract phrases with seedwords
Generate concept terms
Place them into ontology
Collect candidate seedwords for next round of processing
Relationship retrieving
is-a, part-of, manufactured-by, owned-by, etc.
assoc-with is used to define relations except is-a
Use linguistic properties to identify relations
Human experts evaluate and adjust results
Special features
Expand ontology with new concepts and alert human expert to update
Discover attributes associated with certain concepts
Index documents for future retrieval
Allow users to decide between precision and completeness by browsing
38
InfoSleuth: problems
Syntactic structure ambiguity (concept token
identification)
image process software
Different phrases refer to the same concept
Word sense disambiguation
Proper attachment of adjective modifier may help avoid
non-concepts
Heterogeneous resources (inconsistent terminologies)
Automatically constructed ontology can be too prolific and
deficient at the same time (because of the seedwords)
39
SKC (Scalable Knowledge
Composition)
A research project at Stanford
Resolve semantic heterogeneity in information
systems
Description
Derive general methods for ontology integration
Application-independent
Develop an ontology algebra
Convert Webster’s dictionary to a graph structure
Funded by
• AFOSR, DARPA, HPKB
40
SKC: method
Concept graph technique detail is unknown
Use a novel algebraic extraction technique to generate the graph
structure and create thesaurus entries for all words including some
stopwords
Idea from PageRank algorithm
ArcRank algorithm to extract relations
Basic hypothesis: structural relationships between terms are relevant to
their meaning
Pattern/Relation extraction algorithm
Compute a set of nodes that contain arcs comparable to seed arc set
Threshold them according to ArcRank value
Extend seed arc set, when nodes contain further commonality
If the node set increased in size repeat from the first step
The algorithm is self-limited via threshold and distinguish senses
41
SKC: problems
Syllable and accent markers in head words
Misspelled head words
Mis-tagged fields
Stemming and irregular verbs
Common abbreviations in definitions
Undefined words with common prefixes
Multi-word head words
Undefined hyphenated and compound words
42
Ontology Learning
A project in AIFB (Institute of Applied
Informatics and Formal Description
Methods, University of Karlsruhe,
Germany)
Extract ontology from domain data
Description
To learn both taxonomic and non-taxonomic
relations for ontologies
43
OL: method
Shallow text processing
Implement on top of SMES (text process for German)
Use weighted finite state transducers to process phrasal and
sentential patterns
Output dependency relations
Learning algorithm
Input dependency relations
Select the set of documents
Define association rules
Determine confidence for the rules
Output association rules exceeding the user-defined confidence
44
OL: problems
Lightweight ontology contains too many
noisy data
Word sense problem generates lots of
ambiguity
Refinement of the lightweight ontologies is
a trickle issue (need future work)
Relationship learning is not trivial
45
ECAI 2000
Ontology Learning Workshop of ECAI 2000
(European Conference on Artificial Intelligence)
Description
Use NLP techniques
Extract important (high frequency) words or phrases to
define concepts
Use general top-level ontology (WordNet, SENSUS) to
assist disambiguation
Problem: relation extraction
46
Inductive Logic Programming
WOLFIE (WOrd Learning From Interpreted Examples) at
Machine Learning Group in University of Texas at Austin
Description
Learn semantic lexicon from a corpus of sentences
Learned lexicon
• Consist of words with meaning
• Allow synonym and ploysymy
Ultimate goal: learn to parse novel sentences into their meaning
representations
Have the potential to be a workbench for ontological concept
extraction and relation detection
Problem: how to deploy their methods for ontology
concept and rule learning to make the workbench work
47
Library Science and Ontology
Digital Library + Semantic Web
Digital libraries use various forms of vocabularies instead of formal
ontologies
Kwasnik (1999) convert a controlled vocabulary scheme into an
ontology
Higher levels of conception of descriptive vocabulary
Deeper semantics for class/subclass and cross-class relationships
Ability to express concepts and relationship in a description language
Reusable and sharable of the ontological constructs
Strong inference and reasoning functions
Problems
Different ways of modeling knowledge (shallow or deeper semantics)
Different ways of representing knowledge (lexical-flavored or
mathematical and logical-flavored)
To merge or create a common standard for the two fields will be a long
way
48
Others
Borgo 1997
Use lexical semantic graphs to create ontology
Based on WordNet
Yamaguchi 1999
Construct domain ontologies
Based on a machine-readable dictionary
Kashyap 1999
Construct ontology for IR
Based on database schema
49
Ontology Learning
(Research Location Index) [34]
Europe
France (7)
Germany (5)
Spain (3)
Others: Italy (2), Austria, Greece, Netherlands, Portugal, Switzerland, UK
*European Union (2):
• OntoWeb: University of Karlsruhe
• On-To-Knowledge: many countries
USA
Stanford (2)
Austin (2): UT, MCC
Dallas (2): UT, Southern Methodist University
Other: UC Berkeley, Mississippi State University, BYU, UW
Others
Australia, Canada, Israel, Japan, Taiwan (China)
50
Conclusion
Top-level NLO: manual construction required, need human
experts
Domain-level NLO: learnable, fed by
Top-level NLOs
Domain descriptions
Domain ontology: learnable, fed by
Domain description
Training documents
Instance ontology: learnable, fed by
Domain ontology
Specified instance Web pages
51
Conclusion
Source data
Semi-structured documents (more or less)
Seedwords
Existing generic ontologies (WordNet)
Concept extraction
IE, NLP, ML (mostly clustering and inductive learning), existing
digital resource assistance
High precision, not bad completeness
Relationship extraction
Complex and not well-solved
Ontology reuse is another important issue
To map ontologies to different representations may be
valuable (like conceptual graph, conceptual hierarchy,
description logic, ontology language)
52
Download