S K C An Algebraic Approach to

advertisement
S
K
C
An Algebraic Approach to
Articulate Ontologies for
Information Integration
February 2001
Gio Wiederhold
Stanford University
7/26/2016
Gio Wiederhold SKC 1
What are Ontologies?
Ontologies list the terms and their relationships
that allow communication among partners in
enterprises (in machine-readable form)
Relationships determine meaning - parent, school, company
Databases use ontologies during design
in their E-R diagrams
(Implicitly)
and
represent the leaf nodes in their schemas
Knowledge-bases use ontologies (often implicitly)
add class definition (to hold instances), constraints,
and, sometimes, operations among the terms
7/26/2016
Gio Wiederhold SKC 2
Functions of Ontologies
.
• Enable Precision in Understanding
People = designers, implementors, users, maintainers
Systems = implementors = users = maintainers
• Share the Cost of Knowledge Acquistion &
Maintenance
reuse encoded knowledge, remain up-to-date as domains change
• Enable Information Interoperation
Define the terms that link domains
7/26/2016
*
Gio Wiederhold SKC 3
Ancestors of Ontologies
 Lexicons:
 Taxonomies:
.
collect terms used in informtion systems
categorize, abstract, classify terms
 Schemas of databases: attributes, ranges, constraints
 Data dictionaries: systems with multiple files, owners
 Object libraries: grouped attributes, inherit., methods
 Symbol tables: terms bound to implemented programs
 Domain object models: (XML DTD): interchange terms
 . . . More Knowledge formalized
7/26/2016
Gio Wiederhold SKC 4
Establishing Ontologies
.
Top-down:
–Commonly acceptable UPPER layers
Domain-specific
–Sharing tools
–Object based
Bottom-up
–Pragmatic, TASK-specific collections
–Database schemas and models
7/26/2016
Gio Wiederhold SKC 5
Heterogeneity among Domains
If interoperation involves distinct
domains mismatch ensues
• Autonomy conflicts with consistency,
– Local Needs have Priority,
– Outside uses are a Byproduct
Heterogeneity must be addressed
• Platform and Operating Systems 
• Representation and Access Conventions 
• Naming and Ontology
7/26/2016

Gio Wiederhold SKC 6
Semantic Mismatches
Information comes from many autonomous sources
• Differing viewpoints
(by source)
– differing terms for similar items
{ lorry, truck }
– same terms for dissimilar items
trunk(luggage, car)
– differing coverage
vehicles (DMV, AIA)
– differing granularity
trucks (shipper, manuf.)
– different scope
student museum fee, Stanford
• Hinders use of information from disjoint sources
– missed linkages
loss of information, opportunities
– irrelevant linkages
overload on user or application
program
• Poor precision when merged
Still ok for web browsing , poor for business & science
7/26/2016
Gio Wiederhold SKC 7
Need for precision
More precision is needed as data volume increases
--- a small error rate still leads to too many errors
False Positives have to be investigated
( attractive-looking supplier - makes toys
apparent drug-target with poor annotation )
Information Wall
lost opportunities,
suboptimal to some degree
False positives = poor precision
typically cost more than
false negatives = poor recall
data errors
False Negatives cause
information quantity
adapted from Warren Powell, Princeton Un.
7/26/2016
Gio Wiederhold SKC 8
Ontology Sharing
Three Alternatives
 Create a committee to define everybody’s terms
 Takes many years, until people are worn out
 Ignored when changes make deviation necessary
 Get all terms and put them into large model
[ Cyc, UMLS, Federated Schemas, . . . ]
 Can be rapid
 Ignores conflicts
 Hard to maintain (requires committee)
 Keep all Terms distinct, except where sharing
 Requires initial effort
 Empowers participants
7/26/2016
Gio Wiederhold SKC 9
Proposed Language Solutions
Specify and define terminology usage: ontology
• Domain-specific ontologies XML DTD assumption
– Small, focused, cooperating groups
– high quality, some examples - genomics, arthritis, Shakespeare plays
– allows sharable, formal tools
– ongoing, local maintenance affecting users - annual updates
– poor interoperation, users still face inter-domain mismatches
• Cannot achieve globally consistency
– wonderful for users and their programs
– too many interacting sources
– long time to achieve, 2 sources (UAL, BA), 3 (+ trucks), 4, … all ?
– costly maintenance, since all sources evolve
– no world-wide authority to dictate conformance
7/26/2016
Gio Wiederhold SKC 10
An unsolved problem
Common assumption in assembling and integrating
distributed information resources
• The language used by the resources is the same
• Sub languages used by the resources are subsets of
a globally consistent language
This assumption is provably false
Working towards the goal of globally consistency is
1. naïve -- the goal cannot be achieved
2. inefficient -- languages are efficient in local contexts
7/26/2016
Gio Wiederhold SKC 11
General Ontologies?
 Have all the Knowledge together
+ simple for customers of KBs
– hard for owners of KBs
 Large KB will cover multiple domains
 created by a committee -- slow
 maintained by a committee -- costly
 Differences in level of abstraction -- efficiency
 homeowner: nail
 carpenter: sinker, brad, boxnail, . . .
7/26/2016
Gio Wiederhold SKC 12
Structural Heterogeneity
7/26/2016
Gio Wiederhold SKC 13
Domains and Consistency
.
• a domain will contain many objects
• the object configuration is consistent
• within a domain all terms are consistent &
• relationships among objects are consistent
Domain Ontology
• context is implicit
No committee is needed
to forge compromises *
within a domain
 Compromises hide valuable details
7/26/2016
Gio Wiederhold SKC 14
SKC grounded definition
.
• Ontology:
a set of terms and their relationships
• Term:
a reference to real-world and abstract objects
• Relationship:
a named and typed set of links between objects
• Reference:
a label that names objects
• Abstract object:
a concept which refers to other objects
• Real-world object:
an entity instance with a physical manifestation
7/26/2016
Gio Wiederhold SKC 15
Domain-specific Expertise
.
Knowledge needed is huge
• Partition into natural
domains
• Determine domain
responsibility and
authority
• Empower domain owners
• Provide tools
Consider interaction
7/26/2016
Society of
specialists
Gio Wiederhold SKC 16
An Ontology Algebra
A knowledge-based algebra for ontologies
Intersection
Union
Difference
create a subset ontology
keep sharable entries
create a joint ontology
merge entries
create a distinct ontology
remove shared entries
The Articulation Ontology (AO) consists of
matching rules that link domain ontologies
7/26/2016
Gio Wiederhold SKC 17
Sample Operation: INTERSECTION
Result contains
shared terms
Source Domain 1:
Owned and maintained
by Store
7/26/2016
Terms useful
for purchasing
Source Domain 2:
Owned and maintained
by Factory
Gio Wiederhold SKC 18
INTERSECTION support
Articulation ontology
Terms useful
for purchasing
Matching
rules that use
terms from the
2 source domains
Store
Ontology
7/26/2016
Factory
Ontology
Gio Wiederhold SKC 19
Sample Intersections
Articulation
size = size
ontology
matching rules : color =table(colcode)
style = style
Anatomy
{. . . }
Shoe Factory
Shoe Store
• Shoes { . . . }
• Customers { . . . }
• Employees { . . . }
foot = foot
Employees
Nail (toe, foot)
...
7/26/2016
• Material inventory {...}
• Employees { . . . }
• Machinery { . . . }
• Processes { . . . }
• Shoes { . . . }
Department
Store
Hardware
Employees
Nail (fastener)
...
Gio Wiederhold SKC 20
Other Basic Operations
DIFFERENCE: material
fully under local control
UNION: merging
entire ontologies
Articulation
ontology
typically prior
intersections
7/26/2016
Gio Wiederhold SKC 21
Features of an algebra
Operations can be composed
Operations can be rearranged
Alternate arrangements can be evaluated
Optimization is enabled
The record of past operations can be
kept and reused
7/26/2016
Gio Wiederhold SKC 22
Sample Processing in HPKB
• What is the most recent year an – Problems resolved by SKC
OPEC member nation was on the
* Factbook – a secondary
UN security council (SC)?
source -- has out of date
– Related to DARPA HPKB
OPEC & UN SC lists
Challenge Problem
• Indonesia not listed
– SKC resolves 3 Sources
• Gabon (left OPEC 1994)
» CIA Factbook ‘96 (nation)
» OPEC (members, dates)
» UN (SC members, years)
– SKC obtains the
Correct Answer
» 1996 (Indonesia)
– Other groups obtained more,
but factually wrong answers
7/26/2016
* different country names
• Gambia => The Gambia
* historical country names
• Yugoslavia
» UN lists future security
council members
• Gabon 1999
needed ancillary data
Gio Wiederhold SKC 23
Interoperation via Articulation
At application definition time
– Match ontologies
– Establish articulation rules.
– Record the process
At execution time
– Query rewriting
– Optimization based on an Ontology Algebra.
For maintenance
–Regenerate rules using the stored
formulation
7/26/2016
Gio Wiederhold SKC 24
Semi-automatic approach
Provide library of automatic match heuristics
•
•
•
•
•
Lexical Methods -- spelling
Structural Methods -- relative graph position
Reasoning-based Methods
Nexus 
Hybrid Methods
– Iterative/Non-iterative Methods
GUI tool to
- display matches and
- verify generated matches using human expert
- expert can also supply matching rules
7/26/2016
Gio Wiederhold SKC 25
Articulation Generator
Being built by Prasenjit Mitra
Thesaurus
OntA
Phrase Relator
Context-based
Word Relator
Driver
Structural
Matcher
Ont1
Ont2
Semantic
Network
(Nexus)
Human Expert
7/26/2016
Gio Wiederhold SKC 26
Lexical Methods
• Preprocessing rules.
-Expert-generated seed rules.
e.g., (Match O1.President O2.PrimeMinister)
-Context-based preprocessing directives.
• Thesaurus - synonyms, relationships
• Distance of words as measure of
relatedness.
7/26/2016
Gio Wiederhold SKC 27
Vehicle
sales
ontology
Vehicle
registration
ontology
Tools to create articulations
Suggestions
for articulations
7/26/2016
Combine ontology graphs
with expert selection
based on spelling,
graph matching, and
a nexus derived from
a dictionary (O.E.D.)
Gio Wiederhold SKC 28
Tools to create articulations
Graph matcher
for
Articulationcreating
Expert
Transport
ontology
Vehicle
ontology
Suggestions
for articulations
7/26/2016
Gio Wiederhold SKC 29
continue from initial point
Also suggest similar terms
for further articulation:
• by spelling similarity,
• by graph position
• by term match repository
Expert response:
1. Okay
2. False
3. Irrelevant
to this articulation
All results are recorded
Okay’s are converted into articulation rules
7/26/2016
Gio Wiederhold SKC 30
Candidate Match Repository
Term linkages automatically extracted from 1912 Webster’s dictionary *
* free, other sources
have been processed.
.
Based on processing
headwords  definitions
using algebra primitives
Notice presence
of 2 domains:
chemistry, transport
7/26/2016
Gio Wiederhold SKC 31
Using the match repository
7/26/2016
Gio Wiederhold SKC 32
Navigating the match repository
7/26/2016
Gio Wiederhold SKC 33
Relative Arc Importance
• PageRank (Google) limitations
– node oriented
– high rank to words with little semantic value
» conjunctions, articles
The
» prepositions, pronouns
And
to
it
• Relative arc importance
– contribution of source rank to target rank
– vs ,t   ps / | as |
pt
es ,t
7/26/2016
Gio Wiederhold SKC 34
ArcRank
• For All source s and target t nodes in graph
sort outgoing vs ,t j , rank by sorted order res (vs ,t j )
sort incoming vs ,t , rank by sorted order ret (vsi ,t )
i
for each arc es ,t compute mean( res (vs ,t ), ret (vs ,t ))
• In ranking
– Equal values take same rank
– Ranks numbered consecutively
7/26/2016
Gio Wiederhold SKC 35
All Pairs Similarity
• Compute similarity value for all node pairs n,n'
product Pi of inbound arc importance vectors
product Po of outbound arc importance vectors
similarity = Pi  Po
• Similarity Matrix
– Initial state: nodes similar only to themselves
– Node substitution: terms replace similar ones
– Iterative convergence: bounded substitution
7/26/2016
Gio Wiederhold SKC 36
Contents
» Examples
• Verb (Educate)
• Adverb (Ever)
• Proper Noun (Scotland -- undefined term
7/26/2016
Gio Wiederhold SKC 37
Examples (Verb)
7/26/2016
Gio Wiederhold SKC 38
Examples (Adverb)
7/26/2016
Gio Wiederhold SKC 39
Examples (Proper Noun)
7/26/2016
Gio Wiederhold SKC 40
Country Graphs
7/26/2016
Gio Wiederhold SKC 41
To be matched to
7/26/2016
Gio Wiederhold SKC 42
Knowledge Composition
Composed knowledge for
applications using A,B,C,E
Articulation
knowledge
Legend:
U
U
for
(A B) U
(B C) U
(C E)
Articulation
knowledge
(C E)
U
U : union
U
: intersection
Knowledge
resource
E
U
Knowledge
resource
A
7/26/2016
U
(B
C)
Knowledge
resource
B
Knowledge
resource
C
(C
U
U
Articulation
knowledge
for (A B)
D)
Knowledge
resource
D
Gio Wiederhold SKC 43
Primitive Operations
Model and Instance
Unary
• Summarize -- structure up
• Glossarize - list terms
• Filter - reduce instances
• Extract - circumscription
Binary
• Match - data corrobaration
• Difference - distance
measure
• Intersect - schem
discovery
• Blend - schema extension
7/26/2016
Constructors
• create object
• create set
Connectors
• match object
• match set
Editors
• insert value
• edit value
• move value
• delete value
Converters
• object - value
• object indirection
• reference indirection
Gio Wiederhold SKC 44
Future: exploiting the result
Result has links
to source
Avoid n2 problem of interpreter
mapping as stated by Swartout
as an issue in HPKB year 1
Processing & query evaluation
is best performed within Source
Domains & by their engines
7/26/2016
Gio Wiederhold SKC 45
SKC Synopsis
• Research:
– Reliable query answers from heterogeneous,
imperfect data sources
• Sources:
– General: CIA World Factbook ‘96, UN-www, OPEC-www
Webster’s Dictionary, Thesaurus, Oxford English Dictionary
– Topical: OPEC, BattleSpace Sensors, Logistics Servers
• Client:
– DARPA High Performance Knowledge Base project
• Theory:
– Rule-based algebra
– Translation & Composition primitives
7/26/2016
Gio Wiederhold SKC 46
Domain Specialization
• Knowledge Acquisition (20% effort) &
• Knowledge Maintenance (80% effort *)
to be performed by
• Domain specialists
• Professional organizations
• Field teams of modest size
autonomously
maintainable
Empowerment
* based on experience with software
7/26/2016
Gio Wiederhold SKC 47
Innovation in SKC
•
•
•
•
No need to harmonize full ontologies
Focus on what is critical for interoperation
Rules specific for articulation
Tools for creation and maintenance
– Maintenance is distributed
» to n sources
» to m articulation agents
• Potentially many sets of articulation rules
is m < n2 , depends on semantic architecture density
a research question
7/26/2016
Gio Wiederhold SKC 48
Backup Viewgraphs
7/26/2016
Gio Wiederhold SKC 49
Summary
.
• Algebra enables Interoperation by
dealing explicitly with differences by knowledge
identifying maintenance domains
keeping sources autonomous
• Assumes domain has a common ontology
composing domain ontologies requires the algebra
to manage the linkages where articulation occurs
processes are best executed within the domains
• Knowledge about articulation is disjoint
allows integration specialists to work independently
supports multiple intersections and views
• Maintenance is structured and partitioned
7/26/2016
Gio Wiederhold SKC 50
Current Directions
• Experience with real world (imperfect) data
confirms validity of our approach
– Expert sources are better maintained than
general sources
– Rules applied to multiple sources provide
more reliable and accurate query results
– Component architecture enables scalable,
maintainable knowledge base development
• Developing proof of concept environment with
HPKB standard knowledge base connectivity
interface
7/26/2016
Gio Wiederhold SKC 51
Components of Ontologies
.
• Vocabularies
• words as handles for classes and concepts
• from database schemas. textbooks, catalog
indexes
• identified with their domain or context
• Relationships
• meaning is primarily defined through relationships
 Employee of a Company; Nails in a Shoe
• Annotations
• material to clarify the meaning of the terms
• contributed by users as well as original authors
• best with some examples
7/26/2016
Gio Wiederhold SKC 52
Status
September 1997
.
• Base HPKB funding from AFOSR
– New World Vistas
– some industrial co-funding
• Prior work supported through Commercenet
– support for common representation, an interlingua
• Acquiring ontologies that
–
–
–
–
are interesting to HPKB projects
not trivial, I.e., represent realistic activities
intersectable
Logistics: DoD CIM, CIA, Cyc, . . .
• Starting smart students
• Integrating into architecture managed by TFS
7/26/2016
Gio Wiederhold SKC 53
Information Flow for Training Initiative
sample
scenarios
scenario
refinement
trainer /
controller
aggregation/
analysis/
evaluation
ISI scenario language
Scenarios
Objectives
tasks
explosion
doctrine
TRADOC
mediator
knowledge
base
Requirements
exercise
design
Legend
sources
scenario
justification
Data
collection
Probepoint
settings
draft 1
aggregation
7/26/2016
Gio Wiederhold SKC 54
Interlingua(s)
Interlingua:
Query :
Object Exchange Model
Mediator Specification Language
OEM
MSL
{ OID, LABEL, TYPE, VALUE }
<document
{<author AUTHOR> <title TITLE>}:-
<biblioentry
{<author AUTHOR>}>@biblio
<inproceedings {<title TITLE>}>
@sybase
AND
AND
Equal(AUTHOR, “Jeff Ullman”)
Interlingua:
Query:
Knowledge Interchange Format
Knowledge Query and Manipulation Language
KIF
KQML
(PACKAGE :FROM ap001
:TO ap002
:CONTENT
(MSG
:TYPE query
:CONTENT-LANGUAGE KIF
:CONTENT (and (document (author@biblio ?a) (title@sybase ?t))
(eq “Jeff Ullman” ?a)))
7/26/2016
Gio Wiederhold SKC 55
Support for KB-Algebra
• Ontolingua [Gruber, Fikes @ Stanford KSL]:
Repository for Domain Terminologies
Used for mechanical design, bibliographies, catalogs
• LOOM [MacGregor@ USC ISI]:
Classification-based Expert System
Helps in structuring and processing ontologies
• PROTÉGÉ [Musen@ Stanford MIS]
Reuse
• Penguin [Barsalou, Keller@ Stanford MIS, CIFE]:
Object manipulation based on Relational Algebra
Used for genetics laboratory, building design
7/26/2016
Gio Wiederhold SKC 56
Download