Increasing the Precision of Semantic Interoperation Gio Wiederhold Reind van de Riet celebration

advertisement
Reind van de Riet celebration
Increasing the Precision of
Semantic Interoperation
Gio Wiederhold
Stanford University
August 2000
Thanks to Jan Jannink, Shrish Agarwal, Prasenjit Mitra, Stefan Decker.
August 2000
Gio vdR 1
Reind van de Riet, music and computing
August 2000
Gio vdR 2
Achievements
• Computer Science-based Formalization
• Effective Dissemination - students and writings
• Powerful Database Technology
• Language-sensitive Methods
• Ubiquitous Databases
• World-wide interconnectivity
• All the Information one might want
somewhere . . . .
August 2000
Gio vdR 3
Information
Data
overload
starvation
• More databases
– public & corporate
• Faster communication
– digital
– packeting: TCP-IP, ATM
• World-wide connectivity
– internet
– world-wide web
• Disintermediation
– ubiquitous publishing
Data and Knowledge
Information is
created at the
Storage
confluence of
Education
data -- the state
Selection
Recording
&
knowledge -Integration
the ability to
select and
Abstraction
Experience
State changes
project the
Decision-making
state into
the future
Action
Knowledge Loop
Data Loop
Language issues
• Our languages are specialized for our needs
– efficiency of expression: minimal symbols
Examples:
• carpenter versus householder domain
• surgeon versus pathologist
• World-wide communication changes ranges
– geographical locality is less relevant
– conceptual locality is important
• Business requires precision, including that
words map consistently to instances
knowledge controls use of data
August 2000
Gio vdR 6
Transform Data to Information
Application
Layer
Mediation Layer
decision-makers at workstations
value-added services
Foundation
Layer
data and simulation resources
August 2000
Gio vdR 7
Heterogeneity among Domains
If interoperation involves distinct
domains mismatch ensues
• Autonomy conflicts with consistency,
– Local Needs have Priority,
– Outside uses are a Byproduct
Heterogeneity must be addressed
• Platform and Operating Systems 4 4
• Representation and Access Conventions 4
• Naming and Ontology :
August 2000
Gio vdR 8
Semantic Mismatches
Information comes from many autonomous sources
• Differing viewpoints
(by source)
–
–
–
–
–
differing terms for similar items
{ lorry, truck }
same terms for dissimilar items
trunk(luggage, car)
differing coverage
vehicles (DMV, AIA)
differing granularity
trucks (shipper, manuf.)
different scope
student museum fee, Stanford
• Hinders use of information from disjoint sources
– missed linkages
– irrelevant linkages
loss of information, opportunities
overload on user or application program
• Poor precision when merged
ok for web browsing ,
August 2000
poor for business
Gio vdR 9
Need for precision
More precision is needed as data volume increases
--- a small error rate still leads to too many errors
False Positives have to be investigated
( attractive-looking supplier - makes toys
apparent drug-target with poor annotation )
Information Wall
lost opportunities,
suboptimal to some degree
False positives = poor precision
typically cost more than
false negatives = poor recall
data errors
False Negatives cause
information quantity
adapted from Warren Powell, Princeton Un.
August 2000
Gio vdR 10
Means to achieve precision
• Reduce redundancy
– omit similar results from alternate sources
reports, workshop papers, journals, books
• Reduce false positives
– recognize contextual domains *
• the same word refers to different object types
nail (carpentry, anatomy), miter (carpentry, religion)
• Abstract findings to higher levels
– Linguistic processing based on customer model
medical case studies have similar formats
August 2000
Gio vdR 11
Proposed Language Solutions
Specify and define terminology usage: ontology
• Domain-specific ontologies XML DTD assumption
–
–
–
–
–
Small, focused, cooperating groups
high quality, some examples - genomics, arthritis, Shakespeare plays
allows sharable, formal tools
ongoing, local maintenance affecting users - annual updates
poor interoperation, users still face inter-domain mismatches
• Cannot achieve globally consistency
–
–
–
–
–
wonderful for users and their programs
too many interacting sources
long time to achieve, 2 sources (UAL, BA), 3 (+ trucks), 4, … all ?
costly maintenance, since all sources evolve
no world-wide authority to dictate conformance
August 2000
Gio vdR 12
Domains and Consistency
.
• a domain will contain many objects
• the object configuration is consistent
• within a domain all terms are consistent &
• relationships among objects are consistent
Domain Ontology
• context is implicit
No committee is needed
to forge compromises *
within a domain
 Compromises hide valuable details
August 2000
Gio vdR 13
Many ontologies Interoperation
Have devolved maintenance onto many
domain-specific experts / authorities
• Many applications span multiple
domains
• Need an algebra to compute
composed ontologies that are
limited to their articulation terms
SKC
• enable interpretation within the
source contexts
August 2000
Gio vdR 14
Sample Operation: INTERSECTION
Articulation
Source Domain 1:
Owned and maintained
by Store
August 2000
Result contains
shared terms,
useful for purchasing
Source Domain 2:
Owned and maintained
by Factory
Gio vdR 15
Vehicle
sales
ontology
Vehicle
registration
ontology
Tools to create articulations
Suggestions
for articulations
August 2000
Combine ontology graphs
with expert selection
based on spelling,
graph matching, and
a nexus derived from
a dictionary (O.E.D.)Gio vdR 16
An Ontology Algebra
A knowledge-based algebra for ontologies
Intersection
Union
Difference
create a subset ontology
keep sharable entries
create a joint ontology
merge entries
create a distinct ontology
remove shared entries
The Articulation Ontology (AO) consists of
rules that link domain ontologies
August 2000
matching
Gio vdR 17
INTERSECTION support
Articulation ontology
Terms useful
for purchasing
Matching
rules that use
terms from the
2 source domains
Store
Ontology
August 2000
Factory
Ontology
Gio vdR 18
Other Basic Operations
UNION: merging
entire ontologies
DIFFERENCE: material
fully under local control
Articulation
ontology
typically prior
intersections
August 2000
Gio vdR 19
Features of an algebra
Operations can be composed
Operations can be rearranged
Alternate arrangements can be evaluated
Optimization is enabled
The record of past operations can be
kept and reused when sources change
August 2000
Gio vdR 20
Knowledge Composition
Composed knowledge for
applications using A,B,C,E
Articulation
knowledge
(A B) U
(B C) U
(C E)
Articulation
knowledge
(C E)
U
U
U : union
: intersection
U
Knowledge
resource
E
Articulation
knowledge
for (A B)
U
Knowledge
resource
A
August 2000
U
(B
C)
Knowledge
resource
B
Knowledge
resource
C
(C
U
Legend:
U
U
for
D)
Knowledge
resource
D
Gio vdR 21
Domain Specialization
.
• Knowledge Acquisition (20% effort) &
• Knowledge Maintenance (80% effort *)
to be performed
• Domain specialists
• Professional organizations
• Field teams of modest size
autonomously
maintainable
Empowerment
* based on experience with software
August 2000
Gio vdR 22
Summary
To sustain the growth of web usage
1. The value of the results has to keep increasing
precision, relevance not volume
2. Value is provided by mediating experts,
encoded as models of
diverse resources, diverse customers
Problems being addressed
redundancy
Clear models
mismatches
quality
maintenance
}
Need research to develop tools for these tasks
August 2000
Gio vdR 23
Integration Science ?
Databases
access
storage
algebras
scalability
Systems
Engineering
analysis
documentation
costing
Artificial
Intelligence
knowledge mgmt
linguistic models
uncertainty
Integration
Science
August 2000
Gio vdR 24
Background Slides
In Handout distributed at the VU, but not actually used
August 2000
Gio vdR 25
Sample Processing in HPKB
• What is the most recent year – Problems resolved by SKC
* Factbook has out of date
an OPEC member nation was
OPEC & UN SC lists
on the UN security council?
– Related to DARPA HPKB
Challenge Problem
– SKC resolves 3 Sources
• CIA Factbook ‘96 (nation)
• OPEC (members, dates)
• UN (SC members, years)
– SKC obtains the
Correct Answer
• 1996 (Indonesia)
– Other groups obtained more,
but factually wrong answers
August 2000
*
*
•
•
– Indonesia not listed
– Gabon (left OPEC 1994)
different country names
– Gambia => The Gambia
historical country names
– Yugoslavia
UN lists future security council
members
– Gabon 1999
intent of original question
– Temporal variants
Gio vdR 26
Vehicle
sales
ontology
Vehicle
registration
ontology
Tools to create articulations
Suggestions
for articulations
August 2000
Combine ontology graphs
with expert selection
based on spelling,
graph matching, and
a nexus derived from
a dictionary (O.E.D.)Gio vdR 27
Tools to create articulations
Graph matcher
for
Articulationcreating
Expert
Transport
ontology
Vehicle
ontology
Suggestions
for articulations
August 2000
Gio vdR 28
continue from initial point
Also suggest similar terms
for further articulation:
• by spelling similarity,
• by graph position
• by term match repository
Expert response:
1. Okay
2. False
3. Irrelevant
to this articulation
All results are recorded
Okay’s are converted into articulation rules
August 2000
Gio vdR 29
Candidate Match Repository
Term linkages automatically extracted from 1912 Webster’s dictionary *
* free, other sources
have been processed.
.
Based on processing
headwords  definitions
using algebra primitives
Notice presence
of 2 domains:
chemistry, transport
August 2000
Gio vdR 30
Using the match repository
August 2000
Gio vdR 31
Navigating the match repository
August 2000
Gio vdR 32
Primitive Operations
Model and Instance
Unary
• Summarize -- structure up
• Glossarize - list terms
• Filter - reduce instances
• Extract - circumscription
Binary
• Match - data corrobaration
• Difference - distance
measure
• Intersect - schem discovery
• Blend - schema extension
August 2000
Constructors
• create object
• create set
Connectors
• match object
• match set
Editors
• insert value
• edit value
• move value
• delete value
Converters
• object - value
• object indirection
• reference indirection
Gio vdR 33
Future: exploiting the result
Result has links
to source
Avoid n2 problem of interpreter
mapping as stated by Swartout
as an issue in HPKB year 1
Processing & query evaluation
is best performed within Source
Domains & by their engines
August 2000
Gio vdR 34
Download