S C K Scalable Knowledge Composition

advertisement
S
K
C
Scalable Knowledge Composition
September 2001
Gio Wiederhold, Shrish Agarwal, Stefan Decker,
Jan Janninck, Prasenjit Mitra, et al.
Stanford University, CSD
7/26/2016
SKC Synopsis
Data + Knowledge  Information
• Apply relevant Knowledge to relevant Data
Analyses
SKC focus
Composition of
source information
Aggregation
of instances
Selection
Observations
Quality Filters
• to obtain: Information for
decision-making
7/26/2016
SKC Synopsis
Gio Wiederhold 2
Many sources, disciplines, people
Extraction of actionable information,
so that future benefits can accrue,
requires broad-based knowledge
Areas make deep progress in isolation
Benefits are possible when
solid results are available
the results are Integrated or Composed
Broad base leads to heterogeneity and
inconsistency of terminologies
7/26/2016
SKC Synopsis
Gio Wiederhold 3
Language differences inhibit integration
• An essential feature of science
– autonomy of fields
– differing granularity and scope of focus
– growth of fields requires new terms
• A feature of technological process
– standards require stability
– yesterday’s innovations are today’s infrastructure
– today’s innovations are tomorrow’s infrastructure
• Must be dealt with explicitly
– sharing, integration, and aggregation are essential
– large quantities of data require precision
7/26/2016
SKC Synopsis
Gio Wiederhold 4
Semantic Mismatches
Autonomous sources in all domains have
• Differing viewpoints
( by source )
–
–
–
–
–
differing terms for similar items
{ lorry, truck }
same terms for dissimilar items
trunk( luggage, car)
differing coverage
vehicles ( DMV, police, AIA )
differing granularity
trucks ( shipper, manuf. )
different scope
student ( museum fee, Stanford )
– different hierarchical structures
supplier vs. usage

• Hinders use of information from disjoint sources
– missed linkages
– irrelevant linkages
loss of information, opportunities
overload on user or application program
• Poor precision when merged
Ok for web browsing , poor for business & science
7/26/2016
SKC Synopsis
Gio Wiederhold 5
Heterogeneity among Domains is natural
Interoperation creates mismatch
• Autonomy conflicts with consistency,
– Local Needs have Priority,
– Outside uses are a Byproduct
Heterogeneity must be addressed
• Platform and Operating Systems  
• Data Representation and Access Conventions 
• Metadata: Naming and Ontology

– needed to share data from distinct sources
7/26/2016
SKC Synopsis
Gio Wiederhold 6
Two Mismatch Solutions
1. A Single, Globally consistent Ontology ( Your Hope )
–
–
–
–
–
wonderful for users and their programs
too many interacting sources
long time to achieve, 2 sources ( UAL, LH ), 3 (+ trucks), 4, … all ?
costly maintenance, since all sources evolve
no world-wide authority to dictate conformance
2. Domain-specific ontologies ( XML DTD assumption )
–
–
–
–
–
Small, focused, cooperating groups
high quality, some examples - arthritis, Shakespeare plays
allows sharable, formal tools
ongoing, local maintenance affecting users - annual updates
poor interoperation, users still face inter-domain mismatches
7/26/2016
SKC Synopsis
Gio Wiederhold 7
Our approach (SKC project)
1. Define Terminology in a domain precisely

Schemas, XML DTDs  Ontologies
2. Develop methods to permit interoperation
among differing domains (not integration)

Articulation --- support the limited interoperation
needed to solve problems in an application domain
 Ontology Algebra --- enable scalability to as many
sources as are needed to support applications
3. Develop tools to support the methods

7/26/2016
Ontology matching
SKC Synopsis
Gio Wiederhold 8
An Ontology Algebra
The glue that holds the bricks together
A knowledge-based algebra for ontologies
Intersection
Union
Difference
create a subset ontology
keep sharable entries
create a joint ontology
merge entries
create a distinct ontology
remove shared entries
The Articulation Ontology (AO) consists of
matching rules that link domain ontologies
7/26/2016
SKC Synopsis
Gio Wiederhold 9
Sample Operation: INTERSECTION
Result contains
shared terms
Articulates the
two domains
Terms useful
for purchasing
Source Domain 1:
Owned and maintained
by Store
7/26/2016
Source Domain 2:
Owned and maintained
by Factory
SKC Synopsis
Gio Wiederhold 10
Sample Intersections
Articulation
size = size
ontology
matching rules : color =table(colcode)
style = style
Anatomy
{. . . }
Shoe Factory
• Material inventory {...}
• Employees { . . . }
• Machinery { . . . }
• Processes { . . . }
• Shoes { . . . }
Shoe Store
• Shoes { . . . }
• Customers { . . . }
• Employees { . . . }
foot = foot
Employees
Nail (toe, foot)
...
7/26/2016
Department
Store
SKC Synopsis
Hardware
Employees
Nail (fastener)
...
Gio Wiederhold 11
Within a Domain Terms have clear Meanings
• a domain will contain many objects
• the object configuration is consistent
• within a domain all terms are consistent &
• relationships among objects are consistent
No committee is needed
to forge compromises *
within a domain
Domain Ontology
• context is implicit
* Compromises hide valuable details
7/26/2016
SKC Synopsis
Gio Wiederhold 12
SKC grounded definition
.
• Ontology:
a set of terms and their relationships
• Term:
a reference to real-world and abstract objects
• Relationship:
a named and typed set of links between objects
• Reference:
a label that names objects
• Abstract object:
a concept which refers to other objects
• Real-world object:
an entity instance with a physical manifestation
(or its representation in a factual database)
7/26/2016
SKC Synopsis
Gio Wiederhold 13
Grounding enables implementation
• We use many abstract terms in our work
– Needed because we are dealing with many objects
– Human thinking is limited to short-term memory
• Someone must be able to translate them into code
reliably
– Each abstract term must have a path to reality
– One must provide that path for
– students and
– coders
• Without a clear path that is not possible
– Not automatically at all – machines need specs
– Not reliably by human programmers – failures occur
• Without implementation there is no benefit
7/26/2016
SKC Synopsis
Gio Wiederhold 14
INTERSECTION support
Articulation ontology
Terms useful
for purchasing
Matching
rules that use
terms from the
2 source domains
Store
Ontology
7/26/2016
Factory
Ontology
SKC Synopsis
Other Basic Operations
DIFFERENCE: material
fully under local control
UNION: merging
entire ontologies
Articulation
ontology
typically prior
intersections
7/26/2016
SKC Synopsis
Features of an algebra
The record of past operations can be
kept and reused
(experience: 3 months  1 week for Webster's annual update,
 2 weeks for OED (6 x size) [Jannink:01])
Maintenance is enabled by using
1. remote, deep domain expertise
2. rapid recomposition for application domain
Expect also that
Operations can be composed
Operations can be rearranged
Alternate arrangements can be evaluated
Optimization is enabled
7/26/2016
SKC Synopsis
Gio Wiederhold 17
Sample Processing in the DARPA HPKB challenge
What is the most recent year an
OPEC member nation was on
the UN security council (SC)?
– Problems resolved by SKC
– SKC resolves 3 Sources
* Factbook – a secondary
source -- has out of date
OPEC & UN SC lists
• CIA Factbook ‘96 (nation)
– Indonesia not listed
• OPEC (members, dates)
– Gabon (left OPEC 1994)
• UN (SC members, years)
– Gambia => The Gambia
– SKC obtains the
Correct Answer
* historical country names
• 1996 (Indonesia)
– Yugoslavia
– Other groups obtained more,
but factually wrong answers;
they relied on one global
source, the CIA factbook.
7/26/2016
* different country names
• UN lists future security council
members
– Gabon 1999
needed ancillary data
SKC Synopsis
Gio Wiederhold 18
Interoperation via Articulation
Process phases:

At application definition time
– Match relevant ontologies where needed
– Establish articulation rules among them.
– Record the process

At execution time
– Perform query rewriting to get to sources
– Optimize based on the ontology algebra.

For maintenance
– Regenerate rules using the stored formulation
7/26/2016
SKC Synopsis
Gio Wiederhold 19
Generation of the articulation rules
Provide library of automatic match heuristics
• Lexical Methods – spelling similarity --- commonly used by others
 Structural Methods -- relative graph position
 Reasoning-based Methods
 Nexus – a graph we derive from the OED / Websters
 links terms based on definitions, not lexical similarity
• Hybrid Methods
– Iteratively, with an expert in control
GUI tool to
- display matches and
- verify generated matches using the human expert
- expert can also supply matching rules
7/26/2016
SKC Synopsis
Gio Wiederhold 20
Articulation Generator
Being built by Prasenjit Mitra
Thesaurus
OntA
Phrase Relator
Context-based
Word Relator
Driver
Structural
Matcher
Ont1
Ont2
Semantic
Network
(Nexus)
Human Expert
7/26/2016
SKC Synopsis
Gio Wiederhold 21
Principle of Knowledge Composition
Composed knowledge for
applications using A,B,C,E
Articulation
knowledge
Legend:
U
U
for
(A B) U
(B C) U
(C E)
Articulation
knowledge
(C E)
U
U : union
U
: intersection
Knowledge
resource
E
U
Knowledge
resource
A
7/26/2016
U
(B
C)
Knowledge
resource
C
Knowledge
resource
B
SKC Synopsis
(C
U
U
Articulation
knowledge
for (A B)
D)
Knowledge
resource
D
Exploiting the result (future plans)
Avoid n2 problem of interpreter
mapping [Swartout HPKB year 1]
Result has links
to source
Processing & query
evaluation is best
performed within
Source Domains
& by their engines
7/26/2016
SKC Synopsis
Support Domain Specialization
• Knowledge Acquisition (20% effort) &
• Knowledge Maintenance (80% effort *)
to be performed
• Domain specialists (SMEs)
• Professional organizations
• Field teams
of modest size
automously
maintainable
Empowerment
* based on experience with software
7/26/2016
SKC Synopsis
Domain-specific Expertise
.
Knowledge needed is huge
• Partition into natural domains
Our Ontology
• Determine domain responsibility
and authority
• Empower domain owners
Society of
specialists
• Exploit domain-specific expertise
• Provide computer-science tools
Consider interaction
7/26/2016
SKC Synopsis
Gio Wiederhold 25
SKC Project Synopsis
• Research Objective:
– Precise information for applications from
heterogeneous, imperfect, scalably many data sources
• Sources for Ontologies used currently:
– General: CIA World Factbook ‘96, www.UN, www.OPEC
Webster’s Dictionary, Thesaurus, Oxford English Dictionary
– Topical: NATO, BattleSpace Sensors, Logistics Servers
• Theory:
– Domain autonomy and exploitation
– Rule-based algebra over ontologies
– Translation & Composition primitives
• Sponsors and collaboration
– AFOSR; DARPA DAML program; W3C; Stanford KSL and SMI;
Univ. of Karlsruhe, Germany; others.
7/26/2016
SKC Synopsis
Gio Wiederhold 26
Download