An Introduction to the CYC® Technology December 6, 2000 Dr

advertisement
Bill Gates is not a Parking Meter:
Philosophical Quality Control in Automated
Ontology-building
Catherine Legg & Sam Sarjant
University of Waikato
2 July, 2012
Agenda
Philosophical Categories: What are they?
What are they good for?
2. Automated Ontology-Building
• Concept Mapping
• Adding New Assertions
3. Philosophical Quality Control: Semantic
vs. Ontological Relatedness
1.
Agenda
Philosophical Categories: What are they?
What are they good for?
2. Automated Ontology-Building
• Concept Mapping
• Adding New Assertions
3. Philosophical Quality Control: Semantic
vs. Ontological Relatedness
1.
What is wrong with this statement?
S1: The number 8 is a very red number.
This seems somehow worse than false.
For a false statement can be negated to produce a truth,
but:
Fregean quibbles notwithstanding
S2: The number 8 is not a very red number.
The odd synaesthesia episode notwithstanding
doesn’t seem right either.
The problem seems to be that numbers are not the kind of
thing that can have colours. If someone thinks so then they
don’t understand what kinds of things numbers are.
Traditional philosophical terminology for what is wrong is
that S1 commits a category mistake.
It is well-known that the philosophical discipline
of ontology was invented by Aristotle to deal
with these kinds of problems (among others):
όντος: (being) +
λόγος: (theory of)
It was called by him first
philosophy (i.e. the
fundamental science). A
key part of its role was to
define categories.
Categories are intended to describe the most
different kinds of things that exist in reality:
Suggested examples include:
Physical
objects
Times
Events
Numbers
Relationships
Traditionally ontologies were built into a
taxonomic structure, or ‘tree of knowledge’:
‘Tree of Porphyry’ (3rd century A.D.)
Category vs. Property Distinctions:
• There is a subtle but important distinction between
philosophical categories and mere properties.
• Although both divide entities into groups, and may be
represented by classes, categories provide a deeper, more
sortal division which enforces constraints.
• E.g. while we know the same thing cannot be a colour and
a number, the same cannot be said for green and square.
• However, at what ‘level’ of an ontology constraining
categorical divisions give way to non-constraining
property divisions is frequently unclear and contested.
• This has led to skepticism about philosophical categories.
• Plus the disdain of “speculative metaphysics” by C20th
logical positivism (and various successors) didn’t help.
Agenda
Philosophical Categories: What are they?
What are they good for?
2. Automated Ontology-Building
• Concept Mapping
• Adding New Assertions
3. Philosophical Quality Control: Semantic
vs. Ontological Relatedness
1.
Why Automated Ontology-Building?
• A wealth of new free user-supplied Web content (‘Web 2.0’)
for raw data. (Blogs, tags, Wikipedia…)
• New automated methods of determining semantic
relatedness (e.g. Milne & Witten, “An Effective, Low-cost measure of
Semantic Relatedness Obtained from Wikipedia links”,.2008)
• Manual methods widely agreed to be too slow and labourintensive (e.g. Cyc project, 25 years, still unfinished)
What Has Been Done So Far?
• YAGO (Wordnet backbone, enriched with Wikipedia’s leaf categories.
Taxonomic)
• DBPedia (Wikipedia Infoboxes and other semi-structured data. Not
taxonomic)
• EMLR (Wikipedia category network, enriched with further relations
from, e.g., parsing category names, e.g. BornIn1954. Taxonomic)
Our Choice
•
•
Use as backbone the Cyc taxonomy, because:
•
Highly principled as the result of so many years’ labor
•
Purpose-built inference engine to reason over the knowledge
•
Now publically available ‘open source’ (ResearchCyc)
•
One researcher (Legg) had inside knowledge of the system,
which is rare worldwide
Build onto it knowledge mined from Wikipedia, because:
•
Access to Wikipedia-based automated measure of semantic
relatedness developed at University of Waikato
•
Wikipedia is astounding!
•
2.4M (English) articles, referred to by 3M different terms
•
~25 hyperlinks per article
•
175K templates for semi-structured data-entry (inc.9K ‘infoboxes’)
•
full editing history for every article….etc, etc.
Cyc Ontology & Knowledge Base
OpenCyc contains:
~13,500 Predicates
~200,000 Concepts
~3,000,000 Assertions
Thing
Intangible Individual
Thing
Sets
Relations
Space
Physical
Objects
Living
Things
Ecology
Natural
Geography
Political
Geography
Weather
Earth &
Solar System
Paths
Human
Anatomy &
Physiology
Partially
Tangible
Thing
Artifacts
Actors
Actions
Movement
State Change
Dynamics
Plants
Temporal
Thing
Logic
Math
Borders
Geometry
Plans
Goals
Physical
Agents
Animals
Emotion
Human
Products Conceptual
Perception Behavior &
Devices
Works
Belief
Actions
Vehicles
Buildings
Weapons
Spatial
Thing
Spatial
Paths
Materials
Parts
Statics
Life
Forms
Human
Beings
Human
Artifacts
Represented in:
• First Order Logic
• Higher Order Logic
Time
• Context Logic
Events
Scripts
(Micro-theories)
Agents
Mechanical
Software
Social
& Electrical Literature Language Relations,
Devices
Works of Art
Culture
Organizational
Actions
Organizational
Plans
Agent
Organizations
Social
Behavior
Organization
Human
Activities
Business &
Commerce
Purchasing
Shopping
Social
Activities
Types of
Organizations
Politics
Warfare
Sports
Recreation
Entertainment
Transportation
& Logistics
Human
Organizations
Nations
Governments
Geo-Politics
Professions
Occupations
Travel
Communication
Everyday
Living
Law
Business,
Military
Organizations
Domain-Specific Knowledge
(e.g., Bio-Warfare, Terrorism, Computer Security, Military Tactics, Command & Control, Health Care, …)
Domain-Specific Facts and Data
Cyc Common-Sense Knowledge
1. Semantic Argument Constraints on Relations
Cyc contains many assertions of the form:
(arg1Isa birthDate Animal)
(arg2Isa capital City)
These represent that only animals have birthdays, and that
the capital of a country must be a city.
These features of Cyc are a form of categorical knowledge.
Although some of the categories invoked might seem
relatively specific and trivial compared to Aristotle’s,
logically the constraining process is the same.
Cyc enforces these constraints at knowledge entry, a
notable difference from every other formal ontology its
size.
Cyc Common-Sense Knowledge
2. Disjointness Assertions Between Collections
ResearchCyc currently contains 6000 explicitly asserted
disjointWith claims, for e.g.:
disjointWith Doorway WindowPortal
disjointWith HomogeneousStructure Ecosystem
disjointWith YardWork ShootingAProjectileWeapon (!)
From these countless further claims can be deduced.
Again, Cyc enforces these constraints at knowledge
entry,
Wordnet knows about sibling relationships, so in some sense
it knows that a cat cannot be a dog. But it cannot ramify
this knowledge through its hierarchy in this way.
Cyc reasoning over its common-sense
knowledge:
Never explicitly asserted
into Cyc (!)
N.B.
Key assertion
Wikipedia as an ontology





articles  basic concepts
infoboxes  facts about those concepts
first sentences  concept definitions, often in
standard format
hyperlinks between articles  ‘semantic
relatedness’ between concepts
categories organise articles into conceptual
groupings.
Though these groupings are far from a principled
taxonomy enabling knowledge inheritance
For example, consider the following category….
!
!!
Agenda
Philosophical Categories: What are they?
What are they good for?
2. Automated Ontology-Building
• Concept Mapping
• Adding New Assertions
3. Philosophical Quality Control: Semantic
vs. Ontological Relatedness
1.
Stage A&B: easy 1-1
matches using title
strings (or synonyms)
Exact mappings
via Cyc synonyms
CityOfSaarbrucken
Saarbrücken
Get synonyms
asserted in Cyc,
e.g. using
#$nameString
Saarbrucken
Use Wikipedia
redirects
matches
Saarbrucken
Stage C: a number of possible candidates
Stage C: Many Wikipedia  1 Cyc.
onSemantic
the Wikipedia
side, semantic
disambiguation required
disambiguation required
 Collect all candidates (e.g. Kiwi: Bird? Fruit? Nationality? )
 Compute commonness of each candidate (how often string
is anchor text to their Wikipedia pages)
 Collect context from Cyc (concepts nearby in taxonomy)
 Compute similarity to context (using Wikipedia hyperlinks:
Milne and Witten, 2008)
 Determine best candidate
Stage D: Many Cyc  1 Wikipedia.
Reverse disambiguation required.
In this stage many candidate mappings were eliminated by mapping
back from the Wikipedia article to the Cyc term, discarding
mappings which don’t ‘map back’.
For example, Cyc term #$DirectorOfOrganisation incorrectly maps
to Film director, but when we attempt to find a Cyc term from Film
director we get #$Director-Film.
This reduced the number of mappings by 43%, but increased
precision considerably.
Agenda
Philosophical Categories: What are they?
What are they good for?
2. Automated Ontology-Building
• Concept Mapping
• Adding New Assertions
3. Philosophical Quality Control: Semantic
vs. Ontological Relatedness
1.
Finding new concepts in Wikipedia and
adding them to Cyc
First we found mapped concepts where the Wikipedia
article had an equivalent category (about 20% of mapped
concepts). E.g. the article Israeli Settlement has an
equivalent category Israeli Settlements:
We then mined this category for new concepts
belonging under the mapped Cyc concept,
according to the Cyc taxonomy. For instance:
We then mined this category for new concepts
belonging under the mapped Cyc concept,
according to the Cyc taxonomy. For instance:
We called these ‘true children’.
We identified true children by:
1.Parsing the first sentences of Wikipedia
articles (with a list of regular expressions):
Havat Gilad (Hebrew: ‫חַ וַת ִּגלְעָ ד‬, lit. Gilad Farm) is an Israeli settlement
outpost in the West Bank.
Netiv HaGdud (Hebrew: ‫נְתִּ יב הַ גְדוד‬, lit. Path of the Battalion) is a
moshav and Israeli settlement in the West Bank.
Kfar Eldad (Hebrew: ‫ )כפר אלדד‬is an Israeli settlement and a Communal
settlement in the Gush Etzion Regional Council, south of Jerusalem.
The Yad La’achim operation (Hebrew: ‫מבצע יד לאחים‬, “Giving hand to
brothers") was an operation that the IDF performed during the
disengagement plan.
2. “Infobox pairing” – If an article in the category
shares an infobox template with 90% of true
children, we include it.
FAILS FIRST SENTENCE TEST
2. “Infobox pairing” – If an article in the category
shares an infobox template with 90% of true
children, we include it.
FAILS FIRST SENTENCE TEST
Final Results
Method
Cyc terms
Percent mapped
Total terms available
163,000
Common sense terms
83,900
Exact (1-1) mappings
33,500
40%
Further mappings after
disambiguation(2 ways)
8,800
10%
Total mapped
42,300
50%
Cyc assertions
Percent growth
New Cyc Concepts
35, 000
30%
New “doc. strings”
17, 000
Other new assertions
228, 000
Method
10%
Evaluation (22 human volunteers, online form,
compared with DBPedia Ontology)
CASE:


1
2
3
4
5 6_
DBpedia children 0.58 0.81 0.99 0.98 0.99 0.99
Our New children 0.57 0.88 0.99 0.90 0.90 1.00
 2008 mappings
 2009 mappings
0.65 0.83 0.99 0.99 0.99 1.00
0.68 0.91 1.00 1.00 1.00 1.00
CASES:
1 : 100% of evaluators thought assignment correct
2 : >50% thought assignment correct
3 : At least 1 thought assignment correct
4 : 100% thought assignment correct or close
5 : >50% thought assignment correct or close
6 : At least 1 thought assignment correct or close
Agenda
Philosophical Categories: What are they?
What are they good for?
2. Automated Ontology-Building
• Concept Mapping
• Adding New Assertions
3. Philosophical Quality Control: Semantic
vs. Ontological Relatedness
1.
Due to its common-sense knowledge, Cyc
‘regurgitated’ many assertions fed to it which
were ontologically incorrect.
Examples of rejected assertions:
(#$isa #$CallumRoberts #$Research)
Why it
happened
Professor Callum Roberts is a marine conservation biologist,
oceanographer, author and research scholar in the Environment
Department of the University of York in England.
(#$isa #$Insight-EMailClient #$EMailMessage)
Insight WebClient is an groupware E-Mail client from Bynari
embedded on Arachne Web Browser for DOS.
Why it
happened
Quality control is provided via Cyc’s commonsense knowledge, as Cyc knows enough now to
‘regurgitate’ many assertions which are
ontologically incorrect.
Here Cyc knows that the collection of biological
living objects is disjoint with the collection of
Examples of regurgitated assertions:
information objects.
(#$isa #$CallumRoberts #$Research)
Why it
happened
Professor Callum Roberts is a marine conservation biologist,
oceanographer, author and research scholar in the Environment
Department of the University of York in England.
(#$isa #$Insight-EMailClient #$EMailMessage)
Insight WebClient is an groupware E-Mail client from Bynari
embedded on Arachne Web Browser for DOS.
Why it
happened
Quality control is provided via Cyc’s commonsense knowledge, as Cyc knows enough now to
‘regurgitate’ many assertions which are
ontologically incorrect.
Here Cyc knows that the collection of biological
living objects is disjoint with the collection of
Examples of regurgitated assertions:
information objects.
(#$isa #$CallumRoberts #$Research)
Why it
happened
Professor Callum Roberts is a marine conservation biologist,
oceanographer, author and research scholar in the Environment
Here Cyc knows that the collection of software
Department of the University of York in England.
programs is disjoint with the collection of letters.
(#$isa #$Insight-EMailClient #$EMailMessage)
Insight WebClient is an groupware E-Mail client from Bynari
embedded on Arachne Web Browser for DOS.
Why it
happened
Ontological vs Semantic Relatedness
 These examples usefully highlight a clear difference
between quantitative measures of semantic
relatedness, and an ontological relatedness derivable
from a principled category structure.
 Callum Roberts is a researcher, which is highly
semantically related to research and Insight is an email
client, which is highly semantically related to email
messages.
 Thematically these pairs are incredibly close, but
ontologically, they are very different kinds of thing
 In this way, Cyc rejected 4 300 assertions, roughly 3%
of the total presented to it.
 Manual analysis showed that of these, 96% were true
negatives.
Future building work
 Now that we have a distinction between semantic and
ontological relatedness, combining the two has powerful
possibilities.
 In fact in general in automated information science,
overlapping independent heuristics are a boon to accuracy.
 We plan to: automatically augment Cyc’s disjointness
network and semantic argument constraints on relations:
 systematically organized infobox relations are a natural
ground to generalize argument constraints.
 The Wikipedia category network will be mined – with
caution - for further disjointness knowledge.
 Evaluate much more fully both this automated ontology-
building effort and other current leaders in the field
Philosophical lessons
• Our project suggests the notion of philosophical categories
leads to measurable improvements in real-world
ontology-building.
• Just how extensive a system of categories should be will
require real-world testing. But now we have the tools and
free user-supplied data to do this.
• Where exactly the line should be drawn between
categories proper and mere properties remains open.
• However, modern statistical tools raise the possibility of a
quantitative treatment of ontological relatedness that is
more nuanced than Aristotle’s ten neat piles of
predicates, yet can still recognize that it is highly
problematic to say that the number 8 is red, and why.
clegg@waikato.ac.nz
sjs31@cs.waikato.ac.nz
Download