Cyc-DougLenat_20051117

advertisement
CYC: Lessons Learned in LargeScale Ontological Engineering
Dr. Douglas B. Lenat
, 3721 Executive Center Drive, Suite 100, Austin, TX 78731
Email: Lenat@cyc.com
Phone: (512) 342-4001
Fax: (512) 342-4040
November 17, 2005
2 July 2005
1
What Led to Cyc?
1. Programs need general world knowledge, and
commonsense, to break the “brittleness bottleneck”
NL understanding, speech understanding,
robotics, learning, expert systems, search,…
2. We know enough to do this; it is more an
engineering task than a scientific research task.
3. The time was right (1984).
November 17, 2005
2
ELIZA (DOCTOR)
1965 Joe Weizenbaum, MIT
• Carl Rogers-like reflection
Patient: “I swear a lot.”
“How do you feel about the fact that you swear a lot?”
• Count on patient-to-doctor respect/awe/…
Patient: “My dog’s mother died recently.”
“blah blah blah mother blah blah blah”
“Tell me more about your mother.”
November 17, 2005
3
40 years after ELIZA: ALICE
• JUDGE11[9:26:0]I WILL ASK YOU SOME SIMPLE
QUESTIONS AND GIVE SIMPLE COMMANDS
• PROGRAM[9:26:01]Do you think your plan will succeed?
• JUDGE11[9:26:16]WHAT IS THE COLOR OF A BLUE CAR
• PROGRAM[9:26:16]WHAT COLOR IS a blue car
• JUDGE11[9:26:35]WHERE IS SUE’S NOSE WHEN SUE IS IN
HER HOUSE
• PROGRAM[9:26:35]Where it belongs.
Try searching the World Wide Web.
November 17, 2005
4
November 17, 2005
5
Is Natural Language a Good Repr.?
• + there’s a lot of it Everyone understands it
• + index and search it, rapidly, using keywords
• Boolean combinations of keywords
• Synonyms, hyponyms, hypernyms,… of keywords
• - there are a lot of different languages
• - meanings vary (era, place, age group…)
• - often the analyst’s query requires finding
and combining n pieces of data
• - can be inefficient
November 17, 2005
6
Is Edward an ancestor or descendant of Sue?
Carol and Sam begat Sara and Fred. Fred and Jane begat
Ethan, Elaine, and Edward. John and Sara begat Steven,
Mary, and Seth. Ann and Andy begat Sue and Bob. But
then Sara cleaved not to John and with Bob begat Joan.
Ann -- Andy
Sue
Joan
Carol -- Sam
Bob -- Sara --John
Steven Mary Seth
November 17, 2005
Fred --Jane
Ethan Elaine Edward
7
Five friends get together to play 5 doubles
matches, with a different group of 4 players
each time. The sums of the ages of the
players for the different matches are 124,
128, 130, 136 and 142 years.
What is the age of the youngest player ?
v+w+x+y = 124
v+w+x+z = 128
v+w+y+z = 130
v+x+y+z = 136
w+x+y+z = 142
November 17, 2005
8
Natural Language Understanding
requires having lots of knowledge
1. The pen is in the box.
The box is in the pen.
2. The police watched the demonstrators…
…because they feared violence.
…because they advocated violence.
3. Every American has a mother.
Every American has a president.
November 17, 2005
9
Natural Language Understanding
requires having lots of knowledge
4. Mary and Sue are sisters.
Mary and Sue are mothers.
5. The White House announced today that...
6. John saw his brother skiing on TV. The fool…
...didn’t have a coat on!
…didn’t recognize him!
November 17, 2005
10
Logically and Arithmetically
Combining n Pieces of Info.
An example: an analyst’s query posed as
part of HPKB (1996) that Cyc answered.
Information from multiple sources
Knowledge about the domain in general
Commonsense knowledge about the real world
November 17, 2005
11
November 17, 2005
12
November 17, 2005
13
November 17, 2005
14
November 17, 2005
15
November 17, 2005
16
Logically and Arithmetically
Combining n Pieces of Info.
Information from multiple sources
Knowledge about the domain in general
Commonsense knowledge about the real world
The original dream of Arpanet, EDI, EDR, the Semantic Web,…
Ontology holds the key to doing this!
BUT there are so many ways to “cut
corners” and unwittingly fool oneself!
November 17, 2005
17
Query: “How different in age were Uday and Qusay Hussein?”
DB4
Sept. 9, 2003
SuspN
YOB
Qusay Hussein
FBI
Most
Wanted
Uday Hussein
CATS
CDE
1964
DB4
Non-ontology-based
methods for DB integration are quadratic
NARCL
USGS
OFAC
DB8
DB8
Dec. 31, 1996
Prenom Surnom
Qusai
Hussein
Odai
Hussein
ann
30
Ontology-Based Methods of DB
Integration Can Scale Linearly
DB4
Sept. 9, 2003
SuspN
YOB
Qusay Hussein
FBI
Most
Wanted
Uday Hussein
CATS
CDE
DB4
CONCEPTS
#$QusayHusseinAl-Takriti
#$UdaiHusseinAl-Takriti
CYC
HAL
you!
NARCL
1966
1964
USGS
OFAC
RULES
(age ?PERSON (YearsDuration ?AGE))
DB8
(birthDate ?PERSON ?BIRTH-DATE)
DB8
(…and, by the way, enables
DB population/enrichment)
Dec. 31, 1996
Prenom Surnom
ann
Qusai
Hussein
30
Odai
Hussein
32
A Solution that Scales Linearly
DB4
Sept. 9, 2003
SuspN
YOB
Qusay Hussein
FBI
Most
Wanted
NARCL
1966
1964
Uday Hussein
CATS
USGS
CDE
OFAC
DB4
DB8
DB8
(…and, by the way, enables
DB population/enrichment)
Dec. 31, 1996
Prenom Surnom
ann
Qusai
Hussein
30
Odai
Hussein
32
“What major US cities are particularly
vulnerable to an anthrax attack?”
The answer is logically implied by data
dispersed through several sources:
USGS
GNIS
DB
AMVA
KB
RAND R
UN
FAO
DB
November 17, 2005
DTRA
CATS
DB
21
“What major US cities are particularly
vulnerable to an anthrax attack?”
“major US city”  ?C is a U.S. City with >1M population
(> (NumberOfInhabitantsFn ?C) 106)
“particularly vulnerable to an anthrax attack” 
– the current ambient temperature at ?C is above freezing,
and
– ?C has more than 100 people for each hospital bed,
and
– the number of anthrax host animals near ?C exceeds 100k
Don’t add #pullets and
#chickens
November 17, 2005
22
state |
name
| type |
county
| state_fips |
-------+-----------------------+-------+----------------+------------+
TX
| Dallas
| ppl
| Dallas
|
48 |
MN
| Hennepin County
| civil | Hennepin
|
27 |
CA
| Sacramento County
| civil | Sacramento
|
6 |
AZ
| Phoenix
| ppl
| Maricopa
|
4 |
primary_lat | primary_long| elevation | population |
status
|
------------+-------------+-----------+------------+------------------+
32.78333 |
-96.8 |
463 |
1022830 | BGN 1978 1959
45.01667 |
-93.45 |
0 |
1032431 |
38.46667 | -121.31667 |
0 |
1041219 |
33.44833 | -112.07333 |
1072 |
1048949 | BGN 1931 1900 1897
USGS
GNIS
DB
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
November 17, 2005
23
So how do we explain to our system that:
• row 1 of that table is “about” the city of Dallas, TX
• the population field of that table contains the number
of inhabitants of the city that that row is “about”
• here is exactly how to access tuples of that database
• that access will be fast, accurate, recent, complete
USGS
GNIS
DB
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
November 17, 2005
24
• the population field of that table contains the number
of inhabitants of the city that that row is “about”
We provide the field encodings and decodings, some of which correspond to
explicit fields like population, two-letter state codes, etc:
(fieldDecoding Usgs-Gnis-LS ?x
(TheFieldCalled “population”)
(numberOfInhabitants
(TheReferentOfTheRow Usgs-Gnis)
?x))
USGS
GNIS
DB
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
November 17, 2005
25
• row 1 of that table is “about” the city of Dallas, TX
We provide the field encodings and decodings, some of which correspond to
explicit fields like population, and some correspond to entities whose
existence is merely implied by the existence of that row in that table (in this
case, the first row implies the existence of -- and describes some specifics of -the geographic entity that is the real-world city of Dallas, Texas, which is
represented in Cyc’s KB by the term #$CityOfDallasTexas)
There is a logical field name for that entity,
(TheReferentOfTheRow Usgs-Gnis) ,
even though it is only talked about by the explicit fields.
USGS
GNIS
DB
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
November 17, 2005
26
• how to access tuples of that database
We provide all the information needed for a JDBC connection script:
We assert, in the context (MappingMtFn Usgs-KS), all of these:
(passwordForSKS Usgs-KS "geografy")
(portNumberForSKS Usgs-KS 4032)
(serverOfSKS Usgs-KS "sksi.cyc.com")
(sqlProgramForSKS Usgs-KS PostgreSQL)
(structuredKnowledgeSourceName Usgs-KS "usgs")
(subProtocolForSKS Usgs-KS "postgresql")
(userNameForSKS "sksi")
USGS
GNIS
DB
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
November 17, 2005
27
• that access will be fast, accurate, recent, complete
We provide meta-level assertions about the database, about each table of the
database, about the completeness etc. of various kinds of data in the DB, etc.
We assert, in the context (MappingMtFn Usgs-KS):
(schemaCompleteExtentKnownForValueTypeInArg
Usgs-Gnis-LS
USCity
numberOfInhabitants
1)
USGS
GNIS
DB
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
November 17, 2005
28
Cyc automatically gathers statistics like these, and uses them to order search:
(resultSetCardinality Usgs-Gnis-PS
(TheSet (PhysicalFieldFn Usgs-Gnis-PS "state")) TheEmptySet
60.0)
(resultSetCardinality Usgs-Gnis-PS
(TheSet
(PhysicalFieldFn Usgs-Gnis-PS "primary_long")
(PhysicalFieldFn Usgs-Gnis-PS "primary_lat")
(PhysicalFieldFn Usgs-Gnis-PS "name"))
USGS
(TheSet
GNIS
DB
(PhysicalFieldFn
Usgs-Gnis-PS "county")
(PhysicalFieldFn Usgs-Gnis-PS "state"))
530.36)
November 17, 2005
29
November 17, 2005
30
November 17, 2005
31
November 17, 2005
32
November 17, 2005
33
November 17, 2005
34
Semantic Knowledge Source
Integration (SKSI) summary
• Some of the knowledge needed will generally be in the
Cyc KB already
• Some will reside in already-mapped sources: data
bases, web pages, simulators, etc.
• For each needed new source, explain the meaning of its
schema elements to Cyc
– Write Cyc assertions to convey the meaning of each field, each
polymorphism, each idiosyncratic entry code, plus metainformation: when this was created/updated, level of
granularity, its sources, its degree of completeness, what it can
do quickly, what it can do (slowly), how to access it, etc.
November 17, 2005
35
What Led to Cyc?
1. Programs need general world knowledge, and
commonsense, to break the “brittleness bottleneck”
NL understanding, speech understanding,
robotics, learning, expert systems, search,…
2. We know enough to do this; it is more an
engineering task than a scientific research task.
3. The time was right (1984).
November 17, 2005
36
How “general knowledge” helps search
• Query:
“Someone smiling”
• Caption:
“A man helping
his daughter take her first step”
How “general knowledge” helps search
Query: “Show me
pictures of strong and
adventurous people”
Caption: “A man
climbing a rock face”
November 17, 2005
38
How “general knowledge” helps search
Text Document
Query: “Outdoor
explosions in terrorist
events Lebanon between
1990 and 2001”
Document: “1993 pipe
bombing on the patio of
the Beirut Olive Garden”
November 17, 2005
39
+ domain knowledge
How “general knowledge”^helps search
Query: “Threats to
low-flying US airliners in
Lebanon”
Text Document
Document: “Hezballah
buys ten SA-7’s.”
November 17, 2005
40
Find and clean (consistency-check)
information by inference (+KB)
XYZCo
salutation
first
name
8041 9/1/57 8/5/91
Mr
Pat
Jones
8053
8053
8053 3/3/49 2/9/48
Ms
Jan
Smith
8053
8199
ID #
birth
date
hire
date
last
emerg signif
name contact other
If Pat and Jan are married, their date of marriage
should be the same; their address is likely to be the
same; their genders are likely to differ; and so on.
November 17, 2005
41
What Led to Cyc?
1. Programs need general world knowledge, and
commonsense, to break the “brittleness bottleneck”
NL understanding, speech understanding,
robotics, learning, expert systems, search,…
2. We know enough to do this; it is more an
engineering task than a scientific research task.
3. The time was right (1984).
November 17, 2005
42
Cyc is…
Millions of facts, rules of thumb, etc. that capture
human common sense about our everyday world
–
–
–
–
–
–
–
–
The typical bird has 1 beak, 1 heart, lots of feathers,…
Hearts are internal organs; feathers are external protrusions
Most vehicles are steered by an awake, sane, adult,… human
Tangible objects can’t be in 2 (disjoint) places at once
Badly injuring a child is much worse than killing a dog
Causes temporally precede (i.e., start before) their effects
A stabbing requires 2 cotemporal and proximate actors
etc.
November 17, 2005
43
Cyc is…
Millions of facts, rules of thumb, etc. that capture
human common sense about our everyday world
- Each of these represented in formal logic
- Info. about a set of hundreds of thousands of terms
- Language-independent
ArabicWordForWritingPen
EnglishWord-Plume
EnglishWord-Pen
Penitentiary
WritingPen
Corral
BirdFeather
Authoring
…
…
FrenchWord-Plume
November 17, 2005
44
Cyc is…
Millions of facts, rules of thumb, etc. that capture
human common sense about our everyday world
- Each of these represented in formal logic
- Info. about a set of hundreds of thousands of terms
• An inference engine that produces the same sorts
of inferences from those that people would.
• Interfaces so the system can communicate with
people, data bases, spreadsheets, websites, etc.
November 17, 2005
45
Knowledge
Users
User Interface
(with Natural Language Dialog)
Cyc
Reasoning
Modules
Other
Applications
Cyc API
Knowledge Entry
Tools
Knowledge
Authors
Cyc Ontology &
Knowledge Base
Interface to
External Data Sources
External
Data
Sources
Data
Bases
Web
Pages
Text
Sources
November 17, 2005
Other
KBs
46
Painful Evolution of our Representation
from Frames&Slots to Contextualized HOL
EVENT  TEMPORAL-THING  PARTIALLY-TANGIBLE-THING
Upper
Ontology
( a, b ) a  EVENT  b  EVENT 
causes( a, b )  precedes( a, b )
Core
Theories
Domain-Specific
Theories
( m, a ) m  MAMMAL  a  ANTHRAX 
causes( exposed-to( m, a ), infected-by( m, a ) )
Very specific information
(some indirect, via SKSI)
(ist FtLaudHolyCrossERCase#403921
(caused CutaneousAnthrax
(SkinLesions Ahmed_al-Haznawit)))
First Order Predicate Calculus: unambiguous; enable mechanical reasoning
Every American has a president.
Every American has a mother.
y.x. Amer(x)  president(x,y)
x.y. Amer(x)  mother(x,y)
Higher Order Logic (nth-order
predicate calculus): contexts,
predicates as variables,
nested modals, reflection,…
November 17, 2005
47
•
Cyc is not monolithic
The Knowledge Base is divided
into thousands of contexts by:
granularity, topic, culture,
geospatial place, time,...
Cyc is not committed to any one reasoning mechanism
The inference engine is a community of 720
“agents” that attack every problem and,
recursively, every subproblem (subgoal).
One of these 720 is a general theorem
prover; the others have special-purpose data
structures/algorithms to handle the most
important, most common cases, very fast.
Cyc is not monotonic
98% of its content is marked as
merely being usually true.
So reasoning in Cyc is
default
(gather up all
the pro/con
arguments,
and compare them).
Cyc is not committed to its own reasoning mechanisms
Think of reasoning modules 721, 722,
723… as being all manner of external
databases, simulators, translators…
Cyc Knowledge Base
Cyc contains:
15,000 Predicates
300,000 Concepts
3,200,000 Assertions
Intangible Individual
Thing
Sets
Relations
Space
Physical
Objects
Living
Things
Ecology
Natural
Geography
Political
Geography
Weather
Earth &
Solar System
Paths
Actors
Actions
Movement
State Change
Dynamics
Plans
Goals
Physical
Agents
Plants
Human
Anatomy &
Physiology
Temporal
Thing
Partially
Tangible
Thing
Logic
Math
Borders
Geometry
Animals
Emotion
Human
Products Conceptual
Perception Behavior &
Devices
Works
Belief
Actions
Vehicles
Buildings
Weapons
Spatial
Thing
Spatial
Paths
Materials
Parts
Statics
Life
Forms
Human
Beings
Human
Artifacts
Represented in:
• First Order Logic
• Higher Order Logic
Time
• Context Logic
Events
Scripts
• Micro-theories
Agents
Artifacts
Thing
Mechanical
Software
Social
& Electrical Literature Language Relations,
Devices
Works of Art
Culture
Organizational
Actions
Organizational
Plans
Agent
Organizations
Social
Behavior
Organization
Social
Activities
Human
Activities
Business &
Commerce
Purchasing
Shopping
Types of
Organizations
Politics
Warfare
Sports
Recreation
Entertainment
Transportation
& Logistics
Human
Organizations
Nations
Governments
Geo-Politics
Professions
Occupations
Travel
Communication
Law
Everyday
Living
Business,
Military
Organizations
General Knowledge about Various Domains
Specific data, facts, and observations
November 17, 2005
50
Cyc KB extended with domain
knowledge about terrorism
Cyc contains:
15,000 Predicates
300,000 Concepts
3,200,000 Assertions
Intangible Individual
Thing
Sets
Relations
Space
Physical
Objects
Living
Things
Ecology
Natural
Geography
Political
Geography
Weather
Earth &
Solar System
Paths
Human
Anatomy &
Physiology
Actors
Actions
Movement
State Change
Dynamics
Plants
Temporal
Thing
Partially
Tangible
Thing
Logic
Math
Borders
Geometry
Plans
Goals
Physical
Agents
Animals
Emotion
Human
Products Conceptual
Perception Behavior &
Devices
Works
Belief
Actions
Vehicles
Buildings
Weapons
Spatial
Thing
Spatial
Paths
Materials
Parts
Statics
Life
Forms
Human
Beings
Human
Artifacts
Represented in:
• First Order Logic
• Higher Order Logic
Time
• Context Logic
Events
Scripts
• Micro-theories
Agents
Artifacts
Thing
Mechanical
Software
Social
& Electrical Literature Language Relations,
Devices
Works of Art
Culture
Organizational
Actions
Organizational
Plans
Agent
Organizations
Social
Behavior
Organization
Social
Activities
Human
Activities
Business &
Commerce
Purchasing
Shopping
Types of
Organizations
Politics
Warfare
Sports
Recreation
Entertainment
Transportation
& Logistics
Human
Organizations
Nations
Governments
Geo-Politics
Professions
Occupations
Travel
Communication
Law
Everyday
Living
General Knowledge about Terrorism
Specific data, facts, and observations
about terroristNovember
groups
and activities
17, 2005
51
Business,
Military
Organizations
Building Cyc qua Engineering Task
amount known
November 17, 2005
Building Cyc qua Engineering Task
CYC
amount known
November 17, 2005
Building Cyc qua Engineering Task
CYC
amount known
November 17, 2005
Guiding Principle:
“We have to get it to work, not appear to work”
– Don’t defer hard problems (time/space/emotions…)
– No “NIH”! Harness every good idea that others have
– Take an engineering approach, not a scientific research one:
Instead of one TOE (elegant full solution), find a set of
partial solutions that together cover the most common cases
– Pursue applications that require large amounts of real-world
knowledge (they need Cyc and also will drive it)
November 17, 2005
55
Eschew the 5 pitfalls (ways to cut
ontological corners and end up with
something that only appears to work)
• Ignorance-based: Have a small theory size (#terms, #instances, #rules)
• Static KB (can be massively tuned, optimized, cached, etc. ahead of time)
• Simple assertions (e.g., SAT constraints; propositional calculus; Horn;…)
• One global context (no contradictions, limited domain, simplified world)
• Don’t do all the bookkeeping and forward inference required for justification
maintenance (or, equivalently, don’t ever have truth maintenance “turned on”)
November 17, 2005
56
Eschew the 5 pitfalls (ways to cut
ontological corners and end up with
something that only appears to work)
• Ignorance-based: Have a small theory size (#terms, #instances, #rules)
• Static KB (can be massively tuned, optimized, cached, etc. ahead of time)
• Simple assertions (e.g., SAT constraints; propositional calculus; Horn;…)
• One global context (no contradictions, limited domain, simplified world)
• Don’t do all the bookkeeping and forward inference required for justification
maintenance (or, equivalently, don’t ever have truth maintenance “turned on”)
As with pharmaceuticals, what is toxic in one dosage is beneficial in a lesser dosage.
E.g., contexts lead to locally-consistent locally-small theories (faster inference/KE)
E.g., often some (sub)problems can be represented/solved in a simpler repr.
November 17, 2005
57
Choosing what to add to Cyc
• Bottom-up: Look at a sentence, see what knowledge
the writer assumed the reader already had about the
world. Generalize that piece of knowledge.
• Top-down: Articulate the scope of a (sub)topic, and
articulate queries that should be answerable. Get
missing K. by introspecting or just asking Cyc.
November 17, 2005
58
The Cyc Knowledge Base
Cyc contains:
15,000 Predicates
300,000 Concepts
3,200,000 Assertions
Thing
Intangible Individual
Thing
Sets
Relations
Space
Physical
Objects
Living
Things
Ecology
Natural
Geography
Political
Geography
Weather
Earth &
Solar System
Human
Beings
Human
Artifacts
Paths
Partially
Tangible
Thing
Time
Events
Scripts
Artifacts
Plans
Goals
Physical
Agents
Plants
Animals
Mechanical
Software
Social
& Electrical Literature Language Relations,
Devices
Works of Art
Culture
Organization
Organizational
Actions
Organizational
Plans
Agent
Organizations
Social
Behavior
Agents
Actors
Actions
Movement
State Change
Dynamics
Human
Anatomy &
Physiology
Temporal
Thing
Logic
Math
Borders
Geometry
Materials
Parts
Statics
Life
Forms
Spatial
Thing
Spatial
Paths
Emotion
Human
Products Conceptual
Perception Behavior &
Devices
Works
Belief
Actions
Vehicles
Buildings
Weapons
Represented in:
• First Order Logic
• Higher Order Logic
• Context Logic
• Microtheories
Social
Activities
Human
Activities
Business &
Commerce
Purchasing
Shopping
Types of
Organizations
Politics
Warfare
Sports
Recreation
Entertainment
Transportation
& Logistics
Human
Organizations
Nations
Governments
Geo-Politics
Professions
Occupations
Travel
Communication
Law
Everyday
Living
Real World Domain Knowledge
Specific cases, facts, details,…
November 17, 2005
59
Business,
Military
Organizations
November 17, 2005
60
Cyc KB “Whitman’s Sampler”
•
•
•
•
•
•
•
•
•
•
•
Temporal Relations
Senses of “x is a physical part of y”
Senses of “x is physically in y”
Events and their performers (role types)
Organizations
Propositional Attitudes
Biology
Materials
Devices
Weather
Information-bearing objects
November 17, 2005
61
Temporal Relations
37 Relations Between Temporal Things
#$temporalBoundsIntersect
#$temporallyIntersects
#$temporalBoundsContain
#$temporalBoundsIdentical
#$startsAfterStartingOf
#$startsDuring
#$endsAfterEndingOf
#$overlapsStart
#$startingDate
#$startingPoint
#$temporallyContains
#$simultaneousWith
#$temporallyCooriginating
#$after
November 17, 2005
62
Temporal Relations
Some of these Relations are very General, such as:
#$temporallyIntersects
Such relations are particularly useful when they are
known not to hold between a pair of individuals:
(#$not (#$temporallyIntersects ?X ?Y))
That implies all of these:
(#$not (#$spouse PERSON-X PERSON-Y))
(#$not (#$consultant AGENT-X AGENT-Y))
(#$not (#$accountHolder ACCOUNT-X AGENT-Y))
(#$not (#$residesInRegion AGENT-X REGION-Y))
(#$not (#$officiator EVENT-X PERSON-Y))
November 17, 2005
63
Senses of ‘Part’
#$parts
#$intangibleParts
#$subInformation
#$subEvents
#$physicalDecompositions
#$physicalPortions
November 17, 2005
#$physicalParts
#$externalParts
#$internalParts
#$anatomicalParts
#$constituents
#$functionalPart
64
Senses of ‘In’
• Can the inner object leave by passing between
members of the outer group?
– Yes -- Try #$in-Among
November 17, 2005
65
Senses of ‘In’
• Does part of the inner
object stick out of the
container?
– If the container were
turned around could
the contained object
fall out?
– None of it. -- Try
#$in-ContCompletely
Yes -- Try
#$in-ContOpen
– Yes -- Try
#$in-ContPartially
No -- Try
#$in-ContClosed
November 17, 2005
66
Senses of ‘In’
Is it attached to the inside
of the outer object?
– Yes -- Try
#$connectedToInside
Can it be removed, if
enough force is used,
without damaging either
object?
– Yes -- Try #$in-Snugly or
#$screwedIn
Does the inner object
stick into the outer
object?
Yes -- Try #$sticksInto
November 17, 2005
67
Event Types
#$PhysicalStateChangeEvent
#$TemperatureChangingProcess
#$BiologicalDevelopmentEvent
#$ShapeChangeEvent
#$MovementEvent
#$ChangingDeviceState
#$GivingSomething
#$DiscoveryEvent
#$Cracking
#$Carving
#$Buying
#$Thinking
#$Mixing
#$Singing
#$CuttingNails
#$PumpingFluid
11,000 more
November 17, 2005
68
A few event types pertaining to
Vehicular Transportation
#$TransportationEvent
#$ControllingATransportationDevice
#$TransportWithMotorizedLandVehicle
(#$SteeringFn #$RoadVehicle)
#$TransporterCrashEvent
#$VehicleAccident
#$CarAccident
#$Colliding
#$IncurringDamage
#$TippingOver
#$Navigating
#$EnteringAVehicle
November 17, 2005
69
Relations Between
an Event and its Participants
#$performedBy
#$causes-EventEvent
#$objectPlaced
#$objectOfStateChange
#$outputsCreated
#$inputsDestroyed
#$assistingAgent
#$beneficiary
#$fromLocation
#$toLocation
#$deviceUsed
#$driverActor
#$damages
#$vehicle
#$providerOfMotiveForce
#$transportees
Over 400 more.
November 17, 2005
70
These ActorSlots express each type of relation
between an Event and its actors and subevents
Here are some slot: value pairs for Attack874
isa: TerroristAttack.
performedBy: JihadGroup.
deviceUsed: Bomb8388.
eventOccursAt: CityOfLondonEngland.
victim: Person9399.
victim: Person52666.
assistingAgent: AlQaeda.
objectsDestroyed: Structure2990.
objectsDestroyed: Vehicle523452.
November 17, 2005
71
Organization “Slots”
•
•
•
•
•
•
#$governingBody
#$parentCompany
#$subOrgs-Command
#$subOrgs-Permanent
#$subOrgs-Temporary
#$physicalQuarters
•
•
•
•
•
•
#$hasHQinCountry
#$officeInCountry
#$memberTypes
#$organizationHead
#$PolicyFn
#$mainProductType
+ those predicates that make sense for each
generalization of Organization
(e.g., #$startingTime, #$alsoKnownAs).
November 17, 2005
72
Emotion
• Types of Emotions:
–
–
–
–
–
–
#$Adulation
#$Abhorrence
#$Relaxed-Feeling
#$Gratitude
#$Anticipation-Feeling
Over 120 of these
• Predicates For Defining and
Attributing Emotions:
–
–
–
–
–
#$contraryFeelings
#$appropriateEmotion
#$actionExpressesFeeling
#$feelsTowardsObject
#$feelsTowardsPersonType
November 17, 2005
73
Propositional Attitudes
Relations Between Agents and Propositions
•
•
•
•
•
•
#$goals
#$intends
#$desires
#$hopes
#$expects
#$beliefs
•
•
•
•
•
•
#$opinions
#$knows
#$rememberedProp
#$perceivesThat
#$seesThat
#$tastesThat
November 17, 2005
74
Materials
•
•
•
•
Common Substances
Attributes of Materials
States Of Matter
Solutions
•
•
•
•
Electrical Conductivity
Thermal Conductivity
Structural Attributes
Tangible Attributes
November 17, 2005
75
Materials
• Common Substances
• Attributes of Materials
• States Of Matter
– SolidStateOfMatter
– LiquidStateOfMatter
– GaseousStateOfMatter
• Solutions
•
•
•
•
Electrical Conductivity
Thermal Conductivity
Structural Attributes
Tangible Attributes
– SolidTangibleThing
– LiquidTangibleThing
– GaseousTangibleThing
November 17, 2005
76
Devices
• Over 4000 Specializations
of #$PhysicalDevice
Device Specific
Predicates
•
– #$ClothesWasher
– #$NuclearAircraftCarrier
•
• Vocabulary for Describing
Device Functions
– #$primaryFunction-DeviceType
November 17, 2005
#$gunCaliber
#$speedOf
Device States (40+)
#$DeviceOn
#$CockedState
77
Vehicular Transport Devices
• Over 800 Specializations
of #$RoadVehicle
Five Facets of #$RoadVehicle
#$RoadVehicleByChassisType
#$RoadVehicleTypeByBodyStyle
#$RoadVehicleTypeByModel
#$RoadVehicleTypeByPowerSource
#$RoadVehicleTypeByUse
– #$AcuraCar
– #$SportUtilityVehicle
– #$Humvee
• Over 100 Specializations of
#$AutoPart
•
Specialized Predicates
#$highwayFuelConsumption
#$vehicleLoadClass
#$trafficableForVehicle
#$vehicle
– #$AutomobileTire
– #$ShockAbsorber
– #$Windshield
November 17, 2005
78
Weather
Weather Events
Weather Objects
#$TornadoAsEvent
#$SnowProcess
#$CloudInSky
#$SnowMob
• Weather Attributes
– #$ClearWeather
– #$Visibility
– (#$LowAmountFn #$Raininess)
November 17, 2005
79
Information-Bearing Things
Books, web-page copies, radio
broadcasts, utterances, intell
cables, TV series,…
November 17, 2005
80
What is “Moby Dick” ?
InformationBearingThing
(IBT)
AbstractInformationStructure
(AIS)
“‘ T i s M o b y
D i c k !”
PropositionalInformationThing
(PIT)
(#$thereExists ?SEE
(#$and
(#$isa ?SEE Seeing)
(#$objectPerceived ?SEE #$MobyDick)
(#$perceiver ?SEE #$CaptainAhab)))
November 17, 2005
81
What is “Moby Dick” ?
InformationBearingThing (IBT)
textOfIBT
instantiationOfCW
ContainsInfo-Propositional-CW
InfoStructureOfCW
ConceptualWork
(CW)
PITOfIBTFn
AbstractInformationStructure PropositionalInformationThing
(PIT)
(AIS)
November 17, 2005
#$infoStructureRepresents
82
Bridging the Knowledge Gap
upper
ontology
Water is wet
Intermediate ontology
Vehicles slow down in bad weather
lower ontology:
task-specific knowledge
HUMMV’s lose 18% traction in 4-inch-deep mud
November 17, 2005
83
KR Lessons Learned
We started with a straightforward “Frames & Slots” representation
(in 1972),
FredAlbertson
improving it over the years
ownsA: Dog
as -- but only as -- we
needed to.
isA: Person
worksFor: UT
.
.
.
November 17, 2005
84
KR Lessons Learned
We started with a straightforward “Frames & Slots” representation
But Frames&Slots are inadequate to naturally express
• disjunction (“Fred owns a dog or a parakeet.”)
• negation (“Fred does not own a dog.”)
• modals (“Fred believes Israel wants Egypt to expect…”)
• meta-assertions (“That rule is 50 years old but reliable.”)
• nested quantification (w)(x)(y)(z)…
“Every American has a president.” versus
“Every American has a mother.”
November 17, 2005
85
KR Lessons Learned
2.
On the one hand, we must move from Frames&Slots to Logic.
But on the other hand: Theorem-proving is too slow!
Solution: Do it, and to recoup efficiency, separate:
The Heuristic Problem
The Epistemological Problem
(what should the system know?)
(how can it reason efficiently
with&about what it knows?)
I.e., represent each assertion in (at least) 2 ways:
one standard logical (predicate calculus) form (EL), and
one (or more) efficient special-purpose representations (HL)
November 17, 2005
86
Lessons Learned
• Bridging the knowledge gap: do the “intermediate theories.”
• Rather than struggling to reason in NL sentences, use a more
formal representation language. Make this as simple as possible
(but, year by year, we had to make it ever more expressive.)
• Similarly, represent only – but all – useful distinctions. Sounds
trivial but leads to huge ontologies of objects, predicates, scripts..
• Distinguish the EL and HL. Rather than striving in vain for a
single fast inference engine, use a suite of 720 heuristic modules
that each handle some commonly-occurring problems very fast.
• Probabilities are great iff known; often relative likelihood known
• Most knowledge is default; reason by argumentation
• Rather than striving in vain for a monolithic consistent KB,
divide the KB up into many locally-consistent contexts
November 17, 2005
87
Contexts (Microtheories)
Global Consistency:
Can’t Live With It, Can’t Give It Up!
What’s the real source of the problem?
Each rule is rich: it is a simplified statement that obscures
a plethora of unstated assumptions and details.
As long as the rules are all in one coherent small context,
they are likely to make the same simplifying assumptions,
and hence are likely to work together consistently.
November 17, 2005
88
“If it’s raining, carry an umbrella”
 the performer is a human being,
 the performer is sane,
 the performer can carry an umbrella; thus:







the performer is not a baby, not unconscious, not dead,
the performer is going to go outdoors now/soon,
their actions permit them a free hand (e.g., not wheelbarrowing)
their actions wouldn’t be unduly hampered by it (e.g., marathon-running)
the wind outside is not too fierce (e.g., hurricane strength)
the time period of the action is after the invention of the umbrella
the culture is one that uses umbrellas as a rain- (not just sun-)protection device,
the performer has easy access to an umbrella; thus:
not too destitute, not someone who lives where it practically never rains,
not at the office/theater/… caught without an umbrella
 the performer is going to be unsheltered for some period of time
the more waterproof their clothing, the gentler the rain, and
the warmer the air, the longer that time period
 the performer will not be wet anyway (e.g., swimming)
 the rain is annoying -- but merely annoying. Thus:
not ammonia rain on Venus, radioactive post-apocalyptic rain,
biblical (Noah’s-ark-sized, or frogs/blood as rained on Pharaoh)
the performer is not a hydrophobic person, gingerbread man, etc.,
and not a hydrophilic person, someone dying of thirst, etc.
November 17, 2005
89
Each assertion should be situated in a
context: in a region of context-space
• We identified 12 dimensions of mt-space
Anthropacity
Time
• We developed a vocabulary of predicates
GeoLocation
and terms to describe points and regions
TypeOfPlace
along each of those 12 dimensions; and
TypeOfTime
Culture
Sophistication/Security • We have been situating assertions more and
Topic
more precisely, and we have been working
Granularity
out calculi for inferring contexts
Modality/Disposition
/Epistemology
– E.g., if P is true in C1, and P=>Q is true in C2,
• Argument-Preference
in what context C2 can Q be validly concluded?
• Justification
•
•
•
•
•
•
•
•
•
•
November 17, 2005
90
Mathematical Factoring of
Context-space Dimensions
UnitedStatesIn1985Context:
Ronald
Reagan900,000
is president.
There are
at least
doctors.
PennsylvaniaIn1985Context:
Dick Thornburgh is governor.
LehighCountyInFebruary1985Context:
Dick Thornburgh
Thornburgh is
is governor
governor and
and there
Ronald
Dick
Reagan
president.
are atisleast
900,000 doctors.
November 17, 2005
91
Time Indices and Granularities
Doug is talking, at 10:30 to 11:30, on 11/17/05.
Therefore:
Doug is talking, at 10:50 to 11:05, on 11/17/05.
But not:
Doug is talking, at 10:55:11 to 10:55:13, on 11/17/05.
November 17, 2005
92
Time Indices and Granularities
Doug is talking, at 10:30 to 11:30, on 11/17/05
with temporal granularity calendar
P =minute.
Doug is talking.
Calendar Minutes
t = that one
hour interval
Future
t
So: talking during that 15-minute interval? Yes
Talking during that 2-second interval: Unknown
November 17, 2005
93
Summary (1): Technology
• Cyc is a power source, not a single application.
Like oil, electricity, telephony, computers,…
Cyc can spawn and sustain a new industry.
• It can cost-effectively underlie almost all apps.
(Provide a common-sense layer to reduce brittleness
when faced with unexpected inputs/situations)
• To apply Cyc, we extend its ontology, its KB, and
possibly its suite of specialized reasoning modules
November 17, 2005
94
20 Motivating Applications (1984)
November 17, 2005
95
5 More Recent Application Ideas
November 17, 2005
96
Recent/Current Government Apps
• Dept. of Defense (mostly DARPA, ONR)
–
–
–
–
–
CoABS, HPKB, CPoF, DAML, ACIP
RKF (OE-ing by non-logicians via clarification dialogue)
BUTLER: Knowledge-based machine learning
ResearchCyc: Clean, document, speed up, interface, etc.
ONR: Level 2 and 3 Information Fusion (sense-making)
• Other US Government Agencies (NSF, ARDA, NIST)
–
–
–
–
–
–
–
NIST ATP: Jumpstarting a Nat’l. Knowledge Infrastructure
AQUAINT, NIMD, Topsail, Eagle, KSP-ATD,…
Building a comprehensive terrorism KB for the US
Automated generation of plausible terrorism threat scenarios
Modeling intelligence analysts (script learning/recognition)
Semantic knowledge source integration
Efficient Inference in Large Knowledge Bases
November 17, 2005
97
Recent/Current Commercial Apps
• using Cyc as the basis for a medical ontology
– aligning Cyc with Snomed/UMLS/Mesh/...
•
•
•
•
•
•
multiple-thesaurus manager (align n 300k-term lists)
spider the entire Web (indexing it in terms of Cyc concepts)
identify inter-sentential references in NPR transcripts
improved web (and website) search query/follow-ups
vulnerability assessment (reason about a scanned network)
semantic matching for a better customer experience
November 17, 2005
98
Summary (2): Cycorp
• 50 employees (almost all MTS’s)
• Revenue about $7M/year (some commercial licenses
and app.’s, but >50% US Government R&D contracts)
• Employee-owned (VC-free and debt-free)
• $75M development effort (750 PY’s over 21 years)
–
–
–
–
–
Mostly spent on building up its ontology and KB
To a lesser extent, its reasoning modules and interfaces
Focus: automatically growing Cyc via learning
Focus: enabling Cyc users to directly extend it
Focus: making inference orders of magnitude faster
November 17, 2005
99
Summary (3): The Message:
What Needs to be Shared?
•
•
•
•
•
•
•
bits/bytes/streams/network…
alphabet, special characters,…
words, morphological variants,…
syntactic meta-level markups (HTML)
semantic meta-level markups (SGML, XML)
content (logical representation of doc/page/...)
context (common sense, recent utterances, and n
dimensions of metadata: time, space, level of
granularity, the source’s purpose, etc.)
November 17, 2005
100
Summary (3): The Message:
What Needs to be Shared?
• bits/bytes/streams/network…
• alphabet, special characters,…
• words, morphological variants,…
• syntactic meta-level markups (HTML)
• semantic meta-level markups (SGML, XML)
• content (logical representation of doc/page/...)
Tiny vocabulary (# distinctions) of standard relations:
• context
(common
sense,
recent utterances,
and n
rdf:type,
subclass,
label,
domain,
range, comment,…
dimensions of metadata: time, space, level of
Beyond
diversity
is etc.)
tolerated
granularity,
thewhich
source’s
purpose,
Which means divergence is inevitable
“What do you mean we have no standard, we have lots of standards!”
November 17, 2005
101
To do the logical/arithmetic combination
Summary
(3): The
Message:
across information
sources,
we need
tens of Needs
thousands
relations,
not tens
What
toof be
Shared?
• bits/bytes/streams/network…
DAML+OIL adds a few more distinctions:
• alphabet, special characters,…
inverses,
unambiguous
properties,
unique
• words, morphological variants,…
properties, lists, restrictions, cardinalities,
• syntactic meta-level markups (HTML)
pairwise disjoint lists, datatypes, …
• semantic meta-level markups (SGML, XML)
• content (logical representation of doc/page/...)
Tiny vocabulary (# distinctions) of standard relations:
• context
(common
sense,
recent utterances,
and n
rdf:type,
subclass,
label,
domain,
range, comment,…
dimensions of metadata: time, space, level of
Beyond
diversity
is etc.)
tolerated
granularity,
thewhich
source’s
purpose,
Which means divergence is inevitable
“What do you mean we have no standard, we have lots of standards!”
November 17, 2005
102
From the User’s POV
• The user has a question they want answered
• The data needed to answer it is available to them,
but not in one single, obvious, reliable place
• The answers follow logically (and/or
arithmetically) from m elements in n sources
• Don’t want to have to know, ahead of time, what
first-run
sources to“Which
go to, how
to accessmovies
them, how to
combine
theaintermediate
results.
star
teenager born
in Texas
• Doand
wantare
to beshowing
able to limit,
ahead
time, the
today
at aoftheater
uncertainty,
recency,
granularity,
<
10 minutes’
drive
from thisideology…
building?”
(and/or see such meta-level info for each answer)
November 17, 2005
103
From the User’s POV
• The user has a question they want answered
• The data needed to answer it is available to them,
but not in one single, obvious, reliable place
• The answers follow logically (and/or
arithmetically) from m elements in n sources
• Don’t want to have to know, ahead of time, what
sources to go to, how to access them, how to
combine the intermediate results.
• Do want to be able to limit, ahead of time, the
uncertainty, recency, granularity, ideology…
(and/or see such meta-level info for each answer)
November 17, 2005
104
From
thefirst-run
User’s
POV
“Which
movies
star a teenager born in Texas
• Theand
userare
has showing
a question today
they want
answered
at a theater
• The data needed to answer it is available to them,
<
10
minutes’
drive
from
this
building?”
but not in one single, obvious, reliable place
• Do want the answer to be found automatically,
not a bunch of relevant pages for them to peruse.
• Don’t want to have to know, ahead of time, what
sources to go to, how to access them, how to
combine the intermediate results.
• Do want to be able to limit, ahead of time, the
uncertainty, recency, granularity, ideology…
(and/or see such meta-level info for each answer)
November 17, 2005
105
Summary (3): The Message:
What Needs to be Shared?
•
•
•
•
•
•
•
bits/bytes/streams/network…
alphabet, special characters,…
words, morphological variants,…
syntactic meta-level markups (HTML)
semantic meta-level markups (SGML, XML)
content (logical representation of doc/page/...)
context (common sense, recent utterances, and n
dimensions of metadata: time, space, level of
granularity, the source’s purpose, etc.)
November 17, 2005
106
End of “The Message”
End of “The Summary”
Delve into a typical domain – answering
intelligence analysts’ queries – where Cyc can
really help, because that domain thwarts all
five of “ontological corner-cutting” solutions
(+ digressions for OpenCyc, ResearchCyc,…)
November 17, 2005
107
Eschew the 5 pitfalls (ways to cut
ontological corners and end up with
something that only appears to work)
• Ignorance-based: Have a small theory size (#terms, #instances, #rules)
• Static KB (can be massively tuned, optimized, cached, etc. ahead of time)
• Simple assertions (e.g., SAT constraints; propositional calculus; Horn;…)
• One global context (no contradictions, limited domain, simplified world)
• Don’t do all the bookkeeping and forward inference required for justification
maintenance (or, equivalently, don’t ever have truth maintenance “turned on”)
As with pharmaceuticals, what is toxic in one dosage is beneficial in a lesser dosage.
E.g., contexts lead to locally-consistent locally-small theories (faster inference/KE)
E.g., often some (sub)problems can be represented/solved in a simpler repr.
November 17, 2005
108
The Analyst’s Knowledge Base
CT Analyst
“Were there any attacks on
targets of symbolic value to
Muslims since 1987 on a
Christian holy day?"
Domain Experts
"What sequences of
events could lead to
the destruction of
Hoover Dam?"
Query
Formulation
Formulator
Explanation
Generation
Generator
Cyc
Cycorp Tools For:
Ontology-Building,
-Browsing, -Editing,
& Fact/Rule Entry
Scenario
Generation
Generator
Reasoning
Reasoning
Modules
Modules
Others’/GOTS
Analysis and
Collaboration
Components
General
General
Terrorism
Terrorism
Knowledge
Knowledge
Knowledge
Knowledge
Terrorism
Knowledge
Terrorism
Knowledge
Base)
Base
AKB
Relational DB
“projection”
of the AKB
Interface to Data Repositories
HUMINT
Messages
INS
SIGINT
Data
Message
Content
Border
Geopolitical
Data
Crossings
HID
Global
Observa
Terrain tions
Data
Weather
Travel
Records
Data
November 17, 2005
Credit
Satellite
Card
Intel
Records
Military
Intel
109
output of
COTS Text
Extraction
Systems
MIPT
TKS2
TKS3
2. Terrorism domain experts met
MATRIX
to develop a schema for
the missing knowledge.
Preexisting
Structured
Relevant
TerrorismKnowledge
Knowledge
TTT
TKS6
TKS7
TKS8
3. They and others are
1. Fusion of available structured terrorism knowledge
sources: A tiny fraction of the Comprehensive AKB.
working remotely,
collaboratively,
to flesh out the
missing 95% of
the AKB.
80k
1.92M
5. The Comprehensive AKB: First useful state: will
contain over 4M facts and rules of thumb, about half
of which is pre-existing general knowledge already in Cyc.
4. Cyc uses general and domain
knowledge to convert the simple
English phrases into formal logic.
Templatized Terrorism Analysis Queries
1) List the [ORGANIZATIONS] at which [AGENT] was [STATUS] and when.
(1a) List the schools at which [Mohammed Atta] was [enrolled] and when.
(1b) List the companies at which [Mark Fulton] was a [employed] and when.
(3) What percentage of [ATTACK-TYPE] are [ATTACK-TYPE]?
(3a) What percentage of [terrorist attacks] are [poisonings]?
(3b) What percentage of [bombings] are [suicide bombings]?
(4) Between what times was the [AGENT] a/an [ROLE-PREDICATE] in what types of
acts and where?
(4a) Between what times was the [Aum Supreme Truth] a [performer] in what types
of acts and where?
(4b) Between what times was the [Ulster Volunteer Force] an [assisting agent] in
what types of acts and where?
November 17, 2005
111
Templatized Terrorism Analysis Queries
(13) List all [AGENT-TYPE] in [LOCATION] that have used [DEVICE-TYPE] and
list the specific types of (devices) that each has used.
(13a) List all [revolt organizations] in [Northern Ireland] that have used [pipe bombs]
and list the specific types of pipe bombs that each has used.
(13b) List all [right wing terrorist groups] in [North America] that have used
[package bombs] and list the specific types of package bombs that each has used.
(22) List the [AGENT-TYPE] who have [RELATION] [TYPE] to [AGENT] and what
those supplies were.
(22a) List the [Terrorist groups] who have [given] [supplies] to [Hamas] and what
those supplies were.
(22b) List the [state sponsored terrorist agents] who have [provided] [support] to
[Osama Bin Laden] and what those supplies were.
November 17, 2005
112
CIA Intelligence Report
“Seeking Information: Ahmad Said”
July 26, 2004
Ahmad Said, an expert on remote-controlled bombs with a degree in
chemical engineering, was seen travelling to Lebanon early this
month. Said claimed to be a member of the Lebanese Hizballah from
the mid 1980s until late July 1999.
It is currently believed that Said assisted in the July 22nd car bombing
in Beirut that damaged police barracks and destroyed several retail
stores. Lebanese Hizballah's spokesman, Emad Mugniyeh, issued a
statement on July 26th to the Al Aman newspaper denying the group's
involvement in the attack.
November 17, 2005
113
Deeper Analytical Question Answering
What factors argue <for/against> the conclusion that
<ETA> <performed> <the March 2004 Madrid attacks>?
For:
- ETA often executes attacks near national election
- ETA has performed multi-target coordinated attacks
- Over the past 30 years, ETA performed 75% of all terrorist attacks in Spain
- Over the past 30 years, 98% of all terrorist attacks in Spain were performed
by Spain-based groups, and ETA is a Spain-based group.
Against:
-ETA warns (a few minutes ahead of time) of attacks that would result in a
high number civilian casualties, to prevent them. There was no such warning
prior to this attack.
-ETA generally takes responsibility for its attacks, and it did not do so this time.
-ETA has never been known to falsely deny responsibility for an attack, and it
did deny responsibility for this attack.
November 17, 2005
114
Automatic Link Detection
November 17, 2005
115
Automatic Link Detection
November 17, 2005
116
Intelligent Fusion: Disparate Data
• USS Lake Champlain is scheduled to return to its
homeport (NavBase San Diego) 1300 4 September
• Hurricane Howard predicted to make landfall at
Tijuana, Mexico approx. 0100 5 September
• 0600 4 September: satellite imagery reveals
126 boats berthed Silver Gate Yacht Club.
• 0600 4 September: Silver Gate Yacht Club
harbormaster manifest only lists 124 craft.
• 1135 4 September: Coast Guard reports two cigarette
boats, traveling together at 54 knots, on a trajectory
consistent with a path from the Silver Gate Yacht
Club to the entrance of the San Diego Naval Base.
• Monitoring of cell phone activity of a suspected Red Dawn
terrorist cell member in Syria has identified four calls, each
of 30 seconds’ duration, placed to that suspect from Shelter
Island between 2300 September 3 and 1100 September 4.
Automatic Generation of Plausible
(Counter)Terrorism Scenarios
End at target,
if given one
meet in
middle
Start from seed,
if given one
Each step
should be both
Grow whole
populations
of
plausible
interesting
such paths,
notand
just
one.
Employ heuristics to evaluate
each node’s “promise”:
plausibility x interestingness
November 17, 2005
Generate chains
of action and
plausible reaction
118
Often a step is just a
response, by 1 or more
agents, to the prior step
(or, if going right
to left, it is an
enabler/cause of
the already-known
successor step)
Each step can be a…
•
•
•
•
•
•
•
Political event (e.g., an election)
Diplomatic event (communique’)
Military event (buildup along border)
Terrorist event (suicide bombing)
Economic event (loan; arms sale)
Infrastructure event (power outage)
Act of Nature (illness; hurricane)
November 17, 2005
Generate chains
of action and
plausible reaction
119
Hoover dam
is blown up
Each step can be a…
•
•
•
•
•
•
•
Political event (e.g., an election)
Diplomatic event (communique’)
Military event (buildup along border)
Terrorist event (suicide bombing)
Economic event (loan; arms sale)
Infrastructure event (power outage)
Act of Nature (illness; hurricane)
November 17, 2005
Generate chains
of action and
plausible reaction
120
Al Qaida does a
sudden, atypical
liquidizing of $1M
of its assets
buy it for $1M
from Pakistan
Hoover dam
is blown up
Destroy 3.24M
tons of concrete
detonate a crude 100 kton
nuclear bomb, 1 km away
Al Qaida has high net
worth (assets) and
the will to do it
Pakistan has such devices
and is financially hurting
November 17, 2005
Generate chains
of action and
plausible reaction
121
November 17, 2005
122
Auto. Scen.Gen.: Lessons Learned
• Forward generation is too explosive
• Backward generation is too sterile
• Instead, use a sort of “cardiac rhythm”
– Take a large step backward (ABDUCTION)
– Work forward a little from it (DEDUCTION)
– Repeat.
November 17, 2005
123
Targeted Fact Gathering: Web Search
• Abu Sayyaf was founded in ___
• Al Harakat Islamiya, established in ___
• ASG was established in ___
Search Strings
(foundingDate AbuSayyaf ?X)
Abu Sayyaf was founded in the early 1990s
Local
storage
 Parse
(foundingDate AbuSayyaf (EarlyPartFn (DecadeFn 199)))
Suggested Fact
November 17, 2005
124
Targeted Fact Gathering: Web Search
• (maritalStatus YassirArafat Single)
• (maritalStatus YassirArafat Married)
• (maritalStatus YassirArafat Divorced) …
•(maritalStatus YassirArafat Cohabitating-Unmarried)
All Possible Facts
• Yasser Arafat’s fiance
• Yasser Arafat’s wife
• Yasser Arafat’s ex-wife
• Yasser Arafat divorced
Search Strings
(maritalStatus YassirArafat ?X)
PersonTypeByMaritalStatus
(maritalStatus YassirArafat Married)
Suggested Fact
November 17, 2005
125
Harnessing Lots of Users
• Identify underpopulated common sense predicates
• Use semantic constraints + shallow parsing to identify possible fact completions
• Present multiple choice questions to novices to complete facts
150-400 commonsense GAFs/hour
useful distinguishing facts
Hat worn on:  Head  Neck  Foot  Leg
November 17, 2005
126
OpenCyc
Open Source release of: [most of] the Cyc
Ontology + Simple Relns. + Inference Engine
ResearchCyc
Almost All of Cyc (for free for R&D purposes)
November 17, 2005
127
The OpenCyc Release
• Runs on Windows, Linux
• OpenCyc Knowledge Base
– LGPL license
– 47,000 terms
– 306,000 facts
• Cyc Inference Engine
– Free license for binary runtime engine
• Application Programming Interface
– Java, SubL, Python
• Extensive documentation
– Ontological Engineer’s Handbook
– Online Cyc 101 course
November 17, 2005
128
Why Do We Release All This?
• Advance the starting line for AI
• Enable a large number of users to in effect
help us to grow the Cyc Knowledge Base
• Help Cyc become a critical component
– in the Semantic Web
– in more and more applications
– using OpenCyc hopefully leads to using
ResearchCyc for free, eventually licensed
November 17, 2005
129
OpenCyc is Upward- Compatible
with ResearchCyc
ResearchCyc contains
• OpenCyc
• Natural Language Processing subsystem
• Many more facts/rules per term
– The “extent” of non-structural predicates
November 17, 2005
130
60,000 OpenCyc Users/Contributors,
50 Active ResearchCyc User Groups:
Government-related
Government
Language Computer
Corporation
Air Force
Rome Labs
21st
Century
Technologies
Houston
VA Medical Center
Xerox PARC
Stone’s Throw
Technologies
SRI
ISI
Daxtron Labs
Austin Info Systems
Lockheed Martin ATLD
U of Illinois Urbana-Champaign
University
U of Maryland
MIT Media Lab
Northwestern U
Commercial
ANSER, Inc.
NTT
Communications Science
Laboratories (Japan)
Fraunhofer Institute
Sapio Systems (Denmark)
Terra Incognita
Trimtab Consulting
Stanford NLP Dept.
TNO-DMV (Netherlands)
U of Pennsylvania
Rensselaer AI and Reasoning Lab
Microfabrica, Inc.
LBJ School of
Public Affairs
U of Toronto
Radboud U
(Netherlands)
Knowledge Media
Institute, Open
University
U of Stuttgart
U of Minnesota
Witan International
New Mexico
Highlands Univ.
Harvard U
Linkoping U
(Sweden)
U of Hawaii
Institute for the Study
Of Accelerating Change
NPOs
Tokyo Inst.
of Technology
November 17, 2005
131
End of “The Message”
End of “The Summary”
Delve into a typical domain – answering
intelligence analysts’ queries – where Cyc can
really help, because that domain thwarts all
five of “ontological corner-cutting” solutions
(+ digressions for OpenCyc, ResearchCyc,…)
November 17, 2005
132
Eschew the 5 pitfalls (ways to cut
ontological corners and end up with
something that only appears to work)
• Ignorance-based: Have a small theory size (#terms, #instances, #rules)
• Static KB (can be massively tuned, optimized, cached, etc. ahead of time)
• Simple assertions (e.g., SAT constraints; propositional calculus; Horn;…)
• One global context (no contradictions, limited domain, simplified world)
• Don’t do all the bookkeeping and forward inference required for justification
maintenance (or, equivalently, don’t ever have truth maintenance “turned on”)
As with pharmaceuticals, what is toxic in one dosage is beneficial in a lesser dosage.
E.g., contexts lead to locally-consistent locally-small theories (faster inference/KE)
E.g., often some (sub)problems can be represented/solved in a simpler repr.
November 17, 2005
133
Problem
5 Factors slowing IC inference
(F1) Constant stream of new assertions, new data to assimilate.
– “elaboration tolerance” vs. tuned, optimized, “compiled” representations.
(F2) Theory Size: Huge vocab. and #instances (people, specific reports,…)
(F3) Sophisticated assertions and constraints strain even FOPC
– More repr. language “features” (e.g., quantification) => slower inference
(F4) Assertions are often true in one context and false in another
– Contextualized data and queries => exponentially larger search space
(F5) Truth maintenance must be “on”, to assimilate new data properly, and
to provide the symbolic justifications behind its conclusions.
– Each new datum can trigger an avalanche of TMS reactions in the KB
– There can be multiple answers, each with multiple justifications
November 17, 2005
134
Problem
5 Factors slowing IC inference
(F1) Constant stream of new assertions, new data to assimilate.
– “elaboration tolerance” vs. tuned, optimized, “compiled” representations.
(F2) Theory Size: Huge vocab. and #instances (people, specific reports,…)
(F3) Sophisticated assertions and constraints strain even FOPC
– More repr. language “features” (e.g., quantification) => slower inference
(F4) Assertions are often true in one context and false in another
– Contextualized data and queries => exponentially larger search space
(F5) Truth maintenance must be “on”, to assimilate new data properly, and
to provide the symbolic justifications behind its conclusions.
– Each new datum can trigger an avalanche of TMS reactions in the KB
– There can be multiple answers, each with multiple justifications
November 17, 2005
135
Slow Queries
• Queries that take a long time (okay, but faster is better)
–
–
Generate scenarios resulting in destruction of NY Stock Exchange
Still running after 2 months
Answer query Q modulo a small number of plausible “unknown” clauses
• Queries that take a long time and shouldn’t
–
(capableOf ArnoldSchwarzenegger RunningForPresidentOfUS)
Takes 40 minutes to return False.
Why: Wasting time seeing if Arnold is an x where x can’t be President (e.g., Cow)
– (hasBeliefSystems AdolfHitler AntiSemitism)
In the context of World History 1944, takes 16 minutes to return True.
Why: Lots of ways this might not be true
November 17, 2005
136
November 17, 2005
137
November 17, 2005
138
Slow Queries
• Queries that take a long time (okay, but faster is better)
–
–
Generate scenarios resulting in destruction of NY Stock Exchange
Still running after 2 months
Answer query Q modulo a small number of plausible “unknown” clauses
• Queries that take a long time and shouldn’t
–
(capableOf ArnoldSchwarzenegger RunningForPresidentOfUS)
Takes 40 minutes to return False.
Why: Wasting time seeing if Arnold is an x where x can’t be President (e.g., Cow)
– (hasBeliefSystems AdolfHitler AntiSemitism)
In the context of World History 1944, takes 16 minutes to return True.
Why: Lots of ways this might not be true
November 17, 2005
139
Effic. Reasoning Hypotheses
• Hypothesis 1: There is no silver bullet,
no one magic key waiting to be
discovered which will unlock efficient
pathfinding on huge knowledge-spaces.
– Rather, such inference will only be improved
incrementally, by bringing to bear a large
number of efficient partial solutions.
November 17, 2005
140
Effic. Reasoning Hypotheses
• Hypothesis 2: These special-case solutions
are not random, but factor into a handful of
different categories.
– A 2-day workshop meeting could productively
be held for each such category
– Important interstitial work to be done,
collaboratively, before and after the meetings.
November 17, 2005
141
6 categories (workshop topics)
• Reasoners that exploit limitations in the
expressivity of the repr. language they operate over
– Description Logic, 1st order, etc.
– What simplifications enable what speedups?
– At what risk?
• Domain-specific (incl. Context-specific) reasoners
• Statistical/Bayesian Reasoners
• “Unsound” (but presumably useful) reasoners
2
• Meta-reasoners (tacticians) and Meta (strategists)
• Parellel Processing, HW Acceleration, “Other”
November 17, 2005
142
6 categories (workshop topics)
• Reasoners that exploit limitations in the
expressivity of the repr. language they operate over
– Description Logic, 1st order, etc.
– What simplifications enable what speedups?
– At what risk?
• Domain-specific (incl. Context-specific) reasoners
– What sorts of domain knowledge do they utilize?
– How do they use that to speed up inference?
– Contexts, dimensions of context-space, algorithms for
exploiting that structure of the KB to do faster reasoning
November 17, 2005
143
6 categories (workshop topics)
• Statistical/Bayesian Reasoners
– How can these cooperate with, help, and be helped by
non-statistical reasoners (acting as independent agents)?
– How can statistical and symbolic inference be more
tightly integrated in a single reasoner (cf. Koller) ?
• “Unsound” (but presumably useful) reasoners
– Abduction, induction, analogy, abstraction (ignoring
details which hopefully won’t matter), scen. generation
– How can these cooperate with, help, and be helped…?
– How can unsound and sound inference be more tightly
integrated in a single reasoning engine?
November 17, 2005
144
6 categories (workshop topics)
2
• Meta-reasoners (tacticians) and Meta (strategists)
– Do/Improve object-level  meta- level reasoning
– Types of meta-… (prior & tacit; trails; reflection;…)
• “Other”
– Parallel processing
– Hardware acceleration (special purpose chips etc.)
– New types of reasoning modules and strategies, that
don’t fit in any above group, that folks are working on.
– What specific gaps are there (useful, doable, efficient
reasoners no one has even started to research yet) ?
November 17, 2005
145
Background & Lit. Review
•
•
•
•
•
•
•
•
Instantiation-based reasoning systems
Lifted DPLL procedures (Davis Putnam Longemann Loveland)
Completion/Boolean Ring based methods
ContractNet
TeamWork
Scatter-gather algorithms
Auto. theory decomposition by static analysis
Explanation-based learning/partial evaluation
mechanisms that learn generalized proof schemata
November 17, 2005
146
Effic. Reasoning Hypotheses
1. No silver bullet
2. 6 types of powerful partial solutions already exist
– Reasoners that exploit limitations in the expressivity of the
representation language they operate over
– Domain-specific (incl. Context-specific) reasoners
– Statistical/Bayesian Reasoners
– “Unsound” (but presumably useful) reasoners
– Meta-reasoners (tacticians) and Meta2 (strategists)
– “Other”, HW accel., parallel processing
3. They can cooperate / synergize (neutral harness)
November 17, 2005
147
Effic. Reasoning Hypotheses
• Hypothesis 3: They can cooperate / synergize.
– Explicitly characterize, for each “agent” (reasoner):
• A trigger -- in effect specifying its area of competence
• A procedure for estimating its cost, its chance to succeed, etc.
– Cyc’s immense KB and ELHL architecture makes it an
efficient reasoning module “magnet” or “universal recipient”
November 17, 2005
148
Effic. Reasoning Hypotheses/SOW
Hold 3 workshops, on the 6 topics, in 2006
•Participation
Hypothesis 3: They
canthe
cooperate
/ synergize.
by all
leading
experts
More
than that, we
can and
will harness
~10 ofthem
them.
Pre:
readings.
Post:
actually
harness
– Explicitly characterize, for each “agent” (reasoner):
• A trigger -- in effect specifying its area of competence
• A procedure for estimating its cost, its chance to succeed, etc.
– Cyc’s immense KB and ELHL architecture makes it an
efficient reasoning module “magnet” or “universal recipient”
• Use Cyc [and ARDA-related assertions/queries in it] as a testbed for
– operationally “publishing” the results of each workshop
– experiments on comparative and collaborative power
November 17, 2005
149
Efficient Pathfinding in Very Large Data Spaces
GOALS
• Develop an ontology and a standard for specifying the
applicability, % success, estimated resource cost, etc., of bringing
various reasoning modules to bear on a problem
• Build an Integration Framework, a Harness, that enables several
of the world’s leading reasoning systems to cooperatively solve
problems [using the above ontology and standard to act as agents,
broadcast subproblems, etc.] Actually hook them up to this Harness and run them, on test problems from NIMD, AQUAINT, etc.
• Overcome the 5 problems that make IC reasoning hard:
(1) New assertions constantly (can’t just “compile” the KB)
(2) Each is true in some contexts (in 2003; believed by x)
(3) Many are complex (x believes that y believes that…)
(4) Huge vocabulary size and number of instances
(5) Justifications / sources matter (truth maint. Must be “on”)
APPROACH
Workshop Highlights
• Identify the most important ways in which automated reasoners
gain efficiency: limit domain, limit expressive-ness, integrate
probabilistic and symbolic reasoning, meta-reasoning, and
unsound reasoning (e.g., analogy)
4Q 05
Pre-start invitations and Steering Comm. planning
1Q 06
Project starts. 1st workshop: gaining efficiency by
limiting representation language expressivity
2Q 06
Interstitial work on ontology and standard; building the
initial Framework/harness; try out 2 “agents”; 2nd
workshop: gaining efficiency by limiting the domain,
the type of problem to be solved, etc.
3Q 06
3rd workshop: Integrating Bayesian probability and
statistical reasoning with symbolic theorem-proving
Workshop “Steering Committee”:
1Q 07
R.V. Guha, Google; Chris Welty & Andrew Tompkins, IBM;
Andrei Veronkov, Manchester; + I.C./Ops. “Champions”
4th workshop: meta-reasoning (tactics & strategy)
5th workshop: unsound reasoning (e.g., analogy)
4Q 06
6th workshop; Final Report; Hand-off to I.C./Ops
“Champions” for tech transfer/operationalization
• Hold a workshop on each topic (16 invitees; 15 said “Yes”)
• After/between the workshops, get these system builders to
“publish” their reasoner to the growing Framework/harness so
each can bid for, work on, and broadcast subproblems
Workshop PI’s:
Doug Lenat, Cycorp
Michael Genesereth, Stanford
CYC: Lessons Learned in LargeScale Ontological Engineering
The pursuit of Artificial Intelligence -- from robotics to natural language processing to automated
learning -- has been held back by the "brittleness bottleneck" caused by the need for common sense.
For 21 years, we've been priming the pump, building up a formalized corpus of such knowledge, Cyc.
Along the way, we've had to revise our preconceptions and theories, to expand our representation
language and arsenal of inference methods, to find approximate yet adequate engineering solutions to
problems that philosophers have grappled with for millennia such as ontologizing aspects of
substances versus individual objects, time, space, causality, belief, social interactions, and so on. The
process of ontological engineering had to grow and evolve throughout this enterprise, as well, such as
how Cyc represents and reasons with contradictions and context.
In this talk I will try to cover both the large scale picture of what we've built and why, and the detailed
picture of how it's built, and the lessons learned along the way in how and how not to do large-scale
OE. I will report on our recent efforts to make Cyc more accessible to the broader community through
OpenCyc and ResearchCyc, which raises issues of how multiple individuals and groups can share and
integrate their extensions (and settle their differences). Finally, I will discuss an exciting new effort
we have just had funded, to gather automated reasoning researchers together for a series of workshops
in 2006 on speeding up inference in large knowledge bases by orders of magnitude.
November 17, 2005
2 July 2005
151
Download