On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark

advertisement
On WordNet, Text Mining, and
Knowledge Bases of the Future
Peter Clark
Knowledge Systems
Boeing Phantom Works
On Machine Understanding
• Creating models from data…
“China launched a meteorological satellite into orbit
Wednesday, the first of five weather guardians to be sent
into the skies before 2008.”
• Suggests:
• there a rocket launch
• China owns the satellite
• the satellite is for monitoring weather
• the orbit is around the Earth
• etc.
None of these are explicitly stated in the text
On Machine Understanding
• Understanding = creating a situation-specific model
(SSM), coherent with data & background knowledge
– Data suggests background knowledge which may be
appropriate
– Background knowledge suggest ways of interpreting data
?
?
Fragmentary,
ambiguous
inputs
Coherent Model
(situation-specific)
On Machine Understanding
• Core theories of the world
• Ton of common-sense/
episodic/experiential knowledge
(“the way the world is”)
World Knowledge
• Only a tiny part of the target model
• Contains errors and ambiguity
• Not even a subset of the target model
Assembly of
pieces,
assessment of
coherence,
inference
?
?
Fragmentary,
ambiguous
inputs
Coherent Model
(situation-specific)
On Machine Understanding
• Core theories of the world
• Ton of common-sense/
episodic/experiential knowledge
(“the way the world is”)
World Knowledge
• Conjectures about the nature of the beast:
– “Small” number of core theories
• space, time, movement, …
• can encode directly
– Large amount of “mundane” facts
• a dictionary contains many of these facts
• also: episodic/script-like knowledge needed
On Machine Understanding
• Core theories of the world
• Ton of common-sense/
episodic/experiential knowledge
(“the way the world is”)
World Knowledge
• How to acquire this background knowledge?
– Manual encoding (e.g., Cyc, WordNet)
– NLP on a dictionary (e.g., MindNet)
– Community-wide acquisition (e.g., OpenMind)
– Knowledge mining from text (e.g., Schubert)
– Knowledge acquisition technology:
• graphical (e.g., Shaken)
• entry using “controlled” (simple) English
What We’re Trying To Do…
Knowledge base
English-based
description of
a scene (partial,
ambiguous)
Coherent representation
of the scene (elaborated,
disambiguated)
Question-Answering, Search, etc.
Illustration: Caption-Based Video Retrieval
Video
“…”
Captions
(manual
authoring)
“A lever is
“A man pulls
rotated to the and closes an
unarmed
airplane door”
position”
“…”
Pull
Caption text
Interpretation
Elaboration (inference,
scene-building)
agent
object
Man
Door
is-part-of
Airplane
Pull
Door
Airplane
Man
World
Knowledge
Query:
Touch
Search
Person
Door
Demo
Some Example Inferences
“Someone broke the side mirrors of a truck”
 the truck is damaged
…if only the system knew that…
IF X breaks Y
THEN(result) Y is damaged
IF X is-part-of Y
AND X is damaged
THEN Y is damaged
A mirror is part of a truck
Some Example Inferences
“A man cut his hand on a piece of metal”
 the man is hurt
…if only the system knew that…
IF organism is cut
THEN(result) organism is hurt
IF X is-part-of Y
AND X is cut
THEN Y is cut
A hand is part of a person
(also: Metal can cut)
Some Example Inferences
“A man carries a box across the floor”
 a person is walking
…if only the system knew that…
IF X is carrying something
THEN X is walking
IF X is a man
THEN X is a person
Some Example Inferences
“The car engine”
 The engine part of a car
“The car fuel”
 The fuel which is consumed by the car
“The car driver”
 The person who is driving the car
…if only the system knew that…
Cars have engines
Cars consume fuel
People drive cars
A driver is a person who is driving
Some Example Inferences
“The man writes with a pen”
 Pen = pen_n1 (writing implement)
“The pig is in the pen”
 Pen = pen_n2 (container (typically) for confining animals)
…if only the system knew that…
people write
writing is done with writing implements
a pen (n1) is a writing implement
a pig is an animal
animals are sometimes confined
a pen (n2) confines animals
Some Example Inferences
“The blue car”
 The color of the car is blue
…if only the system knew that…
physical objects can have colors
blue is a color
a car is a physical object
WordNet…
• psycholinguistically motivated lexical reference system
• core:
– synsets (“concepts”)
– hypernym links
• later added additional relationships:
– part-of
– substance-of (“pavement is substance of road”)
– causes (“revive causes come-to”)
– entails (“sneeze entails exhale”)
– antonyms (“appearance”/“disappearance”)
– possible-values (“disposition” = {“willing”,“unwilling”})
• Currently at version 2.0
– What will version 10.0 (say) look like?
– What should it look like?
– Is it/coult it migrate towards more of a knowledge base?
Why to Use WordNet
• It’s a comprehensive ontology
– (approx. 120,000 concepts)
• links between concepts (synsets) and lexical items (words)
• Simple structure, easy to use
• Rich population of hypernym links
Problems with WordNet
• Too fine-grained word senses
– e.g., “cut” has 41 senses, including
•
•
•
•
•
cut (separate)
cut grain
cut timber
cut my hair
cut (knife cuts)
– linguistically, not representationally, motivated
• e.g., “cut grain” because just happens to be a word for it (“harvest”)
– representationally, many share a core meaning
• but commonality not captured
• Missing concepts/senses without an English word (+ just forgot)
– e.g., goal-oriented entity (person, corporation, country)
– difference between physical and legal ownership
• Single inheritance (mainly)
– very different to Cyc, which uses multiple inheritance a lot
Problems with WordNet
• “isa” (hypernym) hierarchy is broken in many places
– sometimes means “part of”
• e.g., Connecticut -> America
– mixes instance-of and subclass-of
• e.g., Paris -> Capital_City -> City
– many links seem strange/questionable
• e.g., “launch” is a type of “propel”?
• again, is psychologically not representationally motivated
– has major implications for reasoning
• Semantics of relationships can be fuzzy/asymmetric
• “motor vehicle” has-part “engine” means….?
• “car” has-part “running-board”
Problems with WordNet
• Many relationships missing
– Simple
• verbs/nominalizations (eg. “plan” (v) vs., “plan” (n))
• adverbs/adjectives (“rapidly” vs. “rapid”)
– Conceptual; many, in particular:
•
•
•
•
•
•
•
•
•
•
causes
material
instrument
content
beneficiary
recipient
result
destination
shape
location
What we’d like instead….
• Want a knowledge resource that can provide rich
expectations about the world
– to help interpret ambiguous input
– to infer additional facts beyond those in the input
– to create a coherent model from fragmented input
• It would have
– a small set of “core” theories about the world
• containers, transportation, movement, space
• probably hand-built
– many “mundane” facts which instantiate those
theories in various ways
What we’d like instead….
Core theories, e.g., Transportation:
OBJECTS can be at PLACES
VEHICLES can TRANSPORT OBJECTS from PLACE to PLACE
TRANSPORT requires the OBJECT to be IN the VEHICLE
BEFORE an OBJECT is TRANSPORTED by a VEHICLE
from PLACE1 to PLACE2, the OBJECT and VEHICLE are
at PLACE1
etc.
Basic facts/instantiations:
Cars are vehicles
Cars can transport people
Cars travel along roads
Ships can transport people or goods
Ships can transport over water between ports
Rockets can transport satellites into space
….
Some Questions
• To what extent can a rulebase be simplified into a set
of database-like tables?
• How much of the table-like knowledge can we learn
automatically?
• How can we reason with “messy knowledge” that
such a database inevitably contains?
• How can we represent different views/perspectives?
• What not use Cyc?
• How can we address WordNet’s deficiencies
efficiently?
The “Common Sense” KB
• An attempt to rapidly accumulate some core
knowledge and “routine” facts, to support
– specific applications
– research in how to work with all this knowledge
• Features:
– knowledge (mainly) entered in simple English
– interactively interpreted to KM (logic) structures
– using WordNet’s ontology + UT’s “slot” library
Why Simple Language-based Entry?
• Seems to be easier and faster than formal encoding
– but more restricted
• More comprehensible & accessible
• Viable (if a dictionary is a good model of scope…)
• Ontologically less commital (can reinterpret)
• Forces us to face some key issues
– ambiguity, conflict, “messy” knowledge
• Step towards more extensive language processing
• Costs: more infrastructure needed, limited expressivity,
still need to understand some KR
Demo
Or…
• Can (at least some) of this basic world knowledge
be acquired automatically? e.g.,
– Girju
– Etzioni
– Schubert
Knowledge Mining
Schubert’s Conjecture:
There is a largely untapped source of general knowledge in
texts, lying at a level beneath the explicit assertional
content, and which can be harnessed.
“The camouflaged helicopter landed near the embassy.”
 helicopters can land
 helicopters can be camouflaged
Our attempt: “lightweight” LFs generated from Reuters
LF forms: (S subject verb object (prep noun) (prep noun) …)
(NN noun … noun)
(AN adj noun)
Knowledge Mining
Newswire Article
HUTCHINSON SEES HIGHER
PAYOUT. HONG KONG. Mar 2.
Li said Hong Kong’s property
market remains strong while its
economy is performing better than
forecast. Hong Kong Electric
reorganized and will spin off its
non-electricity related activities.
Hongkong Electric shareholders
will receive one share in the new
subsidiary for every owned share
in the sold company. Li said the
decision to spin off …
Implicit, tacit knowledge
Shareholders may receive shares.
Shares may be owned.
Companies may be sold.
Knowledge Mining – our attempt
Fragment of the raw data (Reuters)
(S "history" "fall" ("out of" "view"))
(S "rate" "fall on" (NIL "tuesday") ("to" "percent"))
(S "index" "rally" "point" ("with" "volume") ("at" "share"))
(S "you" "have" "decline" ("in" "inflation") ("in" "rate"))
(S "you" "have" "decline" ("in" "figure") ("in" "rate"))
(S "you" "have" "decline" ("in" "lack") ("in" "rate"))
(S "Boni" "be wary")
(S "recovery" "be" "led")
(S "evidence" "patchy")
(S "expansion" "be worth")
(S "we" "be content")
(S "investment" "boost" "sale" ("in" "zone"))
(S "Eaton" "say" (S "investment" "boost" "sale" ("in" "zone")))
(S "it" "grab" "portion" ("away from" "rival"))
Knowledge Mining…what next?
• What could we do with all this data?
– Use it to bias the parser
– Extra source of knowledge for the KB
• source of input sentences for our system?
• possibilities (“this deduction looks coherent”)
• But:
– Ambiguity makes it hard to use
• word senses, relationships
– No notion of “relevance”
– Many types of knowledge not mined
• e.g., rule-like, script-like
Summary
• Machine understanding = building a coherent model
• Requires lots of world knowledge
– core theories + lots of “mudane” facts
• WordNet –
– a potentially useful resource, but with many problems
– slowly and manually becoming more KB-like
• there’s a lot of potential to jump ahead with text mining methods
– e.g., Schubert’s approach
– KnowItAll
• We would like to use the results for reasoning!!
Download