On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing Phantom Works On Machine Understanding • Creating models from data… “China launched a meteorological satellite into orbit Wednesday, the first of five weather guardians to be sent into the skies before 2008.” • Suggests: • there a rocket launch • China owns the satellite • the satellite is for monitoring weather • the orbit is around the Earth • etc. None of these are explicitly stated in the text On Machine Understanding • Understanding = creating a situation-specific model (SSM), coherent with data & background knowledge – Data suggests background knowledge which may be appropriate – Background knowledge suggest ways of interpreting data ? ? Fragmentary, ambiguous inputs Coherent Model (situation-specific) On Machine Understanding • Core theories of the world • Ton of common-sense/ episodic/experiential knowledge (“the way the world is”) World Knowledge • Only a tiny part of the target model • Contains errors and ambiguity • Not even a subset of the target model Assembly of pieces, assessment of coherence, inference ? ? Fragmentary, ambiguous inputs Coherent Model (situation-specific) On Machine Understanding • Core theories of the world • Ton of common-sense/ episodic/experiential knowledge (“the way the world is”) World Knowledge • Conjectures about the nature of the beast: – “Small” number of core theories • space, time, movement, … • can encode directly – Large amount of “mundane” facts • a dictionary contains many of these facts • also: episodic/script-like knowledge needed On Machine Understanding • Core theories of the world • Ton of common-sense/ episodic/experiential knowledge (“the way the world is”) World Knowledge • How to acquire this background knowledge? – Manual encoding (e.g., Cyc, WordNet) – NLP on a dictionary (e.g., MindNet) – Community-wide acquisition (e.g., OpenMind) – Knowledge mining from text (e.g., Schubert) – Knowledge acquisition technology: • graphical (e.g., Shaken) • entry using “controlled” (simple) English What We’re Trying To Do… Knowledge base English-based description of a scene (partial, ambiguous) Coherent representation of the scene (elaborated, disambiguated) Question-Answering, Search, etc. Illustration: Caption-Based Video Retrieval Video “…” Captions (manual authoring) “A lever is “A man pulls rotated to the and closes an unarmed airplane door” position” “…” Pull Caption text Interpretation Elaboration (inference, scene-building) agent object Man Door is-part-of Airplane Pull Door Airplane Man World Knowledge Query: Touch Search Person Door Demo Some Example Inferences “Someone broke the side mirrors of a truck” the truck is damaged …if only the system knew that… IF X breaks Y THEN(result) Y is damaged IF X is-part-of Y AND X is damaged THEN Y is damaged A mirror is part of a truck Some Example Inferences “A man cut his hand on a piece of metal” the man is hurt …if only the system knew that… IF organism is cut THEN(result) organism is hurt IF X is-part-of Y AND X is cut THEN Y is cut A hand is part of a person (also: Metal can cut) Some Example Inferences “A man carries a box across the floor” a person is walking …if only the system knew that… IF X is carrying something THEN X is walking IF X is a man THEN X is a person Some Example Inferences “The car engine” The engine part of a car “The car fuel” The fuel which is consumed by the car “The car driver” The person who is driving the car …if only the system knew that… Cars have engines Cars consume fuel People drive cars A driver is a person who is driving Some Example Inferences “The man writes with a pen” Pen = pen_n1 (writing implement) “The pig is in the pen” Pen = pen_n2 (container (typically) for confining animals) …if only the system knew that… people write writing is done with writing implements a pen (n1) is a writing implement a pig is an animal animals are sometimes confined a pen (n2) confines animals Some Example Inferences “The blue car” The color of the car is blue …if only the system knew that… physical objects can have colors blue is a color a car is a physical object WordNet… • psycholinguistically motivated lexical reference system • core: – synsets (“concepts”) – hypernym links • later added additional relationships: – part-of – substance-of (“pavement is substance of road”) – causes (“revive causes come-to”) – entails (“sneeze entails exhale”) – antonyms (“appearance”/“disappearance”) – possible-values (“disposition” = {“willing”,“unwilling”}) • Currently at version 2.0 – What will version 10.0 (say) look like? – What should it look like? – Is it/coult it migrate towards more of a knowledge base? Why to Use WordNet • It’s a comprehensive ontology – (approx. 120,000 concepts) • links between concepts (synsets) and lexical items (words) • Simple structure, easy to use • Rich population of hypernym links Problems with WordNet • Too fine-grained word senses – e.g., “cut” has 41 senses, including • • • • • cut (separate) cut grain cut timber cut my hair cut (knife cuts) – linguistically, not representationally, motivated • e.g., “cut grain” because just happens to be a word for it (“harvest”) – representationally, many share a core meaning • but commonality not captured • Missing concepts/senses without an English word (+ just forgot) – e.g., goal-oriented entity (person, corporation, country) – difference between physical and legal ownership • Single inheritance (mainly) – very different to Cyc, which uses multiple inheritance a lot Problems with WordNet • “isa” (hypernym) hierarchy is broken in many places – sometimes means “part of” • e.g., Connecticut -> America – mixes instance-of and subclass-of • e.g., Paris -> Capital_City -> City – many links seem strange/questionable • e.g., “launch” is a type of “propel”? • again, is psychologically not representationally motivated – has major implications for reasoning • Semantics of relationships can be fuzzy/asymmetric • “motor vehicle” has-part “engine” means….? • “car” has-part “running-board” Problems with WordNet • Many relationships missing – Simple • verbs/nominalizations (eg. “plan” (v) vs., “plan” (n)) • adverbs/adjectives (“rapidly” vs. “rapid”) – Conceptual; many, in particular: • • • • • • • • • • causes material instrument content beneficiary recipient result destination shape location What we’d like instead…. • Want a knowledge resource that can provide rich expectations about the world – to help interpret ambiguous input – to infer additional facts beyond those in the input – to create a coherent model from fragmented input • It would have – a small set of “core” theories about the world • containers, transportation, movement, space • probably hand-built – many “mundane” facts which instantiate those theories in various ways What we’d like instead…. Core theories, e.g., Transportation: OBJECTS can be at PLACES VEHICLES can TRANSPORT OBJECTS from PLACE to PLACE TRANSPORT requires the OBJECT to be IN the VEHICLE BEFORE an OBJECT is TRANSPORTED by a VEHICLE from PLACE1 to PLACE2, the OBJECT and VEHICLE are at PLACE1 etc. Basic facts/instantiations: Cars are vehicles Cars can transport people Cars travel along roads Ships can transport people or goods Ships can transport over water between ports Rockets can transport satellites into space …. Some Questions • To what extent can a rulebase be simplified into a set of database-like tables? • How much of the table-like knowledge can we learn automatically? • How can we reason with “messy knowledge” that such a database inevitably contains? • How can we represent different views/perspectives? • What not use Cyc? • How can we address WordNet’s deficiencies efficiently? The “Common Sense” KB • An attempt to rapidly accumulate some core knowledge and “routine” facts, to support – specific applications – research in how to work with all this knowledge • Features: – knowledge (mainly) entered in simple English – interactively interpreted to KM (logic) structures – using WordNet’s ontology + UT’s “slot” library Why Simple Language-based Entry? • Seems to be easier and faster than formal encoding – but more restricted • More comprehensible & accessible • Viable (if a dictionary is a good model of scope…) • Ontologically less commital (can reinterpret) • Forces us to face some key issues – ambiguity, conflict, “messy” knowledge • Step towards more extensive language processing • Costs: more infrastructure needed, limited expressivity, still need to understand some KR Demo Or… • Can (at least some) of this basic world knowledge be acquired automatically? e.g., – Girju – Etzioni – Schubert Knowledge Mining Schubert’s Conjecture: There is a largely untapped source of general knowledge in texts, lying at a level beneath the explicit assertional content, and which can be harnessed. “The camouflaged helicopter landed near the embassy.” helicopters can land helicopters can be camouflaged Our attempt: “lightweight” LFs generated from Reuters LF forms: (S subject verb object (prep noun) (prep noun) …) (NN noun … noun) (AN adj noun) Knowledge Mining Newswire Article HUTCHINSON SEES HIGHER PAYOUT. HONG KONG. Mar 2. Li said Hong Kong’s property market remains strong while its economy is performing better than forecast. Hong Kong Electric reorganized and will spin off its non-electricity related activities. Hongkong Electric shareholders will receive one share in the new subsidiary for every owned share in the sold company. Li said the decision to spin off … Implicit, tacit knowledge Shareholders may receive shares. Shares may be owned. Companies may be sold. Knowledge Mining – our attempt Fragment of the raw data (Reuters) (S "history" "fall" ("out of" "view")) (S "rate" "fall on" (NIL "tuesday") ("to" "percent")) (S "index" "rally" "point" ("with" "volume") ("at" "share")) (S "you" "have" "decline" ("in" "inflation") ("in" "rate")) (S "you" "have" "decline" ("in" "figure") ("in" "rate")) (S "you" "have" "decline" ("in" "lack") ("in" "rate")) (S "Boni" "be wary") (S "recovery" "be" "led") (S "evidence" "patchy") (S "expansion" "be worth") (S "we" "be content") (S "investment" "boost" "sale" ("in" "zone")) (S "Eaton" "say" (S "investment" "boost" "sale" ("in" "zone"))) (S "it" "grab" "portion" ("away from" "rival")) Knowledge Mining…what next? • What could we do with all this data? – Use it to bias the parser – Extra source of knowledge for the KB • source of input sentences for our system? • possibilities (“this deduction looks coherent”) • But: – Ambiguity makes it hard to use • word senses, relationships – No notion of “relevance” – Many types of knowledge not mined • e.g., rule-like, script-like Summary • Machine understanding = building a coherent model • Requires lots of world knowledge – core theories + lots of “mudane” facts • WordNet – – a potentially useful resource, but with many problems – slowly and manually becoming more KB-like • there’s a lot of potential to jump ahead with text mining methods – e.g., Schubert’s approach – KnowItAll • We would like to use the results for reasoning!!