Extracting Knowledge-Bases from MachineReadable Dictionaries: Have We Wasted Our Time? Nancy Ide and Jean Veronis Proc KB&KB’93 Workshop, 1993, pp257-266 http://www.cs.vassar.edu/faculty/ide/pubs.html As (mis-)interpreted by Peter Clark The Postulates of MRD Work • P1: MRDs contain information that is useful for NLP e.g.: The Postulates of MRD Work • P1: MRDs contain information that is useful for NLP • P2: This info is relatively easy to extract from MRDs e.g., extraction of hypernyms (generalizations): Dipper isa Ladle isa Spoon isa Utensil But… • Not much to show for it so far (1993) – handful of limited and imperfect taxonomies – few studies on the quality of knowledge in MRDs – few studies on extracting more complex info Complaints… • P1: useful info in MRDs: – C1a: 50%-70% of info in dictionaries is “garbled” – C1b: sense definitions concept usage (“real concepts”) – C1c: some types of knowledge simply not there • P2: Info can be easily extracted • Most successes have been for hypernyms only – C2a: MRD formats are a nightmare to deal with – C2b: A virtually open-ended set of ways of describing facts – C2c: Bootstrapping: Need a KB to build a KB from a MRD C1a: MRD information is “garbled” • Multiple people, multiple years effort • Space restrictions, syntactic restrictions • Particular problem 1: – Attachment of terms too high (21%-34%) • e.g., “pan” and “bottle” are “vessels”, but “cup” and “bowl” are simply “containers” • occurs fairly randomly – Categories less clear at top levels • “fork” and “spoon” is ok, but “implement” and “utensil” = ? • Sometimes no word there to refer to a concept – leads to circular definitions C1a: MRD information is “garbled” C1a: MRD information is “garbled” • Particular problem 2: – Categories less clear at top levels • “fork” and “spoon” is ok, but “implement” and “utensil” = ? • Leads to disjuncts e.g. “implement or utensil” • Sometimes no word there to refer to a concept – leads to circular definitions – leads to “covert categories”, e.g., INSTRUMENTAL-OBJECT (a hypernym for “tool”, “utensil”, “instrument”, and “implement”) C1a: MRD information is “garbled” • Particular problem 3: – And hypernyms are relatively consistent!! Other semantic relations are given in a less consistent way, e.g., smell, taste, etc. C1b: sense definitions concept usage (“real concepts”) • Ambiguity of word senses, e.g., – 87% of words in a sample fit > 1 word sense • Word senses don’t reflect actual use • Word sense distinctions differ between MRDs – level of detail – way lines are drawn between senses – no definitive set of distinctions C1c: some types of knowledge simply not there • no broad contextual or world knowledge, e.g., – no connection between “lawn” and “house”, or between “ash” and “tobacco” – “restaurant, eating house, eating place -- (a building where people go to eat)” [WordNet] • No mention that it’s a commercial business, e.g., for “the waitress collected the check.” C2a: MRD formats are a nightmare to deal with • Ambiguities / inconsistencies in typesetter format • Complex grammars for entries • Conventions are inconsistent, e.g. bracketing for – “Canopic jar, urn, or vase” vs. – “Junggar Pendi, Dzungaria, or Zungaria” • Need a lot of hand pre-processing – not much general value to this – is a vast task in itself – not many processed dictionaries available C2b: A virtually open-ended set of ways of describing facts But… There is “virtually an open-ended set of phrases…” C2c: Bootstrapping: Need a KB to build a KB Need knowledge to do NLP on MRDs! – e.g. “carry by means of a handle” vs. “carry by means of a wagon” • But undisambiguated hierarchy is unusable, e.g., – “saucepan” isa “pan” isa “leaf” need to build your KB before you even start on the MRD Synthesis • Underlying postulate of P1 and P2: – P0: Large KBs cannot be built by hand • Counterexamples: – Cyc – Dictionaries themselves! • And besides… – KBs are too hard to extract from MRDs – don’t contain all the knowledge needed • But: MRD contributions: – understanding the structure of dictionaries – convergance of NLP, lexicography, and electronic publishing interests Ways forward… • Combining Knowledge Sources: – One dictionary has 55%-70% of “problematic cases” [of incompleteness], but 5 dictionaries reduced this to 5% • Also should combine knowledge from corpora as a means of “filling out” KBs • Prediction: – KBs built by people, using corpora and text extraction technology tools, and combined together by hand (Schubert-style; Code4; Ikarus) Ways forward… • MRDs will become encoded more consistently • Better analysis needed of the types of knowledge needed for NLP – perhaps don’t need the kind of precision in a KB • Exploitation of associational information – Very useful for sense disambiguation (e.g., Harabagiu) Ways forward… • Lexicographers increasingly interested in using lexical databases for their work • Could create a NLP-like KB directly – Create explicit semantic links between word entries – Ensure consistency of content (e.g., using templates/frames ensures all the important information is provided) Ways forward… Ways forward… • Lexicographers increasingly interested in using lexical databases for their work • Could create a NLP-like KB directly – Create explicit semantic links between word entries – Ensure consistency of content (e.g., using templates/frames ensures all the important information is provided) – Ensure consistency of “metatext” (i.e., be consistent about how semantic relations are stated) – Ensure consistency of sense division • e.g., “cup” and “bowl” have two senses (literal and metonymic) but “glass” only has one (literal) could spot this inconsistency