Extracting Knowledge-Bases from Machine- Readable Dictionaries: Have We Wasted Our Time?

advertisement
Extracting Knowledge-Bases from MachineReadable Dictionaries: Have We Wasted Our
Time?
Nancy Ide and Jean Veronis
Proc KB&KB’93 Workshop, 1993, pp257-266
http://www.cs.vassar.edu/faculty/ide/pubs.html
As (mis-)interpreted by Peter Clark
The Postulates of MRD Work
• P1: MRDs contain information that is useful for NLP
e.g.:
The Postulates of MRD Work
• P1: MRDs contain information that is useful for NLP
• P2: This info is relatively easy to extract from MRDs
e.g., extraction of hypernyms (generalizations):
 Dipper isa Ladle isa Spoon isa Utensil
But…
• Not much to show for it so far (1993)
– handful of limited and imperfect taxonomies
– few studies on the quality of knowledge in MRDs
– few studies on extracting more complex info
Complaints…
• P1: useful info in MRDs:
– C1a: 50%-70% of info in dictionaries is “garbled”
– C1b: sense definitions  concept usage (“real concepts”)
– C1c: some types of knowledge simply not there
• P2: Info can be easily extracted
• Most successes have been for hypernyms only
– C2a: MRD formats are a nightmare to deal with
– C2b: A virtually open-ended set of ways of describing facts
– C2c: Bootstrapping: Need a KB to build a KB from a MRD
C1a: MRD information is “garbled”
• Multiple people, multiple years effort
• Space restrictions, syntactic restrictions
• Particular problem 1:
– Attachment of terms too high (21%-34%)
• e.g., “pan” and “bottle” are “vessels”, but “cup” and “bowl” are
simply “containers”
• occurs fairly randomly
– Categories less clear at top levels
• “fork” and “spoon” is ok, but “implement” and “utensil” = ?
• Sometimes no word there to refer to a concept
– leads to circular definitions
C1a: MRD information is “garbled”
C1a: MRD information is “garbled”
• Particular problem 2:
– Categories less clear at top levels
• “fork” and “spoon” is ok, but “implement” and “utensil” = ?
• Leads to disjuncts e.g. “implement or utensil”
• Sometimes no word there to refer to a concept
– leads to circular definitions
– leads to “covert categories”, e.g., INSTRUMENTAL-OBJECT (a
hypernym for “tool”, “utensil”, “instrument”, and “implement”)
C1a: MRD information is “garbled”
• Particular problem 3:
– And hypernyms are
relatively consistent!!
Other semantic
relations are given in
a less consistent way,
e.g., smell, taste, etc.
C1b: sense definitions  concept usage
(“real concepts”)
• Ambiguity of word senses, e.g.,
– 87% of words in a sample fit > 1 word sense
• Word senses don’t reflect actual use
• Word sense distinctions differ between MRDs
– level of detail
– way lines are drawn between senses
– no definitive set of distinctions
C1c: some types of knowledge
simply not there
• no broad contextual or world knowledge, e.g.,
– no connection between “lawn” and “house”, or between
“ash” and “tobacco”
– “restaurant, eating house, eating place -- (a building
where people go to eat)” [WordNet]
• No mention that it’s a commercial business, e.g., for “the
waitress collected the check.”
C2a: MRD formats are a nightmare to
deal with
• Ambiguities / inconsistencies in typesetter format
• Complex grammars for entries
• Conventions are inconsistent, e.g. bracketing for
– “Canopic jar, urn, or vase” vs.
– “Junggar Pendi, Dzungaria, or Zungaria”
• Need a lot of hand pre-processing
– not much general value to this
– is a vast task in itself
– not many processed dictionaries available
C2b: A virtually open-ended set of ways
of describing facts
But…
There is “virtually an open-ended set of phrases…”
C2c:
Bootstrapping: Need a KB to
build a KB
Need knowledge to do NLP on MRDs!
– e.g. “carry by means of a handle” vs. “carry by means of a
wagon”
• But undisambiguated hierarchy is unusable, e.g.,
– “saucepan” isa “pan” isa “leaf”
 need to build your KB before you even start on the MRD
Synthesis
• Underlying postulate of P1 and P2:
– P0: Large KBs cannot be built by hand
• Counterexamples:
– Cyc
– Dictionaries themselves!
• And besides…
– KBs are too hard to extract from MRDs
– don’t contain all the knowledge needed
• But: MRD contributions:
– understanding the structure of dictionaries
– convergance of NLP, lexicography, and electronic
publishing interests
Ways forward…
• Combining Knowledge Sources:
– One dictionary has 55%-70% of “problematic cases” [of
incompleteness], but 5 dictionaries reduced this to 5%
• Also should combine knowledge from corpora as a
means of “filling out” KBs
• Prediction:
– KBs built by people, using corpora and text extraction
technology tools, and combined together by hand
(Schubert-style; Code4; Ikarus)
Ways forward…
• MRDs will become encoded more consistently
• Better analysis needed of the types of knowledge
needed for NLP
– perhaps don’t need the kind of precision in a KB
• Exploitation of associational information
– Very useful for sense disambiguation (e.g., Harabagiu)
Ways forward…
• Lexicographers increasingly interested in using lexical
databases for their work
• Could create a NLP-like KB directly
– Create explicit semantic links between word entries
– Ensure consistency of content (e.g., using templates/frames
ensures all the important information is provided)
Ways forward…
Ways forward…
• Lexicographers increasingly interested in using lexical
databases for their work
• Could create a NLP-like KB directly
– Create explicit semantic links between word entries
– Ensure consistency of content (e.g., using templates/frames
ensures all the important information is provided)
– Ensure consistency of “metatext” (i.e., be consistent about
how semantic relations are stated)
– Ensure consistency of sense division
• e.g., “cup” and “bowl” have two senses (literal and metonymic) but
“glass” only has one (literal)  could spot this inconsistency
Download