Word Sense Disambiguation

advertisement
Word Sense Disambiguation
(Source Text: Ide and Veronis, (1998) “Word Sense Disambiguation: The State of the
Art”)
The Problem:
Word Sense Disambiguation (WSD) is the process of resolving the meaning of a word:
 in a given usage context
 and, unambiguously, if the word has a variety of applicable definitions.
Why do WSD?
WSD is a required step in several other Natural Language Processing tasks such as:
 Machine Translation: if a word maps to more than one other word in the target
language
 Information Retrieval: To ensure that the results of a query are relevant to the
intended sense of the query term(s)
 Content Analysis:
 Grammatical Analysis: Determining the phrase structure of a sentence may
depend on understanding the meaning. A common example is ambiguous
attachments of sub-phrases.
 Speech Processing: To distinguish between homophones, words that sound the
same.
 Text Processing: eg. To perform spelling corrections
Notes:
Bar-Hillel’s 1960 ALPAC report stated that a lack of good WSD was a major barrier to
machine translation. This report resulted in a reduced research activity in machine
translation.
WSD 2-Step Process:
1. Determine the set of applicable senses of a word for a particular context
 Eg: Dictionaries, thesauri, translation dictionaries
2. Determine which sense is most appropriate
 Based on context vs. External Knowledge Sources
Problem “Space”:
For words with the same spelling (Homographs):
 Senses have differing parts of speech
 See Part-Of-Speech tagging research
 Senses have same part of speech function
Stages in WSD research:
 Early Machine Translation
 Artificial Intelligence (AI) methods
o Symbolic Methods
o Connectionist Methods
 Knowledge-Based Methods
o Machine-readable dictionaries
o Thesauri
o Computational Lexicons
 Enumerative lexicons
 Generative lexicons
 Corpus Methods (Rename this and subsections??)
o General empirical methods
o Automatic Sense-tagging
o Overcoming data sparseness
1. Early Machine Translation
A) Approaches:
Much of the foundation of WSD was laid in this period, but without large resources,
ideas were untested.
 1949, Weaver: window size of N words in text before and after usage (of the word
being disambiguated)
o (in Memorandum) realised relationship between domain specificity and
reduced word sense ambiguity
 resulted in work on “micro-glossaries”
 1950, Kaplan: experiments on value for N, where N=2, or N=sentence length.
Result: no difference
o Similar Expts (same results):
 1956, Koutsoudas and Korfhage, on Russian
 1961, Masterman
 1961, Gougenheim and Michéa, on French
 1985, Choueka and Lusignan, on French
 1955, Reifler: “Semantic Coincidences”, relationship between syntax structure
and word sense
B) Resources:
Knowledge Representation of words was realised:
 1957, Masterman, uses Roget’s Thesaurus to determine Latin-English translation
based on most frequently referred to thesaurus categories in a Latin sentence.
o This early statistical approach was continued by other researchers.
C) Studies of the problem:
Measurements of degree of polysemy
 1957, Harper:
o on Physics texts, 30% polysemous
o on Scientific texts, 43% polysemous
o in Russian dictionary, on average 8.6:1 ratio of words from Eng:Russian


5.6 are quasi-synonyms
¼ polysemous in computerised dictionary
2. AI Methods
Criticisms: Mostly all at level of sentence. All toy systems in that often tried to tackle
highly ambiguous words with fine sense distinctions. Often used in sentences that were
unlikely to be found in real world. Often relied on much hand-crafting and suffered from
the “knowledge-acquisition bottleneck” (though many AI systems of the time suffered
from this).
Symbolic Methods:
Semantic Networks: (c.f. connectionist models: spreading activation models)
 1961, Masterman: Defined 100 primitive concepts by which to organise a
dictionary. This resulted in a semantic network where nodes = concepts, and arcs
= semantic relationships.
 1961-1969, Quillian, worked on semantic networks. The path between two nodes
(words), will usually only involve one sense of intermediary nodes (words).
Networks and Frames:
 1976, Hayes: Case frames used with semantic networks (nodes=nouns, ie. Case
frames, arcs=verbs). Able to handle homonyms, but not other polysemy.
 1987, Hirst: Uses a network of frames with marker passing (ie. Quillian’s
approach). “Polaroid words” where inappropriate senses eliminated by syntactic
evidence and semantic relations, resulting in one sense. Suffers from
metaphorical use of words, resulting in no senses remaining.
Case-based Approaches:
 1968-75, Wilks: “Preference semantics” rules for selection restrictions based on
semantic features (eg. Animate vs inanimate).
 1979, Boguraev: Preference semantics insufficient for polysemous verbs.
Attempts to enhance it with case frames. Mixes syntactic and semantic evidence
for sense assignment.
Ontological Reasoning:
 1988, Dahlgren: Disambiguation handled in two approaches (each used 50% of
time). Fixed phrases and syntax OR reasoning (including common sense
reasoning). Reasoning involves finding ontological parents, precursor to Resnik
(1993).
Connectionist Methods:
Spreading Activation Methods:
 (1961, Quillian’s approach is precursor to spreading activation models. However,
Quillian’s approach was still symbolic. Neural networks are numeric.)







1971, Meyer and Schvaneveldt, “Semantic Priming”, we understand subsequent
words based on what we’ve already heard.
1975, Collins and Loftus – 1983, Anderson: “Spreading Activation” models.
Activation weakens as it spreads. Multiple stimulations of a node means
activation is reinforced.
1981, McClelland and Rumelhart: add notion of inhibitory activation. For
example, nodes activating one word sense may inhibit competing word senses.
1983, Cottrell and Small use neural networks for work similar to Quillian (a node
is a concept).
1985, Waltz and Pollack: semantic “micro-features”, like animate vs inanimate,
are hand-coded in networks as context for disambiguation.
1987, Bookman: automatic priming of microfeature context from preceding text.
Analogous to short-term memory.
1988, Kawamoto: Distributed networds (ie. Node =/= concept). But these require
training (in contrast to “local models” which are defined a priori).
3. Knowledge-Based Methods:
This period of research arose due to the availability of machine-readable resources. The
Following divisions of “Dictionaries”, “Thesaurus” and “Lexicon” are based on the
method of data organisation. In a dictionary, the main entry is at the word level. The
entry refers to various senses of the word. In a thesaurus, the main entry is for a cluster
of related words. In a lexicon, the main entry is for the word sense, which corresponds to
various words. This makes the lexicon and the thesaurus quite similar structurally.
Criticisms:
Inferences based machine-readable resources often suffer from three main problems. The
first is that it is hard to obtain non-contentious definitions for words. That is, in general,
it is difficult for humans to agree on the division of senses of a word. Secondly, thesauri
and lexicons often organise concepts hierarchically. However, the exact hierarchical
organisation is also often debated. Finally, the path length between nodes of a lexicon or
thesaurus does not mean anything.
Dictionaries (accuracy approx 70%)
 1980, Amsler and 1982, Michiel theses: used machine-readable dictionaries.
 1986, Lesk: tried to build knowledge base from dictionary. Each word sense
corresponded to a “signature”. Signatures consisted of the bag of words used in
the definition of the word sense. This bag of words was compared to the context
of the target word. This approach was the precursur to future statistical work but
was too dependent on the particular dictionary’s definitions.
 1990, Wilks: built a measure of relatedness between two words by comparing cooccurences between definitions for those words. This metric is used to compare
the target word with words in its surrounding context window.
 1990, Veronis and Ide: Built a symbolic network out of words and word senses.
A node was a word or a word sense. Words are connected to their word senses
which are in turn connected to the words in their signatures.

1989-1993, misc et al.: tried to use the extra fields of the LDOCE such as subject
codes and box codes. Box codes represented semantic primitives.
Thesauri
 1957, Masterman: uses Roget’s Thesaurus for machine translation
 1985, Patrick: uses Roget’s for verb sense disambiguation. He examines the
connectivity between the closest synonyms and the target word. These senses
distinctions are narrow as they are based on “strongest” synonyms.
 1992, Yarowsky: using Roget’s as basis of word classes (or senses) uses the
Groliers Encyclopedia to find signatures, a bag a words most likely to occur for
that word class. His sense distinctions are quite broad (3 way).
Lexicons
 1990, Miller et al: WordNet, a hand-crafted lexicon, enumerated
 1990, Lenat and Guha: CyC, a semi-hand-crafted lexicon (in principle),
enumerated
 1991, Briscoe, ACQUILEX,, enumerated
 1994, Grishman et al.: COMLEX, enumerated.
 1995, Buitelaar, CORELEX, generative lexicon


1993, Sussna: uses path lengths to measure relatedness between two words. An
overall relatedness score is computed between the sense and the context words (or
their senses). For competing senses, the one with the highest relatedness score is
the disambiguated sense.
1885, Resnik: uses information content of words (based on corpus frequencies)
and wordnet ontology to measure relatedness between two words
Download