Word Sense Disambiguation (Source Text: Ide and Veronis, (1998) “Word Sense Disambiguation: The State of the Art”) The Problem: Word Sense Disambiguation (WSD) is the process of resolving the meaning of a word: in a given usage context and, unambiguously, if the word has a variety of applicable definitions. Why do WSD? WSD is a required step in several other Natural Language Processing tasks such as: Machine Translation: if a word maps to more than one other word in the target language Information Retrieval: To ensure that the results of a query are relevant to the intended sense of the query term(s) Content Analysis: Grammatical Analysis: Determining the phrase structure of a sentence may depend on understanding the meaning. A common example is ambiguous attachments of sub-phrases. Speech Processing: To distinguish between homophones, words that sound the same. Text Processing: eg. To perform spelling corrections Notes: Bar-Hillel’s 1960 ALPAC report stated that a lack of good WSD was a major barrier to machine translation. This report resulted in a reduced research activity in machine translation. WSD 2-Step Process: 1. Determine the set of applicable senses of a word for a particular context Eg: Dictionaries, thesauri, translation dictionaries 2. Determine which sense is most appropriate Based on context vs. External Knowledge Sources Problem “Space”: For words with the same spelling (Homographs): Senses have differing parts of speech See Part-Of-Speech tagging research Senses have same part of speech function Stages in WSD research: Early Machine Translation Artificial Intelligence (AI) methods o Symbolic Methods o Connectionist Methods Knowledge-Based Methods o Machine-readable dictionaries o Thesauri o Computational Lexicons Enumerative lexicons Generative lexicons Corpus Methods (Rename this and subsections??) o General empirical methods o Automatic Sense-tagging o Overcoming data sparseness 1. Early Machine Translation A) Approaches: Much of the foundation of WSD was laid in this period, but without large resources, ideas were untested. 1949, Weaver: window size of N words in text before and after usage (of the word being disambiguated) o (in Memorandum) realised relationship between domain specificity and reduced word sense ambiguity resulted in work on “micro-glossaries” 1950, Kaplan: experiments on value for N, where N=2, or N=sentence length. Result: no difference o Similar Expts (same results): 1956, Koutsoudas and Korfhage, on Russian 1961, Masterman 1961, Gougenheim and Michéa, on French 1985, Choueka and Lusignan, on French 1955, Reifler: “Semantic Coincidences”, relationship between syntax structure and word sense B) Resources: Knowledge Representation of words was realised: 1957, Masterman, uses Roget’s Thesaurus to determine Latin-English translation based on most frequently referred to thesaurus categories in a Latin sentence. o This early statistical approach was continued by other researchers. C) Studies of the problem: Measurements of degree of polysemy 1957, Harper: o on Physics texts, 30% polysemous o on Scientific texts, 43% polysemous o in Russian dictionary, on average 8.6:1 ratio of words from Eng:Russian 5.6 are quasi-synonyms ¼ polysemous in computerised dictionary 2. AI Methods Criticisms: Mostly all at level of sentence. All toy systems in that often tried to tackle highly ambiguous words with fine sense distinctions. Often used in sentences that were unlikely to be found in real world. Often relied on much hand-crafting and suffered from the “knowledge-acquisition bottleneck” (though many AI systems of the time suffered from this). Symbolic Methods: Semantic Networks: (c.f. connectionist models: spreading activation models) 1961, Masterman: Defined 100 primitive concepts by which to organise a dictionary. This resulted in a semantic network where nodes = concepts, and arcs = semantic relationships. 1961-1969, Quillian, worked on semantic networks. The path between two nodes (words), will usually only involve one sense of intermediary nodes (words). Networks and Frames: 1976, Hayes: Case frames used with semantic networks (nodes=nouns, ie. Case frames, arcs=verbs). Able to handle homonyms, but not other polysemy. 1987, Hirst: Uses a network of frames with marker passing (ie. Quillian’s approach). “Polaroid words” where inappropriate senses eliminated by syntactic evidence and semantic relations, resulting in one sense. Suffers from metaphorical use of words, resulting in no senses remaining. Case-based Approaches: 1968-75, Wilks: “Preference semantics” rules for selection restrictions based on semantic features (eg. Animate vs inanimate). 1979, Boguraev: Preference semantics insufficient for polysemous verbs. Attempts to enhance it with case frames. Mixes syntactic and semantic evidence for sense assignment. Ontological Reasoning: 1988, Dahlgren: Disambiguation handled in two approaches (each used 50% of time). Fixed phrases and syntax OR reasoning (including common sense reasoning). Reasoning involves finding ontological parents, precursor to Resnik (1993). Connectionist Methods: Spreading Activation Methods: (1961, Quillian’s approach is precursor to spreading activation models. However, Quillian’s approach was still symbolic. Neural networks are numeric.) 1971, Meyer and Schvaneveldt, “Semantic Priming”, we understand subsequent words based on what we’ve already heard. 1975, Collins and Loftus – 1983, Anderson: “Spreading Activation” models. Activation weakens as it spreads. Multiple stimulations of a node means activation is reinforced. 1981, McClelland and Rumelhart: add notion of inhibitory activation. For example, nodes activating one word sense may inhibit competing word senses. 1983, Cottrell and Small use neural networks for work similar to Quillian (a node is a concept). 1985, Waltz and Pollack: semantic “micro-features”, like animate vs inanimate, are hand-coded in networks as context for disambiguation. 1987, Bookman: automatic priming of microfeature context from preceding text. Analogous to short-term memory. 1988, Kawamoto: Distributed networds (ie. Node =/= concept). But these require training (in contrast to “local models” which are defined a priori). 3. Knowledge-Based Methods: This period of research arose due to the availability of machine-readable resources. The Following divisions of “Dictionaries”, “Thesaurus” and “Lexicon” are based on the method of data organisation. In a dictionary, the main entry is at the word level. The entry refers to various senses of the word. In a thesaurus, the main entry is for a cluster of related words. In a lexicon, the main entry is for the word sense, which corresponds to various words. This makes the lexicon and the thesaurus quite similar structurally. Criticisms: Inferences based machine-readable resources often suffer from three main problems. The first is that it is hard to obtain non-contentious definitions for words. That is, in general, it is difficult for humans to agree on the division of senses of a word. Secondly, thesauri and lexicons often organise concepts hierarchically. However, the exact hierarchical organisation is also often debated. Finally, the path length between nodes of a lexicon or thesaurus does not mean anything. Dictionaries (accuracy approx 70%) 1980, Amsler and 1982, Michiel theses: used machine-readable dictionaries. 1986, Lesk: tried to build knowledge base from dictionary. Each word sense corresponded to a “signature”. Signatures consisted of the bag of words used in the definition of the word sense. This bag of words was compared to the context of the target word. This approach was the precursur to future statistical work but was too dependent on the particular dictionary’s definitions. 1990, Wilks: built a measure of relatedness between two words by comparing cooccurences between definitions for those words. This metric is used to compare the target word with words in its surrounding context window. 1990, Veronis and Ide: Built a symbolic network out of words and word senses. A node was a word or a word sense. Words are connected to their word senses which are in turn connected to the words in their signatures. 1989-1993, misc et al.: tried to use the extra fields of the LDOCE such as subject codes and box codes. Box codes represented semantic primitives. Thesauri 1957, Masterman: uses Roget’s Thesaurus for machine translation 1985, Patrick: uses Roget’s for verb sense disambiguation. He examines the connectivity between the closest synonyms and the target word. These senses distinctions are narrow as they are based on “strongest” synonyms. 1992, Yarowsky: using Roget’s as basis of word classes (or senses) uses the Groliers Encyclopedia to find signatures, a bag a words most likely to occur for that word class. His sense distinctions are quite broad (3 way). Lexicons 1990, Miller et al: WordNet, a hand-crafted lexicon, enumerated 1990, Lenat and Guha: CyC, a semi-hand-crafted lexicon (in principle), enumerated 1991, Briscoe, ACQUILEX,, enumerated 1994, Grishman et al.: COMLEX, enumerated. 1995, Buitelaar, CORELEX, generative lexicon 1993, Sussna: uses path lengths to measure relatedness between two words. An overall relatedness score is computed between the sense and the context words (or their senses). For competing senses, the one with the highest relatedness score is the disambiguated sense. 1885, Resnik: uses information content of words (based on corpus frequencies) and wordnet ontology to measure relatedness between two words