csa5011_wsd

Corpora and Statistical Methods Albert Gatt Word Sense Disambiguation What are word senses?  Cognitive definition:  mental representation of meaning  used in psychological experiments  relies on introspection (notoriously deceptive)  Dictionary-based definition:  adopt sense definitions in a dictionary  most frequently used resource is WordNet WordNet  Taxonomic representation of words (“concepts”)  Each word belongs to a synset, which contains near- synonyms  Each synset has a gloss  Words with multiple senses (polysemy) belong to multiple synsets  Synsets organised by hyponymy (IS-A) relations  Also, other lexical relations, depending on category How many senses?  Example: interest  pay 3% interest on a loan  showed interest in something  purchased an interest in a company.  the national interest…  have X’s best interest at heart  have an interest in word senses  The economy is run by business interests Wordnet entry for interest (noun) 1. a sense of concern with and curiosity about someone or something … 2. 3. 4. 5. 6. 7. (Synonym: involvement) the power of attracting or holding one’s interest… (Synonym: interestingness) a reason for wanting something done ( Synonym: sake) a fixed charge for borrowing money… a diversion that occupies one’s time and thoughts … (Synonym: pastime) a right or legal share of something; a financial involvement with something (Synonym: stake) (usually plural) a social group whose members control some field of activity and who have common aims, (Synonym: interest group) Some issues  Are all these really distinct senses? Is WordNet too fine- grained?  Would native speakers distinguish all these as different?  Cf. The distinction between sense ambiguity and underspecification (vagueness):  one could argue that there are fewer senses, but these are underspecified out of context Translation equivalence  Many WSD applications rely on translation equivalence  Given: parallel corpus (e.g. English-German)  if word w in English has n translations in German, then each translation represents a sense  e.g. German translations of interest:  Zins: financial charge (WN sense 4)  Anteil: stake in a company (WN sense 6)  Interesse: all other senses Some terminology  WSD Task: given an ambiguous word, find the intended sense in context  Sense tagging: task of labelling words as belonging to one sense or another.  needs some a priori characterisation of senses of each relevant word  Discrimination:  distinguishes between occurrences of words based on senses  not necessarily explicit labelling Some more terminology  Two types of WSD task:  Lexical sample task: focuses on disambiguating a small set of target words, using an inventory of the senses of those words.  All-words task: focuses on entire texts and a lexicon, where every word in the text has to be disambiguated  Serious data sparseness problems! Approaches to WSD  All methods rely on training data. Basic idea:  Given word w in context c  learn how to predict sense s of w based on various features of w  Supervised learning: training data is labelled with correct senses  can do sense tagging  Unsupervised learning: training data is unlabelled  but many other knowledge sources used  cannot do sense tagging, since this requires a priori senses Supervised learning  Words in training data labelled with their senses  She pays 3% interest/INTEREST-MONEY on the loan.  He showed a lot of interest/INTEREST-CURIOSITY in the painting.  Similar to POS tagging  given a corpus tagged with senses  define features that indicate one sense over another  learn a model that predicts the correct sense given the features Features (e.g. plant)  Neighbouring words:  plant life  manufacturing plant  assembly plant  plant closure  plant species  Content words in a larger window  animal  equipment  employee  automatic Other features  Syntactically related words  e.g. object, subject….  Topic of the text  is it about SPORT? POLITICS?  Part-of-speech tag, surrounding part-of-speech tags Some principles proposed (Yarowsky 1995)  One sense per discourse:  typically, all occurrences of a word will have the same sense in the same stretch of discourse (e.g. same document)  One sense per collocation:  nearby words provide clues as to the sense, depending on the distance and syntactic relationship  e.g. plant life: all (?) occurrences of plant+life will indicate the botanic sense of plant Training data  SENSEVAL  Shared Task competition  datasets available for WSD, among other things  annotated corpora in many languages  (NB: SENSEVAL now merged with the broader SEMEVAL tasks)  Pseudo-words  create training corpus by artificially conflating words  e.g. all occurrences of man and hammer with man-hammer  easy way to create training data  Multi-lingual parallel corpora  translated texts aligned at the sentence level  translation indicates sense SemCor corpus  Corpus consisting of files from the Brown Corpus (1m words)  SemCor contains 352 files, 360k words total:  186 files contain sense tags for all content words  The remainder only have sense tags for verbs  Originally created using the WordNet 1.6 sense inventory; since updated to the more recent versions of WordNet SemCor Example (= Brown file a01) <s snum=1> [...] <wf cmd=ignore pos=DT>an</wf> <wf cmd=done pos=NN lemma=investigation wnsn=1 lexsn=1:09:00::>investigation</wf> <wf cmd=ignore pos=IN>of</wf> <wf cmd=done pos=NN lemma=atlanta wnsn=1 lexsn=1:15:00::>Atlanta</wf> <wf cmd=ignore pos=POS>'s</wf> <wf cmd=done pos=JJ lemma=recent wnsn=2 lexsn=5:00:00:past:00>recent</wf> <wf cmd=done pos=NN lemma=primary_election wnsn=1 lexsn=1:04:00::>primary_election</wf> [...] </s> SemCor example <wf cmd=done pos=NN lemma=investigation wnsn=1 lexsn=1:09:00::>investigation</wf> Senses in WordNet:     S: (n) probe, investigation (an inquiry into unfamiliar or questionable activities) "there was a congressional probe into the scandal" S: (n) investigation, investigating (the work of inquiring into something thoroughly and systematically) This occurrence involves the first sense Data representation  Example sentence: An electric guitar and bass player stand off to one side...  Target word: bass  Possible senses: fish, musical instrument...  Relevant features are represented as vectors, e.g.: wi2 , POSi2 , wi1, POSi1, wi1, POSi1, wi2 , POSi2  guitar, NN,and, CC, player, NN,stand, VB Part 1 Supervised methods I: Naive Bayes Naïve Bayes Classifier  Identify the features (F)  e.g. surrounding words  other cues apart from surrounding context  Combine evidence from all features  Decision rule: decide on sense s’ iff sk , sk  s': P(s' | F )  P(sk | F )  Example: drug. F = words in context  medication sense: price, prescription, pharmaceutical  illegal substance sense: alcohol, illicit, paraphernalia Naive Bayes Classifier  Problem: We usually don’t know the probability of a sense given the features: P(sk|F)  In our corpus, we have access to P(F|sk)  We can compute P(sk|F) from P(F|sk)  For this we use Bayes’ theorem Deriving Bayes’ rule from the multiplication rule  Recall that: P( A  B)  P( A) P( B | A) Deriving Bayes’ rule from the multiplication rule  Recall that: P( A  B)  P( A) P( B | A)  Given symmetry of intersection, multiplication rule can be written in two ways: P( A  B)  P( A) P( B | A) P( A  B)  P( B) P( A | B) Deriving Bayes’ rule from the multiplication rule  Recall that: P( A  B)  P( A) P( B | A)  Given symmetry of intersection, multiplication rule can be written in two ways: P( A  B)  P( A) P( B | A) P( A  B)  P( B) P( A | B)  Bayes’ rule involves the substitution of one equation into the other, to replace P(A and B) P( B) P( A | B) P( B | A)  P( A) Deriving P(A)  Often, it’s not clear where P(A) should come from  we start out from conditional probabilities!  Given that we have two sets of outcomes of interest, A and B, P(A) can be derived from the following observation: A  ( A  B)  ( A  B)  i.e. The events in A are made up of those which are only in A (but not in B) and those which are in both A and B. Finding P(A) -- I B A A B A B P( A  B)  P( B) P( A | B) P( A  B)  P( B) P( A | B) P(A) must either be in one or the other (or both), since A is composed of these two sets. Finding P(A) -- II Step 1: Applying the addition rule: P( A)  P( A  B)  P( A  B)  P( B) P( A | B)  P( B) P( A | B) Step 2: Substituting into Bayes’ equation to replace P(A): P( B | A)  P( B) P( A | B) P( B) P( A | B)  P( B) P( A | B) Using Bayes’ rule fro WSD  We usually don’t know P(sk|F) but we can compute from training data: P(sk) (the prior) and P(F|sk) P ( sk | f )  P ( f | sk ) P ( sk ) P( f )  P(f) can be eliminated because it is constant for all senses in the corpus sbest  arg max sk P ( sk | f )  arg max sk P ( f | sk ) P ( sk ) P( f )  arg max sk P ( f | sk ) P ( sk ) The independence assumption  It’s called “naïve” because: n P ( f | sk )   P ( f j | sk ) j 1  i.e. all features are assumed to be independent  Obviously, this is often not true.  e.g. finding illicit in the context of drug may not be independent of finding pusher.  cf. our discussion of collocations!  Also, topics often constrain word choice. Training the naive Bayes classifier  We need to compute:  P(s) for all senses s of w P ( sk )  Count ( sk , w) Count ( w)  P(fj|s) for all features fj P ( f j | sk )  Count ( f j , sk ) Count ( sk ) Part 2 Supervised methods II: Information-theoretic and collocation-based approaches Information-theoretic measures  Find the single, most informative feature to predict a sense.  E.g. using a parallel corpus:  prendre (FR) can translate as take or make  prendre une décision: make a decision  prendre une mesure: take a measure [to…]  Informative feature in this case: direct object  mesure indicates take  décision indicates make  Problem: need to identify the correct value of the feature that indicates a specific sense. Brown et al’s algorithm 1. Given: translations T of word w 2. Given: values X of a useful feature (e.g. mesure, décision as values of DO) 3. Step 1: random partition P of T 4. While improving, do:  create partition Q of X that maximises I(P;Q)  find a partition P of T that maximises I(P;Q) comment: relies on mutual info to find clusters of translations mapping to clusters of feature values Using dictionaries and thesauri  Lesk (1986): one of the first to exploit dictionary definitions   the definition corresponding to a sense can contain words which are good indicators for that sense Method: 1. 2. 3. 4. Given: ambiguous word w with senses s1…sn with glosses g1…gn. Given: the word w in context c compute overlap between c & each gloss select the maximally matching sense Expanding a dictionary  Problem with Lesk:  often dictionary definitions don’t contain sufficient information  not all words in dictionary definitions are good informants  Solution: use a thesaurus with subject/topic categories  e.g. Roget’s thesaurus  http://www.gutenberg.org/cache/epub/22/pg22.html.utf8 Using topic categories  Suppose every sense sk of word w has subject/topic tk  w can be disambiguated by identifying the words related to tk in the thesaurus  Problems:  general-purpose thesauri don’t list domain-specific topics  several potentially useful words can be left out  e.g. … Navratilova plays great tennis …  proper name here useful as indicator of topic SPORT Expanding a thesaurus: Yarowsky 1992 1. Given: context c and topic t 2. For all contexts and topics, compute p(c|t) using Naïve Bayes  by comparing words pertaining to t in the thesaurus with words in c  if p(c|t) > α, then assign topic t to context c 3. For all words in the vocabulary, update the list of contexts in which the word occurs.  Assign topic t to each word in c 4. Finally, compute p(w|t) for all w in the vocabulary  this gives the “strength of association” of w with t Yarowsky 1992: some results SENSE star space object celebrity shape sentence punishment set of words ROGET TOPICS ACCURACY UNIVERSE ENTERTAINER INSIGNIA 96% 95% 82.% LEGAL_ACTION 99% GRAMMAR 98% Bootstrapping  Yarowsky (1995) suggested the one sense per discourse/collocation constraints.  Yarowsky’s method: 1. 2.  select the strongest collocational feature in a specific context disambiguate based only on this feature (similar to the information-theoretic method discussed earlier) One sense per collocation 1. For each sense s of w, initialise F, the collocations found in the dictionary definition for s 2. One sense per collocation:   identify the set of contexts containing collocates of s for each sense s of w, update F to contain those collocates such that, for for all s’ ≠ s P( s | f )  P ( s '| f ) (where alpha is a threshold) One sense per discourse 3. For each document:    find the majority sense of w out of those found in previous step assign all occurrences of w the majority sense This is implemented as a post-processing step. Reduces error rate by ca. 27%. Part 3 Unsupervised disambiguation Preliminaries  Recall: unsupervised learning can do sense discrimination not tagging  akin to clustering occurrences with the same sense  e.g. Brown et al 1991: cluster translations of a word  this is akin to clustering senses Brown et al’s method  Preliminary categorisation: 1. Set P(w|s) randomly for all words w and senses s of w. 2. Compute, for each context c of w the probability P(c|s) that the context was generated by sense s.  Use (1) and (2) as a preliminary estimate. Re-estimate iteratively to find best fit to the corpus. Characteristics of unsupervised disambiguation  Can adapt easily to new domains, not covered by a dictionary or pre-labelled corpus  Very useful for information retrieval  If there are many senses (e.g. 20 senses for word w), the algorithm will split contexts into fine-grained sets  NB: can go awry with infrequent senses Part 4 Some issues with WSD The task definition  The WSD task traditionally assumes that a word has one and only one sense in a context.  Is this true?  Is it possible to see evidence of co-activation (one word displaying more than one sense)?  this would bring competition to the licensed trade  competition = “act of competing”  competition = “people/organisations who are competing” Systematic polysemy  Not all senses are so easy to distinguish. E.g. competition in the “agent competing” vs “act of competing” sense.  The polysemy here is systematic  Compare bank/bank where the senses are utterly distinct (and most linguists wouldn’t consider this a case of polysemy, but homonymy)  Can translation equivalence help here?  depends if polysemy is systematic in all languages Logical metonymy  Metonymy = usage of a word to stand for something else  e.g. the pen is mightier than the sword  pen = the press  Logical metonymy arises due to systematic polysemy  good cook vs. good book  enjoy the paper vs enjoy the cake  Should WSD distinguish these? How could they do this? Which words/usages count?  Many proper names are identical to common nouns (cf. Brown, Bush,…)  This presents a WSD algorithm with systematic ambiguity and reduces performance.  Also, names are good indicators of senses of neighbouring words.  But this requires a priori categorisation of names.  Brown’s green stance vs. the cook’s green curry Useful links  WordNet: http://wordnet.princeton.edu/  SemCor: http://www.cse.unt.edu/~rada/downloads.html#semcor  Senseval: http://www.senseval.org/

csa5011_wsd

Related documents

Products

Support

csa5011_wsd

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib