Word Sense Disambiguation CS 4705 1 Overview • A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’? Flies [V] vs. Flies [N] He robbed the bank. He sat on the bank. • How do we determine the correct ‘sense’ of the word? • Machine Learning – Supervised methods • Evaluation – Lightly supervised and Unsupervised • Bootstrapping • Dictionary-based techniques • Selection restrictions • Clustering 2 Supervised WSD • Approaches: – Tag a corpus with correct senses of particular words (lexical sample) or all words (all-words task) • E.g. SENSEVAL corpora – Lexical sample: • Extract features which might predict word sense – POS? Word identity? Punctuation after? Previous word? Its POS? • Use Machine Learning algorithm to produce a classifier which can predict the senses of one word or many – All-words • Use semantic concordance: each open class word labeled with sense from dictionary or thesaurus 3 – E.g. SemCor (Brown Corpus), tagged with WordNet senses 4 What Features Are Useful? • “Words are known by the company they keep” – How much ‘company’ do we need to look at? – What do we need to know about the ‘friends’? • POS, lemmas/stems/syntactic categories,… • Collocations: words that frequently appear with the target, identified from large corpora federal government, honor code, baked potato – Position is key • Bag-of-words: words that appear somewhere in a context window I want to play a musical instrument so I chose the bass. – Ordering/proximity not critical 5 • Punctuation, capitalization, formatting 6 Rule Induction Learners and WSD • Given a feature vector of values for independent variables associated with observations of values for the training set • Top-down greedy search driven by information gain: how will entropy of (remaining) data be reduced if we split on this feature? • Produce a set of rules that perform best on the training data, e.g. – bank2 if w-1==‘river’ & pos==NP & src==‘Fishing News’… – … • Easy to understand result but many passes to achieve each decision, susceptible to over-fitting 7 Naïve Bayes arg max sS p(s|V), p(V |s) p(s) p(V ) • ŝ= or • Where s is one of the senses S possible for a word w and V the input vector of feature values for w • Assume features independent, so probability of V is the product of probabilities of each feature, given s, so n p(V | s) p(v j | s) • p(V) same for any ŝ arg max sS j 1 • Then n sˆ arg max p(s) p(v j | s) j 1 sS 8 • How do we estimate p(s) and p(vj|s)? – p(si) is max. likelihood estimate from a sense-tagged corpus (count(si,wj)/count(wj)) – how likely is bank to mean ‘financial institution’ over all instances of bank? – P(vj|s) is max. likelihood of each feature given a candidate sense (count(vj,s)/count(s)) – how likely is the previous word to be ‘river’ when the sense of bank is ‘financial institution’ • Calculate sˆ arg max p(s) n p(v j | s) for each possible j 1 sS sense and take the highest scoring sense as the most likely choice 9 Decision List Classifiers • Transparent • Like case statements applying tests to input in turn fish within window --> bass1 striped bass --> bass1 guitar within window --> bass2 bass player --> bass1 – Yarowsky ‘96’s approach orders tests by individual accuracy on entire training set based on log-likelihood ratio P(Sense1| f v j i Abs(Log P(Sense 2| f i v j 10 Bootstrapping to Get More Labeled Data • Bootstrapping I – Start with a few labeled instances of target item as seeds to train initial classifier, C – Use high confidence classifications of C on unlabeled data as training data – Iterate • Bootstrapping II – Start with sentences containing words strongly associated with each sense (e.g. sea and music for bass), either intuitively or from corpus or from dictionary entries, and label those automatically – One Sense per Discourse hypothesis 11 Evaluating WSD • In vivo/end-to-end/task-based/extrinsic vs. in vitro/standalone/intrinsic: evaluation in some task (parsing? q/a? IVR system?) vs. application independent – In vitro metrics: classification accuracy on held-out test set or precision/recall/f-measure if not all instances must be labeled • Baseline: – Most frequent sense? – Lesk algorithms • Ceiling: human annotator agreement 12 Dictionary Approaches • Problem of scale for all ML approaches – Building a classifier for each word with multiple senses • Machine-Readable dictionaries with senses identified and examples – Simplified Lesk: • Retrieve all content words occurring in context of target (e.g. Sailors love to fish for bass.) – Compute overlap with sense definitions of target entry » bass1: a musical instrument… » bass2: a type of fish that lives in the sea… 13 – Choose sense with most content-word overlap – Original Lesk: • Compare dictionary entries of all content-words in context with entries for each sense • Limits: – Dictionary entries are short; performance best with longer entries, so…. • Expand with entries of ‘related’ words that appear in the entry • If tagged corpus available, collect all the words appearing in context of each sense of target word (e.g. all words appearing in sentences with bass1 ) to signature for bass1 – Weight each by frequency of occurrence in all ‘documents’ (e.g. all senses of bass) to capture how discriminating a word is for the target word’s senses – Corpus Lesk performs best of all Lesk approaches 14 Summary • Many useful approaches developed to do WSD – Supervised and unsupervised ML techniques – Novel uses of existing resources (WN, dictionaries) • Future – More tagged training corpora becoming available – New learning techniques being tested, e.g. co-training • Next class: – Ch 18:6-9 15