Word Sense Disambiguation CS 4705 1

advertisement
Word Sense Disambiguation
CS 4705
1
Overview
• A problem for semantic attachment approaches: what
happens when a given lexeme has multiple ‘meanings’?
Flies [V] vs. Flies [N]
He robbed the bank. He sat on the bank.
• How do we determine the correct ‘sense’ of the word?
• Machine Learning
– Supervised methods
• Evaluation
– Lightly supervised and Unsupervised
• Bootstrapping
• Dictionary-based techniques
• Selection restrictions
• Clustering
2
Supervised WSD
• Approaches:
– Tag a corpus with correct senses of particular words
(lexical sample) or all words (all-words task)
• E.g. SENSEVAL corpora
– Lexical sample:
• Extract features which might predict word sense
– POS? Word identity? Punctuation after? Previous word?
Its POS?
• Use Machine Learning algorithm to produce a
classifier which can predict the senses of one word
or many
– All-words
• Use semantic concordance: each open class word
labeled with sense from dictionary or thesaurus
3
– E.g. SemCor (Brown Corpus), tagged with WordNet
senses
4
What Features Are Useful?
• “Words are known by the company they keep”
– How much ‘company’ do we need to look at?
– What do we need to know about the ‘friends’?
• POS, lemmas/stems/syntactic categories,…
• Collocations: words that frequently appear with the
target, identified from large corpora
federal government, honor code, baked potato
– Position is key
• Bag-of-words: words that appear somewhere in a
context window
I want to play a musical instrument so I chose the bass.
– Ordering/proximity not critical
5
• Punctuation, capitalization, formatting
6
Rule Induction Learners and WSD
• Given a feature vector of values for independent variables
associated with observations of values for the training set
• Top-down greedy search driven by information gain: how
will entropy of (remaining) data be reduced if we split on
this feature?
• Produce a set of rules that perform best on the training
data, e.g.
– bank2 if w-1==‘river’ & pos==NP & src==‘Fishing News’…
– …
• Easy to understand result but many passes to achieve each
decision, susceptible to over-fitting
7
Naïve Bayes
arg max
sS p(s|V),
p(V |s) p(s)
p(V )
• ŝ=
or
• Where s is one of the senses S possible for a
word w and V the input vector of feature values
for w
• Assume features independent, so probability of V
is the product of probabilities of each feature,
given s, so
n
p(V | s)   p(v j | s)
•
p(V) same for any ŝ
arg max
sS
j 1
• Then
n
sˆ  arg max p(s)  p(v j | s)
j 1
sS
8
• How do we estimate p(s) and p(vj|s)?
– p(si) is max. likelihood estimate from a sense-tagged
corpus (count(si,wj)/count(wj)) – how likely is bank to
mean ‘financial institution’ over all instances of bank?
– P(vj|s) is max. likelihood of each feature given a
candidate sense (count(vj,s)/count(s)) – how likely is
the previous word to be ‘river’ when the sense of bank
is ‘financial institution’
• Calculate sˆ  arg max p(s) n p(v j | s) for each possible
j 1
sS
sense and
take the highest
scoring sense as the most likely choice
9
Decision List Classifiers
• Transparent
• Like case statements applying tests to input in turn
fish within window
--> bass1
striped bass
--> bass1
guitar within window
--> bass2
bass player
--> bass1
– Yarowsky ‘96’s approach orders tests by individual
accuracy on entire training set based on log-likelihood


ratio
 P(Sense1| f v j 
i
 
Abs(Log 

 P(Sense 2| f

i v j 
10
Bootstrapping to Get More Labeled Data
• Bootstrapping I
– Start with a few labeled instances of target item as
seeds to train initial classifier, C
– Use high confidence classifications of C on unlabeled
data as training data
– Iterate
• Bootstrapping II
– Start with sentences containing words strongly
associated with each sense (e.g. sea and music for
bass), either intuitively or from corpus or from
dictionary entries, and label those automatically
– One Sense per Discourse hypothesis
11
Evaluating WSD
• In vivo/end-to-end/task-based/extrinsic vs. in vitro/standalone/intrinsic: evaluation in some task (parsing? q/a? IVR
system?) vs. application independent
– In vitro metrics: classification accuracy on held-out test set or
precision/recall/f-measure if not all instances must be labeled
• Baseline:
– Most frequent sense?
– Lesk algorithms
• Ceiling: human annotator agreement
12
Dictionary Approaches
• Problem of scale for all ML approaches
– Building a classifier for each word with multiple senses
• Machine-Readable dictionaries with senses
identified and examples
– Simplified Lesk:
• Retrieve all content words occurring in context of
target (e.g. Sailors love to fish for bass.)
– Compute overlap with sense definitions of target entry
» bass1: a musical instrument…
» bass2: a type of fish that lives in the sea…
13
– Choose sense with most content-word overlap
– Original Lesk:
• Compare dictionary entries of all content-words in context
with entries for each sense
• Limits:
– Dictionary entries are short; performance best with longer entries,
so….
• Expand with entries of ‘related’ words that appear in the entry
• If tagged corpus available, collect all the words appearing in
context of each sense of target word (e.g. all words appearing
in sentences with bass1 ) to signature for bass1
– Weight each by frequency of occurrence in all ‘documents’ (e.g.
all senses of bass) to capture how discriminating a word is for
the target word’s senses
– Corpus Lesk performs best of all Lesk approaches
14
Summary
• Many useful approaches developed to do WSD
– Supervised and unsupervised ML techniques
– Novel uses of existing resources (WN, dictionaries)
• Future
– More tagged training corpora becoming available
– New learning techniques being tested, e.g. co-training
• Next class:
– Ch 18:6-9
15
Download