pptx

advertisement
Corpora and Statistical Methods
Lecture 6
Semantic similarity, vector space models and wordsense disambiguation
Part 2
Word sense disambiguation
What are word senses?
 Cognitive definition:
 mental representation of meaning
 used in psychological experiments
 relies on introspection (notoriously deceptive)
 Dictionary-based definition:
 adopt sense definitions in a dictionary
 most frequently used resource is WordNet
WordNet
 Taxonomic representation of words (“concepts”)
 Each word belongs to a synset, which contains near-
synonyms
 Each synset has a gloss
 Words with multiple senses (polysemy) belong to multiple
synsets
 Synsets organised by hyponymy (IS-A) relations
How many senses?
 Example: interest
 pay 3% interest on a loan
 showed interest in something
 purchased an interest in a company.
 the national interest…
 have X’s best interest at heart
 have an interest in word senses
 The economy is run by business interests
Wordnet entry for interest (noun)
1. a sense of concern with and curiosity about someone or something …
2.
3.
4.
5.
6.
7.
(Synonym: involvement)
the power of attracting or holding one’s interest… (Synonym:
interestingness)
a reason for wanting something done ( Synonym: sake)
a fixed charge for borrowing money…
a diversion that occupies one’s time and thoughts … (Synonym: pastime)
a right or legal share of something; a financial involvement with something
(Synonym: stake)
(usually plural) a social group whose members control some field of activity
and who have common aims, (Synonym: interest group)
Some issues
 Are all these really distinct senses? Is WordNet too fine-
grained?
 Would native speakers distinguish all these as different?
 Cf. The distinction between sense ambiguity and
underspecification (vagueness):
 one could argue that there are fewer senses, but these are
underspecified out of context
Translation equivalence
 Many WSD applications rely on translation equivalence
 Given: parallel corpus (e.g. English-German)
 if word w in English has n translations in German, then each
translation represents a sense
 e.g. German translations of interest:
 Zins: financial charge (WN sense 4)
 Anteil: stake in a company (WN sense 6)
 Interesse: all other senses
Some terminology
 WSD Task: given an ambiguous word, find the intended sense
in context
 Sense tagging: task of labelling words as belonging to one
sense or another.
 needs some a priori characterisation of senses of each relevant word
 Discrimination:
 distinguishes between occurrences of words based on senses
 not necessarily explicit labelling
Some more terminology
 Two types of WSD task:
 Lexical sample task: focuses on disambiguating a small set of
target words, using an inventory of the senses of those words.
 All-words task: focuses on entire texts and a lexicon, where
every word in the text has to be disambiguated
 Serious data sparseness problems!
Approaches to WSD
 All methods rely on training data. Basic idea:
 Given word w in context c
 learn how to predict sense s of w based on various features of w
 Supervised learning: training data is labelled with correct senses
 can do sense tagging
 Unsupervised learning: training data is unlabelled
 but many other knowledge sources used
 cannot do sense tagging, since this requires a priori senses
Supervised learning
 Words in training data labelled with their senses
 She pays 3% interest/INTEREST-MONEY on the loan.
 He showed a lot of interest/INTEREST-CURIOSITY in the painting.
 Similar to POS tagging
 given a corpus tagged with senses
 define features that indicate one sense over another
 learn a model that predicts the correct sense given the features
Features (e.g. plant)
 Neighbouring words:
 plant life
 manufacturing plant
 assembly plant
 plant closure
 plant species
 Content words in a larger window
 animal
 equipment
 employee
 automatic
Other features
 Syntactically related words
 e.g. object, subject….
 Topic of the text
 is it about SPORT? POLITICS?
 Part-of-speech tag, surrounding part-of-speech tags
Some principles proposed (Yarowsky 1995)
 One sense per discourse:
 typically, all occurrences of a word will have the same sense in the
same stretch of discourse (e.g. same document)
 One sense per collocation:
 nearby words provide clues as to the sense, depending on the distance
and syntactic relationship
 e.g. plant life: all (?) occurrences of plant+life will indicate the botanic
sense of plant
Training data
 SENSEVAL
 Shared Task competition
 datasets available for WSD, among other things
 annotated corpora in many languages
 Pseudo-words
 create training corpus by artificially conflating words
 e.g. all occurrences of man and hammer with man-hammer
 easy way to create training data
 Multi-lingual parallel corpora
 translated texts aligned at the sentence level
 translation indicates sense
Data representation
 Example sentence: An electric guitar and bass player stand off to
one side...
 Target word: bass
 Possible senses: fish, musical instrument...
 Relevant features are represented as vectors, e.g.:
wi2 , POSi2 , wi1 , POSi1 , wi 1 , POSi1, wi2 , POSi2 
guitar,
NN,and, CC, player, NN,stand, VB
Supervised methods
Naïve Bayes Classifier
 Identify the features (F)
 e.g. surrounding words
 other cues apart from surrounding context
 Combine evidence from all features
 Decision rule: decide on sense s’ iff
sk , sk  s': P(s' | F )  P(sk | F )
 Example: drug. F = words in context
 medication sense: price, prescription, pharmaceutical
 illegal substance sense: alcohol, illicit, paraphernalia
Using Bayes’ rule
 We usually don’t know P(sk|f) but we can compute from
training data: P(sk) (the prior) and P(f|sk)
P ( sk | f ) 
P ( f | sk ) P ( sk )
P( f )
 P(f) can be eliminated because it is constant for all senses in the
corpus
sbest
 arg max sk P( sk | f )
 arg max sk
P ( f | sk ) P ( sk )
P( f )
 arg max sk P( f | sk ) P( sk )
The independence assumption
 It’s called “naïve” because:
n
P ( f | sk )   P ( f j | sk )
j 1
 i.e. all features are assumed to be independent
 Obviously, this is often not true.
 e.g. finding illicit in the context of drug may not be independent of finding pusher.
 cf. our discussion of collocations!
 Also, topics often constrain word choice.
Training the naive Bayes classifier
 We need to compute:
 P(s) for all senses s of w
P ( sk ) 
Count ( sk , w)
Count ( w)
 P(f|s) for all features f
P ( f j | sk ) 
Count ( f j , sk )
Count ( sk )
Information-theoretic measures
 Find the single, most informative feature to predict a sense.
 E.g. using a parallel corpus:
 prendre (FR) can translate as take or make
 prendre une décision: make a decision
 prendre une mesure: take a measure [to…]
 Informative feature in this case: direct object
 mesure indicates take
 décision indicates make
 Problem: need to identify the correct value of the feature that
indicates a specific sense.
Brown et al’s algorithm
1.
Given: translations T of word w
2.
Given: values X of a useful feature (e.g. mesure, décision as values of DO)
3.
Step 1: random partition P of T
4.
While improving, do:

create partition Q of X that maximises I(P;Q)

find a partition P of T that maximises I(P;Q)
comment: relies on mutual info to find clusters of translations mapping to clusters
of feature values
Using dictionaries and thesauri

Lesk (1986): one of the first to exploit dictionary
definitions


the definition corresponding to a sense can contain words which
are good indicators for that sense
Method:
1.
2.
3.
4.
Given: ambiguous word w with senses s1…sn with glosses g1…gn.
Given: the word w in context c
compute overlap between c & each gloss
select the maximally matching sense
Expanding a dictionary
 Problem with Lesk:
 often dictionary definitions don’t contain sufficient information
 not all words in dictionary definitions are good informants
 Solution: use a thesaurus with subject/topic categories
 e.g. Roget’s thesaurus
 http://www.gutenberg.org/cache/epub/22/pg22.html.utf8
Using topic categories
 Suppose every sense sk of word w has subject/topic tk
 w can be disambiguated by identifying the words related to tk
in the thesaurus
 Problems:
 general-purpose thesauri don’t list domain-specific topics
 several potentially useful words can be left out
 e.g. … Navratilova plays great tennis …
 proper name here useful as indicator of topic SPORT
Expanding a thesaurus: Yarowsky 1992
1.
Given: context c and topic t
2.
For all contexts and topics, compute p(c|t) using Naïve Bayes

by comparing words pertaining to t in the thesaurus with words in c

if p(c|t) > α, then assign topic t to context c
3.
For all words in the vocabulary, update the list of contexts in which the
word occurs.

Assign topic t to each word in c
4.
Finally, compute p(w|t) for all w in the vocabulary

this gives the “strength of association” of w with t
Yarowsky 1992: some results
SENSE
star
space object
celebrity
shape
sentence
punishment
set of words
ROGET TOPICS
ACCURACY
UNIVERSE
ENTERTAINER
INSIGNIA
96%
95%
82.%
LEGAL_ACTION 99%
GRAMMAR
98%
Bootstrapping

Yarowsky (1995) suggested the one sense per
discourse/collocation constraints.

Yarowsky’s method:
1.
2.

select the strongest collocational feature in a specific context
disambiguate based only on this feature
(similar to the information-theoretic method discussed earlier)
One sense per collocation
1.
For each sense s of w, initialise F, the collocations found in s’s
dictionary definition.
2.
One sense per collocation:


identify the set of contexts containing collocates of s
for each sense s of w, update F to contain those collocates such that
for all s’ ≠ s
P( s | f )

P( s' | f )
(where alpha is a threshold)
One sense per discourse
3.
For each document:



find the majority sense of w out of those found in previous step
assign all occurrences of w the majority sense
This is implemented as a post-processing step. Reduces
error rate by ca. 27%.
Unsupervised disambiguation
Preliminaries
 Recall: unsupervised learning can do sense discrimination
not tagging
 akin to clustering occurrences with the same sense
 e.g. Brown et al 1991: cluster translations of a word
 this is akin to clustering senses
Brown et al’s method

Preliminary categorisation:
1.
Set P(w|s) randomly for all words w and senses s of w.
2.
Compute, for each context c of w the probability P(c|s)
that the context was generated by sense s.

Use (1) and (2) as a preliminary estimate. Re-estimate
iteratively to find best fit to the corpus.
Characteristics of unsupervised
disambiguation
 Can adapt easily to new domains, not covered by a dictionary
or pre-labelled corpus
 Very useful for information retrieval
 If there are many senses (e.g. 20 senses for word w), the
algorithm will split contexts into fine-grained sets
 NB: can go awry with infrequent senses
Some issues with WSD
The task definition
 The WSD task traditionally assumes that a word has one and
only one sense in a context.
 Is this true?
 Kilgarriff (1993) argues that co-activation (one word
displaying more than one sense) is frequent:
 this would bring competition to the licensed trade
 competition = “act of competing”; “people/organisations who are
competing”
Systematic polysemy
 Not all senses are so easy to distinguish. E.g. competition in the
“agent competing” vs “act of competing” sense.
 The polysemy here is systematic
 Compare bank/bank where the senses are utterly distinct (and most
linguists wouldn’t consider this a case of polysemy, but homonymy)
 Can translation equivalence help here?
 depends if polysemy is systematic in all languages
Logical metonymy
 Metonymy = usage of a word to stand for something else
 e.g. the pen is mightier than the sword
 pen = the press
 Logical metonymy arises due to systematic polysemy
 good cook vs. good book
 enjoy the paper vs enjoy the cake
 Should WSD distinguish these? How could they do this?
Which words/usages count?
 Many proper names are identical to common nouns (cf.
Brown, Bush,…)
 This presents a WSD algorithm with systematic ambiguity
and reduces performance.
 Also, names are good indicators of senses of neighbouring
words.
 But this requires a priori categorisation of names.
 Brown’s green stance vs. the cook’s green curry
Download