Word sense disambiguation

advertisement
Word Sense Disambiguation
MAS.S60
Catherine Havasi
Rob Speer
Banks?
• The edge of a river
– “I fished on the bank of the Mississippi.”
• A financial institution
– “Bank of America failed to return my call.”
• The building that houses the financial
institution
– “The bank burned down last Thursday.”
• A “biological repository”
– “I gave blood at the blood bank”.
Word Sense Disambiguation
• Most NLP tasks need WSD
• “Played a lot of pool last
night… my bank shot is
improving!”
• Usually keying to WordNet
“I hit the ball with the bat.”
Types
• “All words”
– Guess the WN sysnet
• Lexical Subset
– A small number of pre-defined words
• Course Word Sense
– All words, but more intuitive senses
Types
• “All words”
– Guess the WN sysnet
• Lexical Subset
– A small number of pre-defined words
• Coarse Word Sense
– All words, but more intuitive senses
IAA is 75-80% for all words task with WordNet
90% for simple binary tasks
What is a Coarse Word Sense?
• How many word senses does the word “bag”
have in WordNet?
What is a Coarse Word Sense?
• How many word senses does the word “bag”
have in WordNet?
– 9 noun senses, 5 verb senses
• Coarse WSD: 6 nouns, 2 verbs
• A Coarse WordNet: 6,000 words (Navigli and Litkowski 2006)
• These distinctions are hard even for humans
(Snyder and Palmer 2004)
– Fine Grained IAA: 72.5%
– Coarse Grained IAA: 86.4%
“Bag”: Noun
• 1. A coarse sense containing:
– bag (a flexible container with a single opening)
– bag, handbag, pocketbook, purse (a container used for carrying money
and small personal items or accessories)
– bag, bagful (the quantity that a bag will hold)
– bag, traveling bag, travelling bag, grip, suitcase (a portable rectangular
container for carrying clothes)
•
•
•
•
2. bag (the quantity of game taken in a particular period)
3. base, bag (a place that the runner must touch before scoring)
4. bag, old bag (an ugly or ill-tempered woman)
5. udder, bag (mammary gland of bovids (cows and sheep and
goats))
• 6. cup of tea, bag, dish (an activity that you like or at which you are
superior)
Frequent Ingredients
•
•
•
•
•
Open Mind Word Expert
WordNet
eXtended WordNet (XWN)
SemCor 3.0 (“brown1” and “brown2”)
ConceptNet
Semcor
No training set, no problem
• Julia Hockenmaier’s “Psudoword” evaluation
• Pick two random words
– Say, “banana” and “door”
• Combine them together
– “BananaDoor”
• Replace all instances of either in your corpora
with your new pseudoword
• Evaluate
• A bit easier…
The “Flip-flop” Method
• Stephen Brown and Jonathan Rose, 1991
• Find a single feature or set of features which
disambiguated the words – think the named
entity recognizer
An Example
Standard Techniques
• Naïve Bayes (notice a trend)
– Bag of words
– Priors are based on word frequencies
• Unsupervised clustering techniques
– Expectation Maximization (EM)
– Yarowsky
Yarowsky
(slides from Julia Hockenmaier)
Training Yarowsky
Using OMCS
• Created a blend using a large number of
resources
• Created an ad hoc category for a word and its
surroundings in sentence
• Find which word sense is most similar to
category
• Keep the system machinery as general as
possible.
Adding Associations
• ConceptNet was included in two forms:
– Concept vs. feature matrices
– Concept-to-concept associations
• Associations help to represent topic areas
• If the document mentions computer-related
words, expect more computer-related word
senses
Constructing the Blend
Calculating the Right Sense
“I put my money
in the bank”
SemEval Task 7
•
•
•
•
14 different systems were submitted in 2007
Baseline: Most frequent sense
Spoiler!: Our system would have placed 4th
Top three systems:
– NUS-PT: parallel corpora with SVM (Chang et al, 2007)
– NUS-ML: Bayesian LDA with specialized features
(Chai, et al, 2007)
– LCC-WSD: multiple methods approach with endto-end system and corpora (Novichi et al, 2007)
Results
Parallel Corpora
• IMVHO the “right” way to do it.
• Different words have different sense in
different languages
• Use parallel corpora to find those instances
– Like Euro or UN proceedings
English and Romanian
Gold standards are overrated
• Rada Mihalcea, 2007: “Using Wikipedia for
Automatic Word Sense Disambiguation”
Lab: making a simple supervised
WSD classifier
• Big thanks to some guy with a blog (Jim Plush)
• Training data: Wikipedia articles surrounding
“Apple” (the fruit) and “Apple Inc.”
• Test data: hand-classified tweets about apples
and Apple products
• Use familiar features + Naïve Bayes to get >
90% accuracy
• Optional: use it with tweetstream to show
only tweets about apples (the fruit)
Slide Thanks
• James Pustejovsky, Gerard Bakx, Julie
Hockenmaier
• Manning and Schutze
Download