Discussion of assigned readings Lecture 13 1

advertisement
Discussion of assigned readings
Lecture 13
1

How does ‘smoothing’ help in Bayesian word
sense disambiguation? How do you do this
smoothing?
–
Most words appear rarely (remember Heap’s law)

–
–
2
The more data you see, the more words you had never
seen before you encounter
Now imagine you use the word before and the
word after the target word as a feature
When you test your classifier, you will see context
words you never saw during training
How does the naïve Bayes classifier
work?

Compute the probability of the observed
context assuming each sense
–
–

3
Multiplication of the probabilities of each
individual feature
A word not seen in training will have zero
probability
Choose the most probable sense
Lesson 2: zeros or not?

Zipf’s Law:
–
–
–
–

Result:
–
–
–
4

A small number of events occur with high frequency
A large number of events occur with low frequency
You can quickly collect statistics on the high frequency events
You might have to wait an arbitrarily long time to get valid
statistics on low frequency events
Our estimates are sparse! no counts at all for the vast bulk of
things we want to estimate!
Some of the zeroes in the table are really zeros But others are
simply low frequency events you haven't seen yet. After all,
ANYTHING CAN HAPPEN!
How to address?
Answer:
–
Estimate the likelihood of unseen N-grams!
Dealing with unknown words
5

Training:
- Assume a fixed vocabulary (e.g. all words that
occur at least 5 times in the corpus)
- Replace all other words by a token <UNK>
- Estimate the model on this corpus

Testing:
- Replace all unknown words by <UNK>
- Run the model
Smoothing is like Robin Hood:
Steal from the rich and give to the poor (in
probability mass)
6
Laplace smoothing

Also called add-one smoothing
Just add one to all the counts!
Very simple
MLE estimate:

Laplace estimate:



7
Why do pseudo-words give an
optimistic results for WSD?

Banana-door
–
–
–
–

The problem is that different sense of the same word
are often semantically related
–
8
Find sentences containing either of the words
Replace the occurrence of each of the words with a new
symbol
Here is a big annotated corpus!
The correct sense is the original word
Pseudo-words are pomonymous rather than polysemous
Let’s correct this

Which is better, lexical sample or decision tree
classifier, in terms of system performance and
results?

Lexical sample task
–
–

Classifiers used to solve the task
–
–
–
9
Way of formulating the task of WSD
A small pre-selected set of target words
Naïve Bayes
Decision trees
Support vector machines
What is relative entropy?

10
KL divergence/relative entropy
Selectional preference strength





11
eat x [FOOD]
be y [PRETTY-MUCH-ANYTHING]
The distribution of expected semantic classes
(FOOD, PEOPLE, LIQUIDS)
The distribution of expected semantic classes for a
particular verb
The greater the difference between these
distributions, the more information the verb is giving
us about possible objects
Bootstrapping WSD algorithm

12
How the algorithm after defining plant under
life and manufacturing can go on to define
animal and microscopic from life? How does
it do that recursively or something?
Basically, how other words get labeled other
than those we set out to label.
Bootstrapping


What if you don’t have enough data to train a
system…
Bootstrap
–
–
–

For bass
–
13
Pick a word that you as an analyst think will co-occur with
your target word in particular sense
Grep through your corpus for your target word and the
hypothesized word
Assume that the target tag is the right one
Assume play occurs with the music sense and fish occurs
with the fish sense
Sentences extracting using “fish” and
“play”
14
15
SimLin: why multiply by 2


16
Common(A,B)
Description(A,B)
Word similarity

Synonymy is a binary relation
–

We want a looser metric
–
–

Word similarity or
Word distance
Two words are more similar
–

Two words are either synonymous or not
If they share more features of meaning
Actually these are really relations between senses:
–
–
Instead of saying “bank is like fund”
We say



17
Bank1 is similar to fund3
Bank2 is similar to slope5
We’ll compute them over both words and senses
Two classes of algorithms

Thesaurus-based algorithms
–

Based on whether words are “nearby” in Wordnet
Distributional algorithms
–
By comparing words based on their context


18
I like having X for dinner?
What are the possible values of X
Thesaurus-based word similarity

We could use anything in the thesaurus
–
–
–

Meronymy
Glosses
Example sentences
In practice
–
By “thesaurus-based” we just mean


Word similarity versus word relatedness
–
–
Similar words are near-synonyms
Related could be related any way


19
Using the is-a/subsumption/hypernym hierarchy
Car, gasoline: related, not similar
Car, bicycle: similar
Path based similarity

20
Two words are similar if nearby in thesaurus
hierarchy (i.e. short path between them)
Refinements to path-based similarity

pathlen(c1,c2) = number of edges in the
shortest path between the sense nodes c1
and c2

simpath(c1,c2) = -log pathlen(c1,c2)

wordsim(w1,w2) =
–
21
maxc1senses(w1),c2senses(w2) sim(c1,c2)
Problem with basic path-based
similarity



Assumes each link represents a uniform
distance
Nickel to money seem closer than nickel to
standard
Instead:
–
22
Want a metric which lets us represent the cost of
each edge independently
Information content similarity
metrics

Let’s define P(C) as:
–
–
–
–
23
The probability that a randomly selected word in a
corpus is an instance of concept c
Formally: there is a distinct random variable,
ranging over words, associated with each concept
in the hierarchy
P(root)=1
The lower a node in the hierarchy, the lower its
probability
Information content similarity

Train by counting in a corpus
–

1 instance of “dime” could count toward frequency
of coin, currency, standard, etc
More formally:
count(w)
P(c) 
24
w words(c )
N
Information content similarity

25
WordNet hieararchy augmented with
probabilities P(C)
Information content: definitions
Information content:
– IC(c)=-logP(c)
 Lowest common subsumer

LCS(c1,c2)
I.e. the lowest node in the
hierarchy
– That subsumes (is a hypernym
of) both c1 and c2
–
26
Resnik method
27

The similarity between two words is related to their
common information

The more two words have in common, the more
similar they are

Resnik: measure the common information as:
– The info content of the lowest common subsumer
of the two nodes
– simresnik(c1,c2) = -log P(LCS(c1,c2))
SimLin(c1,c2) =
2 x log P (LCS(c1,c2))/ (log P(c1) + log P(c2))

SimLin(hill,coast) =
2 x log P (geological-formation)) /
(log P(hill) + log P(coast))
= .59

28
Extended Lesk


29
Two concepts are similar if their glosses contain
similar words
– Drawing paper: paper that is specially prepared
for use in drafting
– Decal: the art of transferring designs from
specially prepared paper to a wood or glass or
metal surface
For each n-word phrase that occurs in both glosses
– Add a score of n2
– Paper and specially prepared for 1 + 4 = 5
Summary: thesaurus-based
similarity
30
Problems with thesaurus-based
methods



We don’t have a thesaurus for every
language
Even if we do, many words are missing
They rely on hyponym info:
–

Alternative
–
31
Strong for nouns, but lacking for adjectives and
even verbs
Distributional methods for word similarity
Distributional methods for word
similarity
–
–
–
–

Intuition:
–
–
32
A bottle of tezgüino is on the table
Everybody likes tezgüino
Tezgüino makes you drunk
We make tezgüino out of corn.
just from these contexts a human could guess
meaning of tezguino
So we should look at the surrounding contexts,
see what other words have similar context.
Context vector





33

Consider a target word w
Suppose we had one binary feature fi for
each of the N words in the lexicon vi
Which means “word vi occurs in the
neighborhood of w”
w=(f1,f2,f3,…,fN)
If w=tezguino, v1 = bottle, v2 = drunk, v3 =
matrix:
w = (1,1,0,…)
Intuition



34
Define two words by these sparse features
vectors
Apply a vector distance metric
Say that two words are similar if two vectors
are similar
Distributional similarity

35
So we just need to specify 3 things
1. How the co-occurrence terms are defined
2. How terms are weighted
 (frequency? Logs? Mutual
information?)
3. What vector distance metric should we
use?
 Cosine? Euclidean distance?
Defining co-occurrence vectors
36

He drinks X every morning

Idea: parse the sentence, extract syntactic
dependencies:
Co-occurrence vectors based on
dependencies
37
Measures of association with context



Let’s consider one feature
f=(r,w’) = (obj-of,attack)
P(f|w)=count(f,w)/count(w)

Assocprob(w,f)=p(f|w)


38
We have been using the frequency of some feature
as its weight or value
But we could use any function of this frequency
Weighting: Mutual Information
39

Pointwise mutual information: measure of how often two
events x and y occur, compared with what we would expect
if they were independent:

PMI between a target word w and a feature f :
Mutual information intuition

40
Objects of the verb drink
Lin is a variant on PMI
41

Pointwise mutual information: how often two events x and y occur,
compared with what we would expect if they were independent:

PMI between a target word w and a feature f :

Lin measure: breaks down expected value for P(f) differently:
Similarity measures
42
What is the baseline algorithm
(Lin&Hovy paper)



43
Good question, they don’t say!
Random selection
First N words
TextTiling: Segmenting Text into Multiparagraph Subtopic Passages
44
Unsupervised Discourse
Segmentation


45
Hearst (1997): 21-pgraph science news
article called “Stargazers”
Goal: produce the following subtopic
segments:
Intuition of cohesion-based
segmentation
46

Sentences or paragraphs in a subtopic are cohesive
with each other

But not with paragraphs in a neighboring subtopic

Thus if we measured the cohesion between every
neighboring sentences
– We might expect a ‘dip’ in cohesion at subtopic
boundaries.
TextTiling (Hearst 1997)
1.
Tokenization
–
–
–
–
–
2.
Lexical Score Determination: cohesion score
–
3.
47
Each space-deliminated word
Converted to lower case
Throw out stop list words
Stem the rest
Group into pseudo-sentences of length w=20
Average similarity (cosine measure) between gap (20
pseudo sentences)
Boundary Identification
TextTiling algorithm
48
Cosine
49
Could you use stemming to compare
synsets (distances between words)
50

E.g. musician  music

Is there a way to deal with inconsistent
granularities of relations
Zipf's law and Heap's law

Both are related to the fact that there are a
lot of words that one would see very rarely

This means that when we build language
models (estimate probabilities of words) we
will have unreliable estimates for many
–
51
This is why we were talking about smoothing!
Heap’s law: estimating the number of
terms
M  kT
b
M vocabulary size (number of terms)
T number of tokens
30 < k < 100
b = 0.5
52
Linear relation between vocabulary size and number of
tokens in log-log space
53
Zipf’s law: modeling the distribution of
terms
54

The collection frequency of
the ith most common term is
proportional to 1/i
1
cf i 
i

If the most frequent term
occurs cf1 then the second
most frequent term has half
as many occurrences, the
third most frequent term has
a third as many, etc
cf i  ci k
log cf i  log c  k log i
55
Bigram Model

Approximate P(wn |w1n1)
–


P(wn |wn 1)
P(unicorn|the mythical) by P(unicorn|mythical)
Markov assumption: the probability of a word
depends only on the probability of a limited history
Generalization: the probability of a word depends
only on the probability of the n previous words
–
–
56
by
–
trigrams, 4-grams, …
the higher n is, the more data needed to train
backoff models…
A Simple Example: bigram model
–
57
P(I want to each Chinese food) = P(I | <start>)
P(want | I) P(to | want) P(eat | to) P(Chinese | eat)
P(food | Chinese) P(<end>|food)
Generating WSJ
58
Maximum tf normalization: wasn’t
useful for summarization, why discuss
it then?
59
Accuracy, precision and recall
60
Accuracy

Problematic measure for IR evaluation
–

99.9% of the documents will be nonrelevant
–
61
(tp+tn)/(tp+tn+fp+fn)
Trivially achieved high performance
Precision
62
Recall
63
Precision/Recall trade off

Which is more important depends on the
user needs
–
Typical web users

–
Paralegals and intelligence analysts


64
High precision in the first page of results
Need high recall
Willing to tolerate some irrelevant documents as a price
F-measure
65

66
Explain what vector representation means? I
tried explaining it to my mother and had
difficulty as to what this "vector" implies. She
kept saying she thought of vectors in terms of
graphs.

67
How can NLP be used in spam?
Using the entire string as feature?
NO!

Topic categorization: classify the document
into semantics topics
The U.S. swept into the
Davis Cup final on Saturday
when twins Bob and Mike
Bryan defeated Belarus's
Max Mirnyi and Vladimir
Voltchkov to give the
Americans an
unsurmountable 3-0 lead in
the best-of-five semi-final
tie.
68
One of the strangest, most
relentless hurricane seasons
on record reached new
bizarre heights yesterday as
the plodding approach of
Hurricane Jeanne prompted
evacuation orders for
hundreds of thousands of
Floridians and high wind
warnings that stretched 350
miles from the swamp towns
What is relative entropy?

69
KL divergence/relative entropy
Selectional preference strength
How do you find verbs that strongly
associated with a given subject?





70
eat x [FOOD]
be y [PRETTY-MUCH-ANYTHING]
The distribution of expected semantic classes
(FOOD, PEOPLE, LIQUIDS)
The distribution of expected semantic classes for a
particular verb
The greater the difference between these
distributions, the more information the verb is giving
us about possible objects
Download