Log likelihood statistic and topic signature words Lecture 10 1

advertisement
Log likelihood statistic and topic
signature words
Lecture 10
1
2
Grows very slowly
log(1) = 0
log(2) = 0.6931472
log(3) = 1.098612
log(4) = 1.386294
log(5) = 1.609438
log(6) = 1.791759
log(7) = 1.94591
log(8) = 2.079442
log(9) = 2.197225
log(10) = 2.302585
3

4
Log(x) < 0 if 0 < x < 1
log( ab)  log( a)  log( b)
log( a )  n log( a)
n
5
Log probability arithmetic

Probabilities P(w) are very small numbers
–


Using log probabilities solves the problem
Working with logs is also more efficient
–
–
6
Underflow
Summation instead of multiplication
log(p1p2p3…pn) = log(p1) + log(p2) +…+ log(pn)
Plain text

7
Howe, the fourth Cabinet member to resign
because of disputes over policy toward the
European Community, shocked the House of
Commons on Nov. 13 by calling Mrs.
Thatcher a threat to Britain 's vital interests.
Part of speech tagging

8
Howe_NNP ,_, the_DT fourth_JJ Cabinet_NNP
member_NN to_TO resign_VB because_IN of_IN
disputes_NNS over_IN policy_NN toward_IN the_DT
European_NNP Community_NNP ,_, shocked_VBD
the_DT House_NNP of_IN Commons_NNPS on_IN
Nov._NNP 13_CD by_IN calling_VBG Mrs._NNP
Thatcher_NNP a_DT threat_NN to_TO Britain_NNP
's_POS vital_JJ interests_NNS ._.
Named entity recognition

9
Howe/PERSON ,/O the/O fourth/O
Cabinet/ORGANIZATION member/O to/O resign/O
because/O of/O disputes/O over/O policy/O
toward/O the/O European/LOCATION
Community/LOCATION ,/O shocked/O the/O
House/ORGANIZATION of/ORGANIZATION
Commons/ORGANIZATION on/O Nov./O 13/O by/O
calling/O Mrs./PERSON Thatcher/PERSON a/O
threat/O to/O Britain/LOCATION 's/O vital/O
interests/O ./O
Likelihood ratio (section 5.3.4)

One more way of deciding “is a given word
representative of what an article is about?”

T---cluster of articles (such as those we have
seen in homework 1)
NT---background collection

–
10
For homework 1 we also used a background
collection to compute idf. This was all the articles
that we are not currently summarzing
Is a word w a topic word?

Two possibilities
–
–
Either w is very indicative of the topic of the
cluster and appears more often in T than in NT
Or, w occurs with the same frequency in both T
and NT
H 1 : P( w | T )  P( w | NT )  p
H 2 : P( w | T )  p1 and P( w | NT )  p 2 and p1  p 2
11
What is the likelihood of a word
occurring n times in a document?

Binomial distribution
–
–
12
Word w occurring == success
Other word occurring == failure
What is the likelihood of the data under
the two models?
-2log(lambda) has a chi square distribution
13
So we can look up the probability of getting a
given value; if the probability is very low, we
can assume H2 holds
Download