Log likelihood statistic and topic signature words Lecture 10 1 2 Grows very slowly log(1) = 0 log(2) = 0.6931472 log(3) = 1.098612 log(4) = 1.386294 log(5) = 1.609438 log(6) = 1.791759 log(7) = 1.94591 log(8) = 2.079442 log(9) = 2.197225 log(10) = 2.302585 3 4 Log(x) < 0 if 0 < x < 1 log( ab) log( a) log( b) log( a ) n log( a) n 5 Log probability arithmetic Probabilities P(w) are very small numbers – Using log probabilities solves the problem Working with logs is also more efficient – – 6 Underflow Summation instead of multiplication log(p1p2p3…pn) = log(p1) + log(p2) +…+ log(pn) Plain text 7 Howe, the fourth Cabinet member to resign because of disputes over policy toward the European Community, shocked the House of Commons on Nov. 13 by calling Mrs. Thatcher a threat to Britain 's vital interests. Part of speech tagging 8 Howe_NNP ,_, the_DT fourth_JJ Cabinet_NNP member_NN to_TO resign_VB because_IN of_IN disputes_NNS over_IN policy_NN toward_IN the_DT European_NNP Community_NNP ,_, shocked_VBD the_DT House_NNP of_IN Commons_NNPS on_IN Nov._NNP 13_CD by_IN calling_VBG Mrs._NNP Thatcher_NNP a_DT threat_NN to_TO Britain_NNP 's_POS vital_JJ interests_NNS ._. Named entity recognition 9 Howe/PERSON ,/O the/O fourth/O Cabinet/ORGANIZATION member/O to/O resign/O because/O of/O disputes/O over/O policy/O toward/O the/O European/LOCATION Community/LOCATION ,/O shocked/O the/O House/ORGANIZATION of/ORGANIZATION Commons/ORGANIZATION on/O Nov./O 13/O by/O calling/O Mrs./PERSON Thatcher/PERSON a/O threat/O to/O Britain/LOCATION 's/O vital/O interests/O ./O Likelihood ratio (section 5.3.4) One more way of deciding “is a given word representative of what an article is about?” T---cluster of articles (such as those we have seen in homework 1) NT---background collection – 10 For homework 1 we also used a background collection to compute idf. This was all the articles that we are not currently summarzing Is a word w a topic word? Two possibilities – – Either w is very indicative of the topic of the cluster and appears more often in T than in NT Or, w occurs with the same frequency in both T and NT H 1 : P( w | T ) P( w | NT ) p H 2 : P( w | T ) p1 and P( w | NT ) p 2 and p1 p 2 11 What is the likelihood of a word occurring n times in a document? Binomial distribution – – 12 Word w occurring == success Other word occurring == failure What is the likelihood of the data under the two models? -2log(lambda) has a chi square distribution 13 So we can look up the probability of getting a given value; if the probability is very low, we can assume H2 holds