Measures of association: chi square test, mutual information, binomial distribution and log likelihood ratio Lecture 8 1 Experiments in Multidocument summarization (SNM’02) Summarization system based on a range of features Raises issues we have not discussed upto now Non-extractive techniques Ordering of information 2 Lead values feature Lead sentences of news articles can often make excellent brief summaries But for multi-document summaries there are several first sentences, so difficult to choose! They are information dense Can we find very informative words based on this observation Used Binomial test to decide if P(Winlead ) P(Wanywhere ) 3 Sample lead words 4 Verb specificity Compare “arrest” with “do” or “be” Often given subjects are very strongly associated with a verb Actors appear in movies Singers release an album Compute associations between subject nouns and verbs Use mutual association measure I ( x, y ) log P ( x, y ) P( x | y ) log P( x) P( y ) P( x) 5 Concept sets Frequency of words are not that reliable, even when stemming is used Synonyms, hypernyms and hyponyms from wordnet 6 Other features Location A negative value that penalizes sentencesthat appear late in the document. Publication Date Additional value to the most recent documents, on the assumption that users will want the most up-to-date information. Target Indicates the presence of the central personage in the document cluster, if one exists. Length A penalty for sentences that are below a minimum (15 words) and above a maximum (30 words). Short sentences are often require some introduction or reference resolution, or else are a kind of interjection. Long sentences can cover multiple thoughts that are often found elsewhere in the document cluster in single sentences. Others Indicates the presence of any named entity, weighted to the frequency of that entity across all documents. Pronoun A negative value on sentences that have pronouns in the beginning of the sentence. 7 Other issues Sentence ordering How to present the selected information? Even good choices might be hard to understand if they are presented in the wrong order Imagine a newspaper articles with all sentences randomly permuted Noun phrases Depend on the context 8 Extractive summary 9 Partly modified summary 10 Measures of associations For supervised learning, they can help us detrmine which features are predictive of the distinctions we want to make Chi square test from last lecture Words that are likely to appear in the first sentence rather than anywhere else Verbs that are strongly associated with a given subjects A variety of measures are defined in the Chapter 5 reading 11 2 statistic (CHI) 2 statistic (pronounced “kai square”) A commonly used method of comparing proportions. Measures the lack of independence between a term and a category 12 2 statistic (CHI) Is “jaguar” a good predictor for the “auto” class? Term = jaguar Term jaguar Class = auto 2 500 Class auto 3 9500 We want to compare: the observed distribution above; and null hypothesis: that jaguar and auto are independent 13 2 statistic (CHI) Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect? If independent: Pr(j,a) = Pr(j) Pr(a) So, there would be N Pr(j,a), i.e. N Pr(j) Pr(a) occurances of “jaguar” Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500 N(5/N)(502/N)=2510/N=2510/10005 0.25 Term = jaguar Term jaguar Class = auto 2 500 Class auto 3 9500 14 2 statistic (CHI) Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect? Term = jaguar Term jaguar expected: fe Class = auto Class auto 2 (0.25) 3 500 9500 observed: fo 15 2 statistic (CHI) Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? Term = jaguar Term jaguar expected: fe Class = auto Class auto 2 (0.25) 3 (4.75) 500 (502) 9500 (9498) observed: fo 16 2 statistic (CHI) 2 is interested in (fo – fe)2/fe summed over all table entries: 2 ( j , a) (O E ) 2 / E (2 .25) 2 / .25 (3 4.75) 2 / 4.75 (500 502) 2 / 502 (9500 9498) 2 / 9498 12.9 ( p .001) The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence). Term = jaguar Term jaguar expected: fe Class = auto Class auto 2 (0.25) 3 (4.75) 500 (502) 9500 (9498) observed: fo 17 2 statistic (CHI) There is a simpler formula for 2: A = #(t,c) C = #(¬t,c) B = #(t,¬c) D = #(¬t, ¬c) N=A+B+C+D 18 Finding translation equivalents 19 Binomial distribution k—number of “successes” n—number of trails x—probability of success n! k ( nk ) B(k , n, x) x (1 x) k!(n k )! 20 Log likelihood ratio test 21 Log likelihood ratio test 22