Automated Text Summarization – IJCNLP-2005 Tutorial Chin-Yew Lin Natural Language Group Information Sciences Institute University of Southern California cyl@isi.edu http://www.isi.edu/~cyl How Much Information? (2003) • Print, film, magnetic, and optical storage media produced about 5 exabytes (1018) of new information in 2002. 92% of the new information was stored on magnetic media, mostly in hard disks. – If digitized with full formatting, the 17 million books in the Library of Congress contain about 136 terabytes of information; – 5 exabytes of information is equivalent in size to the information contained in 37,000 new libraries the size of the Library of Congress book collections. • The amount of new information stored on paper, film, magnetic, and optical media has about doubled during 1999 – 2002. • Information flows through electronic channels – telephone, radio, TV, and the Internet – contained almost 18 exabytes of new information in 2002, three and a half times more than is recorded in storage media (98% are telephone calls). * Source: How much information 2003 * http://www.sims.berkeley.edu/research/projects/how-much-info2003/execsum.htm#summary Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Table of contents 1. 2. 3. 4. 5. Motivation and Introduction. Approaches and paradigms. Summarization methods. Evaluating summaries. The future. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Which School is Better? Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Is this Shoe Good or Bad? Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 What’s in the News? Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 How to Get from A to B? Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Get a Job! *About.com Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Which Movie for Friday Night? Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Browse a Summarized Web * Stanford PowerBrowser Project, Orkut et al. WWW10, 2001 Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 The Big & Bigger Search Engines Search Engine Reported Size Page Depth Google 8.1 this billion 101K posted a “As those of you who follow blog know, I recently weather you to material changes to our index. MSN report alerting 5.0 billion 150K Since that post, we've seen discussion from webmasters Yahoo 4.2some billion * 500K who have noticed more of their documents in our index. As it Ask Jeeves 2.5 billion 101K turns out we have grown our index and just reached a significant * Nov, 2004 searchenginewatch.com milestone at Yahoo! Search — our index now provides BillionstoOfover Textual Indexed (6,we 2002 - 9, 2003) access 20 Documents billion items . While typically don't disclose size (since we've always said that size is only one dimension of the quality of a search engine), for those who are curious this update includes just over 19.2 billion web documents, 1.6 billion images, and over 50 million audio and video files. Note that as with all index updates we are still tuning things so youユll continue to see some fluctuation in ranking over the next few weeks.” 8/8/2005 Yahoo! Search Blog Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 The Bottom Line Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Solution: Summarization • Informed decision through summarization • What is summarization? – A summary is a concise restatement of the topic and main ideas of its source. • Concise – giving a lot of information clearly and in a few words. (Oxford American Dictionary) • Restatement – in your own words. • Topic – what is the source about? • Main ideas – important facts or arguments about the topic. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Kinds of Summaries • Based on two different perspectives – The author’s point of view • Paraphrase summary (indicative vs. informative) – In the fall of 1978, Fran Tate wrote some hot checks to finance the opening of a Mexican restaurant in Barrow, Alaska. All the banks she had applied to for financing had turned her down. But Fran was sure that Barrow lusted for a Mexican restaurant, so she went ahead with her plans. (“In Alaska: Where the Chili is Chilly”, by Gregory Jaynes) – The summary writer’s point of view • Analytical summary – For Gregory Jaynes, Fran Tate embodies the astonishing independence and toughness that is typical of Alaska at its best. Tate has opened a Mexican restaurant – an unlikely enterprise – in Barrow, Alaska. From “Reaching Across the Curriculum”, ©1997, Eva Thury and Carl Drott. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 How to Summarize? • Topic Identification – Find the most important information – the topic. • The Egyptian civilization was one of the most important cultures of the ancient world. This civilization flourished along the rich banks and delta of the Nile River for many centuries, from 3200 B.C. until the Roman conquest in 30 B.C. …* – Find main ideas that support the topic and show how they are related. • Among the many contributions made by the Egyptian culture is the hieroglyphic writing system they invented. This is one of the earliest writing systems known and it was used from about 3200 B.C. until 350 A.D. Hieroglyphics are a form of picture writing, in which each symbol stands for either a single object or a single sound. The symbols can be combined into long strings to form words and sentences. Other cultures, such as the Hittites, the Cretans, and the Mayans, also developed picture writing, but these systems are not related to the Egyptian system, nor to one another.* • Topic Interpretation (I) – Combine several main ideas into a single sentence. • Among the many contributions made by the Egyptian culture is the hieroglyphic writing system they invented. This is one of the earliest writing system known and it was used from about 3200 B.C. until 350 A.D. Hieroglyphics are a form of picture writing, in which each symbol stands for either a single object or a single sound. The symbols can be combined into long strings to form words and sentences. Other cultures, such as the Hittites, the Cretans, and the Mayans, also developed picture writing, but these systems are not related to the Egyptian system, nor to one another.* The Egyptians invented hieroglyphics, a form of picture writing, which is one of the earliest writing systems known.* (* Examples adapted from Summary Street ® http://lsa.colorado.edu/summarystreet) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 How to Summarize? (cont.) • Topic Interpretation (II) – Substitute a general form for lists of items or events. • List of items:John bought some milk, bread, fruit, cheese, potato chips, butter, hamburger meat and buns. * John bought some groceries.* • List of events:A lot of children came and brought presents. They played games and blew bubbles at each other. A magician came and showed them some magic. Later Jennifer opened her presents and blew out the candles on her cake.* Jennifer had a birthday party.* – Remove trivial and redundant information • We have gotten word of an enormous deal in the publishing deal. Hillary Clinton, senator-elect has accepted an $8 million dollars advance to publish her book. That is a figure, surpassed only by the $8 1/2 million the pope received for his book. Senator-elect Hillary Rodham Clinton Friday night agreed to sell Simon & Schuster a memoir of her years as first lady, for the near-record advance of about $8 million. Senator-elect Hillary Rodham Clinton agreed to sell Simon & Schuster a memoir of her years as First Lady, for the near-record advance of about $8 million. (* Examples adapted from Summary Street ® http://lsa.colorado.edu/summarystreet) • Generation (Writing Summaries) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Aspects of Summarization • Input – – – – – (Sparck Jones 97,Hovy and Lin 99) Single-document vs. multi-document...fuse together texts? Domain-specific vs. general...use domain-specific techniques? Genre...use genre-specific (newspaper, report…) techniques? Scale and form…input large or small? Structured or free-form? Monolingual vs. multilingual...need to cross language barrier? • Purpose – – – – Situation...embedded in larger system (MT, IR) or not? Generic vs. query-oriented...author’s view or user’s interest? Indicative vs. informative...categorization or understanding? Background vs. just-the-news…does user have prior knowledge? • Output – Extract vs. abstract...use text fragments or re-phrase content? – Domain-specific vs. general...use domain-specific format? – Style…make informative, indicative, aggregative, analytic... Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Table of contents 1. 2. 3. 4. 5. Motivation and Introduction. Approaches and paradigms. Summarization methods. Evaluating summaries. The future. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Two Psycholinguistic Studies • Coarse-grained summarization protocols from professional summarizers (Kintsch and van Dijk, 78): – Delete material that is trivial or redundant. – Use superordinate concepts and actions. – Select or invent topic sentence. • 552 finely-grained summarization strategies from professional summarizers (Endres-Niggemeyer, 98): – – – – Self control: make yourself feel comfortable. Processing: produce a unit as soon as you have enough data. Info organization: use “Discussion” section to check results. Content selection: the table of contents is relevant. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Computational Approach: Basics Top-Down: Bottom-Up: • I know what I want! — don’t • I’m dead curious: what’s confuse me with nonsense! in the text? • User wants anything • User wants only certain that’s important. types of info. • System needs generic • System needs particular importance metrics, used criteria of interest, used to to rate content. focus search. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Typical 3 Stages of Summarization 1. Topic Identification: find/extract the most important material 2. Topic Interpretation: compress it 3. Summary Generation: say it in your own words …as easy as that! Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Top-Down: Info. Extraction (IE) • IE task: Given a form and a text, find all the information relevant to each slot of the form and fill it in. • Summ-IE task: Given a query, select the best form, fill it in, and generate the contents. • Questions: 1. IE works only for very particular forms; can it scale up? 2. What about info that doesn’t fit into any form—is this a generic limitation of IE? Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Bottom-Up: Info. Retrieval (IR) • IR task: Given a query, find the relevant document(s) from a large set of documents. • Summ-IR task: Given a query, find the relevant passage(s) from a set of passages (i.e., from one or more documents). • Questions: 1. IR techniques work on large volumes of data; can they scale down accurately enough? 2. IR works on words; do abstracts require abstract representations? Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Paradigms: IE vs. IR IE: IR: • Approach: try to ‘understand’ text—transform content into ‘deeper’ notation; then manipulate that. • Need: rules for text analysis and manipulation, at all levels. • Strengths: higher quality; supports abstracting. • Weaknesses: speed; still needs to scale up to robust open-domain summarization. • Approach: operate at word level—use word frequency, collocation counts, etc. • Need: large amounts of text. • Strengths: robust; good for query-oriented summaries. • Weaknesses: lower quality; inability to manipulate information at abstract levels. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Deep and Shallow, Down and Up... Today: Increasingly, techniques hybridize: people use word-level counting techniques to fill IE forms’ slots, and try to use IE-like discourse and quasisemantic notions in the IR approach. Thus: You can use either deep or shallow paradigms for either top-down or bottom-up approaches! Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Toward the Final Answer... • Problem: What if neither IR-like nor IE-Word counting like methods work? – sometimes counting and forms are insufficient, – and then you need to do inference to understand. • Solution: Mrs. Coolidge: “What did the preacher preach about?” Coolidge: “Sin.” Mrs. Coolidge: “What did he say?” Coolidge: “He’s against it.” – semantic analysis of the text (NLP), – using adequate knowledge bases that support inference (AI). Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Inference The Optimal Solution... Combine strengths of both paradigms… ...use IE/NLP when you have suitable form(s), ...use IR when you don’t… …but how exactly to do it? Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Table of contents 1. 2. 3. 4. 5. Motivation and Introduction. Approaches and paradigms. Summarization methods. Evaluating summaries. The future. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Topic Identification Classic Methods Overview of Identification Methods • General method: score each sentence; combine scores; choose best sentence(s) • Scoring techniques: – Position in the text: lead method; optimal position policy; title/heading method – Cue phrases: in sentences – Term frequencies: throughout the text – Cohesion: semantic links among words; lexical chains; coreference; word co-occurrence – Discourse structure of the text – Information Extraction: parsing and analysis Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Position-Based Method • Claim: Important sentences occur at the beginning (and/or end) of texts. • Lead method: just take first sentence(s)! • Experiments: – In 85% of 200 individual paragraphs the topic sentences occurred in initial position and in 7% in final position (Baxendale 58). – Only 13% of the paragraphs of contemporary writers start with topic sentences (Donlan 80). – First 100-word baseline achieved 80% (B: 32.24% vs. H: 40.23% in average coverage) of human performance in DUC 2001 single-doc sum task. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Optimum Position Policy (OPP) Claim: Important sentences are located at positions that are genre-dependent; these positions can be determined automatically through training: – Corpus: 13000 newspaper articles (ZIFF corpus). – Step 1: For each article, enumerate sentence positions (both and ). – Step 2: For each sentence, determine yield (= overlap between sentences and the index terms for the article). – Step 3: Create partial ordering over the locations where sentences containing important words occur: Optimal Position Policy (OPP). (Lin and Hovy, 97) OPP is genre dependent – For example: lead sentence is a very good baseline in news genre. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 OPP (cont.) – OPP for ZIFF corpus: (T) > (P2,S1) > (P3,S1) > (P2,S2) > {(P4,S1),(P5,S1),(P3,S2)} >… (T=title; P=paragraph; S=sentence) – Results: testing corpus of 2900 articles: Recall=35%, Precision=38%. – Results: 10%-extracts cover 91% of the salient words. COVERAGE SCORE – OPP for Wall Street Journal: (T)>(P1,S1)>... 1 0.9 0.05 0.03 0.04 0.8 0.03 0.02 0.7 0.02 0.07 0.08 0.06 0.6 R3 0.5 R1 R2 0.4 0.3 0.2 R1 R2 R3 0.1 0 R1 R2 R3 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.05 0.06 0.07 0.07 0.08 0.08 0.08 0.1 0.11 0.12 0.13 0.14 0.15 0.16 R4 R5 R6 R7 R8 R9 R10 R4 R5 R6 R7 R8 R9 R10 R4 R5 R6 R7 R8 R9 R10 OPP POSITIONS Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 >=5 4 3 2 1 Position: Title-Based Method • Claim: Words in titles and headings are positively relevant to summarization. • Shown to be statistically valid at 99% level of significance (Edmundson, 68). • Empirically shown to be useful in summarization systems. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Examples: DUC 2003 Topic 30003 DOCNO NYT19981017.0177 HEADLINE ACCEDING TO SPAIN, BRITAIN ARRESTS FORMER CHILEAN LEADER AUGUSTO PINOCHET 1ST SNT BUENOS AIRES, Argentina _ Gen. Augusto Pinochet, who ruled Chile as a despot for 17 years, has been arrested in London after Spain asked that he be extradited for the presumed murders of hundreds of Chilean and Spanish citizens, the British authorities announced Saturday. NYT19981018.0098 PINOCHET'S ARREST PUTS BLAIR'S GOVERNMENT IN A TIGHT SPOT LONDON _ Gen. Augusto Pinochet, the former Chilean dictator arrested here at the request of a Spanish judge, remained sequestered under police guard Sunday, awaiting a potentially devastating court hearing to weigh his extradition on charges of genocide, terrorism and murder. NYT19981018.0160 IN SANTIAGO, NEWS OF PINOCHET'S ARREST STIRS ONLY SMALL PROTEST SANTIAGO, Chile _ Perhaps as a sign of how much Chileans wish to forget their years of polarized politics and repression under Gen. Augusto Pinochet, the immediate popular response here to the sensational news of his arrest in London has been remarkable for its moderation. NYT19981018.0123 A MAVERICK SPANISH JUDGE'S TWO-YEAR QUEST FOR PINOCHET'S ARREST PARIS _ A stubbornly independent judge who more than once has challenged the Spanish government, Baltasar Garzon was ridiculed when he first set out to pursue the most notorious right-wing generals in South America for human rights crimes committed in the 1970s and '80s, offenses for which they had already been granted amnesty at home. NYT19981018.0185 ACCUSED OFFICIALS HAVE FEWER PLACES TO HIDE Chin-Yew LIN • IJCNLP-05 • Jeju Island • WASHINGTON _ The arrest of Gen. Augusto Pinochet shows the growing significance of international human-rights law, suggesting that officials accused of atrocities have fewer places to hide these days, even if they are carrying diplomatic passports, legal Republic of Korea • scholars October say. 10, 2005 Examples: Yahoo! News Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Cue-Phrase-Based Method • Claim 1: Important sentences contain ‘bonus phrases’, such as significantly, In this paper we show, and In conclusion, while non-important sentences contain ‘stigma phrases’ such as hardly and impossible. • Claim 2: These phrases can be compiled semiautomatically (Edmumdson 69; Paice 80; Kupiec et al. 95; Teufel and Moens 97). • Method: Add to sentence score if it contains a bonus phrase, penalize if it contains a stigma phrase. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Examples: “In this paper, we …” — From WAS 2004 Workshop papers Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Term Significance: TF vs. TFIDF Term Rankings DUC 2003 Topic 30003 Top 30 Normalized TF & TFIDF Terms TF: Term Frequency IDF: Inverse document frequency TFIDF: TF*log(N/DF) N: # of docs in a training corpus DF: # of docs occurrences for term i Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 TF Terms N-TF Value TFIDF Terms the 1.000 Pinochet of 0.703 Spanish in 0.536 Chile a 0.504 Chilean and 0.489 Garzon to 0.482 Argentina Pinochet 0.380 diplomatic that 0.308 British for 0.250 immunity said 0.214 dictator was 0.203 arrest he 0.196 military Spanish 0.185 genocide 's 0.174 Augusto on 0.174 Gen. is 0.170 crimes Chile 0.167 extradition his 0.159 his have 0.152 London has 0.149 torture British 0.127 human The 0.127 Spain arrest 0.120 judge who 0.120 terrorism Chilean 0.105 rights Argentina 0.098 Castro by 0.098 him from 0.098 he as 0.094 former with 0.094 surgery N-TFIDF Value 1.000 0.377 0.254 0.245 0.238 0.176 0.163 0.154 0.141 0.133 0.123 0.117 0.114 0.105 0.105 0.094 0.094 0.091 0.088 0.082 0.082 0.081 0.080 0.080 0.079 0.076 0.076 0.074 0.072 0.070 Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Term Significance: Term Frequency-Based Method C 1 E Pinochet D DUC 2003 TOPIC 30003 TOP 31 TF TERMS TF vs. TFIDF th e 0.9 (Luhn 59) 0.8 0.7 0.6 TF TFIDF LUHN to a of an d in 0.5 he t 0.4 Pi no c Spanish th at 0.3 C as by from with Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 en t rn m ve government go ge w nti ith na as om Ar who fr iti w sh ho ar re st C hi le by an Br have has The Argentina by Ar fro ge m nt in a w ith go ve rn as m en t is Th e ha ve ha s hi le hi s C is 's 's British arrest is hi le hi s ha ve ha s Th Br e iti sh w h ar o re C hi st le an on 's that for t th at fo r sa id w as he no c to andto Pi an d a in of th e 0 of in a the Chilean his he saidwas on 0.1 Sp he an is h 0.2 Sp an on i sh sa id w as he fo r Chile Sentence Significance: Luhn’s Method SENTENCE Sentence Now , after Garzon succeeded in getting Scotland Yard to [arrest Chile 's former dictator , Gen. Augusto Pinochet , the Spanish ] magistrate has suddenly moved closer to creating an important legal precedent , one that could make the retirement of former dictators decidedly uncomfortable . *: SIGNIFICANT WORDS * 1 2 * * 3 4 5 6 * 7 ALL WORDS SCORE = 42/7 = 2.286 LONDON ( AP ) _ An influential lawmaker from the governing Labor Party on Saturday backed [ Spanish requests to question former Chilean dictator Gen. Augusto Pinochet , in London for back surgery , on allegations of genocide and terrorism ] . LONDON ( AP ) _ Former [ Chilean dictator Gen. Augusto Pinochet has been arrested by British police on a Spanish extradition warrant , despite protests from Chile that he is entitled to diplomatic immunity ] . (Luhn 59) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Luhn Score 8^2/10=6.400 11^2/21=5.762 12^2/27=5.333 Web Search: Text Snippet Significance Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Topic Identification Cohesion-Based Cohesion-Based Methods • Claim: Important sentences/paragraphs are the highest connected entities in more or less elaborate semantic structures. • Cohesion: Semantic relations that make the text hang together and provide the texture of the text. Anaphora • Types of Cohesion Resolution Discourse – Grammatical • • • • Parsing Reference: anaphora, cataphora, and exophora. Substitution: nominal (one, ones; same), verbal (do), and clausal (so, not). Ellipsis: substitution by zero; something left unsaid but understood. Conjunction: additive, adversative, causal, and temporal. Lexical Chain – Lexical • Lexical Cohesion: reiteration (repetition, synonym, hypernym/hyponum, general word) and collocation. – Please see “Cohesion in English”, by Halliday & Hasan for more details. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Cohesion Examples From: “Alice's Adventures in Wonderland” by Lewis Carrol — Example taken from: “Cohesion in English” by Halliday & Hasan 1976 The Cat only grinned when saw Alice. ‘Come, it’s pleased so far,’ thought Alice, and she went on. ‘Would you tell me, please, which way I ought to go from here?’ That depends a good deal on where you want to get to,’ said the Cat. ‘I don’t much care where —’ said Alice. ‘Then it doesn’t matter which way you go,’ said the Cat. ‘— so long as I get somewhere,’ Alice added as an explanation. ‘Oh, you’re sure to do that,’ said the Cat, ‘if you only walk long enough.’ Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Cohesive-Based Semantic Network (Skorochod’ko 1971) “b” “a” • chain • Fi Ni (M M i ring semantic network. – the maximum number of connected M imax nodes after removal of node i. 7 • 6 • 5 4 2 1 8 ) N i – the degree (# of edges) of a node. M – the total number of nodes in a monolith piecewise 3 Sentences are linked by semantic relations. Sentence significance, Fi, is determined by: max 9 Word significance is determined by the same formula but nodes represent words instead of sentences. Summaries are generated by first selecting the most significant sentences then compressing by removing the least significant words. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Cohesion: Lexical Chains (1) • First applied lexical chains in • Lexical chains: sequence of summarization by Barzilay & semantically related words linked Elhadad (B&E 97). together by lexical cohesion. • Morris & Hirst (91) proposed the • Similar to Hirst & St-Onge (98) but differed in: tirst computational model of • Use a POS tagger and a shallow lexical chains. parser to identify noun • Lexical cohesion was defined compounds vs. use straight based on categories, index WordNet nouns. entries, and pointers in Rogets’ • Identify chains in each text thesaurus. (eg, synonyms, antonyms, hyper-/hyponums, and mero-/holonyms) • Not an automatic algorithm. • Cover over 90% of intuitive lexical relations. • Automated later by Hirst & StOnge (98) using WordNet. segment then connect only strong chains across segments. • Try to link words to all possible chains then choose the best chain at the end vs. just link to the best currently available chain. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Lexical Chains-Based Method (2) • Chain score is computed as follows (B&E 97): Score(C) = Length(C) - #DistinctOccurrences(C) • Strong chain criterion: Score(C) > Average(Scores)+2*SD(Scores) • Assumes that important sentences are those that are ‘traversed’ by strong chains. – For each chain, choose the first sentence that is traversed by the chain and that uses a representative set of concepts from that chain. Lexical Lead-based [Jing et al. 98] corpus 10% cutoff 20% cutoff Chains Baseline Recall Prec Recall Prec 67% 64% 61% 47% 83% 71% 63% 47% Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Lexical Chains Examples (DUC 2003 APW19981017.0477) Augusto Pinochet or Pinochet Eight years after his turbulent regime ended , former Chilean strongman Gen. Augusto Pinochet is being called to account by Spanish authorities for the deaths , detention and torture of political regimeopponents or government or group . Eight years after his turbulent regime warrant ended ,, British formerpolice Chilean strongman Gen. they Augusto Pinochet is Responding to a Spanish extradition announced Saturday have arrested being called to account by Spanish the deaths , detention and in torture of political Pinochet on allegations that he authorities murdered anfor unidentified number of Spaniards Chile between Sept opponents . 11 , 1973 , the year he seized power , and Dec 31 , 1983 . But Britain saidruthless Pinochet did notwas havewidely diplomatic immunity or amnesty Pinochet ,immunity whose regime criticized for.its human right record , was recovering Pinochet , whose ruthless regime was widely criticized for right record , was recovering from from surgery in asaid London clinicprotest when he heldauthorities Friday night .human that Chile it would towas British ,its arguing the 82-year-old senator-for-life has surgery in a London clinic when he was held Friday night . diplomatic immunity . In a statement issued in Porto , Portugal , where President Eduardo is attending theFernando Ibero American Scotland Yard refused to confirm Pinochet 's whereabouts , but hisFrei Santiago spokesman But Britain said Pinochet did not have diplomatic immunity . summitMartinez , the Chilean government said it is `` filing a formal protest with the British government for said he was in a London clinic when police came for him . A regular visitor to Britain , Pinochet underwent surgery Oct 9 for a herniated disc , a spinal disorder what it considers a violation of theindiplomatic immunity which President Sen. Pinochet '' In a statement issued Porto , Portugal , where Eduardoenjoys Frei is .attending the Ibero American whichsummit has given himChilean pain and hampered his walking months . , the government said in it recent is `` filing a formal protest with the British government for Spanish Foreign Minister Abel Matutes , also attending the Ibero American summit , said his what considers a decisions violation of the diplomatic Sen. the Pinochet enjoys . '' In a statement issued in Porto , Portugal , where President Eduardo Freiwhich is attending Ibero American government ``itrespects the taken by courts . ''immunity summit , the Chilean government said it is ` ` filing a formal protest with the British government for British law two recognizes two types of immunity : state immunity , which British what law recognizes types of of immunity : state immunity ,Sen. which covers head of. state andhead of state and it considers a violation the diplomatic immunity which Pinochet enjoys '' covers government on official visits , andimmunity diplomatic immunity for persons accredited as diplomats government members onmembers official visits , and diplomatic for persons accredited as diplomats . . Bunting of the human rights group Amnesty International , which has frequently criticized Richard Pinochet said the British government was ` ` under obligation tocovering take legalhas action '' againstcriticized himbefore . While,ofin power he also pushed through an International amnesty crimes committed 1978 , when Richard Bunting the human rights group Amnesty , which frequently most of his human rights abuses allegedly took place . Pinochet , said the British clear government ` underPinochet obligation to take It was not immediately which clinicwas was `treating , who is 83legal next action week . '' against him . Baltasar Garzon , one of two Spanish magistrates handling probes into human rights violations in Chile Castellon 'sstrongman probe into murder , torture and disappearances in Chile during Pinochet 's regime began or dictator and Argentina , filed a request to question Pinochet on Wednesday , a day after another judge , Manuel in 1996 . GarciaEight Castellon , filed a similar petition . regime ended , former Chilean strongman Gen. Augusto Pinochet is years after his turbulent Pinochet is implicated in probe through his involvement `` Operation Condor , torture '' in which Castellon 's probe murder 's , torture and disappearances in Chile during 'sand regime began being calledinto toGarzon account by Spanish authorities for the in deaths , Pinochet detention ofinpolitical military1996 regimes in Chile , Argentina and Uruguay coordinated anti-leftist campaigns .opponents . Pinochet is implicated in Garzon 's probe through his involvement in `` Operation Condor , '' in which It son will of be athe first time this ghastly has faced questions , '' heintold Sky television . Pinochet ,`` the customs clerk who ousted dictator elected President Salvador Allende a bloody 1973 military regimes in Chile , Argentina and Uruguay coordinated anti-leftist campaigns coup , remained commander in chief of the Chilean army until March , when he was sworn in as a senator-for-life a post him in a constitution drafted by his regime . a bloody 1973 Pinochet ,, the son established of a customsfor clerk who ousted elected President Salvador Allende in coup , remained commander in chief of the Chilean army until March , when he was sworn in as a senator-for-life , a post established for him in a constitution drafted by his regime . Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Cohesion: Coreference-Based Method • Build coreference chains (noun/event identity, part-whole relations) between – query and document – title and document – sentences within document • Important sentences are those traversed by a large number of chains: – a preference is imposed on chains (query > title > doc) • Evaluation: 67% F-score for relevance (SUMMAC, 98). (Baldwin and Morton, 98) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Cohesion: Graph-Based Method (1) (in Multi-document summarization; Mani and Bloedorn 97) • Map texts into graphs: – The nodes of the graph are the words of the text. – Arcs represent same, adjacency, phrase, name, coreference, and alpha link (synonym and hypernym) relations. • Seed the text graphs with word TFIDF weights. • Given a query (or topic), a spreading activation algorithm is used to re-weight the initial graphs. • Sentences with high activation weights are assumed as important. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Cohesion: Graph-Based Method (2) (in Multi-document summarization; Mani and Bloedorn 97) • Extrinsic information retrieval task evaluation – 4 TREC topics were used as testing queries: • 204, 207, 210, and 215 – Corpus: Subset of TREC collection indexed by SMART • Retain the top 75 documents returned for each query. • A Query-biased summary was created by taking the top 5 weighting sentences for every returned document. – 4 human subjects were asked to rate the relevancy of the returned documents and their summaries. Metric Accuracy (Recall, Precision) Time (mins) Full Document 30.25%, 41.25% 24.65 Summary 25.75%, 48.75% 21.65 Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Cohesion: Graph-Based Method (3) (in Multi-document summarization; Mani and Bloedorn 97) Adjacent links taking into account collocations Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Cohesion: Hyperlink-Based Method (1) P1 P2 P3 P9 • Paragraph Relation Map (PRM): P8 P7 P4 P6 P5 – Apply automatic inter-document hypertext link generation method in IR to generate intra-document paragraph or sentence links. (Salton et al. 91; Salton & Allan 93; Allan 96) – Compute cosine similarity between every paragraph or sentence pairs of a document. – Paragraphs or sentences with similarity scores higher then an empirically determined cutoff threshold are linked. • Summaries are generated in three ways: – Bushy* path: take the topmost N bushy nodes on the map. – Depth-first path: start from the bushiest node then visit the next most similar node for the next step. – Segmented bushy path:segment text at least bushy nodes then apply the bushy path strategy for each segment. *Bushiness: # of links connecting a node to other nodes on PRM. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Cohesion: Hyperlink-Based Method (2) • Evaluation (Salton et al. 99) – 50 articles from the Funk and Wagnalls Encyclopedia (Funk & Wagnalls 79). – 100 extracts (2 per article) were created by 9 human subjects. On average, 46% overlap were found between two manual extracts. – Four summary scoring scenarios: • • • • Optimistic: take the higher overlap between two human extracts. Pessimistic: take the lower overlap between two human extracts. Intersection*: take the overlap of the intersection of two human extracts. Union*: take the overlap the union of the two human extracts. Algorthm Bushy Depeth-first Segmented Bushy Random Optimistic (%) Pessimistic (%) Intersection (%) Union (%) 45.60 30.74 47.33 55.16 43.98 27.76 42.33 52.48 45.48 26.37 38.17 52.95 39.16 22.07 38.47 44.24 *intersection: strict precision; *union: lenient precision Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Hyperlink-Based Method: Example (DUC 2003 APW19981017.0477) (S04) Chile said it would protest to British authorities , arguing that the 82-year-old senator-for-life has diplomatic immunity . (S27) Pinochet , the son of a customs clerk who ousted elected President Salvador Allende in No a bloody 1973 coup , remained (S11) In a statement issued in Porto Portugal ,was given . (S03) reason for, the dates commander-in-chief of the Chilean army until , Ibero when where President Eduardo is``attending (S08) NoitFrei hearing dateMarch hasthe been set he . was (S19) Bunting of the human rights group Amnesty (S06)Richard Prime sworn Minister Tony Blair 's office said was inhas as frequently a senator-for-life ,Chilean aPinochet post established for himithearing in a American summit , the government said is (S22) He has a pacemaker and International , which criticized , said a matter for the magistrates and by thehis police . '' . constitution drafted regime `` filing a formal protest with the British S23 Baltasar Garzon , one of two Spanish magistrates handling aid , but ingovernment good health probes . the British government under obligationistogenerally take legal (S23) Baltasar Garzon was , one``of two Spanish for what itinto considers arights violation of the diplomatic human violations in Chile and Argentina , filed a request to action '' into against himrights . probes magistrates handling human immunity Sen. . '’ judge , Manuel question Pinochet on which Wednesday , a day after another (S28) While power heArgentina also pushed through an amnesty violations ininChile and , filed a Pinochet request toenjoys (S14) Foreign Minister Abel Matutes , also Castellon , Spanish filed similar . , arguing covering crimes committed before 1978 ,petition when most of his S04 ChileGarcia said it would protest to aBritish authorities that the 82question Pinochet on Wednesday , a day after attending the Ibero American summit , said his S24 Castellon 's probe into murder , torture and disappearances in Chile human rights allegedly took place year-old senator-for-life has diplomatic immunity another judgeabuses , Manuel Garcia Castellon ,. filed. a government ``began respects the decisions taken during Pinochet 's regime 1996 . immunity S05 But Britain said did not have in diplomatic . by courts similar petition . Pinochet . ''is implicated S26 's :probe involvement in S15 British lawPinochet recognizes two types in of Garzon immunity state through immunityhis, which `` Operation Condor , '' in which militaryon regimes Chile, and , Argentina and covers heads of state and government members officialinvisits campaigns . . diplomaticUruguay immunitycoordinated for personsanti-leftist accredited as diplomats Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Topic Identification Discourse-Based Discourse-Based Method • Claim: The multi-sentence coherence structure of a text can be constructed, and the ‘centrality’, typically the nuclei of the textual units in this structure reflects their importance. • Tree-like representation of texts in the style of Rhetorical Structure Theory (RST; Mann and Thompson 88). • Use the discourse (RST) representation in order to determine the most important textual units. • Attempts: – Ono et al. (1994) for Japanese. – Marcu (1997, 2000) for English – (see http://www.isi.edu/~marcu/discourse for more information). Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 RST: A Very Short Overview (Mann and Thompson 88). • Rhetorical Relation (Effective Speaking or Writing) – A relation holds between two non-overlapping text spans called nucleus and satellite. – Nucleus expresses what is more essential to the writer’s purpose than satellite. – The nucleus is comprehensible independent of the satellite but not vice versa. – Constraints on RST relations and their overall effect enforce text coherence. Relation Name EVIDENCE The reader R might not believe the information that is conveyed by Constraints on N the nucleus N to a degree satifactory to the writer W. The reader believes the information that is conveyed by the satellite S Constraints on S or will find it credible. Constraints on N+S R's comprehending S increases R's belief of N. R's belief of N is increased. Locus of the Effect N. [(N) The truth is that the pressure to smoke in junior high is greater Example than it will be any other time of one's lift:][( S) we know that 3,000 teens start smoking each day.] Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Rhetorical Parsing (1) http://www.isi.edu/~marcu/discourse [Energetic and concrete action has been taken in Colombia during the past 60 days against the mafiosi of the evidence drug trade,]1 [but it has not been sufficiently effective,]2 background [because, unfortunately, it came too late.]3 [Ten years ago, the newspaper El Espectador,]4 [of which my brother Guillermo was editor,]5 [began warning of the rise of the drug mafias and of their leaders' aspirations]6 [to control Colombian politics, especially the Congress.]7 [Then,]8 [when it would have been easier to resist them,]9 [nothing was done]10 [and my brother was murdered by the drug mafias three years ago.]11 … (Marcu 97) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Rhetorical Parsing (2) 2 Elaboratio n 2 Elaboratio n 2 Background Justification 1 2 8 Example 3 Elaboration 45 Contrast 3 4 10 Antithesi s 7 9 8 10 Summarization = selection of the most important units + determination of a partial ordering on them 5 Evidence Cause 5 8 Concessi on 6 2 > 8 > 3, 10 > 1, 4, 5, 7, 9 > 6 Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Discourse Method: Evaluation (Marcu WVLC-98: using a combination of heuristics for rhetorical parsing disambiguation) TREC Corpus (4-Fold Cross-validation; 40 Texts; Jing et al 98) Reduction Method Recall Precision F-Score Humans 83.20% 75.95% 79.41% 10% Program 63.75% 72.50% 67.84% Lead 82.91% 63.45% 71.89% Humans 82.83% 64.93% 72.80% 20% Program 61.79% 60.83% 61.31% Lead 70.91% 46.96% 56.50% Scientific American Corpus (5 Texts; Marcu WISTS-97) Level Method Recall Precision F-Score Humans 72.66% 69.63% 71.27% Program (training) 67.57% 73.53% 70.42% Clause Program (no training) 51.35% 63.33% 56.71% Lead 39.68% 39.68% 39.68% Humans 78.11% 79.37% 78.73% Program (training) 69.23% 64.29% 66.67% Sentence Program (no training) 57.69% 51.72% 54.54% Lead 54.22% 54.22% 54.22% Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Topic Identification Combining the Evidence Summary of Identification Methods • General method: score each sentence; combine scores; choose best sentence(s) • Scoring techniques: – Position in the text: lead method; optimal position policy; title/heading method – Cue phrases: in sentences – Term frequencies: throughout the text – Cohesion: semantic links among words; lexical chains; coreference; word co-occurrence – Discourse structure of the text Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Combining the Evidence • Problem: which extraction methods to use? • Answer: assume they are independent, and combine their evidence: merge individual sentence scores. • Studies: – (Edumondson 69): linear combination. – (Kupiec et al. 95; Aone et al. 97; Teufel & Moens 97): Naïve Bayes. – (Mani and Bloedorn 98): SCDF, C4.5, inductive learning. – (Lin 99): C4.5, linear combination. – (Marcu 98): rhetorical parsing tuning. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Linear Combination (1) • Edmundson 69: – Combination of cue phrase (C) , key method (K), title method (T), and position method (P). Score a1C a2 K a3T a4 P – A total of 17 experiments were performed to adjust weights, i.e. a1 , a2 , a3 , and a4 . – Co-selection ratio between the automatic and the the target extracts was used as the evaluation metric. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Linear Combination (2) (Edmundson 69) Random Key Combination Title Cue Position Cue-Key-Title-Position Cue-Title-Position 0.0 10.0 20.0 30.0 40.0 50.0 60.0 Coselection Percentage Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 70.0 80.0 Naïve Bayes (1) • Kupiec, Pedersen, and Chen 95 – Features • Sentence length cut-off (binary long sentence feature: only include sentences longer than 5 words) • Fixed-phrase (binary cue phrase feature: such as “this letter …”, total 26 fixed phrases) • Paragraph (binary position feature: the first 10 and the last 5 paragraphs in a document) • Thematic Word (binary key feature: sentences contain the most frequent content words or not) • Uppercase Word (binary orthographic feature: capitalize frequent words not occur at the sentence-initial position) – Corpus: 188 OCRed documents and their manual summaries sampled from 21 scientific/technical domain. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Naïve Bayes (2) (Kupiec et al. 95) • Summarization as a sentence ranking process based on the probability of a sentence would be included in a summary: P(F1, F2 , , Fk | s S)P(s S) P(s S | F1 , F2 , , Fk ) P(F1 , F2 , , Fk ) P(s S) j 1 P(Fj | s S) k k P(Fj ) – Assuming statistical independence of the features. – P(s S) is a constant (assuming uniform distribution for all s). – P(Fj | s S) and P(Fj ) can be estimated from the training set. j 1 Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Naïve Bayes (3) (Kupiec et al. 95) Reference Extract Matching Distribution Direct Sentence Matches Direct Joins Unmatchable Sentences Incomplete Single Sentences Incomplete Joins Total Manual Summary sents 451 19 50 21 27 568 79% 3% 9% 4% 5% Feature Individual Cum ulative Sents Correct Sents Correct Paragraph 163 (33%) 163 (33%) Fixed Phrases 145 (29%) 209 (42%) Length C ut-off 121 (24%) 217 (44%) Thematic Word 101 (20%) 209 (42%) Uppercase Word 100 (20%) 211 (42%) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Performance of Individual Factors (Lin CIKM 99) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Topic Interpretation Topic Interpretation • From extract to abstract: topic interpretation or concept fusion. • Experiment (Marcu, 99): xx xxx xxxx x xx xxxx xxx xx xxx xx xxxxx x xxx xx xxx xx x xxx xx xx xxx x xxx xx xxx x xx x xxxx xxxx xxxx xx xx xxxx xxx xxx xx xx xxxx x xxx xx x xx xx xxxxx x x xx xxx xxxxxx xxxxxx x x xxxxxxx xx x xxxxxx xxxx xx xx xxxxx xxx xx x xx xx xxxx xxx xxxx xx xxx xx xxx xxxx xx xxx x xxxx x xx xxxx xx xxx xxxx xx x xxx xxx xxxx x xxx x xxx xx xx xxxxx x x xx xxxxxxx xx x xxxxxx xxxx xx xx xxxxx xxx xx xxx xx xxxx x xxxxx xx xxxxx x – Got 10 newspaper texts, with human abstracts. – Asked 14 judges to extract corresponding clauses from texts, to cover the same content. – Compared word lengths of extracts to abstracts: extract_length 2.76 abstract_length !! Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Some Types of Interpretation • Concept Generalization: Sue ate apples, pears, and bananas Sue ate fruit • Meronymy Replacement (Part-Whole) : Both wheels, the pedals, saddle, chain… the bike • Script Identification: (Schank and Abelson, 77) He sat down, read the menu, ordered, ate, paid, and left He ate at the restaurant • Metonymy: A spokesperson for the US Government announced that… Washington announced that... Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 General Aspects of Interpretation • Interpretation occurs at the conceptual level... …words alone are polysemous (bat animal and sports instrument) and combine for meaning (alleged murderer murderer). • For interpretation, you need world knowledge... …the fusion inferences are not in the text! • Little work so far: (Lin, 95; Radev and McKeown, 98; Reimer and Hahn, 97; Hovy and Lin, 98). Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Form-Based Operations • Claim: Using IE systems, can aggregate forms by detecting interrelationships. 1. Detect relationships (contradictions, changes of perspective, additions, refinements, agreements, trends, etc.). 2. Modify, delete, aggregate forms using rules (Radev and McKeown, 98): Given two forms, if (the location of the incident is the same and the time of the first report is before the time of the second report and the report sources are different and at least one slot differs in value) then combine the forms using a contradiction operator. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Inferences in Terminological Logic • ‘Condensation’ operators (Reimer & Hahn 97). 1. Parse text, incrementally build a terminological rep. 2. Apply condensation operators to determine the salient concepts, relationships, and properties for each paragraph (employ frequency counting and other heuristics on concepts and relations, not on words). 3. Build a hierarchy of topic descriptions out of salient constructs. • Conclusion: No evaluation. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Concept Generalization: Wavefront • Claim: Can perform concept generalization, using WordNet (Lin, 95). • Find most appropriate summarizing concept: Calculator Computer PC 5 IBM 6 20 0 20 2 18 Mac 5 Dell Cash register Mainframe 1. Count word occurrences in text; score WN concepts 2. Propagate scores upward 3. R Max{scores} / scores 4. Move downward until no obvious child: R<Rt 5. Output that concept Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Wavefront Evaluation • 200 BusinessWeek articles about computers: – typical length 750 words (1 page). – human abstracts, typical length 150 words (1 par). – several parameters; many variations tried. • Rt = 0.67; StartDepth = 6; Length = 20%: Random Wavefront Precision Recall 20.30% 15.70% 33.80% 32.00% • Conclusion: need more elaborate taxonomy. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Topic Signatures (1) • Claim: Can approximate script identification at lexical level, using automatically acquired ‘word families’ (Lin and Hovy COLING 2000). • Idea: Create topic signatures: each concept is defined by frequency distribution of its related words (concepts): signature = {head (c1,f1) (c2,f2) ...} restaurant waiter + menu + food + eat... • (inverse of query expansion in IR.) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Example Signatures RANK 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 aerospace contract air_force aircraft navy army space missile equipment mcdonnell northrop nasa pentagon defense receive boeing shuttle airbus douglas thiokol plane engine million aerospace corp. unit banking environment telecommunication bank epa at&t thrift waste network banking environmental fcc loan water cbs mr. ozone pbs deposit state bell board incinerator long-distance fslic agency telephone fed clean telecommunication institution landfill mci federal hazardous mr. Agirre et al. LREC 2004 fdic acid_rain doctrine volcker standard service henkel federal news banker lake turner khoo garbage station asset pollution nbc brunei city sprint citicorp law communication billion site broadcasting regulator air broadcast national_bank protection programming greenspan violation television financial management abc vatican reagan rate Download topic signatures for all WordNet 1.6 noun senses: http://ixa.si.ehu.es/Ixa/resources/sensecorpus Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Topic Signatures (2) • Experiment: created 30 signatures from 30,000 Wall Street Journal texts, 30 categories: – Used tf.idf to determine uniqueness in category. – Collected most frequent 300 words per term. • Evaluation: classified 2204 new texts: – Created document signature and matched against all topic signatures; selected best match. • Results: Precision 69.31%; Recall 75.66% – 90%+ for top 1/3 of categories; rest lower, because less clearly delineated (overlapping signatures). Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Evaluating Signature Quality • Test: perform text categorization task: 1. match new text’s ‘signature’ against topic signatures 2. measure how correctly texts are classified by signature • Document Signature (DSi): [ (ti1,wi1), (ti2,wi2), … ,(tin,win) ] • Similarity measure: – cosine similarity, cos TSk DSi / | TSk DSi| DATA RECALL PRECISION DATA RECALL PRECISION 7WD 0.847 0.752 8WD 0.803 0.719 7TR 0.844 0.739 8TR 0.802 0.710 7PH 0.843 0.748 8PH 0.797 0.716 Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Summary Generation NL Generation for Summaries • Level 1: no separate generation – Produce extracts, verbatim from input text. • Level 2: simple sentences – Assemble portions of extracted clauses together. • Level 3: full NLG 1. Sentence Planner: plan sentence content, sentence length, theme, order of constituents, words chosen... (Hovy and Wanner, 96) 2. Surface Realizer: linearize input grammatically (Elhadad, 92; Knight and Hatzivassiloglou, 95). Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Full Generation Example • Challenge: Pack content densely! • Example (Radev and McKeown, 98): – Traverse templates and assign values to ‘realization switches’ that control local choices such as tense and voice. – Map modified templates into a representation of Functional Descriptions (input representation to Columbia’s NL generation system FUF). – FUF maps Functional Descriptions into English. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Generation Example (Radev and McKeown, 98) NICOSIA, Cyprus (AP) – Two bombs exploded near government ministries in Baghdad, but there was no immediate word of any casualties, Iraqi dissidents reported Friday. There was no independent confirmation of the claims by the Iraqi National Congress. Iraq’s state-controlled media have not mentioned any bombings. Multiple sources and disagreement Explicit mentioning of “no information” Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Fusion at Syntactic Level • General Procedure: 1. 2. 3. 4. Identify sentences with overlapping/related content, Parse these sentences into syntax trees, Apply fusion operators to compress the syntax trees, Generate sentence(s) from the fused tree(s). X X A B C A B D AA BB C Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 D Syntax Fusion (1) (Barzilay, McKeown, and Elhadad, 99) • Parse tree: simple syntactic dependency notation DSYNT, using Collins parser. • Tree paraphrase rules derived through corpus analysis cover 85% of cases: – – – – sentence part reordering, demotion to relative clause, coercion to different syntactic class, change of grammatical feature: tense, number, passive, etc., – change of part of speech, – lexical paraphrase using synonym, etc. • Compact trees mapped into English using FUF. • Evaluate the fluency of the output. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Syntax Fusion (2) (Mani, Gates, and Bloedorn, 99) • Elimination of syntactic constituents. • Aggregation of constituents of two sentences on the basis of referential identity. • Smoothing: – Reduction of coordinated constituents. – Reduction of relative clauses. • Reference adjustment. Evaluation: – Informativeness. – Readability. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Text Summarization Information Extraction-Based Information Extraction Method • Idea: content selection using forms (templates) – Predefine a form, whose slots specify what is of interest. – Use a canonical IE system to extract from a (set of) document(s) the relevant information; fill the form. – Generate the content of the form as the summary. • Previous IE work: – FRUMP (DeJong, 79): ‘sketchy scripts’ of terrorism, natural disasters, political visits... – McFRUMP (Mauldin, 89): forms for conceptual IR. – SCISOR (Rau and Jacobs, 90): forms for business. – SUMMONS (McKeown and Radev, 95): forms for news. – GisTexter (Harabagiu et al., 03): Traditional IE + dynamic templates based on WordNet. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 IE-Based Summarization • Topic Identification – A template defines the topic of interest and important facts associated with the topic. This also defines the coverage of IEbased summarizer. – Problem: how to select a relevant template? (i.e. the frame initiation problem) • Topic Interpretation – Problem: how to fill template slots and merge multiple templates? • Generation – Problem: how to generate summaries from the filled and merged templates? Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 FRUMP (DeJong 1979) • • • • Domain: General news Input: UPI news stories Output: summaries (single doc) Template: 60 sketchy scripts that contain important events in a situation. – Meeting, earthquake, theft, war, … – Represented as conceptual dependency (CD) • Encode constraints on script variables, between script variables, and between scripts. • Ex: John went to New York. ((ACTOR (*John*) <=> (*PTRANS*) OBJECT (*JOHN*) TO (*NEW YORK*)) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 FRUMP (cont.) • Vocabulary size: 2,300 root words plus morphological rules. • Interpretation Mechanism – PREDICTOR initiates candidate scripts. – SUBSTANTIATOR validates the script slots. • Evaluation – 6 days run from April 10 to 15 in 1980. – Lenient: recall 63%, precision 80% – Strict: recall 43%, precision 54% Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 FRUMP (cont.) • Three ways to activate sketchy scripts: – Explicit: a word or phrase that identifies an entire script can activate the script. • Ex: ARREST – Implicit: invocation of a script activates a related script. • Ex: CRIME => ARREST – Event-inducing: a central event of a script can activate the script. (key request) • Ex: police apprehended someone => ARREST Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 McFRUMP (Mauldin 1989) • Domain: FRUMP + Astronomy (5 scripts) • Input: UseNet news; Output: case frame (CD) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. *Mauldin 91 Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 McFRUMP (cont.) • Vocabulary size: 3,369 nouns, 2,148 names, and 376 basic verbs + synonyms from Webster’s 7 (28,323 word senses) + near synonyms, and proper name rules. • Interpretation Mechanism – PREDICTOR initiates candidate scripts. – SUBSTANTIATOR validates the script slots. • Evaluation – Use as a component in FERRET, a conceptual information retrieval system. – Measured on a test corpus of 1,065 astronomy texts with recall 52% and precision 48%. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 SCISOR (Rau & Jacobs 1990) • • • • • Domain: Merger & Acquisition (M&A) Input: Dow Jones News + Queries Output: Answers Template: M&A (1?) Vocabulary size: 10,000 word roots + 75 affixes and a core concept hierarchy of 1,000 nodes. • Interpretation Mechanism – TRUMPET top down processor matches bottom up output with expectations. – TRUMP bottom up processor parses input into frame-like semantic representation. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 SCISOR (cont.) • Evaluation: NA Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 SUMMONS (McKeown & Radev 95) • Domain: Terrorist (MUC-4) • Input: Filled IE templates + Sources + Times • Output: Automatically generated English summary of a few sentences (abstract) • Template: Terrorist (1?, 25 fields) • Vocabulary size: Large? (based on FUF, Functional Unification Formalism, Elhadad 91) • Interpretation Mechanism – Summarization is achieved through a series of template merging using 7+1 major summary operators. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 SUMMONS (cont.) • Evaluation: NA • Goal: Generate paragraph summaries describing the change of a single event over time or a series of related events. – Assume all the input templates are about the same event. (a strong assumption!) – Focus on similarity and difference of sources, times, and values of slot. Input: Cluster of templates T1 ….. T2 Tm Conceptual combiner Combiner Domain ontology Planning operators Paragraph planner Linguistic realizer Sentence planner Lexicon Lexical chooser Sentence generator OUTPUT: Base summary Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 SURGE SUMMONS Operators (Topic Interpretation) *TxSi : slot i of template x; ∪: union; ⊂ : specialization Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 GisTexter (Harabagiu et al. 03) (An End-to-end IE-based Summarizer) • • • • Domain: General Input: General News Output: summaries (extract) with sentence reduction Template: Natural Disasters (1?) + ad hoc template from WordNet topical relations. (match 40 out of 59 DUC 2002 topics) • Vocabulary size: WordNet 2.0 + topical relations extracted from WordNet • Interpretation Mechanism – – – – – Apply traditional IE to fill templates Map filled template slots to text snippets Resolve coreference chains in the text snippets Generate summaries through a 5-step incremental process If pre-encorded template cannot be found, then ad-hoc templates are automatically generated. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Architecture of GisTexter *Harabagiu et al. 2003 Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 GisTexter (cont.) • No concepts of time, source, different perspective, addition, contradition, superset, refinement, or trend as in SUMMONS. Only the agreement operator is used based on slot-filler frequency. • Link template slots to text snippets through extraction patterns. – Natural Disasters: [Casuality-expression {to|from} $Number {from|because of} Disasterword] “the death toll to 40 from last weeks’ tornado” • The most frequent slot-filler across all templates is called the dominant event and is treated as the main topic of the input texts. (ex: Hurricane Andrew) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 GisTexter Summarization Procedure 1. Select the most important template T0 Importance: the sum of all frequency counts of all slots of a template. 2. Select sentences covered by the text snippets mapped from T0. 3. Select sentences from other templates about the dominant event that are not selected in 2. 4. Select sentences from other templates that are not about the dominant event from the texts covered by 1 and 2. 5. Select sentences from the rest of the templates. (* always add antecedent sentences if anaphoric expressions exist) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 GisTexter Ad Hoc Templates • Based on WordNet 2.0 (Miller et al. 95) gloss and synset relation (isa, has-part, …) hierarchy + automatic template slot creation method inspired by Autoslog-TS (Riloff 96). *Harabagiu et al. 2003 Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 GisTexter Evaluation • DUC02 multi-doc 3rd in precision and 1st in recall. • Limit coverage of existing templates (about 2/3). – Didn’t state how difficult it is to create a new template. – Didn’t compare systems using only existing templates v.s. using only ad hoc templates. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Recent Development & Applications Headline or Very Short Summary Generation + Compression • Headline or title generation – Witbrock & Mittal (SIGIR99): generative probabilistic model. – Dorr, Zajic & Schwartz (DUC 2003): hybrid statistical content generation and linguistic rules. – Zhou & Hovy (DUC 2003): position + headline word model + automatic post generation patching. • Sentence Compression – Knight & Marcu (2002 AI Journal): noisy channel model and decision tree model – Riezler, King, Crouch & Zaenen (NAACL 2003): stochastic lexical functional grammar (LFG). – Turner & Charniak (ACL2005): improved noisy channel model based on K&M’s work. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Summarization Models • Berger & Mittal (SIGIR 2000): – Bag of words gisting model – Expanded-lexicon gisting model – Readable gisting model • Kraaji, Spitters & van der Heijden (DUC 2001): – Mixture language model (content salience) + naïve Bayes model (surface salience) • Jing (CL 2002): – Cut and paste model based on HMM. • Barzilay & Lee (NAACL-HLT 2004): – Probabilistic content model based on HMM. • Barzilay & Lapata (ACL 2005): – Entity-based local coherence model. • Daume III & Marcu (ACL MSE 2005): – Bayesian query-focused model. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Applications • Berger & Mittal (SIGIR 2000): web summarization. • Zechner & Waibel (COLING 2000): dialog summarization. • Buyukkokten, Garcia-Molina & Paepcke (WWW10 2001): summarization for web browsing on handheld devices. • Muresan, Tzoukermann & Klavans (CoNLL 2001): email summarization. • Zhou,Ticrea & Hovy (EMNLP 2004): biography summarization. • Buist, Kraaij & Raaijmakers (CLIN 2005): meeting data summarization. • Zhou & Hovy (ACL 2005): online discussion summarization. • Bing, Hu & Chen (WWW14 2005): opinion summarization. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 ISI – DARPA Surprise Language Exercise 2003 (Leuski et al. 03) Table of contents 1. 2. 3. 4. 5. Motivation and Introduction. Approaches and paradigms. Summarization methods. Evaluating summaries. The future. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Progress in Speech Recognition Error Rate 100 RM ATIS WSJ SWB BN VoiceMail TIDigits 10 Reduced Error Rate 1 Common Automatic Metric 0.1 1985 1990 1995 2000 2005 Mukund Padmanabhan and Michael Picheny, HLT Group, IBM Research, ASR Workshop, Paris, Sep 2000. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Common Corpora MT and Summarization Evaluations • Machine Translation – Inputs • Reference translation • Candidate translation – Methods • Manually compare two translations in: – Adequacy – Fluency – Informativeness • Auto evaluation using: – BLEU/NIST scores • Auto Summarization – Inputs • Reference summary • Candidate summary – Methods • Manually compare two summaries in: – Content overlap – Linguistic qualities • Auto evaluation using: ROUGE Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Summarization Evaluation Document Understanding Conference (DUC) • Part of US DARPA TIDES Project DUC 01 - 04 (http://duc.nist.gov) – Tasks • Single-doc summarization (DUC 01 and 02: 30 topics) • Single-doc headline generation (DUC 03: 30 topics, 04: 50 topics) • Multi-doc summarization – Generic 10, 50, 100, 200 (2002) , and 400 (2001) words summaries – Short summaries of about 100 words in three different tasks in 2003 » focused by an event (30 TDT clusters) » focused by a viewpoint (30 TREC clusters) » in response to a question (30 TREC Novelty track clusters) – Short summaries of about 665 bytes in three different tasks in 2004 » focused by an event (50 TDT clusters) » focused by an event but documents were translated into English from Arabic (24 topics) » in response to a “who is X?” question (50 persons) – Participants • 15 systems in DUC 2001, 17 in DUC 2002, 21 in DUC 2003, and 25 in DUC 2004 • A new roadmap was discussed this year at DUC 2005 in Vancouver. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Snapshot of An Evaluation Session Overall Candidate Quality SEE – Summary Evaluation Environment (Lin 2001) http://www.isi.edu/~cyl/SEE Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Snapshot of An Evaluation Session Measuring Content Coverage Single Reference! Coarse Judgment! Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Inconsistency of Human Judgment • Single Document Task – Total of 5,921 judgments – Among them 1,076 (18%) contain different judgments for the same pair (from the same assessor) – 143 of them have three different coverage grades (2.4%) • Multi-Document Task – Total of 6,963 judgments – Among them 528 (7.6%) contain different judgments for the same pair (from the same assessor) – 27 of them have three different coverage grades Lin and Hovy (DUC Workshop 2002) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 DUC 2003 Human vs. Human (1) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Nenkova and Passonneau (HLT/NAACL 2004) DUC 2003 Human vs. Human (2) 1. Can we get consensus among humans? 2. If yes, how many humans do we need to get consensus? 3. Single reference or multiple references? Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 DUC 2003 Human vs. Human (3) Can we get stable estimation of human or system performance? How many samples do we need to achieve this? Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Summary of Research Issues • How to accommodate human inconsistency? • Can we obtain stable evaluation results despite using only a single reference summary per evaluation? • Will inclusion of multiple summaries make evaluation more or less stable? • How can multiple references be used in improving stability of evaluations? • How is stability of evaluation affected by sample size? Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Recent Results • Van Halteren and Teufel (2003) – Stable consensus factoid summary could be obtained if 40 to 50 reference summaries were considered. • 50 manual summaries of one text. • Nenkova and Passonneau (2003) – Stable consensus semantic content unit (SCU) summary could be obtained if at least 5 reference summaries were used. • 10 manual multi-doc summaries for three DUC 2003 topics. • Hori et al. (2003) – Using multiple references would improve evaluation stability if a metric taking into account consensus. • 50 utterances in Japanese TV broadcast news; each with 25 manual summaries. • Lin and Hovy (2003), Lin (2004) – ROUGE, an automatic summarization evaluation method used in DUC 2003. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Automatic Evaluation of Summarization Using ROUGE • ROUGE summarization evaluation package – Currently (v1.5.4) include the following automatic evaluation methods: (Lin, Text Summarization Branches Out workshop 2004) • ROUGE-N: N-gram based co-occurrence statistics • ROUGE-L: LCS-based statistics • ROUGE-W: Weighted LCS-based statistics that favors consecutive LCSes (see ROUGE note) • ROUGE-S: Skip-bigram-based co-occurrence statistics • ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics – Free download for research purpose at: http://www.isi.edu/~cyl/ROUGE Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Summary of Results • Overall – Using multiple references achieved better correlation with human judgment than just using a single reference. – Using more samples achieved better correlation with human judgment (DUC 02 vs. other DUC data). – Stemming and removing stopwords improved correlation with human judgment. – Single-doc task had better correlation than multi-doc • Specific – ROUGE-S4, S9, and ROUGE-W1.2 were the best in 100 words singledoc task, but were statistically indistinguishable from most other ROUGE metrics. – ROUGE-1, ROUGE-L, ROUGE-SU4, ROUGE-SU9, and ROUGE-W1.2 worked very well in 10 words headline like task (Pearson’s ~ 97%). – ROUGE-1, 2, and ROUGE-SU* were the best in 100 words multi-doc task but were statistically equivalent to other ROUGE-S and SU metrics. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Effect of Sample Sizes • Examine the effect of sample size on: – Human assigned mean coverage (C) – ROUGE – Correlation between C and ROUGE • Experimental setup – DUC 2001 100 words single and multi doc data. – Applied stemming and stopword removal. – Ran bootstrap resampling at sample size from 1 to 142 for single-doc task and 26 for multi-doc task. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Pearson’s ρ ROUGE-SU4 vs. Mean Coverage (DUC 2001) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Summary of Results (Sample Size) • Reliability of performance estimation in human assigned coverage score and ROUGE improved when the number of samples increased (tighter gray box). • Reliability of correlation analysis improved when the number of samples increased. – Critical numbers to reach significant results in Pearson’s • at least 86 documents in the single-doc task. • at least 18 topics in the multi-doc task. – Number of documents and topics in DUC were well above these critical numbers. • Using multiple reference achieved better correlation between ROUGE and mean coverage. • Sample size had a big effect on the reliability of estimation. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Conclusions and Future Work • ROUGE worked pretty well in evaluations of single document summarization tasks (100 words and 10 words) with Pearson’s above 85%. • ROUGE worked fine in multi-doc summarization evaluations but how to make its metrics consistently achieve Pearson’s above 85% in all test cases is still an open research topic. • If human inconsistency is unavoidable, we then need to quantify its effect using statistical significance analysis and minimize it using large number of samples. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Basic Elements – ROUGE+ Toward Automatic Evaluation of MT and Summarization in Semantic Space Snapshot of An Evaluation Session Measuring Content Coverage Single Reference! Coarse Judgment! Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 What Is the Right Span of Information Unit • Information Retrieval – Document and passage • Question and Answering – Factoid, paragraph, document, … • Summarization – Word, phrase, clause (EDU), sentence, paragraph, … Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 The Factoid Method • Van Halteren & Teufel (2003, 2004) • Factoids – Atomic semantic units represent sentence meaning (FOPL style). – “Atomic” means that a semantic unit is used as a whole across multiple summaries. – Each factoid may carry information varying from a single word to a clause. • Example: – The police have arrested a white Dutch man. • • • • • A suspect was arrested. The police did the arresting. The suspect is white. The suspect is Dutch. The suspect is male. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 The Pyramid Method • Nenkova and Passonneau (2003) • Pyramid – A weighted inventory of factoids or summarization content units (SCU) • • • • • A: “Unable to make payments on a $2.1 billion debt” B: “made payments on PAL's $2 billion debt impossible” C: “with a rising $2.1 billion debt” D: “PAL is buried under a $2.2 billion dollar debt it cannot repay” SCU – F1: PAL has 2.1 million debt (All) – F2: PAL can't make payments on debt (Most) Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Problems with Factoid and SCU • Each factoid may carry very different amount of information – How to assign fair information value to a factoid? – No predetermined size of factoids or SCUs “counting matches” and “scoring” would be problematic. • The inventory of factoids grows as more summaries are added to the reference pool – Old factoids tend to break apart to create new factoids • Interdependency of factoids are ignored • Totally manual creation so far and only been tested on very small data set – Factoid: 2 documents – SCU+Pyramid: 3 sets of multi-doc topics • How to automate? Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Basic Elements (BE) • Definition – A head, modifier and relation triple: BE::<HEAD|MOD|REL> – BE::HEAD is the head of a major syntactic constituent (noun, verb, adjective or adverbial phrases). – BE::MOD is a single dependent of BE::HEAD with a relation, BE::REL, between them. – BE::REL could a syntactic, semantic relation or NIL. • Example – “Two Libyans were indicted for the Lockerbie bombing in 1991.” <Libyans|two|CARDINAL> <indicted|Libyans|ACCUSED> <indicted|bombing|CRIME> <indicted|1991|TIME> Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Research Issues • How can BEs be created automatically? – Extract dependency triples from automatic parse trees. • BE-F: MINPAR triples* (Lin 95) • BE-L: Charniak parse trees + automatic semantic role tagging* • What score should each BE have? – Equal weight*, tfidf, information value, … • When do two BEs match? – Lexical*, lemma*, synonym, distributional similarity, … • How should an overall summary score be derived from the individual matched BEs' scores? – Consensus of references* Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Current Status • First version, BE 1.0, released to the research community on April 13, 2005. – Package include: • BE-F (Minipar) BE breakers • ROUGE-1.5.5 scorer – One of the three official automatic evaluation metrics for Multilingual Summarization Evaluation 2005 (MSE 2005). – It is used in DUC 2005. – Free download for research purpose at: http://www.isi.edu/~cyl/BE Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Evaluation • Measurement – Examine the Pearson’s correlation between human assigned mean coverage (C) and BE. – Compare results with ROUGE 1-4, S4, and SU4. • Experimental setup – Use DUC 2002 (10 systems) and 2003 (18 systems) 100 words multi doc data. – Compare single vs. multiple references. – Applied stemming and stopword removal. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Correlation Analysis (DUC 2002) DUC-2002 M100 BE-F vs. Human Scores Pearson's Correlation multi-ref single-ref Original Stemmed Original Stemmed H NA NA NA NA HM 0.914 0.915 0.924 0.953 HMR 0.909 0.907 0.934 0.953 HM1 0.914 0.915 0.924 0.953 HMR1 0.909 0.907 0.934 0.953 HMR2 0.909 0.907 0.934 0.953 DUC-2002 M100 BE-L vs. Human Scores Pearson's Correlation multi-ref single-ref Original Stemmed Original Stemmed H 0.890 0.880 0.874 0.871 HM 0.917 0.932 0.865 0.895 HMR 0.917 0.951 0.815 0.894 HM1 0.907 0.902 0.879 0.887 HMR1 0.921 0.932 0.867 0.881 HMR2 0.909 0.904 0.879 0.882 DUC-2002 ROUGE vs. Human Scores Pearson's Correlation multi-ref single-ref Original Stemmed Stopped Original Stemmed Stopped R1 0.751 0.755 0.837 0.698 0.707 0.835 R2 0.933 0.927 0.912 0.896 0.889 0.873 R3 0.962 0.959 0.914 0.931 0.922 0.855 R4 0.924 0.918 0.889 0.911 0.901 0.773 RL 0.719 0.717 0.837 0.667 0.667 0.820 RS4 0.895 0.906 0.881 0.857 0.867 0.860 RSU4 0.855 0.865 0.867 0.809 0.822 0.853 Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Correlation Analysis (DUC 2003) DUC-2003 M100 BE-F vs. Human Scores Pearson's Correlation multi-ref single-ref Original Stemmed Original Stemmed H NA NA NA NA HM 0.931 0.927 0.920 0.940 HMR 0.933 0.923 0.904 0.919 HM1 0.931 0.927 0.920 0.940 HMR1 0.933 0.923 0.904 0.919 HMR2 0.933 0.923 0.904 0.919 DUC-2003 M100 BE-L vs. Human Scores Pearson's Correlation multi-ref single-ref Original Stemmed Original Stemmed H 0.784 0.776 0.785 0.782 HM 0.959 0.949 0.917 0.918 HMR 0.882 0.864 0.753 0.718 HM1 0.859 0.847 0.853 0.849 HMR1 0.961 0.952 0.921 0.914 HMR2 0.860 0.848 0.855 0.847 DUC-2003 ROUGE vs. Human Scores Pearson's Correlation multi-ref single-ref Original Stemmed Stopped Original Stemmed Stopped R1 0.619 0.609 0.773 0.622 0.611 0.786 R2 0.875 0.883 0.921 0.803 0.796 0.895 R3 0.872 0.869 0.880 0.684 0.669 0.687 R4 0.736 0.733 0.647 0.488 0.488 0.501 RL 0.547 0.533 0.726 0.539 0.508 0.729 RS4 0.811 0.817 0.886 0.744 0.754 0.885 RSU4 0.747 0.748 0.845 0.723 0.726 0.864 Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Oracle Experiment • We would like to answer the following questions: 1. How well does the topic identification module of a summarization system work? 2. How much information do we lose by only matching on lexical level? 3. How much do humans agree with each others? • Experimental setup – – Use DUC 2003 multi-doc data with 4 human refs. Use BE ranking to simulate topic identification. We compare three BE ranking methods: • – – LR (log likelihood ratio), DF (document frequency with LR), and RawDF (raw document frequency) Use BE-F as counting unit. Compute recall and precision of different BE ranking methods vs. reference summary BE ranking based on raw DF. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Oracle Exp: BE-F Recall BE-F Recall Scores vs. Top N BEs 256 1.00000 R-LR R-DF R-RAW-DF R-REF-RAW-DF Ref-BE R-REF-RAW-DF 0.90000 0.80000 BE-F Recall 0.70000 128 Lexical Gap! 0.60000 0.50000 0.40000 64 0.30000 32 0.20000 0.10000 16 4 12 Better Ranking Algorithm! 8 0.00000 0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 Top N BEs Selected Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 240 256 Oracle Exp: BE-F Precision BE-F Precision Scores vs. Top N BEs 1.00000 P-LR P-DF P-RAW-DF P-REF-RAW-DF REF-BE P-REF-RAW-DF 0.90000 0.80000 1 0.70000 2 Lexical Gap & Human Disagreement BE-F Recall 4 0.60000 8 0.50000 16 0.40000 32 64 128 0.30000 0.20000 256 Better Ranking Algorithm! 0.10000 0.00000 0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 Top N BEs Selected Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 240 256 Conclusions • BE-F consistently achieves over 90% Peason’s correlation with human judgments in all testing categories. – BE-F with stemming and matching only on BE::HEAD and BE:MOD (HM & HM1) has the best correlation. • BE-L has over 90% correlation when both BE::HEAD and BE::MOD are considered in the matching. It also works better with multiple references. • BE-F and BE-L are more stable than ROUGE across corpora. (DUC’02 R2 Org vs. DUC’03 R3 Stop) • Need to go beyond lexical matching. • Need to develop better BE ranking algorithms. • Need to address the issue of human disagreement: – Better summary writers? – Better domain knowledge? – Better task definition … Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Future Directions • BE breaking – Use FrameNet II frame elements in BE relations. • BE matching – Paraphrases, synonyms, and distributional similarity. • BE ranking – Prioritize BEs in a given application context. – Assign weights according to BE’s information content. – Utilize inter-BE dependency. • Application – Develop summarization methods based on BE. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Table of contents 1. 2. 3. 4. 5. Motivation and Introduction. Approaches and paradigms. Summarization methods. Evaluating summaries. The future. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 What’s Next? • Robust information extraction on general domains based on: FrameNet, PropBank, WordNet and other knowledge sources. • Automatic frame acquisition and integration to increase coverage and development of better template/frame merging and linking methods. • Design of better summarization models (statistical or not). • Construction of large common summarization corpora and design better evaluation methods to facilitate more accurate fast develop and test cycles. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005 Good Bye! Please check: www.summaries.net for the latest version of this tutorial and reference information. Chin-Yew LIN • IJCNLP-05 • Jeju Island • Republic of Korea • October 10, 2005