Some interesting directions in Automatic Summarization Annie Louis CIS 430

Some interesting directions in Automatic Summarization Annie Louis CIS 430 12/02/08 1 Today’s lecture Multi-strategy summarization  Is one method enough? Performance Confidence Estimation  Would be nice to have an indication of expected system performance on an input Evaluation without human models  Can we come up with cheap and fast evaluation measures? Beyond generic summarization  Query focused, updates, blogs, meeting, speech.. 2 Relevant Papers :  Lacatusu et al. LCC’s GISTexter at DUC 2006: Multi-Strategy Multi- Document Summarization. In Proceedings of the Document Understanding Workshop (DUC-2006)  McKeown et al. Columbia multi-document. summarization: Approach and evaluation. In Proceedings of the Document Understanding Conference (DUC01), 2001.  Nenkova et al. Can You Summarize This? Identifying Correlates of Input Difficulty for Multi-Document Summarization. In Proceedings of ACL-08: HLT 3 More about DUC 2002 data…  /project/cis/nlp/tools/Summarization_Data/Inputs2 002  Newswire texts  Has 3 categories of inputs !  ? 4 DUC 2002 input categories  Single event - 30 inputs  Eg: d061- Hurricane Gilbert  Same place, roughly same time, same actions  Multiple distinct events – 15 inputs  Eg: d064 - Opening of Mac Donald’s at Russia, Canada, South Korea..  Different places, different times, different agents  Biographies – 15 inputs  Eg: d065 – Dan Quayle, Bush’s nominee for vice president  One person – one event, background info – events from the past Do you think a single method will do well for all ? 5 Tf-idf summary - d061 Hurricane Gilbert Heads Toward Dominican Coast . Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night. Gilbert Reaches Jamaican Capital With 110 Mph Winds . Hurricane warnings were posted for the Cayman Islands, Cuba and Haiti. Hurricane Hits Jamaica With 115 mph Winds; Communications. Gilbert reached Jamaica after skirting southern Puerto Rico, Haiti and the Dominican Republic. Gilbert was moving west-northwest at 15 mph and winds had decreased to 125 mph. What Makes Gilbert So Strong? With PM-Hurricane Gilbert, Bjt . Hurricane Gilbert Heading for Jamaica With 100 MPH Winds . Tropical Storm Gilbert 6 Tf-idf summary - d064 First McDonald's to Open in Communist Country . Police Keep Crowds From Crashing First McDonald's . McDonald's and Genex contribute $1 million each for the flagship restaurant. A Bolshoi Mac Attack in Moscow as First McDonald's Opens . McDonald's Opens First Restaurant in China . McDonald's hopes to open a restaurant in Beijing later. The 500-seat McDonald's restaurant in a three-story building is operated by McDonald's Restaurant Shenzhen Ltd., a wholly owned subsidiary of McDonald's Hong Kong. McDonald's Hong Kong is a 50-50 joint venture with McDonald's in the United States. McDonald's officials say it is not a question that 7 Tf-idf summary - d065 Tucker was fascinated by the idea, Quayle said. But Dan Quayle's got experience, too. Quayle's Triumph Quickly Tarnished . Quayle's Biography Inflates State Job; Quayle Concedes Error . Her statement was released by the Quayle campaign. But he would go no further in describing what assignments he would give Quayle. ``I will be a very close adviser to the president,'' Quayle said. ``You're never going to see Dan Quayle telling tales out of It was everything Quayle had hoped for. Quayle had said very little and he had said it very well. There are windows into the workings of the 8 Multi-strategy summarization  Multiple summarization modules within a single system  Better than a single method  How to employ a multi-strategy system?  Use all methods, produce multiple summaries, choose best  Use a router and summarize by only one specific method 9 Produce multiple summaries and choose – LCC GISTexter  Task - Query focused summarization  Query is decomposed by 3 methods  Sent to a QA system and a multi-document summarizer  6 different summaries  Select the best summary  Textual entailment + pyramid scoring 10 Route to a specific module – Columbia’s multi-document summarizer  Features to classify an input as  Single event  Biography  Loosely connected documents  The result of classification is used to route the input to one of 3 different summarizers 11 Features - Single Event  To identify  Time span between publication dates < 80 days  More than 50% documents published on same day  To summarize  Exploit redundancy, cluster similar sentences into themes  Rank themes based on size, similarity, ranking of contained sentences by lexical chains  Select phrases from each theme  Generate sentences 12 Features - Biographies  To identify  Frequency of most frequent capitalized letter > X (compensate for NE)  Frequency of personal pronouns > Y  To summarize  Target individual mentioned in sentence ?  Another individual found in the sentence ?  Position of most prominent capitalized word in the sentence 13 Features – Weakly related documents  To identify  Not single event nor biographical  To summarize  Words likely to be used in first paragraphs ie important      words – learnt from corpus analysis Verb specificity Semantic themes – wordnet concepts Positional and length features More weight to recent articles Downweight sentences with pronouns 14 Characterizing/ Classifying inputs  Important if you want to route to a specialized summarizer  Classification can be made along several lines  Theme of input – Columbia’s summarizer  Scientific/ News articles  Long/ Short documents  News articles about events/ Editorials  Difficult/ Easy ?? 15 Input difficulty and Performance Confidence Estimation  Some inputs are more difficult than others – Most summarizers produce poor summaries for these inputs 16 Input to summarizer  Some inputs are easier than others ! Average system scores obtained on different inputs for 100 word summaries mean 0.55 min 0.07 max 1.65 Data: DUC 2001  score range 0 - 4  17 Input difficulty & Content coverage scores  Content coverage score  extent of coverage of important content  Poor content selection –> low score  If most summaries for an input get low score..  Most systems could not identify important content  “ Difficult Input ” 18 Did system performance vary with DUC 2001 input categories? Multi-document inputs were from 5 categories: A set of documents describing... Single event  The Exxon Valdez Oil Spill Subject  Mad Cow Disease Biographical  Cohesive / “On topic” Inputs Elizabeth Taylor Multiple distinct events  Different occasions of police misconduct Opinion Views of the senate, public, congress, lawyers etc on the decision by the senate to count illegal aliens in the 1990 census   Non Cohesive/ “Multiple facets” Inputs Single task – generic summarization 19 Input type influenced scores obtained    Biographical Single event Subject are easier to summarize than   Multiple distinct events Opinions 20 Cohesive inputs are easier to summarize Scores for cohesive inputs are significantly* higher than those for non-cohesive inputs at 100, 200 and 400 words *One sided t-tests 95% significance level Cohesive  Biographical  Single event  Subject Non Cohesive  Multiple distinct events  Opinions 21 Inputs can be easy or difficult =>  Better summarizers ~ different methods to summarize different inputs  multi-strategy  Enhancing user experience ~ system can flag summaries that are likely to be poor in content  low system confidence on difficult inputs 22 First step..  What characterizes difficult inputs?  Find useful features  Can we identify difficult inputs with high accuracy?  Classification task – difficult vs easy 23 Features – Simple length-based Smaller inputs ~ less loss of information ~ better summaries Number of sentences ~ information to be captured in the summary Vocabulary size ~ number of unique words 24 Features – Word distributions in input % of words used only once ~ lexical repetition less repetition of content ~ difficult inputs Type- token ratio ~ lexical variation in the input fewer types ~ easy inputs Entropy of the input i n H ( X )   p ( wi ) log 2 p ( wi ) i 1 ~ descriptive words ~ high probabilities ~ less entropy ~ easy 25 Features – Document similarity and relatedness documents with overlapping content ~ easy input Pair-wise cosine overlap (average, min, max) ~ similarity of the documents v .v cos   1 2 v1 ,v2 tf −idf v1 v2 weightsof content wordsof 2documents High cosine overlaps  overlapping content  easy to summarize 26 Features – Document similarity and relatedness tightly-bound by topic ~ easy input KL Divergence ~ distance from a large collection of random documents KLdivergence   wInp pinp ( w) log 2 pinp ( w) pcoll ( w) coll− all documents fromall tasksof DUC2001to2006 Difference between 2 language models  input & random collection Greater divergence  input is unlike random documents, tightly bound input 27 Features – Log likelihood ratio based more topic terms, similar topic terms ~ topic-oriented, easy input Number of topic signature terms Percentage of topic signatures in the vocabulary ~ control for length of the input Pair-wise topic signature overlap (average, min, max) ~ similarity between the topic vectors of different documents ~ cosine overlap with reduced & specific vectors 28 What makes some inputs easy? Easy inputs have smaller vocabulary  smaller entropy  greater divergence from a random collection  higher % of topic signatures in the vocabulary  higher avg cosine and topic overlap  29 Input difficulty hypothesis for systems Indicator of an input’s difficulty  Average system coverage score  Difficult, if most systems select poor content Defining difficulty of inputs  2 classes  Abv/ Below “ mean average system score ” > mean score – easy < mean score – difficult  Equal classes 30 Classification Results Baseline performance : 50% Test set: DUC 2002 - 04 10 fold cross validation on 192 observations Precision and recall of difficult inputs Accuracy Precision Recall 69.27 0.696 0.674 31 Summary Evaluation without Human Models  Current Evaluation Measures - Recap  Content Coverage  Pyramid  Responsiveness  ROUGE * My work with Ani 32 Need for cheap, fast measures  All current evaluations require human effort  Human summaries (content overlap, pyramid, rouge)  Manual marking of summaries (responsiveness)  Human summaries are biased  several summaries for the same input are needed to remove bias (Pyramid, ROUGE)  Can we come up with cheaper evaluation techniques that will produce the same rankings for systems as human evaluations ? 33 Compare with input – No human models  Estimate closeness of summary to input  The more close a summary is to the input, the better its content should be  How do we verify this ?  Design some features that can reflect how close a summary is to the input  Rank summaries based on the value of this feature  Compare the obtained rankings to rankings given by humans  Similar rankings (high correlation) – you have succeeded 34  What features should we use?  We want to know how well a summary reflects the input’s content.  Guesses ? 35 Features - Divergence between input and summary Smaller divergence ~ better summary  KL divergence input – summary  KL divergence summary – input  Jensen Shannon Divergence Inp || Summ  H ( 12 Inp  12 Summ)  12 H ( Inp)  12 H (Summ) 36 Features – Use of topic words from the input More topic words ~ better summary  % of summary composed of topic words  % of input’s topic words carried over to the summary 37 Features – Similarity between input and summary More similar to the input ~ better summary  Cosine similarity  input – summary words  Cosine similarity  input’s topic signatures – summary words 38 Features - Summary Probability Higher likelihood of summary given input ~ better summary n  Unigram pInpsummary ( w1 ) n pInp ( w2 )probability  pInp ( wr ) n 1 2 r r  summary vocabulary n1    nr  N summary size ni  count in summary of word wi  Multinomial summary probability N! n1 n2 n1 !nr ! p Inp ( w1 ) p Inp ( w2 )  p Inp ( wr ) nr 39 Analysis of features  The value of the feature will be the score for the summary  Average the feature values for a particular system over all inputs  Compare to average human score  Spearman (rank) correlation 40 Results TAC 2008 Query focused summarization 48 inputs, 57 systems Feature JSD % input’s topic in summary KL div summ-input Cosine overlap % of topic summary KL div input – summ Unigram summ prob Mult. Summ prob Pyramid -0.8803 -0.8741 -0.7629 0.7117 0.7115 -0.6875 -0.1879 0.2224 Responsivenes s -0.7364 -0.8249 -0.6941 0.6469 0.6015 -0.5850 -0.1006 0.2353 41 Evaluation without human models  Comparison with input – correlates well with human judgements  Cheap, fast, unbiased  No human effort needed 42 Other summarization tasks of interest  Update summaries  The user has read a set of documents A  Produce a summary of updates from a set B of documents published later in time  Query focused  A topic statement is given to focus content selection 43 Other summarization tasks of interest  Blog/ Opinion Summarization  Mine opinions, good/ bad product reviews etc  Meeting/ Speech Summarization  How would you summarize a brainstorming session ?  44 What you have learnt today..  How simple features you already know can be put to use for interesting applications  Beyond a simple sentence extractor engine – customizing for inputs/ user/ task-setting is important  There are a lot of interesting tasks in summarization and language processing 45

Some interesting directions in Automatic Summarization Annie Louis CIS 430

Related documents

Products

Support

Some interesting directions in Automatic Summarization Annie Louis CIS 430

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib