Connecting the Dots Between News Articles Dafna Shahaf and Carlos Guestrin Information overload is everywhere Well, we have Google… Search Limitations Input Output Interaction New query Our Approach Input Output Structured, annotated output Phrase complex information needs Interaction Richer forms New query of interaction Connecting the Dots: News Domain 3.19.2008 Input Housing Bubble Input: Pick two articles (start, goal) Output: Bridge the gap with a smooth chain of articles Bailout Output Interaction Input Housing Bubble • Keeping Borrowers Afloat • Input: A Mortgage Crisis Begins to Spiral, ... Pick two articles (start, goal) • Investors Grow Wary of Bank's Reliance on Debt Output: Bridge the gap with a smooth chain of articles • Markets Can't Wait for Congress to Act • Bailout Plan Wins Approval Bailout Output Interaction Game Plan What is a good chain? Formalize objective Score a chain Find a good chain What is a Good Chain? • What’s wrong with shortest-path? • Build a graph – Node for every article – Edges based on similarity s • Chronological order (DAG) – Run BFS t Shortest-path Lewinsky Talks Over Over Ex-Intern's Testimony On Clinton AppearAppear to Bog Down • A1: alks Ex-Intern's Testimony On Clinton to Bog • A2: Judge Sides with the Government in Microsoft Antitrust Trial • A3: Who will be the Next Microsoft? – trading at a market capitalization… • A4: Palestinians Planning to Offer Bonds on Euro. Markets • A5: Clinton Watches as Palestinians Vote to Rescind 1964 Provision • A6: ontesting theVote: Vote: Overview; GorePublic asksFor Public For Contesting the TheThe Overview; Gore asks Patience; Florida recount Shortest-path Talks Over Over Ex-Intern's Testimony On Clinton AppearAppear to Bog Down • A1: alks Ex-Intern's Testimony On Clinton to Bog • A2: Judge Sides with the Government in Microsoft Antitrust Trial • A3: Who will be the Next Microsoft? – trading at a market capitalization… • A4: Palestinians Planning to Offer Bonds on Euro. Markets • A5: Clinton Watches as Palestinians Vote to Rescind 1964 Provision • A6: ontesting theVote: Vote: Overview; GorePublic asksFor Public For Contesting the TheThe Overview; Gore asks Patience; Shortest-path • A1: Talks Over Ex-Intern's Testimony On Clinton Appear to Bog Down • A2: Judge Sides with the Government in Microsoft Antitrust Trial Stream of consciousness? • A3: Who will be the Next Microsoft? - Each transition is strong – trading at a market capitalization… - No global theme • A4: Palestinians Planning to Offer Bonds on Euro. Markets • A5: Clinton Watches as Palestinians Vote to Rescind 1964 Provision • A6: Contesting the Vote: The Overview; Gore asks Public For Patience; More-Coherent Chain • B1: Talks Over Ex-Intern's Testimony On Clinton Appear to Bog Down Lewinsky • B2: Clinton Admits Lewinsky Liaison to Jury • B3: G.O.P. Vote Counter in House Predicts Impeachment of Clinton • B4: Clinton Impeached; He Faces a Senate Trial • B5: Clinton’s Acquittal; Senators Talk About Their Votes • B6: Aides Say Clinton Is Angered As Gore Tries to Break Away • B7: As Election Draws Near, the Race Turns Mean Florida recount • B8: Contesting the Vote: The Overview; Gore asks Public For Patience; More-Coherent Chain • B1: Talks Over Ex-Intern's Testimony On Clinton Appear to Bog Down • B2: Clinton Admits Lewinsky Liaison to Jury • B3: G.O.P. Vote Counter in House Predicts Impeachment of Clinton • B4: Clinton Impeached; He Faces a Senate Trial What makes it coherent? • B5: Clinton’s Acquittal; Senators Talk About Their Votes • B6: Aides Say Clinton Is Angered As Gore Tries to Break Away • B7: As Election Draws Near, the Race Turns Mean • B8: Contesting the Vote: The Overview; Gore asks Public For Patience; Word Patterns For Shortest Path Chain Topic changes every transition (jittery) Word Patterns For Coherent Chain Use this intuition to estimate coherence of chains Topic consistent over transitions What is a Good Chain? • • • • Every transition is strong Global theme No jitteriness (back-and-forth) Short (5-6 articles?) What is a Good Chain? • • • • Every transition is strong Global theme No jitteriness (back-and-forth) Short (5-6 articles?) Strong transitions between consecutive documents 4 w1: Lewinsky w2: Clinton w3: Oath w4: Intern w5: Microsoft d1 3 min(4,3,1)=1 d2 d3 1 d4 Strong transitions between consecutive documents • Too coarse – Word importance in transition – Missing words Intuitively, high iff • di and di+1 very related • w plays an important role in the relationship ??? Influence • Most methods assume edges Intuitively, high iff – Influence propagates through the edges • No edges in our dataset • di and di+1 very related • w plays an important role in the relationship Computing Influence(di, dj | w) w Clinton Judge Microsoft Gore di Clinton Admits Lewinsky Judge Sides with the Govmnt Contest the Vote The Next Microsoft dj Computing Influence(di, dj | w) 1. Run random walks Judg Microsof - Random restarts from t di e - ε controls expected length w Clinton Gore di Clinton Admits Lewinsky Judge Sides with the Govmnt Contest the Vote The Next Microsoft dj Computing Influence(di, dj | w) w Clinton Judge Microsoft Gore di Clinton Admits Lewinsky Judge Sides with the Govmnt Contest the Vote The Next Microsoft dj Computing Influence(di, dj | w) w Clinton How important is w?Judge distribution Calculate stationary ofGore dj Microsoft -Check how many went through w - Intuitively, high ifwalks documents are related di Clinton Admits Lewinsky Judge Sides with the Govmnt Contest the Vote The Next Microsoft dj Computing Influence(di, dj | w) w 2. Influence(di,Judge dj | w) = Gore dj no longer reachable: Stationary distribution(dj) with w All influence is due to w Stationary distribution(dj) without w Clinton Microsoft di Clinton Admits Lewinsky Judge Sides with the Govmnt Contest the Vote The Next Microsoft dj Influence: Reality Check • di: OJ Simpson trial article – dj: DNA evidence in OJ trial – dj: Super Bowl 49ers Coherence formulation No edges. Computed using random walks What is a Good Chain? • • • • Every transition is strong Global theme No jitteriness (back-and-forth) Short (5-6 articles?) Global Theme, No Jitter • Jittery chain can score well! • But need a lot of words… • Good chains can often be represented by a small number of segments Global Theme, No Jitter • Choose 3 segments to be scored on Score = 0 Good score Coherence: New Objective • Maximize over legal activations: – Limit total number of active words – Limit number of words per transition – Each word to be activated at most once Game Plan What is a good chain? Formalize objective Score a chain Find a good chain Scoring a Chain • Problem is NP-Complete • Softer notion of activation: [0,1] • Natural formalization as a linear program (LP) LP: Objective Pre-computed LP: Smoothness • A word is active if either • Active before • Just initialized • Each word is initialized at most once LP: Activation • Limit #words • Limit #words per transition Example • Scoring a chain – September 11th to Daniel Pearl Activation levels Activa weighted (re Game Plan What is a good chain? Formalize objective Score a chain Find a good chain Finding a good chain • Can’t brute-force – nd possible chains: >>1020 after pruning • Joint LP: optimize activation and chain • New variables: – Is document di a part of the chain? – Does document dj come after di in the chain? • New constraints: – Chain structure – Length = K Rounding • Unlike previous LP, we need to round – Extract a chain 0.1 0.3 s 0.7 2 0.6 3 0.9 • Approximation guarantees – Chain length K in expectation – Objective: O(sqrt(ln(n/e)) with probability 1- e t Scaling Up • LP has variables • Polynomial, but D is large – Restricting number of documents – Sparsifying the graph • Random walks Game Plan What is a good chain? Formalize objective How good is it? Score a chain Find a good chain Evaluation: Competitors • Shortest path • Google Timeline • Enter a query • Pick k equally-spaced articles • Event threading (TDT) [Nallapati et al ‘04] • Generate cluster graph • Representative articles from clusters Example Chain (1) Simpson trial • • • • Simpson Strategy: There Were Several Killers O.J. Simpson's book deal controversy CNN OJ Simpson Trial News: April Transcripts Tandoori murder case a rival for OJ Simpson case Simpson verdict Google News Timeline Example Chain (2) Simpson trial • Issue of Racism Erupts in Simpson Trial • Ex-Detective's Tapes Fan Racial Tensions in LA • Many Black Officers Say Bias Is Rampant in LA Police Force • With Tale of Racism and Error, Lawyers Seek Acquittal Simpson Verdict Connect-the-Dots Evaluation #1: Familiarity • 18 users • Show two articles – 5 news stories Do you know a coherent story linking these articles? – Before and after reading the chain Effectiveness (improvement in familiarity) Average fraction of gap closed Better 1 ConnectDots Google Shortest 0.75 TDT 0.5 0.25 0 Elections Base familiarity: 2.1 Pearl 3.1 Lewinsky OJ Enron 3.2 3.4 1.9 Effectiveness (improvement in familiarity) Average fraction of gap closed Better 1 ConnectDots Google Shortest 0.75 TDT 0.5 0.25 0 Elections Base familiarity: 2.1 Pearl 3.1 Lewinsky OJ Enron 3.2 3.4 1.9 Effectiveness (improvement in familiarity) Average fraction of gap closed Better 1 ConnectDots Google Shortest 0.75 TDT 0.5 0.25 0 Elections Base familiarity: 2.1 Pearl 3.1 Lewinsky OJ Enron 3.2 3.4 1.9 Effectiveness (improvement in familiarity) Average fraction of gap closed Better 1 ConnectDots Google Shortest 0.75 TDT 0.5 0.25 0 Elections Base familiarity: 2.1 Pearl 3.1 Lewinsky OJ Enron 3.2 3.4 1.9 Effectiveness (improvement in familiarity) Average fraction of gap closed Better 1 ConnectDots Google Shortest 0.75 TDT 0.5 We are better almost everywhere 0.25 0 Elections Base familiarity: 2.1 Pearl 3.1 Lewinsky OJ Enron 3.2 3.4 1.9 Evaluation #2: Chain Quality • Compare two chains for – Coherence – Relevance – Redundancy ? > Fraction of times preferred Better Relevance, Coherence, Nonredundancy 1 TDT 0.8 0.6 0.4 0.2 0 Rel. Relevance Coh. Coherence Non-Red Non-redundancy Across complex stories Relevance, Coherence, Nonredundancy Fraction of times preferred Better ConnectDots 1 Google Shortest 0.8 TDT 0.6 0.4 0.2 0 Rel. Relevance Coh. Coherence Non-Red Non-redundancy Across complex stories Relevance, Coherence, Nonredundancy Fraction of times preferred Better ConnectDots 1 Google Shortest 0.8 TDT 0.6 0.4 0.2 0 Rel. Relevance Coh. Coherence Non-Red Non-redundancy Across complex stories Relevance, Coherence, Nonredundancy Fraction of times preferred Better ConnectDots 1 Google Shortest 0.8 TDT 0.6 0.4 0.2 0 Rel. Relevance Coh. Coherence Non-Red Non-redundancy Across complex stories What’s left? Two documents Interaction Chain Interaction Types 1. Refinement d1 2. User interests ??? d2 d3 d4 Interaction Simpson Defense Drops DNA Challenge … Algorithmic ideas from online • Many black officersstood say bias IsDNA rampant in LA Defense A day the cross-examines country state still expert learning force • police With In thefiber joy ofevidence, victory, defense prosecution team in discord • Racial split at the end … … Simpson Verdict Verdict Race Blood, glove Interaction User Study • Refinement – 72% prefer chains refined our way • User Interests – 63.3% able to identify intruders • 2 correct words out of 10 Conclusions • Fight information overload – Provide a structured, easy way to navigate topics • New task • Explored desired properties – LP formalization – Efficient algorithm • Evaluated over real news data – Demonstrate effectiveness – Interaction Complex information needs Structured, annotated output … Since KDD • New domains – Research papers – Email • New questions – Fixed endpoints? No endpoints? • New forms of output Issue Maps 66 Issue Maps Issue Maps machines can’t have emotions we can imagine artifacts that have feelings [Smart ‘59] Challenge: Build automatically! is disputed by concept of feeling only applies to living organisms [Ziff ‘59] Simpson Defense Lawyers Unleash Sharp Assault on Police Inquiry at Murder Scene OJ Simpson Trial With Tale of Racism and Error, Simpson Lawyers Seek Acquittal 69 Conclusions • Fight information overload – Provide a structured, easy way to navigate topics • New task • Explored desired properties – LP formalization – Efficient algorithm • Evaluated over real news data – Demonstrate effectiveness – Interaction Complex information needs Structured, annotated output