Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed and Tan Xu HLT COE and UMIACS Laboratory for Computational Linguistics and Information Processing COE Quarterly Technical Exchange, June 10th 2008 1 COE ACE System English Pipeline ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ Within-Doc Coref. Pairs Filtering Feature Generation Conversational Genre Features Context Features ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ Within-Doc Coref. Clustering Feature Generation Clustering Arabic Pipeline COE Quarterly Technical Exchange, June 10th 2008 2 Roadmap 1. Context Features Pairwise similarity Efficient vs. effectiveness Generating features for ACE 2. Conversational-genre Features New generative model Joint Resolution Evaluation using ACE-Usenet COE Quarterly Technical Exchange, June 10th 2008 3 Context Features Close friends and colleagues of Cheney -- including former Gen. Brent Scowcroft, who was national security adviser when Cheney was Gerald Ford's chief of staff and George H. W. Bush's defense secretary -- have been famously quoted they just don't recognize the Cheney they served along side and the Cheney of today who repeatedly made false assertions about the Iraq war and weapons of mass destruction. Now, an article in Vanity Fair Magazine by Todd S. Purdum has published a number of strikingly similar assessments from Clinton's former confidants -- plus medically authoritative guesswork speculating about how health problems of the sort Clinton experienced can change a person. But we avoid that trash talk to focus only on the real, striking changes in the public performances of Bill Clinton and Dick Cheney today. Compared to the way they were, back when they were greatly admired by those who knew them best, back in the day. Clinton Once, and Cheney were considered consummate political performers. Now they utter gaffes and commit blunders. And they leave the lasting impression that they just don't care about what you think about it. Once, they were smart and savvy strategic forces that always seemed to boost the political fortunes of their team (Clinton with sterling public performances; Cheney with rock-steady behind-the-scenes guidance). Now they have become liabilities to their causes, grand grist for late-night monologues, caricatures on "Saturday Night Live." It barely seems credible now but there was a time when it seemed the Democratic nomination was Hillary Clinton's for the taking. The air of certainty in January was convincing when Clinton declared from a sofa at her Washington home: "I'm in and I'm in to win." Two Democratic senators and two former governors swiftly pulled out rather than get between Clinton and White House. Then along came Barack Obama and the aura of inevitability that was crucial to Clinton's strategy vanished. Clinton "The campaign was meant to be shock and awe: big events in big states, sweep the board on Super Tuesday, overwhelm the less well-known competitors," said Chip Smith, who was deputy campaign manager for Al Gore in 2000. "Unfortunately, Obama uprooted that strategy. Inevitability isn't a viable strategy against a well-funded candidate with a powerful message." It is unclear whether there was anything Clinton could have done to stop a gifted politician such as Obama, once his early win in Iowa and prodigious fundraising ability established that he really did have a chance of winning the Democratic nomination. Clinton also may have destroyed any chance of a comeback after being caught out in her fib about coming under sniper fire while in Bosnia in the 1990s. The lie crystallised voter unease with Clinton, and held back chances of a grand comeback in Pennsylvania. In April, a Washington Post/ABC News poll found that 61% of American voters considered her dishonest and untrustworthy. COE Quarterly Technical Exchange, June 10th 2008 4 Abstract Problem 0.20 0.30 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00 ~~~~~~~~~~ 0.34 ~~~~~~~~~~ 0.34 0.13 0.74 0.20 0.30 ~~~~~~~~~~ 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00 ~~~~~~~~~~ 0.34 0.34 0.13 0.74 0.20 0.30 ~~~~~~~~~~ 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00 ~~~~~~~~~~ 0.34 0.34 0.13 0.74 0.20 ~~~~~~~~~~ 0.30 ~~~~~~~~~~ 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00 0.34 0.34 0.13 0.74 0.20 ~~~~~~~~~~ 0.30 ~~~~~~~~~~ 0.54 ~~~~~~~~~~ 0.21 ~~~~~~~~~~ 0.00 0.34 0.34 0.13 0.74 Goal: Scalable Pairwise Similarity ~10K docs ~50 million doc pairs ~140K entities ~10 billion entity pairs COE Quarterly Technical Exchange, June 10th 2008 5 Solutions Trivial sim (d i , d j ) wt ,di wt ,d j tV Loads each vector o(N) times Loads each term t o(dft2) times Better Each term contributes only if appears in sim (d i , d j ) sim (d i , d j ) w td i d j t ,di di d j wt ,d j term_contrib(t, d , d td i d j i j ) Loads each term (with posting list) once Each term contributes o(dft2) COE Quarterly Technical Exchange, June 10th 2008 6 Indexing (3-doc toy collection) Clinton Clinton Obama Clinton Clinton Cheney 1 2 1 Cheney 1 Indexing Barack 1 Clinton Barack Obama Obama 1 1 Standard IR Indexing COE Quarterly Technical Exchange, June 10th 2008 7 Pairwise Similarity (b) Group pairs (a) Generate pairs Clinton 1 2 (c) Sum pairs 2 1 2 Cheney 1 2 2 3 1 1 Barack 1 1 Obama 1 1 1 COE Quarterly Technical Exchange, June 10th 2008 8 Pairwise Similarity (abstract) (a) Generate pairs term postings multiply term postings multiply term postings multiply term postings (b) Group pairs Grouping (c) Sum pairs sum similarity sum similarity sum similarity multiply COE Quarterly Technical Exchange, June 10th 2008 9 MapReduce! (a) Map input input input input (b) Shuffle (c) Reduce map map map Shuffling group values by keys reduce output reduce output reduce output map COE Quarterly Technical Exchange, June 10th 2008 10 And indexing .. of course! (a) Map doc doc doc doc (b) Shuffle (c) Reduce tokenize tokenize tokenize Shuffling group values by keys combine Posting list combine Posting list combine Posting list tokenize COE Quarterly Technical Exchange, June 10th 2008 11 Terms: Zipfian Distribution each term t contributes o(dft2) partial results doc freq (df) very few terms dominate the computations most frequent term (“said”) 3% most frequent 10 terms 15% most frequent 100 terms 57% most frequent 1000 terms 95% ~0.1% of total terms (99.9% df-cut) term rank COE Quarterly Technical Exchange, June 10th 2008 12 Efficiency (disk space) Aquaint-2 Collection, ~ million doc Intermediate Pairs (billions) 9,000 8 trillion intermediate pairs 8,000 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 0 10 20 30 40 50 60 70 80 90 100 Corpus Size (%) Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk COE Quarterly Technical Exchange, June 10th 2008 13 Efficiency (disk space) Intermediate Pairs (billions) Aquaint-2 Collection, ~ million doc 9,000 8,000 8 trillion intermediate pairs df-cut at 99% df-cut at 99.9% df-cut at 99.99% df-cut at 99.999% no df-cut 7,000 6,000 5,000 4,000 3,000 0.5 trillion intermediate pairs 2,000 1,000 0 0 10 20 30 40 50 60 70 80 90 100 Corpus Size (%) Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk COE Quarterly Technical Exchange, June 10th 2008 14 Effectiveness Effect of df-cut on effectiveness Medline04 - 909k abstracts- Ad-hoc retrieval 100 95 Relative P5 (%) 90 Drop 0.1% of terms “Near-Linear” Growth Fit on disk Cost 2% in Effectiveness 85 80 75 70 65 For more details, Check “Pairwise Document Similarity in Large Collections with MapReduce” 60 at ACL 2008 (presented next week!) 55 50 99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00 df-cut (%) Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk COE Quarterly Technical Exchange, June 10th 2008 15 In ACE! ~10K docs ~140K entities each document is a vector each has multiple mentions each entity context is a vector Generated 8 feature matrices (6 English + 2 Arabic) English Pipeline ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ Within-Doc Coref. Pairs Filtering Feature Generation Clustering Arabic Pipeline ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ Within-Doc Coref. Feature Generation COE Quarterly Technical Exchange, June 10th 2008 Clustering 16 Roadmap 1. Context Features Pairwise similarity Efficient vs. effectiveness Generating features for ACE 2. Conversational-genre Features New generative model Joint Resolution Evaluation using ACE-Usenet COE Quarterly Technical Exchange, June 10th 2008 17 Identity Resolution in Email Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann <kay.mann@enron.com> To: Mary Adams <mary.adams@enron.com> Subject: Re: tennis tomorrow! Did Sue Sue want Scott to join? Looks like the game will be too late for him. Who? i.e., label with email address Identity Resolution COE Quarterly Technical Exchange, June 10th 2008 18 New Generative Model 1. Choose “person” c to mention p(c) 2. Choose appropriate “context” X to mention c p(X | c) 3. Choose a “mention” l p(l | X, c) playing tennis “sue” COE Quarterly Technical Exchange, June 10th 2008 19 Context Social Context Topical Context Conversational Context Local Context COE Quarterly Technical Exchange, June 10th 2008 20 Single-Mention: 2-Step Solution (1) Identity Modeling Prior Distribution (2) Mention Resolution Evidence Posterior Distribution COE Quarterly Technical Exchange, June 10th 2008 21 Improved Results Effectivness Comparison on Enron Collection 1 0.9 0.8 +8.9% +8.6% Heuristic Generative 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 MRR P@1 For more details, Check “Resolving Personal Names in Email using Context Expansion” at ACL 2008 (also presented next week!) COE Quarterly Technical Exchange, June 10th 2008 22 Limitation! “sjhonson@enron.com” “Susan Scott” social social “Sue” topical Context-Free Resolution “Sue” conversational social topical “Suebob” topical “Susan Jones” “Susan” Joint Resolution! COE Quarterly Technical Exchange, June 10th 2008 23 Joint Resolution Mention Graph Spread Current Resolution Combine Context Info COE Quarterly Technical Exchange, June 10th 2008 Update Resolution 24 Joint Resolution Mention Graph map Work in Progress! shuffle reduce MapReduce! COE Quarterly Technical Exchange, June 10th 2008 25 Roadmap Context Features Pairwise similarity Efficient vs. effectiveness Generating features for ACE Conversational-genre Features New generative model Joint Resolution Evaluation using ACE-Usenet COE Quarterly Technical Exchange, June 10th 2008 26 Email Message From: Machiavegli <machia@aol.com> To: Mark <mk@hotmail> receiver Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election is email address In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. COE Quarterly Technical Exchange, June 10th 2008 27 Usenet Message From: Machiavegli <machia@aol.com> Newsgroup: soc.history.what-if newsgroup! Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. COE Quarterly Technical Exchange, June 10th 2008 28 ACE Usenet Document <DOCID> soc.history.what-if_20350205910 </DOCID> <POSTER> Machiavegli </POSTER> no email addresses in headers! <POSTDATE> 29 Jan 2005 22:04:38 GMT </POSTDATE> <SUBJECT> The 1860 Presidential Election </SUBJECT> In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. COE Quarterly Technical Exchange, June 10th 2008 29 Reconstruct from From: Machiavegli <machia@aol.com> Newsgroup: soc.history.what-if Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election automatically Got the address back! In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. COE Quarterly Technical Exchange, June 10th 2008 30 Handling it as @ From: Machiavegli <machia@aol.com> To: soc.history.what-if@usenet.com Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election handle group as receiver In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. COE Quarterly Technical Exchange, June 10th 2008 31 Feature Value: same label Need for feature matrix (pairwise score) sjhonson@hotmail.com sjhonson@hotmail.com “Steph” “Stephan” “Stephan” “S. Smith” +1.0 COE Quarterly Technical Exchange, June 10th 2008 32 Feature Value: different labels Need for feature matrix (pairwise score) sjhonson@hotmail.com smith_s@aol.com “Steph” “Stephan” “Stephan” “S. Smith” -1.0 COE Quarterly Technical Exchange, June 10th 2008 33 Conclusion MapReduce can be applied to many HLT applications easy, cheap, and fast for distributed processing e.g., scalable pairwise similarity for coreference resolution calls for new ways of thinking Identity resolution in email new generative model yields improved accuracy scalable joint resolution needed Usenet-ACE is new test collection COE Quarterly Technical Exchange, June 10th 2008 34 Thank You! COE Quarterly Technical Exchange, June 10th 2008 35 MapReduce and Text Analysis Computing pairwise similarity in large collections Joint resolution of mentions in email collections Search engines (of course!) Building language models Clustering applications Machine translation … COE Quarterly Technical Exchange, June 10th 2008 36