PPT

advertisement
Using MapReduce
for Scalable Coreference Resolution
Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed and Tan Xu
HLT COE and
UMIACS Laboratory for Computational Linguistics and Information Processing
COE Quarterly Technical Exchange, June 10th 2008
1
COE ACE System
English Pipeline
~~~~~~~~
~~~~~~~~
~~~~~~~~
~~~~~~~~
~~~~~~~~
Within-Doc
Coref.
Pairs
Filtering
Feature
Generation
Conversational
Genre
Features
Context
Features
~~~~~~~~
~~~~~~~~
~~~~~~~~
~~~~~~~~
~~~~~~~~
Within-Doc
Coref.
Clustering
Feature
Generation
Clustering
Arabic Pipeline
COE Quarterly Technical Exchange, June 10th 2008
2
Roadmap
1. Context Features
 Pairwise similarity
 Efficient vs. effectiveness
 Generating features for ACE
2. Conversational-genre Features
 New generative model
 Joint Resolution
 Evaluation using ACE-Usenet
COE Quarterly Technical Exchange, June 10th 2008
3
Context Features
Close friends and colleagues of Cheney -- including
former Gen. Brent Scowcroft, who was national security
adviser when Cheney was Gerald Ford's chief of staff
and George H. W. Bush's defense secretary -- have
been famously quoted they just don't recognize the
Cheney they served along side and the Cheney of today
who repeatedly made false assertions about the Iraq
war and weapons of mass destruction.
Now, an article in Vanity Fair Magazine by Todd S.
Purdum has published a number of strikingly similar
assessments from Clinton's former confidants -- plus
medically authoritative guesswork speculating about
how health problems of the sort Clinton experienced can
change a person.
But we avoid that trash talk to focus only on the real,
striking changes in the public performances of Bill
Clinton and Dick Cheney today. Compared to the way
they were, back when they were greatly admired by
those who knew them best, back in the day.
Clinton
Once,
and Cheney were considered
consummate political performers. Now they utter gaffes
and commit blunders. And they leave the lasting
impression that they just don't care about what you think
about it.
Once, they were smart and savvy strategic forces that
always seemed to boost the political fortunes of their
team (Clinton with sterling public performances; Cheney
with rock-steady behind-the-scenes guidance). Now
they have become liabilities to their causes, grand grist
for late-night monologues, caricatures on "Saturday
Night Live."
It barely seems credible now but there was a time when
it seemed the Democratic nomination was Hillary
Clinton's for the taking. The air of certainty in January
was convincing when Clinton declared from a sofa at
her Washington home: "I'm in and I'm in to win."
Two Democratic senators and two former governors
swiftly pulled out rather than get between Clinton and
White House. Then along came Barack Obama
and the aura of inevitability that was crucial to Clinton's
strategy vanished.
Clinton
"The
campaign was meant to be shock and
awe: big events in big states, sweep the board on Super
Tuesday, overwhelm the less well-known competitors,"
said Chip Smith, who was deputy campaign manager
for Al Gore in 2000.
"Unfortunately, Obama uprooted that strategy.
Inevitability isn't a viable strategy against a well-funded
candidate with a powerful message."
It is unclear whether there was anything Clinton could
have done to stop a gifted politician such as Obama,
once his early win in Iowa and prodigious fundraising
ability established that he really did have a chance of
winning the Democratic nomination.
Clinton also may have destroyed any chance of a
comeback after being caught out in her fib about coming
under sniper fire while in Bosnia in the 1990s. The lie
crystallised voter unease with Clinton, and held back
chances of a grand comeback in Pennsylvania. In April,
a Washington Post/ABC News poll found that 61% of
American voters considered her dishonest and
untrustworthy.
COE Quarterly Technical Exchange, June 10th 2008
4
Abstract Problem
0.20
0.30
0.54
~~~~~~~~~~
0.21
~~~~~~~~~~
0.00
~~~~~~~~~~
0.34
~~~~~~~~~~
0.34
0.13
0.74
0.20
0.30
~~~~~~~~~~
0.54
~~~~~~~~~~
0.21
~~~~~~~~~~
0.00
~~~~~~~~~~
0.34
0.34
0.13
0.74
0.20
0.30
~~~~~~~~~~
0.54
~~~~~~~~~~
0.21
~~~~~~~~~~
0.00
~~~~~~~~~~
0.34
0.34
0.13
0.74
0.20
~~~~~~~~~~
0.30
~~~~~~~~~~
0.54
~~~~~~~~~~
0.21
~~~~~~~~~~
0.00
0.34
0.34
0.13
0.74
0.20
~~~~~~~~~~
0.30
~~~~~~~~~~
0.54
~~~~~~~~~~
0.21
~~~~~~~~~~
0.00
0.34
0.34
0.13
0.74
Goal: Scalable Pairwise Similarity
~10K docs  ~50 million doc pairs
~140K entities  ~10 billion entity pairs
COE Quarterly Technical Exchange, June 10th 2008
5
Solutions

Trivial
sim (d i , d j )   wt ,di wt ,d j
tV



Loads each vector o(N) times
Loads each term t o(dft2) times
Better

Each term contributes only if appears in
sim (d i , d j ) 
sim (d i , d j ) 


w
td i  d j
t ,di
di  d j
wt ,d j
 term_contrib(t, d , d
td i d j
i
j
)
Loads each term (with posting list) once
Each term contributes o(dft2)
COE Quarterly Technical Exchange, June 10th 2008
6
Indexing (3-doc toy collection)
Clinton
Clinton
Obama
Clinton
Clinton
Cheney
1
2
1
Cheney
1
Indexing
Barack
1
Clinton
Barack
Obama
Obama
1
1
Standard IR Indexing
COE Quarterly Technical Exchange, June 10th 2008
7
Pairwise Similarity
(b) Group pairs
(a) Generate pairs
Clinton
1
2
(c) Sum pairs
2
1
2
Cheney
1
2
2
3
1
1
Barack
1
1
Obama
1
1
1
COE Quarterly Technical Exchange, June 10th 2008
8
Pairwise Similarity (abstract)
(a) Generate pairs
term
postings
multiply
term
postings
multiply
term
postings
multiply
term
postings
(b) Group pairs
Grouping
(c) Sum pairs
sum
similarity
sum
similarity
sum
similarity
multiply
COE Quarterly Technical Exchange, June 10th 2008
9
MapReduce!
(a) Map
input
input
input
input
(b) Shuffle
(c) Reduce
map
map
map
Shuffling
group values
by keys
reduce
output
reduce
output
reduce
output
map
COE Quarterly Technical Exchange, June 10th 2008
10
And indexing .. of course!
(a) Map
doc
doc
doc
doc
(b) Shuffle
(c) Reduce
tokenize
tokenize
tokenize
Shuffling
group values
by keys
combine
Posting
list
combine
Posting
list
combine
Posting
list
tokenize
COE Quarterly Technical Exchange, June 10th 2008
11
Terms: Zipfian Distribution
each term t contributes o(dft2) partial results
doc freq (df)
very few terms dominate the computations
most frequent term (“said”)  3%
most frequent 10 terms  15%
most frequent 100 terms  57%
most frequent 1000 terms  95%
~0.1% of total terms
(99.9% df-cut)
term rank
COE Quarterly Technical Exchange, June 10th 2008
12
Efficiency (disk space)
Aquaint-2 Collection, ~ million doc
Intermediate Pairs (billions)
9,000
8 trillion
intermediate pairs
8,000
7,000
6,000
5,000
4,000
3,000
2,000
1,000
0
0
10
20
30
40
50
60
70
80
90
100
Corpus Size (%)
Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk
COE Quarterly Technical Exchange, June 10th 2008
13
Efficiency (disk space)
Intermediate Pairs (billions)
Aquaint-2 Collection, ~ million doc
9,000
8,000
8 trillion
intermediate pairs
df-cut at 99%
df-cut at 99.9%
df-cut at 99.99%
df-cut at 99.999%
no df-cut
7,000
6,000
5,000
4,000
3,000
0.5 trillion
intermediate pairs
2,000
1,000
0
0
10
20
30
40
50
60
70
80
90
100
Corpus Size (%)
Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk
COE Quarterly Technical Exchange, June 10th 2008
14
Effectiveness
Effect of df-cut on effectiveness
Medline04 - 909k abstracts- Ad-hoc retrieval
100
95
Relative P5 (%)
90
Drop 0.1% of terms
“Near-Linear” Growth
Fit on disk
Cost 2% in Effectiveness
85
80
75
70
65
For more details, Check
“Pairwise
Document Similarity in Large Collections with MapReduce”
60
at ACL 2008 (presented next week!)
55
50
99.00
99.10
99.20
99.30
99.40
99.50
99.60
99.70
99.80
99.90 100.00
df-cut (%)
Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk
COE Quarterly Technical Exchange, June 10th 2008
15
In ACE!

~10K docs


~140K entities



each document is a vector
each has multiple mentions
each entity context is a vector
Generated 8 feature matrices (6 English + 2 Arabic)
English Pipeline
~~~~~~~~
~~~~~~~~
~~~~~~~~
~~~~~~~~
~~~~~~~~
Within-Doc
Coref.
Pairs
Filtering
Feature
Generation
Clustering
Arabic Pipeline
~~~~~~~~
~~~~~~~~
~~~~~~~~
~~~~~~~~
~~~~~~~~
Within-Doc
Coref.
Feature
Generation
COE Quarterly Technical Exchange, June 10th 2008
Clustering
16
Roadmap
1. Context Features
 Pairwise similarity
 Efficient vs. effectiveness
 Generating features for ACE
2. Conversational-genre Features
 New generative model
 Joint Resolution
 Evaluation using ACE-Usenet
COE Quarterly Technical Exchange, June 10th 2008
17
Identity Resolution in Email
Date: Wed Dec 20 08:57:00 EST 2000
From: Kay Mann <kay.mann@enron.com>
To: Mary Adams <mary.adams@enron.com>
Subject: Re: tennis tomorrow!
Did Sue
Sue want Scott to join? Looks like the game
will be too late for him.
Who?
i.e., label with email address
Identity
Resolution
COE Quarterly Technical Exchange, June 10th 2008
18
New Generative Model
1. Choose “person” c to mention
p(c)
2. Choose appropriate “context” X to mention c
p(X | c)
3. Choose a “mention” l
p(l | X, c)
playing
tennis
“sue”
COE Quarterly Technical Exchange, June 10th 2008
19
Context
Social Context
Topical Context
Conversational
Context
Local
Context
COE Quarterly Technical Exchange, June 10th 2008
20
Single-Mention: 2-Step Solution
(1) Identity Modeling
Prior Distribution
(2) Mention Resolution
Evidence
Posterior Distribution
COE Quarterly Technical Exchange, June 10th 2008
21
Improved Results
Effectivness Comparison on Enron Collection
1
0.9
0.8
+8.9%
+8.6%
Heuristic
Generative
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MRR
P@1
For more details, Check
“Resolving Personal Names in Email using Context Expansion”
at ACL 2008 (also presented next week!)
COE Quarterly Technical Exchange, June 10th 2008
22
Limitation!
“sjhonson@enron.com”
“Susan Scott”
social
social
“Sue”
topical
Context-Free
Resolution
“Sue”
conversational
social
topical
“Suebob”
topical
“Susan Jones”
“Susan”
Joint Resolution!
COE Quarterly Technical Exchange, June 10th 2008
23
Joint Resolution
Mention
Graph
Spread
Current Resolution
Combine
Context Info
COE Quarterly Technical Exchange, June 10th 2008
Update
Resolution
24
Joint Resolution
Mention
Graph
map
Work in Progress!
shuffle
reduce
MapReduce!
COE Quarterly Technical Exchange, June 10th 2008
25
Roadmap

Context Features




Pairwise similarity
Efficient vs. effectiveness
Generating features for ACE
Conversational-genre Features



New generative model
Joint Resolution
Evaluation using ACE-Usenet
COE Quarterly Technical Exchange, June 10th 2008
26
Email Message
From: Machiavegli <machia@aol.com>
To: Mark <mk@hotmail>
receiver
Date: 29 Jan 2005 22:04:38 GMT
Subject: The 1860 Presidential Election
is email address
In 1860 there was a four-way race between the Republican Party with Abraham
Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic
Party with John Breckenridge, and the Constitutional Union Party with John
Bell. Lincoln won a plurality with about 40% of the vote.
WI it was only a two-way race between Lincoln and Douglas? I believe Douglas
would have won.
This would have delayed secession and the Civil War.
COE Quarterly Technical Exchange, June 10th 2008
27
Usenet Message
From: Machiavegli <machia@aol.com>
Newsgroup: soc.history.what-if
newsgroup!
Date: 29 Jan 2005 22:04:38 GMT
Subject: The 1860 Presidential Election
In 1860 there was a four-way race between the Republican Party with Abraham
Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic
Party with John Breckenridge, and the Constitutional Union Party with John
Bell. Lincoln won a plurality with about 40% of the vote.
WI it was only a two-way race between Lincoln and Douglas? I believe Douglas
would have won.
This would have delayed secession and the Civil War.
COE Quarterly Technical Exchange, June 10th 2008
28
ACE Usenet Document
<DOCID> soc.history.what-if_20350205910 </DOCID>
<POSTER> Machiavegli </POSTER>
no email addresses in headers!
<POSTDATE> 29 Jan 2005 22:04:38 GMT </POSTDATE>
<SUBJECT> The 1860 Presidential Election </SUBJECT>
In 1860 there was a four-way race between the Republican Party with Abraham
Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic
Party with John Breckenridge, and the Constitutional Union Party with John
Bell. Lincoln won a plurality with about 40% of the vote.
WI it was only a two-way race between Lincoln and Douglas? I believe Douglas
would have won.
This would have delayed secession and the Civil War.
COE Quarterly Technical Exchange, June 10th 2008
29
Reconstruct from
From: Machiavegli <machia@aol.com>
Newsgroup: soc.history.what-if
Date: 29 Jan 2005 22:04:38 GMT
Subject: The 1860 Presidential Election
automatically
Got the address back!
In 1860 there was a four-way race between the Republican Party with Abraham
Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic
Party with John Breckenridge, and the Constitutional Union Party with John
Bell. Lincoln won a plurality with about 40% of the vote.
WI it was only a two-way race between Lincoln and Douglas? I believe Douglas
would have won.
This would have delayed secession and the Civil War.
COE Quarterly Technical Exchange, June 10th 2008
30
Handling it as @
From: Machiavegli <machia@aol.com>
To: soc.history.what-if@usenet.com
Date: 29 Jan 2005 22:04:38 GMT
Subject: The 1860 Presidential Election
handle group as receiver
In 1860 there was a four-way race between the Republican Party with Abraham
Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic
Party with John Breckenridge, and the Constitutional Union Party with John
Bell. Lincoln won a plurality with about 40% of the vote.
WI it was only a two-way race between Lincoln and Douglas? I believe Douglas
would have won.
This would have delayed secession and the Civil War.
COE Quarterly Technical Exchange, June 10th 2008
31
Feature Value: same label

Need for feature matrix (pairwise score)
sjhonson@hotmail.com
sjhonson@hotmail.com
“Steph”
“Stephan”
“Stephan”
“S. Smith”
+1.0
COE Quarterly Technical Exchange, June 10th 2008
32
Feature Value: different labels

Need for feature matrix (pairwise score)
sjhonson@hotmail.com
smith_s@aol.com
“Steph”
“Stephan”
“Stephan”
“S. Smith”
-1.0
COE Quarterly Technical Exchange, June 10th 2008
33
Conclusion

MapReduce can be applied to many HLT
applications

easy, cheap, and fast for distributed processing



e.g., scalable pairwise similarity for coreference resolution
calls for new ways of thinking
Identity resolution in email

new generative model yields improved accuracy


scalable joint resolution needed
Usenet-ACE is new test collection
COE Quarterly Technical Exchange, June 10th 2008
34
Thank You!
COE Quarterly Technical Exchange, June 10th 2008
35
MapReduce and Text Analysis







Computing pairwise similarity in large
collections
Joint resolution of mentions in email
collections
Search engines (of course!)
Building language models
Clustering applications
Machine translation
…
COE Quarterly Technical Exchange, June 10th 2008
36
Download