Probabilistic Equational Reasoning

advertisement
Probabilistic Equational
Reasoning
Arthur Kantor
akantor@uiuc.edu
Outline
• The Problem
• Two solutions
– Generative Model
– Undirected Graphical Model
3/11/04
2
The problem
• You have m objects, all with some common attributes
– Publications
• You also have n references to those objects.
– Citations of those publications
• The references are ambiguous
– Citations have different formats and may have spelling mistakes
– m may not be known
• How do you know if two references refer to the same
object?
– A common problem in citeseer.nj.nec.com
– natural language processing, database merging, …
3/11/04
3
The problem
• What object do these references refer to?
– “Powell”
– “she”
– “Mr. Powell”
– References can disambiguate each other
3/11/04
4
Two Solutions
• Based on probabilistic models
• Objects are unobserved
– number of objects m is not known
• Try to resolve all the references simultaneously
– “she” would not co-reference “Powell” in presence of
“Mr. Powell”
• Solution one:
– Based on relational probabilistic models (RPMs)
• Solution two:
– Based on undirected graphical models
3/11/04
5
RPM solution
[Pasula et al.]
• System built to identify the papers from various
citations
• Straight-forward Bayes rule, intelligently applied
• 4 classes of information
–
–
–
–
Author (unobserved)
Paper (unobserved)
AuthorAsCited
Citation
• Probability distributions for each class are given
3/11/04
6
Thick lines specify the foreign keys
(can also be thought of as random vectors of pointers to objects)
3/11/04
7
Assume for now, that the number of papers and authors is known.
Thin lines represent dependencies for every instance of the class.
Think Generatively:
1) Authors are born, names picked from prior distribution
2) Papers are written, names and publication types picked from a prior
3/11/04
8
1) Authors are born, names picked from prior distribution
2) Papers are written, names and publication types picked from a prior
3) Based on papers, Citations are composed (perhaps misspelled)
4) Based on mood, perhaps pubType, a format is chosen for citation
5) Finally, the text is written down
3/11/04
9
We now have P(text | all the stuff that happened)
So what could have happened?
P(what happened | text)
3/11/04
10
Consider picking a book from the library of n books, writing down the citation
putting it back on the shelf, and repeating the process once more.
You now have two citations, c1, and c2
Consider two hypotheses:
H1: c1.paper = c2.paper
H2: c1.paper ≠ c2.paper
What’s more likely?
3/11/04
11
Consider picking a paper from the library of n papers, writing down the citation
putting it back on the shelf, and repeating the process once more.
You now have two citations, c1, and c2
Consider two hypotheses:
H1: c1.paper = c2.paper
H2: c1.paper ≠ c2.paper
What’s more likely?
p2 
3/11/04

 

n  1  
  PX ( x1 ) PY ( y1 | x1 )   PX ( x2 ) PY ( y2 | x2 )     PX ( x1 ) PY ( y1 | x1 ) 2  
 x
  x

n   x1
 2
  1

12
What’s more likely?
H1: c1.paper = c2.paper
H2: c1.paper ≠ c2.paper
p2 

 

n  1  
  PX ( x1 ) PY ( y1 | x1 )   PX ( x2 ) PY ( y2 | x2 )     PX ( x1 ) PY ( y1 | x1 ) 2  
 x
  x

n   x1
 2
  1

• Depends on y1 and y2
– If it is probable that both y1 and y2 were
copied down correctly, and y1 and y2 differ
significantly, the cause for the difference was
probably that the papers were in fact different
• P(H2|text)>P(H1|text)
• H2 is what happened
3/11/04
13
Concerns
• Unknown number of papers, authors
– Condition everything on the number of
papers, authors
• The probability space is ridiculously huge.
We cannot possibly sum over it.
• MCMC sampling on the number of objects
• name/title corruption is symmetric: instead of
corrupting the title, corrupt the cited title.
• Sum directly over small-range attributes, like
doctype.
3/11/04
14
Performance
• Works pretty well
– Depends greatly on the quality of the
generative model
3/11/04
15
Outline
• The Problem
• Two solutions
– Generative Model
– Undirected Graphical Model
• This model gives more flexibility in specifying
features than the previous one
– No need to specify per-class dependencies
3/11/04
16
Undirected Graphical Model
• Objects are implicit – we deal only with
references
• Given:
– References x1 … xi … xn
– Yij are binary random variables
• Yij=1 iff xi co-references xj
– fl (xi,xj,yij) are feature or potential functions
• measure particular similarity facet between xi and xj
• have the property fl (xi,xj,1) = - fl (xi,xj,0)
• non-zero, if xi and xj are related
3/11/04
17
Objective function
• Maximize
  


1
P(Y  y | x ) 
exp   l f l ( xi , x j , yij )   l ' f l ' ( yij , y jk , yki ) 
Zx
i , j , k ,l '
 i , j ,l

Biggest if all the similar
(xi,xj) are connected and
all opposite (xi,xj) are
seperated
Becomes - if xi , x j , xk
Are not all linked, to
prevent inconsistencies
Notational trick – the
implementation simply
doesn’t allow non-clique
configurations
3/11/04
18
Objective function
  


1
P(Y  y | x ) 
exp   l f l ( xi , x j , yij )   l ' f l ' ( yij , y jk , yki ) 
Zx
i , j , k ,l '
 i , j ,l

• l are learned by maximum likelihood, over the
training data
• The function is concave, so we can use our
favorite learning algorithm (e.g. stochastic
gradient ascent)
3/11/04
19
Graph partitioning
  
P(Y  y | x ) is
• Maximizing
equivalent to finding
an optimal graph partitioning of a complete
graph
• The nodes are xis
• The edges (xi,xj) are the log-potential
functions applied to that pair of references
  f (x , x , y
l
l
i
j
ij
)
l
3/11/04
20
Correlation Clustering
Graph G=(V,E)
–
X2
– –
X1
X3
–
–
+
+
X4
–
–
X5
Edges are labeled + or –.
Partition V into clusters s.t.
+
+ edges are within clusters
– edges are across clusters.
No bound on # of clusters.
3/11/04
21
Agreements and Disagreements
–
X2
– –
X1
3/11/04
X3
–
–
+
Agreements
+ edges inside clusters AND
– edges outside clusters.
+
X4
–
–
X5
22
Agreements and Disagreements
Agreements
+ edges inside clusters AND
– edges outside clusters.
X3
X2
X4
+
X1
3/11/04
Disagreements (mistakes)
+ edges outside clusters AND
– edges inside clusters.
X5
Can either Maximize Agreements
OR
Minimize Disagreements
23
–
X3
–
X2
–
– –
X1
3/11/04
+
Choosing a different
partitioning is possible, but
could be worse, since more
disagreements are
introduced.
+
X4
–
–
X5
+
Partitions must be cliques.
(introduces a bias towards
small partitions?)
24
A few observations
• Number of objects is determined
automatically
– Corresponds to the number of cliques
• Metrics are defined pairwise, but decision
to join a clique involves all references
• If we force two cliques, the problem is
equivalent to a single simulated annealing
pass
3/11/04
25
3/11/04
26
Cross-citation disambiguation
3/11/04
27
Complexity
• A probabilistic inference problem to a clustering problem,
cannot be constant-factor-approximated in polynomial
time.
We can reduce a 3-SAT problem to it
• Correlation Clustering Algorithms exist to guarantee
relative error less than (1-) in polynomial time for
complete graphs (such as ours).
• Yet we are solving a probabilistic inference with a faster
algorithm. How?
– We have a simpler subclass of probability distributions, they are
log-linear.
– Probably boils down to an integer-programming problem, since
not all assignments of y are allowed.
3/11/04
28
Proper Noun co-reference
f l ‘s:
3/11/04
29
Proper Nouns performance
• Tested on
–
–
–
–
–
–
3/11/04
30 news wire articles
117 stories from broadcast news portion of DARAPA’s ACE set
Hand-annotated nouns (non-proper nouns ignored)
Identical feature functions on all three sets!
5-fold validation
Only 60% accuracy if proper nouns are included
30
Download