Probabilistic Equational Reasoning Arthur Kantor akantor@uiuc.edu Outline • The Problem • Two solutions – Generative Model – Undirected Graphical Model 3/11/04 2 The problem • You have m objects, all with some common attributes – Publications • You also have n references to those objects. – Citations of those publications • The references are ambiguous – Citations have different formats and may have spelling mistakes – m may not be known • How do you know if two references refer to the same object? – A common problem in citeseer.nj.nec.com – natural language processing, database merging, … 3/11/04 3 The problem • What object do these references refer to? – “Powell” – “she” – “Mr. Powell” – References can disambiguate each other 3/11/04 4 Two Solutions • Based on probabilistic models • Objects are unobserved – number of objects m is not known • Try to resolve all the references simultaneously – “she” would not co-reference “Powell” in presence of “Mr. Powell” • Solution one: – Based on relational probabilistic models (RPMs) • Solution two: – Based on undirected graphical models 3/11/04 5 RPM solution [Pasula et al.] • System built to identify the papers from various citations • Straight-forward Bayes rule, intelligently applied • 4 classes of information – – – – Author (unobserved) Paper (unobserved) AuthorAsCited Citation • Probability distributions for each class are given 3/11/04 6 Thick lines specify the foreign keys (can also be thought of as random vectors of pointers to objects) 3/11/04 7 Assume for now, that the number of papers and authors is known. Thin lines represent dependencies for every instance of the class. Think Generatively: 1) Authors are born, names picked from prior distribution 2) Papers are written, names and publication types picked from a prior 3/11/04 8 1) Authors are born, names picked from prior distribution 2) Papers are written, names and publication types picked from a prior 3) Based on papers, Citations are composed (perhaps misspelled) 4) Based on mood, perhaps pubType, a format is chosen for citation 5) Finally, the text is written down 3/11/04 9 We now have P(text | all the stuff that happened) So what could have happened? P(what happened | text) 3/11/04 10 Consider picking a book from the library of n books, writing down the citation putting it back on the shelf, and repeating the process once more. You now have two citations, c1, and c2 Consider two hypotheses: H1: c1.paper = c2.paper H2: c1.paper ≠ c2.paper What’s more likely? 3/11/04 11 Consider picking a paper from the library of n papers, writing down the citation putting it back on the shelf, and repeating the process once more. You now have two citations, c1, and c2 Consider two hypotheses: H1: c1.paper = c2.paper H2: c1.paper ≠ c2.paper What’s more likely? p2 3/11/04 n 1 PX ( x1 ) PY ( y1 | x1 ) PX ( x2 ) PY ( y2 | x2 ) PX ( x1 ) PY ( y1 | x1 ) 2 x x n x1 2 1 12 What’s more likely? H1: c1.paper = c2.paper H2: c1.paper ≠ c2.paper p2 n 1 PX ( x1 ) PY ( y1 | x1 ) PX ( x2 ) PY ( y2 | x2 ) PX ( x1 ) PY ( y1 | x1 ) 2 x x n x1 2 1 • Depends on y1 and y2 – If it is probable that both y1 and y2 were copied down correctly, and y1 and y2 differ significantly, the cause for the difference was probably that the papers were in fact different • P(H2|text)>P(H1|text) • H2 is what happened 3/11/04 13 Concerns • Unknown number of papers, authors – Condition everything on the number of papers, authors • The probability space is ridiculously huge. We cannot possibly sum over it. • MCMC sampling on the number of objects • name/title corruption is symmetric: instead of corrupting the title, corrupt the cited title. • Sum directly over small-range attributes, like doctype. 3/11/04 14 Performance • Works pretty well – Depends greatly on the quality of the generative model 3/11/04 15 Outline • The Problem • Two solutions – Generative Model – Undirected Graphical Model • This model gives more flexibility in specifying features than the previous one – No need to specify per-class dependencies 3/11/04 16 Undirected Graphical Model • Objects are implicit – we deal only with references • Given: – References x1 … xi … xn – Yij are binary random variables • Yij=1 iff xi co-references xj – fl (xi,xj,yij) are feature or potential functions • measure particular similarity facet between xi and xj • have the property fl (xi,xj,1) = - fl (xi,xj,0) • non-zero, if xi and xj are related 3/11/04 17 Objective function • Maximize 1 P(Y y | x ) exp l f l ( xi , x j , yij ) l ' f l ' ( yij , y jk , yki ) Zx i , j , k ,l ' i , j ,l Biggest if all the similar (xi,xj) are connected and all opposite (xi,xj) are seperated Becomes - if xi , x j , xk Are not all linked, to prevent inconsistencies Notational trick – the implementation simply doesn’t allow non-clique configurations 3/11/04 18 Objective function 1 P(Y y | x ) exp l f l ( xi , x j , yij ) l ' f l ' ( yij , y jk , yki ) Zx i , j , k ,l ' i , j ,l • l are learned by maximum likelihood, over the training data • The function is concave, so we can use our favorite learning algorithm (e.g. stochastic gradient ascent) 3/11/04 19 Graph partitioning P(Y y | x ) is • Maximizing equivalent to finding an optimal graph partitioning of a complete graph • The nodes are xis • The edges (xi,xj) are the log-potential functions applied to that pair of references f (x , x , y l l i j ij ) l 3/11/04 20 Correlation Clustering Graph G=(V,E) – X2 – – X1 X3 – – + + X4 – – X5 Edges are labeled + or –. Partition V into clusters s.t. + + edges are within clusters – edges are across clusters. No bound on # of clusters. 3/11/04 21 Agreements and Disagreements – X2 – – X1 3/11/04 X3 – – + Agreements + edges inside clusters AND – edges outside clusters. + X4 – – X5 22 Agreements and Disagreements Agreements + edges inside clusters AND – edges outside clusters. X3 X2 X4 + X1 3/11/04 Disagreements (mistakes) + edges outside clusters AND – edges inside clusters. X5 Can either Maximize Agreements OR Minimize Disagreements 23 – X3 – X2 – – – X1 3/11/04 + Choosing a different partitioning is possible, but could be worse, since more disagreements are introduced. + X4 – – X5 + Partitions must be cliques. (introduces a bias towards small partitions?) 24 A few observations • Number of objects is determined automatically – Corresponds to the number of cliques • Metrics are defined pairwise, but decision to join a clique involves all references • If we force two cliques, the problem is equivalent to a single simulated annealing pass 3/11/04 25 3/11/04 26 Cross-citation disambiguation 3/11/04 27 Complexity • A probabilistic inference problem to a clustering problem, cannot be constant-factor-approximated in polynomial time. We can reduce a 3-SAT problem to it • Correlation Clustering Algorithms exist to guarantee relative error less than (1-) in polynomial time for complete graphs (such as ours). • Yet we are solving a probabilistic inference with a faster algorithm. How? – We have a simpler subclass of probability distributions, they are log-linear. – Probably boils down to an integer-programming problem, since not all assignments of y are allowed. 3/11/04 28 Proper Noun co-reference f l ‘s: 3/11/04 29 Proper Nouns performance • Tested on – – – – – – 3/11/04 30 news wire articles 117 stories from broadcast news portion of DARAPA’s ACE set Hand-annotated nouns (non-proper nouns ignored) Identical feature functions on all three sets! 5-fold validation Only 60% accuracy if proper nouns are included 30