Probabilistic Equational Reasoning

Probabilistic Equational Reasoning Arthur Kantor akantor@uiuc.edu Outline • The Problem • Two solutions – Generative Model – Undirected Graphical Model 3/11/04 2 The problem • You have m objects, all with some common attributes – Publications • You also have n references to those objects. – Citations of those publications • The references are ambiguous – Citations have different formats and may have spelling mistakes – m may not be known • How do you know if two references refer to the same object? – A common problem in citeseer.nj.nec.com – natural language processing, database merging, … 3/11/04 3 The problem • What object do these references refer to? – “Powell” – “she” – “Mr. Powell” – References can disambiguate each other 3/11/04 4 Two Solutions • Based on probabilistic models • Objects are unobserved – number of objects m is not known • Try to resolve all the references simultaneously – “she” would not co-reference “Powell” in presence of “Mr. Powell” • Solution one: – Based on relational probabilistic models (RPMs) • Solution two: – Based on undirected graphical models 3/11/04 5 RPM solution [Pasula et al.] • System built to identify the papers from various citations • Straight-forward Bayes rule, intelligently applied • 4 classes of information – – – – Author (unobserved) Paper (unobserved) AuthorAsCited Citation • Probability distributions for each class are given 3/11/04 6 Thick lines specify the foreign keys (can also be thought of as random vectors of pointers to objects) 3/11/04 7 Assume for now, that the number of papers and authors is known. Thin lines represent dependencies for every instance of the class. Think Generatively: 1) Authors are born, names picked from prior distribution 2) Papers are written, names and publication types picked from a prior 3/11/04 8 1) Authors are born, names picked from prior distribution 2) Papers are written, names and publication types picked from a prior 3) Based on papers, Citations are composed (perhaps misspelled) 4) Based on mood, perhaps pubType, a format is chosen for citation 5) Finally, the text is written down 3/11/04 9 We now have P(text | all the stuff that happened) So what could have happened? P(what happened | text) 3/11/04 10 Consider picking a book from the library of n books, writing down the citation putting it back on the shelf, and repeating the process once more. You now have two citations, c1, and c2 Consider two hypotheses: H1: c1.paper = c2.paper H2: c1.paper ≠ c2.paper What’s more likely? 3/11/04 11 Consider picking a paper from the library of n papers, writing down the citation putting it back on the shelf, and repeating the process once more. You now have two citations, c1, and c2 Consider two hypotheses: H1: c1.paper = c2.paper H2: c1.paper ≠ c2.paper What’s more likely? p2  3/11/04     n  1     PX ( x1 ) PY ( y1 | x1 )   PX ( x2 ) PY ( y2 | x2 )     PX ( x1 ) PY ( y1 | x1 ) 2    x   x  n   x1  2   1  12 What’s more likely? H1: c1.paper = c2.paper H2: c1.paper ≠ c2.paper p2      n  1     PX ( x1 ) PY ( y1 | x1 )   PX ( x2 ) PY ( y2 | x2 )     PX ( x1 ) PY ( y1 | x1 ) 2    x   x  n   x1  2   1  • Depends on y1 and y2 – If it is probable that both y1 and y2 were copied down correctly, and y1 and y2 differ significantly, the cause for the difference was probably that the papers were in fact different • P(H2|text)>P(H1|text) • H2 is what happened 3/11/04 13 Concerns • Unknown number of papers, authors – Condition everything on the number of papers, authors • The probability space is ridiculously huge. We cannot possibly sum over it. • MCMC sampling on the number of objects • name/title corruption is symmetric: instead of corrupting the title, corrupt the cited title. • Sum directly over small-range attributes, like doctype. 3/11/04 14 Performance • Works pretty well – Depends greatly on the quality of the generative model 3/11/04 15 Outline • The Problem • Two solutions – Generative Model – Undirected Graphical Model • This model gives more flexibility in specifying features than the previous one – No need to specify per-class dependencies 3/11/04 16 Undirected Graphical Model • Objects are implicit – we deal only with references • Given: – References x1 … xi … xn – Yij are binary random variables • Yij=1 iff xi co-references xj – fl (xi,xj,yij) are feature or potential functions • measure particular similarity facet between xi and xj • have the property fl (xi,xj,1) = - fl (xi,xj,0) • non-zero, if xi and xj are related 3/11/04 17 Objective function • Maximize      1 P(Y  y | x )  exp   l f l ( xi , x j , yij )   l ' f l ' ( yij , y jk , yki )  Zx i , j , k ,l '  i , j ,l  Biggest if all the similar (xi,xj) are connected and all opposite (xi,xj) are seperated Becomes - if xi , x j , xk Are not all linked, to prevent inconsistencies Notational trick – the implementation simply doesn’t allow non-clique configurations 3/11/04 18 Objective function      1 P(Y  y | x )  exp   l f l ( xi , x j , yij )   l ' f l ' ( yij , y jk , yki )  Zx i , j , k ,l '  i , j ,l  • l are learned by maximum likelihood, over the training data • The function is concave, so we can use our favorite learning algorithm (e.g. stochastic gradient ascent) 3/11/04 19 Graph partitioning    P(Y  y | x ) is • Maximizing equivalent to finding an optimal graph partitioning of a complete graph • The nodes are xis • The edges (xi,xj) are the log-potential functions applied to that pair of references   f (x , x , y l l i j ij ) l 3/11/04 20 Correlation Clustering Graph G=(V,E) – X2 – – X1 X3 – – + + X4 – – X5 Edges are labeled + or –. Partition V into clusters s.t. + + edges are within clusters – edges are across clusters. No bound on # of clusters. 3/11/04 21 Agreements and Disagreements – X2 – – X1 3/11/04 X3 – – + Agreements + edges inside clusters AND – edges outside clusters. + X4 – – X5 22 Agreements and Disagreements Agreements + edges inside clusters AND – edges outside clusters. X3 X2 X4 + X1 3/11/04 Disagreements (mistakes) + edges outside clusters AND – edges inside clusters. X5 Can either Maximize Agreements OR Minimize Disagreements 23 – X3 – X2 – – – X1 3/11/04 + Choosing a different partitioning is possible, but could be worse, since more disagreements are introduced. + X4 – – X5 + Partitions must be cliques. (introduces a bias towards small partitions?) 24 A few observations • Number of objects is determined automatically – Corresponds to the number of cliques • Metrics are defined pairwise, but decision to join a clique involves all references • If we force two cliques, the problem is equivalent to a single simulated annealing pass 3/11/04 25 3/11/04 26 Cross-citation disambiguation 3/11/04 27 Complexity • A probabilistic inference problem to a clustering problem, cannot be constant-factor-approximated in polynomial time. We can reduce a 3-SAT problem to it • Correlation Clustering Algorithms exist to guarantee relative error less than (1-) in polynomial time for complete graphs (such as ours). • Yet we are solving a probabilistic inference with a faster algorithm. How? – We have a simpler subclass of probability distributions, they are log-linear. – Probably boils down to an integer-programming problem, since not all assignments of y are allowed. 3/11/04 28 Proper Noun co-reference f l ‘s: 3/11/04 29 Proper Nouns performance • Tested on – – – – – – 3/11/04 30 news wire articles 117 stories from broadcast news portion of DARAPA’s ACE set Hand-annotated nouns (non-proper nouns ignored) Identical feature functions on all three sets! 5-fold validation Only 60% accuracy if proper nouns are included 30

Probabilistic Equational Reasoning

Related documents

Products

Support

Probabilistic Equational Reasoning

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib