Probabilistic Models of Relational Data Daphne Koller Stanford University Joint work with: Ben Taskar Pieter Abbeel Lise Getoor Eran Segal Nir Friedman Avi Pfeffer Ming-Fai Wong Why Relational? The real world is composed of objects that have properties and are related to each other Natural language is all about objects and how they relate to each other “George got an A in Geography 101” Attribute-Based Worlds Smart_Jane & easy_CS101 GetA_Jane_CS101 Smart_Mike & easy_Geo101 GetA_Mike_Geo101 Smart_Jane & easy_Geo101 Smart students get A’sinGetA_Jane_Geo101 easy classes Smart_Rick & easy_CS221 GetA_Rick_C World = assignment of values to attributes / truth values to propositional symbols Object-Relational Worlds x,y(Smart(x) & Easy(y) & Take(x,y) Grade(A,x,y)) World = relational interpretation: Objects in the domain Properties of these objects Relations (links) between objects Why Probabilities? All universals are false (almost) Smart students get A’s in easy classes True universals are rarely useful Smart students get either A, B, C, D, or F C student The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful … Therefore the true logic for this world is the calculus of probabilities … James Clerk Maxwell Probable Worlds Probabilistic semantics: A set of possible worlds Each world associated with a probability hard smart A hard weak A easy smart A easy weak A hard smart B hard weak B easy smart B easy weak B hard smart C hard weak C easy smart C easy weak C course difficulty student intell. grade Epistemic state Probabilistic Categorical Representation: Design Axes Bayesian nets Markov nets n-gram models HMMs Prob. CFGs Propositional logic CSPs Automata Grammars Attributes Sequences World state First-order logic Relational databases Objects Outline Bayesian Networks Representation & Semantics Reasoning Probabilistic Relational Models Collective Classification Undirected discriminative models Collective Classification Revisited PRMs for NLP Bayesian Networks CPD P(G|D,I) A B C Difficulty Intelligence easy,low easy,high hard,low Grade hard,high 0% 20% 40% 60% 80% 100% nodes = variables edges = direct influence Letter SAT Graph structure encodes independence assumptions: Letter conditionally independent of Intelligence given Grade BN semantics D I conditional local full joint independencies + probability = distribution G models over domain in BN structure L S P(d, i, g,l, s) P(d) P(i) P(g | d, i) P(l | g) P(s | i) Compact & natural representation: nodes have k parents 2kn vs. 2n params parameters natural and easy to elicit Reasoning using BNs Probability theory is nothing but Difficulty Intelligence common sense reduced to calculation. Pierre Simon Laplace Grade Letter SAT Full joint distribution specifies answer to any query: P(variable | evidence about others) BN Inference BN Inference is NP-hard Structure can use graph structure: Graph separation conditional independence B A Do separate inference in parts Results combined over interface. C E D F Complexity: exponential in largest separator Structured BNs allow effective inference Exact inference in dense BNs is intractable Approximate BN Inference Belief propagation is an iterative message passing algorithm for approximate inference in BNs Each iteration (until “convergence”): Cons: Nodes pass “beliefs” as messages to neighboring nodes Limited theoretical guarantees Might not converge Pros: Linear time per iteration Works very well in practice, even for dense networks Outline Bayesian Networks Probabilistic Relational Models Language & Semantics Web of Influence Collective Classification Undirected discriminative models Collective Classification Revisited PRMs for NLP Bayesian Networks: Problem Bayesian nets use propositional representation Real world has objects, related to each other Intelligence Diffic_CS101 Intell_Jane Difficulty Diffic_CS101 Intell_George These “instances” are not independent Grade_Jane_CS101 A Grade Intell_George Diffic_Geo101 Grade_George_Geo101 Grade_George_CS101 C Probabilistic Relational Models Combine advantages of relational logic & BNs: Natural domain modeling: objects, properties, relations Generalization over a variety of situations Compact, natural probability models Integrate uncertainty with relational model: Properties of domain entities can depend on properties of related entities Uncertainty over relational structure of domain St. Nordaf University Prof. Smith Teaching-ability Teaches Teaches Prof. Jones Teaching-ability In-courseGrade Registered Satisfac Intelligence Welcome to George Geo101 Grade Welcome to Difficulty Registered In-courseSatisfac CS101 Intelligence Grade Difficulty In-courseSatisfac Registered Jane Relational Schema Specifies types of objects in domain, attributes of each type of object & types of relations between objects Professor Classes Student Intelligence Teaching-Ability Teach Take Attributes Relations Course Difficulty In Registration Grade Satisfaction Probabilistic Relational Models Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies Links define potential interactions Professor Teaching-Ability Student Intelligence Course Difficulty A B C easy,low Reg Grade Satisfaction [K. & Pfeffer; Poole; Ngo & Haddawy] easy,high hard,low hard,high 0% 20% 40% 60% 80% 100% PRM Semantics Prof. Jones Teaching-ability Prof. Smith Teaching-ability Instantiated PRM BN variables: attributes of all objects dependencies: determined by links & PRM Grade Welcome to Intelligence Satisfac George Geo101 Grade Welcome to Difficulty Satisfac CS101 Grade Difficulty Satisfac Intelligence Jane The Web of Influence Welcome to CS101 C 0% 0% 50% 50% Welcome to Geo101 easy / hard A low high low / high 100% 100% Outline Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Learning models from data Collective classification of webpages Undirected discriminative models Collective Classification Revisited PRMs for NLP Learning PRMs Reg D Relational Database Learner Course Student Expert knowledge [Friedman, Getoor, K., Pfeffer] Learning PRMs Parameter estimation: Probabilistic model with shared parameters Grades for all students share same model Can use standard techniques for max-likelihood or Bayesian parameter estimation Pˆ ( Reg .Grade A | Student.Intell hi, Course.Diff lo ) # ( Reg .Grade A, Student.Intell hi, Course.Diff lo ) # ( Reg .Grade *, Student.Intell hi, Course.Diff lo ) Structure learning: Define scoring function over structures Use combinatorial search to find high-scoring structure Web KB Tom Mitchell Professor Project-of WebKB Project Member Advisor-of Sean Slattery Student [Craven et al.] Web Classification Experiments WebKB dataset Four CS department websites Bag of words on each page Links between pages Anchor text for links Experimental setup Trained on three universities Tested on fourth Repeated for all four combinations Standard Classification Page Category Word1 Professor department extract information computer science machine learning … Categories: faculty course project student other Naïve Bayes ... WordN 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 words only Exploiting Links Page Category Word1 workin g with Tom Mitchell … ... WordN ... LinkWordN 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 words only link words Collective Classification To-Page From-Page Category Category Word1 ... WordN Exists Word1 Link Classify all pages collectively, ... WordN 0.35 maximizing the joint label probability 0.3 0.25 0.2 0.15 0.1 0.05 Approx. inference: belief propagation [Getoor, Segal, Taskar, Koller] 0 words only link words collective Learning w. Missing Data: EM [Dempster et al. 77] Students Courses A B C easy,low easy,high hard,low hard,high 0% 20% 40% 60% 80% 100% P(Registration.Grade | Course.Difficulty, Student.Intelligence) easy / hard low / high Discovering Hidden Types Internet Movie Database http://www.imdb.com Discovering Hidden Types Actor Director Type Type Movie Genres Type Year MPAA Rating Rating #Votes [Taskar, Segal, Koller] Discovering Hidden Types Movies Actors Directors Wizard of Oz Cinderella Sound of Music The Love Bug Pollyanna The Parent Trap Mary Poppins Swiss Family Robinson Sylvester Stallone Bruce Willis Harrison Ford Steven Seagal Kurt Russell Kevin Costner Jean-Claude Van Damme Arnold Schwarzenegger Alfred Hitchcock Stanley Kubrick David Lean Milos Forman Terry Gilliam Francis Coppola Terminator 2 Batman Batman Forever GoldenEye Starship Troopers Mission: Impossible Hunt for Red October Anthony Hopkins Robert De Niro Tommy Lee Jones Harvey Keitel Morgan Freeman Gary Oldman Steven Spielberg Tim Burton Tony Scott James Cameron John McTiernan Joel Schumacher … … Outline Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models Markov Networks Relational Markov Networks Collective Classification Revisited PRMs for NLP Directed Models: Solution: Undirected Limitations Models Acyclicity constraint limits expressive power: Network size O(N2) Inference is quadratic Generative training: Allow arbitrary patterns over sets of objects & links Two objects linked to by a student probably not both professors Acyclicity forces modeling of all potential links: Train to fit all of data, not to maximize accuracy Influence flows over existing links, exploiting link graph sparsity Network size O(N) Allow discriminative training: Max P (labels | observations) [Lafferty, McCallum, Pereira] Markov Networks Compatibility ABC FFF FFT FTF FTT TFF TFT TTF TTT Alice Eve Chris Dave (A,B,C) Betty 0 0.5 1 1.5 2 1 P(A, B, C, D, E) (A, B, C) (C,D) (D,E) (E, A) Z Graph structure encodes independence assumptions: Chris conditionally independent of Eve given Alice & Dave Relational Markov Networks Universals: Probabilistic patterns hold for all groups of objects Locality: Represent local probabilistic dependencies Sets of links give us possible interactions Student Intelligence Reg Grade Course Difficulty Student2 Intelligence Template potential Reg2 Grade CC CB CA BC Study BB BA AC AB AA Group 0 0.5 1 1.5 [Taskar, Abbeel, Koller ‘02] 2 RMN Semantics Instantiated RMN MN variables: attributes of all objects dependencies: determined by links & RMN Welcome to Geo101 Grade Geo Study Group Difficulty Intelligence George Grade Intelligence Welcome to CS101 Grade Difficulty CS Study Group Jane Intelligence Grade Jill Outline Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models Collective Classification Revisited Discriminative training of RMNs Webpage classification Link prediction PRMs for NLP Learning RMNs Parameter estimation is not closed form Convex problem unique global maximum Maximize L = log P(Grades,Intelligence|Difficulty) easy / hard ABC low / high Grade Difficulty Intelligence Grade Intelligence Grade Intelligence Difficulty Grade (Reg1.Grade,Reg2.Grade) CC CB CA BC BB BA AC AB AA 0 0.5 1 1.5 2 L # (Grade A, Grade A) AA P(Grade A, Grade A | Diffic ) Flat Models Page Category Logistic Regression Word1 ... WordN ... LinkWordN 0.3 0.25 0.2 0.15 P(Category|Words) 0.1 0.05 0 Naïve Bayes Logistic SVM Exploiting Links To-Page From-Page Category Category 42.1% Word relative reduction ... Word Word1 ... WordN Linkin error N 1 relative to generative approach 0.25 0.2 0.15 0.1 0.05 0 PRM Logistic RMN-link More Complex Structure Faculty W1 Students C Wn S S Courses Collective Classification: Results 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 35.4% relative reduction in error relative to strong flat approach Logistic Links Section Link+Section Scalability WebKB data set size Classification Directed models 3 sec 180 sec Undirected models 20 minutes 15-20 sec 1300 entities 180K attributes 5800 links Network size / school: Directed model 200,000 variables 360,000 edges Undirected model Training 40,000 variables 44,000 edges Difference in training time decreases substantially when some training data is unobserved want to model with hidden variables Predicting Relationships Tom Mitchell Professor Member WebKB Project Member Advisor-of Sean Slattery Student Even more interesting are the relationships between objects e.g., verbs are almost always relationships Flat Model To-Page From-Page Word1 ... Rel NONE advisor instructor TA member project-of Word1 WordN LinkWord1 Type ... LinkWordN ... WordN Flat Model ... ... ... ... ... ... Collective Classification: Links To-Page From-Page Category Category ... ... Word1 Word1 WordN Rel LinkWord1 Type ... LinkWordN WordN Link Model ... ... ... ... ... ... Triad Model Professor Advisor Student Member Member Group Triad Model Professor Advisor Student TA Instructor Course Triad Model WebKB++ Four new department web sites: Labeled page type (8 types): faculty, student, research scientist, staff, research group, research project, course, organization Labeled hyperlinks and virtual links (6 types): Berkeley, CMU, MIT, Stanford advisor, instructor, TA, member, project-of, NONE Data set size: 11K pages 110K links 2million words Link Prediction: Results ... ... 72.9% relative reduction in error ... relative to strong flat approach Error measured over links predicted to be present 30 25 20 15 Link presence cutoff is 10 at precision/recall 5 break-even point (30% for all models) 0 Flat Labels Triad Summary PRMs inherit key advantages of probabilistic graphical models: Coherent probabilistic semantics Exploit structure of local interactions Relational models inherently more expressive “Web of influence”: use all available information to reach powerful conclusions Exploit both relational information and power of probabilistic reasoning Outline Bayesian Networks Probabilistic Relational Models Collective Classification & Clustering Undirected Discriminative Models Collective Classification Revisited PRMs for NLP or “Why Should I Care?”* Word-Sense Disambiguation Relation Extraction Natural Language Understanding (?) * An outsider’s perspective Word Sense Disambiguation financial physical electrical academic figurative criticism wind paper Her advisor gave her feedback about the draft. Neighboring words alone may not provide enough information to disambiguate We can gain insight by considering compatibility between senses of related words Collective Disambiguation financial physical electrical academic figurative criticism wind paper Her advisor gave her feedback about the draft. Can we infer grammatical structure Objects: words in text and disambiguate word senses simultaneously Attributes: sense, gender, rather number,than pos,sequentially? … Links: Grammatical relations (subject-object, modifier,…) CloseCan semantic (is-a,inter-word cause-of, …) relationships we relations integrate Same word in different sentences (one-sense-per-discourse) directly into our probabilistic model? Compatibility parameters: Learned from tagged data Based on prior knowledge (e.g., WordNet, FrameNet) Relation Extraction ACME’s board of directors began a search for a new CEO after the departure of current CEO, James Jackson, following allegations of creative accounting practices at ACME. [6/01] … In an attempt to improve the company’s image, ACME is considering former judge Mary Miller for the job. [7/01] … As her first act in her new position, Miller announced that ACME will be doing a stock buyback. [9/01] … CEO Jackson Of Miller Made Announcement Understanding Language Professor Sarah met Jane. She explained the hole in her proof. Most likely interpretation: Theorem: P=NP Proof: N=1 Student Jane Professor Sarah Resolving Ambiguity Professor Sarah met Jane. She explained the hole in her proof. Professors often meet with students Jane is probably a student Attribute values Professors like to explain Link types “She” is probably Prof. Sarah Object identity Probabilistic reasoning about objects, their attributes, and the relationships between them [Goldman & Charniak, Pasula & Russell] Acquiring Semantic Models Statistical NLP reveals patterns: train hire 3% pay 1.5% fire 1.4% 0.3% serenade be 3% 24% teacher Standard models learn patterns at word level But word-patterns are only implicit surrogates for underlying semantic patterns “Teacher” objects tend to participate in certain relationships Can use this pattern for objects not explicitly labeled as a teacher Complementary Competing Approaches Approaches Semantic Scaling Up Desiderata: Understanding (via learning) Logical Statistical PRMs Noise & Ambiguity Statistics: from Words to Semantics Represent statistical patterns at semantic level What types of objects participate in what types of relationships Learn statistical models of semantics from text Reason using the models to obtain global semantic understanding of the text Georgia O’Keefe Ladder to the Moon