CMSC 828G: Introduction to Statistical Relational Learning (SRL) & Link Analysis (LA) January 28, 2005 Today’s Outline • • • • • • Brief Introduction to SRL Student Introductions Course Mechanics Slightly Longer Introduction to SRL SRL focus problem Exercise: Create your own SRL focus problem • Discussion of SRL focus problems • Survey • Resources Statistical Relational Learning • Traditional machine learning and data mining approaches assume: – A random sample of homogeneous objects from single relation • Real world data sets: – Multi-relational, heterogeneous and semi-structured • SRL – newly emerging research area at the intersection of research in graphical models, social network and link analysis, hypertext and web mining, graph mining, relational learning and inductive logic programming SRL Approaches • Combine logical/combinatorial structures with statistical/probabilistic models • Families of Approaches – Entity-relation Models + Graphical Models (BNs/Markov Models) – First-Order Logic + Graphical Models – Functional Programming + Stochastic Execution Sample Domains • • • • • • • • web data (web) bibliographic data (cite) epidimiological data (epi) communication data (comm) customer networks (cust) collaborative filtering problems (cf) trust networks (trust) biological data (bio) Recent SRL Activities • Dagstuhl Workshop on Probabilistic, Logical and Relational Learning - Towards a Synthesis (1/30/05-2/04/05) http://www.dagstuhl.de/05051/ • ICML 2004 workshop on Statistical Relational Learning and its Connections to Other Fields http://www.cs.umd.edu/projects/srl2004/ • IJCAI 2003 workshop on Statistical Relational Learning http://kdl.cs.umass.edu/srl2003/ • AAAI 2000 workshop on Statistical Relational Learning http://robotics.stanford.edu/srl • Several related workshops: – KDD MRDM workshops • http://www-ai.ijs.si/SasoDzeroski/MRDM2004/ • http://www-ai.ijs.si/SasoDzeroski/MRDM2003/ • http://www-ai.ijs.si/SasoDzeroski/MRDM2002/ • Benjamin Taskar and I are working on an edited SRL collection, and ideally we will have access to draft chapters from this collection. Other SRL Related Courses • Tom Dietterich’s course at OSU http://web.engr.oregonstate.edu/~tgd/classes/539/ • David Page, Mark Craven and Jude Shavlik at UWisc http://www.biostat.wisc.edu/~page/838.html • Pedro Domingo’s course at UWash • Eric Mjolsness course at UCI on Probabilistic Knowledge Representation • Stuart Russell’s course at Berkeley on Knowledge Representation and Reasoning http://www.cs.berkeley.edu/~russell/classes/cs289/f04/ • Joydeep Ghosh course at UT Austin on Advanced Topics in Data Mining http://www.lans.ece.utexas.edu/course/382v/05sp/ • Michael Littman course at Rutgers on Learned Representations in AI, http://www.cs.rutgers.edu/~mlittman/courses/lightai03/ • David Jensen and Andrew McCallums course at UMass on Computational Social Network Analysis http://computableplant.ics.uci.edu/emj/classes/280_04/Syllabus%20ICS%20280%20 v2.doc http://kdl.cs.umass.edu/courses/csna/ Goals of this Course • ***NEW*** area • Understand Foundations – Tutorials on Graphical Models, Logic, ILP, etc. • Understand existing work – Wade through and make sense of Alphabet Soup of approaches (PRMs, BLPs, SLPs, MLPs, RMNs, LBNs, etc.) • Understand interesting theoretical issues – Collective classification, Open World assumptions, etc. • Study interesting and practical applications of SRL • Do a significant (publishable) project in this area. Course Mechanics • Course meets 10:00-12:45. – We will have 15 minute break, typically 11:1511:30 – Class will consists of: • Tutorials • Exercises • Readings and Discussion • Course URL – http://www.cs.umd.edu/class/spring2005/cmsc 828g/ • Course Wiki – … stay tuned…. Course Expectations • SRL Focus problem (15%) – Each student will develop an SRL focus problem (10%) due Feb. 11 – Each student will ‘solve’ SRL focus problem using at least two different SRL techniques (5%) • • • Describe a domain Describe useful inference and learning tasks (Ideally) Collect data • Lead at least one class discussion (5%) • Class Participation (15%) • Class Project (50%) – – – Each student will sign up to lead the discussion of one (or more depending on class size) class discussion topic. Each week each student must turn in a short discussion of the readings by noon Thursday before class. The discussion leader should review the others responses, and use them to structure the class discussion. Each student is expected to do a research project for the course. • • • • • Feb. 18, Project Proposals Due Mar. 18, Project Progress Report #1 due Apr. 22, Project Progress Report #2 due May 6, Project Presentations May 12, Project Write-up Due • Class Exercises (10%) • Reviewer (5%) – Throughout the course, there will be small class exercises – Each student is expected to do 2 one-page reviews of submitted SRL Book Chapters (Students reviewers will be acknowledged in text) Introductions • Name • Where you are originally from • Research Interest/Advisor if you have one SRL Intro Part II An Example: Probabilistic Relational Models Bayesian Networks: Problem • Bayesian nets use propositional representation • Real world has objects, related to each other Intelligence Diffic_CS101 Intell_Jane Difficulty Diffic_CS101 Intell_George These “instances” are not independent Grade_Jane_CS101 A Grade Intell_George Diffic_Geo101 Grade_George_Geo101 Grade_George_CS101 C Probabilistic Relational Models • Combine advantages of relational logic & BNs: – Natural domain modeling: objects, properties, relations – Generalization over a variety of situations – Compact, natural probability models • Integrate uncertainty with relational model: – Properties of domain entities can depend on properties of related entities – Uncertainty over relational structure of domain St. Nordaf University Prof. Smith Teaching-ability Teaches Teaches Prof. Jones Teaching-ability In-courseGrade Registered Satisfac Intelligence Welcome to George Geo101 Grade Welcome to Difficulty Registered In-courseSatisfac CS101 Intelligence Grade Difficulty In-courseSatisfac Registered Jane Relational Schema • Specifies types of objects in domain, attributes of each type of object & types of relations between objects Professor Classes Student Intelligence Teaching-Ability Teach Take Attributes Relations Course Difficulty In Registration Grade Satisfaction Representing the Distribution • Very large probability space for a given context – All possible assignments of all attributes of all objects • Infinitely many potential contexts – Each associated with a very different set of worlds Need to represent infinite set of complex distributions Probabilistic Relational Models • Universals: Probabilistic patterns hold for all objects in class • Locality: Represent direct probabilistic dependencies – Links define potential interactions Professor Teaching-Ability Student Intelligence Course Difficulty A B C easy,low Reg Grade Satisfaction [Koller & Pfeffer; Poole; Ngo & Haddawy] easy,high hard,low hard,high 0% 20% 40% 60% 80% 100% PRM Semantics Prof. Jones Teaching-ability Prof. Smith Teaching-ability Instantiated PRM BN variables: attributes of all objects dependencies: determined by links & PRM Grade Welcome to Intelligence Satisfac George Geo101 Grade Welcome to Difficulty Satisfac CS101 Grade Difficulty Satisfac Intelligence Jane The Web of Influence Welcome to CS101 C 0% 0% 50% 50% Welcome to Geo101 easy / hard A low high low / high 100% 100% Reasoning with a PRM • Generic approach: – Instantiate PRM to produce ground BN – Use standard BN inference • In most cases, resulting BN is too densely connected to allow exact inference • Use approximate inference: belief propagation • Improvement: Use domain structure — objects & relations — to guide computation – Kikuchi approximation where clusters = objects Data Model Objects Reg Database Learner Probabilistic Model Course Student Expert knowledge Data for New Situation [Friedman, Getoor, Koller & Pfeffer; Prob. Inferenc e What are the objects in the new situation? How are they related to each other? PRM Summary • PRMs inherit key advantages of probabilistic graphical models: – Coherent probabilistic semantics – Exploit structure of local interactions • Relational models inherently more expressive • “Web of influence”: use multiple sources of information to reach conclusions • Exploit both relational information and power of probabilistic reasoning SRL & Link Mining General Issues Linked Data • Heterogeneous, multi-relational data represented as a graph or network – Nodes are objects • May have different kinds of objects • Objects have attributes • Objects may have labels or classes – Edges are links • May have different kinds of links • Links may have attributes • Links may be directed, are not required to be binary Link Mining Tasks • • • • • • • • • Link-based Object Classification Object Type Prediction Link Type Prediction Predicting Link Existence Link Cardinality Estimation Object Consolidation Group Detection Subgraph Discovery Metadata Mining Link-based Object Classification • Predicting the category of an object based on its attributes and its links and attributes of linked objects • web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc. • cite: Predict the topic of a paper, based on word occurrence, citations, co-citations • epi: Predict disease type based on characteristics of the patients infected by the disease Object Class Prediction • Predicting the type of an object based on its attributes and its links and attributes of linked objects • comm: Predict whether a communication contact is by email, phone call or mail. • cite: Predict the venue type of a publication (conference, journal, workshop) Link Type Classification • Predicting type or purpose of link based on properties of the participating objects • web: predict advertising link or navigational link; predict an advisor-advisee relationship • epi: predicting whether contact is familial, co-worker or acquaintance Predicting Link Existence • Predicting whether a link exists between two objects • web: predict whether there will be a link between two pages • cite: predicting whether a paper will cite another paper • epi: predicting who a patient’s contacts are Link Cardinality Estimation I • Predicting the number of links to an object • web: predict the authoratativeness of a page based on the number of in-links; identifying hubs based on the number of out-links • cite: predicting the impact of a paper based on the number of citations • epi: predicting the number of people that will be infected based on the infectiousness of a disease. Link Cardinality Estimation II • Predicting the number of objects reached along a path from an object • Important for estimating the number of objects that will be returned by a query • web: predicting number of pages retrieved by crawling a site • cite: predicting the number of citations of a particular author in a specific journal Entity Resolution • Predicting when two objects are the same, based on their attributes and their links • aka: record linkage, duplicate elimination, identity uncertainty • web: predict when two sites are mirrors of each other. • cite: predicting when two citations are referring to the same paper. • epi: predicting when two disease strains are the same • bio: learning when two names refer to the same protein Group Detection • Predicting when a set of entities belong to the same group based on clustering both object attribute values and link structure • web – identifying communities • cite – identifying research communities Subgraph Identification • Find characteristic subgraphs • Focus of graph-based data mining (Cook & Holder, Inokuchi, Washio & Motoda, Kuramochi & Karypis, Yan & Han) • bio – protein structure discovery • comm – legitimate vs. illegitimate groups • chem – chemical substructure discovery Metadata Mining • Schema mapping, schema discovery, schema reformulation • cite – matching between two bibliographic sources • web - discovering schema from unstructured or semi-structured data • bio – mapping between two medical ontologies Link Mining Tasks • • • • • • • • • Link-based Object Classification Object Type Prediction Link Type Prediction Predicting Link Existence Link Cardinality Estimation Object Consolidation Group Detection Subgraph Discovery Metadata Mining SRL General Issues Summary • SRL Tasks – Link-based Object Classification – Object Type Prediction – Link Type Prediction – Predicting Link Existence • SRL Challenges – Logical vs. Statistical dependencies – Feature construction – Instances vs. Classes – Collective Classification – – – – – Link Cardinality Estimation Object Consolidation Group Detection Subgraph Discovery Metadata Mining – Collective Consolidation – Effective Use of Labeled & Unlabeled Data – Link Prediction – Closed vs. Open World SRL Focus Problem #1 Citation Analysis Domain • The first focus problem domain is bibliographic citation analysis. A large number of SRL researchers have worked with this domain. Some advantages of this domain are: – the availability of data (thanks largely to Andrew McCallum, William Cohen, Steve Lawrence and others) – the ease of understanding the domain and – our obvious inherent interest in the domain as academics, . – the potential high payoff, high visability of SRL apporaches if they can solve this problem. • Within this domain, some of the objects are: – papers, authors, affiliations and venues and so on, • Some of the links or relationships are: – citations, authorship and co-authorship and so on. • An interesting aspect of the problem is that one must deal with indentity uncertainty: objects can be referenced in many ways, and an important task is entity resolution: figuring out the underlying object domains and mappings between references and objects. SRL Tasks in FP #1 • • • • • • • • • topic prediction: collective classification of the topics of papers author attribution: predicting the author of a paper. An issue is whether we assume a closed or open world for the authors. Plagiarism detection. author-topic identification: discovering the topic areas for authors. This can be used for example to assign reviewers for papers. entity resolution: collective clustering of the reference to objects to determine the set of authors, papers and venues. topic evolution: tracking change in topics over time. group detection: finding collaboration networks. – citation counting/ranking: predicting number of citations or ranking based on predicted number of citations. hidden object invention: Analogous to hidden variable introduction, the introduction of a hidden object, such as an advisor, that relates two author instances. predicate invention: from co-author information, affiliation information and perhaps information such as position and room location, invent advisor predicate. Data for FP #1 • Many people have constructed data sets by crawling bibliography servers such as CiteSeer, ACM, DBLP and, soon one would imagine, GoogleScholar. • Steve Lawrence several years ago made available a large collection of the citeseer data, this is available by contacting him. • Several versions of the Cora data set are available here: http://www.cs.umass.edu/~mccallum/codedata.html • The recent 2003 KDD Cup challenge has data available from high energy physics, http://www.cs.cornell.edu/projects/kddcup/ Your Turn • Come up with an SRL focus problem: – Define the schema, objects, links, etc. – Describe some SRL tasks in this domain – Think about where you could get the data Survey Next Time • Graphical Models Review • Led by Indrajit Bhattacharya • Readings available for pickup and in library. (Due to draft nature, they are not available on the web)