Discovering Patterns in Text and Relational Data with Bayesian Latent-Variable Models Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Xuerui Wang, David Mimno, Andres Corrada, Natasha Mohanty, Gideon Mann, Hanna Wallach. Goal Building models that mine actionable knowledge from unstructured text. Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1 A Portal for Job Openings Keyword = Java Location = U.S. Job Openings: Category = High Tech Data Mining the Extracted Job Information IE from Chinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries IE from Research Papers [McCallum et al ‘97] IE from Research Papers Mining Research Papers [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004] [Giles et al] Address Book Management and Expert Finding Workplace effectiveness ~ Ability to leverage network of acquaintances But filling Contacts DB by hand is tedious, and incomplete. Contacts DB Email Inbox Automatically WWW System Overview WWW CRF Email Keyword Extraction Person Name Extraction Name Coreference Homepage Retrieval names Contact Info and Person Name Extraction Social Network Analysis An Example To: “Andrew McCallum” mccallum@cs.umass.edu! Subject ... ! First Name: Andrew Search for new people Middle Name: Kachites Last Name: McCallum JobTitle: Associate Professor Company: University of Massachusetts Street Address: 140 Governor’s Dr. City: Amherst State: MA Zip: 01003 Company Phone: (413) 545-1323 Links: Fernando Pereira, Sam Roweis, … Key Words: Information extraction, social network,… Example keywords extracted Person Keywords William Cohen Logic programming Text categorization Data integration Rule learning Daphne Koller Bayesian networks Relational models Probabilistic models Hidden variables Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies Tom Mitchell Machine learning Cognitive states Learning apprentice Artificial intelligence Summary of Results Contact info and name extraction performance (25 fields) CRF Token Acc Field Prec Field Recall Field F1 94.50 85.73 76.33 80.76 1. Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!) 2. Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency. Social Network in an Email Dataset Social Network in an Email Dataset From: kate@cs.umass.edu! Subject: NIPS and ....! Date: June 14, 2004 2:27:41 PM EDT! To: mccallum@cs.umass.edu! There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:! NIPS registration receipt.! CALO registration receipt.! Thanks,! Kate! A Probabilistic Approach • Define a probabilistic generative model for documents. • Learn the parameters of this model by fitting them to the data and a prior. Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003] Generative Process: Example: For each document: Sample a distribution over topics, ! 70% Iraq war 30% US election For each word in doc Sample a topic, z Sample a word from the topic, w Iraq war “bombing” Example topics induced from a large collection of text JOB SCIENCE BALL FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES RESEARCH PLAY COIL READ MOMENT CAUSED LIKE WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS MANY BASKETBALL COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE CERTAIN UNDERWATER [Tennenbaum et al] Example topics induced from a large collection of text JOB SCIENCE BALL FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES RESEARCH PLAY COIL READ MOMENT CAUSED LIKE WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS MANY BASKETBALL COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE CERTAIN UNDERWATER [Tennenbaum et al] Structured Topic Models Topic models that also include some additional structure, relations, modalities: • • • • • • Social network relations Behavior Time Correlations among topics Hierarchical dependencies Markov dependencies Advantage of graphical models: Can integrate new (modalities of) evidence! Outline • Social Network Analysis – Roles (Author-Recipient-Topic Model) – Groups (Group-Topic Model) – Rexa, a research paper digital library • Brief note: Probabilistic Databases From LDA to Author-Recipient-Topic (ART) Inference and Estimation Gibbs Sampling: - Easy to implement - Reasonably fast r! r! Enron Email Corpus • 250k email messages • 23k people Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)! From: debra.perlingiere@enron.com! To: steve.hooser@enron.com! Subject: Enron/TransAltaContract dated Jan 1, 2001! Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.! DP! Debra Perlingiere! Enron North America Corp.! Legal Department! 1400 Smith Street, EB 3885! Houston, Texas 77002! dperlin@enron.com! Topics, and prominent senders / receivers Topic names, discovered by ART by hand Topics, and prominent senders / receivers discovered by ART Beck = “Chief Operations Officer” Dasovich = “Government Relations Executive” Shapiro = “Vice President of Regulatory Affairs” Steffes = “Vice President of Government Affairs” Comparing Role Discovery Traditional SNA ART Author-Topic distribution over authored topics distribution over authored topics connection strength (A,B) = distribution over recipients Comparing Role Discovery Tracy Geaconne ! Dan McCarty Traditional SNA ART Similar roles Different roles Geaconne = “Secretary” McCarty = “Vice President” Author-Topic Different roles Comparing Role Discovery Lynn Blair ! Kimberly Watson Traditional SNA Different roles ART Very similar Author-Topic Very different Blair = “Gas pipeline logistics” Watson = “Pipeline facilities planning” McCallum Email Corpus 2004 • January - October 2004 • 23k email messages • 825 people From: kate@cs.umass.edu! Subject: NIPS and ....! Date: June 14, 2004 2:27:41 PM EDT! To: mccallum@cs.umass.edu! There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:! NIPS registration receipt.! CALO registration receipt.! Thanks,! Kate! McCallum Email Blockstructure Most prominent topics in discussions with ____? Padhraic Smyth, Prof., UC Irvine, CS Two most prominent topics in discussions with ____? Topic 1 Words love house time great hope dinner saturday left ll visit evening stay bring weekend road sunday kids flight Topic 2 Prob 0.030514 0.015402 0.013659 0.012351 0.011334 0.011043 0.00959 0.009154 0.009154 0.009009 0.008282 0.008137 0.008137 0.007847 0.007701 0.007411 0.00712 0.006829 0.006539 0.006539 Words today tomorrow time ll meeting week talk meet morning monday back call free home won day hope leave office tuesday Prob 0.051152 0.045393 0.041289 0.039145 0.033877 0.025484 0.024626 0.023279 0.022789 0.020767 0.019358 0.016418 0.015621 0.013967 0.013783 0.01311 0.012987 0.012987 0.012742 0.012558 ART: Roles but not Groups Traditional SNA Block structured Enron TransWestern Division ART Not Author-Topic Not Outline • Social Network Analysis – Roles (Author-Recipient-Topic Model) – Groups (Group-Topic Model) – Rexa, a research paper digital library • Brief note: Probabilistic Databases Groups and Topics • Input: – Observed relations between people – Attributes on those relations (text, or categorical) • Output: – Attributes clustered into “topics” – Groups of people---varying depending on topic Adjacency Matrix Representing Relations Student Roster Academic Admiration Adams Bennett Carter Davis Edwards Frederking Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) A B C D E F G1 G2 G1 G2 G3 G3 A B C D E F A B C D E F A B C D E F G1 G2 G1 G2 G3 G3 A C B D E F G1G1 G2 G2 G3 G3 A C B D E F G1 G1 G2 G2 G3 G3 Group Model: Partitioning Entities into Groups Stochastic Blockstructures for Relations [Nowicki, Snijders 2001] ! Beta S: number of entities Multinomial ! g G2 S ! Dirichlet ! G: number of groups Binomial v S2 Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004] Two Relations with Different Attributes Student Roster Academic Admiration Social Admiration Adams Bennett Carter Davis Edwards Frederking Acad(A, B) Acad(C, B) Acad(A, D) Acad(C, D) Acad(B, E) Acad(D, E) Acad(B, F) Acad(D, F) Acad(E, A) Acad(F, A) Acad(E, C) Acad(F, C) Soci(A, B) Soci(A, D) Soci(A, F) Soci(B, A) Soci(B, C) Soci(B, E) Soci(C, B) Soci(C, D) Soci(C, F) Soci(D, A) Soci(D, C) Soci(D, E) Soci(E, B) Soci(E, D) Soci(E, F) Soci(F, A) Soci(F, C) Soci(F, E) A C B D E F G1 G1 G2 G2 G3 G3 A C B D E F G1 G1 G2 G2 G3 G3 A C E B D F G1 G1G1G2 G2 G2 A C E B D F G1 G1 G1 G2 G2 G2 The Group-Topic Model: Discovering Groups and Topics Simultaneously ! Uniform t ! Dirichlet Beta Multinomial ! g G2 S w Binomial T Nb v ! ! T Multinomial ! Dirichlet S2 B Dataset #1: U.S. Senate • 16 years of voting records in the US Senate (1989 – 2005) • a Senator may respond Yea or Nay to a resolution • 3423 resolutions with text attributes (index terms) • 191 Senators in total across 16 years S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay …… Topics Discovered (U.S. Senate) Mixture of Unigrams Education Energy Military Misc. Economic education school aid children drug students elementary prevention energy power water nuclear gas petrol research pollution government military foreign tax congress aid law policy federal labor insurance aid tax business employee care Foreign Economic Social Security + Medicare labor insurance tax congress income minimum wage business social security insurance medical care medicare disability assistance Education + Domestic Group-Topic Model education foreign school trade federal chemicals aid tariff government congress tax drugs energy communicable research diseases Groups Discovered (US Senate) Groups from topic Education + Domestic Senators Who Change Coalition the most Dependent on Topic e.g. Senator Shelby (D-AL) votes with the Republicans on Economic with the Democrats on Education + Domestic with a small group of maverick Republicans on Social Security + Medicaid Outline • Social Network Analysis – Roles (Author-Recipient-Topic Model) – Groups (Group-Topic Model) – Rexa, a research paper digital library • Brief note: Probabilistic Databases Social Networks in Research Literature • Better understand structure of our own research area. • Structure helps us learn a new field. • Aid collaboration • Map how ideas travel through social networks of researchers. • Aids for hiring and finding reviewers. • Measure impact of papers or people. Previous Systems Previous Systems Cites Research Paper More Entities and Relations Expertise Cites Research Paper Grant Venue Person University Groups Our Data • Over 7 million research papers, mostly in computer science, spidered from web: information extraction, de-duplicated, available at Rexa.info portal. Topical Transfer Citation counts from one topic to another. Map “producers and consumers” Topical Bibliometric Impact Measures [Mann, Mimno, McCallum, 2006] • Topical Citation Counts • Topical Impact Factors • Topical Longevity • Topical Precedence • Topical Diversity • Topical Transfer Topical Transfer Transfer from Digital Libraries to other topics Other topic Cit’s Paper Title Web Pages 31 Trawling the Web for Emerging CyberCommunities, Kumar, Raghavan,... 1999. Computer Vision 14 On being ‘Undigital’ with digital cameras: extending the dynamic... Video 12 Lessons learned from the creation and deployment of a terabyte digital video libr.. Graphs 12 Trawling the Web for Emerging CyberCommunities Web Pages 11 WebBase: a repository of Web pages Topical Diversity Papers that had the most influence across many other fields... Topical Diversity Entropy of the topic distribution among papers that cite this paper (this topic). High Diversity Low Diversity Outline • Social Network Analysis – Roles (Author-Recipient-Topic Model) – Groups (Group-Topic Model) – Rexa, a research paper digital library • Brief note: Probabilistic Databases Probabilistic Databases Previous work from perspective of ML & NLP 1. Scalable, but limited dependencies – Widom “Trio”: each field contains distribution over values. No dependencies. – Sarawagi: each record represented by a mixture. Limited dependencies. 2. Arbitrary dependencies, but not scalable – Hellerstein “BayesStore”: Represent arbitrary Bayesian network, and pass to BayesNetToolbox. Completely unscalable. – Suciu “Mystiq”: Re-generate sampled possible worlds with MCMC. Also lacking scalability. Our approach: FACTORIE [McCallum et al 2009], [Wick, McCallum, Miklau 2010] • DB represents single possible world (as usual). Scalable. • Arbitrary dependencies represented by factor graph outside the DB • Represent uncertainty by MCMC – Local proposed changes efficiently scored • Given query: – MCMC to sample possible worlds – Run SQL query on possible worlds, more efficiently thanks to “view maintenance” Outline • Social Network Analysis – Roles (Author-Recipient-Topic Model) – Groups (Group-Topic Model) – Rexa, a research paper digital library • Brief note: Probabilistic Databases – Arbitrary dependencies with factor graphs – Scalability with MCMC and view maintenance