Discovering Patterns in Text and Relational Data with Bayesian Latent-Variable Models

advertisement
Discovering Patterns in
Text and Relational Data
with Bayesian Latent-Variable Models
Andrew McCallum
Computer Science Department
University of Massachusetts Amherst
Joint work with Xuerui Wang, David Mimno, Andres Corrada,
Natasha Mohanty, Gideon Mann, Hanna Wallach.
Goal
Building models that
mine actionable knowledge
from unstructured text.
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.html
OtherCompanyJobs: foodscience.com-Job1
A Portal for Job Openings
Keyword = Java
Location = U.S.
Job Openings: Category = High Tech
Data Mining the Extracted Job Information
IE from
Chinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy of Sciences
200k+ documents
several millennia old
- Qing Dynasty Archives
- memos
- newspaper articles
- diaries
IE from Research Papers
[McCallum et al ‘97]
IE from Research Papers
Mining Research Papers
[Rosen-Zvi, Griffiths, Steyvers,
Smyth, 2004]
[Giles et al]
Address Book Management
and Expert Finding
Workplace effectiveness ~ Ability to leverage network of acquaintances
But filling Contacts DB by hand is tedious, and incomplete.
Contacts DB
Email Inbox
Automatically
WWW
System Overview
WWW
CRF
Email
Keyword
Extraction
Person
Name
Extraction
Name
Coreference
Homepage
Retrieval
names
Contact
Info and
Person
Name
Extraction
Social
Network
Analysis
An Example
To: “Andrew McCallum” mccallum@cs.umass.edu!
Subject ... !
First Name: Andrew
Search for new
people
Middle
Name:
Kachites
Last Name: McCallum
JobTitle:
Associate Professor
Company:
University of Massachusetts
Street
Address:
140 Governor’s Dr.
City:
Amherst
State:
MA
Zip:
01003
Company
Phone:
(413) 545-1323
Links:
Fernando Pereira, Sam Roweis,
…
Key Words: Information extraction,
social network,…
Example keywords extracted
Person
Keywords
William Cohen
Logic programming
Text categorization
Data integration
Rule learning
Daphne Koller
Bayesian networks
Relational models
Probabilistic models
Hidden variables
Deborah
McGuiness
Semantic web
Description logics
Knowledge representation
Ontologies
Tom Mitchell
Machine learning
Cognitive states
Learning apprentice
Artificial intelligence
Summary of Results
Contact info and name extraction performance (25 fields)
CRF
Token
Acc
Field
Prec
Field
Recall
Field
F1
94.50
85.73
76.33
80.76
1. Expert Finding:
When solving some task, find friends-of-friends with relevant expertise.
Avoid “stove-piping” in large org’s by automatically suggesting collaborators.
Given a task, automatically suggest the right team for the job. (Hiring aid!)
2. Social Network Analysis:
Understand the social structure of your organization.
Suggest structural changes for improved efficiency.
Social Network in an Email Dataset
Social Network in an Email Dataset
From: kate@cs.umass.edu!
Subject: NIPS and ....!
Date: June 14, 2004 2:27:41 PM EDT!
To: mccallum@cs.umass.edu!
There is pertinent stuff on the first yellow folder that is
completed either travel or other things, so please sign that
first folder anyway. Then, here is the reminder of the things
I'm still waiting for:!
NIPS registration receipt.!
CALO registration receipt.!
Thanks,!
Kate!
A Probabilistic Approach
•  Define a probabilistic generative
model for documents.
•  Learn the parameters of this
model by fitting them to the data
and a prior.
Clustering words into topics with
Latent Dirichlet Allocation
[Blei, Ng, Jordan 2003]
Generative
Process:
Example:
For each document:
Sample a distribution
over topics, !
70% Iraq war
30% US election
For each word in doc
Sample a topic, z
Sample a word
from the topic, w
Iraq war
“bombing”
Example topics
induced from a large collection of text
JOB
SCIENCE
BALL
FIELD
STORY
MIND
DISEASE
WATER
WORK
STUDY
GAME
MAGNETIC
STORIES
WORLD
BACTERIA
FISH
JOBS
SCIENTISTS
TEAM
MAGNET
TELL
DREAM
DISEASES
SEA
CAREER
SCIENTIFIC FOOTBALL
WIRE
CHARACTER
DREAMS
GERMS
SWIM
KNOWLEDGE
BASEBALL EXPERIENCE
NEEDLE
THOUGHT CHARACTERS
FEVER
SWIMMING
WORK
PLAYERS EMPLOYMENT
CURRENT
AUTHOR
IMAGINATION
CAUSE
POOL
OPPORTUNITIES
RESEARCH
PLAY
COIL
READ
MOMENT
CAUSED
LIKE
WORKING
CHEMISTRY
FIELD
POLES
TOLD
THOUGHTS
SPREAD
SHELL
TRAINING
TECHNOLOGY PLAYER
IRON
SETTING
OWN
VIRUSES
SHARK
SKILLS
MANY
BASKETBALL
COMPASS
TALES
REAL
INFECTION
TANK
CAREERS
MATHEMATICS COACH
LINES
PLOT
LIFE
VIRUS
SHELLS
POSITIONS
BIOLOGY
PLAYED
CORE
TELLING
IMAGINE
MICROORGANISMS SHARKS
FIND
FIELD
PLAYING
ELECTRIC
SHORT
SENSE
PERSON
DIVING
POSITION
PHYSICS
HIT
DIRECTION
INFECTIOUS
DOLPHINS CONSCIOUSNESS FICTION
FIELD
LABORATORY
TENNIS
FORCE
ACTION
STRANGE
COMMON
SWAM
OCCUPATIONS
STUDIES
TEAMS
MAGNETS
TRUE
FEELING
CAUSING
LONG
REQUIRE
WORLD
GAMES
BE
EVENTS
WHOLE
SMALLPOX
SEAL
OPPORTUNITY
SPORTS
MAGNETISM SCIENTIST
TELLS
BEING
BODY
DIVE
EARN
STUDYING
BAT
POLE
TALE
MIGHT
INFECTIONS
DOLPHIN
ABLE
SCIENCES
TERRY
INDUCED
NOVEL
HOPE
CERTAIN
UNDERWATER
[Tennenbaum et al]
Example topics
induced from a large collection of text
JOB
SCIENCE
BALL
FIELD
STORY
MIND
DISEASE
WATER
WORK
STUDY
GAME
MAGNETIC
STORIES
WORLD
BACTERIA
FISH
JOBS
SCIENTISTS
TEAM
MAGNET
TELL
DREAM
DISEASES
SEA
CAREER
SCIENTIFIC FOOTBALL
WIRE
CHARACTER
DREAMS
GERMS
SWIM
KNOWLEDGE
BASEBALL EXPERIENCE
NEEDLE
THOUGHT CHARACTERS
FEVER
SWIMMING
WORK
PLAYERS EMPLOYMENT
CURRENT
AUTHOR
IMAGINATION
CAUSE
POOL
OPPORTUNITIES
RESEARCH
PLAY
COIL
READ
MOMENT
CAUSED
LIKE
WORKING
CHEMISTRY
FIELD
POLES
TOLD
THOUGHTS
SPREAD
SHELL
TRAINING
TECHNOLOGY PLAYER
IRON
SETTING
OWN
VIRUSES
SHARK
SKILLS
MANY
BASKETBALL
COMPASS
TALES
REAL
INFECTION
TANK
CAREERS
MATHEMATICS COACH
LINES
PLOT
LIFE
VIRUS
SHELLS
POSITIONS
BIOLOGY
PLAYED
CORE
TELLING
IMAGINE
MICROORGANISMS SHARKS
FIND
FIELD
PLAYING
ELECTRIC
SHORT
SENSE
PERSON
DIVING
POSITION
PHYSICS
HIT
DIRECTION
INFECTIOUS
DOLPHINS CONSCIOUSNESS FICTION
FIELD
LABORATORY
TENNIS
FORCE
ACTION
STRANGE
COMMON
SWAM
OCCUPATIONS
STUDIES
TEAMS
MAGNETS
TRUE
FEELING
CAUSING
LONG
REQUIRE
WORLD
GAMES
BE
EVENTS
WHOLE
SMALLPOX
SEAL
OPPORTUNITY
SPORTS
MAGNETISM SCIENTIST
TELLS
BEING
BODY
DIVE
EARN
STUDYING
BAT
POLE
TALE
MIGHT
INFECTIONS
DOLPHIN
ABLE
SCIENCES
TERRY
INDUCED
NOVEL
HOPE
CERTAIN
UNDERWATER
[Tennenbaum et al]
Structured Topic Models
Topic models that also include some
additional structure, relations, modalities:
• 
• 
• 
• 
• 
• 
Social network relations
Behavior
Time
Correlations among topics
Hierarchical dependencies
Markov dependencies
Advantage of graphical models: Can integrate new (modalities of) evidence!
Outline
•  Social Network Analysis
–  Roles (Author-Recipient-Topic Model)
–  Groups (Group-Topic Model)
–  Rexa, a research paper digital library
•  Brief note: Probabilistic Databases
From LDA to Author-Recipient-Topic
(ART)
Inference and Estimation
Gibbs Sampling:
-  Easy to implement
-  Reasonably fast
r!
r!
Enron Email Corpus
•  250k email messages
•  23k people
Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)!
From: debra.perlingiere@enron.com!
To: steve.hooser@enron.com!
Subject: Enron/TransAltaContract dated Jan 1, 2001!
Please see below. Katalin Kiss of TransAlta has requested an
electronic copy of our final draft? Are you OK with this? If
so, the only version I have is the original draft without
revisions.!
DP!
Debra Perlingiere!
Enron North America Corp.!
Legal Department!
1400 Smith Street, EB 3885!
Houston, Texas 77002!
dperlin@enron.com!
Topics, and prominent senders / receivers
Topic names,
discovered by ART
by hand
Topics, and prominent senders / receivers
discovered by ART
Beck = “Chief Operations Officer”
Dasovich = “Government Relations Executive”
Shapiro = “Vice President of Regulatory Affairs”
Steffes = “Vice President of Government Affairs”
Comparing Role Discovery
Traditional SNA
ART
Author-Topic
distribution over
authored topics
distribution over
authored topics
connection strength (A,B) =
distribution over
recipients
Comparing Role Discovery
Tracy Geaconne ! Dan McCarty
Traditional SNA
ART
Similar roles
Different roles
Geaconne = “Secretary”
McCarty = “Vice President”
Author-Topic
Different roles
Comparing Role Discovery
Lynn Blair ! Kimberly Watson
Traditional SNA
Different roles
ART
Very similar
Author-Topic
Very different
Blair = “Gas pipeline logistics”
Watson = “Pipeline facilities planning”
McCallum Email Corpus 2004
•  January - October 2004
•  23k email messages
•  825 people
From: kate@cs.umass.edu!
Subject: NIPS and ....!
Date: June 14, 2004 2:27:41 PM EDT!
To: mccallum@cs.umass.edu!
There is pertinent stuff on the first yellow folder that is
completed either travel or other things, so please sign that
first folder anyway. Then, here is the reminder of the things I'm
still waiting for:!
NIPS registration receipt.!
CALO registration receipt.!
Thanks,!
Kate!
McCallum Email Blockstructure
Most prominent topics in discussions with ____?
Padhraic Smyth, Prof., UC Irvine, CS
Two most prominent topics
in discussions with ____?
Topic 1
Words
love
house
time
great
hope
dinner
saturday
left
ll
visit
evening
stay
bring
weekend
road
sunday
kids
flight
Topic 2
Prob
0.030514
0.015402
0.013659
0.012351
0.011334
0.011043
0.00959
0.009154
0.009154
0.009009
0.008282
0.008137
0.008137
0.007847
0.007701
0.007411
0.00712
0.006829
0.006539
0.006539
Words
today
tomorrow
time
ll
meeting
week
talk
meet
morning
monday
back
call
free
home
won
day
hope
leave
office
tuesday
Prob
0.051152
0.045393
0.041289
0.039145
0.033877
0.025484
0.024626
0.023279
0.022789
0.020767
0.019358
0.016418
0.015621
0.013967
0.013783
0.01311
0.012987
0.012987
0.012742
0.012558
ART: Roles but not Groups
Traditional SNA
Block structured
Enron TransWestern Division
ART
Not
Author-Topic
Not
Outline
•  Social Network Analysis
–  Roles (Author-Recipient-Topic Model)
–  Groups (Group-Topic Model)
–  Rexa, a research paper digital library
•  Brief note: Probabilistic Databases
Groups and Topics
•  Input:
–  Observed relations between people
–  Attributes on those relations (text, or categorical)
•  Output:
–  Attributes clustered into “topics”
–  Groups of people---varying depending on topic
Adjacency Matrix Representing Relations
Student Roster
Academic Admiration
Adams
Bennett
Carter
Davis
Edwards
Frederking
Acad(A, B) Acad(C, B)
Acad(A, D) Acad(C, D)
Acad(B, E) Acad(D, E)
Acad(B, F) Acad(D, F)
Acad(E, A) Acad(F, A)
Acad(E, C) Acad(F, C)
A B C D E F
G1 G2 G1 G2 G3 G3
A B C D E F
A
B
C
D
E
F
A
B
C
D
E
F
G1
G2
G1
G2
G3
G3
A C B D E F
G1G1 G2 G2 G3 G3
A
C
B
D
E
F
G1
G1
G2
G2
G3
G3
Group Model:
Partitioning Entities into Groups
Stochastic Blockstructures for Relations
[Nowicki, Snijders 2001]
!
Beta
S: number of entities
Multinomial
!
g
G2
S
!
Dirichlet
!
G: number of groups
Binomial
v
S2
Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004]
Two Relations with Different Attributes
Student Roster
Academic Admiration
Social Admiration
Adams
Bennett
Carter
Davis
Edwards
Frederking
Acad(A, B) Acad(C, B)
Acad(A, D) Acad(C, D)
Acad(B, E) Acad(D, E)
Acad(B, F) Acad(D, F)
Acad(E, A) Acad(F, A)
Acad(E, C) Acad(F, C)
Soci(A, B) Soci(A, D) Soci(A, F)
Soci(B, A) Soci(B, C) Soci(B, E)
Soci(C, B) Soci(C, D) Soci(C, F)
Soci(D, A) Soci(D, C) Soci(D, E)
Soci(E, B) Soci(E, D) Soci(E, F)
Soci(F, A) Soci(F, C) Soci(F, E)
A C B D E F
G1 G1 G2 G2 G3 G3
A
C
B
D
E
F
G1
G1
G2
G2
G3
G3
A C E B D F
G1 G1G1G2 G2 G2
A
C
E
B
D
F
G1
G1
G1
G2
G2
G2
The Group-Topic Model:
Discovering Groups and Topics Simultaneously
!
Uniform
t
!
Dirichlet
Beta
Multinomial
!
g
G2
S
w
Binomial
T
Nb
v
!
!
T
Multinomial
!
Dirichlet
S2
B
Dataset #1:
U.S. Senate
• 
16 years of voting records in the US Senate (1989 – 2005)
• 
a Senator may respond Yea or Nay to a resolution
• 
3423 resolutions with text attributes (index terms)
• 
191 Senators in total across 16 years
S.543
Title: An Act to reform Federal deposit insurance, protect the deposit insurance
funds, recapitalize the Bank Insurance Fund, improve supervision and regulation
of insured depository institutions, and for other purposes.
Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2)
Latest Major Action: 12/19/1991 Became Public Law No: 102-242.
Index terms: Banks and banking Accounting Administrative fees Cost control
Credit Deposit insurance Depressed areas and other 110 terms
Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea
Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……
Topics Discovered (U.S. Senate)
Mixture of Unigrams
Education
Energy
Military
Misc.
Economic
education
school
aid
children
drug
students
elementary
prevention
energy
power
water
nuclear
gas
petrol
research
pollution
government
military
foreign
tax
congress
aid
law
policy
federal
labor
insurance
aid
tax
business
employee
care
Foreign
Economic
Social Security
+ Medicare
labor
insurance
tax
congress
income
minimum
wage
business
social
security
insurance
medical
care
medicare
disability
assistance
Education
+ Domestic
Group-Topic Model
education
foreign
school
trade
federal
chemicals
aid
tariff
government
congress
tax
drugs
energy
communicable
research
diseases
Groups Discovered (US Senate)
Groups from topic Education + Domestic
Senators Who Change Coalition the most
Dependent on Topic
e.g. Senator Shelby (D-AL) votes
with the Republicans on Economic
with the Democrats on Education + Domestic
with a small group of maverick Republicans on Social Security + Medicaid
Outline
•  Social Network Analysis
–  Roles (Author-Recipient-Topic Model)
–  Groups (Group-Topic Model)
–  Rexa, a research paper digital library
•  Brief note: Probabilistic Databases
Social Networks in Research Literature
•  Better understand structure of our own
research area.
•  Structure helps us learn a new field.
•  Aid collaboration
•  Map how ideas travel through social networks
of researchers.
•  Aids for hiring and finding reviewers.
•  Measure impact of papers or people.
Previous Systems
Previous Systems
Cites
Research
Paper
More Entities and Relations
Expertise
Cites
Research
Paper
Grant
Venue
Person
University
Groups
Our Data
•  Over 7 million research papers,
mostly in computer science, spidered from web:
information extraction, de-duplicated,
available at Rexa.info portal.
Topical Transfer
Citation counts from one topic to another.
Map “producers and consumers”
Topical Bibliometric Impact Measures
[Mann, Mimno, McCallum, 2006]
•  Topical Citation Counts
•  Topical Impact Factors
•  Topical Longevity
•  Topical Precedence
•  Topical Diversity
•  Topical Transfer
Topical Transfer
Transfer from Digital Libraries to other topics
Other topic
Cit’s
Paper Title
Web Pages
31
Trawling the Web for Emerging CyberCommunities, Kumar, Raghavan,... 1999.
Computer Vision
14
On being ‘Undigital’ with digital cameras:
extending the dynamic...
Video
12
Lessons learned from the creation and
deployment of a terabyte digital video libr..
Graphs
12
Trawling the Web for Emerging CyberCommunities
Web Pages
11
WebBase: a repository of Web pages
Topical Diversity
Papers that had the most influence across many other fields...
Topical Diversity
Entropy of the topic distribution among
papers that cite this paper (this topic).
High
Diversity
Low
Diversity
Outline
•  Social Network Analysis
–  Roles (Author-Recipient-Topic Model)
–  Groups (Group-Topic Model)
–  Rexa, a research paper digital library
•  Brief note: Probabilistic Databases
Probabilistic Databases
Previous work from perspective of ML & NLP
1.  Scalable, but limited dependencies
–  Widom “Trio”: each field contains distribution over values.
No dependencies.
–  Sarawagi: each record represented by a mixture.
Limited dependencies.
2.  Arbitrary dependencies, but not scalable
–  Hellerstein “BayesStore”: Represent arbitrary Bayesian
network, and pass to BayesNetToolbox.
Completely unscalable.
–  Suciu “Mystiq”: Re-generate sampled possible worlds
with MCMC.
Also lacking scalability.
Our approach: FACTORIE
[McCallum et al 2009], [Wick, McCallum, Miklau 2010]
•  DB represents single possible world
(as usual). Scalable.
•  Arbitrary dependencies represented by factor
graph outside the DB
•  Represent uncertainty by MCMC
–  Local proposed changes efficiently scored
•  Given query:
–  MCMC to sample possible worlds
–  Run SQL query on possible worlds,
more efficiently thanks to “view maintenance”
Outline
•  Social Network Analysis
–  Roles (Author-Recipient-Topic Model)
–  Groups (Group-Topic Model)
–  Rexa, a research paper digital library
•  Brief note: Probabilistic Databases
–  Arbitrary dependencies with factor graphs
–  Scalability with MCMC and view maintenance
Download