Talk

advertisement
Information Extraction,
Data Mining
and Joint Inference
Andrew McCallum
Computer Science Department
University of Massachusetts Amherst
Joint work with Charles Sutton, Aron Culotta, Xuerui Wang,
Ben Wellner, David Mimno, Gideon Mann.
Goal:
Mine actionable knowledge
from unstructured text.
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.htm
OtherCompanyJobs: foodscience.com-Job1
A Portal for Job Openings
Category = High Tech
Keyword = Java
Location = U.S.
Job Openings:
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Data Mining the Extracted Job Information
IE from Research Papers
[McCallum et al ‘99]
IE from Research Papers
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Mining Research Papers
[Rosen-Zvi, Griffiths, Steyvers,
Smyth, 2004]
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
IE from
Chinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy of Sciences
200k+ documents
several millennia old
- Qing Dynasty Archives
- memos
- newspaper articles
- diaries
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
From Text to Actionable Knowledge
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Prediction
Outlier detection
Decision support
Solution:
Uncertainty Info
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Emerging Patterns
Prediction
Outlier detection
Decision support
Solution:
Unified Model
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Probabilistic
Model
Discover patterns
- entity types
- links / relations
- events
Discriminatively-trained undirected graphical models
Document
collection
Conditional Random Fields
[Lafferty, McCallum, Pereira]
Conditional PRMs
[Koller…], [Jensen…],
[Geetor…], [Domingos…]
Complex Inference and Learning
Just what we researchers like to sink our teeth into!
Actionable
knowledge
Prediction
Outlier detection
Decision support
Scientific Questions
• What model structures will capture salient dependencies?
• Will joint inference actually improve accuracy?
• How to do inference in these large graphical models?
• How to do parameter estimation efficiently in these models,
which are built from multiple large components?
• How to do structure discovery in these models?
Scientific Questions
• What model structures will capture salient dependencies?
• Will joint inference actually improve accuracy?
• How to do inference in these large graphical models?
• How to do parameter estimation efficiently in these models,
which are built from multiple large components?
• How to do structure discovery in these models?
Outline
a
• Examples of IE and Data Mining.
a
• Motivate Joint Inference
• Brief introduction to Conditional Random Fields
• Joint inference: Examples
– Joint Labeling of Cascaded Sequences
– Joint Co-reference Resolution
(Loopy Belief Propagation)
(Graph Partitioning)
– Joint Co-reference with Weighted 1st-order Logic
– Joint Relation Extraction and Data Mining
• Ultimate application area:
Rexa, a Web portal for researchers
(MCMC)
(Bootstrapping)
(Linear Chain) Conditional Random Fields
[Lafferty, McCallum, Pereira 2001]
Undirected graphical model, trained to maximize
conditional probability of output (sequence) given input (sequence)
Finite state model
Graphical model
OTHER
y t-1
PERSON
yt
OTHER
y t+1
ORG
y t+2
TITLE …
y t+3
output seq
FSM states
...
observations
x
said
1
p(y | x)  (y t , y t1,x,t)
Zx t
x
t -1
t
Jones
where
x
a
t +1
x
t +2
Microsoft
x
t +3
VP …
input seq


(y t , y t1,x,t)  exp k f k (y t , y t1,x,t)
 k

Wide-spread interest, positive experimental results in many applications.
Noun phrase, Named entity [HLT’03], [CoNLL’03]
Protein structure prediction [ICML’04]

IE from Bioinformatics text [Bioinformatics ‘04],…
Asian word segmentation [COLING’04], [ACL’04]
IE from Research papers [HTL’04]
Object classification in images [CVPR ‘04]
Outline
a
• Examples of IE and Data Mining.
a
• Motivate Joint Inference
a
• Brief introduction to Conditional Random Fields
• Joint inference: Examples
– Joint Labeling of Cascaded Sequences
– Joint Co-reference Resolution
(Loopy Belief Propagation)
(Graph Partitioning)
– Joint Co-reference with Weighted 1st-order Logic
– Joint Relation Extraction and Data Mining
• Ultimate application area:
Rexa, a Web portal for researchers
(MCMC)
(Bootstrapping)
Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
But errors cascade--must be perfect at every stage to do well.
Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
Joint prediction of part-of-speech and noun-phrase in newswire,
matching accuracy with only 50% of the training data.
Inference:
Loopy Belief Propagation
Outline
a
• Examples of IE and Data Mining.
a
• Motivate Joint Inference
• Brief introduction to Conditional Random Fields
• Joint inference: Examples
– Joint Labeling of Cascaded Sequences
– Joint Co-reference Resolution
(Loopy Belief Propagation)
(Graph Partitioning)
– Joint Co-reference with Weighted 1st-order Logic
– Joint Relation Extraction and Data Mining
• Ultimate application area:
Rexa, a Web portal for researchers
(MCMC)
(Bootstrapping)
Joint co-reference among all pairs
Affinity Matrix CRF
“Entity resolution”
“Object correspondence”
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
99
Y/N
Y/N
11
~25% reduction in error on
co-reference of
proper nouns in newswire.
. . . she . . .
Inference:
Correlational clustering
graph partitioning
[Bansal, Blum, Chawla, 2002]
[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]
Joint Co-reference for
Multiple Entity Types [Culotta & McCallum 2005]
People
Stuart Russell
Y/N
Stuart Russell
Y/N
Y/N
S. Russel
Joint Co-reference for
Multiple Entity Types [Culotta & McCallum 2005]
People
Stuart Russell
Organizations
University of California at Berkeley
Y/N
Y/N
Stuart Russell
Y/N
Y/N
S. Russel
Berkeley
Y/N
Y/N
Berkeley
Joint Co-reference for
Multiple Entity Types [Culotta & McCallum 2005]
People
Stuart Russell
Organizations
University of California at Berkeley
Y/N
Y/N
Stuart Russell
Y/N
Y/N
S. Russel
Berkeley
Y/N
Y/N
Reduces error by 22%
Berkeley
Outline
a
• Examples of IE and Data Mining.
a
• Motivate Joint Inference
a
• Brief introduction to Conditional Random Fields
• Joint inference: Examples
a – Joint Labeling of Cascaded Sequences
a – Joint Co-reference Resolution
(Loopy Belief Propagation)
(Graph Partitioning)
– Joint Co-reference with Weighted 1st-order Logic
– Joint Relation Extraction and Data Mining
• Ultimate application area:
Rexa, a Web portal for researchers
(MCMC)
(Bootstrapping)
Sometimes pairwise comparisons
are not enough.
• Entities have multiple attributes
(name, email, institution, location);
need to measure “compatibility” among them.
• Having 2 “given names” is common, but not 4.
• Need to measure size of the clusters of mentions.
•  a pair of lastname strings that differ > 5?
We need measures on hypothesized “entities”
We need First-order logic
Pairwise Co-reference Features
Howard Dean
SamePerson(Dean Martin, Howard Dean)?
SamePers
Martin)?
on(Howard Dean, Howard
Pairwise Features
StringMatch(x1,x2)
EditDistance(x1,x2)
Dean Martin
Howard Martin
SamePerson(Dean Martin, Howard Martin)?
Cluster-wise (higher-order) Representations
Howard Dean
First-Order Features
x1,x2 StringMatch(x1,x2)
x1,x2 ¬StringMatch(x1,x2)
x1,x2 EditDistance>.5(x1,x2)
ThreeDistinctStrings(x1,x2, x3 )
N=3
SamePerson(Howard Dean,
Howard
Martin,
Dean Martin)?
Dean Martin
Howard Martin
Cluster-wise (higher-order) Representations
.
.
.
Combinatorial Explosion!
SamePerson(x1,x2 ,x3,x4 ,x5 ,x6)
…
SamePerson(x1,x2 ,x3,x4 ,x5)
…
SamePerson(x1,x2 ,x3,x4)
…
SamePerson(x1,x2 ,x3)
…
SamePerson(x1,x2)
…
Dean Martin
Howard Dean
Howard Martin
Dino
Howie
.
.
.
Martin
…
This space complexity is common in
first-order probabilistic models
Markov Logic: (Weighted 1st-order Logic)
Using 1st-order Logic as a Template to Construct a CRF
[Richardson & Domingos 2005]
ground Markov network
grounding Markov network
requires space O(nr)
n = number constants
r = highest clause arity
How can we perform
inference and learning in models that
cannot be grounded?
Inference in First-Order Models
SAT Solvers
• Weighted SAT solvers [Kautz et al 1997]
– Requires complete grounding of network
• LazySAT [Singla & Domingos 2006]
– Saves memory by only storing clauses that may
become unsatisfied
– Still requires exponential time to visit all ground
clauses at initialization.
Inference in First-Order Models
Sampling
• Gibbs Sampling
– Difficult to move between high probability configurations
by changing single variables
• Although, consider MC-SAT [Poon & Domingos ‘06]
• An alternative: Metropolis-Hastings sampling
[Culotta & McCallum 2006]
– Can be extended to partial configurations
• Only instantiate relevant variables
– Successfully used in BLOG models [Milch et al 2005]
– 2 parts: proposal distribution, acceptance distribution.
Proposal Distribution
Dean Martin
Howie Martin
Howard Martin
Dino
y
Dean Martin
Dino
Howard Martin
Howie Martin
y’
Proposal Distribution
Dean Martin
Howie Martin
Howard Martin
Dino
Dean Martin
Howie Martin
Howard Martin
Howie Martin
y
y’
Proposal Distribution
Dean Martin
Howie Martin
Howard Martin
Howie Martin
Dean Martin
Howie Martin
Howard Martin
Dino
y
y’
Inference with Metropolis-Hastings
• y : configuration
• p(y’)/p(y) : likelihood ratio
– Ratio of P(Y|X)
– ZX cancels
• q(y’|y) : proposal distribution
– probability of proposing move y y’
Experiments
• Paper citation coreference
• Author coreference
• First-order features
–
–
–
–
All Titles Match
Exists Year MisMatch
Average String Edit Distance > X
Number of mentions
Results on Citation Data
Citeseer paper coreference results (pair F1)
First-Order
Pairwise
constraint
82.3
76.7
reinforce
93.4
78.7
face
88.9
83.2
reason
81.0
84.9
Author coreference results (pair F1)
First-Order
Pairwise
miller_d
41.9
61.7
li_w
43.2
36.2
smith_b
65.4
25.4
Outline
a
• Examples of IE and Data Mining.
a
• Motivate Joint Inference
a
• Brief introduction to Conditional Random Fields
• Joint inference: Examples
a – Joint Labeling of Cascaded Sequences
a – Joint Co-reference Resolution
a – Joint Co-reference with Weighted 1 -order Logic
(Loopy Belief Propagation)
(Graph Partitioning)
st
– Joint Relation Extraction and Data Mining
• Ultimate application area:
Rexa, a Web portal for researchers
(MCMC)
(Bootstrapping)
Data
• 270 Wikipedia articles
• 1000 paragraphs
• 4700 relations
• 52 relation types
– JobTitle, BirthDay, Friend, Sister, Husband,
Employer, Cousin, Competition, Education, …
• Targeted for density of relations
– Bush/Kennedy/Manning/Coppola families and friends
George W. Bush
…his father George H. W. Bush…
…his cousin John Prescott Ellis…
George H. W. Bush
…his sister Nancy Ellis Bush…
Nancy Ellis Bush
…her son John Prescott Ellis…
Cousin = Father’s Sister’s Son
sibling
George HW Bush
Nancy Ellis Bush
son
son
cousin
George X
W Bush
John Prescott
Ellis
Y
likely a cousin
John Kerry
…celebrated with Stuart Forbes…
Name
Son
Rosemary Forbes
John Kerry
James Forbes
Stuart Forbes
Name
Sibling
Rosemary Forbes
James Forbes
Rosemary Forbes
sibling
son
James Forbes
son
cousin
John Kerry
Stuart Forbes
Iterative DB Construction
Joseph P. Kennedy, Sr
… son
John F. Kennedy
with
Rose Fitzgerald
Son
Wife
Name
Son
Joseph P. Kennedy
John F. Kennedy
Rose Fitzgerald
John F. Kennedy
Ronald Reagan
George W. Bush
Use relational
Fill
DB with
features
with
“first-pass”
CRFCRF
“second-pass”
(0.3)
Results
ME
CRF
RCRF
RCRF .9
RCRF .5
RCRF
Truth
RCRF
Truth.5
F1
.5489
.5995
.6100
.6008
.6136
.6791
.6363
Prec
.6475
.7019
.6799
.7177
.7095
.7553
.7343
Recall
.4763
.5232
.5531
.5166
.5406
.6169
.5614
ME = maximum entropy
CRF = conditional random field RCRF = CRF + mined features
Examples of Discovered Relational
Features
•
•
•
•
•
•
•
Mother: FatherWife
Cousin: MotherHusbandNephew
Friend: EducationStudent
Education: FatherEducation
Boss: BossSon
MemberOf: GrandfatherMemberOf
Competition: PoliticalPartyMemberCompetition
Outline
a
• Examples of IE and Data Mining.
a
• Motivate Joint Inference
a
• Brief introduction to Conditional Random Fields
a
• Joint inference: Examples
– Joint Labeling of Cascaded Sequences
– Joint Co-reference Resolution
(Loopy Belief Propagation)
(Graph Partitioning)
– Joint Co-reference with Weighted 1st-order Logic
– Joint Relation Extraction and Data Mining
• Ultimate application area:
Rexa, a Web portal for researchers
(MCMC)
(Bootstrapping)
Mining our Research Literature
• Better understand structure of our own
research area.
• Structure helps us learn a new field.
• Aid collaboration
• Map how ideas travel through social networks
of researchers.
• Aids for hiring and finding reviewers!
Previous Systems
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Previous Systems
Cites
Research
Paper
More Entities and Relations
Expertise
Cites
Research
Paper
Grant
Venue
Person
University
Groups
Topical Transfer
[Mann, Mimno, McCallum, JCDL 2006]
Citation counts from one topic to another.
Map “producers and consumers”
Impact Diversity
Topic Diversity: Entropy of the distribution of citing topics
Summary
• Joint inference needed for avoiding cascading
errors in information extraction and data mining.
• Challenge: making inference & learning scale to
massive graphical models.
– Markov-chain Monte Carlo
• Rexa: New research paper search engine,
mining the interactions in our community.
Download