(this topic).

advertisement
UMass and
Learning for CALO
Andrew McCallum
Information Extraction & Synthesis Laboratory
Department of Computer Science
University of Massachusetts
Outline
• CC-Prediction
– Learning in the wild from user email usage
• DEX
– Learning in the wild from user correction...
as well as KB records filled by other CALO components
• Rexa
– Learning in the wild from user corrections to
coreference... propagating constraints in a MarkovLogic-like system that scales to ~20 million objects
• Several new topic models
– Discover interesting useful structure without the need
for supervision... learning from newly arrived data on the
fly
CC Prediction
Using Various Exponential Family
Factor Graphs
Learning to keep an org. connected & avoid stove-piping.
First steps toward ad-hoc team creation.
Learning in the wild from user’s CC behavior,
and from other parts of the CALO ontology.
Graphical Models for Email
• Compute P(y|x) for CC prediction
- function
- random variable
N
- N replications
The graph describes the joint distribution
of random variables in term of the product of
local functions
Email Model: Nb words in the body, Ns
words in the subject, Nr recipients
Recipient
of Email
xb
Nb
Body
Words
y
xs
Ns
xr
Nr-1
Subject Other
Words Recipients
Nr
• Local functions facilitate system engineering
through modularity
Document Models
• Models may relational attributes
Na
Author of
Document
xt
Nt
Title
xb
Nb
Abstract
y
xs
Ns
xr
Na-1
xb
Nr
Body Co-authors References
• We can optimize P(y|x) for classification
performance and P(x|y) for model interpretability
and parameter transfer (to other models)
CC Prediction and Relational Attributes
Nr
Target
Recipient
xtr
Ntr
xb
Nb
Thread Body
Relation Words
y
xs
xr
Ns
xr’
Nr-1
Subject Other
Relation
Words Recipients
Thread Relations –
e.g. Was a given recipient ever included on this email thread?
Recipient Relationships –
e.g. Does one of the other recipients report to the target recipient?
CC-Prediction Learning in the Wild
• As documents are added to Rexa, models of
expertise for authors grows
• As DEX obtains more contact information and
keywords, organizational relations emerge
• Model parameters can be adapted on-line
• Priors on parameters can be used to transfer
learned information between models
• New relations can be added on-line
• Modular model construction and intelligent model
optimization enable these goals
CC Prediction
Upcoming work on
Multi-Conditional Learning
A discriminatively-trained topic model,
discovering low-dimensional representations for
transfer learning and improved regularization & generalization.
Objective Functions for Parameter Estimation
Traditional
Traditional, joint training (e.g. naive Bayes, most topic models)
Traditional mixture model (e.g. LDA)
Traditional, conditional training (e.g. MaxEnt classifiers, CRFs)
New, multi-conditional
Conditional mixtures (e.g. Jebara’s CEM,
McCallum CRF string edit distance, ...)
Multi-conditional
(mostly conditional, generative regularization)
Multi-conditional
(for semi-sup)
Multi-conditional
(for transfer learning, 2 tasks, shared hiddens)
“Multi-Conditional Learning” (Regularization)
[McCallum, Pal, Wang, 2006]
Predictive Random Fields
mixture of Gaussians on synthetic data
Data, classify by color
Generatively trained
[McCallum, Wang, Pal, 2005]
Multi-Conditional
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Conditionally-trained [Jebara 1998]
Multi-Conditional Mixtures
vs. Harmoniun
on document retrieval task
[McCallum, Wang, Pal, 2005]
Multi-Conditional,
multi-way conditionally
trained
Conditionally-trained,
to predict class labels
Harmonium, joint,
with class labels
and words
Harmonium, joint with
words, no labels
DEX
Beginning with a review of previous work,
then new work on record extraction,
with the ability to leverage new KBs in the wild,
and for transfer
System Overview
WWW
Email
CRF
Qu i ck Ti me ™a nd a
TIF F (Un co mpre ss ed )d ec omp res so r
a re ne ed ed to s ee th i s pi c tu re.
Keyword
Extraction
Person
Name
Extraction
Name
Coreference
Homepage
Retrieval
names
Contact
Info and
Person
Name
Extraction
Social
Network
Analysis
An Example
To: “Andrew McCallum” mccallum@cs.umass.edu
Subject ...
Search for
new people
First
Name:
Andrew
Middle
Name:
Kachites
Last
Name:
McCallum
JobTitle:
Associate Professor
Company:
University of Massachusetts
Street
Address:
140 Governor’s Dr.
City:
Amherst
State:
MA
Zip:
01003
Company
Phone:
(413) 545-1323
Links:
Fernando Pereira, Sam
Roweis,…
Key
Words:
Information extraction,
social network,…
Example keywords
extracted
Person
Keywords
William Cohen
Logic programming
Text categorization
Data integration
Rule learning
Daphne Koller
Bayesian networks
Relational models
Probabilistic models
Hidden variables
Deborah
McGuiness
Semantic web
Description logics
Knowledge representation
Ontologies
Tom Mitchell
1.
2.
Machine learning
Cognitive states
Learning apprentice
Artificial intelligence
Summary of Results
Contact info and name extraction performance (25 fields)
CRF
Token
Acc
Field
Prec
Field
Recall
Field
F1
94.50
85.73
76.33
80.76
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Expert Finding:
When solving some task, find friends-of-friends with relevant expertise.
Avoid “stove-piping” in large org’s by automatically suggesting collaborators.
Given a task, automatically suggest the right team for the job. (Hiring aid!)
Social Network Analysis:
Understand the social structure of your organization.
Suggest structural changes for improved efficiency.
Importance of accurate DEX fields in IRIS
• Information about
–
–
–
–
–
–
–
people
contact information
email
affiliation
job title
expertise
...
are key to answering many CALO questions...
both directly, and as supporting inputs to
higher-level questions.
Learning Field Compatibilities in DEX
Extracted Record
Professor Jane Smith
Name: Jane Smith, John Doe
University of California
JobTitle: Professor, Administrative Assistant
209-555-5555
Company: U of California
Department: Computer Science
Phone: 209-555-5555, 209-444-4444
Professor Smith chairs the
Computer Science Department.
She hails from Boston, …her
administrative assistant …
City: Boston
Compatibility Graph
.8
Jane Smith
University of California
209-555-5555
John Doe
Computer Science
.4
-.5
Administrative Assistant
University of California
209-444-4444
Professor
-.6
Boston
Administrative Assistant
-.5
-.4
University of California
.4
209-444-4444
John Doe
Learning Field Compatibilities in DEX
Extracted Record
Professor Jane Smith
Name: Jane Smith, John Doe
University of California
JobTitle: Professor, Administrative Assistant
209-555-5555
Company: U of California
Department: Computer Science
Phone: 209-555-5555, 209-444-4444
Professor Smith chairs the
Computer Science Department.
She hails from Boston, …her
administrative assistant …
City: Boston
Jane Smith
209-555-5555
University of California
Professor
Computer Science
John Doe
Administrative Assistant
Boston
University of California
Administrative Assistant
University of California
John Doe
209-444-4444
209-444-4444
Learning Field Compatibilities in DEX
• ~35% error reduction over transitive closure
• Qualitatively better than heuristic approach
• Mine Knowledge Bases from other parts of IRIS
for learning compatibility rules among fields
– “Professor” job title co-occurs with “University” company
– Area code / city compatibility
– “Senator” job title co-occurs with “Washington, D.C” location
• In the wild
– As the user adds new fields & make corrections, DEX learns
from this KB data
• Transfer learning
– between departments/industries
Rexa
A knowledge base of publications,
grants, people, their expertise, topics,
and inter-connections
Learning for information extraction and coreference.
Incrementally leveraging multiple sources of
information for improved coreference
Gathering information about people’s expertise and coauthor, citation relations
First a tour of Rexa, then slides about learning
Previous Systems
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Previous Systems
Cites
Research
Paper
More Entities and Relations
Expertise
Cites
Research
Paper
Grant
Venue
Person
University
Groups
Learning in Rexa
Extraction, coreference
In the wild: Re-adjusting KB after corrections from a user
Also, learning research topics/expertise,
and their interconnections

(Linear Chain) Conditional Random Fields
[Lafferty, McCallum, Pereira 2001]
Undirected graphical model, trained to maximize
conditional probability of output sequence given input sequence
1
p(y | x)  (y t , y t1,x,t)
Zx t


(y t , y t1,x,t)  exp k f k (y t , y t1,x,t)
 k

where
Finite state model
Graphical model
OTHER PERSON
y t - 1 
yt
OTHER
y t+1
ORG
y t+2
TITLE …
y t+3
output seq
FSM states
...
observations
x
x
t -1
said
t
Jones
x
a
t +1
x
t +2
Microsoft
x
t +3
VP …
input seq
Wide-spread interest, positive experimental results in many applications.
Noun phrase, Named entity [HLT’03], [CoNLL’03]
Protein structure prediction [ICML’04]
IE from Bioinformatics text [Bioinformatics ‘04],…
Asian word segmentation [COLING’04], [ACL’04]
IE from Research papers [HTL’04]
Object classification in images [CVPR ‘04]
IE from Research Papers
[McCallum et al ‘99]
IE from Research Papers
Field-level F1
Hidden Markov Models (HMMs)
75.6
[Seymore, McCallum, Rosenfeld, 1999]
Support Vector Machines (SVMs)
89.7
 error
40%
[Han, Giles, et al, 2003]
Conditional Random Fields (CRFs)
93.9
[Peng, McCallum, 2004]
(Word-level accuracy is >99%)
Joint segmentation and co-reference
Extraction from and matching of
research paper citations.
o
s
Laurel, B. Interface Agents:
Metaphors with Character, in
The Art of Human-Computer Interface
Design, B. Laurel (ed), AddisonWesley, 1990.
World
Knowledge
c
y
Brenda Laurel. Interface Agents:
Metaphors with Character, in
Laurel, The Art of Human-Computer
Interface Design, 355-366, 1990.
p
Co-reference
decisions
y
Database
field values
c
s
c
y
Citation attributes
s
o
Segmentation
o
35% reduction in co-reference error by using segmentation uncertainty.
6-14% reduction in segmentation error by using co-reference.
Inference:
Variant of Iterated Conditional Modes
[Besag, 1986]
[Wellner, McCallum, Peng, Hay, UAI 2004]
see also [Marthi, Milch, Russell, 2003]
Rexa Learning in the Wild
from User Feedback
• Coreference will never be perfect.
• Rexa allows users to enter corrections to
coreference decisions
• Rexa then uses this feedback to
– re-consider other inter-related parts of the KB
– automatically make further error corrections
by propagating constraints
• (Our coreference system uses underlying
ideas very much like Markov Logic, and
scales to ~20 million mention objects.)
Finding Topics in 1 million CS papers
200 topics & keywords automatically discovered.
Topical Transfer
Citation counts from one topic to another.
Map “producers and consumers”
Topical Diversity
Find the topics that are cited by many other topics
---measuring diversity of impact.
Entropy of the topic distribution among
papers that cite this paper (this topic).
Low
Diversity
High
Diversity
Some New Work on
Topic Models
Robustly capturing topic correlations
Pachkinko Allocation Model
Capturing phrases in topic-specific ways
Topical N-Gram Model
Pachinko Machine
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Pachinko Allocation Model
[Li, McCallum, 2005]
11
21
31
41
word1
word2
42
word3
22
32
Distributions over distributions over topics...
33
43
word4
Distributions over topics;
mixtures, representing topic correlations
44
word5
word6
45 Distributions over words
(like “LDA topics”)
word7
word8
Some interior nodes could contain one multinomial, used for all documents.
(i.e. a very peaked Dirichlet)
Topic Coherence Comparison
“models,
“estimation,
estimation, stopwords” some junk”
“estimation”
LDA 20
LDA 100
PAM 100
models
model
parameters
distribution
bayesian
probability
estimation
data
gaussian
methods
likelihood
em
mixture
show
approach
paper
density
framework
approximation
markov
estimation
likelihood
maximum
noisy
estimates
mixture
scene
surface
normalization
generated
measurements
surfaces
estimating
estimated
iterative
combined
figure
divisive
sequence
ideal
estimation
bayesian
parameters
data
methods
estimate
maximum
probabilistic
distributions
noise
variable
variables
noisy
inference
variance
entropy
models
framework
statistical
estimating
Example super-topic
33 input hidden units function number
27 estimation bayesian parameters data m
24 distribution gaussian markov likelihood
11 exact kalman full conditional determini
1 smoothing predictive regularizers inter
Topic Correlations in PAM
5000 research paper abstracts, from across all CS
Numbers on edges are supertopics’ Dirichlet parameters
Likelihood Comparison
Varying number of topics
Want to Model Trends over Time
• Is prevalence of topic growing or waning?
• Pattern appears only briefly
– Capture its statistics in focused way
– Don’t confuse it with patterns elsewhere in time
• How do roles, groups, influence shift over time?
Topics over Time (TOT)
[Wang, McCallum 2006]
Dirichlet

multinomial
over topics

word
w
topic
index

z


T
Nd
D
multinomial
over topics
topic
index
word

t
time stamp
Dirichlet
prior
time
stamp
T
Multinomial
over words


Uniform
prior
z


T
Dirichlet
prior
t
Beta
over time
Uniform
prior


distribution
on time
stamps
Beta
over time
w
T
Multinomial
over words
Nd
D
State of the Union Address
208 Addresses delivered between January 8, 1790 and January 29, 2002.
To increase the number of documents, we split the addresses into paragraphs
and treated them as ‘documents’. One-line paragraphs were excluded.
Stopping was applied.
•17156 ‘documents’
•21534 words
•669,425 tokens
Our scheme of taxation, by means of which this needless surplus is taken
from the people and put into the public Treasury, consists of a tariff or
duty levied upon importations from abroad and internal-revenue taxes levied
upon the consumption of tobacco and spirituous and malt liquors. It must be
conceded that none of the things subjected to internal-revenue taxation
are, strictly speaking, necessaries. There appears to be no just complaint
of this taxation by the consumers of these articles, and there seems to be
nothing so well able to bear the burden without hardship to any portion of
the people.
1910
Comparing
TOT
against
LDA
Topic Distributions Conditioned on Time
topic mass (in vertical height)
NIPS
vol
1-14
time
Download