Learning and Inference - Dan Roth - University of Illinois at Urbana

advertisement
Natural Language Processing
via
Learning and Global Inference
with
Constraints
Dan Roth
Department of Computer Science
University of Illinois at Urbana-Champaign
With thanks to:
Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov,
Mark Sammons, Scott Yih, Dav Zimak
Funding: ARDA, under the AQUAINT
program 2007
November
NSF: ITR IIS-0085836, ITR IIS-0428472, ITR IIS- 0085980
Beckman
Institute
A DOI grant under the Reflex
program,
DASH Optimization (Xpress-MP)
Page 1
Nice to Meet You
Page 2
Learning and Inference
 Global decisions in which several local decisions play a role
but there are mutual dependencies on their outcome.
 (Learned) models/classifiers for different sub-problems
 Incorporate classifiers’ information, along with constraints, in
making coherent decisions – decisions that respect the local
models as well as domain & context specific constraints.
 Global inference for the best assignment to all variables of
interest.
Page 3
Inference
Page 4
Comprehension
A process that maintains and
updates a collection of propositions
about the state of affairs.
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in
England. He is the same person that you read about in the book, Winnie the
Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When
Chris was three years old, his father wrote a poem about him. The poem was
printed in a magazine for others to read. Mr. Robin then wrote a book. He
made up a fairy tale land where Chris lived. His friends were animals. There
was a bear called Winnie the Pooh. There was also an owl and a young pig,
called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin
made them come to life with his words. The places in the story were all near
Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to
read about Christopher Robin and his animal friends. Most people don't know
he is a real person who is grown now. He has written two books of his own.
They tell what it is like to be famous.
1. Christopher Robin was born in England.
3. Christopher Robin’s dad was a magician.
2. Winnie the Pooh is a title of a book.
4. Christopher Robin must be at least 65 now.
Page 5
What we Know: Stand Alone Ambiguity Resolution
Illinois’ bored of education
board
...Nissan Car and truck plant is …
…divide life into plant and animal kingdom
(This Art) (can N) (will MD) (rust V)
V,N,N
The dog bit the kid. He was taken to a veterinarian
a hospital
Learn a function f: X Y that
maps observations in a domain
to one of several categories or <
Page 6
Classification is Well Understood

Theoretically: generalization bounds


Algorithmically: good learning algorithms for linear
representations.



How many example does one need to see in order to guarantee good
behavior on previously unobserved examples.
Can deal with very high dimensionality (106 features)
Very efficient in terms of computation and # of examples. On-line.
Key issues remaining:



Learning protocols: how to minimize interaction (supervision); how
to map domain/task information to supervision; semi-supervised
learning; active learning; ranking; sequences; adaptation
What are the features? No good theoretical understanding here.
Programming systems that have multiple classifiers
Page 7
A process that maintains and
updates a collection of propositions
about the state of affairs.
Comprehension
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in
England. He is the same person that you read about in the book, Winnie the
Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When
Chris was three years old, his father wrote a poem about him. The poem was
printed in a magazine for others to read. Mr. Robin then wrote a book. He
made up a fairy tale land where Chris lived. His friends were animals. There
was a bear called Winnie the Pooh. There was also an owl and a young pig,
called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin
made them come to life with his words. The places in the story were all near
Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to
read about Christopher Robin and his animal friends. Most people don't know
he is a real person who is grown now. He has written two books of his own.
They tell what it is like to be famous.
1. Christopher Robin was born in England.
3. Christopher Robin’s dad was a magician.
2. Winnie the Pooh is a title of a book.
4. Christopher Robin must be at least 65 now.
This is an Inference Problem
Page 8
This Talk
 Global Inference over Local Models/Classifiers + Expressive Constraints
 Model
 Generality of the framework
 Training Paradigms
 Global vs. Local training
 Semi-Supervised Learning
 Examples
 Semantic Parsing
 Information Extraction
 Pipeline processes
Page 9
Allows for Dynamic
Programming based
Inference
Sequential Constrains Structure

Three models for sequential inference with classifiers
[Punyakanok & Roth NIPS’01]

HMM; HMM with Classifiers


Conditional Models (PMM)



s4
s5
s6
o11 o2 o3 o4 o5 o 6
By far, the most popular
in applications
Allows direct modeling of states as a function of input
Classifiers may vary; SNoW (Winnow;Perceptron); MEMM: MaxEnt; SVM based
Constraint Satisfaction Models (CSCL: more general constrains)



Sufficient for easy problems
s11 s2 s3
The inference problem is modeled as weighted 2-SAT
With sequential constraints: shown to have efficient solution
Recent work – viewed as multi-class classification; emphasis on global
training [Collins’02, CRFs,SVMs]; efficiency and performance issues
What if the structure of the problem/constraints is not sequential?
Page 10
Pipeline
Raw Data
POS Tagging



 Most problems are not single classification problems
Phrases
Semantic Entities
Parsing
WSD
Relations
Semantic Role Labeling
Pipelining is a crude approximation; interactions occur across levels and
down stream decisions often interact with previous decisions.
Leads to propagation of errors
Occasionally, later stage problems are easier but upstream mistakes will not
be corrected.
Looking for:
 Global inference over the outcomes of different (learned) predictors as
a way to break away from this paradigm [between pipeline & fully global]
 Allows a flexible way to incorporate linguistic and structural constraints.
Page 11
Improvement over
no inference: 2-5%
Inference with General Constraint Structure
other
0.05
other
0.10
other
0.05
per
0.85
per
0.60
per
0.50
loc
0.10
loc
0.30
loc
0.45
Dole ’s wife, Elizabeth , is a native of N.C.
E1
E2
R12
E3
R23
irrelevant
irrelevant
0.05
irrelevant
0.10
spouse_of
spouse_of
0.45
0.45
spouse_of
0.05
born_in
born_in
0.50
born_in
0.85
Page 12
Problem Setting

Random Variables Y:
y4
y1
C(y1,y4)
y5
y2
C(y2,y3,y6,y7,y8)
y6
y7
y3
y8
observations




Conditional Distributions P (learned by models/classifiers)
Constraints C– any Boolean function
defined on partial assignments (possibly: + weights W )
Goal: Find the “best” assignment
 The assignment that achieves the highest global performance.
This is an Integer Programming Problem
Y*=argmaxY PY (+ WC) subject to constraints C
Other, more general ways to incorporate
soft constraints here [ACL’07] Page 13
A General Inference Setting

Essentially all complex models studied today can be viewed as optimizing a
linear objective function: HMMs/CRFs [Roth’99; Collins’02;Laffarty et. al 02]

Linear objective functions can be derived from probabilistic perspective:

Markov Random Field 
[standard; Kleinberg&Tardos] Optimization Problem (Metric Labeling) 
[Chekuri et. al’01] Linear Programming Problems 
Inference as Constrained Optimization
[Yih&Roth CoNLL’04]
[Punyakanok et. al COLING’04]…

The probabilistic perspective supports finding the most likely assignment
 Not necessarily what we want

Our Integer linear programming (ILP) formulation




Allows the incorporation of more general cost functions
General (non-sequential) constraint structure
Better exploitation (computationally) of hard constraints
Can find the optimal solution if desired
Page 14
Formal Model
Penalty for violating
the constraint.
Subject to constraints
Weight Vector for
“local” models
A collection of Classifiers;
Log-linear models (HMM,
CRF) or a combination
(Soft) constraints
component
How far away is y from
a “legal” assignment
How to solve?
How to train?
This is an Integer Linear Program
Incorporating constraints in the
learning process?
Solve using ILP packages gives an
exact solution.
Search techniques are also possible
Page 15
Example: Semantic Role Labeling
Who did what to whom, when, where, why,…
I left my pearls to my daughter in my will .
[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .




Special Case (structure output problem):
A0
Leaver
here, all the data is available at one time;
in general, classifiers might be learned
A1
Things left
from different sources, at different times,
A2
Benefactor
at different contexts.
AM-LOC
Location
Implications on training paradigms
I left my pearls to my daughter in my will .
Overlapping arguments
If A2 is present, A1
must also be present.
Page 16
Semantic Role Labeling (2/2)

PropBank [Palmer et. al. 05] provides a large human-annotated
corpus of semantic verb-argument relations.



Core arguments: A0-A5 and AA



It adds a layer of generic semantic labels to Penn Tree Bank II.
(Almost) all the labels are on the constituents of the parse trees.
different semantics for each verb
specified in the PropBank Frame files
13 types of adjuncts labeled as AM-arg

where arg specifies the adjunct type
Page 17
Identify Vocabulary
Algorithmic Approach


I left my nice pearls to her
[ [
[
[
[
]
] ]
]
]
Pruning [Xue&Palmer, EMNLP’04]
Argument Identifier

Binary classification (SNoW)
Classify argument candidates

Argument Classifier


candidate arguments
Identify argument candidates


I left my nice pearls to her
Multi-class classification (SNoW)
Inference



Inference over (old and
I left my nice pearls to her
[new)
[
[ Vocabulary
[
[
]I left
] my
] nice pearls] to her
]
Use the estimated probability distribution
given by the argument classifier
Use structural and linguistic constraints
Infer the optimal global output
I left my nice pearls to her
Page 18
I left my nice pearls to her
Inference

The output of the argument classifier often violates some
constraints, especially when the sentence is long.

Finding the best legitimate output is formalized as an
optimization problem and solved via Integer Linear
Programming.
[Punyakanok et. al 04, Roth & Yih 04;05]

Input:



The probability estimation (by the argument classifier)
Structural and linguistic constraints
Allows incorporating expressive (non-sequential)
constraints on the variables (the arguments types).
Page 19
Integer Linear Programming Inference

For each argument ai


Set up a Boolean variable: ai,t indicating whether ai is classified as t
Goal is to maximize


 i score(ai = t ) ai,t
Subject to the (linear) constraints


Any Boolean constraints can be encoded as linear constraint(s).
If score(ai = t ) = P(ai = t ), the objective is to find the
assignment that maximizes the expected number of
arguments that are correct and satisfies the constraints.
Page 20
Constraints

No duplicate argument classes
a  POTARG x{a = A0}  1

Any Boolean rule can be encoded as
a linear constraint.
R-ARG
If there is an R-ARG phrase, there is an ARG
Phrase
 a2  POTARG , a  POTARG x{a = A0}  x{a2 = R-A0}

C-ARG

a2  POTARG ,
If there is an C-ARG phrase, there is an ARG before it
 (a  POTARG)  (a is before a2 ) x{a = A0}  x{a2 = C-A0}
Many other possible constraints:




Unique labels
No overlapping or embedding
Relations between number of arguments
If verb is of type A, no argument of type B
Universally quantified
rules
Joint inference can be used also to combine different SRL Systems.
Page 21
ILP as a Unified Algorithmic Scheme

Consider a common model for sequential inference: HMM/CRF

Inference in this model is done via
the Viterbi Algorithm.
s

So far, shown the use of
only (deterministic)
constraints. Can be used
with statistical constraints.
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
t
y y1 y2 y3 y4 y5
x x1 x2 x3 x4 x5
Viterbi is a special case of the Linear Programming based Inference.

Viterbi is a shortest path problem, which is a LP, with a canonical matrix that is
totally unimodular. Therefore, you can get integrality constraints for free.
One can now incorporate non-sequential/expressive/declarative constraints by
modifying this canonical matrix
 The extension reduces to a polynomial scheme under some conditions (e.g.,
when constraints are sequential, when the solution space does not change, etc.)
 Not necessarily increases complexity and very efficient in practice
[Roth&Yih, ICML’05]

Learn a rather simple model; make decisions with a more expressive model
Page 22
Extracting Relations via Semantic Analysis
Screen shot from a CCG demo
http://L2R.cs.uiuc.edu/~cogcomp
Semantic parsing reveals several relations
in the sentence along with their
arguments.
This approach produces a very
good semantic parser. F1~90%
Easy and fast: ~7 Sent/Sec
(using Xpress-MP)
Top ranked system in CoNLL’05
shared task
Key difference is the Inference
Page 23
This Talk
 Global Inference over Local Models/Classifiers + Expressive Constraints
 Model
 Generality of the framework
 Training Paradigms
 Global vs. Local training
 Semi-Supervised Learning
 Examples
 Semantic Parsing
 Information Extraction
 Pipeline processes
Page 24
Phrasal verb paraphrasing
[Connor&Roth’07]
Textual Entailment


Given:
Entity matching [Li et. al,
Q: Who acquired Overture?
AAAI’04, NAACL’04]
Determine:
A: Eyeing the huge market potential, currently
Semantic Role Labeling
led by Google, Yahoo took over search company
Overture Services Inc last year.
Is it true that?
market(Textual Entailment)
Eyeing the huge
potential, currently led by
Google, Yahoo took over
search company
Overture Services Inc. last
year

Yahoo acquired Overture
Overture is a search company
Google is a search company
Google owns Overture
……….
Page 25
Training Paradigms that Support Global Inference

Incorporating general constraints (Algorithmic Approach)



Allow both statistical and expressive declarative constraints
Allow non-sequential constraints (generally difficult)
Coupling vs. Decoupling Training and Inference.



Incorporating global constraints is important but
Should it be done only at evaluation time or also at training time?
Issues related to:

modularity, efficiency and performance, availability of training data
May not be relevant in
some problems.
Page 26
Did this classifier
make a mistake?
Phrase Identification Problem
How to train it?


Use classifiers’ outcomes to identify phrases
Final outcome determined by optimizing classifiers outcome and
constrains
Input:
o1 o2 o3 o4 o5 o6 o7 o8 o9 o10
Classifier 1: [
Classifier 2:
Infer:
[
[
]
]
]
[
[
]
]
[
]
]
]
Page 27
Training in the presence of Constraints

General Training Paradigm:



First Term: Learning from data
Second Term: Guiding the model by constraints
Can choose if constraints are included in training or only in
evaluation
Page 28
Cartoon: each model
can be more complex
and may have a view
on a set of output
variables.
L+I: Learning plus Inference
Training w/o Constraints
Testing: Inference with Constraints
Learning the
components together!
y1
IBT: Inference-basedf1(x)
Training
y2
y3
y4
f2(x)
y5
x3
f3(x)
x4
x1
x5
f4(x)
Y
f5(x)
x2
x6
x7
X
Page 29
Perceptron-based Global Learning
True Global Labeling
Local Predictions
Apply
Constraints:
Y
-1
1
-1
-1
1
Y’
-1
1
1
-1
1
1
f1(x)
X
f2(x)
x3
f3(x)
x4
x1
x5
x2
x6
x7
f4(x)
Y
f5(x)
Which one is better?
When and Why?
Page 30
Claims

When the local classification problems are “easy”, L+I outperforms IBT.


In many applications, the components are identifiable and easy to learn (e.g.,
argument, open-close, PER).
Only when the local problems become difficult to solve in isolation, IBT
outperforms L+I, but needs a larger number of training examples.

When data is scarce, problems are not easy and constraints can be used,
along with a “weak” model, to label unlabeled data and improve model.

Experimental results and theoretical intuition to support our claims.
L+I: cheaper computationally; modular
IBT is better in the limit, and other
extreme cases.
Combinations: L+I, and then IBT are possible
Page 31
Bound Prediction
L+I vs. IBT: the more identifiable
individual problems are the better
overall performance is with L+I

Local  ≤ opt + ( ( d log m + log 1/ ) / m )1/2

Global
Indication for
hardness of
problem
 ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2
Bounds
Simulated Data
opt
=0.1
=0
opt
opt=0.2
Page 32
Relative Merits: SRL
In some cases problems are hard
due to lack of training data.
Semi-supervised learning
L+I is better.
When the problem
is artificially made
harder, the tradeoff
is clearer.
hard
Difficulty of the learning problem
(# features)
easy
Page 33
Information extraction with Background Knowledge
(Constraints)
Lars Ole Andersen . Program analysis and specialization
for the C Programming language. PhD thesis. DIKU ,
University of Copenhagen, May 1994 .
Prediction result of a trained HMM
[AUTHOR]
[TITLE]
[EDITOR]
[BOOKTITLE]
[TECH-REPORT]
[INSTITUTION]
[DATE]
Lars Ole Andersen . Program analysis and
specialization for the
Violates lots of
C
constraints!
Programming language
. PhD thesis .
DIKU , University of Copenhagen , May
1994 .
Page 34
Examples of Constraints

Each field must be a consecutive list of words, and can appear
at most once in a citation.

State transitions must occur on punctuation marks.

The citation can only start with AUTHOR or EDITOR.




The words pp., pages correspond to PAGE.
Four digits starting with 20xx and 19xx are DATE.
Quotations can appear only in TITLE
…….
Page 35
Information Extraction with Constraints


Adding constraints, we get correct results!
[AUTHOR]
[TITLE]
[TECH-REPORT]
[INSTITUTION]
[DATE]

Lars Ole Andersen .
Program analysis and specialization for the
C Programming language .
PhD thesis .
DIKU , University of Copenhagen ,
May, 1994 .
If incorporated into the training, better results mean
 Better Feedback!
Page 36
Semi-Supervised Learning with Constraints


In traditional Semi-Supervised learning the model can drift away from
the correct one.
Constraints can be used

At decision time, to bias the objective function towards favoring
constraint satisfaction.

At training to improve labeling of un-labled data (and thus
improve the model)
Model
Decision Time
Constraints
Constraints
Un-labeled Data
Page 37
Constraint - Driven Learning (CODL) [Chang, Ratinov, Roth, ACL’07]
Any supervised
learning algorithm
parametrized by 
Augmenting the training
set (feedback). Any
inference algorithm
(with constraints).
=learn(Tr)
For N iterations do
T=
For each x in unlabeled dataset
Inference(x,C,)
)
y Inference(x,
T=T  {(x, y)}
Page 38
Constraint - Driven Learning (CODL) [Chang, Ratinov, Roth, ACL’07]
Any supervised
learning algorithm
parametrized by 
Augmenting the training
set (feedback). Any
inference algorithm
(with constraints).
=learn(Tr)
For N iterations do
T=
For each x in unlabeled dataset
{y1,…,yK} Top-K-Inference(x,C, )
T=T  {(x, yi)}i=1…k
=  +(1- )learn(T)
Learn from new training data.
Weight supervised and
unsupervised model(Nigam2000*).
Page 39
Token-based accuracy (inference with constraints)
supervised
100
Weighted EM
CODL
95
Accuracy
90
85
80
75
70
65
60
5
10
15
20
25
300
Labeled set size
Page 40
Semi-Supervised Learning with Constraints
Objective function:
Learning w/o Constraints: 300 examples.
Learning w Constraints
Constraints are used to
Bootstrap a semisupervised learner
Poor model + constraints
are used to annotate
unlabeled data, which in
turn is used to keep
training the model.
# of available labeled examples
Page 41
Conclusions




Discussed a general paradigm for learning and inference in the context of
natural language understanding tasks
A general Constraint Optimization approach for integration of learned
models with additional (declarative or statistical) expressivity.
A paradigm for making Machine Learning practical – allow domain/task
specific constraints.
How to train?

Learn Locally and make use globally (via global inference)


[Punyakanok et. al IJCAI’05]
Ability to make use of domain & constraints to drive supervision

[Klementiev & Roth, ACL’06; Chang, Ratinov, Roth, ACL’07]
LBJ (Learning Based Java): http://L2R.cs.uiuc.edu/~cogcomp
A modeling language that supports programming along with
building learned models and allows incorporating and
inference with constraints
Page 42
Problem Setting

Random
y1
Variables Y:
y4
C(y1,y4)
y5
y2
y6
C(y2,y3,y6,y7,y8)
y3
y7
y8

Conditional Distributions P (learned by models/classifiers)
Constraints C– any Boolean function
defined on partial assignments (possibly: + weights W )

Goal: Find the “best” assignment

The assignment that achieves the highest global accuracy.
This is an Integer Programming Problem


Y*=argmaxY PY (+ WC) subject to constraints C
Other, more general ways to incorporate
soft constraints here [ACL’07] Page 43
Download