Constraints Driven Structured Learning with Indirect Supervision.

advertisement
Constraints Driven Structured Learning
with
Indirect Supervision
Dan Roth
Department of Computer Science
University of Illinois at Urbana-Champaign
With thanks to:
April
2010Dan Goldwasser, Lev Ratinov,
Collaborators: Ming-Wei Chang, James
Clarke,
Vivek Srikumar,
Many others
Carnegie
Mellon University
Funding: NSF: ITR IIS-0085836, SoD-HCER-0613885, DHS;
DARPA: Bootstrap Learning & Machine Reading Programs
DASH Optimization (Xpress-MP)
Page 1
Nice to Meet You
Page 2
A process that maintains and
updates a collection of propositions
about the state of affairs.
Comprehension
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in
England. He is the same person that you read about in the book, Winnie the
Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When
Chris was three years old, his father wrote a poem about him. The poem was
printed in a magazine for others to read. Mr. Robin then wrote a book. He
made up a fairy tale land where Chris lived. His friends were animals. There
was a bear called Winnie the Pooh. There was also an owl and a young pig,
called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin
made them come to life with his words. The places in the story were all near
Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to
read about Christopher Robin and his animal friends. Most people don't know
he is a real person who is grown now. He has written two books of his own.
They tell what it is like to be famous.
1. Christopher Robin was born in England.
3. Christopher Robin’s dad was a magician.
2. Winnie the Pooh is a title of a book.
4. Christopher Robin must be at least 65 now.
This is an Inference Problem
Page 3
Coherency in Semantic Role Labeling
Predicate-arguments generated should be consistent across phenomena
The touchdown scored by Bettis cemented the victory of the Steelers.
Verb
Nominalization
Preposition
Predicate: score
Predicate: win
Sense: 11(6)
A0: Bettis (scorer)
A1: The touchdown (points
scored)
A0: the Steelers (winner)
“the object of the preposition is the
object of the underlying verb of the
nominalization”
Linguistic Constraints:
A0: the Steelers  Sense(of): 11(6)
A0:Bettis  Sense(by): 1(1)
Page 4
Semantic Parsing
X :“What is the largest state that borders New York and Maryland ?"
Y: largest( state( next_to( state(NY) AND next_to (state(MD))))

Successful interpretation involves multiple decisions
 What entities appear in the interpretation?
 “New York” refers to a state or a city?
 How to compose fragments together?

state(next_to()) >< next_to(state())
Page 5
Learning and Inference

Natural Language Decisions are Structured

Global decisions in which several local decisions play a role but there are
mutual dependencies on their outcome.

It is essential to make coherent decisions in a way that takes the
interdependencies into account. Joint, Global Inference.

But: Learning structured models requires annotating structures.
Page 6
Constrained Conditional Models (aka ILP Inference)
Penalty for violating
the constraint.
(Soft) constraints
component
Weight Vector for
“local” models
Features, classifiers; loglinear models (HMM,
CRF) or a combination
How far y is from
a “legal” assignment
CCMs can be viewed as a general interface to easily combine
domain knowledge with data driven statistical models
How to solve?
How to train?
This is an Integer Linear Program
Solving using ILP packages gives an
exact solution.
Search techniques are also possible
Training is learning the objective
function
How to exploit the structure to
minimize supervision?
Page 7
Example: Semantic Role Labeling
Who did what to whom, when, where, why,…
I left my pearls to my daughter in my will .
[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .




A0
A1
A2
AM-LOC
Leaver
Things left
Benefactor
Location
I left my pearls to my daughter in my will .
Overlapping arguments
If A2 is present, A1
must also be present.
Page 8
Semantic Role Labeling (2/2)

PropBank [Palmer et. al. 05] provides a large human-annotated
corpus of semantic verb-argument relations.



Core arguments: A0-A5 and AA



It adds a layer of generic semantic labels to Penn Tree Bank II.
(Almost) all the labels are on the constituents of the parse trees.
different semantics for each verb
specified in the PropBank Frame files
13 types of adjuncts labeled as AM-arg

where arg specifies the adjunct type
Page 9
Identify Vocabulary
Algorithmic Approach
candidate arguments

Identify argument candidates


Binary classification (SNoW)
Classify argument candidates

Argument Classifier


I left my nice pearls to her
[ [
[
[
[
]
] ]
]
]
Pruning [Xue&Palmer, EMNLP’04]
Argument Identifier


I left my nice pearls to her
Multi-class classification (SNoW)
Inference



Inference over (old and
I left my nice pearls to her
[new)
[
[ Vocabulary
[
[
]I left
] my
] nice pearls] to her
]
Use the estimated probability distribution
given by the argument classifier
Use structural and linguistic constraints
Infer the optimal global output
I left my nice pearls to her
Page 10
Semantic Role Labeling (SRL)
I left my pearls to my daughter in my will .
0.5
0.05
0.15
0.1
0.15
0.15
0.1
0.6
0.1
0.05
0.05
0.05
0.05
0.05
0.7
0.2
0.6
0.05
0.05
0.3
0.15
0.2
0.2
0.1
0.2
Page 11
Semantic Role Labeling (SRL)
I left my pearls to my daughter in my will .
0.5
0.05
0.15
0.1
0.15
0.15
0.1
0.6
0.1
0.05
0.05
0.05
0.05
0.05
0.7
0.2
0.6
0.05
0.05
0.3
0.15
0.2
0.2
0.1
0.2
Page 12
Semantic Role Labeling (SRL)
I left my pearls to my daughter in my will .
0.5
0.05
0.15
0.1
0.15
0.15
0.1
0.6
0.1
0.05
0.05
0.05
One inference
problem for each
verb predicate.
0.05
0.05
0.7
0.2
0.6
0.05
0.05
0.3
0.15
0.2
0.2
0.1
0.2
Page 13
Integer Linear Programming Inference

For each argument ai


Goal is to maximize



Set up a Boolean variable: ai,t indicating whether ai is classified as t
 i score(ai = t ) ai,t
The Constrained Conditional Model is
completely decomposed during training
Subject to the (linear) constraints
If score(ai = t ) = P(ai = t ), the objective is to find the
assignment that maximizes the expected number of
arguments that are correct and satisfies the constraints.
Page 14
Constraints

No duplicate argument classes
a  POTARG x{a = A0}  1

Any Boolean rule can be encoded as
a (collection of) linear constraints.
R-ARG
If there is an R-ARG phrase, there is an ARG
Phrase
 a2  POTARG , a  POTARG x{a = A0}  x{a2 = R-A0}

C-ARG

a2  POTARG ,
If there is an C-ARG phrase, there is an ARG before it
 (a  POTARG)  (a is before a2 ) x{a = A0}  x{a2 = C-A0} Universally quantified
Many other possible constraints:




LBJ: allows a developer
rulesto encode
constraints in FOL; these are
Unique labels
compiled into linear inequalities
No overlapping or embedding
automatically.
Relations between number of arguments; order constraints
If verb is of type A, no argument of type B
Joint inference can be used also to combine different (SRL) Systems.
Page 15
Learning and Inference

Natural Language Decisions are Structured

Global decisions in which several local decisions play a role but there are
mutual dependencies on their outcome.

It is essential to make coherent decisions in a way that takes the
interdependencies into account. Joint, Global Inference.

But: Learning structured models requires annotating structures.
Page 16
Information extraction without Prior Knowledge
Lars Ole Andersen . Program analysis and specialization for the
C Programming language. PhD thesis. DIKU ,
University of Copenhagen, May 1994 .
Prediction result of a trained HMM
[AUTHOR]
[TITLE]
[EDITOR]
[BOOKTITLE]
[TECH-REPORT]
[INSTITUTION]
[DATE]
Lars Ole Andersen . Program analysis and
specialization for the
C
Programming language
. PhD thesis .
DIKU , University of Copenhagen , May
1994 .
Violates lots of natural constraints!
Page 17
Examples of Constraints

Each field must be a consecutive list of words and can appear
at most once in a citation.

State transitions must occur on punctuation marks.

The citation can only start with AUTHOR or EDITOR.

The words pp., pages correspond to PAGE.
Four digits starting with 20xx and 19xx are DATE.
Quotations can appear only in TITLE
Easy to express pieces of “knowledge”
…….



Non Propositional; May use Quantifiers
Page 18
Information Extraction with Constraints


Adding constraints, we get correct results!
 Without changing the model
[AUTHOR]
[TITLE]
[TECH-REPORT]
[INSTITUTION]
[DATE]
Lars Ole Andersen .
Program analysis and specialization for the
C Programming language .
PhD thesis .
DIKU , University of Copenhagen ,
May, 1994 .
Page 19
Guiding Semi-Supervised Learning with Constraints


In traditional Semi-Supervised learning the model can drift
away from the correct one.
Constraints can be used to generate better training data


At training to improve labeling of un-labeled data (and thus
improve the model)
At decision time, to bias the objective function towards favoring
constraint satisfaction.
Model
Decision Time
Constraints
Constraints
Un-labeled Data
Page 20
Constraints Driven Learning (CoDL)
[Chang, Ratinov, Roth, ACL’07;ICML’08,Long’10]
(w0,½0)=learn(L)
Supervised learning algorithm parameterized by
(w,½). Learning can be justified as an optimization
procedure for an objective function
For N iterations do
Inference with constraints:
T=
augment the training set
For each x in unlabeled dataset
h à argmaxy wT Á(x,y) -  ½k dC(x,y)
T=T  {(x, h)}
(w,½) =  (w0,½0) + (1- ) learn(T)
Learn from new training data
Weigh supervised &
unsupervised models.
Excellent Experimental Results showing the advantages of using constraints,
especially with small amounts on labeled data [Chang et. al, Others]
Page 21
Constraints Driven Learning (CODL)

[Chang et. al 07,08; others]
Semi-Supervised Learning Paradigm that makes use of
constraints to bootstrap from a small number of examples
Objective function:
Learning w 10 Constraints
Learning w/o Constraints: 300 examples.
Poor model + constraints
Constraints are used to:
 Bootstrap a semi-supervised
learner
 Correct weak models
predictions on unlabeled
data, which in turn are used
to keep training the model.
# of available labeled examples
Page 22
Learning and Inference

Natural Language Decisions are Structured

Global decisions in which several local decisions play a role but there are
mutual dependencies on their outcome.

It is essential to make coherent decisions in a way that takes the
interdependencies into account. Joint, Global Inference.

But: Learning structured models requires annotating structures.

Interdependencies among decision variables should be
exploited in learning.


Goal: use minimal, indirect supervision
Amplify it using interdependencies among variables
Page 23
Two Ideas

Idea1: Simple, easy to supervise, binary decisions often depend
on the structure you care about. Learning to do well on the
binary task can drive the structure learning.

Idea2: Global Inference can be used to amplify the minimal
supervision.

Idea 2 ½: There are several settings where a binary label can be used to
replace a structured label. Perhaps the most intriguing is where you use
the world response to the model’s actions.
Page 24
Outline

Inference

Semi-supervised Training Paradigms for structures


Constraints Driven Learning
Indirect Supervision Training Paradigms for structure

Indirect Supervision Training with latent structure [NAACL’10]


Training Structure Predictors by Inventing (easy to supervise) binary
labels [ICML’10]


Transliteration; Textual Entailment; Paraphrasing
POS, Information extraction tasks
Driving supervision signal from World’s Response [CoNLL’10]

Semantic Parsing
Page 25
Textual Entailment
Former military specialist Carpenter took the helm at FictitiousCom Inc.
after five
years as press
official
at the United
States embassy in the
 Entailment
Requires
an Intermediate
Representation
UnitedKingdom.
Alignment based Features
 Given the intermediate features – learn a decision
 Entail/ Does not Entail
x3
x1
x4
x1
x5
x2
x6
x3
x4
x2
x7
But only positive entailments are expected to have
a meaningful intermediate representation
Jim Carpenter worked for the US Government.
Page 26
Given an input x 2 X
Learn a model f : X ! {-1, 1}
Paraphrase Identification

Consider the following sentences:

S1:
Druce will face murder charges, Conte said.

S2:
Conte said Druce will be charged with murder .
We need latent variables that explain:
why this is a positive example.


Are S1 and S2 a paraphrase of each other?
There is a need for an intermediate representation to justify
this decision
Given an input x 2 X
Learn a model f : X ! H ! {-1, 1}
Page 27
Algorithms: Two Conceptual Approaches

Two stage approach (typically used for TE and paraphrase identification)

Learn hidden variables; fix it




Need supervision for the hidden layer (or heuristics)
For each example, extract features over x and (the fixed) h.
Learn a binary classier
Proposed Approach: Joint Learning




Drive the learning of h from the binary labels
Find the best h(x)
An intermediate structure representation is good to the extent is
supports better final prediction.
Algorithm?
Page 28
Learning with Constrained Latent Representation (LCLR): Intuition

If x is positive




There must exist a good explanation (intermediate representation)
9 h, wT Á(x,h) ¸ 0
New feature vector for the final decision.
Chosen h selects a representation.
or, maxh wT Á(x,h) ¸ 0
If x is negative



No explanation is good enough to support the answer
8 h, wT Á(x,h) · 0
or, maxh wT Á(x,h) · 0

Altogether, this can be combined into an objective function:
Minw ¸/2 ||w||2 + Ci l(1-zimaxh 2 C wT {s} hs Ás (xi))

Why does inference help?
Inference: best h subject to constraints C
Page 29
Optimization


Non Convex, due to the maximization term inside the global
minimization problem
In each iteration:




Find the best feature representation h* for all positive examples (offthe shelf ILP solver)
Having fixed the representation for the positive examples, update w
solving the convex optimization problem:
Not the standard SVM/LR: need inference
Asymmetry: Only positive examples require a good
intermediate representation that justify the positive label.

Consequently, the objective function decreases monotonically
Page 30
Iterative Objective Function Learning
Generate features
Inference
for h subj. to C
Prediction
with inferred h
Initial Objective
Function
Update weight
vector
Training
w/r to binary
decision label


Formalized as Structured SVM + Constrained Hidden Structure
LCRL: Learning Constrained Latent Representation
Page 31
Learning with Constrained Latent Representation (LCLR): Framework

LCLR provides a general inference formulation that allows that
use of expressive constraints

Flexibly adapted for many tasks that require latent representations.
LCLR Model

Declarative model
Paraphrasing: Model input as graphs, V(G1,2), E(G1,2)

Four Hidden variables:


hv1,v2 – possible vertex mappings; he1,e2 – possible edge mappings
Constraints:



Each vertex in G1 can be mapped to a single vertex in G2 or to null
Each edge in G1 can be mapped to a single edge in G2 or to null
Edge mapping is active iff the corresponding node mappings are active
Page 32
Experimental Results
Transliteration:
Recognizing Textual Entailment:
Paraphrase Identification:*
Page 33
Outline

Inference

Semi-supervised Training Paradigms for structures


Constraints Driven Learning
Indirect Supervision Training Paradigms for structure

Indirect Supervision Training with latent structure


Training Structure Predictors by Inventing (easy to supervise) binary
labels


Transliteration; Textual Entailment; Paraphrasing
POS, Information extraction tasks
Driving supervision signal from World’s Response

Semantic Parsing
Page 34
Structured Prediction

Before, the structure was in the intermediate level



What if we care about the structure?


We cared about the structured representation only to the extent it
helped the final binary decision
The binary decision variable was given as supervision
Information Extraction; Relation Extraction; POS tagging, many others.
Invent a companion binary decision problem!




Parse Citations: Lars Ole Andersen . Program analysis and
specialization for the C Programming language. PhD thesis. DIKU ,
University of Copenhagen, May 1994 .
Companion: Given a citation; does it have a legitimate parse?
POS Tagging
Companion: Given a word sequence, does it have a legitimate POS
tagging sequence?
Page 35
Predicting phonetic alignment (For Transliteration)
Target Task
I
t a l y
‫י ל ט י א‬

I l l i no i s
‫ה‬
Yes/No
‫ל י א‬
‫י ו נ י‬
Target Task




Companion Task
Input: an English Named Entity and its Hebrew Transliteration
Output: Phonetic Alignment (character sequence mapping)
A structured output prediction task (many constraints), hard to label
Companion Task




Why it is a companion task?
Input: an English Named Entity and an Hebrew Named Entity
Companion Output: Do they form a transliteration pair?
A binary output problem, easy to label
Negative Examples are FREE, given positive examples
Page 36
Companion Task Binary Label as Indirect Supervision

The two tasks are related just like the binary and structured
tasks discussed earlier
Positive transliteration pairs
must have “good” phonetic
alignments



All positive examples must have a good structure
Negative examples cannot have a good structure
We are in the same setting as before



Negative transliteration pairs
cannot have “good” phonetic
alignments
Binary labeled examples are easier to obtain
We can take advantage of this to help learning a structured model
Here: combine binary learning and structured learning
Page 37
Learning Structure with Indirect Supervision


In this case we care about the predicted structure
Use both Structural learning and Binary learning
Predicted
Correct
Negative examples cannot
have a good structure
The feasible structures
of an example
Negative examples restrict
the space of hyperplanes
supporting the decisions for x
Page 38
Joint Learning with Indirect Supervision (J-LIS)

Joint learning : If available, make use of both supervision types
Target Task
I
Companion Task
t a l y
‫י ל ט י א‬
I l l i no i s
‫ה‬
Yes/No
‫ל י א‬
‫י ו נ י‬
Loss function: LB, as before; LS, Structural learning
Key: the same parameter w for both components
1 T
min w w  C1  LS ( xi , yi ; w) C2  LB ( xi , zi ; w)
w 2
iS
iB
Loss on Target Task
Loss on Companion Task
Page 39
Experimental Result


Very little direct (structured) supervision.
(Almost free) Large amount binary indirect supervision
Page 40
Experimental Result


Very little direct (structured) supervision.
(Almost free) Large amount binary indirect supervision
Page 41
Relations to Other Frameworks


B=Á, l=(squared) hinge loss: Structural SVM
S=Á, LCLR


Related to Structural Latent SVM (Yu & Johachims) and to
Felzenszwalb.
If S=Á, Conceptually related to Contrastive Estimation


No “grouping” of good examples and bad neighbors
Max vs. Sum: we do not marginalize over the hidden structure space


Allows for complex domain specific constraints
Related to some
Semi-Supervised approaches,
but can use negative
examples (Sheffer et. al)
Page 42
Outline

Inference

Semi-supervised Training Paradigms for structures


Constraints Driven Learning
Indirect Supervision Training Paradigms for structure

Indirect Supervision Training with latent structure


Training Structure Predictors by Inventing (easy to supervise) binary
labels


Transliteration; Textual Entailment; Paraphrasing
POS, Information extraction tasks
Driving supervision signal from World’s Response

Semantic Parsing
Page 43
Connecting Language to the World
Can I get a coffee with no
sugar and just a bit of milk
Great!
Arggg
Semantic Parser
MAKE(COFFEE,SUGAR=NO,MILK=LITTLE)
Can we rely on this interaction to provide supervision?
Page 44
Real World Feedback
Supervision = Expected Response
Traditional
approach:
Our
approach:
use
learnthe
from
logical forms
only
responses
and gold alignments
EXPENSIVE!
x
NL
Query
ry
Logical
Query
Query
Response:
Query
Response:
r
“What is the largest state that borders NY?"
largest( state( next_to( const(NY))))
Pennsylvania
Pennsylvania
Interactive Computer
System
Binary
Check if Predicted response == Expected response
Supervision
Semantic parsing is a structured prediction problem:
Expectedfrom
: Pennsylvania
: Pennsylvania
identify mappings
text to a meaningExpected
representation
Predicted : Pennsylvania
Positive Response
Predicted : NYC
Negative Response
Train a structured predictor with this binary supervision !
Page 45
Learning Structures with a Binary Signal

Protocol I: Direct learning with binary supervision

Uses predicted structures as examples for learning a binary decision




Inference used to predict the query: (y,z) = argmaxy,zwT Á(x,y,z)
Positive feedback: add a positive binary example
Negative feedback: add a negative binary example
Learned parameters form the objective function; iterate
-
-
+
+
“What is the largest state that borders NY?"
largest(state(next_to(const(NJ))))
“What is the largest state that borders NY?"
b
largest(state(next_to(const(NY))))
“What is the smallest state?"
state(next_to(const(NY))))
Page 46
Learning Structures with a Binary Signal

Protocol II: Aggressive Learning with Binary supervision



Positive feedback IFF the structure is correct
(y,z) = argmaxy,zwT Á(x,y,z)
+
Train a structured predictor from these structures


Positive feedback: add a positive structured example
Iterate until no new structures are found
“What is the largest state that borders NY?"
largest( state( next_to( const(NY))))
Correct
Response!
Pennsylvania
Interactive
Computer System
Page 47
Empirical Evaluation

Key Question: Can we learn from this type of supervision?
Algorithm
# training
structures
Test set
accuracy
No Learning: Initial Objective Fn
Binary signal: Protocol I
0
0
22.2%
69.2 %
Binary signal: Protocol II
0
73.2 %
310
75 %
WM*2007 (fully supervised – uses
gold structures)
*[WM] Y.-W. Wong and R. Mooney. 2007. Learning synchronous grammars for semantic
parsing with lambda calculus. ACL.
Page 48
Summary

Constrained Conditional Model: Computation Framework for
global interference and an vehicle for incorporating knowledge

Direct supervision for structured NLP tasks is expensive


Indirect supervision is cheap and easy to obtain
We suggested learning protocols for Indirect Supervision



Make use of simple, easy to get, binary supervision
Showed how to use it to learn structure
Done in the context of Constrained Conditional Models


Inference is an essential part of propagating the simple supervision
Learning Structures from Real World Feedback


Obtain binary supervision from “real world” interaction
Indirect supervision replaces direct supervision
Page 49
Features Versus Constraints
Ái : X
£ Y ! R;
Ci : X £ Y ! {0,1};
Mathematically, soft
constraints are features
d: X £ Y ! R;
 In principle, constraints and features can encode the same properties

In practice, they are very different
If Á(x,y) = Á(x) – constraints provide an easy
way to introduce dependence on y
 Local , short distance properties – to support tractable inference
 Propositional (grounded):
 E.g. True if “the followed by a Noun occurs in the sentence”

Features

Constraints
 Global properties
 Quantified, first order logic expressions
 E.g.True iff
“all yis in the sequence y are assigned different values.”
Page 50
Constraints As a Way To Encode Prior Knowledge

Consider encoding the knowledge that:
Entities of type A and B cannot occur simultaneously in a sentence
Need more training data
 The “Feature” Way
 Requires larger models


The Constraints Way



A effective way to
inject knowledge
Keeps the model simple; add expressive constraints directly
A small set of constraints
Allows for decision time incorporation of constraints
We can use constraints as a way to replace training data
Allows one to learn simpler models
Page 51
Download