Unified Expectation Maximization Rajhans Samdani Joint work with Ming-Wei Chang (

advertisement
Unified Expectation Maximization
Rajhans Samdani
NAACL 2012,
Montreal
Joint work with
Ming-Wei Chang (Microsoft Research)
and Dan Roth
University of Illinois at Urbana-Champaign
Page 1
Weakly Supervised Learning in NLP

Labeled data is scarce and difficult to obtain

A lot of work on learning with a small amount of labeled data

Expectation Maximization (EM) algorithm is the de facto
standard

More recently: significant work on injecting weak supervision
or domain knowledge via constraints into EM


Constraint-driven Learning (CoDL; Chang et al, 07)
Posterior regularization (PR; Ganchev et al, 10)
Page 2
Weakly Supervised Learning: EM and …?

Several variants of EM exist in the literature: Hard EM

Variants of constrained EM: CoDL and PR

Which version to use: EM (PR) vs hard EM (CoDL)?????


Or is there something better out there?
OUR CONTRIBUTION: a unified framework for EM algorithms,
Unified EM (UEM)

Includes existing EM algorithms
Pick the most suitable EM algorithm in a simple, adaptive, and
principled way

Adapting to data, initialization, and constraints

Page 3
Outline

Background: Expectation Maximization (EM)

EM with constraints

Unified Expectation Maximization (UEM)

Optimization Algorithm for the E-step

Experiments
Page 4
Predicting Structures in NLP

Predict the output or dependent variable y from the space of
allowed outputs Y given input variable x using parameters or
weight vector w

E.g.




predict POS tags given a sentence,
predict word alignments given sentences in two different languages,
predict the entity-relation structure from a document
Prediction expressed as
y* = argmaxy 2 Y P (y | x; w)
Page 5
Learning Using EM: a Quick Primer


Given unlabeled data: x, estimate w; hidden: y
for t = 1 … T do
 E:step: estimate a posterior distribution, q, over y:
qt(y) = argmin
q(y) , Pt) (y|x;wt) )
qt(y)q KL(
= P (y|x;w
(Neal and Hinton, 99)
Posterior
distribution
Conditional
distribution of
y given w
 M:step: estimate the parameters w w.r.t. q:
wt+1 = argmaxw Eq log P (x, y; w)
Page 6
Other Version of EM: Hard EM
Standard EM
 E-step:
Not
clear
argmin
KL(q
q
t(y),P
(y|x;wt))
Hard EM
 E-step:
*= argmax
which yversion
To
q(y) = ±yy=y*
P(y|x,w)
use!!!


M-step:
M-step:
argmaxw Eq log P (x, y; w)
argmaxw Eq log P (x, y; w)
Page 7
Constrained EM

Domain knowledge-based constraints can help a lot by guiding
unsupervised learning
Constraint-driven Learning (Chang et al, 07),
 Posterior Regularization (Ganchev et al, 10),
 Generalized Expectation Criterion (Mann & McCallum, 08),
 Learning from Measurements (Liang et al, 09)


Constraints are imposed on y (a structured object, {y1,y2…yn})
to specify/restrict the set of allowed structures Y
Page 8
Entity-Relation Prediction: Type Constraints
Per
Loc
Dole ’s wife, Elizabeth , is a resident of N.C.
E1
E2
R12
E3
R23
lives-in



Predict entity types: Per, Loc, Org, etc.
Predict relation types: lives-in, org-based-in, works-for, etc.
Entity-relation type constraints
Page 9
Bilingual Word Alignment: Agreement Constraints

Align words from
sentences in EN with
sentences in FR

Agreement constraints:
alignment from EN-FR
should agree with the
alignment from FR-EN
(Ganchev et al, 10)
Picture: courtesy Lacoste-Julien et al
10
Structured Prediction Constraints Representation

Assume a set of linear constraints:
Y = {y : Uy · b}

A universal representation (Roth and Yih, 07)

Can be relaxed into expectation constraints on posterior
probabilities:
Eq[Uy] · b

Focus on introducing constraints during the E-step
Page 11
Two Versions of Constrained EM
Posterior Regularization
(Ganchev et al, 10)
 E-step:
argminNot
(y),P
q KL(qtclear
(y|x;wt))
Eq[Uy] · b

Constraint driven-learning
(Chang et al, 07)
 E-step:
= argmaxy P(y|x,w)
whichy*version
To
Uy · b
use!!!
M-step:
argmaxw Eq log P (x, y; w)

M-step:
argmaxw Eq log P (x, y; w)
Page 12
So how do we learn…?

EM (PR) vs hard EM (CODL)
 Unclear which version of EM to use (Spitkovsky et al, 10)

This is the initial point of our research

We present a family of EM algorithms which includes
these EM algorithms (and infinitely many new EM
algorithms): Unified Expectation Maximization (UEM)

UEM lets us pick the best EM algorithm in a principled
way
Page 13
Outline

Notation and Expectation Maximization (EM)

Unified Expectation Maximization


Motivation
Formulation and mathematical intuition

Optimization Algorithm for the E-step

Experiments
Page 14
Motivation: Unified Expectation Maximization (UEM)

EM (PR) and hard EM (CODL) differ mostly in the entropy of
the posterior distribution
EM

Hard EM
UEM tunes the entropy of the posterior distribution q and is
parameterized by a single parameter °
Page 15
Unified EM (UEM)

EM (PR) minimizes the KL-Divergence KL(q , P (y|x;w))
KL(q , p) = y q(y) log q(y) – q(y) log p(y)

UEM changes the E-step of standard EM and minimizes a
modified KL divergence KL(q , P (y|x;w); °) where
KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y)
Changes the entropy of the
posterior

Different ° values ! different EM algorithms
Page 16
Effect of Changing °
KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y)
q with ° = 1
q with ° = 1
Original
Distribution p
q with ° = 0
q with ° = -1
Page 17
Unifying Existing EM Algorithms
KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y)
Changing ° values results in different existing EM algorithms
Deterministic
Annealing (Smith and
No
Constraints
With
Constraints
-1
CODL
Hard EM
EM
0
1
°
Eisner, 04; Hofmann, 99)
1
PR
Page 18
Range of °
KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y)
We focus on tuning ° in the range [0,1]
Hard EM
No
Constraints
With
Constraints
0
LP
approx to
CODL
(New)
Infinitely many
new EM algorithms
°
EM
1
PR
Page 19
Tuning ° in practice

° essentially tunes the entropy of the posterior to better
adapt to data, initialization, constraints, etc.

We tune ° using a small amount of development data over
the range
0 .1 .2 .3

……
1
UEM for arbitrary ° in our range is very easy to implement:
existing EM/PR/hard EM/CODL codes can be easily extended
to implement UEM
Page 20
Outline

Setting up the problem

Unified Expectation Maximization

Solving the constrained E-step



Lagrange dual-based based algorithm
Unification of existing algorithms
Experiments
Page 21
The Constrained E-step
°-Parameterized
KL divergence
Domain knowledge-based linear
constraints
Standard probability
simplex constraints
For ° ¸ 0 ) convex
Page 22
Solving the Constrained E-step for q(y)
1
2
3
Iterate until
convergence
Introduce dual variables ¸ for
each constraint
Sub-gradient ascent on dual
vars with O ¸ / Eq[Uy] – b
Compute q for given ¸

For °>0, compute

With ° !0, unconstrained MAP
inference:
Page 23
Some Properties of our E-step Optimization

We use a dual projected sub-gradient ascent algorithm
(Bertsekas, 99)


Includes inequality constraints
For special instances where two (or more) “easy” problems
are connected via constraints, reduces to dual decomposition
 For ° > 0: convex dual decomposition over individual models
(e.g. HMMs) connected via dual variables
 ° = 1: dual decomposition in posterior regularization
(Ganchev et al, 08)
 For °
= 0: Lagrange relaxation/dual decomposition for hard
ILP inference (Koo et al, 10; Rush et al, 11)
Page 24
Outline

Setting up the problem

Introduction to Unified Expectation Maximization

Lagrange dual-based optimization Algorithm for the E-step

Experiments



POS tagging
Entity-Relation Extraction
Word Alignment
Page 25
Experiments: exploring the role of °

Test if tuning ° helps improve the performance over baselines

Study the relation between the quality of initialization and °
(or “hardness” of inference)

Compare against:


Posterior Regularization (PR) corresponds to ° = 1.0
Constraint-driven Learning (CODL) corresponds to ° = -1
Page 26
Unsupervised POS Tagging

Model as first order HMM

Try varying qualities of initialization:



Uniform initialization: initialize with equal probability for all states
Supervised initialization: initialize with parameters trained on varying
amounts of labeled data
Test the “conventional wisdom” that hard EM does well with
good initialization and EM does better with a weak
initialization
Page 27
Unsupervised POS tagging: Different EM instantiations
Relative performance to EM (Gamma=1)
0.9
Hard EM
Initialization with
40-80 examples
0.7
uniform posterior initializer
5 labeled examples initializer
10 labeled examples initializer
20 labeled examples initializer
40 labeled examples initializer
80 labeled examples initializer
0.8
0.6 0.5 0.4
Gamma
Performance relative to EM
0.05
0
-0.05
-0.1
-0.15
1.0
EM
Initialization with
20 examples
Initialization with
10 examples
0.3
Initialization with
5 examples
0.2
Uniform
Initialization
0.1
0.0
°
Page 28
Experiments: Entity-Relation Extraction
Dole ’s wife, Elizabeth , is a resident of N.C.
E1
E2
R12


R23
Extract entity types (e.g. Loc, Org, Per) and relation types
(e.g. Lives-in, Org-based-in, Killed) between pairs of
entities
Add constraints:



E3
Type constraints between entity and relations
Expected count constraints to regularize the counts of ‘None’ relation
Semi-supervised learning with a small amount of labeled data
Page 29
Result on Relations
0.48
0.46
No semi-sup
Macro-f1 scores
0.44
0.42
CODL
0.4
PR
0.38
UEM
0.36
0.34
0.32
0.3
5%
10%
% of labeled data
20%
Page 30
Experiments: Word Alignment





Word alignment from a language S to language T
We try En-Fr and En-Es pairs
We use an HMM-based model with agreement constraints for
word alignment
PR with agreement constraints known to give HUGE
improvements over HMM (Ganchev et al’08; Graca et al’08)
Use our efficient algorithm to decomposes the E-step into
individual HMMs
Page 31
Word Alignment: EN-FR with 10k Unlabeled Data
Alignment Error Rate
25
20
15
EM
PR
CODL
10
UEM
5
0
EN-FR
FR-EN
Page 32
Word Alignment: EN-FR
Alignment Error Rate
25
20
15
EM
PR
CODL
10
UEM
5
0
10k
50k
100k
Page 33
Word Alignment: FR-EN
Alignment Error Rate
25
20
15
EM
PR
CODL
10
UEM
5
0
10k
50k
100k
Page 34
Word Alignment: EN-ES
40
Alignment Error Rate
35
30
EM
PR
25
CODL
UEM
20
15
10
10k
50k
100k
Page 35
Word Alignment: ES-EN
Alignment Error Rate
35
30
25
EM
PR
CODL
20
UEM
15
10
10k
50k
100k
Page 36
Experiments Summary

In different settings, different baselines work better




Entity-Relation extraction: CODL does better than PR
Word Alignment: PR does better than CODL
Unsupervised POS tagging: depends on the initialization
UEM allows us to choose the best algorithm in all of these
cases

Best version of EM: a new version with 0 < ° < 1
Page 37
Unified EM: Summary




UEM generalizes existing variations of EM/constrained EM
UEM provides new EM algorithms parameterized by a single
parameter °
Efficient dual projected subgradient ascent technique to
incorporate constraints into UEM
The best ° corresponds to neither EM (PR) nor hard EM
(CODL) and found through the UEM framework


Tuning ° adaptively changes the entropy of the posterior
UEM is easy to implement: add a few lines of code to existing
EM codes
Questions?
Page 38
Download