Unified Expectation Maximization Rajhans Samdani Joint work with Ming-Wei Chang (

Unified Expectation Maximization Rajhans Samdani NAACL 2012, Montreal Joint work with Ming-Wei Chang (Microsoft Research) and Dan Roth University of Illinois at Urbana-Champaign Page 1 Weakly Supervised Learning in NLP  Labeled data is scarce and difficult to obtain  A lot of work on learning with a small amount of labeled data  Expectation Maximization (EM) algorithm is the de facto standard  More recently: significant work on injecting weak supervision or domain knowledge via constraints into EM   Constraint-driven Learning (CoDL; Chang et al, 07) Posterior regularization (PR; Ganchev et al, 10) Page 2 Weakly Supervised Learning: EM and …?  Several variants of EM exist in the literature: Hard EM  Variants of constrained EM: CoDL and PR  Which version to use: EM (PR) vs hard EM (CoDL)?????   Or is there something better out there? OUR CONTRIBUTION: a unified framework for EM algorithms, Unified EM (UEM)  Includes existing EM algorithms Pick the most suitable EM algorithm in a simple, adaptive, and principled way  Adapting to data, initialization, and constraints  Page 3 Outline  Background: Expectation Maximization (EM)  EM with constraints  Unified Expectation Maximization (UEM)  Optimization Algorithm for the E-step  Experiments Page 4 Predicting Structures in NLP  Predict the output or dependent variable y from the space of allowed outputs Y given input variable x using parameters or weight vector w  E.g.     predict POS tags given a sentence, predict word alignments given sentences in two different languages, predict the entity-relation structure from a document Prediction expressed as y* = argmaxy 2 Y P (y | x; w) Page 5 Learning Using EM: a Quick Primer   Given unlabeled data: x, estimate w; hidden: y for t = 1 … T do  E:step: estimate a posterior distribution, q, over y: qt(y) = argmin q(y) , Pt) (y|x;wt) ) qt(y)q KL( = P (y|x;w (Neal and Hinton, 99) Posterior distribution Conditional distribution of y given w  M:step: estimate the parameters w w.r.t. q: wt+1 = argmaxw Eq log P (x, y; w) Page 6 Other Version of EM: Hard EM Standard EM  E-step: Not clear argmin KL(q q t(y),P (y|x;wt)) Hard EM  E-step: *= argmax which yversion To q(y) = ±yy=y* P(y|x,w) use!!!   M-step: M-step: argmaxw Eq log P (x, y; w) argmaxw Eq log P (x, y; w) Page 7 Constrained EM  Domain knowledge-based constraints can help a lot by guiding unsupervised learning Constraint-driven Learning (Chang et al, 07),  Posterior Regularization (Ganchev et al, 10),  Generalized Expectation Criterion (Mann & McCallum, 08),  Learning from Measurements (Liang et al, 09)   Constraints are imposed on y (a structured object, {y1,y2…yn}) to specify/restrict the set of allowed structures Y Page 8 Entity-Relation Prediction: Type Constraints Per Loc Dole ’s wife, Elizabeth , is a resident of N.C. E1 E2 R12 E3 R23 lives-in    Predict entity types: Per, Loc, Org, etc. Predict relation types: lives-in, org-based-in, works-for, etc. Entity-relation type constraints Page 9 Bilingual Word Alignment: Agreement Constraints  Align words from sentences in EN with sentences in FR  Agreement constraints: alignment from EN-FR should agree with the alignment from FR-EN (Ganchev et al, 10) Picture: courtesy Lacoste-Julien et al 10 Structured Prediction Constraints Representation  Assume a set of linear constraints: Y = {y : Uy · b}  A universal representation (Roth and Yih, 07)  Can be relaxed into expectation constraints on posterior probabilities: Eq[Uy] · b  Focus on introducing constraints during the E-step Page 11 Two Versions of Constrained EM Posterior Regularization (Ganchev et al, 10)  E-step: argminNot (y),P q KL(qtclear (y|x;wt)) Eq[Uy] · b  Constraint driven-learning (Chang et al, 07)  E-step: = argmaxy P(y|x,w) whichy*version To Uy · b use!!! M-step: argmaxw Eq log P (x, y; w)  M-step: argmaxw Eq log P (x, y; w) Page 12 So how do we learn…?  EM (PR) vs hard EM (CODL)  Unclear which version of EM to use (Spitkovsky et al, 10)  This is the initial point of our research  We present a family of EM algorithms which includes these EM algorithms (and infinitely many new EM algorithms): Unified Expectation Maximization (UEM)  UEM lets us pick the best EM algorithm in a principled way Page 13 Outline  Notation and Expectation Maximization (EM)  Unified Expectation Maximization   Motivation Formulation and mathematical intuition  Optimization Algorithm for the E-step  Experiments Page 14 Motivation: Unified Expectation Maximization (UEM)  EM (PR) and hard EM (CODL) differ mostly in the entropy of the posterior distribution EM  Hard EM UEM tunes the entropy of the posterior distribution q and is parameterized by a single parameter ° Page 15 Unified EM (UEM)  EM (PR) minimizes the KL-Divergence KL(q , P (y|x;w)) KL(q , p) = y q(y) log q(y) – q(y) log p(y)  UEM changes the E-step of standard EM and minimizes a modified KL divergence KL(q , P (y|x;w); °) where KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y) Changes the entropy of the posterior  Different ° values ! different EM algorithms Page 16 Effect of Changing ° KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y) q with ° = 1 q with ° = 1 Original Distribution p q with ° = 0 q with ° = -1 Page 17 Unifying Existing EM Algorithms KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y) Changing ° values results in different existing EM algorithms Deterministic Annealing (Smith and No Constraints With Constraints -1 CODL Hard EM EM 0 1 ° Eisner, 04; Hofmann, 99) 1 PR Page 18 Range of ° KL(q , p; °) = y ° q(y) log q(y) – q(y) log p(y) We focus on tuning ° in the range [0,1] Hard EM No Constraints With Constraints 0 LP approx to CODL (New) Infinitely many new EM algorithms ° EM 1 PR Page 19 Tuning ° in practice  ° essentially tunes the entropy of the posterior to better adapt to data, initialization, constraints, etc.  We tune ° using a small amount of development data over the range 0 .1 .2 .3  …… 1 UEM for arbitrary ° in our range is very easy to implement: existing EM/PR/hard EM/CODL codes can be easily extended to implement UEM Page 20 Outline  Setting up the problem  Unified Expectation Maximization  Solving the constrained E-step    Lagrange dual-based based algorithm Unification of existing algorithms Experiments Page 21 The Constrained E-step °-Parameterized KL divergence Domain knowledge-based linear constraints Standard probability simplex constraints For ° ¸ 0 ) convex Page 22 Solving the Constrained E-step for q(y) 1 2 3 Iterate until convergence Introduce dual variables ¸ for each constraint Sub-gradient ascent on dual vars with O ¸ / Eq[Uy] – b Compute q for given ¸  For °>0, compute  With ° !0, unconstrained MAP inference: Page 23 Some Properties of our E-step Optimization  We use a dual projected sub-gradient ascent algorithm (Bertsekas, 99)   Includes inequality constraints For special instances where two (or more) “easy” problems are connected via constraints, reduces to dual decomposition  For ° > 0: convex dual decomposition over individual models (e.g. HMMs) connected via dual variables  ° = 1: dual decomposition in posterior regularization (Ganchev et al, 08)  For ° = 0: Lagrange relaxation/dual decomposition for hard ILP inference (Koo et al, 10; Rush et al, 11) Page 24 Outline  Setting up the problem  Introduction to Unified Expectation Maximization  Lagrange dual-based optimization Algorithm for the E-step  Experiments    POS tagging Entity-Relation Extraction Word Alignment Page 25 Experiments: exploring the role of °  Test if tuning ° helps improve the performance over baselines  Study the relation between the quality of initialization and ° (or “hardness” of inference)  Compare against:   Posterior Regularization (PR) corresponds to ° = 1.0 Constraint-driven Learning (CODL) corresponds to ° = -1 Page 26 Unsupervised POS Tagging  Model as first order HMM  Try varying qualities of initialization:    Uniform initialization: initialize with equal probability for all states Supervised initialization: initialize with parameters trained on varying amounts of labeled data Test the “conventional wisdom” that hard EM does well with good initialization and EM does better with a weak initialization Page 27 Unsupervised POS tagging: Different EM instantiations Relative performance to EM (Gamma=1) 0.9 Hard EM Initialization with 40-80 examples 0.7 uniform posterior initializer 5 labeled examples initializer 10 labeled examples initializer 20 labeled examples initializer 40 labeled examples initializer 80 labeled examples initializer 0.8 0.6 0.5 0.4 Gamma Performance relative to EM 0.05 0 -0.05 -0.1 -0.15 1.0 EM Initialization with 20 examples Initialization with 10 examples 0.3 Initialization with 5 examples 0.2 Uniform Initialization 0.1 0.0 ° Page 28 Experiments: Entity-Relation Extraction Dole ’s wife, Elizabeth , is a resident of N.C. E1 E2 R12   R23 Extract entity types (e.g. Loc, Org, Per) and relation types (e.g. Lives-in, Org-based-in, Killed) between pairs of entities Add constraints:    E3 Type constraints between entity and relations Expected count constraints to regularize the counts of ‘None’ relation Semi-supervised learning with a small amount of labeled data Page 29 Result on Relations 0.48 0.46 No semi-sup Macro-f1 scores 0.44 0.42 CODL 0.4 PR 0.38 UEM 0.36 0.34 0.32 0.3 5% 10% % of labeled data 20% Page 30 Experiments: Word Alignment      Word alignment from a language S to language T We try En-Fr and En-Es pairs We use an HMM-based model with agreement constraints for word alignment PR with agreement constraints known to give HUGE improvements over HMM (Ganchev et al’08; Graca et al’08) Use our efficient algorithm to decomposes the E-step into individual HMMs Page 31 Word Alignment: EN-FR with 10k Unlabeled Data Alignment Error Rate 25 20 15 EM PR CODL 10 UEM 5 0 EN-FR FR-EN Page 32 Word Alignment: EN-FR Alignment Error Rate 25 20 15 EM PR CODL 10 UEM 5 0 10k 50k 100k Page 33 Word Alignment: FR-EN Alignment Error Rate 25 20 15 EM PR CODL 10 UEM 5 0 10k 50k 100k Page 34 Word Alignment: EN-ES 40 Alignment Error Rate 35 30 EM PR 25 CODL UEM 20 15 10 10k 50k 100k Page 35 Word Alignment: ES-EN Alignment Error Rate 35 30 25 EM PR CODL 20 UEM 15 10 10k 50k 100k Page 36 Experiments Summary  In different settings, different baselines work better     Entity-Relation extraction: CODL does better than PR Word Alignment: PR does better than CODL Unsupervised POS tagging: depends on the initialization UEM allows us to choose the best algorithm in all of these cases  Best version of EM: a new version with 0 < ° < 1 Page 37 Unified EM: Summary     UEM generalizes existing variations of EM/constrained EM UEM provides new EM algorithms parameterized by a single parameter ° Efficient dual projected subgradient ascent technique to incorporate constraints into UEM The best ° corresponds to neither EM (PR) nor hard EM (CODL) and found through the UEM framework   Tuning ° adaptively changes the entropy of the posterior UEM is easy to implement: add a few lines of code to existing EM codes Questions? Page 38

Unified Expectation Maximization Rajhans Samdani Joint work with Ming-Wei Chang (

Related documents

Products

Support

Unified Expectation Maximization Rajhans Samdani Joint work with Ming-Wei Chang (

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib