Latent Variables Naman Agarwal Michael Nute May 1, 2013 Latent Variables Contents • Definition & Example of Latent Variables • EM Algorithm Refresher • Structured SVM with Latent Variables • Learning under semi-supervision or indirect supervision – CoDL – Posterior Regularization – Indirect Supervision Latent Variables General Definition & Examples A Latent Variable in a machine learning algorithm is one which is assumed to exist (or have null value) but which is not observed and is inferred from other observed variables. • Generally corresponds to some meaningful element of the problem for which direct supervision is intractable. • Latent variable methods often imagine the variable as part of the input/feature space (e.g. PCA, factor analysis), or as part of the output space (e.g. EM). – This distinction is only illustrative though and can be blurred, as we will see with indirect supervision. Latent Input Variables: ðģ∗ ðģ ðī (unobserved) As part of the input space, the variable ðĨ ∈ ðģ affects the output ðĶ ∈ ðī only through the unobserved variable ðĨ ∗ ∈ ðģ ∗ . This formulation is only helpful if the dimension of ðģ ∗ is smaller than the dimension of ðģ, so latent variables here are essentially an exercise in dimension reduction. Latent Output Variables: ðī∗ ðģ (unobserved) ðī (observed) When we think of a latent variable as part of the output space, the method becomes an exercise in unsupervised or semi-supervised learning. Example Paraphrase Identification Problem: Given sentences A and B, determine whether they are paraphrases of each other. • Note that if they are paraphrases, then there will exist a mapping between named entities and predicates in the sentence. • The mapping is not directly observed, but is a latent variable in the decision problem of determining whether the sentences say the same thing. A: Druce will face murder charges, Conte said. (latent) B: Conte said Druce will be charged with murder. Revised Problem: Given sentences A and B, determine the mapping of semantic elements between A and B. • Now we are trying to learn specifically the mapping between them, so we can use the Boolean question in the previous problem as a latent variable. • In practice, the Boolean question is easy to answer, so we can use it to guide the semisupervised task of mapping semantic elements. • This is called indirect supervision (more on that later). 1Example taken from talk by D. Roth Language Technologies Institute Colloquium, Carnegie Mellon University, Pittsburgh, PA. Constraints Driven Structured Learning with Indirect Supervision. April 2010. The EM Algorithm Refresher In practice, many algorithms that use latent variables have a structure similar to the ExpectationsMaximization algorithm (even though EM is not discriminative and others are). So let’s review: The EM Algorithm (formally) Setup: Observed Data: ðŋ Unobserved Data: ð Unknown Parameters: ðĢ Log-Likelihood Function: ðŋ ðĢ ð = ðĶ∈ð log(P ðŋ, ð ðĢ ) Algorithm: Initialize ðĢ = ðĢ(ð) E-Step: Find the expected value of ðŋ ðĢ ðŋ over the unobserved data ð given the current estimate of the parameters: ð ðŊðŊ ðĄ = ðð|ðŋ,ðŊ ðĄ ðŋ ðĢ ðŋ M-Step: Find the parameters that maximize the expected log-likelihood function: ðŊ ðĄ+1 = argmax ð ðŊ ðŊ ðŊ ðĄ Takes expectation over possible “labels” of ð The EM Algorithm Hard EM vs. Soft EM • The algorithm at left is often called Soft EM because it computes the expectation of the log-likelihood function in the E-Step. • An important variation on this algorithm is called Hard EM: — instead of computing expectation, we simply choose the MAP value for ð and proceed with the likelihood function conditional on the MAP value. • This is a simpler procedure which many latent variable methods will essentially resemble: Label ð Train ðĢ (repeat until convergence) Yu & Joachims—Learning Structured SVMs with Latent Variables Model Formulation General Structured SVM Formulation: Solve: 1 min ðĪ ðĪ 2 ð 2 max Δ ðĶð , ðĶ + ðĪ ′ Φ ðĨð , ðĶ +ð ð=1 ðĶ∈ðī − ðĪ ′ Φ ðĨð , ðĶð Where: (ðĨð , ðĶð ) are input and structure for training example ð. Φ ðĨð , ðĶð is the feature vector Δ ðĶð , ðĶ is the loss function in the output space ðĪ is the weight vector Structured SVM Formulation with Latent Variable: Let â ∈ â be an unobserved variable. Since the predicted ðĶ now depends on â, the predicted value of the latent variable â, the loss function of the actual ðĶ and ðĶ may now become a function of â as well: Δ ðĶð , ðĶ ⇒ Δ ðĶð , ðĶ, â So our new optimization problem becomes: 1 min ðĪ ðĪ 2 ð 2 +ð ð max ð=1 (ðĶ,â)∈ðī×â Δ ðĶð , ðĶ, â + ðĪ ′ Φ ðĨð , ðĶ, â max ðĪ ′ Φ ðĨð , ðĶð , â −ð ð=1 â∈â Problem is now the difference of two convex functions, so we can solve it using a concave-convex procedure (CCCP). Yu & Joachims—Learning Structured SVMs with Latent Variables Optimization Methodology & Notes The CCCP: Notes: 1. Compute • Technically the loss function would compare the true values ðĶð , âð to the predicted ðĶ, â , but since we do not observe âð , we are restricted to using loss functions that reduce to what is shown. âð∗ = argmax ðĪ ′ Φ ðĨð , ðĶð , â â∈â for each ð 2. Update ðĪ by solving the standard Structured SVM formulation, treating each âð∗ as though it were an observed value. (repeat until convergence) Note the similarity to the simple way we looked at Hard-EM earlier: first we label the unlabeled values, then we re-train the model based on the newly labeled values. • It is not strictly necessary that the loss function depend on â. In NLP it often does not. • In the absence of latent variables, the optimization problem reduces to the general Structured SVM formulation. Learning under semi-supervision ïĻ ïĻ ïĻ Labeled dataset is hard to obtain We generally have a small labeled dataset and a large unlabeled data-set Naïve Algorithm [A kind of EM] ïĪ ïĪ ïĪ ïĪ ïĻ ïĻ Train on labeled data set [Initialization] Make Inference on the unlabeled set [Expectation] Include them in your training [Maximization] Repeat Can we do better ? Indirect supervision ïĪ ïĪ Constraints Binary decision problems Constraint Driven Learning ïĻ ïĻ ïĻ ïĻ Proposed by Chang et al [2007] Uses constraints obtained by domain-knowledge as to streamline semi-supervision Constraints are pretty general Incorporates soft constraints Why are constraints useful ? ïĻ ïĻ ïĻ ïĻ ïĻ [AUTHOR Lars Ole Anderson . ] [TITLE Program Analysis and specification for the C programming language . ] [ TECH-REPORT PhD thesis , ] [INSTITUTION DIKU , University of Copenhagen , ][DATE May 1994 .] HMM trained on 30 data sets produces [AUTHOR Lars Ole Anderson . Program Analysis and ] [ TITLE specification for the ] [ EDITOR C ] BOOKTITLE programming language . ] [ TECHREPORT PhD thesis , ] [INSTITUTION DIKU , University of Copenhagen , May ][DATE 1994 .] Leads to noisy predictions. Simple constraint that state transition occurs only on punctuation marks produces the correct output CoDL Framework ïĻ Notations = (XL , YL ) is the labeled dataset ïĪ U = (X U , YU ) is the unlabeled dataset ïĪ ð ð, ð represents a feature vector ïĪ Structured Learning Task ïĪL Learn w such that Yi = ððððððĨ ðĶ ððŧ . ð(ðð , ðð ) ïĪ ðķ1 , … , ðķð are the set of constraints where each ðķð âķ ð × ð → {0,1} CoDL Objective ïĻ If the constraints are hard – ððððððĨ ðĶ∈1 ðķ ðĨ ïĻ If constraints are soft they define a notion of violation by a distance function d such that ð ðĶ, ðĨ, ðķð = ïĻ ððŧ . ð(ðð , ðð ) min ðŧ(ðĶ, ðĶ′) ðĶ ′ ∈1ðķ ð ðĨ The objective in this “soft” formulation is given by ðū ððððððĨ ðĶ ððŧ . ð ðð , ðð − ðð ð ðĶ, ðĨ, ðķð ð=1 Learning Algorithm ïĻ ïĻ Divided into Four Steps Initialization ðð = ððððð(ðŋ) ïĻ Expectation ïŪ For ïŪ ïŪ all ðĨ ∈ ð ðĨ, ðĶ1 … ðĨ, ðĶ ð ð=ð∪ = ððð − ðū − ðžðððððððð ðĨ, ð, ðŠð , ðð ðĨ, ðĶ1 … ðĨ, ðĶ ð ððð − ðū − ðžðððððððð generates the best K “valid” assignments to Y using Beam – Search techniques ïĪ Can be thought of as assigning a uniform distribution over the above K in the posterior and 0 everywhere else ïĪ Learning Algorithm (cntd.) ïĻ Maximization ðĪ = ðūðĪ0 + 1 − ðū ∗ ððððð(ð) ïĪðū is a smoothing parameter that does not let the model drift too much from the supervised model ïĻ Repeat Posterior Regularization [Ganchev et al ‘09] ïĻ ïĻ ïĻ Hard vs Soft EM Imposes constraints in Expectation over the Posterior Distribution of the Latent Variables Two components of the objective function log-likelihood – l ð = log(ðð (ð, ð)) ïĪ The deviation from the predicted posterior and the one Posterior Distribution of the satisfying constraints latent variables min ðūðŋ(ð||ðð (ð|ð)) ïĪ The ð ð ðĒðððððĄ ðĄð Set of all posterior distributions ðķð (ð) = 1 Constraint specified in terms of expectation over q The PR Algorithm ïĻ Initialization ïĪ Estimate ïĻ parameters ð from the labeled data set Expectation Step ïĪ Compute the closest satisfying distribution ððĄ = min ðūðŋ(ð||ðððĄ (ð|ð)) ð ð ðĒðððððĄ ðĄð ïĻ ðķð (ð) = 1 Maximization Step ð ðĄ+1 = ððððððĨð ðļððĄ [ð(ð)] ïĻ Repeat Indirect Supervision - Motivation ïĻ Paraphrase Identification S1: Druce will face murder charges, Conte said. ïĻ S2: Conte said Druce will be charged with murder. ïĻ ïĻ ïĻ ïĻ There exists some Latent Structure H between S1 and S2 H acts as a justification for the binary decision. Can be used as an intermediate step in learning the model Supervision through Binary Problems ïĻ Now we ask the previous question in the reverse direction Given answers to the binary problem, can we improve our latent structure identification Structured Prediction Problem ïĻ Example – ïĪ Field • Companion Binary Problem • Labeled dataset – easy to obtain Identification in advertisements (size,rent etc.) ïĪ Whether the text is a well formed advertisement The Model [Chang et al 2010] ïĻ Notations – = (XL , YL ) is the labeled dataset ðĩ = ðĩ+ ∪ ðĩ− ïĪ B = (X B , YB ) is the binary (ððĩ ∈ {1, −1}) labeled dataset. ïĪ ð ð, ð represents a feature vector ïĪ Structured Learning Task ïĪL Learn w such that F ïĻ ðĨ, ðĶ, ðĪ = min ðĪ Additionally we require ∀ ðĨ, −1 ∈ ðĩ− , ∀ ðĶ , ∀ ðĨ, +1 ∈ ðĩ+ , ∃ ðĶ , The weight vector scores some structure well ðĪ 2 2 + ðķ1 ð∈ð ðŋð ðĨð , ðĶð , ðĪ The weight vector scores all structures badly ððŧ . ð ð, ð ≤ ð ððŧ . ð ð, ð > ð Loss Function ïĻ The previous “constraint” can be captured by the following loss function ðŋðĩ ðĨð , ðĶð , ð = ð(1 − ðĶð max(ððŧ . ð(ðĨð , ð))) ð ïĻ Now we wish to optimize the following objective ðđ ð, ð, ð + ðŋðĩ ðĨð , ðĶð , ð ðĩ Structured Prediction over the labeled dataset Indirect Supervision Model Specification Setup: ðĨð , ðð ð ð=1 Fully-labeled training data: ð= Binary-labeled training data: ðĩ = ðĩ+ ∪ ðĩ− = ðĨð , ðĶð ð+ð ð=ð+1 where ðĶð ∈ −1, +1 Two Conditions Imposed on the Weight Vector: ∀ ðĨ, −1 ∈ ðĩ− , ∀ð ∈ â ðĨ , ðĪ ′ Φ ðĨ, ð ≤ 0 (i.e. there is no good predicted structure for the negative examples) ∀ ðĨ, +1 ∈ ðĩ+ , ∃ð ∈ â ðĨ , ðĪ ′ Φ ðĨ, ð ≥ 0 (i.e. there is at least one good predicted structure for the positive examples) So the optimization problem becomes: ðĪ min ðĪ 2 2 + ðķ1 ðŋð ðĨð , ðð , ðĪ + ðķ2 ðŋðĩ− ðĨð , ðĶð , ðĪ + ðķ2 ð∈ðĩ− ð∈ð Where: ′ ðŋð ðĨð , ðð , ðĪ = ð max Δ ð, ðð − ðĪ Φ ðĨð , ðð − Φ ðĨð , ð ð ðŋðĩ ðĨð , ðĶð , ðĪ = ð 1 − ðĶð max ðĪ ′ â∈â(ðĨ) Φ ðĨð , ð ð (ðĨð ) ð is a common loss function such as the hinge loss ð is a normalization constant ðŋðĩ+ ðĨð , ðĶð , ðĪ ð∈ðĩ+ This term is non-convex and must be optimized by setting ð = argmax ðĪ ′ Φ ðĨð , ð and â∈â solving the first two terms, then repeating (CCCP-like). Latent Variables in NLP Overview of Three Methods Method 2-Second Description Latent Variable EM Analogue Key Advantage Structural SVM1 Structured SVM with latent variables & EMlike training Separate and independent from the output variable Hard EM, latent value found by argmax ðĪ ′ Φ ðĨð , ðĶð , â Enables Structured SVM learned with latent variable â∈â CoDL2 Train on labeled data, generate K best structures of unlabeled data and train on that. Average the two. Output variable for unlabeled training examples Soft-EM with Uniform Distribution on top-K predicted outputs. Efficient semisupervised learning when constraints are difficult to guarantee for predictions but easy to evaluate Indirect Supervision3 Get small number of labeled & many where we know if label exists or not. Train a model on both at the same time. 1. Companion binary-decision variable 2. Output structure on positive, unlabeled examples Hard EM where label is applied only to examples where binary classifier is positive Combines information gain from indirect supervision (on lots of data) with direct supervision 1Learning Structural SVMs with Latent Variables, Chun-Nam John Yu and T. Joachims, ICML, 2009. Semi-Supervision with Constraint-Driven Learning, M. Chang, L. Ratinov and D. Roth, ACL 2007 3Structured Output Learning with Indirect Supervision, M. Chang, V. Srikumar, D. Goldwasser and D. Roth, ICML 2010. 2Guiding