ppt

advertisement
Latent Variables
Naman Agarwal
Michael Nute
May 1, 2013
Latent Variables
Contents
• Definition & Example of Latent Variables
• EM Algorithm Refresher
• Structured SVM with Latent Variables
• Learning under semi-supervision or indirect supervision
– CoDL
– Posterior Regularization
– Indirect Supervision
Latent Variables
General Definition & Examples
A Latent Variable in a machine learning
algorithm is one which is assumed to exist (or
have null value) but which is not observed
and is inferred from other observed
variables.
• Generally corresponds to some meaningful
element of the problem for which direct
supervision is intractable.
• Latent variable methods often imagine the
variable as part of the input/feature space (e.g.
PCA, factor analysis), or as part of the output
space (e.g. EM).
– This distinction is only illustrative though and can
be blurred, as we will see with indirect
supervision.
Latent Input Variables:
ð’ģ∗
ð’ģ
ð’ī
(unobserved)
As part of the input space, the variable ð‘Ĩ ∈ ð’ģ
affects the output ð‘Ķ ∈ ð’ī only through the
unobserved variable ð‘Ĩ ∗ ∈ ð’ģ ∗ . This formulation
is only helpful if the dimension of ð’ģ ∗ is smaller
than the dimension of ð’ģ, so latent variables
here are essentially an exercise in dimension
reduction.
Latent Output Variables:
ð’ī∗
ð’ģ
(unobserved)
ð’ī
(observed)
When we think of a latent variable as part of
the output space, the method becomes an
exercise in unsupervised or semi-supervised
learning.
Example
Paraphrase Identification
Problem: Given sentences A and B, determine whether they are paraphrases of each
other.
• Note that if they are paraphrases, then there will exist a mapping between named entities
and predicates in the sentence.
• The mapping is not directly observed, but is a latent variable in the decision problem of
determining whether the sentences say the same thing.
A: Druce will face murder charges, Conte said.
(latent)
B: Conte said Druce will be charged with murder.
Revised Problem: Given sentences A and B, determine the mapping of semantic
elements between A and B.
• Now we are trying to learn specifically the mapping between them, so we can use the
Boolean question in the previous problem as a latent variable.
• In practice, the Boolean question is easy to answer, so we can use it to guide the semisupervised task of mapping semantic elements.
• This is called indirect supervision (more on that later).
1Example
taken from talk by D. Roth Language Technologies Institute Colloquium, Carnegie Mellon University, Pittsburgh, PA. Constraints Driven Structured
Learning with Indirect Supervision. April 2010.
The EM Algorithm
Refresher
In practice, many algorithms that use latent variables have a structure similar to the ExpectationsMaximization algorithm (even though EM is not discriminative and others are).
So let’s review:
The EM Algorithm (formally)
Setup:
Observed Data: ð‘ŋ
Unobserved Data: 𝒀
Unknown Parameters: ðœĢ
Log-Likelihood Function: ðŋ ðœĢ 𝐗 =
ð‘Ķ∈𝑌 log(P
ð‘ŋ, 𝒀 ðœĢ )
Algorithm:
Initialize ðœĢ = ðœĢ(𝟎)
E-Step: Find the expected value of ðŋ ðœĢ ð‘ŋ over the unobserved data 𝒀 given the
current estimate of the parameters:
𝑄 ðšŊðšŊ
ð‘Ą
= 𝐄𝒀|ð‘ŋ,ðšŊ ð‘Ą ðŋ ðœĢ ð‘ŋ
M-Step: Find the parameters that maximize the expected log-likelihood function:
ðšŊ
ð‘Ą+1
= argmax 𝑄 ðšŊ ðšŊ
ðšŊ
ð‘Ą
Takes
expectation
over possible
“labels” of 𝒀
The EM Algorithm
Hard EM vs. Soft EM
• The algorithm at left is often called Soft EM because it computes the expectation of
the log-likelihood function in the E-Step.
• An important variation on this algorithm is called Hard EM:
— instead of computing expectation, we simply choose the MAP value for 𝐘 and proceed
with the likelihood function conditional on the MAP value.
• This is a simpler procedure which many latent variable methods will essentially
resemble:
Label 𝒀
Train ðœĢ
(repeat until convergence)
Yu & Joachims—Learning Structured SVMs with Latent Variables
Model Formulation
General Structured SVM Formulation:
Solve:
1
min ð‘Ī
ð‘Ī 2
𝑛
2
max Δ ð‘Ķ𝑖 , ð‘Ķ + ð‘Ī ′ Φ ð‘Ĩ𝑖 , ð‘Ķ
+𝑐
𝑖=1
ð‘Ķ∈ð’ī
− ð‘Ī ′ Φ ð‘Ĩ𝑖 , ð‘Ķ𝑖
Where:
(ð‘Ĩ𝑖 , ð‘Ķ𝑖 ) are input and structure for training example 𝑖.
Φ ð‘Ĩ𝑖 , ð‘Ķ𝑖 is the feature vector
Δ ð‘Ķ𝑖 , ð‘Ķ is the loss function in the output space
ð‘Ī is the weight vector
Structured SVM Formulation with Latent Variable:
Let ℎ ∈ ℋ be an unobserved variable. Since the predicted ð‘Ķ now depends on ℎ, the predicted value of the
latent variable ℎ, the loss function of the actual ð‘Ķ and ð‘Ķ may now become a function of ℎ as well:
Δ ð‘Ķ𝑖 , ð‘Ķ ⇒ Δ ð‘Ķ𝑖 , ð‘Ķ, ℎ
So our new optimization problem becomes:
1
min
ð‘Ī
ð‘Ī
2
𝑛
2
+𝑐
𝑛
max
𝑖=1
(ð‘Ķ,ℎ)∈ð’ī×ℋ
Δ ð‘Ķ𝑖 , ð‘Ķ, ℎ + ð‘Ī ′ Φ ð‘Ĩ𝑖 , ð‘Ķ, ℎ
max ð‘Ī ′ Φ ð‘Ĩ𝑖 , ð‘Ķ𝑖 , ℎ
−𝑐
𝑖=1
ℎ∈ℋ
Problem is now the difference of two convex functions, so
we can solve it using a concave-convex procedure (CCCP).
Yu & Joachims—Learning Structured SVMs with Latent Variables
Optimization Methodology & Notes
The CCCP:
Notes:
1. Compute
• Technically the loss function would compare
the true values ð‘Ķ𝑖 , ℎ𝑖 to the predicted
ð‘Ķ, ℎ , but since we do not observe ℎ𝑖 , we
are restricted to using loss functions that
reduce to what is shown.
ℎ𝑖∗ = argmax ð‘Ī ′ Φ ð‘Ĩ𝑖 , ð‘Ķ𝑖 , ℎ
ℎ∈ℋ
for each 𝑖
2. Update ð‘Ī by solving the standard
Structured SVM formulation, treating each
ℎ𝑖∗ as though it were an observed value.
(repeat until convergence)
Note the similarity to the simple way we
looked at Hard-EM earlier: first we label
the unlabeled values, then we re-train
the model based on the newly labeled
values.
• It is not strictly necessary that the loss
function depend on ℎ. In NLP it often does
not.
• In the absence of latent variables, the
optimization problem reduces to the general
Structured SVM formulation.
Learning under semi-supervision
ï‚Ļ
ï‚Ļ
ï‚Ļ
Labeled dataset is hard to obtain
We generally have a small labeled dataset and a large
unlabeled data-set
Naïve Algorithm [A kind of EM]
ï‚Ī
ï‚Ī
ï‚Ī
ï‚Ī
ï‚Ļ
ï‚Ļ
Train on labeled data set [Initialization]
Make Inference on the unlabeled set [Expectation]
Include them in your training [Maximization]
Repeat
Can we do better ?
Indirect supervision
ï‚Ī
ï‚Ī
Constraints
Binary decision problems
Constraint Driven Learning
ï‚Ļ
ï‚Ļ
ï‚Ļ
ï‚Ļ
Proposed by Chang et al [2007]
Uses constraints obtained by domain-knowledge as
to streamline semi-supervision
Constraints are pretty general
Incorporates soft constraints
Why are constraints useful ?
ï‚Ļ
ï‚Ļ
ï‚Ļ
ï‚Ļ
ï‚Ļ
[AUTHOR Lars Ole Anderson . ] [TITLE Program Analysis and specification for
the C programming language . ] [ TECH-REPORT PhD thesis , ] [INSTITUTION
DIKU , University of Copenhagen , ][DATE May 1994 .]
HMM trained on 30 data sets produces
[AUTHOR Lars Ole Anderson . Program Analysis and ] [ TITLE specification
for the ] [ EDITOR C ] BOOKTITLE programming language . ] [ TECHREPORT PhD thesis , ] [INSTITUTION DIKU , University of Copenhagen , May
][DATE 1994 .]
Leads to noisy predictions.
Simple constraint that state transition occurs only on
punctuation marks produces the correct output
CoDL Framework
ï‚Ļ
Notations
= (XL , YL ) is the labeled dataset
ï‚Ī U = (X U , YU ) is the unlabeled dataset
ï‚Ī 𝜙 𝑋, 𝑌 represents a feature vector
ï‚Ī Structured Learning Task
ï‚ĪL
Learn w such that Yi = 𝑎𝑟𝑔𝑚𝑎ð‘Ĩ ð‘Ķ 𝒘ð‘ŧ . 𝜙(𝑋𝑖 , 𝑌𝑖 )
ï‚Ī ðķ1 , … , ðķ𝑘
are the set of constraints where each
ðķ𝑖 âˆķ 𝑋 × ð‘Œ → {0,1}
CoDL Objective
ï‚Ļ
If the constraints are hard –
𝑎𝑟𝑔𝑚𝑎ð‘Ĩ ð‘Ķ∈1 ðķ ð‘Ĩ
ï‚Ļ
If constraints are soft they define a notion of violation
by a distance function d such that
𝑑 ð‘Ķ, ð‘Ĩ, ðķ𝑖 =
ï‚Ļ
𝒘ð‘ŧ . 𝜙(𝑋𝑖 , 𝑌𝑖 )
min ðŧ(ð‘Ķ, ð‘Ķ′)
ð‘Ķ ′ ∈1ðķ
𝑖 ð‘Ĩ
The objective in this “soft” formulation is given by
ðū
𝑎𝑟𝑔𝑚𝑎ð‘Ĩ ð‘Ķ
𝒘ð‘ŧ . 𝜙 𝑋𝑖 , 𝑌𝑖 −
𝑝𝑖 𝑑 ð‘Ķ, ð‘Ĩ, ðķ𝑖
𝑖=1
Learning Algorithm
ï‚Ļ
ï‚Ļ
Divided into Four Steps
Initialization
𝒘𝟎 = 𝑙𝑒𝑎𝑟𝑛(ðŋ)
ï‚Ļ
Expectation
ïŪ For
ïŪ
ïŪ
all ð‘Ĩ ∈ 𝑈
ð‘Ĩ, ð‘Ķ1 … ð‘Ĩ, ð‘Ķ 𝑘
𝑇=𝑇∪
= 𝑇𝑜𝑝 − ðū − 𝐞𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒 ð‘Ĩ, 𝒘, 𝑊𝒊 , 𝒑𝒊
ð‘Ĩ, ð‘Ķ1 … ð‘Ĩ, ð‘Ķ 𝑘
𝑇𝑜𝑝 − ðū − 𝐞𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒 generates the best K “valid”
assignments to Y using Beam – Search techniques
ï‚Ī Can be thought of as assigning a uniform distribution over
the above K in the posterior and 0 everywhere else
ï‚Ī
Learning Algorithm (cntd.)
ï‚Ļ
Maximization
ð‘Ī = ð›ūð‘Ī0 + 1 − ð›ū ∗ 𝑙𝑒𝑎𝑟𝑛(𝑇)
ï‚Īð›ū
is a smoothing parameter that does not let the model
drift too much from the supervised model
ï‚Ļ
Repeat
Posterior Regularization [Ganchev et al ‘09]
ï‚Ļ
ï‚Ļ
ï‚Ļ
Hard vs Soft EM
Imposes constraints in Expectation over the Posterior
Distribution of the Latent Variables
Two components of the objective function
log-likelihood – l 𝜃 = log(𝑝𝜃 (𝑋, 𝑌))
ï‚Ī The deviation from the predicted posterior and the one
Posterior Distribution of the
satisfying constraints
latent variables
min ðūðŋ(𝑞||𝑝𝜃 (𝑌|𝑋))
ï‚Ī The
𝑞
𝑠ð‘Ēð‘ð‘—ð‘’ð‘ð‘Ą ð‘Ąð‘œ
Set of all posterior
distributions
ðķ𝑖 (𝑞) = 1
Constraint specified in terms
of expectation over q
The PR Algorithm
ï‚Ļ
Initialization
ï‚Ī Estimate
ï‚Ļ
parameters 𝜃 from the labeled data set
Expectation Step
ï‚Ī Compute
the closest satisfying distribution
ð‘„ð‘Ą = min ðūðŋ(𝑞||ð‘ðœƒð‘Ą (𝑌|𝑋))
𝑞
𝑠ð‘Ēð‘ð‘—ð‘’ð‘ð‘Ą ð‘Ąð‘œ
ï‚Ļ
ðķ𝑖 (𝑞) = 1
Maximization Step
𝜃 ð‘Ą+1 = 𝑎𝑟𝑔𝑚𝑎ð‘Ĩ𝜃 ðļð‘„ð‘Ą [𝑙(𝜃)]
ï‚Ļ
Repeat
Indirect Supervision - Motivation
ï‚Ļ
Paraphrase Identification
S1: Druce will face murder charges, Conte said.
ï‚Ļ
S2: Conte said Druce will be charged with murder.
ï‚Ļ
ï‚Ļ
ï‚Ļ
ï‚Ļ
There exists some Latent Structure H between S1 and
S2
H acts as a justification for the binary decision.
Can be used as an intermediate step in learning the
model
Supervision through Binary Problems
ï‚Ļ
Now we ask the previous question in the reverse
direction
Given answers to the binary problem, can we improve our
latent structure identification
Structured Prediction
Problem
ï‚Ļ
Example –
ï‚Ī Field
• Companion Binary Problem
• Labeled dataset – easy to
obtain
Identification in advertisements (size,rent etc.)
ï‚Ī Whether
the text is a well formed advertisement
The Model [Chang et al 2010]
ï‚Ļ
Notations –
= (XL , YL ) is the labeled dataset
ðĩ = ðĩ+ ∪ ðĩ−
ï‚Ī B = (X B , YB ) is the binary (𝑌ðĩ ∈ {1, −1}) labeled dataset.
ï‚Ī 𝜙 𝑋, 𝑌 represents a feature vector
ï‚Ī Structured Learning Task
ï‚ĪL
Learn w such that F
ï‚Ļ
ð‘Ĩ, ð‘Ķ, ð‘Ī = min
ð‘Ī
Additionally we require
∀ ð‘Ĩ, −1 ∈ ðĩ− , ∀ ð‘Ķ ,
∀ ð‘Ĩ, +1 ∈ ðĩ+ , ∃ ð‘Ķ ,
The weight vector scores
some structure well
ð‘Ī 2
2
+ ðķ1
𝑖∈𝑆 ðŋ𝑆
ð‘Ĩ𝑖 , ð‘Ķ𝑖 , ð‘Ī
The weight vector scores all
structures badly
𝒘ð‘ŧ . 𝝓 𝒙, 𝒚 ≤ 𝟎
𝒘ð‘ŧ . 𝝓 𝒙, 𝒚 > 𝟎
Loss Function
ï‚Ļ
The previous “constraint” can be captured by the
following loss function
ðŋðĩ ð‘Ĩ𝑖 , ð‘Ķ𝑖 , 𝒘 = 𝑙(1 − ð‘Ķ𝑖 max(𝒘ð‘ŧ . 𝜙(ð‘Ĩ𝑖 , 𝑌)))
𝑌
ï‚Ļ
Now we wish to optimize the following objective
ðđ 𝒘, 𝑋, 𝑌 +
ðŋðĩ ð‘Ĩ𝑖 , ð‘Ķ𝑖 , 𝒘
ðĩ
Structured Prediction
over the labeled dataset
Indirect Supervision
Model Specification
Setup:
ð‘Ĩ𝑖 , 𝒉𝑖
𝑙
𝑖=1
Fully-labeled training data:
𝑆=
Binary-labeled training data:
ðĩ = ðĩ+ ∪ ðĩ− =
ð‘Ĩ𝑖 , ð‘Ķ𝑖
𝑙+𝑚
𝑖=𝑙+1
where ð‘Ķ𝑖 ∈ −1, +1
Two Conditions Imposed on the Weight Vector:
∀ ð‘Ĩ, −1 ∈ ðĩ− ,
∀𝒉 ∈ ℋ ð‘Ĩ ,
ð‘Ī ′ Φ ð‘Ĩ, 𝒉 ≤ 0
(i.e. there is no good predicted structure for
the negative examples)
∀ ð‘Ĩ, +1 ∈ ðĩ+ ,
∃𝒉 ∈ ℋ ð‘Ĩ ,
ð‘Ī ′ Φ ð‘Ĩ, 𝒉 ≥ 0
(i.e. there is at least one good predicted
structure for the positive examples)
So the optimization problem becomes:
ð‘Ī
min
ð‘Ī
2
2
+ ðķ1
ðŋ𝑆 ð‘Ĩ𝑖 , 𝒉𝑖 , ð‘Ī + ðķ2
ðŋðĩ− ð‘Ĩ𝑖 , ð‘Ķ𝑖 , ð‘Ī + ðķ2
𝑖∈ðĩ−
𝑖∈𝑆
Where:
′
ðŋ𝑆 ð‘Ĩ𝑖 , 𝒉𝑖 , ð‘Ī = 𝓁 max Δ ð’‰, 𝒉𝑖 − ð‘Ī Φ ð‘Ĩ𝑖 , 𝒉𝑖 − Φ ð‘Ĩ𝑖 , 𝒉
𝒉
ðŋðĩ ð‘Ĩ𝑖 , ð‘Ķ𝑖 , ð‘Ī = 𝓁 1 − ð‘Ķ𝑖 max ð‘Ī ′
ℎ∈ℋ(ð‘Ĩ)
Φ ð‘Ĩ𝑖 , 𝒉
𝜅(ð‘Ĩ𝑖 )
𝓁 is a common loss function such as the hinge loss
𝜅 is a normalization constant
ðŋðĩ+ ð‘Ĩ𝑖 , ð‘Ķ𝑖 , ð‘Ī
𝑖∈ðĩ+
This term is non-convex and
must be optimized by setting
𝒉 = argmax ð‘Ī ′ Φ ð‘Ĩ𝑖 , 𝒉 and
ℎ∈ℋ
solving the first two terms,
then repeating (CCCP-like).
Latent Variables in NLP
Overview of Three Methods
Method
2-Second Description
Latent Variable
EM Analogue
Key Advantage
Structural
SVM1
Structured SVM with
latent variables & EMlike training
Separate and
independent from
the output variable
Hard EM, latent value
found by
argmax ð‘Ī ′ Φ ð‘Ĩ𝑖 , ð‘Ķ𝑖 , ℎ
Enables Structured
SVM learned with
latent variable
ℎ∈ℋ
CoDL2
Train on labeled data,
generate K best
structures of
unlabeled data and
train on that. Average
the two.
Output variable for
unlabeled training
examples
Soft-EM with Uniform
Distribution on top-K
predicted outputs.
Efficient semisupervised learning
when constraints are
difficult to guarantee
for predictions but
easy to evaluate
Indirect
Supervision3
Get small number of
labeled & many where
we know if label exists
or not. Train a model
on both at the same
time.
1. Companion
binary-decision
variable
2. Output structure
on positive,
unlabeled
examples
Hard EM where label is
applied only to
examples where binary
classifier is positive
Combines information
gain from indirect
supervision (on lots of
data) with direct
supervision
1Learning
Structural SVMs with Latent Variables, Chun-Nam John Yu and T. Joachims, ICML, 2009.
Semi-Supervision with Constraint-Driven Learning, M. Chang, L. Ratinov and D. Roth, ACL 2007
3Structured Output Learning with Indirect Supervision, M. Chang, V. Srikumar, D. Goldwasser and D. Roth, ICML 2010.
2Guiding
Download