Learning Structural SVMs with Latent Variables

advertisement
Chun-Nam Yu
Cornell University
Joint work with Thorsten Joachims
Presented at NIPS ‘08 SISO Workshop





Widely used in statistics and machine learning
Can represent unobserved quantities in
experiments (e.g. intelligence)
Dimensionality reduction to control the
number of degree-of-freedom that generates
the observed data
Classic examples: Factor analysis, Mixture
Models, PCA
Focus on the use of latent variables in
supervised prediction tasks in this talk



[from Cardie & Wagstaff 99]
John Simon, Chief Financial Officer,
His, the 37-year-old, president
Prime Corp.,
The financial services company

Input x – noun phrases
with related features
Label y – clusters of coreferent noun phrases
Might need to reason
with transitivity to
determine coreference
Latent variable h – links
that connect two
coreferent noun phrases

Generative Hidden Markov Models in speech
recognition, bioinformatics

Hidden CRF in object recognition[Wang et.al.06] –
representing parts of objects
PCFG with latent annotations for parsing[Petrov
& Klein 07] – mixture distribution to represent
part-of-speech tags

Semi-supervised Structural SVM

[Zien & Brefeld 07]


Almost all of the above applications are based
on probabilistic models. Can we introduce
latent variables to structural SVMs?
Many interesting questions to be answered:
◦ Representation – changes to joint feature vector
and loss functions?
◦ Training – non-convex objective?
◦ Inference – changes to inference procedures in
training and testing?

In conventional structural SVM we learn a linear
prediction rule:
~ ¢©(x; y);
f w~ (x) = argmaxy2Y w


Extend ©(x,y) to ©(x,y,h) to include a set of latent
explanatory variables
A new argmax prediction rule:
f w~ (x) = y¹
¹ = argmax(y;h)2Y£ H w
¹ ; h)
~ ¢©(x; y; h)
where (y

Requires joint inference over output y and latent
variable h
Optimization Problem for Latent Variable
Structural SVM:
n
X
1
~ k2 + C
min kw
»i
~ ;~
w
» 2
i =1

s:t : for 1 · i · n, for all out put struct ures y^ 2 Y,
·
¸ ·
¸
^ ¸ ¢ (y i ; y^ ) - »i
~ ¢©(xi ; y i ; h) - max w
~ ¢©(xi ; y^ ; h)
max w
h2 H


h^2 H
Assume loss ¢ does not depend on latent
variables
»i bounds the loss by the new prediction rule

Prediction loss on training set is bounded by:
^ - max w
~ ¢©(xi ; y^ ; h)
~ ¢©(xi ; y i ; h)
¢ (y i ; y^ ) + max w
(^
y ;h^)2 Y £ H
·
h2 H
^ - max w
~ ¢©(xi ; y^ ; h)]
~ ¢©(xi ; y i ; h)
max [¢ (y i ; y^ ) + w
(^
y ;h^)2 Y £ H
h2 H
= »i
 If the loss function depends on latent variable h
^ ) this bounding trick cannot
of yi, ( ¢ ((y i ; h); (^
y ; h))
be applied
 Cannot do semi-supervised learning; still
applicable to many latent variable problems
 The loss-augmented inference now extends over
h^

Constrained Concave-Convex Procedure

Decompose objective as sum of convex and concave
parts

Upper bound concave part with hyperplane

Minimize the convex sum. Iterate until convergence.
Rangarajan 03]
[Yuille &
Convex
Our objective can be rewritten as:
"
#
n
X
1
2
^ + ¢ (y i ; y^ )]
~k + C
~ ¢©(xi ; y^ ; h)
min kw
max [w
~
w
2
(^
y ;h^)2 Y £ H
i =1
"
#
Concave
n
X
~ ¢©(xi ; y i ; h)
- C
max w

i =1

h2 H
Computing the upper bound hyperplane is the
same as completing latent variables:
~ ¢©(xi ; y i ; h)
argmaxh2H w

The convex minimization problem after filling in
the latent variables can be solved with a cutting
plane algorithm

Solve the optimization
problem:
n
X
1
2
~k + C
min kw
»i
~ ;~
w
» 2
i =1
·s:t : for 1 · i · n, for
¸ all· out put struct ures y^¸ 2 Y,
^ ¸ ¢ (y i ; y^ ) - »i
~ ¢©(xi ; y i ; h) - max w
~ ¢©(xi ; y^ ; h)
max w
h2 H


h^2 H
Restriction on loss function ¢
Three related inference problems:
~ ¢©(x; y; h)
argmax(y ;h)2 Y£ H w
◦ Prediction:
^
~
argmax
[¢
(y
;
y
^
)
+
w
¢©(x
;
y
^
;
h)]
i
i
◦ Loss-augmented:
(^
y ;h^)2 Y £ H
~ ¢©(xi ; y i ; h)
◦ Latent Variable Completion:
argmax w
h2 H

S. cerevisiae


S. kluyveri
DNA sequences from 2
closely related types of
yeast, all containing
ARS (autonomous
replicating sequences)
ARS in S.cerevisiae
might or might not be
functional in S.kluyveri
Problem: Find out the
motif responsible for
replication process in
S. cerevisiase
h




© – parameters for position weight matrix, and
Markov background model of order 4
¢ – zero-one loss
Latent variables h – position of motif
Joint inference, loss-augmented inference can all
be done efficiently (y binary, h linear in length of
DNA sequence)


Data - 197 yeast DNA sequences from S.
cerevisiae and S. kluyveri. About 6000 intergenic
sequences for background estimation
10 fold CV, 10 random restarts for each
parameter setting
Algorithm
Error Rate
Gibbs Sampler (w = 11)
37.97 %
Gibbs Sampler (w = 17)
35.06 %
Latent Variable SSVM (w = 11)
11.09 %
Latent Variable SSVM (w = 17)
12.00 %

Good classification accuracy, currently working
on ways to interpret the motif signals




We have proposed a formulation of structural
SVM with latent variables and an efficient
algorithm for solving the learning problem
Results on discriminative motif finding in
yeast DNA indicates the proposed algorithm
is promising
Currently working on other applications such
as hierarchical clustering for noun phrase
coreference
Interesting research questions include
extension to slack-rescaling
Download