Chun-Nam Yu Cornell University Joint work with Thorsten Joachims Presented at NIPS ‘08 SISO Workshop Widely used in statistics and machine learning Can represent unobserved quantities in experiments (e.g. intelligence) Dimensionality reduction to control the number of degree-of-freedom that generates the observed data Classic examples: Factor analysis, Mixture Models, PCA Focus on the use of latent variables in supervised prediction tasks in this talk [from Cardie & Wagstaff 99] John Simon, Chief Financial Officer, His, the 37-year-old, president Prime Corp., The financial services company Input x – noun phrases with related features Label y – clusters of coreferent noun phrases Might need to reason with transitivity to determine coreference Latent variable h – links that connect two coreferent noun phrases Generative Hidden Markov Models in speech recognition, bioinformatics Hidden CRF in object recognition[Wang et.al.06] – representing parts of objects PCFG with latent annotations for parsing[Petrov & Klein 07] – mixture distribution to represent part-of-speech tags Semi-supervised Structural SVM [Zien & Brefeld 07] Almost all of the above applications are based on probabilistic models. Can we introduce latent variables to structural SVMs? Many interesting questions to be answered: ◦ Representation – changes to joint feature vector and loss functions? ◦ Training – non-convex objective? ◦ Inference – changes to inference procedures in training and testing? In conventional structural SVM we learn a linear prediction rule: ~ ¢©(x; y); f w~ (x) = argmaxy2Y w Extend ©(x,y) to ©(x,y,h) to include a set of latent explanatory variables A new argmax prediction rule: f w~ (x) = y¹ ¹ = argmax(y;h)2Y£ H w ¹ ; h) ~ ¢©(x; y; h) where (y Requires joint inference over output y and latent variable h Optimization Problem for Latent Variable Structural SVM: n X 1 ~ k2 + C min kw »i ~ ;~ w » 2 i =1 s:t : for 1 · i · n, for all out put struct ures y^ 2 Y, · ¸ · ¸ ^ ¸ ¢ (y i ; y^ ) - »i ~ ¢©(xi ; y i ; h) - max w ~ ¢©(xi ; y^ ; h) max w h2 H h^2 H Assume loss ¢ does not depend on latent variables »i bounds the loss by the new prediction rule Prediction loss on training set is bounded by: ^ - max w ~ ¢©(xi ; y^ ; h) ~ ¢©(xi ; y i ; h) ¢ (y i ; y^ ) + max w (^ y ;h^)2 Y £ H · h2 H ^ - max w ~ ¢©(xi ; y^ ; h)] ~ ¢©(xi ; y i ; h) max [¢ (y i ; y^ ) + w (^ y ;h^)2 Y £ H h2 H = »i If the loss function depends on latent variable h ^ ) this bounding trick cannot of yi, ( ¢ ((y i ; h); (^ y ; h)) be applied Cannot do semi-supervised learning; still applicable to many latent variable problems The loss-augmented inference now extends over h^ Constrained Concave-Convex Procedure Decompose objective as sum of convex and concave parts Upper bound concave part with hyperplane Minimize the convex sum. Iterate until convergence. Rangarajan 03] [Yuille & Convex Our objective can be rewritten as: " # n X 1 2 ^ + ¢ (y i ; y^ )] ~k + C ~ ¢©(xi ; y^ ; h) min kw max [w ~ w 2 (^ y ;h^)2 Y £ H i =1 " # Concave n X ~ ¢©(xi ; y i ; h) - C max w i =1 h2 H Computing the upper bound hyperplane is the same as completing latent variables: ~ ¢©(xi ; y i ; h) argmaxh2H w The convex minimization problem after filling in the latent variables can be solved with a cutting plane algorithm Solve the optimization problem: n X 1 2 ~k + C min kw »i ~ ;~ w » 2 i =1 ·s:t : for 1 · i · n, for ¸ all· out put struct ures y^¸ 2 Y, ^ ¸ ¢ (y i ; y^ ) - »i ~ ¢©(xi ; y i ; h) - max w ~ ¢©(xi ; y^ ; h) max w h2 H h^2 H Restriction on loss function ¢ Three related inference problems: ~ ¢©(x; y; h) argmax(y ;h)2 Y£ H w ◦ Prediction: ^ ~ argmax [¢ (y ; y ^ ) + w ¢©(x ; y ^ ; h)] i i ◦ Loss-augmented: (^ y ;h^)2 Y £ H ~ ¢©(xi ; y i ; h) ◦ Latent Variable Completion: argmax w h2 H S. cerevisiae S. kluyveri DNA sequences from 2 closely related types of yeast, all containing ARS (autonomous replicating sequences) ARS in S.cerevisiae might or might not be functional in S.kluyveri Problem: Find out the motif responsible for replication process in S. cerevisiase h © – parameters for position weight matrix, and Markov background model of order 4 ¢ – zero-one loss Latent variables h – position of motif Joint inference, loss-augmented inference can all be done efficiently (y binary, h linear in length of DNA sequence) Data - 197 yeast DNA sequences from S. cerevisiae and S. kluyveri. About 6000 intergenic sequences for background estimation 10 fold CV, 10 random restarts for each parameter setting Algorithm Error Rate Gibbs Sampler (w = 11) 37.97 % Gibbs Sampler (w = 17) 35.06 % Latent Variable SSVM (w = 11) 11.09 % Latent Variable SSVM (w = 17) 12.00 % Good classification accuracy, currently working on ways to interpret the motif signals We have proposed a formulation of structural SVM with latent variables and an efficient algorithm for solving the learning problem Results on discriminative motif finding in yeast DNA indicates the proposed algorithm is promising Currently working on other applications such as hierarchical clustering for noun phrase coreference Interesting research questions include extension to slack-rescaling