Max-Margin Latent Variable Models M. Pawan Kumar Max-Margin Latent Variable Models M. Pawan Kumar Kevin Miller, Rafi Witten, Tim Tang, Danny Goodman, Haithem Turki, Dan Preston, Ben Packer Daphne Koller Dan Selsam, Andrej Karpathy Log (Size) Computer Vision Data ~ 2000 Segmentation Information Log (Size) Computer Vision Data ~ 12000 ~ 2000 Bounding Box Information Segmentation Log (Size) Computer Vision Data > 14 M ~ 12000 ~ 2000 Image-Level Bounding Box Segmentation Information “Car” “Chair” Computer Vision Data Log (Size) >6B > 14 M Noisy Label ~ 12000 ~ 2000 Image-Level Bounding Box Segmentation Information Learn with missing information (latent variables) Outline • Two Types of Problems • Latent SVM (Background) • Self-Paced Learning • Max-Margin Min-Entropy Models • Discussion Annotation Mismatch Learn to classify an image Image x h Annotation a = “Deer” Mismatch between desired and available annotations Exact value of latent variable is not “important” Annotation Mismatch Learn to classify a DNA sequence Latent Variables h Sequence x Annotation a {+1, -1} Mismatch between desired and possible annotations Exact value of latent variable is not “important” Output Mismatch Learn to segment an image Image x Output y Output Mismatch Learn to segment an image (x, a) Bird (a, h) Output Mismatch Learn to segment an image (x, a) (a, h) Cow Mismatch between desired output and available annotations Exact value of latent variable is important Output Mismatch Learn to classify actions (x, y) Output Mismatch Learn to classify actions x ha = +1 hb + “jumping” Output Mismatch Learn to classify actions hb x ha = -1 + “jumping” Mismatch between desired output and available annotations Exact value of latent variable is important Outline • Two Types of Problems • Latent SVM (Background) • Self-Paced Learning • Max-Margin Min-Entropy Models • Discussion Latent SVM Andrews et al, 2001; Smola et al, 2005; Felzenszwalb et al, 2008; Yu and Joachims, 2009 Image x Features (x,a,h) h Parameters w Annotation a = “Deer” (a(w),h(w)) = maxa,h wT(x,a,h) Parameter Learning Score of Best Completion of Ground-Truth > Score of All Other Outputs Parameter Learning maxh wT(xi,ai,h) > wT(x,a,h) Parameter Learning min ||w||2 + CΣi ξi maxh wT(xi,ai,h) ≥ wT(x,a,h) Annotation Mismatch + Δ(ai,a) - ξi Optimization Update hi* = argmaxh wT(xi,ai,h) Update w by solving a convex problem min ||w||2 + C∑i i wT(xi,ai,hi*) - wT(xi,a,h) ≥ (ai, a) - i Repeat until convergence Outline • Two Types of Problems • Latent SVM (Background) • Self-Paced Learning • Max-Margin Min-Entropy Models • Discussion Self-Paced Learning Kumar, Packer and Koller, NIPS 2010 1+1=2 Math is for losers !! 1/3 + 1/6 = 1/2 eiπ+1 = 0 FAILURE … BAD LOCAL MINIMUM Self-Paced Learning Kumar, Packer and Koller, NIPS 2010 1+1=2 Euler was a Genius!! 1/3 + 1/6 = 1/2 eiπ+1 = 0 SUCCESS … GOOD LOCAL MINIMUM Optimization Update hi* = argmaxh wT(xi,ai,h) Update w by solving a convex problem vi {0,1} min ||w||2 + C∑i i vi - λ∑i vi wT(xi,ai,hi*) - wT(xi,a,h) ≥ (ai, a) - i λ λμ Repeat until convergence Image Classification Mammals Dataset 271 images, 6 classes 90/10 train/test split 5 folds Image Classification Kumar, Packer and Koller, NIPS 2010 4.75 4.7 4.65 4.6 4.55 4.5 4.45 4.4 17.5 17 16.5 16 CCCP 15.5 SPL 15 CCCP SPL 14.5 Objective Test Error HOG-Based Model. Dalal and Triggs, 2005 Image Classification PASCAL VOC 2007 Dataset ~ 5000 images Car vs. Not-Car 50/50 train/test split 5 folds Image Classification Witten, Miller, Kumar, Packer and Koller, In Preparation Objective HOG + Dense SIFT + Dense Color SIFT SPL+ – Different features choose different “easy” samples Image Classification Witten, Miller, Kumar, Packer and Koller, In Preparation Mean Average Precision HOG + Dense SIFT + Dense Color SIFT SPL+ – Different features choose different “easy” samples Motif Finding UniProbe Dataset ~ 40,000 sequences Binding vs. Not-Binding 50/50 train/test split 5 folds Motif Finding Kumar, Packer and Koller, NIPS 2010 140 120 100 80 60 40 20 0 CCCP SPL Objective 36 35 34 33 32 31 30 29 28 CCCP SPL Test Error Motif + Markov Background Model. Yu and Joachims, 2009 Semantic Segmentation VOC Segmentation 2009 Stanford Background + Train - 1274 images Validation - 225 images Test - 750 images Train - 572 images Validation - 53 images Test - 90 images Semantic Segmentation VOC Detection 2009 ImageNet + Bounding Box Data Train - 1564 images Image-Level Data Train - 1000 images Semantic Segmentation Kumar, Turki, Preston and Koller, ICCV 2011 30 29 28 27 26 25 24 23 22 SPL SUP CCCP VOC Overlap 55.5 55 54.5 54 53.5 53 52.5 52 SPL SUP CCCP SBD Overlap SUP – Supervised Learning (Segmentation Data Only) Region-based Model. Gould, Fulton and Koller, 2009 Action Classification PASCAL VOC 2011 + Bounding Box Data Train – 3000 instances Test – 3000 instances Noisy Data Train - 10000 images Action Classification Packer, Kumar, Tang and Koller, In Preparation 62.8 62.6 62.4 62.2 62 61.8 61.6 61.4 61.2 61 60.8 SPL CCCP SUP Mean Average Precision Poselet-based Model. Maji, Bourdev and Malik, 2011 Self-Paced Multiple Kernel Learning Kumar, Packer and Koller, In Preparation 1+1=2 Integers 1/3 + 1/6 = 1/2 Rational Numbers eiπ+1 Imaginary Numbers =0 USE A FIXED MODEL Self-Paced Multiple Kernel Learning Kumar, Packer and Koller, In Preparation 1+1=2 Integers 1/3 + 1/6 = 1/2 Rational Numbers eiπ+1 =0 Imaginary Numbers ADAPT THE MODEL COMPLEXITY Optimization hi* = argmaxh wT(xi,ai,h) Update and c Update w by solving a convex problem ^ vi {0,1} min ||w||2 + C∑i i vi - λ∑i vi wT(xi,ai,hi*) - wT(xi,a,h) ≥ (ai, a) - i Kij = (xi,ai,hi)T (xj,aj,hj) λ λμ Repeat until convergence K = Σk ck Kk Image Classification Mammals Dataset 271 images, 6 classes 90/10 train/test split 5 folds Image Classification Kumar, Packer and Koller, In Preparation 1 20 0.8 15 0.6 0.4 10 FIXED FIXED 5 0.2 SPMKL SPMKL 0 0 Objective Test Error HOG-Based Model. Dalal and Triggs, 2005 Motif Finding UniProbe Dataset ~ 40,000 sequences Binding vs. Not-Binding 50/50 train/test split 5 folds Motif Finding Kumar, Packer and Koller, NIPS 2010 78 77 76 75 74 73 72 71 70 69 11.5 11 10.5 10 FIXED 9.5 SPMKL 9 FIXED SPMKL 8.5 Objective Test Error Motif + Markov Background Model. Yu and Joachims, 2009 Outline • Two Types of Problems • Latent SVM (Background) • Self-Paced Learning • Max-Margin Min-Entropy Models • Discussion MAP Inference Pr(a,h|x) = exp( wT(x,a,h)) Z(x) Pr(a1,h|x) 0.00 0.00 0.00 0.00 0.25 0.00 0.25 0.00 0.25 MAP Inference Pr(a,h|x) = exp( wT(x,a,h)) Z(x) mina,h – log (Pr(a,h|x)) Value of latent variable? Pr(a1,h|x) 0.00 0.00 0.00 0.00 0.25 0.00 Pr(a2,h|x) 0.25 0.00 0.25 0.00 0.00 0.00 0.00 0.24 0.00 0.01 0.00 0.00 Min-Entropy Inference mina – log (Pr(a|x)) + Hα (Pr(h|a,x)) Renyi entropy of generalized distribution Q(a; x, w) = Set of all {Pr(a,h|x)} mina Hα(Q(a; x, w)) Max-Margin Min-Entropy Models Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012 min ||w||2 + C∑i i Hα(Q(a; x, w))- Hα(Q(ai; x, w)) ≥ (ai, a) - i i ≥ 0 Like latent SVM, minimizes (ai, ai(w)) In fact, when α = ∞... Max-Margin Min-Entropy Models Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012 min ||w||2 + C∑i i maxhwT(x,ai,h)-maxhwT(x,a,h) ≥ (ai, a) - i i ≥ 0 Like latent SVM, minimizes (ai, ai(w)) In fact, when α = ∞... Latent SVM Image Classification Mammals Dataset 271 images, 6 classes 90/10 train/test split 5 folds Image Classification Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012 HOG-Based Model. Dalal and Triggs, 2005 Image Classification Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012 HOG-Based Model. Dalal and Triggs, 2005 Image Classification Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012 HOG-Based Model. Dalal and Triggs, 2005 Motif Finding UniProbe Dataset ~ 40,000 sequences Binding vs. Not-Binding 50/50 train/test split 5 folds Motif Finding Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012 Motif + Markov Background Model. Yu and Joachims, 2009 Outline • Two Types of Problems • Latent SVM (Background) • Self-Paced Learning • Max-Margin Min-Entropy Models • Discussion Very Large Datasets • Initialize parameters using supervised data • Impute latent variables (inference) • Select easy samples (very efficient) • Update parameters using incremental SVM • Refine efficiently with proximal regularization Output Mismatch Σh Prθ(h|a,x) Δ(a,h,a(w),h(w)) + A(θ) C. R. Rao’s Relative Quadratic Entropy Minimize over w and θ Output Mismatch Σh Prθ(h|a,x) Δ(a,h,a(w),h(w)) + A(θ) Prθ(h,a|x) C. R. Rao’s Relative Quadratic Entropy (a1,h) Minimize over w (a2,h) Output Mismatch Σh Prθ(h|a,x) Δ(a,h,a(w),h(w)) + A(θ) Prθ(h,a|x) C. R. Rao’s Relative Quadratic Entropy (a1,h) Minimize over w (a2,h) Output Mismatch Σh Prθ(h|a,x) Δ(a,h,a(w),h(w)) + A(θ) Prθ(h,a|x) C. R. Rao’s Relative Quadratic Entropy (a1,h) Minimize over θ (a2,h) Output Mismatch Σh Prθ(h|a,x) Δ(a,h,a(w),h(w)) + A(θ) Prθ(h,a|x) C. R. Rao’s Relative Quadratic Entropy (a1,h) Minimize over θ (a2,h) Output Mismatch Σh Prθ(h|a,x) Δ(a,h,a(w),h(w)) + A(θ) Prθ(h,a|x) C. R. Rao’s Relative Quadratic Entropy (a1,h) Minimize over θ (a2,h) Questions?