Curriculum Learning for Latent Structural SVM (under submission) M. Pawan Kumar QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Benjamin Packer QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Daphne Koller Aim To learn accurate parameters for latent structural SVM Input x Output y Y Hidden Variable hH “Deer” Y = {“Bison”, “Deer”, ”Elephant”, “Giraffe”, “Llama”, “Rhino” } Aim To learn accurate parameters for latent structural SVM Feature (x,y,h) (HOG, BoW) Parameters w (y*,h*) = maxyY,hH wT(x,y,h) Motivation Real Numbers Math is for losers !! Imaginary Numbers eiπ+1 = 0 FAILURE … BAD LOCAL MINIMUM Motivation Real Numbers Euler was a Genius!! Imaginary Numbers eiπ+1 = 0 SUCCESS … GOOD LOCAL MINIMUM Curriculum Learning: Bengio et al, ICML 2009 Motivation Start with “easy” examples, then consider “hard” ones Simultaneously estimate easiness and parameters Easiness is property of data sets, not single instances Easy vs. Hard Expensive Easy for human Easy for machine Outline • Latent Structural SVM • Concave-Convex Procedure • Curriculum Learning • Experiments Latent Structural SVM Felzenszwalb et al, 2008, Yu and Joachims, 2009 Training samples xi Ground-truth label yi Loss Function (yi, yi(w), hi(w)) Latent Structural SVM (yi(w),hi(w)) = maxyY,hH wT(x,y,h) min ||w||2 + C∑i(yi, yi(w), hi(w)) Non-convex Objective Minimize an upper bound Latent Structural SVM (yi(w),hi(w)) = maxyY,hH wT(x,y,h) min ||w||2 + C∑i i maxh wT(xi,yi,hi) - wT(xi,y,h) i ≥ (yi, y, h) - i Still non-convex Difference of convex CCCP Algorithm - converges to a local minimum Outline • Latent Structural SVM • Concave-Convex Procedure • Curriculum Learning • Experiments Concave-Convex Procedure Start with an initial estimate w0 Update hi = maxhH wtT(xi,yi,h) Update wt+1 by solving a convex problem min ||w||2 + C∑i i wT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i Concave-Convex Procedure Looks at all samples simultaneously “Hard” samples will cause confusion Start with “easy” samples, then consider “hard” ones Outline • Latent Structural SVM • Concave-Convex Procedure • Curriculum Learning • Experiments Curriculum Learning REMINDER Simultaneously estimate easiness and parameters Easiness is property of data sets, not single instances Curriculum Learning Start with an initial estimate w0 Update hi = maxhH wtT(xi,yi,h) Update wt+1 by solving a convex problem min ||w||2 + C∑i i wT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i Curriculum Learning min ||w||2 + C∑i i wT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i Curriculum Learning vi {0,1} min ||w||2 + C∑i vii wT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i Trivial Solution Curriculum Learning vi {0,1} min ||w||2 + C∑i vii - ∑ivi/K wT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i Large K Medium K Small K Curriculum Learning vi [0,1] Biconvex Problem min ||w||2 + C∑i vii - ∑ivi/K wT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i Large K Medium K Small K Curriculum Learning Start with an initial estimate w0 T(x ,y ,h) h = max w Update i hH t i i Update wt+1 by solving a convex problem min ||w||2 + C∑i vii - ∑i vi/K wT(xi,yi,hi) - wT(xi,y,h) ≥ (yi, y, h) - i Decrease K K/ Outline • Latent Structural SVM • Concave-Convex Procedure • Curriculum Learning • Experiments Object Detection Input x - Image Output y Y Latent h - Box - 0/1 Loss Y = {“Bison”, “Deer”, ”Elephant”, “Giraffe”, “Llama”, “Rhino” } Feature (x,y,h) - HOG Object Detection Mammals Dataset 271 images, 6 classes 90/10 train/test split 5 folds Object Detection CCCP Curriculum Object Detection CCCP Curriculum Object Detection CCCP Curriculum Object Detection CCCP Curriculum Object Detection Objective value Test error 25 4.9 4.8 4.7 4.6 4.5 4.4 4.3 4.2 4.1 4 20 15 10 5 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 CCCP Curriculum Handwritten Digit Recognition Input x - Image Output y Y Latent h - Rotation - 0/1 Loss Y = {0, 1, … , 9} MNIST Dataset Feature (x,y,h) - PCA + Projection Handwritten Digit Recognition C C C - Significant Difference Handwritten Digit Recognition C C C - Significant Difference Handwritten Digit Recognition C C C - Significant Difference Handwritten Digit Recognition C C C - Significant Difference Motif Finding Input x - DNA Sequence Output y Y Y = {0, 1} Latent h - Motif Location - 0/1 Loss Feature (x,y,h) - Ng and Cardie, ACL 2002 Motif Finding UniProbe Dataset 40,000 sequences 50/50 train/test split 5 folds Motif Finding Average Hamming Distance of Inferred Motifs Motif Finding 160 140 120 100 80 60 40 20 0 CCCP Curr Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Objective Value Motif Finding 50 40 30 CCCP Curr 20 10 0 Fold Fold Fold Fold Fold 1 2 3 4 5 Test Error Noun Phrase Coreference Input x - Nouns Output y - Clustering Latent h - Spanning Forest over Nouns Feature (x,y,h) - Yu and Joachims, ICML 2009 Noun Phrase Coreference MUC6 Dataset 50/50 train/test split 60 documents 1 predefined fold Noun Phrase Coreference MITRE Loss Pairwise Loss - Significant Improvement - Significant Decrement Noun Phrase Coreference MITRE Loss Pairwise Loss Noun Phrase Coreference MITRE Loss Pairwise Loss Summary • Automatic Curriculum Learning • Concave-Biconvex Procedure • Generalization to other Latent models – Expectation-Maximization – E-step remains the same – M-step includes indicator variables vi