Curriculum Learning for Latent Structural SVM

advertisement
Curriculum Learning for
Latent Structural SVM
(under submission)
M. Pawan Kumar
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Benjamin Packer
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Daphne Koller
Aim
To learn accurate parameters for latent structural SVM
Input x
Output y  Y
Hidden Variable
hH
“Deer”
Y = {“Bison”, “Deer”, ”Elephant”, “Giraffe”, “Llama”, “Rhino” }
Aim
To learn accurate parameters for latent structural SVM
Feature (x,y,h)
(HOG, BoW)
Parameters w
(y*,h*) = maxyY,hH wT(x,y,h)
Motivation
Real
Numbers
Math is for
losers !!
Imaginary
Numbers
eiπ+1 = 0
FAILURE … BAD LOCAL MINIMUM
Motivation
Real
Numbers
Euler was
a Genius!!
Imaginary
Numbers
eiπ+1 = 0
SUCCESS … GOOD LOCAL MINIMUM
Curriculum Learning: Bengio et al, ICML 2009
Motivation
Start with “easy” examples, then consider “hard” ones
Simultaneously estimate easiness and parameters
Easiness is property of data sets, not single instances
Easy vs. Hard
Expensive
Easy for human
 Easy for machine
Outline
• Latent Structural SVM
• Concave-Convex Procedure
• Curriculum Learning
• Experiments
Latent Structural SVM
Felzenszwalb et al, 2008, Yu and Joachims, 2009
Training samples xi
Ground-truth label yi
Loss Function
(yi, yi(w), hi(w))
Latent Structural SVM
(yi(w),hi(w)) = maxyY,hH wT(x,y,h)
min ||w||2 + C∑i(yi, yi(w), hi(w))
Non-convex Objective
Minimize an upper bound
Latent Structural SVM
(yi(w),hi(w)) = maxyY,hH wT(x,y,h)
min ||w||2 + C∑i i
maxh wT(xi,yi,hi) - wT(xi,y,h)
i
≥ (yi, y, h) - i
Still non-convex
Difference of convex
CCCP Algorithm - converges to a local minimum
Outline
• Latent Structural SVM
• Concave-Convex Procedure
• Curriculum Learning
• Experiments
Concave-Convex Procedure
Start with an initial estimate w0
Update hi = maxhH wtT(xi,yi,h)
Update wt+1 by solving a convex problem
min ||w||2 + C∑i i
wT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Concave-Convex Procedure
Looks at all samples simultaneously
“Hard” samples will cause confusion
Start with “easy” samples, then consider “hard” ones
Outline
• Latent Structural SVM
• Concave-Convex Procedure
• Curriculum Learning
• Experiments
Curriculum Learning
REMINDER
Simultaneously estimate easiness and parameters
Easiness is property of data sets, not single instances
Curriculum Learning
Start with an initial estimate w0
Update hi = maxhH wtT(xi,yi,h)
Update wt+1 by solving a convex problem
min ||w||2 + C∑i i
wT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Curriculum Learning
min ||w||2 + C∑i i
wT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Curriculum Learning
vi  {0,1}
min ||w||2 + C∑i vii
wT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Trivial Solution
Curriculum Learning
vi  {0,1}
min ||w||2 + C∑i vii - ∑ivi/K
wT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Large K
Medium K
Small K
Curriculum Learning
vi  [0,1]
Biconvex
Problem
min ||w||2 + C∑i vii - ∑ivi/K
wT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Large K
Medium K
Small K
Curriculum Learning
Start with an initial estimate w0
T(x ,y ,h)
h
=
max
w
Update
i
hH t
i i
Update wt+1 by solving a convex problem
min ||w||2 + C∑i vii - ∑i vi/K
wT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Decrease K  K/
Outline
• Latent Structural SVM
• Concave-Convex Procedure
• Curriculum Learning
• Experiments
Object Detection
Input x - Image
Output y  Y
Latent h - Box
 - 0/1 Loss
Y = {“Bison”, “Deer”, ”Elephant”, “Giraffe”, “Llama”, “Rhino” }
Feature (x,y,h) - HOG
Object Detection
Mammals Dataset
271 images, 6 classes
90/10 train/test split
5 folds
Object Detection
CCCP
Curriculum
Object Detection
CCCP
Curriculum
Object Detection
CCCP
Curriculum
Object Detection
CCCP
Curriculum
Object Detection
Objective value
Test error
25
4.9
4.8
4.7
4.6
4.5
4.4
4.3
4.2
4.1
4
20
15
10
5
0
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
CCCP
Curriculum
Handwritten Digit Recognition
Input x - Image
Output y  Y
Latent h - Rotation
 - 0/1 Loss
Y = {0, 1, … , 9}
MNIST Dataset
Feature (x,y,h) - PCA + Projection
Handwritten Digit Recognition
C
C
C
- Significant Difference
Handwritten Digit Recognition
C
C
C
- Significant Difference
Handwritten Digit Recognition
C
C
C
- Significant Difference
Handwritten Digit Recognition
C
C
C
- Significant Difference
Motif Finding
Input x - DNA Sequence
Output y  Y
Y = {0, 1}
Latent h - Motif Location
 - 0/1 Loss
Feature (x,y,h) - Ng and Cardie, ACL 2002
Motif Finding
UniProbe Dataset
40,000 sequences
50/50 train/test split
5 folds
Motif Finding
Average Hamming Distance of Inferred Motifs
Motif Finding
160
140
120
100
80
60
40
20
0
CCCP
Curr
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Objective Value
Motif Finding
50
40
30
CCCP
Curr
20
10
0
Fold Fold Fold Fold Fold
1
2
3
4
5
Test Error
Noun Phrase Coreference
Input x - Nouns
Output y - Clustering
Latent h - Spanning Forest over Nouns
Feature (x,y,h) - Yu and Joachims, ICML 2009
Noun Phrase Coreference
MUC6 Dataset
50/50 train/test split
60 documents
1 predefined fold
Noun Phrase Coreference
MITRE
Loss
Pairwise
Loss
- Significant Improvement
- Significant Decrement
Noun Phrase Coreference
MITRE
Loss
Pairwise
Loss
Noun Phrase Coreference
MITRE
Loss
Pairwise
Loss
Summary
• Automatic Curriculum Learning
• Concave-Biconvex Procedure
• Generalization to other Latent models
– Expectation-Maximization
– E-step remains the same
– M-step includes indicator variables vi
Download