Loss Gradient

advertisement
Differentiable Sparse Coding
David Bradley and J. Andrew Bagnell
NIPS 2008
1
100,000 ft View
• Complex Systems
Cameras
Voxel
Features
Voxel
Classifier
Voxel
Grid
Initialize with “cheap” data
Ladar
Column
Cost
2-D
Y
Planner (Path to goal)
Joint Optimization
2
10,000 ft view

x

w
• Sparse Coding = Generative Model
Unlabeled Data
Optimization
Latent Variable
Classifier
Loss Gradient
• Semi-supervised learning
• KL-Divergence Regularization
• Implicit Differentiation
3
Sparse Coding
Understand X
4
As a combination of factors
B
5
Sparse coding uses optimization
Reconstruction
loss function


x  f ( Bw)
Projection (feed-forward):
T


w  f ( x B)
Some vector
Want to use
to classify x
6
Sparse vectors:
Only a few significant elements
7
Example: X=Handwritten Digits
Basis vectors colored by w
8
Optimization vs. Projection
Input
Basis
Projection
9
Optimization vs. Projection
Input
Basis
KL-regularized Optimization
Outputs are sparse for each example
10
Generative Model
i
Latent variables are Independent
Likelihood
Prior
Examples are Independent
11
Sparse Approximation
Likelihood
Prior
MAP Estimate
12
Sparse Approximation
Distance between
reconstruction and input
Distance between weight
vector and prior mean



 
 log P( x)  Loss x || f ( Bw)  Prior w || p
Regularization Constant
13
Example: Squared Loss + L1
• Convex + sparse (widely studied in engineering)
• Sparse coding solves for B as well (non-convex for now…)
• Shown to generate useful features on diverse problems
Tropp, Signal Processing, 2006
Donoho and Elad, Proceedings of the National Academy of Sciences, 2002
Raina, Battle, Lee, Packer, Ng, ICML, 2007
14
L1 Sparse Coding
Optimize B over
all examples
Shown to generate useful features on diverse
problems
15
Differentiable Sparse Coding
X
Unlabeled
Data
Labeled
Data
Optimization
Module (B)
Learning
Module (θ)
W
f (BW )
X
Y
Loss
Function
Raina, Battle, Lee, Packer, Ng, ICML, 2007
Reconstruction
Loss
Sparse Coding
Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008
16
L1 Regularization is Not Differentiable
X
Unlabeled
Data
Labeled
Data
Optimization
Module (B)
Y
Learning
Module (θ)
Loss
Function
W
f (BW )
X
Reconstruction
Loss
Sparse Coding
Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008
17
Why is this unsatisfying?
18
Problem #1:
Instability



L1 Map Estimates are
discontinuous
Outputs are not
differentiable
Instead use KL-divergence
Proven to compete with
L1 in online learning
19
Problem #2:
No closed-form Equation


 
wˆ  arg min Loss x || f ( Bw)   Prior w || p 
w
At the MAP estimate:


 w Loss x || f ( Bwˆ )    w Prior wˆ || p   0
20
Solution: Implicit Differentiation
Differentiate both sides with respect to an element of B:



 Loss  x || f ( Bw
 Prior w
ˆ
ˆ || p   0



)



w
w
i
Bj
Since
is a function of B:
Solve for this
21
Example: Squared Loss, KL prior
KL-Divergence
22
Handwritten Digit Recognition
50,000 digit training set
10,000 digit validation set
10,000 digit test set
23
Handwritten Digit Recognition
Step #1:
Unsupervised
Sparse Coding
L2 loss and L1 prior
Training Set
Raina, Battle, Lee, Packer, Ng, ICML, 2007
24
Handwritten Digit Recognition
Step #2:
Maxent
Classifier
Loss
Function
Sparse
Approximation
25
Handwritten Digit Recognition
Step #3:
Maxent
Classifier
Loss
Function
Supervised
Sparse Coding
26
KL Maintains Sparsity
Weight concentrated
in few elements
Log scale
27
X
KL adds Stability
W (L1)
W (KL)
W (backprop)
Duda, Hart, Stork. Pattern classification. 2001
28
Misclassification Error (%)
Better
Performance vs. Prior
9
8
7
6
5
4
3
2
1
0
L1
KL
KL + Backprop
1,000
10,000
50,000
Number of Training Examples
29
Classifier Comparison
Misclassification Error
Better
8%
7%
6%
5%
PCA
4%
L1
3%
KL
2%
KL+backprop
1%
0%
Maxent
2-layer NN SVM (Linear) SVM (RBF)
30
Misclassification Error
Better
Comparison to other algorithms
4.00%
3.50%
3.00%
2.50%
2.00%
1.50%
1.00%
0.50%
0.00%
L1
KL
KL+backprop
SVM
2-layer NN
50,000
31
Transfer to English Characters
24,000 character training set
12,000 character validation set
12,000 character test set
8
16
32
Transfer to English Characters
Step #1:
Maxent
Classifier
Digits
Basis
Loss
Function
Sparse
Approximation
Raina, Battle, Lee, Packer, Ng, ICML, 2007
33
Transfer to English Characters
Step #2:
Maxent
Classifier
Loss
Function
Supervised
Sparse Coding
34
Transfer to English Characters
100%
Classification Accuracy
Better
90%
80%
Raw
70%
PCA
L1
60%
KL
KL+backprop
50%
40%
500
5000
20000
Training Set Size
35
Text Application
Step #1:
X
5,000 movie reviews
Unsupervised
Sparse Coding
KL loss + KL prior
10 point sentiment scale
1=hated it, 10=loved it
Pang, Lee, Proceeding of the ACL, 2005
36
Text Application
Step #2:
X
L2 Loss
5-fold Cross Validation
Linear
Regression
Sparse
Approximation
10 point sentiment scale
1=hated it, 10=loved it
37
Text Application
Step #3:
X
L2 Loss
5-fold Cross Validation
Linear
Regression
Supervised
Sparse Coding
10 point sentiment scale
1=hated it, 10=loved it
38
Movie Review Sentiment
0.60
State of the art
graphical model
Supervised
basis
Predictive R 2
Better
0.50
0.40
Unsupervised
basis
LDA
KL
0.30
sLDA
KL+backprop
0.20
0.10
0.00
Blei, McAuliffe, NIPS, 2007
39
Future Work
Camera
Engineered
Features
Sparse
Coding
Ladar
MMP
Example
Paths
Voxel
Classifier
Labeled
Training Data
RGB Camera
NIR Camera
Laser
40
Future Work:
Convex Sparse Coding
• Sparse approximation is convex
• Sparse coding is not because fixed-size basis is a
non-convex constraint
• Sparse coding ↔ sparse approximation on
infinitely large basis + non-convex rank constraint
– Relax to a convex L1 rank constraint
• Use boosting for sparse approximation directly on
infinitely large basis
Bengio, Le Roux, Vincent, Dellalleau, Marcotte, NIPS, 2005
Zhao, Yu. Feature Selection for Data Mining, 2005
Riffkin, Lippert. Journal of Machine Learning Research, 2007
41
Questions?
42
Download