Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008 1 100,000 ft View • Complex Systems Cameras Voxel Features Voxel Classifier Voxel Grid Initialize with “cheap” data Ladar Column Cost 2-D Y Planner (Path to goal) Joint Optimization 2 10,000 ft view x w • Sparse Coding = Generative Model Unlabeled Data Optimization Latent Variable Classifier Loss Gradient • Semi-supervised learning • KL-Divergence Regularization • Implicit Differentiation 3 Sparse Coding Understand X 4 As a combination of factors B 5 Sparse coding uses optimization Reconstruction loss function x f ( Bw) Projection (feed-forward): T w f ( x B) Some vector Want to use to classify x 6 Sparse vectors: Only a few significant elements 7 Example: X=Handwritten Digits Basis vectors colored by w 8 Optimization vs. Projection Input Basis Projection 9 Optimization vs. Projection Input Basis KL-regularized Optimization Outputs are sparse for each example 10 Generative Model i Latent variables are Independent Likelihood Prior Examples are Independent 11 Sparse Approximation Likelihood Prior MAP Estimate 12 Sparse Approximation Distance between reconstruction and input Distance between weight vector and prior mean log P( x) Loss x || f ( Bw) Prior w || p Regularization Constant 13 Example: Squared Loss + L1 • Convex + sparse (widely studied in engineering) • Sparse coding solves for B as well (non-convex for now…) • Shown to generate useful features on diverse problems Tropp, Signal Processing, 2006 Donoho and Elad, Proceedings of the National Academy of Sciences, 2002 Raina, Battle, Lee, Packer, Ng, ICML, 2007 14 L1 Sparse Coding Optimize B over all examples Shown to generate useful features on diverse problems 15 Differentiable Sparse Coding X Unlabeled Data Labeled Data Optimization Module (B) Learning Module (θ) W f (BW ) X Y Loss Function Raina, Battle, Lee, Packer, Ng, ICML, 2007 Reconstruction Loss Sparse Coding Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008 16 L1 Regularization is Not Differentiable X Unlabeled Data Labeled Data Optimization Module (B) Y Learning Module (θ) Loss Function W f (BW ) X Reconstruction Loss Sparse Coding Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008 17 Why is this unsatisfying? 18 Problem #1: Instability L1 Map Estimates are discontinuous Outputs are not differentiable Instead use KL-divergence Proven to compete with L1 in online learning 19 Problem #2: No closed-form Equation wˆ arg min Loss x || f ( Bw) Prior w || p w At the MAP estimate: w Loss x || f ( Bwˆ ) w Prior wˆ || p 0 20 Solution: Implicit Differentiation Differentiate both sides with respect to an element of B: Loss x || f ( Bw Prior w ˆ ˆ || p 0 ) w w i Bj Since is a function of B: Solve for this 21 Example: Squared Loss, KL prior KL-Divergence 22 Handwritten Digit Recognition 50,000 digit training set 10,000 digit validation set 10,000 digit test set 23 Handwritten Digit Recognition Step #1: Unsupervised Sparse Coding L2 loss and L1 prior Training Set Raina, Battle, Lee, Packer, Ng, ICML, 2007 24 Handwritten Digit Recognition Step #2: Maxent Classifier Loss Function Sparse Approximation 25 Handwritten Digit Recognition Step #3: Maxent Classifier Loss Function Supervised Sparse Coding 26 KL Maintains Sparsity Weight concentrated in few elements Log scale 27 X KL adds Stability W (L1) W (KL) W (backprop) Duda, Hart, Stork. Pattern classification. 2001 28 Misclassification Error (%) Better Performance vs. Prior 9 8 7 6 5 4 3 2 1 0 L1 KL KL + Backprop 1,000 10,000 50,000 Number of Training Examples 29 Classifier Comparison Misclassification Error Better 8% 7% 6% 5% PCA 4% L1 3% KL 2% KL+backprop 1% 0% Maxent 2-layer NN SVM (Linear) SVM (RBF) 30 Misclassification Error Better Comparison to other algorithms 4.00% 3.50% 3.00% 2.50% 2.00% 1.50% 1.00% 0.50% 0.00% L1 KL KL+backprop SVM 2-layer NN 50,000 31 Transfer to English Characters 24,000 character training set 12,000 character validation set 12,000 character test set 8 16 32 Transfer to English Characters Step #1: Maxent Classifier Digits Basis Loss Function Sparse Approximation Raina, Battle, Lee, Packer, Ng, ICML, 2007 33 Transfer to English Characters Step #2: Maxent Classifier Loss Function Supervised Sparse Coding 34 Transfer to English Characters 100% Classification Accuracy Better 90% 80% Raw 70% PCA L1 60% KL KL+backprop 50% 40% 500 5000 20000 Training Set Size 35 Text Application Step #1: X 5,000 movie reviews Unsupervised Sparse Coding KL loss + KL prior 10 point sentiment scale 1=hated it, 10=loved it Pang, Lee, Proceeding of the ACL, 2005 36 Text Application Step #2: X L2 Loss 5-fold Cross Validation Linear Regression Sparse Approximation 10 point sentiment scale 1=hated it, 10=loved it 37 Text Application Step #3: X L2 Loss 5-fold Cross Validation Linear Regression Supervised Sparse Coding 10 point sentiment scale 1=hated it, 10=loved it 38 Movie Review Sentiment 0.60 State of the art graphical model Supervised basis Predictive R 2 Better 0.50 0.40 Unsupervised basis LDA KL 0.30 sLDA KL+backprop 0.20 0.10 0.00 Blei, McAuliffe, NIPS, 2007 39 Future Work Camera Engineered Features Sparse Coding Ladar MMP Example Paths Voxel Classifier Labeled Training Data RGB Camera NIR Camera Laser 40 Future Work: Convex Sparse Coding • Sparse approximation is convex • Sparse coding is not because fixed-size basis is a non-convex constraint • Sparse coding ↔ sparse approximation on infinitely large basis + non-convex rank constraint – Relax to a convex L1 rank constraint • Use boosting for sparse approximation directly on infinitely large basis Bengio, Le Roux, Vincent, Dellalleau, Marcotte, NIPS, 2005 Zhao, Yu. Feature Selection for Data Mining, 2005 Riffkin, Lippert. Journal of Machine Learning Research, 2007 41 Questions? 42