Some Useful Machine Learning Tools M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Outline • Part I : Supervised Learning • Part II: Weakly Supervised Learning Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods – Logistic regression – Multiclass logistic regression – Regularized maximum likelihood • Loss-based Methods – Support vector machine – Structured output support vector machine Image Classification Is this an urban or rural area? Input: x Output: y {-1,+1} Image Classification Is this scan healthy or unhealthy? Input: x Output: y {-1,+1} Image Classification Which city is this? Input: x Output: y {1,2,…,C} Image Classification What type of tumor does this scan contain? Input: x Output: y {1,2,…,C} Object Detection Where is the object in the image? Input: x Output: y {Pixels} Object Detection Where is the rupture in the scan? Input: x Output: y {Pixels} Segmentation sky tree sky car road grass What is the semantic class of each pixel? Input: x Output: y {1,2,…,C}|Pixels| Segmentation What is the muscle group of each pixel? Input: x Output: y {1,2,…,C}|Pixels| A Simplified View of the Pipeline Extract Features Input x http://deeplearning.net Learn f Prediction y(f) maxy f(Φ(x),y) Features Φ(x) Compute Scores Scores f(Φ(x),y) Learning Objective Data distribution P(x,y) Distribution is unknown Measure of prediction quality f* = argminf EP(x,y) Error(y(f),y) Expectation over data distribution Prediction Ground Truth Learning Objective Training data {(xi,yi), i = 1,2,…,n} Measure of prediction quality f* = argminf EP(x,y) Error(y(f),y) Expectation over data distribution Prediction Ground Truth Learning Objective Training data {(xi,yi), i = 1,2,…,n} Finite samples Measure of prediction quality f* = argminf Σi Error(yi(f),yi) Expectation over Prediction empirical distribution Ground Truth Learning Objective Training data {(xi,yi), i = 1,2,…,n} Finite samples f* = argminf Σi Error(yi(f),yi) + λ R(f) Relative weight (hyperparameter) Regularizer Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods – Logistic regression – Multiclass logistic regression – Regularized maximum likelihood • Loss-based Methods – Support vector machine – Structured output support vector machine Logistic Regression Input: x Features: Φ(x) f(Φ(x),y) = yθTΦ(x) Output: y {-1,+1} Prediction: sign(θTΦ(x)) P(y|x) = l(f(Φ(x),y)) l(z) = 1/(1+e-z) Logistic function Is the distribution normalized? Logistic Regression Training data {(xi,yi), i = 1,2,…,n} minθ Σi –log(P(yi|xi)) + λ R(θ) Negative Log-likelihood Regularizer Logistic Regression Training data {(xi,yi), i = 1,2,…,n} minθ Σi –log(P(yi|xi)) + λ ||θ||2 Convex optimization problem Proof left as an exercise. Hint: Prove that Hessian H is PSD aTHa ≥ 0, for all a Gradient Descent Training data {(xi,yi), i = 1,2,…,n} minθ Σi –log(P(yi|xi)) + λ ||θ||2 Start with an initial estimate θ0 θt+1 θt - μ dL(θ) dθ θt Repeat until decrease in objective is below a threshold Gradient Descent Small μ Large μ Gradient Descent Small μ Large μ Gradient Descent Training data {(xi,yi), i = 1,2,…,n} minθ Σi –log(P(yi|xi)) + λ ||θ||2 Start with an initial estimate θ0 θt+1 θt - μ dL(θ) dθ θt Small constant or Line search Repeat until decrease in objective is below a threshold Newton’s Method Minimize g(z) Solution at iteration t = zt Define gt(Δz) = g(zt + Δz) Second-order Taylor’s Series gt(Δz) ≈ g(zt) + g’(zt)Δz + g’’(zt) (Δz)2 Derivative wrt Δz = 0, implies g’(zt) + g’’(zt) Δz = 0 Solving for Δz provides the learning rate Newton’s Method Training data {(xi,yi), i = 1,2,…,n} minθ Σi –log(P(yi|xi)) + λ ||θ||2 Start with an initial estimate θ0 θt+1 θt μ-1 = d2L(θ) - μ dL(θ) dθ θt dθ2 θt Repeat until decrease in objective is below a threshold Logistic Regression Input: x Features: Φ(x) Output: y {1,2,…,C} Train C 1-vs-all logistic regression binary classifiers Prediction: Maximum probability of +1 over C classifiers Simple extension, easy to code Loses the probabilistic interpretation Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods – Logistic regression – Multiclass logistic regression – Regularized maximum likelihood • Loss-based Methods – Support vector machine – Structured output support vector machine Multiclass Logistic Regression Input: x Features: Φ(x) Output: y {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) Ψ(x,1) = [Φ(x) 0 0 … 0] … Ψ(x,2) = [0 Φ(x) 0 … 0] Ψ(x,C) = [0 0 0 … Φ(x)] Multiclass Logistic Regression Input: x Features: Φ(x) Output: y {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = θTΨ(x,y) Prediction: maxy θTΨ(x,y)) P(y|x) = exp(f(Ψ(x,y)))/Z(x) Partition function Z(x) = Σy exp(f(Ψ(x,y))) Multiclass Logistic Regression Training data {(xi,yi), i = 1,2,…,n} minθ Σi –log(P(yi|xi)) + λ ||θ||2 Convex optimization problem Gradient Descent, Newton’s Method, and many others Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods – Logistic regression – Multiclass logistic regression – Regularized maximum likelihood • Loss-based Methods – Support vector machine – Structured output support vector machine Regularized Maximum Likelihood Input: x Features: Φ(x) Output: y {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) [Ψ(x,y1); Ψ(x,y2); …; Ψ(x,ym)] [Ψ(x,yi), for all i; Ψ(x,yi,yj), for all i, j] Regularized Maximum Likelihood Input: x Features: Φ(x) Output: y {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) [Ψ(x,y1); Ψ(x,y2); …; Ψ(x,ym)] [Ψ(x,yi), for all i; Ψ(x,yij), for all i, j] [Ψ(x,yi), for all i; Ψ(x,yc), c is a subset of variables] Regularized Maximum Likelihood Input: x Features: Φ(x) Output: y {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = θTΨ(x,y) Prediction: maxy θTΨ(x,y)) P(y|x) = exp(f(Ψ(x,y)))/Z(x) Partition function Z(x) = Σy exp(f(Ψ(x,y))) Regularized Maximum Likelihood Training data {(xi,yi), i = 1,2,…,n} minθ Σi –log(P(yi|xi)) + λ ||θ||2 Partition function is expensive to compute Approximate inference (Nikos Komodakis’ tutorial) Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods – Logistic regression – Multiclass logistic regression – Regularized maximum likelihood • Loss-based Methods – Support vector machine (multiclass) – Structured output support vector machine Multiclass SVM Input: x Features: Φ(x) Output: y {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) Ψ(x,1) = [Φ(x) 0 0 … 0] … Ψ(x,2) = [0 Φ(x) 0 … 0] Ψ(x,C) = [0 0 0 … Φ(x)] Multiclass SVM Input: x Features: Φ(x) Output: y {1,2,…,C} Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = wTΨ(x,y) Prediction: maxy wTΨ(x,y)) Predicted Output: y(w) = argmaxy wTΨ(x,y)) Multiclass SVM Training data {(xi,yi), i = 1,2,…,n} Loss function for i-th sample Δ(yi,yi(w)) Minimize the regularized sum of loss over training data Highly non-convex in w Regularization plays no role (overfitting may occur) Multiclass SVM Training data {(xi,yi), i = 1,2,…,n} wTΨ(x,yi(w)) + Δ(yi,yi(w)) - wTΨ(x,yi(w)) ≤ wTΨ(x,yi(w)) + Δ(yi,yi(w)) - wTΨ(x,yi) ≤ maxy { wTΨ(x,y) + Δ(yi,y) } - wTΨ(x,yi) Sensitive to regularization of w Convex Multiclass SVM Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi for all y Quadratic program with polynomial # of constraints Specialized software packages freely available http://www.cs.cornell.edu/People/tj/svm_light/svm_multiclass.html Outline – Part I • Introduction to Supervised Learning • Probabilistic Methods – Logistic regression – Multiclass logistic regression – Regularized maximum likelihood • Loss-based Methods – Support vector machine (multiclass) – Structured output support vector machine Structured Output SVM Input: x Features: Φ(x) Output: y {1,2,…,C}m Joint feature vector of input and output: Ψ(x,y) f(Ψ(x,y)) = wTΨ(x,y) Prediction: maxy wTΨ(x,y)) Structured Output SVM Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi for all y Quadratic program with exponential # of constraints Many polynomial time algorithms Cutting Plane Algorithm Define working sets Wi = {} REPEAT Update w by solving the following problem minw ||w||2 + C Σiξi wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi for all y Wi Compute the most violated constraint for all samples ŷi = argmaxy wTΨ(x,y) + Δ(yi,y) Update the working sets Wi by adding ŷi Cutting Plane Algorithm Termination criterion: Violation of ŷi < ξi + ε, for all i Number of iterations = max{O(n/ε),O(C/ε2)} At each iteration, convex dual of problem increases. Convex dual can be upper bounded. Ioannis Tsochantaridis et al., JMLR 2005 http://svmlight.joachims.org/svm_struct.html Structured Output SVM Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi for all y {1,2,…,C}m Number of constraints = nCm Structured Output SVM Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi for all y Y Structured Output SVM Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi) ≤ ξi for all zi Y Structured Output SVM Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + C Σiξi Σi (wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)) ≤ Σiξi for all Z = {zi,i=1,…,n} Yn Equivalent problem to structured output SVM Number of constraints = Cmn 1-Slack Structured Output SVM Training data {(xi,yi), i = 1,2,…,n} minw ||w||2 + C ξ Σi (wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)) ≤ ξ for all Z = {zi,i=1,…,n} Yn Cutting Plane Algorithm Define working sets W = {} REPEAT Update w by solving the following problem minw ||w||2 + C ξ Σi (wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)) ≤ ξ for all Z W Compute the most violated constraint for all samples zi = argmaxy wTΨ(x,y) + Δ(yi,y) Update the working sets W by adding {zi, i=1,…n} Cutting Plane Algorithm Termination criterion: Violation of {zi} < ξ + ε Number of iterations = O(C/ε) At each iteration, convex dual of problem increases. Convex dual can be upper bounded. Thorsten Joachims et al., Machine Learning 2009 http://svmlight.joachims.org/svm_struct.html Outline – Part II • Introduction to Weakly Supervised Learning – Two types of problems • Probabilistic Methods – Expectation maximization • Loss-based Methods – Latent support vector machine – Dissimilarity coefficient learning Log (Size) Computer Vision Data ~ 2000 Segmentation Information Log (Size) Computer Vision Data ~1M ~ 2000 Bounding Box Segmentation Information Log (Size) Computer Vision Data > 14 M Image-Level ~1M ~ 2000 Bounding Box Segmentation Information “Car” “Chair” Computer Vision Data Log (Size) >6B > 14 M ~1M Noisy Label Image-Level ~ 2000 Bounding Box Segmentation Information Data Detailed annotation is expensive Often, in medical imaging, annotation is impossible Desired annotation keeps changing Learn with missing information (latent variables) Outline – Part II • Introduction to Weakly Supervised Learning – Two types of problems • Probabilistic Methods – Expectation maximization • Loss-based Methods – Latent support vector machine – Dissimilarity coefficient learning Annotation Mismatch Learn to classify an image Desired Output y Image x h Annotation y = “Deer” Mismatch between desired and available annotations Exact value of latent variable is not “important” Annotation Mismatch Learn to classify a DNA sequence Latent Variables h Sequence x Annotation y {+1, -1} Desired Output y Mismatch between desired and possible annotations Exact value of latent variable is not “important” Output Mismatch Learn to detect an object in an image Desired Output (y,h) Image x h Annotation y = “Deer” Mismatch between output and available annotations Exact value of latent variable is important Output Mismatch Learn to segment an image Image Desired Output Output Mismatch Learn to segment an image (x, y) Bird (y, h) Output Mismatch Learn to segment an image (x, y) (y, h) Cow Mismatch between output and available annotations Exact value of latent variable is important Outline – Part II • Introduction to Weakly Supervised Learning – Two types of problems • Probabilistic Methods – Expectation maximization • Loss-based Methods – Latent support vector machine – Dissimilarity coefficient learning Expectation Maximization Input: x Annotation: y Latent Variables: h Joint feature vector: Ψ(x,y,h) f(Ψ(x,y,h)) = θTΨ(x,y,h) P(y,h|x;θ) = exp(f(Ψ(x,y,h)))/Z(x;θ) Partition function Z(x;θ) = Σy,h exp(f(Ψ(x,y,h))) Prediction: maxy P(y|x;θ) = maxy Σh P(y,h|x;θ) Expectation Maximization Input: x Annotation: y Latent Variables: h Joint feature vector: Ψ(x,y,h) f(Ψ(x,y,h)) = θTΨ(x,y,h) P(y,h|x;θ) = exp(f(Ψ(x,y,h)))/Z(x;θ) Partition function Z(x;θ) = Σy,h exp(f(Ψ(x,y,h))) Prediction: maxy,h P(y,h|x;θ) Expectation Maximization Training data {(xi,yi), i = 1,2,…,n} minθ Σi –log(P(yi|xi;θ)) + λ ||θ||2 Annotation Mismatch - log P(y|x;θ) EP(h|y,x;θ’) log P(h|y,x;θ) Maximized at θ = θ’ - EP(h|y,x;θ’) log P(y,h|x;θ) Left as exercise Expectation Maximization Training data {(xi,yi), i = 1,2,…,n} minθ Σi –log(P(yi|xi;θ)) + λ ||θ||2 Annotation Mismatch minθ - log P(y|x;θ) EP(h|y,x;θ’) log P(h|y,x;θ) Maximized at θ = θ’ - EP(h|y,x;θ’) log P(y,h|x;θ) Expectation Maximization Training data {(xi,yi), i = 1,2,…,n} minθ Σi –log(P(yi|xi;θ)) + λ ||θ||2 Annotation Mismatch minθ - log P(y|x;θ) minθ - EP(h|y,x;θ’) log P(y,h|x;θ) Expectation Maximization Start with an initial estimate θ0 E-step: Compute P(h|y,x;θt) M-step: Obtain θt+1 by solving the following problem minθ Σi –EP(h|yi,xi;θt) log(P(yi,h|xi;θ)) + λ ||θ||2 Repeat until convergence Outline – Part II • Introduction to Weakly Supervised Learning – Two types of problems • Probabilistic Methods – Expectation maximization • Loss-based Methods – Latent support vector machine – Dissimilarity coefficient learning Latent SVM Input x Output y Y Hidden Variable hH “Deer” Y = {“Bison”, “Deer”, ”Elephant”, “Giraffe”, “Llama”, “Rhino” } Latent SVM Feature (x,y,h) (HOG, BoW) Parameters w (y(w),h(w)) = maxyY,hH wT(x,y,h) Latent SVM Training samples xi Ground-truth label yi Loss Function (yi, yi(w)) Annotation Mismatch Latent SVM (y(w),h(w)) = maxyY,hH wT(x,y,h) wT(xi,yi(w),hi(w)) + (yi, yi(w)) - wT(xi,yi(w),hi(w)) “Very” non-convex Latent SVM (y(w),h(w)) = maxyY,hH wT(x,y,h) wT(xi,yi(w),hi(w)) + (yi, yi(w)) - maxh wT(xi,yi,hi) i Upper Bound Latent SVM (y(w),h(w)) = maxyY,hH wT(x,y,h) maxy,h wT(xi,y,h) + (yi, y) - maxh wT(xi,yi,hi) i Upper Bound Latent SVM (y(w),h(w)) = maxyY,hH wT(x,y,h) min ||w||2 + C∑i i maxh wT(xi,yi,hi) - wT(xi,y,h) i ≥ (yi, y) - i So is this convex? Latent SVM (y(w),h(w)) = maxyY,hH wT(x,y,h) Convex maxy,h wT(xi,y,h) + (yi, y) - maxh wT(xi,yi,hi) i Convex Difference-of-convex !! Concave-Convex Procedure + Linear upper-bound of concave part Concave-Convex Procedure + Linear upper-bound of concave part Concave-Convex Procedure + Until Convergence Latent SVM (y(w),h(w)) = maxyY,hH wT(x,y,h) maxy,h wT(xi,y,h) + (yi, y) - maxh wT(xi,yi,hi) i Linear upper bound at wt (xi,yi,hi*) hi* = argmaxh wtT(xi,yi,hi) i Latent SVM (y(w),h(w)) = maxyY,hH wT(x,y,h) min ||w||2 + C∑i i maxh wT(xi,yi,hi) - wT(xi,y,h) i ≥ (yi, y) - i Solve using CCCP CCCP for Latent SVM Start with an initial estimate w0 Update hi* = argmaxh H wtT(xi,yi,hi) i Update wt+1 by solving a convex problem min ||w||2 + C∑i i wT(xi,yi,hi*) - wT(xi,y,h) ≥ (yi, y) - i http://webdocs.cs.ualberta.ca/~chunnam/ CCCP for Human Learning 1+1=2 Math is for losers !! 1/3 + 1/6 = 1/2 eiπ+1 = 0 FAILURE … BAD LOCAL MINIMUM Self-Paced Learning 1+1=2 Euler was a Genius!! 1/3 + 1/6 = 1/2 eiπ+1 = 0 SUCCESS … GOOD LOCAL MINIMUM Self-Paced Learning Start with “easy” examples, then consider “hard” ones Simultaneously estimate easiness and parameters Easiness is property of data sets, not single instances Easy vs. Hard Expensive Easy for human Easy for machine CCCP for Latent SVM Start with an initial estimate w0 Update hi* = argmaxh H wtT(xi,yi,hi) i Update wt+1 by solving a convex problem min ||w||2 + C∑i i wT(xi,yi,hi*) - wT(xi,y,h) ≥ (yi, y) - i Self-Paced Learning min ||w||2 + C∑i i wT(xi,yi,hi*) - wT(xi,y,h) ≥ (yi, y, h) - i Self-Paced Learning vi {0,1} min ||w||2 + C∑i vii wT(xi,yi,hi*) - wT(xi,y,h) ≥ (yi, y, h) - i Trivial Solution Self-Paced Learning vi {0,1} min ||w||2 + C∑i vii - ∑ivi/K wT(xi,yi,hi*) - wT(xi,y,h) ≥ (yi, y, h) - i Large K Medium K Small K Self-Paced Learning Alternating Convex Search vi [0,1] Biconvex Problem min ||w||2 + C∑i vii - ∑ivi/K wT(xi,yi,hi*) - wT(xi,y,h) ≥ (yi, y, h) - i Large K Medium K Small K SPL for Latent SVM Start with an initial estimate w0 Update hi* = argmaxh H wtT(xi,yi,hi) i Update wt+1 by solving a convex problem min ||w||2 + C∑i i - ∑i vi/K wT(xi,yi,hi*) - wT(xi,y,h) ≥ (yi, y) - i Decrease K K/ http://cvc.centrale-ponts.fr/personnel/pawan/ Outline – Part II • Introduction to Weakly Supervised Learning – Two types of problems • Probabilistic Methods – Expectation maximization • Loss-based Methods – Latent support vector machine – Dissimilarity coefficient learning (if time permits)