Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Introduction to Machine Learning CSE474/574: Perceptrons Varun Chandola <chandola@buffalo.edu> Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Outline 1 Perceptrons Geometric Interpretation Perceptron Training 2 Perceptron Convergence 3 Perceptron Learning in Non-separable Case 4 Gradient Descent and Delta Rule Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Geometric Interpretation Perceptron Training Outline 1 Perceptrons Geometric Interpretation Perceptron Training 2 Perceptron Convergence 3 Perceptron Learning in Non-separable Case 4 Gradient Descent and Delta Rule Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Geometric Interpretation Perceptron Training Artificial Neurons Figure: Src: http://brainjackimage.blogspot.com/ Figure: Src: Wikipedia Human brain has 1011 neurons Each connected to 104 neighbors Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Geometric Interpretation Perceptron Training Perceptron [4, 2] bias Θ x1 w1 x2 w2 .. . .. . xd wd inputs weights P {-1,+1} Activation function Pd j=1 wj xj ≥ Θ Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Geometric Interpretation Perceptron Training Geometric Interpretation x2 w> x +1 −1 =Θ n= Θ − |w| w |w| x1 Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Geometric Interpretation Perceptron Training Eliminating Bias Add another attribute xd+1 = 1. wd+1 is −Θ Desired hyperplane goes through origin in (d + 1) space Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Geometric Interpretation Perceptron Training Hypothesis Space Assumption: ∃w ∈ <d+1 such that w can strictly classify all examples correctly. Hypothesis space: Set of all hyperplanes defined in the (d + 1)-dimensional space passing through origin The target hypothesis is also called decision surface or decision boundary. Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Geometric Interpretation Perceptron Training Perceptron Training - Perceptron Learning Rule 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: w ← (0, 0, . . . , 0)d+1 for i=1, 2, . . . do if w> x(i) > 0 then c(x(i) ) = +1 else c(x(i) ) = −1 end if if c(x(i) ) 6= c∗ (x(i) ) then w ← w + c∗ (x(i) )x(i) end if end for Varun Chandola Every mistake tweaks the hyperplane Rotation in (d + 1) space Accomodate the offending point Stopping Criterion: Exhaust all training example, or No further updates Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Outline 1 Perceptrons Geometric Interpretation Perceptron Training 2 Perceptron Convergence 3 Perceptron Learning in Non-separable Case 4 Gradient Descent and Delta Rule Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Does Perceptron Training Converge? 2 x 2 1 x1 −2 −1 1 2 −1 −2 Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Convergence Assumptions 1 Linearly separable examples 2 No errors 3 |x| = 1 A positive δ gap exists that “contains” the target concept (hyperplane) 4 (∃δ)(∃v) such that (∀x)v> x > c∗ (x)δ. Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Perceptron Convergence Theorem Theorem For a set of unit length and linearly separable examples, the perceptron learning algorithm will converge after a finite number of mistakes (at most δ12 ). Proof discussed in Minsky’s book [3]. Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Review Hypothesis Space, H Conjunctive Disjunctive Input Space, x Input Space, y d x ∈ {0, 1} y ∈ {0, 1} y ∈ {−1, +1} d x∈< Disjunctions of k attributes y ∈< Linear hyperplanes c∗ ∈ H c∗ 6∈ H Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Review Hypothesis Space, H Conjunctive Disjunctive Input Space, x Input Space, y d x ∈ {0, 1} y ∈ {0, 1} y ∈ {−1, +1} d x∈< Disjunctions of k attributes y ∈< Linear hyperplanes c∗ ∈ H c∗ 6∈ H Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Outline 1 Perceptrons Geometric Interpretation Perceptron Training 2 Perceptron Convergence 3 Perceptron Learning in Non-separable Case 4 Gradient Descent and Delta Rule Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Target concept c∗ 6∈ H 2 x 2 1 Expand H? Lower expectations Principle of good enough x1 −2 −1 1 −1 −2 Varun Chandola Introduction to Machine Learning 2 Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Perceptron Learning in Non-separable Case x2 +1 −1 x1 Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Perceptron Learning in Non-separable Case x2 +1 −1 x1 Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Outline 1 Perceptrons Geometric Interpretation Perceptron Training 2 Perceptron Convergence 3 Perceptron Learning in Non-separable Case 4 Gradient Descent and Delta Rule Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Gradient Descent and Delta Rule Which hyperplane to choose? Gives best performance on training data Pose as an optimization problem Objective function? Optimization procedure? Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Objective Function for Perceptron Learning An unthresholded perceptron (a linear unit) Input layer Output layer x0 Training Examples: hxi , yi i x1 Weight: w 1X E (w) = (yi − w> xi )2 2 x2 Output x3 i x4 Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Machine Learning as Optimization Problem1 Learning is optimization Faster optimization methods for faster learning Let w ∈ <d and S ⊂ <d and f0 (w ), f1 (w ), . . . , fm (w ) be real-valued functions. Standard optimization formulation is: minimize f0 (w ) subject to fi (w ) ≤ 0, i = 1, . . . , m. w 1 Adapted from http://ttic.uchicago.edu/ gregory/courses/ml2012/ ~ lectures/tutorial_optimization.pdf. Also see, http://www.stanford.edu/~boyd/cvxbook/ and http://scipy-lectures.github.io/advanced/mathematical_optimization/. Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Solving Optimization Problems Methods for general optimization problems Simulated annealing, genetic algorithms Exploiting structure in the optimization problem Convexity, Lipschitz continuity, smoothness Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Convexity Convex Sets Convex Functions w2 y = x2 w1 Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Convex Optimization Optimality Criterion minimize f0 (w ) subject to fi (w ) ≤ 0, i = 1, . . . , m. w where all fi (w ) are convex functions. w0 is feasible if w0 ∈ Dom f0 and all constraints are satisfied A feasible w ∗ is optimal if f0 (w ∗ ) ≤ f0 (w ) for all w satisfying the constraints Varun Chandola Introduction to Machine Learning Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Gradient of a Function ∇E (w) = ∂w0 ∂E ∂w1 .. . 25 20 15 E[w] Denotes the direction of steepest ascent ∂E 10 5 0 2 ∂E ∂wd 1 -2 -1 0 0 1 -1 2 3 w0 Varun Chandola Introduction to Machine Learning w1 Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Finding Extremes of a Single Variable Function Set derivative to 0 Second derivative for minima or maxima Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Finding Extremes of a Multiple Variable Function Gradient Descent 1 2 Start from any point in variable space Move along the direction of the steepest descent (or ascent) By how much? A learning rate (η) What is the direction of steepest descent? Gradient of E at w Training Rule for Gradient Descent w = w − η∇E (w) For each weight component: wj = wj − η Varun Chandola ∂E ∂wj Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Convergence Guaranteed? Error surface contains only one global minimum Algorithm will converge Examples need not be linearly separable η should be small enough Impact of too large η? Too small η? Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Issues with Gradient Descent Slow convergence Stuck in local minima Varun Chandola Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References Objective Function for Perceptron Learning Machine Learning as Optimization Convex Optimization Gradient Descent Issues with Gradient Descent Stochastic Gradient Descent Stochastic Gradient Descent [1] Update weights after every training example. For sufficiently small η, closely approximates Gradient Descent. Gradient Descent Weights updated after summing error over all examples More computations per weight update step Risk of local minima Varun Chandola Stochastic Gradient Descent Weights updated after examining each example Significantly lesser computations Avoids local minima Introduction to Machine Learning Perceptrons Perceptron Convergence Perceptron Learning in Non-separable Case Gradient Descent and Delta Rule References References Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1(4):541–551, Dec. 1989. W. Mcculloch and W. Pitts. A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:127–147, 1943. M. L. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, 1969. F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386–408, 1958. Varun Chandola Introduction to Machine Learning