Classification: Linear Models Oliver Schulte Machine Learning 726 Parent Node/ Child Node Discrete Continuous Discrete Maximum Likelihood Decision Trees logit distribution (logistic regression) Classifiers: linear discriminant (perceptron) Support vector machine Continuous conditional Gaussian (not discussed) linear Gaussian (linear regression) 2/57 Linear Classification Models General Idea: Learn linear continuous function y of continuous features x. 2. Classify as positive if y crosses a threshold, typically 0. 3. As in linear regression, can use more complicated features defined by basis functions ϕ. 1. 3/57 Example: Classifying Digits Classify input vector as “4” vs. “not 4”. Represent input image as vector x with 28x28 =784 numbers. Target t = 1 for “positive”, -1 for “negative”. Given a training set (x1,t1,..,xN,tN), the problem is find a good linear function y(x). y:R784 R. Classify x as positive if y(x) >0, negative o.w. 4/57 Other Examples Will the person vote conservative, given age, income, previous votes? Is the patient at risk of diabetes given body mass, age, blood test measurements? Predict Earthquake vs. nuclear explosion given body wave magnitude and surface wave magnitude. Age Incom e Votes surface wave magnitude body wave magnitude Convervative disaster type 5/57 x2 Linear Separation x1 = surface wave magnitude x2 = body wave magnitude 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 white = earthquake black = nuclear explosion Russell and Norvig Figure 18.15 5 6 5.5 6.5 x1 6/57 7 Linear Discriminants Simple linear model: y(x) = w·x + w0 Can drop explicit w0 if we assume fixed dummy bias. Decision surface is line, orthogonal to w. In 2-D, just try a line between the classes! 7/57 Perceptron Learning 8/57 Defining an Error Function General idea: Encode class label using a real number t. 1. 2. e.g., “positive” = 1, “negative” = 0 or “negative” = -1. Measure error by comparing continuous linear output y and class label code t. 9/57 The Error Function for linear discriminants Could use squared error as in linear regression. Various problems (see book). Basically due to the fact that 1,-1 are not real target values. Different criterion developed for learning perceptrons. Perceptrons are a precursor to neural nets. Analog implementation by Rosenblatt in the 1950s, see Figure 4.8. 10/57 The Perceptron Criterion An example is misclassified if (Take a moment to verify this.) (xn ·w)t n < 0 Perceptron Error EP (w) = - å (x n · w)t n nÎM where M is the set of misclassified inputs, the mistakes. Exercise: find the gradient of the error function wrt a single input xn. 11/57 Perceptron Learning Algorithm Use stochastic gradient descent. gradient descent for one example at a time, cycle through. Update Equation: w(t +1) = w(t ) - hÑEP (w) = w(t ) + h x ntn where we set η= 1 (without loss of generality in this case). Excel Demo. 12/57 Perceptron Demo 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 1 −1 −1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −0.5 −0.5 0 0.5 0 1 0.5 1 −1 −1 −0.5 −0.5 0 0.5 0 0.5 1 1 13/57 Perceptron Learning Analysis Theorem If the classes are linearly separable, the perceptron learning algorithm converges to a weight vector that separates them. Convergence can be slow. Sensitive to initialization. 14/57 Nonseparability Linear discriminants can solve problems only if the classes can be separated by a line (hyperplane). Canonical example of non-separable problem is X-OR. Perceptron typically does not converge. 15/57 x2 Nonseparability: real world example x1 = surface wave magnitude x2 = body wave magnitude 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 white = earthquake 5 5.5 6 6.5 x1 black = nuclear explosion Figure Russell and Norvig 18.15 b 16/57 7 Responses to Nonseparability Classes cannot be separated by a linear discriminant use non-linear activation function finds approximate solution logistic regression separate classes not completely but “well” Fisher discriminant (not covered) add hidden features neural network support vector machine 17/57 Logistic Regression 18/57 From Values to Probabilities Key idea: instead of predicting a class label, predict the probability of a class label. E.g., p+ = P(class is positive|features) p- = P(class is negative|features) Naturally a continuous quantity. How to turn a real number y into a probability p+? 19/57 The Logistic Sigmoid Function Definition: s (y) = 1 1+ exp(-y) Squeezes the real line into [0,1]. Differentiable: (nice exercise) ds = s (1- s ) dy 1 0.5 0 −5 0 5 20/57 Soft threshold interpretation If y> 0, σ(y) goes to 1 very quickly. If y<0, σ(y) goes to 0 very quickly. 1 1 0.5 0.5 0 0 -8 -6 -4 -2 0 2 4 6 8 Figure Russell and Norvig 18.17 -6 -4 -2 0 2 4 6 21/57 Probabilistic Interpretation The sigmoid can be interpreted in terms of the class odds p+/(1-p+). Exercise: Show the following implication for the class odds: + 1 p p+ = Þ = exp(y) + 1+ exp(-y) 1- p p+ Therefore y = ln( the log class odds. +) = 1- p 22/57 Logistic Regression In logistic regression, the log-class odds are a linear function of the input features: p+ ln( ) = x·w + 1- p Recall that we got the same kind of expression for the naive Bayes classifier. Learning logistic regression is conceptually similar to linear regression. 23/57 Logistic Regression: Maximum Likelihood Notation: the probability that the n-th input example is positive = p + which depends on a weight vector w. n Positive example has tn = 1, negative tn = 0. Then the likelihood assigned to N independent training data N is p(t | w) = (p+ )tn {1-p+ }1-tn Õ n n n=1 The cross-entropy error N E(w) = -ln p(t | w) = å{tn ln(pn+ ) + (1- tn )ln(1- pn+ )} n=1 Equivalent to minimizing the KL divergence between the predicted class probabilities and the observed class frequencies. 24/57 Gradient Search Exercise (on assignment): Using the cross-entropy error N E(w) = -ln p(t | w) = å{tn ln(pn+ ) + (1- tn )ln(1- pn+ )} n=1 N ÑE(w) = å(pn+ - tn )x n show that n=1 Hint: recall that ds = s (1- s ) dy No closed form minimum since pn+ is non-linear function of input features. Can use gradient descent. Better approach: Use Iterative Reweighted Least Squares (IRLS). See assignment. 25/57 x2 Example logistic regression model learned on non-separable data 1 0.8 0.6 0.4 0.2 0 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 -2 0 x1 2 4 6 -4 -2 0 2 x2 4 6 10 8 7 x1 Figure Russell and Norvig 18.17 26/57 Logistic Regression With Basis Functions Figure Bishop 4.12 27/57 Multi-Class Example Logistic regression can be extended to multiple classes. Here’s a picture of what decision boundaries can look like. 6 4 2 0 −2 −4 −6 −6 −4 −2 0 2 4 6 28/57