Linear Classifiers.

Classification: Linear Models Oliver Schulte Machine Learning 726 Parent Node/ Child Node Discrete Continuous Discrete Maximum Likelihood Decision Trees logit distribution (logistic regression) Classifiers: linear discriminant (perceptron) Support vector machine Continuous conditional Gaussian (not discussed) linear Gaussian (linear regression) 2/57 Linear Classification Models  General Idea: Learn linear continuous function y of continuous features x. 2. Classify as positive if y crosses a threshold, typically 0. 3. As in linear regression, can use more complicated features defined by basis functions ϕ. 1. 3/57 Example: Classifying Digits  Classify input vector as “4” vs. “not 4”.  Represent input image as vector x with 28x28     =784 numbers. Target t = 1 for “positive”, -1 for “negative”. Given a training set (x1,t1,..,xN,tN), the problem is find a good linear function y(x). y:R784 R. Classify x as positive if y(x) >0, negative o.w. 4/57 Other Examples  Will the person vote conservative, given age, income, previous votes?  Is the patient at risk of diabetes given body mass, age, blood test measurements?  Predict Earthquake vs. nuclear explosion given body wave magnitude and surface wave magnitude. Age Incom e Votes surface wave magnitude body wave magnitude Convervative disaster type 5/57 x2 Linear Separation x1 = surface wave magnitude x2 = body wave magnitude 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 white = earthquake black = nuclear explosion Russell and Norvig Figure 18.15 5 6 5.5 6.5 x1 6/57 7 Linear Discriminants  Simple linear model: y(x) = w·x + w0  Can drop explicit w0 if we assume fixed dummy bias.  Decision surface is line, orthogonal to w.  In 2-D, just try a line between the classes! 7/57 Perceptron Learning 8/57 Defining an Error Function  General idea: Encode class label using a real number t. 1.  2. e.g., “positive” = 1, “negative” = 0 or “negative” = -1. Measure error by comparing continuous linear output y and class label code t. 9/57 The Error Function for linear discriminants  Could use squared error as in linear regression.  Various problems (see book). Basically due to the fact that 1,-1 are not real target values.  Different criterion developed for learning perceptrons.  Perceptrons are a precursor to neural nets.  Analog implementation by Rosenblatt in the 1950s, see Figure 4.8. 10/57 The Perceptron Criterion  An example is misclassified if  (Take a moment to verify this.) (xn ·w)t n < 0  Perceptron Error EP (w) = - å (x n · w)t n nÎM where M is the set of misclassified inputs, the mistakes.  Exercise: find the gradient of the error function wrt a single input xn. 11/57 Perceptron Learning Algorithm  Use stochastic gradient descent.  gradient descent for one example at a time, cycle through.  Update Equation: w(t +1) = w(t ) - hÑEP (w) = w(t ) + h x ntn where we set η= 1 (without loss of generality in this case).  Excel Demo. 12/57 Perceptron Demo 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 1 −1 −1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −0.5 −0.5 0 0.5 0 1 0.5 1 −1 −1 −0.5 −0.5 0 0.5 0 0.5 1 1 13/57 Perceptron Learning Analysis  Theorem If the classes are linearly separable, the perceptron learning algorithm converges to a weight vector that separates them.  Convergence can be slow.  Sensitive to initialization. 14/57 Nonseparability  Linear discriminants can solve problems only if the classes can be separated by a line (hyperplane).  Canonical example of non-separable problem is X-OR.  Perceptron typically does not converge. 15/57 x2 Nonseparability: real world example x1 = surface wave magnitude x2 = body wave magnitude 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 white = earthquake 5 5.5 6 6.5 x1 black = nuclear explosion Figure Russell and Norvig 18.15 b 16/57 7 Responses to Nonseparability Classes cannot be separated by a linear discriminant use non-linear activation function finds approximate solution logistic regression separate classes not completely but “well” Fisher discriminant (not covered) add hidden features neural network support vector machine 17/57 Logistic Regression 18/57 From Values to Probabilities  Key idea: instead of predicting a class label, predict the probability of a class label.  E.g., p+ = P(class is positive|features) p- = P(class is negative|features)  Naturally a continuous quantity.  How to turn a real number y into a probability p+? 19/57 The Logistic Sigmoid Function  Definition: s (y) = 1 1+ exp(-y)  Squeezes the real line into [0,1].  Differentiable: (nice exercise) ds = s (1- s ) dy 1 0.5 0 −5 0 5 20/57 Soft threshold interpretation  If y> 0, σ(y) goes to 1 very quickly.  If y<0, σ(y) goes to 0 very quickly. 1 1 0.5 0.5 0 0 -8 -6 -4 -2 0 2 4 6 8 Figure Russell and Norvig 18.17 -6 -4 -2 0 2 4 6 21/57 Probabilistic Interpretation  The sigmoid can be interpreted in terms of the class odds p+/(1-p+).  Exercise: Show the following implication for the class odds: + 1 p p+ = Þ = exp(y) + 1+ exp(-y) 1- p p+  Therefore y = ln( the log class odds. +) = 1- p 22/57 Logistic Regression  In logistic regression, the log-class odds are a linear function of the input features: p+ ln( ) = x·w + 1- p  Recall that we got the same kind of expression for the naive Bayes classifier.  Learning logistic regression is conceptually similar to linear regression. 23/57 Logistic Regression: Maximum Likelihood  Notation: the probability that the n-th input example is positive = p + which depends on a weight vector w. n  Positive example has tn = 1, negative tn = 0.  Then the likelihood assigned to N independent training data N is p(t | w) = (p+ )tn {1-p+ }1-tn Õ n n n=1  The cross-entropy error N E(w) = -ln p(t | w) = å{tn ln(pn+ ) + (1- tn )ln(1- pn+ )} n=1  Equivalent to minimizing the KL divergence between the predicted class probabilities and the observed class frequencies. 24/57 Gradient Search  Exercise (on assignment): Using the cross-entropy error N E(w) = -ln p(t | w) = å{tn ln(pn+ ) + (1- tn )ln(1- pn+ )} n=1 N ÑE(w) = å(pn+ - tn )x n     show that n=1 Hint: recall that ds = s (1- s ) dy No closed form minimum since pn+ is non-linear function of input features. Can use gradient descent. Better approach: Use Iterative Reweighted Least Squares (IRLS). See assignment. 25/57 x2 Example logistic regression model learned on non-separable data 1 0.8 0.6 0.4 0.2 0 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 -2 0 x1 2 4 6 -4 -2 0 2 x2 4 6 10 8 7 x1 Figure Russell and Norvig 18.17 26/57 Logistic Regression With Basis Functions Figure Bishop 4.12 27/57 Multi-Class Example  Logistic regression can be extended to multiple classes.  Here’s a picture of what decision boundaries can look like. 6 4 2 0 −2 −4 −6 −6 −4 −2 0 2 4 6 28/57

Linear Classifiers.

Related documents

Products

Support

Linear Classifiers.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib