Linear Classifiers.

advertisement
Classification: Linear Models
Oliver Schulte
Machine Learning 726
Parent Node/
Child Node
Discrete
Continuous
Discrete
Maximum Likelihood
Decision Trees
logit distribution
(logistic regression)
Classifiers:
linear discriminant (perceptron)
Support vector machine
Continuous
conditional Gaussian
(not discussed)
linear Gaussian
(linear regression)
2/57
Linear Classification Models
 General Idea:
Learn linear continuous function y of continuous features x.
2. Classify as positive if y crosses a threshold, typically 0.
3. As in linear regression, can use more complicated features
defined by basis functions ϕ.
1.
3/57
Example: Classifying Digits
 Classify input vector as “4” vs. “not 4”.
 Represent input image as vector x with 28x28




=784 numbers.
Target t = 1 for “positive”, -1 for “negative”.
Given a training set (x1,t1,..,xN,tN), the problem
is find a good linear function y(x).
y:R784 R.
Classify x as positive if y(x) >0, negative o.w.
4/57
Other Examples
 Will the person vote conservative, given age, income, previous
votes?
 Is the patient at risk of diabetes given body mass, age, blood test
measurements?
 Predict Earthquake vs. nuclear explosion given body wave
magnitude and surface wave magnitude.
Age
Incom
e
Votes
surface
wave
magnitude
body wave
magnitude
Convervative
disaster
type
5/57
x2
Linear Separation
x1 = surface
wave
magnitude
x2 = body
wave
magnitude
7.5
7
6.5
6
5.5
5
4.5
4
3.5
3
2.5
4.5
white = earthquake
black = nuclear explosion
Russell and Norvig Figure 18.15
5
6
5.5
6.5
x1
6/57
7
Linear Discriminants
 Simple linear model:
y(x) = w·x + w0
 Can drop explicit w0 if we assume fixed dummy bias.
 Decision surface is line, orthogonal to w.
 In 2-D, just try a line between the classes!
7/57
Perceptron Learning
8/57
Defining an Error Function
 General idea:
Encode class label using a real number t.
1.

2.
e.g., “positive” = 1, “negative” = 0 or “negative” = -1.
Measure error by comparing continuous linear output y
and class label code t.
9/57
The Error Function for linear
discriminants
 Could use squared error as in
linear regression.
 Various problems (see book).
Basically due to the fact that
1,-1 are not real target
values.
 Different criterion developed
for learning perceptrons.
 Perceptrons are a precursor
to neural nets.
 Analog implementation by
Rosenblatt in the 1950s, see
Figure 4.8.
10/57
The Perceptron Criterion
 An example is misclassified if
 (Take a moment to verify this.)
(xn ·w)t n < 0
 Perceptron Error
EP (w) = - å (x n · w)t n
nÎM
where M is the set of misclassified inputs, the mistakes.
 Exercise: find the gradient of the error function wrt a single
input xn.
11/57
Perceptron Learning Algorithm
 Use stochastic gradient descent.
 gradient descent for one example at a time, cycle through.
 Update Equation:
w(t +1) = w(t ) - hÑEP (w) = w(t ) + h x ntn
where we set η= 1 (without loss of generality in this case).
 Excel Demo.
12/57
Perceptron Demo
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
1
−1
−1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−0.5
−0.5
0
0.5
0
1
0.5
1
−1
−1
−0.5
−0.5
0
0.5
0
0.5
1
1
13/57
Perceptron Learning Analysis
 Theorem If the classes are linearly separable, the perceptron
learning algorithm converges to a weight vector that
separates them.
 Convergence can be slow.
 Sensitive to initialization.
14/57
Nonseparability
 Linear discriminants can solve problems only if the classes
can be separated by a line (hyperplane).
 Canonical example of non-separable problem is X-OR.
 Perceptron typically does not converge.
15/57
x2
Nonseparability: real world example
x1 = surface
wave
magnitude
x2 = body
wave
magnitude
7.5
7
6.5
6
5.5
5
4.5
4
3.5
3
2.5
4.5
white = earthquake
5
5.5
6
6.5
x1
black = nuclear explosion
Figure Russell and Norvig 18.15 b
16/57
7
Responses to Nonseparability
Classes cannot be separated by a linear discriminant
use non-linear activation function
finds approximate solution
logistic regression
separate
classes not
completely
but “well”
Fisher discriminant
(not covered)
add hidden features
neural network
support vector machine
17/57
Logistic Regression
18/57
From Values to Probabilities
 Key idea: instead of predicting a class label, predict the
probability of a class label.
 E.g., p+ = P(class is positive|features)
p- = P(class is negative|features)
 Naturally a continuous quantity.
 How to turn a real number y into a probability p+?
19/57
The Logistic Sigmoid Function
 Definition:
s (y) =
1
1+ exp(-y)
 Squeezes the real line into [0,1].
 Differentiable:
(nice exercise)
ds
= s (1- s )
dy
1
0.5
0
−5
0
5
20/57
Soft threshold interpretation
 If y> 0, σ(y) goes to 1 very quickly.
 If y<0, σ(y) goes to 0 very quickly.
1
1
0.5
0.5
0
0
-8 -6 -4 -2 0 2 4 6 8
Figure Russell and Norvig 18.17
-6 -4 -2
0
2
4
6
21/57
Probabilistic Interpretation
 The sigmoid can be interpreted in terms of the class odds
p+/(1-p+).
 Exercise: Show the following implication for the class odds:
+
1
p
p+ =
Þ
= exp(y)
+
1+ exp(-y) 1- p
p+
 Therefore y = ln(
the log class odds.
+) =
1- p
22/57
Logistic Regression
 In logistic regression, the log-class odds are a linear function
of the input features:
p+
ln(
) = x·w
+
1- p
 Recall that we got the same kind of expression for the naive
Bayes classifier.
 Learning logistic regression is conceptually similar to linear
regression.
23/57
Logistic Regression: Maximum
Likelihood
 Notation: the probability that the n-th input example is
positive = p +
which depends on a weight vector w.
n
 Positive example has tn = 1, negative tn = 0.
 Then the likelihood assigned to N independent training data
N
is
p(t | w) = (p+ )tn {1-p+ }1-tn
Õ
n
n
n=1
 The cross-entropy error
N
E(w) = -ln p(t | w) = å{tn ln(pn+ ) + (1- tn )ln(1- pn+ )}
n=1
 Equivalent to minimizing the KL divergence between the predicted class
probabilities and the observed class frequencies.
24/57
Gradient Search
 Exercise (on assignment): Using the cross-entropy error
N
E(w) = -ln p(t | w) = å{tn ln(pn+ ) + (1- tn )ln(1- pn+ )}
n=1
N
ÑE(w) = å(pn+ - tn )x n




show that
n=1
Hint: recall that ds = s (1- s )
dy
No closed form minimum since pn+ is non-linear function of
input features.
Can use gradient descent.
Better approach: Use Iterative Reweighted Least Squares
(IRLS). See assignment.
25/57
x2
Example logistic regression model
learned on non-separable data
1
0.8
0.6
0.4
0.2
0
7.5
7
6.5
6
5.5
5
4.5
4
3.5
3
2.5
4.5
5
5.5
6
6.5
-2
0
x1
2
4
6
-4
-2
0
2
x2
4
6
10 8
7
x1
Figure Russell and Norvig 18.17
26/57
Logistic Regression With Basis
Functions
Figure Bishop 4.12
27/57
Multi-Class Example
 Logistic regression can be extended to multiple classes.
 Here’s a picture of what decision boundaries can look like.
6
4
2
0
−2
−4
−6
−6
−4
−2
0
2
4
6
28/57
Download