Pattern recognition systems – Lab 9 Linear Classifiers

advertisement
Pattern recognition systems – Lab 9
Linear Classifiers
1. Objectives
In this lab session we will study some linear discriminant functions like the
perceptron, and we will apply the gradient descent procedure for computing the best
classification rules for a two-class classification problem.
2. Theoretical Background
The goal of classification is to group items that have similar feature values into
classes or groups. A linear classifier achieves this goal by making a classification
decision based on the value of the linear combination of the features.
Definition of a classifier
Let S be a set of samples {S1, S2,…, Sn} that belong to m different classes
{c1, c2, … , cm}. Each sample has associated d features x = {x1, x2,…, xd}.
A classifier is a mapping from the feature space to the class labels {c1, c2,…,cm}.
Thus a classifier partitions the feature space into m decision regions. A line or a
curve separating the classes is called decision boundary. If we have more than two
features it becomes a surface (e.g. a hyperplane).
We will refer to the two-category case, in which the samples in S belong to two
classes {c1, c2}.
Figure 2.1: Example of linear classifier on a two-class classification problem. Each sample is
characterized by two features.
Figure 2.1 depicts an example of a linear decision boundary for the two category case
in which each sample in S is characterized by two features.
2.1. General form of a linear classifier
A linear classifier is a linear combination of the features, that is the components of the
vector x. It can be written as:
d
g(x) = wT x + w0 = ∑ wi xi + w0
i =1
Where
• w is the weight vector
• w0 is the bias or the threshold weight
A two-category linear classifier (or binary classifier) implements the following
decision rule:
if g ( x) > 0 decide that sample x belongs to class c1

if g ( x) < 0 decide that sample x belongs to class c2
or
if w x > − w0 decide that sample x belongs to class c1

T
if w x < − w0 decide that sample x belongs to class c2
T
If g(x) = 0, x can ordinarily be assigned to either class.
2.2. Perceptron Learning
Consider the problem of learning a binary classification problem with a linear
classifier. Assume we have a dataset X={x(1, x(2,… , x(n} containing examples from
the two classes. Each sample x(i in the dataset is characterized by a feature vector {x1,
x2, … , xd}.
A schematic view of the perceptron is given in the next figure.
x1
w1
x2
xd
w2
wd
w0
1
f=w1x1+w2x2+…+wdxd+w0
T(f)
Weighted Sum of the inputs
Threshold
function
{c1,c2}
Output = Class decision
For convenience, we will absorb the intercept w0 by augmenting the feature vector x
with in additional constant dimension:
[
wT x + w0 = w0
]
1 
wT   = aT y
 x
Our objective is to find a vector such that :
 > 0, x ∈ c1
g ( x) = aT y 
< 0, x ∈ c2
To simplify the derivation, we will “normalize” the training set by replacing all
examples from class c2 by their negative: y ← [− y ], ∀y ∈ c2 .
This allows us to ignore class labels and look for a weight vector such that
aT y > 0, ∀y
To find the weight vector satisfying the above equation we define an objective
function J(a). The perceptron criterion function is:
J p (a) =
∑ (−a
T
y)
y∈ΥM
Where:
• YM is the set of samples misclassified by a
• Note that Jp(a) is non-negative since aTy<0 for all misclassified samples.
To find the minimum of this function, we use the gradient descent. The gradient is
defined by:
∇ a J p (a) = ∑ (− y )
y∈ΥM
The gradient descent update rule is:
a (k + 1) = a (k ) + η
∑y
y∈YM ( k )
This is known as perceptron batch update rule. Notice that the weight vector may be
updated in an “on-line” fashion, after the presentation of each individual example:
a (k + 1) = a (k ) + ηy (i , where y (i is an example that has been misclassified by a(k)
Algorithm: Batch Perceptron
Begin
• Initialize a, η, criterion θ, k=0
• “normalize” the training set by replacing all examples from class c2 by
their negative: y ← [− y ], ∀y ∈ c2
• Do
o k = k+1
o a=a+η ∑y
y∈YM (k)
•
Until η
∑ y <θ
y∈YM (k)
•
End
Return a
2.2.1. Numerical example
Compute the perceptron on the following dataset:
X1 = [(1,6), (7,2), (8,9), (9,9)]
X2 = [(2,1), (2,2), (2,4), (7,1)]
Perceptron leaning algorithm:
• Assume η=0.1
• Assume a(0)=[0.1, 0.1, 0.1]
• Use an online update rule
• Normalize the dataset
1 1 6


2
1 7
1 8 9


1 9 9

Y=
 −1 −2 −1 


 −1 −2 −2 
 −1 −2 −4 


 −1 −7 −1 
• Iterate through all the examples and update a(k) on the ones that are
misclassified
1. Y(1) ⇒ [1 1 6]*[0.1 0.1 0.1]T>0 ⇒ no update
2. Y(2) ⇒ [1 7 2]*[0.1 0.1 0.1]T>0 ⇒ no update
3. …
4. …
5. Y(5) ⇒ [-1 -2 -1]*[0.1 0.1 0.1]T<0 ⇒ update a(1) = [0.1 0.1 0.1] +
η*[-1 -2 -1] = [0 -0.1 0]
6. Y(6) ⇒ [-1 -2 -2]*[0 -0.1 0]T>0 ⇒ no update
7. …
8. Y(1) ⇒ [1 1 6]*[0 -0.1 0]T<0 ⇒ update a(2) = [0 -0.1 0] +η*[1 1 6]
= [0.1 0 0.6]
9. Y(2) ⇒ [1 7 2]*[0.1 0 0.6]T>0 ⇒ no update
10. …
Stop when η ∑ y < θ
y∈YM (k)
•
In this example, the perceptron rule converges after 175 iterations to
a=[-3.5 0.3 0.7]
3. Two-class two feature perceptron classifier
In our lab session we will find a perceptron linear classifier that discriminates
between two sets of points. The points in class 1 are colored in red and the points in
class 2 are colored in blue.
The figure below describes the dataset.
Figure 3.1: Input points for the perceptron classifier.
Each point is described by the color (that denotes the class label) and the two
coordinates, say x1 and x2.
The weights vector will have the form a = [w0, w1, w2].
The vector y = [ 1 x1 x2 ].
The algorithm becomes:
Algorithm: Batch Perceptron for two classes, two features
Begin
• Initialize
o a = [0.1 0.1 0.1]
o η=1
o stop threshold θ=0.00001
o k=0
o stop_vector b=[0 0 0]
•
“normalize” the training set by replacing all examples from class c2 by
their negative: y ← [− y ], ∀y ∈ c2 , hence y=[–1 –x1 –x2]
•
Do
o b = [0 0 0]
o k = k+1
o a=a+η ∑y
y∈YM (k)
o
b=b+η
∑y
y∈YM (k)
•
Until b < θ
•
Return a
End
The result of the classification is:
Figure 3.2: Result for the perceptron training
4. Exercices
Given a set of points belonging to two classes (red and blue):
Apply the batch perceptron algorithm to find the linear classifier that divides them
into two groups.
Consider η=1, initialize a(0) = [0.1, 0.1, 0.1] and θ=0.0001.
The palette of the input images is modified, such that on position 1 we have the color
red and on position 2 we have the color blue.
5. References
[1] Richard O. Duda, Peter E. Hart, David G. Stork: Pattern Classification 2nd ed.
Download