Hyperplanes as classifiers

advertisement
Linear hyperplanes as
classifiers
Usman Roshan
Hyperplane separators
Hyperplane separators
w
Hyperplane separators
w
Hyperplane separators
r
x
xp
w
Hyperplane separators
r
x
xp
w
Nearest mean as hyperplane
separator
m2
m1
Nearest mean as hyperplane
separator
m2
m1
m1 + (m2-m1)/2
Nearest mean as hyperplane
separator
m2
m1
Separating hyperplanes
Perceptron
Gradient descent
Perceptron training
Perceptron training
Perceptron training by
gradient descent
Obtaining probability from
hyperplane distances
Multilayer perceptrons
• Many perceptrons with hidden
layer
• Can solve XOR and model
non-linear functions
• Leads to non-convex
optimization problem solved by
back propagation
Back propagation
• Ilustration of back
propagation
– http://home.agh.edu.pl
/~vlsi/AI/backp_t_en/b
ackprop.html
• Many local minima
Training issues for multilayer
perceptrons
• Convergence rate
– Momentum
• Adaptive learning
• Overtraining
– Early stopping
Separating hyperplanes
• For two sets of points there
are many hyperplane
separators
y
• Which one should we choose
for classification?
• In other words which one is
most likely to produce least
error?
x
Separating hyperplanes
• Best hyperplane is the one that maximizes
the minimum distance of all training points to
the plane (Learning with kernels, Scholkopf
and Smola, 2002)
• Its expected error is at most the fraction of
misclassified points plus a complexity term
(Learning with kernels, Scholkopf and Smola,
2002)
Margin of a plane
• We define the margin as the
minimum distance to training
points (distance to closest
y
point)
• The optimally separating plane
is the one with the maximum
margin
x
Optimally separating
hyperplane
y
w
x
Optimally separating
hyperplane
• How do we find the optimally separating
hyperplane?
• Recall distance of a point to the plane
defined earlier
Hyperplane separators
r
x
xp
w
Distance of a point to the
separating plane
• And so the distance to the plane r is
given by
wT x  w0
r
w
or
wT x  w0
ry
w
where y is -1 if the point is on the left side of
the plane and +1 otherwise.
Support vector machine:
optimally separating hyperplane
Distance of point x (with label y) to the hyperplane is given by
y(wT x  w0 )
w
We want this to be at least some value
y(wT x  w0 )
r
w
By scaling w we can obtain infinite solutions. Therefore we require that
r w 1
So we minimize ||w|| to maximize the distance which gives us the SVM optimization
problem.
Support vector machine:
optimally separating hyperplane
SVM optimization criterion
min w
1
w
2
2
subject to yi ( wT xi  w0 )  1, for all i
We can solve this with Lagrange multipliers.
That tells us that
w  i yi xi
i
The xi for which i is non-zero are called support vectors.
Support vector machine:
optimally separating hyperplane
y
1/||w||
w
2/||w||
x
Inseparable case
• What is there is no separating hyperplane? For
example XOR function.
• One solution: consider all hyperplanes and select the
one with the minimal number of misclassified points
• Unfortunately NP-complete (see paper by Ben-David,
Eiron, Long on course website)
• Even NP-complete to polynomially approximate
(Learning with kernels, Scholkopf and Smola, and
paper on website)
Inseparable case
• But if we measure error as the sum of the
distance of misclassified points to the plane
then we can solve for a support vector
machine in polynomial time
• Roughly speaking margin error bound
theorem applies (Theorem 7.3, Scholkopf and
Smola)
• Note that total distance error can be
considerably larger than number of
misclassified points
Optimally separating
hyperplane with errors
y
w
x
Support vector machine:
optimally separating hyperplane
In practice we allow for error terms in case there is no
hyperplane.
min w, w0 ,i (
1
2
w +C i ) subject to yi ( wT xi  w0 )  1  i , for all i
2
i
SVM software
• Plenty of SVM software out there. Two
popular packages:
– SVM-light
– LIBSVM
Kernels
• What if no separating hyperplane
exists?
• Consider the XOR function.
• In a higher dimensional space we can
find a separating hyperplane
• Example with SVM-light
Kernels
• The solution to the SVM is obtained by
applying KKT rules (a generalization of
Lagrange multipliers). The problem to
solve becomes
1
Ld    i    i j yi y j xiT x j
2 i j
i
subject to
 y
i
i
i
 0 and 0   i  C
Kernels
• The previous problem can be solved in
turn again with KKT rules.
• The dot product can be replaced by a
matrix K(i,j)=xiTxj or a positive definite
matrix K.
1
Ld    i    i j yi y j K ( xi x j )
2 i j
i
subject to
 y
i
i
i
 0 and 0   i  C
Kernels
• With the kernel approach we can avoid
explicit calculation of features in high
dimensions
• How do we find the best kernel?
• Multiple Kernel Learning (MKL) solves it
for K as a linear combination of base
kernels.
Download