Document 13510806

advertisement
New Schedule
Dec. 8
same
same
same
Oct. 21
^2 weeks
^1 week
^1 week
Fall 2004
Pattern Recognition for Vision
9.913 Pattern Recognition for Vision
Classification
Bernd Heisele
Fall 2004
Overview
Introduction
Linear Discriminant Analysis
Support Vector Machines
Literature & Homework
Fall 2004
Pattern Recognition for Vision
Introduction
Classification
• Linear, non-linear
separation
• Two class, multi-class
problems
Two approaches:
• Density estimation, classify with Bayes decision:
Linear Discr. Analysis (LDA), Quadratic Discr. Analysis (QDA)
• Without density estimation: Support Vector Machines (SVM)
Fall 2004
Pattern Recognition for Vision
LDA Bayes Rule
Bayes Rule
p(x,w ) = p(x | w ) P (w ) = P(w | x) p ( x)
likelihood
�
prior
p ( x | w ) P (w )
P(w | x ) =
p(x)
posterior
evidence
x : random vector
w : class
Fall 2004
Pattern Recognition for Vision
LDA— Bayes Decision Rule
P(w1 | x
)
Decide w1 if
> 1; otherwise w 2
P(w 2 | x
)
Likelihood Ratio
p(x | w1 ) P (w1 )
>1
p(x | w 2 ) P (w 2 ) Log Likelihood Ratio
� p(x | w1 ) P (w1 ) �
� p(x | w1 ) �
� P(w1 ) �
ln �
� > 0, ln �
� + ln �
�>0
Ł p(x | w 2 ) P (w 2 ) ł
Ł p(x | w 2 ) ł
Ł P(w 2 ) ł
Fall 2004
Pattern Recognition for Vision
LDA—Two Classes, Identical Covariance
Gaussian: p ( x | w i ) =
1
(2p )
d /2
| Si |
1/2
e
1
- ( x - m i ) T S i-1 ( x- m i )
2
assume identical covariance matrices S1 = S 2 :
� p( x | w1 ) �
� P(w1 ) �
ln �
� + ln �
�
Ł p( x | w 2 ) ł
Ł P(w 2 ) ł
� P (w1 ) �
1
1
T
-1
T
-1
= ( x - m 2 ) S (x - m 2 ) - ( x - m1 ) S ( x - m1 ) + ln �
�
P
2
2
Ł (w 2 ) ł
� P (w1 ) �
1
T
-1
= x S ( m1 - m 2 ) + ( m1 + m 2 ) S ( m 2 - m1 ) + ln �
�
����� 2
(
)
P
w
Ł
2 ł
w
�������
���������
�
T
-1
b
= xT w + b
Fall 2004
linear decision function: w1 if xT w + b > 0
Pattern Recognition for Vision
LDA—Two Classes, Identical Covariance
fi U T Rotation
X w
fi D-1 / 2 Scaling
X¢
x¢
Nearest neighbor
classifier in X ¢
m 2¢
m1¢
f ( x¢) = 0
1
f (x ) = x S ( m1 - m 2 ) + ( m1 + m 2 )T S -1 ( m 2 - m1 ) + ln( P (w1 ) / P (w 2 ))
2
1
T
¢
¢
¢
= x ( m1 - m2 ) + ( m1¢ + m 2¢ )T ( m 2¢ - m1¢) + ln( P (w1 ) / P (w2 )),
2
��������
���������
T
-1
b
GT G = S -1 , x¢T = x T GT , x¢ = Gx , m1¢ - m2¢ = G ( m1 - m2 )
S = UDU T , S -1 = U D -1U T = GT G, G = D -1 / 2 U T
Fall 2004
Pattern Recognition for Vision
LDA—Computation
�
x i ,n
�
�
n =1
�
Density estimation
�
N
2
i
1
T �
ˆ
ˆ
(
)(
)
Sˆ =
x
x
m
m
�� i,n i i,n i �
N1 + N 2 i =1 n =1
�
1
mˆ i =
Ni
Ni
f (x) = sign(xT w + b)
w = Sˆ -1 ( mˆ - mˆ )
1
2
� P (w1 ) �
1
T
-1
b = ( mˆ1 + mˆ 2 ) S ( mˆ 2 - mˆ1 ) + ln �
�
2
(
)
P
w
2 ł
Ł�
���
�
Approximate by
Fall 2004
N1
N2
Pattern Recognition for Vision
QDA—Two classes, different covariance matrix
Quadratic Discriminant Analysis
decide w1 if f (x) > 0
f ( x) = ln ( p( x | w1 ) ) + ln( P (w1 )) - ln ( p (x | w2 ) ) - ln ( P(w2 ) )
1
1
ln ( p( x | w1 ) ) = - ln S1 - ( x - m1 )T S1-1 ( x - m1 ) + ln P(w1 )
2
2
f ( x ) = xT Ax + w T x + w0 - quadratic
where
1 -1
A = - ( S1 - S-2 1 )
2
w = S1-1m1 - S-21m 2
wo = ... well, the rest of it
Fall 2004
- a matrix
- a vector
- a scalar
Pattern Recognition for Vision
LDA Multiclass, Identical Covariance
Find the linear decision boundaries for k classes:
For two classes we have:
f ( x ) = xT w + b
decide w1 if f ( x ) > 0
In the multi-class case we have k -1 decision functions:
f1,2 ( x ) = xT w1,2 + b1,2 ,
f1,3 (x ) = x T w 1,3 + b1,3 ,
�
f1, k ( x ) = xT w 1, k + b1, k
� we have to determine (k -1)( p +1) parameters,
p is dimension of x
Fall 2004
Pattern Recognition for Vision
LDA Multiclass, Identical Covariance
X
m1
y2 ( m1 )
m2
w2
m3
w1
X¢
m1¢
m 3¢
m 2¢
y1 ( m1 )
Find the n - dimensional subspace that gives the
best linear discrimination betw. the k classes.
y = ( w1 | w 2 |...| w n )T x
also known as Fisher Linear Discriminant
Fall 2004
Pattern Recognition for Vision
LDA—Multiclass, Identical Covariance
Computation
compute the d · k matrix M = (m1| m 2 |...|mk ) and cov. matrix S
compute the m ¢ : M ¢=GM , S -1 =GT G
compute the cov. matrix B¢ of m ¢
compute the eigenvectors v ¢i of B¢ ranked by eigenvalues
calculate y by projecting x into X ¢ and then onto the eigenvector:
yi = v¢iT Gx � w i = GT v¢i
Fall 2004
Pattern Recognition for Vision
LDA—Fisher’s Approach
Find w such that the ratio of between-class and
in-class variance is maximized if the data is projected
onto w :
y = wT x
w T Bw
max T
, B = MMT the covariance of the m ' s
w Sw
can be written as:
max w T Bw subject to wT Sw = 1
generalized eigenvalue problem,
solution are the ranked eigenvectors of S -1B
...same is an previous derivation.
Fall 2004
Pattern Recognition for Vision
The Coffee Problem: LDA vs. PCA
Image removed due to copyright considerations. See:
: R. Gutierrez-Osuna
http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf
Fall 2004
Pattern Recognition for Vision
LDA/QDA—Summary
Advantages:
• LDA is the Bayes classifier for multivariate Gaussian
distributions with common covariance.
• LDA creates linear boundaries which are simple to compute.
• LDA can be used for representing multi-class data
in low dimensions.
• QDA is the Bayes classifier for multivariate Gaussian
distributions.
• QDA creates quadratic boundaries.
Problems:
• LDA is based on a single prototype per class (class center) which
is often insufficient in practice.
Fall 2004
Pattern Recognition for Vision
Variants of LDA
Nonparameteric LDA (Fukunaga)
removes the unimodal assumption by the scatter matrix using local information.
More than k-1 features can be extracted.
Orthonormal LDA (Okada&Tomita) computes projections that maximize separability and are pair-wise orthonormal.
Generalized LDA (Lowe)
Incorporates a cost function similar to Bayes Risk minimization.
….and many many more (see “Elements of Statistical Learning” Hastie, Tibshirani, Friedman)
Fall 2004
Pattern Recognition for Vision
SVM—Linear, Separable (LS)
Find the separating function f ( x) = sign(xTi
w + b)
which maximizes the margin M = 2d on the training data.
� Maximum margin classifier.
Fall 2004
Pattern Recognition for Vision
SVM—Primal, (LS)
Training data consists of N pairs {( x1 , y1 ),..., (x N , y N )} ,yi ˛ {-1,1} .
The problem of maximizing the margin 2d can be formulated as:
�� yi ( xTi w ¢ + b¢) ‡ d
max 2d subject to �
2
w ¢,b
¢
w
=1
�
w¢
or alternatively: w = w¢ / d , b = b¢ / d
1
1
2
T
min w subject to yi ( xi w + b) ‡ 1, where d =
w ,b 2
w
Convex optimization problem with quadratic objective function and
linear constraints.
Fall 2004
Pattern Recognition for Vision
SVM—Dual, (LS)
Multiply constraint equations by positive Lagrange multipliers
and subtract them from the objective function:
N
1
2
LP = w - � a i غ yi ( x iT w + b ) - 1øß
2
i =1
Min. LP w.r. t. w and b and max. w.r. t. a i , subject to a i ‡ 0.
Set derivatives d LP /d w and d LP /d b to zero and max. w.r. t. a i :
N
w = � a i yi x i ,
i =1
N
�a y
i
i
= 0.
i =1
substititung in Lp we get the so called Wolfe dual:
1 N N
�
max. LD = � a i - �� a ia k yi yk x iT x k � solve for a i then
2 k =1 i =1
� compute w = a y x and
i =1
� i i i
�
N
�
T
subject to a i ‡ 0, � a i yi = 0
Ø
ø=0
from
(
b
a
y
x
i º i
i w + b) - 1ß
��
i =1
N
Fall 2004
Pattern Recognition for Vision
SVM—Primal vs. Dual (LS) Primal:
w
min
w ,b
2
Dual:
subject to: yi ( xiT w + b) ‡ 1
1 N N
max � a i - � � a iak yi yk xiT xk subject to: a i ‡ 0,
ai
2 k =1 i =1
i =1
N
N
�a y
i =1
i
i
=0
The primal has a dense inequality constraint for every point in the
training set. The dual has a single dense equality constraint and a set
of box constraints which makes it easier to solve than the p rimal.
Fall 2004
Pattern Recognition for Vision
SVM—Optimality Conditions (LS)
Optimality conditions for the linearly separable data:
N
� a i yi = 0, a i ‡ 0 "i,
i =1
yi ( xiT w + b) - 1 ‡ 0 "i
N
N
i =1
i =1
w = � a i yi xi � f ( x) = � a i yi xi T x + b
a i ( yi ( xiT w + b) - 1) = 0 "i � a i =0 for points which are
a =0
a ‡0
not on the boundary of the margin.
a ‡0
a =0
Fall 2004
points with a i > 0 are support vectors.
Pattern Recognition for Vision
SVM—Linear, non-separable (LNS)
f ( x) = 1
f ( x) = 0
f ( x ) = -1
xi
2d = M
N
1
2
min w +C � x i subject to:
w ,b 2
i=1
yi ( xTi w + b) ‡ 1 - x i "i, x i > 0 "i
x i are called slack variables
C constant, penalizes errors
Fall 2004
Pattern Recognition for Vision
SVM—Dual (LNS) Same procedure as in separable case
N
1 N N
max. LD = � a i - �� a ia k yi yk x iT x k
2 k =1 i =1
i =1
subject to 0 £ a i £ C ,
N
�a y
i =1
i
i
=0
solve for a i then
compute w = � a i yi x i and
b from yi (x i T w + b) - 1 = 0
for any sample x i for which 0 < a i < C
Fall 2004
Pattern Recognition for Vision
SVM—Optimality Conditions (LNS) f ( x) = 1
f ( x) = 0
f ( x ) = -1
xi
2d = M
N
f ( x ) = � a i yi x i T x + b
i =1
a i = 0 � xi = 0, yi f ( xi ) > 1
0 < a i < C � xi = 0 yi f ( xi ) = 1
unbounded support vectors
a i = C � xi ‡ 0, yi f (x i ) £ 1
bounded support vectors
Fall 2004
Pattern Recognition for Vision
SVM—Non-linear (NL)
Non-linear mapping:
x¢ = F ( x)
x2
x2¢
input space
feature space
x1
x1¢
Project into feature space, apply SVM procedure
Fall 2004
Pattern Recognition for Vision
SVM—Kernel Trick
N
1 N N
max. � a i - �� a iak yi yk x¢i T x¢k 2 k =1 i =1
i=1
subject to 0 £ a i £ C,
N
�a y
i =1
i
i
=0
Only the inner product of the samples appears in the objective function. If we can write: K (xi , x k ) = x¢iT x¢k
we can avoid any computations in the feature space.
The solution f (x¢) = x¢T w + b can be written as:
N
N
i =1
i=1
f ( x) = � a i yi F( x)T
F(x i ) + b = � a i yi K (x, x i
) + b
using w = � a i yi x¢i .
Fall 2004
Pattern Recognition for Vision
SVM—Kernels
When is a Kernel K (u, v ) an inner product in a Hilbert space?
K (u, v ) = � ln fn (u)fn ( v )
n
with positive coefficients ln
if for any g (u ) ˛ L2
� K (u, v) g (u ) g ( v)dudv ‡ 0
Mercer's condition
Some examples of commonly used kernels:
Linear kernel:
uT v
Polynomial kernel:
(1 + uT v )d
Gaussian kernel (RBF): exp(- u - v ), shift invar.
2
MLP:
Fall 2004
tanh(uT v - q )
Pattern Recognition for Vision
SVM—Polynomial Kernel
Polynomial second degree kernel
K (u, v ) = (1 + u T v ) 2 , u, v ˛ � 2
= 1 + (u1 v1 ) 2 + (u 2 v2 )2 + 2u1v 1u 2 v2 + u1v1 + u 2 v2
= (1, u12 , u 2 2 , 2u1 u2 , u1 , u2 )(1, v12 , v2 2 , 2v1v2 , v1 ,v 2 )T
F (x ) = (1, x12 , x2 2 , 2 x1 x2 , x1 , x 2 )T
Shift invariant kernel K (u, v ) = K (u - v )
defined on L2 ([0, T ]d ) can be written
as the Fourier series of K :
1
f (t ) =
T
¥
�l
k =-¥
k
e
j 2 p kt
T
1
, f ( t - t0 ) =
T
¥
�l
k =-¥
k
e
j 2 p kt
T
e
-
j 2 p kt0
T
¥
K (u - v ) = � lk e j 2p k k u e - j 2p k k v "k k ˛ Z ([ -¥, ¥]d )
k =0
Fall 2004
Pattern Recognition for Vision
SVM—Uniqueness Are the a i in the solution unique? No
y = +1
x2
N
f ( x) = � a i yi xi T x + b
x2
i =1
w = (1, 0)
two solutions:
a = (0.25,0.25,0.25,0.25)
w
+1
-1
x3
x1
+1
-1
x1
x4
a = (0.5,0.5,0,0)
constraints: a i ‡ 0,
N
�a y
i
i
= 0 are satisfied
i =1
More important: is the solution f ( x) unique?
Yes, the solution is unique and global
if the objective function is strictly convex.
Fall 2004
Pattern Recognition for Vision
SVM—Multiclass
Bottom-Up 1vs1
1 vs. All
A or B or C or D
A or B
A
B
C or D
C
B,C,D
B
A,C,D
C
A,B,D
D
A,B,C
D
Training:
k (k-1) / 2
Classification : k-1
Fall 2004
A
Training:
k
Classification : k
Pattern Recognition for Vision
SVM—Choosing the Kernel
How to choose the kernel?
Linear SVMs are simple to compute, fast at
runtime but often not sufficient for complex tasks.
SVM with Gaussian kernels showed excellent
performance in many applications (after some
tuning of sigma). Slow at run-time.
Polynomial with 2nd are commonly used in
computer vision applications. Good trade off
between classification performance computational
complexity.
Fall 2004
Pattern Recognition for Vision
SVM—Example
Face detection with linear and 2nd degree polyn. SVM & LDA
Fall 2004
Pattern Recognition for Vision
SVM—Choosing C
How to choose the C-value?
N
1
2
min w +C � x i
w ,b 2
i=1
C-value penalizes points within the margin.
Large C-value can lead to poor generalization
performance (over-fitting).
From own experience in object detection tasks:
Find a kernel and C-values which give you zero errors on
the training set.
Fall 2004
Pattern Recognition for Vision
SVM—Computation during Classification
In computer vision applications fast classification is usually
more important than fast training.
Two ways of computing the decision function f ( x) :
a) wT F(x) + b b)
N
� a y K (x, x ) + b
i
i
i
Which one is faster?
i=1
-For a linear kernel a) -For a polynomial 2nd degree kernel:
Multiplications for a): GF , poly2 = (n + 2) n, where n is dim. of x
Multiplications for b): GK , poly 2 = (n + 2) s, where s is nb. of sv's
-Gaussian kernel: only b) since dim. of F ( x) is ¥.
Fall 2004
Pattern Recognition for Vision
Learning Theory—Problem Formulation
From a given set of training examples {xi , yi } learn the
mapping x fi y. The learning machine is defined by a set of
possible mappings x fi f (x,a ) where a is the adjustable
parameter of f .
The goal is to minimize the expected risk R :
R (a ) = � V ( f (x,a ),y ) dP ( x, y ) V is the loss function
P is the probability distribution function
We can't compute R (a ) since we don't know P (x, y )
Fall 2004
Pattern Recognition for Vision
Learning Theory –Empirical Risk Minimization
To solve the problem minimize the "empirical risk"
Remp over the training set :
1 N
Remp (a ) = � V ( f (xi ,a ), yi )
N i =1
V is the loss function
Common loss functions:
V ( f (x), y ) = ( y - f ( x)) 2 least squares
V ( f (x), y ) = (1 - yf ( x)) + hinge loss where ( x) + ” max( x, 0)
1
1
Fall 2004
yf ( x)
Pattern Recognition for Vision
Learning Theory & SVM
Bound on the expected risk:
For a loss function with 0 £ V ( f ( x), y) £ 1 with probability
1 - h , 0 £ h £ 1 the following bound holds:
h ln(2 N / h) + h - ln(h / 4)
R (a ) £ Remp (a ) +
N
Bound is independant of the
Remp (a ) empirical risk
probablility distribution P ( x, y ).
N number of training examples
h Vapnik Chervonenkis (VC ) dimension
Keep all parameters in the bound fixed except one:
(1 - h ) › bound ›, N › bound fl, h › bound ›
Fall 2004
Pattern Recognition for Vision
Learning Theory VC Dimension
The VC dimension is a property of the set of functions
{ f (a )} .
If for a set of N points labeled in all 2N possible ways
one can find an f ˛ { f (a ) } which separates the points correctly
one says that the set of points is shattered by
{ f (a )} .
The VC dimension is the maximum number of points
that can be shattered by
{ f (a )} .
The VC dimension of a functions f : w T x + b = 0 in 2 dim:
Fall 2004
Pattern Recognition for Vision
Learning Theory—SVM D
M1
The expected risk E ( R ) for the optimal hyperplanes:
E ( D2 / M 2 )
E ( R) £
N
where the expectation is over all training sets of size N .
'Algorithms that maximize the margin have better generalization
performance.'
Fall 2004
Pattern Recognition for Vision
Bounds
Most bounds on expected risk are very loose to
compute instead:
Cross Validation Error
Error on a cross validation set which is different
from the training set.
Leave-one-out Error
Leave one training example out of the training set, train classifier and test on the example which was left out. Do this for all examples.
For SVMs upper bounded by the # of support vectors.
Fall 2004
Pattern Recognition for Vision
Regularization Theory
Given N examples (x i , yi ), x ˛ R n , y ˛ {0,1} solve:
1
min
f ˛H N
N
� V ( f (x ), y ) + g
i
i =1
where f
2
K
i
f
2
K
is the norm in a Reproducing Kernel
Hilbert Space (RKHS) H,with the reproducing kernel K ,
g is the regularization parameter.
g f
2
K
can be interpreted as a smoothness constraint.
Under rather general conditions the solution can be written as:
N
f ( x) = � ci K ( x, x i )
i =1
Smooth
function
Fall 2004
Pattern Recognition for Vision
Regularization—Reproducing Kernel Hilbert Space (RKHS)
Reproducing Kernel Hilbert Space (RKHS) H�
f ( x ) = K ( x, y ), f ( y )
H�
Positive numbers ln and orthonormal set of
functions fn (x),
�f
n
(x)fm (x) dx = 0 for n „ m, and 1 otherwise :
K ( x , y ) ” � lnf n (x)fn (y ), ln are nonnegative eigenvalues of K
n
f ( x ) = � a nfn (x), an = � f (x)f n (x) dx,
n
f ( x), f ( y )
f ( x)
Fall 2004
H
”�
n
an
bn
ln
ln
= f ( x), f ( x)
H�
H
= � an2 / ln
n
Pattern Recognition for Vision
Regularization—Simple Example of RKHS
Kernel is a one dimensional Gaussian with s =1:
K ( x , y ) = exp(-( x - y )2 ), x, y in [0,1]
write K ( x , y ) as Fourier expansion using
shift theorem:
K ( x , y ) = � ln exp( j 2p nx ) exp( - j 2p ny ) Period T = 1
n
where ln are the Fourier coeff. of exp( - x 2 )
ln = A exp( - n 2 / 2)
ln decreases with higher frequencies (increasing n).
This is a property of most kernels. The regularization term:
f ( x)
2
=
a
/ ln , where an are the Fourier coeff. of f ( x )
H� � n
n
penalizes high freq. more than low freq. fi smoothness!
Fall 2004
Pattern Recognition for Vision
Regularization—SVM For the hinge loss function V ( f ( x), y ) = (1 - yf (x)) + it can be
shown that the regularization problem is equivalent
to the SVM problem:
1 N
2
min � (1 - yi f ( xi )) + + l f K
f ˛H N
i =1
introducing slack variables x i = 1 - yi f ( xi ) we can rewrite:
1 N
2
min � x i + l f K , subject to: yi f ( xi ) ‡ 1 - x i , and x i ‡ 0 "i
f ˛H N
i =1
It can be shown that this is equivalent to the SVM problem (up to b) :
N
1
2
SVM: min w +C � x i
w ,b 2
i =1
C = 1/(2 l N )
subject to: yi ( xTi w + b) ‡ 1 - x i , x i ‡ 0 "i
Fall 2004
Pattern Recognition for Vision
SVM—Summary
• SVMs are maximum margin classifiers. • Only training points close to the boundary (support vectors) occur
in the SVM solution.
• The SVM problem is convex, the solution is global and unique.
• SVMs can handle non-separable data.
• Non-linear separation in the input space is possible by projecting
the data into a feature space.
• All calculations can be done in the input space (kernel trick). • SVMs are known to perform well in high dimensional problems
with few examples.
• Depending on the kernel, SVMs can be slow during classification
• SVMs are binary classifiers. Not efficient for problems with large
number of classes.
Fall 2004
Pattern Recognition for Vision
Literature
T. Hastie, R. Tibshirani, J. Friedman: The Elements of
Statistical Learning, Springer, 2001:
LDA, QDA, extensions to LDA, SVM & Regularization.
C. Burges: A Tutorial on SVM for Pattern Recognition,
1999: Learning Theory, SVM.
R. Rifkin: Everything Old is New again: A fresh
Look at Historical Approaches in Machine Learning,
2002: SVM training, SVM multiclass.
T. Evgeniou, M. Pontil, T. Poggio: Regularization
Networks and SVMs, 1999: SVM & Regularization.
V. Vapnik: The Nature of Statistical Learning, 1995:
Statistical learning theory, SVM.
Fall 2004
Pattern Recognition for Vision
Homework
Classification problem on the NIST handwritten
digits data involving PCA, LDA and SVMs.
PCA code will be posted today
Fall 2004
Pattern Recognition for Vision
Download