Uploaded by Amy the Cat

CpE646-5v4

advertisement
CpE
p 646 Pattern Recognition
g
and
Classification
Prof. Hong Man
Department of Electrical and
Computer Engineering
Stevens Institute of Technology
Bayesian Decision Theory
Chapter
p 3 (Section
(
3.7,, 3.8)) Outline:
• Problems of dimensionality
• Component analysis and discriminant
Problems of Dimensionality
• In ppractical multi-category
g y cases,, it is common to see
problems involving hundreds of features, especially when
features are binary valued.
• We
W assume every single
i l feature
f t
is
i useful
f l for
f some
discrimination, but we do not know (usually doubt) if they
all provide some independent information.
• Two concerns:
– Classification accuracy depends upon the
di
dimensionality
i li andd the
h amount off training
i i data
d
– Computational complexity of designing a classifier.
Accuracy, Dimension and Training Sample Size
• If features are statisticallyy independent,
p
classification
performance can be excellent in theory.
• Example: two-classes multivariate normal with the same
covariance, p(x|ωj) ~ N(μj,Σ), jj=1,
1, 2.
P ( error ) =
1
2π
∞
∫
e
− u2
2
du
r/2
w here r 2 = ( μ 1 − μ 2 ) t Σ
−1
( μ1 − μ 2 )
therefore lim P ( error ) = 0
r→∞
r2 is called the squared Mahalanobis distance from μ1 to μ2
The Mahalanobis Distance
the squared Mahalanobis distance from x to μ is
r 2 = ( x − μ i )t Σ i−1 ( x − μ i )
the loci of points of constant density are hyperellipsoids
with constant squred Mahalanobis distance.
Accuracy, Dimension and Training Sample Size
Accuracy, Dimension and Training Sample Size
• If features are independent then:
Σ = diag(σ 12 , σ 22 , ..., σ d2 )
⎛ μ i1 − μ i 2 ⎞
r = ∑⎜
⎟
σi ⎠
i =1 ⎝
d
2
2
• Thi
This shows
h
how
h different
diff
t feature
f t
contributes
t ib t to
t reducing
d i
the probability of error (~increasing r).
– Most useful features are the ones for which the
difference between the means is large relative to the
standard deviation
– Also no feature is useless
– To
T further
f th reduce
d
the
th error rate,
t we can introduce
i t d
new
independent features.
Accuracy, Dimension and Training Sample Size
• In ggeneral if the pprobabilistic structure of the pproblem is
known, introducing new features will improve
classification performance
– Existing feature set maybe inadequate.
– Bayes risk can not be increased by adding new feature.
At worst, the Bayes classifier may ignore the new
feature.
feature
• However adding new feature will increase the
computational complexity of both the feature extractor and
the
h classifier.
l ifi
Accuracy, Dimension and Training Sample Size
Accuracy, Dimension and Training Sample Size
• It has frequently
q
y been observed in ppractice that, beyond
y
a
certain point, the inclusion of additional features leads to
worse rather than better performance.
– Frequently it is because that we have the wrong model
(Gaussian) or wrong assumption (independent
features),
– Also the number of training samples may be inadequate
so the distributions are not estimated accurately.
Computational Complexity
• Our design
g methodology
gy is affected by
y the computational
p
difficulty
• Notations of computational complexity
– Order of a function: if f(x) is “of
of the order of h(x)
h(x)”,
written as f(x) = O(h(x)) and read as “big oh of h(x)”, if
there exists constant c and x0 such that |f(x)|≤c|h(x)| for
all x>x0
• An upper bound on f(x) grows no worse than h(x) for
sufficiently large x
• Example: f(x) = a0+a1x+a2x2, we have h(x) = x2, and f(x) =
O(x2)
• Big oh order of a function is not unique. In our example: f(x) =
( 2), or O(x
( 3), or O(x
( 4), or O(x
( 2lnx))
O(x
Computational Complexity
– Bigg theta notation. f(
f(x)) = Θ((h(x))
( )) and read as “bigg theta
of h(x)”, if there exists constants x0, c1 and c2 such that
for all x>x0, c1h(x) ≤ f(x)≤c2h(x)
• In
I previous
i
example,
l f(x)
f( ) = Θ(x
( 2),
) but
b t will
ill nott obey
b
f(x) = Θ(x3)
Computation
putat o complexity
co p e ty iss measured
easu ed in terms
te s of
o number
u be
• Co
of basic mathematical operators, such as additions,
multiplications, and divisions, or in terms of computing
time and memory requirements.
time,
requirements
Complexity of ML Estimation
• Gaussian ppriors in d dimensions classifier with n trainingg
samples for each of c classes
• For each category, we have to compute the discriminant
2
O
(
1
)
O
(
d
n)
2
f ti
function
(
)
O
d
n
O(n)
O ( dn )
1
d
1
t ˆ −1
ˆ
ˆ
g ( x ) = − ( x − μ ) Σ ( x − μ ) − ln 2π − ln Σˆ + ln P (ω )
2
2
2
μˆ : d × n additions and 1 division
Σˆ : d (d + 1) / 2 components each with n
multiplications and additions
|i|= O(d 2 ) and (i)−1 = (d 3 ),
estimation of P (ω ) : O( n)
Complexity of ML Estimation
• Total complexity
p
y for one discriminant function is
dominated by the O(d2n).
• For total of c classes, the overall computational complexity
f learning
for
l
i in
i this
thi Bayes
B
classifier
l ifi is
i O(cd
O( d2n)) ≈ O(d2n))
(c is a constant much smaller than d and n)
• Co
Complexity
p e ty increases
c eases as d aandd n increase.
c ease.
Computational Complexity
• The computational
p
complexity
p
y for classification is usually
y
less than parameter estimation (i.e. learning).
• Given a test point x, for each of c categories,
– Compute ( x − μˆ ) , O(d).
t
−1
– Compute the multiplication ( x − μˆ ) Σˆ ( x − μˆ ) , O(d2)
max gi ( x ) , O(c)
– Compute
C
t the
th decision
d ii
O( )
i
– So overall for small c, classification is an O(d2)
operation.
p
Overfitting
• Frequently
q
y the number of trainingg sample
p is inadequate.
q
To
address this problem:
– Reduce the dimensionality
• Redesign the feature extractor
• Select a subset of existing features
– Assume all classes share the same covariance matrix
– Find a better way to estimate the covariance matrix.
• With some prior estimate, Bayesian learning can
refine this estimate
– Assume the covariance matrix is diagonal, and offdiagonal elements are all zero, i.e. features are
i d
independent.
d t
Overfitting
• Assumingg feature independence
p
will generally
g
y pproduce
suboptimal classification performance. But in some cases it
may outperform ML estimate with full parameter space.
– This
Thi is
i because
b
off insufficient
i ffi i t data.
d t
• To have accurate estimate of model parameters, we need
much
uc more
o e data samples
sa p es than
t a parameters.
pa a ete s. If we don’t,
do t, we
may have to reduce the parameter space.
– We can start with a complex model, and then smooth
andd simplify
i lif the
h model.
d l Such
S h solution
l i may have
h
poor
estimation error on training data, but may generalize
well on test data.
Overfitting
Component Analysis and Discriminants
• One of the major
j problems
p
in statistical machine learningg
and pattern recognition is the curse of dimensionality, i.e.
the large value of feature dimension d.
• Combine
C bi features
f t
in
i order
d to
t reduce
d
the
th dimension
di
i off the
th
feature space – dimension reduction
• Linear
ea co
combinations
b at o s aaree ssimple
p e to compute
co pute and
a d tractable
t actab e
• Project high dimensional data onto a lower dimensional
space
Component Analysis and Discriminants
• Two classical approaches
pp
for finding
g “optimal”
p
linear
transformation
– PCA (Principal Component Analysis) “Projection that
b t represents
best
t the
th data
d t in
i a leastl t square sense””
– MDA (Multiple Discriminant Analysis) “Projection that
best sepa
separates
ates tthee data in a least-squares
east squa es sense”
se se
Principal Component Analysis
• Given n data samples
p {x1, x2, …,, xn} in d dimension.
• Define a linear transform
xk = m + ak e
where xk is an approximation of a feature vector xk, e is a
unit vector in a particular direction and ||e||=1, ak is the 1-D
projection of xk onto basis vector e , and m is the sample
mean
1 n
m = ∑ k =1 xk
n
• We try to minimize the square error criterion function
n
J ( a1 , a2 , ..., an , e ) = ∑ ( m + ak e ) − xk
k =1
2
Principal Component Analysis
– To solve ak
n
J ( a1 , a2 , ..., an , e ) = ∑ ( m + ak e ) − xk
2
k =1
n
n
= ∑ ak e − ( xk − m )
2
k =1
n
n
= ∑ a e − 2∑ ak e ( xk − m ) + ∑ xk − m
k =1
2
k
2
t
k =1
k =1
∂J
let
= 0, we have ak = e t ( xk − m )
∂ak
– We also define the scatter matrix
S = ∑ k =1 ( xk − m )( xk − m )t
n
C
Compare
this
hi with
i h the
h covariance
i
matrix
i fro
f ML estimation
i i (CPE646-4,p.19)
(CPE646 4 19)
2
Principal Component Analysis
– To solve e, ggiven ak=et((xk-m)) and ||||e||=1
|| , we have
n
n
n
J ( e ) = ∑ a − 2∑ a + ∑ x k − m
2
k
k =1
k =1
2
k
2
k =1
n
n
= − ∑ [e ( xk − m )] + ∑ xk − m
t
2
k =1
2
k =1
n
n
= − ∑ e ( xk − m )( xk − m ) e + ∑ xk − m
t
t
k =1
k =1
n
= − e Se + ∑ xk − m
t
k =1
2
2
Principal Component Analysis
– To minimize J(e)
( ) is the same as to maximize etSe. We
can use Lagrange Multipliers to maximize etSe subject
to the constraint ||e||=1
u = e t Se
S − λ ( e t e − 1)
∂u
= 2 Se − 2λ e = 0
∂e
Se = λ e or e t Se = λ e t e = λ
– To maximize etSe,, we should select e as the eigenvector
g
that corresponding to the largest eigenvalue of the
scatter matrix .
Principal Component Analysis
• In ggeneral,, if we extend this 1-dimension pprojection
j
to d’dimension projection (d’≤d)
d'
xk = m + ∑ aik ei
i =1
the solution is to let ei be the eigenvectors of the largest d’
eigenvalues of the scatter matrix
• Because the scatter matrix is real and symmetric, the
eigenvectors are orthogonal.
• The projections of xk to each of these eigenvectors (i.e.
bases) are called principal components.
Problem of PCA
• The PCA reduces dimension for each class individually.
y
The resulting components are good representation of the
data in each class, but they may not be good for
discrimination purpose.
purpose
• For example, suppose the two classes have 2D Gaussianlike densities, represented by the two ellipsis. They are
well separable. But if we project the data to the first
principal component (i.e. from 2D to 1D), then they
become inseparable (with a very low Chernoff
information).
• The best projection is the short axis
Problem of PCA
• In fact, this is typical problem in studying generative
models
d l vs discriminative
di i i ti models.
d l The
Th generative
ti models
d l
aim at representing the data faithfully, while discriminative
models target telling objects apart.
Fisher Linear Discriminant
• Consider a two-class problem.
p
• Given n data samples {x1, x2, …, xn} in d dimension. n1
samples in subset (class) D1 labeled with ω1, and n2
samples
l in
i subset
b t (class)
( l ) D2 labeled
l b l d with
ith ω2.
• The linear combination will project a d-D feature vector to
a 1-D sca
scalee y=w
wtx aandd ||w||=1
||w||
• The resulting n samples y1, …,yn divided into the subsets
Y1 and Y2.
• We adjust the direction of w to maximize class
separability.
Fisher Linear Discriminant
Fisher Linear Discriminant
1
mi =
ni
1
mi =
ni
∑
y∈Yi
1
y=
ni
∑x
x∈Di
t
t
w
x
=
w
mi
∑
y∈Yi
m1 − m2 = w t ( m1 − m2 )
• We define the scatter for projected samples labeled ω1
si2 =
2
(
y
−
m
)
∑
i
y∈Yi
Fisher Linear Discriminant
• The Fisher linear discriminant employs
p y that linear function
wtx for which the criterion function is maximum.
J (w) =
m1 − m2
2
s12 + s22
m1 − m2 is the distance of projected means -between-class scatter
s12 + s22 is called the total within-class scatter
Fisher Linear Discriminant
• We define the scatter matrices
Si =
si2 =
=
t
(
x
−
m
)(
x
−
m
)
and
∑
i
i
x∈ Di
t
t
2
(
w
x
−
w
m
)
∑
i
x∈ Di
∑
x∈ Di
w t ( x − mi )( x − mi )t w
= w t Si w
s12 + s22 = w t SW w
SW = S1 + S2 within-class scatter matrix
Fisher Linear Discriminant
• The between-class scatter
( m1 − m2 )2 = ( w t m1 − w t m2 )2
= w t ( m1 − m2 )( m1 − m2 )t w
= w t SB w
where
S B = ( m1 − m2 )( m1 − m2 )t between-class scatter matrix
• The criterion function becomes
w t SB w
J (w) = t
w SW w
(generalized Rayleigh quotient)
Fisher Linear Discriminant
• The vector w that maximize J(⋅)
( ) must satisfyy
S B w = λ SW w for some constant λ
if SW is not singular
singular, this becomes
SW−1 S B w = λ w
• This is an eigenvalue problem, but we do not need to solve
for λ and w, since SBw is always in the direction of (m1m2), the solution is
w = SW−1 ( m1 − m2 )
Fisher Linear Discriminant
• The remaining
g step
p is to separate
p
two classes in 1dimension.
• The complexity of finding w is dominated by calculating
SW-11, which
hi h is
i an O(d2n)) calculation.
l l ti
Fisher Linear Discriminant
• If the class-conditional densities p(
p(x||ωi) are multivariate
normal with equal covariance matrices Σ, the optimal
decision boundary satisfies (CPE646-3, p.22)
w t x + ω0 = 0
where ω 0 is a constant involving w and
the prior probabilities, and
w = Σ −1 ( μ1 − μ 2 )
This w is equivalent to
w = SW−1 ( m1 − m2 )
Multiple Discriminant Analysis
• For the c-class pproblem,, the generalization
g
of Fisher linear
discriminant will project the d-dimensional feature vectors
to a (c-1)-dimensional space, assuming d≥c.
• The
Th generalization
li ti off within-class
ithi l scatter
tt matrix
ti
c
SW = ∑ Si
where
i =1
Si =
t
(
x
−
m
)(
x
−
m
)
∑
i
i
x∈Di
1
mi =
ni
∑x
x∈Di
and
Multiple Discriminant Analysis
• The g
generalization of between-class scatter matrix
1
1 c
define total mean vector m = ∑ x = ∑ ni mi
n x
n i =1
define total scatter matrix
ST = ∑ ( x − m )( x − m )t
x
c
= ∑ ∑ ( x − mi + mi − m )( x − mi + mi − m )t
i =1 x∈Di
c
= SW + ∑ ni ( mi − m )( mi − m )t
i =1
define a general between-class scatter matrix
c
SB = ∑ ni ( mi − m )( mi − m )t
i =1
and ST = SW + S B
Multiple Discriminant Analysis
• The p
projection
j
from a d-D to ((c-1)-D
) can be expressed
p
as
yi = wit x i = 1, ..., c − 1 or
y = W t x , where
h
wi are columns
l
off d × (c - 1) matrix
t i W
• The criterion function becomes
W t SBW
J (W ) = t
W SW W
• The solution to maximizing this function satisfies
SB wi = λi SW wi
Multiple Discriminant Analysis
• To solve wi, we can
– Compute λi from |SB- λiSW|=0
– Then compute wi from (SB- λiSW)wi=0
• The solution for W is not unique, they all have the same
J(W).
Multiple Discriminant Analysis
Download