Document

advertisement
Ch 4. Linear Models for
Classification (1/2)
Pattern Recognition and Machine Learning,
C. M. Bishop, 2006.
Summarized and revised by
Hee-Woong Lim
Contents

4.1. Discriminant Functions

4.2. Probabilistic Generative Models
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2
Classification Models

Linear classification model
 (D-1)-dimensional hyperplane for D-dimensional input space
 1-of-K coding scheme for K>2 classes, such as t = (0, 1, 0, 0, 0)T

Discriminant function
 Directly assigns each vector x to a specific class.
 ex. Fishers linear discriminant

p  Ck | x 
Approaches using conditional probability
 Separation of inference and decision states
 Two approaches
Direct modeling of the posterior probability
 Generative approach

– Modeling likelihood and prior probability to calculate the posterior
probability
– Capable of generating samples
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
3
Discriminant Functions-Two Classes

Classification by hyperplanes
y  x   w T x  w0

 if y  x   0, x  C1


otherwise, x  C2
or
y  x   wT x
where w   w0 , w  and x  1, x 
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
4
Discriminant Functions-Multiple Classes

One-versus-the-rest classifier
 K-1 classifiers for a K-class discriminant
 Ambiguous when more than two classifiers say ‘yes’.

One-versus-one classifier
 K(K-1)/2 binary discriminant functions
 Majority voting  ambiguousness with equal scores
One-versus-the-rest
One-versus-one
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
5
Discriminant Functions-Multiple Classes
(Cont’d)

K-class discriminant comprising K linear functions
 Assigns x to the corresponding class having the maximum
output.
yk  x   wTk x  wk 0 , k  1,..., K
x  Ck if yk  x   y j  x  for j  k

The decision regions are always singly connected and
convex.
For x A , x B  Ck , let xˆ   x A  1    x B .
Then yk  xˆ    yk  x A   1    yk  x B  .
yk  x A   y j  x A  and yk  x B   y j  x B  for j  k ,
therefore yk  xˆ   y j  xˆ  for j  k .
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
6
Approaches for Learning Parameters
for Linear Discriminant Functions


Least square method
Fisher’s linear discriminant
 Relation to least squares
 Multiple classes

Perceptron algorithm
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
7
Least Square Method


Minimization of the sum-of-squares error (SSE)
1-of-K binary coding scheme for the target vector t.
y  x   WT x

where W   w1 w 2 ... w K  and w k 

wk 0, w Tk

T
.
For a training data set, {xn, tn} where n = 1,…,N.
The sum of squares error function is…
 
ED W 

1
Tr XW  T
2
  XW  T ,
T
where X   x1 x 2 ... x N  and T   t1 t 2 ... t N  .
T

Minimizing SSE gives
T

W  XT X

1
XT T  X T.
Pseudo inverse
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8
Least Square Method (Cont’d)
-Limit and Disadvantage

The least-squares solutions yields y(x) whose elements sum to 1,
but do not ensure the outputs to be in the range [0,1].
 Vulnerable to outliers
 Because SSE function penalizes ‘too correct’ examples i.e. far from
the decision boundary.
 ML under Gaussian conditional distribution
 Unimodal vs. multimodal
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
9
Least Square Method (Cont’d)
-Limit and Disadvantage

Lack of robustness comes from…
 Least square method corresponds to the maximum likelihood
under the assumption of Gaussian distribution.
 Binary target vectors are far from this assumption.
Least square solution
Logistic regression
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
10
Fisher’s Linear Discriminant

Linear classification model as dimensionality reduction
from the D-dimensional space to one dimension.
 In case of two classes
yw x
T

if y  w0 , then x  C1

x  C2
 otherwise,
Finding w such that the projected data are clustered well.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
11
Fisher’s Linear Discriminant (Cont’d)

Maximizing projected mean distance?
 The distance between the cluster means, m1 and m2 projected
onto w.
1
1
m2  m1  w
T
m2  m1 
m1 
N1
x
n
and m 2 
nC1
N2
x
n
nC2
 Not appropriate when the covariances are nondiagonal.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
12
Fisher’s Linear Discriminant (Cont’d)

Integrate the within-class variance of the projected data.
 Finding w that maximizes J(w).
J w
2
m2  m1 


, where s 2 
si2
 s22
wTS B w
 J w  T
w SW w

k
 y
n
 mk 
SB: Between-class covariance matrix
2
SW: Within-class covariance matrix
nCk
S B   m 2  m1  m 2  m1 
SW 
T
T
T
x

m
x

m

x

m
x

m






 n 1 n 1  n 2 n 2
nC1
J(w) is maximized when
nC2
w S wS
T
B
Ww


 wTSW w S B w
in the direction
of (m2-m1)
Fisher’s linear discriminant w  SW1 m2  m1 
 If the within-class covariance is isotropic, w is proportional to the
difference of the class means as in the previous case.

(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
13
Fisher’s Linear Discriminant
-Relation to Least Squares
Fisher criterion as a special case of least squares
 When setting target values as:
 N/N1 for class C1 and N/N2 for class C2.
w
T
x n  w0  tn  0
w
T
x n  w0  tn x n  0
N
1
E
2

N
n 1
w xn  w0  tn
T

dE / dw0  0
2
dE / dw  0
n 1
N
n 1
1
w0  wTm, where m 
N
N

xn 
n 1
1
 N1m1  N2m2 
N
N1N2


S

S
B  w  N  m1  m2  .
 W
N


1
w  SW
m1  m2  .

(1)

(2)
by solving (1).
by solving (2) with the w0 above.
SB w : always in the direction of m2  m1 
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14
Fisher’s Discriminant for Multiple Classes


K > 2 classes
Dimension reduction from D to D’
 D’ > 1 linear features, yk (k = 1,…,D’)

yk  wTk x
Generalization of SW and SB
K
SW 

S k , where S k 
k 1
K
SB 

T
x

m
x

m



 n k n k and mk 
nCk
N k  m k  m  m k  m 
1
Nk
x .
n
nCk
T
k 1
SB is from the decomposition of total covariance matrix (Duda and Hart, 1997)
N

1
ST 
 xn  m  xn  m  , where m 
N
n 1
T
N

1
xn 
N
n 1
K
N m .
k
k
k 1
ST  SW  S B .
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
15
Fisher’s Discriminant for Multiple Classes
(Cont’d)

Covariance matrices in the projected y-space
K
sW 
   y k  μk  yk  μk 
T
K
and s B 
k 1 nCk
1
where μ k 
Nk



nCk
1
y n and μ 
N
Fukunaga’s criterion
Another criterion

N k  μ k  μ  μ k  μ  ,
T
k 1
K
N μ .
k k
k 1
J  W  Tr

1
sW
sB



 Tr  WSW WT

  WS W 
1
T
B
 Duda et al. ‘Pattern Classification’, Ch. 3.8.3
 Determinant: the product of the eigenvalues, i.e. the variances in
the principal directions.
WS B W T
sB
J W 
=
T
sW
WSW W
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
16
Fisher’s Discriminant for Multiple Classes
(Cont’d)
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
17
Perceptron Algorithm

Classification of x by a perceptron
1, a  0
y  x   f w   x  , where f  a   
.

1,
a

0




T
Error functions
 The total number of misclassified patterns
constant and discontinuous gradient is zero almost
everywhere.
 Piecewise
 Perceptron criterion.
EP  w   
 w  t , where t
T
n n
n
is the target output.
nM
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
18
Perceptron Algorithm (cont’d)

Stochastic gradient descent algorithm
w
 1



 w  EP  w   w  ntn
The error from a misclassified pattern is reduced after each iteration.
 Not imply the overall error is reduced.
w
 1 T

ntn  w Tntn  ntn  ntn  w Tntn
T
Perceptron convergence theorem.
 If there exists an exact solution (i.e. linear separable), the perceptron
learning algorithm is guaranteed to find it.

However…
 Learning speed, linearly nonseparable, multiple classes
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
19
Perceptron Algorithm (cont’d)
(a)
(b)
(c)
(d)
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
20
Probabilistic Generative Models

Computation of posterior probabilities using class-conditional
densities and class priors.
p  x | Ck  and p Ck   p Ck | x 

Two classes
p  C1 | x  


p  x | C1  p  C1 
p  x | C1  p  C1   p  x | C2  p  C2 
1
  a
1  exp  a 
where a  ln
p  x | C1  p  C1 
p  x | C2  p  C2 
.
Generalization to K > 2 classes
p  Ck | x  
p  x | Ck  p  Ck 

exp  ak 
 p  x | C  p C  
j
j
j
where ak  ln p  x | Ck  p  Ck  .
j
 
exp a j
, The normalized exponential is also
known as the softmax function, i.e.
smoothed version of the ‘max’
function.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
21
Probabilistic Generative Models
-Continuous Inputs
Posterior probabilities when the class-conditional densities
are Gaussian.
p  x | Ck 
 When sharing the same covariance matrix ∑,
p  x | Ck  

1
1
 2 D / 2

1/ 2
T
 1

exp   x  μ k  1  x  μ k   .
 2

Two classes

p  C1 | x    w T x  w0
w
1
 μ1  μ 2 

p  C1 
1 T 1
1 T 1
and w0   μ1  μ1  μ 2  μ 2  ln
2
2
p  C2 
p  C1 | x 
 The quadratic terms in x from the exponents are cancelled.
 The resulting decision boundary is linear in input space.
 The prior only shifts the decision boundary, i.e. parallel
contour.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
22
Probabilistic Generative Models
-Continuous Inputs (cont’d)
Generalization to K classes
ak  x   wTk x  wk 0
1
w k  1μk and wk 0   μTk 1μk  ln p  Ck 
2
 When sharing the same covariance matrix, the decision boundaries are
linear again.
 If each class-condition density have its own covariance matrix, we will
obtain quadratic functions of x, giving rise to a quadratic discriminant.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
23
Probabilistic Generative Models
-Maximum Likelihood Solution

Determining the parameters for p  x | Ck  and p Ck  using
maximum likelihood from a training data set.
Two classes
Data set: xn , tn  , n  1,..., N Priors: p  C1    and p C2   1  
tn  1 or 0, (denoting C1 and C2 , respectively)
p  xn , C1   p  C1  p  xn | C1    N  xn | μ1 ,  
p  xn , C2   p  C2  p  xn | C2   1    N  xn | μ2 ,  
 The likelihood function
p  t | x, μ1, μ2 ,   
N

n 1
1 tn
 N  xn | μ1,   n 1    N  xn | μ2 ,  
t
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
24 t 
t  t1,...,
N
T
Probabilistic Generative Models
-Maximum Likelihood Solution (cont’d)
Two classes (cont’d)
 Maximization of the likelihood with respect to π.
of the log likelihood that depend on π.
 Setting the derivative with respect to π equal to zero.
 Terms
N
tn ln   1  tn  ln 1   
1

N
n 1
N

tn 
n 1
N1
N1

N
N1  N 2
 Maximization with respect to μ1.
N

1
tn ln N  xn | μ1,    
2
n 1
1
μ1 
N1
N
t x
n n
n 1
N

tn  xn  μ1  1  xn  μ1   const.
T
n 1
1
μ

and analogously 2
N2
N
 1  t  x
n
n
n 1
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
25
Probabilistic Generative Models
-Maximum Likelihood Solution (cont’d)
Two classes (cont’d)
 Maximization of the likelihood with respect to the shared
covariance matrix ∑.
N
N
1

2

1

2
N


N
N
ln   Tr  1S
2
2
1
tn  
2
n 1

tn  x n  μ1   1  x n  μ1 
T
n 1
1
1  tn   
2
n 1
N
 1  t
n
 xn  μ 2 
T

1
 xn  μ 2 
n 1
 
S
Weighted average of the
covariance matrices
associated with each classes.
N1
N
S1  2 S 2
N
N
1
Sk 
 xn  μk  xn  μk T
N k nC
S

k
But not robust to outliers.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
26
Probabilistic Generative Models
-Discrete Features

Discrete feature values xi 0,1
General distribution would correspond to a 2D size table.
 When we have D inputs, the table size grows exponentially with
the number of features.
Naïve Bayes assumption, conditioned on the class Ck
p  x | Ck  
D

1 xi
kixi 1  ki 
i 1
ln p  x | Ck  p  Ck  
D
x ln 
i
ki
 1  xi  ln 1  ki   ln p  Ck 
i 1
Linear with respect to the features as in the continuous features.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
27
Bayes Decision Boundaries: 2D
-Pattern Classification, Duda et al. pp.42
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
28
Bayes Decision Boundaries: 3D
-Pattern Classification, Duda et al. pp.43
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
29
Probabilistic Generative Models
-Exponential Family
For both Gaussian distributed and discrete inputs…
 The posterior class probabilities are given by
 Generalized linear models with logistic sigmoid or softmax activation
functions.

Generalization to the class-conditional densities of the exponential family
 The subclass for which u(x) = x.
Exponential family
For some scaling parameter s,


p  x | λ k   h  x  g  λ k  exp λ Tk u  x 
1 1 
1

p  x | λ k , s   h  x  g  λ k  exp  λ Tk x  .
s s 
s

Two-classes a  x   λ1  λ2 T x  ln g  λ1   ln g  λ2   ln p C1   ln p C2 
K-classes
ak  x   λTk x  ln g  λ k   ln p Ck 
 Linear with respect to x again.
where p  Ck | x  
exp  ak 

(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
p  C1 | x     a1  .
j
 
exp a j
.
30
Download