PART 2: Statistical Pattern Classification: Optimal Classification with

advertisement
METU Informatics Institute
Min 720
Pattern Classification with Bio-Medical Applications
PART 2: Statistical Pattern
Classification: Optimal Classification
with Bayes Rule
Statistical Approach to P.R
X  [ X 1 , X 2 ,..., X d ]
R3
g1
R1
Dimension of the feature space: d
Set of different states of nature: {1 , 2 ,..., c }
Categories: c
find Ri Ri  R j   uRi  R d
set of possible actions (decisions): {1 ,  2 ,...,  a }
Here, a decision might include a ‘reject option’
A Discriminant Function g i ( X )  g j ( X ) g i (X ) 1  i  c
in region Ri ; decision rule :  k if g k ( X )  g j ( X )
R2
g2
g3
A Pattern Classifier
g1 ( X )
X
g2 ( X )
Max
g c (X )
k
So our aim now will be to define these functions g1 , g 2 ,..., g c
to minimize or optimize a criterion.
Parametric Approach to Classification
• 'Bayes DecisionTheory' is used for minimum-error/minimum risk
pattern classifier design.
• Here, it is assumed that if a sample X is drawn from a class  i
it is a random variable represented with a multivariate
probability density function.
‘Class- conditional density function’
P( X i )
• We also know a-priori probability P(i )
1 i  c
(c is no. of classes)
• Then, we can talk about a decision rule that minimizes the
probability of error.
• Suppose we have the observation
X
• This observation is going to change a-priori assumption to aposteriori probability:
P(i X )
• which can be found by the Bayes Rule.
P(i X )  P(i , X ) / P( X )
P( X i ).P(i )

P( X )
• P( X ) can be found by Total Probability Rule:
When
 i ‘s
are disjoint,
X
c
P( X )   P(i , X )
i 1
c
1
2
P( X )   P( X i ).P(i )
i 1
• Decision Rule: Choose the category with highest a-posteriori
probability, calculated as above, using Bayes Rule.
then,
gi ( X )  P(i X )
1
g1  g 2
Decision boundary:
g1  g 2
R1
or in general, decision boundaries are where:
gi ( X )  g j ( X )
between regions Ri and
Rj
g 2  g1
R2
• Single feature – decision boundary – point
2 features –
curve
3 features –
surface
More than 3 –
hypersurface
gi ( X )  P( X i ).P(i )
gi( X ) 
P( X i ).P(i )
P( X )
• Sometimes, it is easier to work with logarithms
gi ( X )  log[ P( X i ).P(i )]
gi ( X )  log P( X i )  log P(i )
• Since logarithmic function is a monotonically increasing function,
log fn will give the same result.
c1 , c2
2 Category Case:
Assign to
c1
if
(1 )
c2
if
( 2 )
P(1 X )  P(2 X )
P(1 X )  P(2 X )
But this is the same as:
c1
if
P( X 1 ).P(1 )
P( X )
By throwing away
c1
if
P( X )

P( X 2 ).P(2 )
P( X )
‘s, we end up with:
P( X i ).P(1 )  P( X 2 ).P(2 )
Which the same as:
Likelihood ratio
P( X 1 )
P( X 2 )

P( X 2 )
P( X 1 )
k
Example: a single feature, 2 category problem with gaussian density
: Diagnosis of diabetes using sugar count X
P(c1 )  0.7
c1 state of being healthy
c2 state of being sick (diabetes)
P(c2 )  0.3
P( X c1 ) 
1
21
2
.e
( X  m1 ) 2 / 2 12
P( X c2 ) 
P( X )
P( X c1 )
1
if
.e
P( X c2 ).P(c2 )
m1
c1
2 2
2
( X  m2 ) 2 / 2  2 2
P( X c2 )
P( X c1 ).P( X c1 )
The decision rule:
1
d
m2
X
2
P( X c1 ).P(c1 )  P( X c2 ).Por
(c2 )
0.7 P( X c1 )  0.3P( X c2 )
Assume now:
m2  20
m1  10
1   2  2
And we measured: X  17
Assign the unknown sample:
Find likelihood ratio:

e
e
X
to the correct category.
 ( X 10) 2 / 8
( X  20) / 8
2
for
 e 4.9  0.006
Compare with:
So assign:
Xto
P(c2 ) 0.3

 0.43  0.006
P(c1 ) 0.7
. c2
X  17
Example: A discrete problem
Consider a 2-feature, 3 category case
where:
And
1


for ai  X 1  bi

2
P ( X 1 , X 2 ci )   (ai  bi )
ai  X 2  bi
 0 other wise

P(c1 )  0.4 ,
P(c2 )  0.4
Find the decision boundaries and regions:
R3
1 0.2  0.2
Solution:
3
R2
1
0.4
 0.4 
9
9
1
 (0.4)  0.1
4
R1
1
3
P(c3 )  0.2
,
a1  1
b1  1
a 2  0 .5
b2  3.5
a3  3
b3  4
Remember now that for the 2-class case:
ifc1
P ( X c1 ). P ( c1 )  P ( X c2 ). P ( c2 )
or
P ( X c1 )
Likelihood ratio
P ( X c2 )

P ( X c2 )
P ( X c1 )
k
Error probabilities and a simple proof of minimum error
Consider again a 2-class 1-d problem:
P( X c1 ).P(c1 )
R1
d
d
Let’s show that: if the decision boundary is
rather than any arbitrary point d  .
P( X c2 ).P(c2 )
R2
d (intersection point)
Then P (E ) (probability of error) is minimum.
P( E)  P( X R2 , c1 )  P( X  R1 , c2 )
 P( X  R2 c1 ).P(c1 )  P( X  R1 c2 ).P(c2 )
 [  P( X c1 )dX ].P(c1 )  [  P( X c2 )dX ].P(c2 )
R2
R1
  P( X c1 ).P(c1 )dX   P( X c2 ).P(c2 )dX
R2
R1
d
d
It can very easily be seen that the P (E ) is minimum if d   d
.
Minimum Risk Classification
Risk associated with incorrect decision might be more important than
the probability of error.
So our decision criterion might be modified to minimize the average risk
in making an incorrect decision.
We define a conditional risk (expected loss) for decision  i when X
c
occurs as:
R i ( X )    ( i  j ).P( j X )
j 1
Where  ( i  j ) is defined as the conditional loss associated with
decision  i when the true class is  j . It is assumed that  is
known.
i
j
The decision rule: decide on ci if R ( X )  R ( X )
for all 1  j  c i  j
The discriminant function here can be defined as:
4
gi ( X )   Ri ( X )
• We can show that minimum – error decision is a special case
of above rule where:
 (i i )  0
 ( i  j )  1
then,
R i ( X )   P( j X )
j i
 1  P(i X )
so the rule is  i if 1  P(i X )  1  P( j X )
 P(i X )  R( j X )
For the 2 – category case, minimum – risk classifier becomes:
R1 ( X )  11P(1 X )  12 P(2 X )
R 2 ( X )  22 P(2 X )  21P(1 X )
11P(1 X )  12 P(2 X )  22 P(2 X )  21P(1 X )
1 if
 (11  21 ).P(1 X )  (12  22 ).P(2 X )
 (11  21 ).P( X 1 ).P(1 )  (12  22 ).P( X 2 ) P(2 )
1 if
P( X 1 )
P( X  2 )
Otherwise,

(12  22 ) P(2 )
.
(21  11 ) P(1 )
 2.
This is the same as likelihood rule if
and 12  21  1
22  11  0
Discriminant Functions so far
For Minimum Error:
 P(i X )
 P( X i ).P(i )
 log P( X i )  log P(i )
For Minimum Risk:
Where
 Ri (X )
c
R ( X )    ( i  j ).P( j X )
i
j 1
Bayes (Maximum Likelihood)Decision:
• Most general optimal solution
• Provides an upper limit(you cannot do better with other
rule)
• Useful in comparing with other classifiers
Special Cases of Discriminant Functions
Multivariate Gaussian (Normal) Density
The general density form: P( X ) 
N ( M , ) :
1
(2 )
d /2

1/ 2
e
1/ 2 ( X  M )T

1
( X M )
Here X in the feature vector of size .d
M : d element mean vector E ( X )  M  [ 1 ,  2 ,...,  d ]T
 dxd: covariance matrix
ij  E[( X i  i )( X j   j )]
ii  E[( X i  i ) 2 ]
 i
(variance
of feature
2
)

- symmetric
 ij  0 when X i and X
j
are statistically independent.
Xi
- determinant of 
General shape:
where
Hyper ellipsoids
2
(X  M )
2 – d problem:
X1 , X 2

1
(X  M )
is constant:
X1
Mahalanobis
1
Distance
T
X2
 1 
M  
2 
 12 12 

2
 21  2 
,
X2
If 12  0 ,
21  0
(statistically independent features) then,
major axes are parallel to major ellipsoid axes
X1
if in addition
 12   2 2
2
circular
1
in general, the equal density curves are hyper ellipsoids. Now
gi ( X )  log e P( X i )  log e P(i )
is used for N ( M i , i ) since its ease in manipulation
g i ( X )  (1 / 2).( X  M i )
T

1
i
(X  Mi )
 (1 / 2) log i  log P(i )
g i (X ) is a quadratic function of X as will be shown.
g i ( X )  1 / 2. X
 1 / 2. X
T
 X  1 / 2.M  M
 M  1 / 2M  X
1
T
i
1
i
1
T
i
i
T
i
i
i
1
i
 1 / 2. log  i1  log P(i )
Wi  1 / 2.i
1
Vi  M
T
i

1
i
a scalar Wio  1 / 2.M iT  i1M i  1 / 2. log  i  log P(i )
Then,
g i ( X )  X TWi X  Vi X  Wio
On the decision boundary,
gi ( X )  g j ( X )
X T Wi X  X T W j X  Vi X  V j X  Wio  W jo  0
X T (Wi  W j ) X  (Vi  V j ) X  (Wio  W jo )  0
X T WX  VX  W0  0
Decision boundary function is hyperquadratic in general.
Example in 2d.
11 12 
W 



22 
 21
V  v1 v2 

X
x1
Then, above boundary
becomes
x2 
x1
11 12   x1 
 x1 
x2 
 v1 v2    W0  0



21 22   x2 
 x2 
11x12  212 x1 x2  22 x2 2  v1 x1  v2 x2  W0  0
General form of hyper quadratic boundary IN 2-d.
The special cases of Gaussian:
Assume
 i matrix
  2I
Where
is the unit
I
i   2d
 2 0

2
0

i  
0
0

0
 0
0 0

0 0
.. 0 
2
0  
1
i
 
1

2
I
g i ( X )   21 2 ( X  M i )T .( X  M i )  12 log  2 d  log P(i )
g i ( X )   2 2 X , M i
1
Now assume
2
 log P(i )
(not a function
M of X so can be removed)
i
P(i )  P( j )
gi ( X )   21 2 X , M i
2
 d 2  X , M i 
euclidian distance between X and Mi
Then,the decision boundary is linear !
Decision
Rule: Assign the unknown sample to the closest mean’s category
unknown sample
d2
d1
M1
M2
d
P(towards
i )  P (the
j ) less probable
d= Perpendicular bisector that will move
category
Minimum Distance Classifier
• Classify an unknown sample X to the category with closest mean !
• Optimum when gaussian densities with equal variance and equal apriori probability.
R M
1
M2
R2
1
M4
R4
M3
R3
Piecewise linear boundary in case of more than 2 categories.
• Another special case: It can be shown that when (Covariance
matrices are the same)
i size
 and shape
• Samples fall in clusters of equal
unknown sample
P(i )  P( j )
is called Mahalonobis Distance
g i ( X )   12 ( X  M i )T  1 ( X  M i )  log P(i )
is called
Mahalonobis
Distance
T 1
1
 2 (X  Mi )  (X  Mi )
Then, if P(i )  P( j )
The decision rule:
 i if (Mahalanobis Distance of unknown sample to M i ) >
(Mahalanobis Distance of unknown sample to
If
P(i )  P( j )
The boundary moves toward the less probable one.
Mj
)
Binary Random Variables
• Discrete features: Features can take only discrete values. Integrals
are replaced by summations.
• Binary Features: 0 or 1
pi  ( X i  11 )
qi  ( X i  1  2 )
pi
1  pi
1
0
•
•
P( X i 1 )
Xi
Assume binary features are statistically independent.
Where
is binary
Xi
X  X1 , X 2 ,..., X d 
T
Binary Random Variables
Example: Bit – matrix for machine – printed characters
1
0
BA
a pixel
Xi
Here, each pixel may be taken as a featureX i
For above problem, we have
d  10 10  100
is the probability that
pi
for letter A,B,…
Xi 1
P( xi )  ( pi ) xi (1  pi )1 xi
defined for xi  0,1
undefined elsewhere:
d
d
i 1
i 1
P( X )   P( xi )   ( pi ) xi (1  pi )1 xi
d
g k ( X )  log( P( X wk )  log P( wk ))   xi log pi   (1  xi ) log( 1  pi )  log P( wk )
i 1
• If statistical independence of features is assumed.
• Consider the 2 category problem; assume:
pi  ( x  11 )
qi  ( x  12 )
then, the decision boundary is:
 x log p   (1  x ) log( 1  p )   x log q   (1  x ) log( 1  q ) 
i
i
i
i
i
i
i
i
log P(1 )  log P(2 )  0
So if
pi
1 pi
P (1 )
x
log

(
1

x
)
log

log
 i qi  i 1qi
P ( 2 )
 0 category 1
else 2
g ( X )  boundary
Wi X i  W0 is linear in X.
The decision
a weighted sum of the inputs
where:
and
Wi  ln
Pi (1 qi )
qi (1 pi )
W0   ln 11 qpii  ln
P (1 )
P ( 2 )
Download