Review, Unrestricted notes: Discriminant idea & details (review quickly)

advertisement
Normal Density:
f(y) =
(2)-1/2||-1/2 exp(-0.5(y-)2/2)
Mutivariate vector Y = (y1,y2)’
(for n=2 elements).Y ~ N()
  0   1 .8  
  
Y ~ N   , 
0
.
8
1





Multivariate normal density
(Y) = (2)-n/2||-1/2 exp(-0.5(Y-)’Y-))
The larger the scalar (Y), the more likely we are to observe vector Y.
Fisher linear discriminant function :
k: number of multivariate normal (sub) populations
n: number of features (measurements, observations) for each individual = number of
elements in the vector Y of observations.
Y: the nx1 vector of observations on an individual to be classified.
D2 : The squared Mahalanobis distance
F: The Fisher Linear Discriminant function
pi : The proportion (prior probability) of our big population that is from subpopulation i.
i: The multivariate normal density for subpopulation i, mean i variance i.
Goal: We want Pr{ pop i | Y} where Y is a multivariate vector of measurements on an
individual and we have i=1,2,…,k populations from which to choose. We will assume
that the Y’s from each population have a multivariate normal density i.
(1) Bayes’ Rule { relates Pr{A|B} to Pr{B|A} }
Pr{ pop i | Y} = Pr{ pop i and Y}/ Pr{Y} = [pi Pr{ Y | pop i }]/ [j=1,k pj Pr{ Y | pop j }]
=[ pi i ]/ [j=1,k pj j ] *
(2) Simplest case, all populations equally likely (p) and have same covariance matrix .
For population j we have
pj j = p (2)-n/2||-1/2 exp( -0.5 Dj2 )= p (2)-n/2||-1/2 exp( -0.5 (Y-j)’-1(Y-j) =
[ p (2)-n/2||-1/2 exp( -0.5 Y’-1Y)] exp( Fj)
where
(a) Fj = -0.5j’j + j’-1Y = aj + bj’Y = “Fisher Linear
Discriminant Function.” The larger it is, the more likely it is that population j
would produce an observation like Y.
(b) [ p (2)-n/2||-1/2 exp( -0.5 Y’-1Y)] is the same for all populations.

(c) Dj2 = squared Mahalanobis distance = (Y-j)’-1(Y-j
Y’-1Y - 2[-0.5j’j + j’-1Y] = Y’-1Y - 2Fj
The larger this is, the smaller is Fj and hence the less likely it is that
population j would produce an observation like Y. It is negatively related
to the discriminant function and hence acts like a distance.
(3) Note that anything (like item (b)) in common to the numerator and denominator of
Pr{ pop i | Y} =[ pi i ]/ [j=1,k pj j ] will cancel out so in this simplest case (equal priors
p and equal covariance matrices), the factor[ p (2)-n/2||-1/2 exp( -0.5 Y’-1Y)] , which is
constant across the populations, is eliminated leaving the simpler expression
Pr{ pop i | Y} = exp( Fi)/ [j=1,kexp(Fj)]. Compute exp(Fi) for i=1,…,k and divide each by
the sum to get (posterior) probabilities for each population for the case of equal priors
and covariance matrices.
(4) The Fisher Linear Discriminant function is seen to be everything in ln(pi i) that
changes with i. When pi and i change, only the 2 term is constant so we have:
Fi = [-0.5i’ii +ln(pi) -0.5ln|iY’i-1Y + i’i-1Y
Note that ln(pi) is omitted if p is constant and -0.5ln|iY’i-1Y is omitted if there
is a common variance-covariance matrix. When Y’i-1Y appears, for obvious reasons, F
is called Fisher’s Quadratic Discriminant Function. Only the intercept
[-0.5i’ii +ln(pi) -0.5ln|i is affected by unequal p.
Example: equal priors and covariance matrices
 7.5 7.5 6.25 


 =  7.5 25 12.5 
 6.25 12.5 31.25 



 0.05  0.02 
 0.2


=   0.05 0.0625  0.015 
  0.02  0.015 0.042 


2
  2
1
1
 
 
 
 
   1   0     1  Classify the observation Y  2  
1
 1 
1
 3
 
 
 
 






** Class notes example **;
ods html close; ods listing;
{ } curly brackets - matrix, commas delineate rows
( ) round brackets function arguments
[ ] square brackets row and column operations, e.g. sum A[+, ],
select elements A[1,2], or submatrices A[1,2:5] etc.
// join one on top of the other
|| join side by side
` backquote (upper left on typical keyboard) -> transpose
;
PROC IML;
S= {7.5 7.5 6.25, 7.5 25 12.5, 6.25 12.5 31.25};
IN = inv(S);
m1 = {2,-1,1}; m2 = {-2,0,1}; m3 = {1,-1,1};
print S in m1 m2 m3;
D1 =-0.5*m1`*in*m1||( m1`*IN);
D2 =-0.5*m2`*in*m2||( m2`*IN);
D3 =-0.5*m3`*in*m3||( m3`*IN);
D = D1//D2//D3;
Y = {1,2,3};
discriminant = D*({1}//Y);
print D Y discriminant;
expdisc = exp(discriminant);
prob = expdisc/expdisc[+, ];
* multiply each exp(Fj) by sum of exp(Fk)*;
print discriminant expdisc prob;

S
7.5
7.5
6.25
7.5
25
12.5
m1
2
-1
1
D
-0.52725
-0.461
-0.19725
0.43
-0.42
0.23
IN
0.2
-0.05
-0.02
6.25
12.5
31.25
-0.1775
0.085
-0.1275
discriminant
m2
-2
0
1
-0.05
0.0625
-0.015
m3
1
-1
1
Y discriminant
1
-0.40125
2
-0.465
3
-0.11125
0.017
0.082
0.037
expdisc
-0.02
-0.015
0.042
prob
-0.40125 0.6694827 0.3053746
-0.465 0.6281351 0.2865145
-0.11125
0.894715
0.408111
Population 3 has the highest probability density at Y and hence the highest posterior
probability of having produced Y. The pdf of Y in population j is seen above to be C
times exp(Fj ) where C is the same for all populations. Thus if we compute the quantity
exp( Fi) /j exp( Fj) for i=1,2,3, these will be 3 probabilities that add to 1 and are in the
proper ratio. If we think in Bayesian terms of equal prior probabilities that an observed
vector comes from population j then we have computed the posterior probability of being
from group j given the observed Y.
For example, e-0.11125/(e-0.40125+e-0.465+e-0.11125)=0.408111.
* In [ pi i ]/ [j=1,k pj j ] and elsewhere we are ignoring the fact that (Y) = 0 for any particular Y ( is the
pdf of a continuous random variable). The arguments can be made rigorous in the usual limit way –
establish a small interval (square, cube, hypercube) around point Y that shrinks toward Y.
Download