11.7 Fisher`s discriminant function: several populations

advertisement
1
11.7 Fisher’s Method for Discriminating Among Several
Populations
1. Separation:
Suppose there are g populations,
X 1 , X 2 , , X n1 : population 1
X n1 1 , X n1  2 ,, X n1  n2 : population 2


X n1ng 11 , , X nT
: population g,
n1  n2    ng  nT .
where
Let X j be the sample mean for the population j, j  1,, g , and
nT
X 
X
i 1
nT
i
.
The sample between matrix
g
B   n j ( X j  X )( X j  X )t
j 1
Thus,


a t Ba   n j a t X j  X X j  X  a   n j a t X j  a t X X tj a  X t a
g
g
t
j 1

j 1
  n j Y j  Y  ,
g
2
j 1
Yi  a t X i , i  1,  , nT , Y j is the mean for the j’th population, j  1,  , g ,
n1
for example, Y1 
 Yi
i 1
n1
nT
and Y 
Y
i 1
nT
i
.
The sample within group matrix W is
2
n1  n2
W   X i  X 1 X i  X 1  
n1
 X
t
i 1
i  n1 1
 X 2 X i  X 2    
 X
nT
t
i
 X g X i  X g 
t
i
i  n1 n g 1 1
Thus,
 a X
a tWa   a t X i  X 1 X i  X 1  a   
n1
nT
t
i 1
  Yi  Y1  
n1
n1  n2
 Y
2
i 1
i  n1 1
i  n1  ng 1 1
 Y2    
i
 Y
nT
2
i
 X g X i  X g  a
t
t
 Yg  .
2
i
i  n1  ng 1 1
Note:
 Y
n1
a tWa

nT  g
i 1

i
 Y1  
2
 Y
i  n1 1
i
 Y2    
 Y
nT
2
 Yg 
2
i
i  n1  n g 1 1
nT  g
the pooled estimate based on Y1 , Y2 , , Yn .
T
 X
n1
S pooled 
n1  n2
W

nT  g
i 1
 X 1 X i  X 1    
i
 X
nT
t
i
i  n1  n g 1 1
 X g X i  X g 
t
nT  g
 the pooled estimate based on X 1 , X 2 , , X nT .
We now introduce Fisher’s linear discriminant method for several
populations.
Fisher’s discriminant method for several populations is as follows:
Find the vector â1 maximizing the separation function
 n Y
g
S (a) 
t
a Ba

a tWa
j 1
 Y
n1
i 1
i
 Y1  
2
n1  n2
 Y
i  n1 1
i
j
Y 
2
j
 Y2    
2
 Y
nT
 Yg 
,
2
i
i  n1  n g 1 1
subject to aˆ1t S pooled aˆ1  1. The linear combination aˆ1t X is called the
sample first discriminant.
Find the vector â 2 maximizing the separation function S (a ) subject
3
to
aˆ 2t S pooled aˆ 2  1 and aˆ 2t S pooled aˆ1  0 .


Find the vector â s maximizing the separation function S (a ) subject
to
aˆ st S pooled aˆ s  1 and aˆ st S pooled aˆ l  0, l  s.
aˆ tj S pooled aˆ j is the estimate of Var(aˆ tj X ), j  1,, s.
Note:
aˆ tj S pooled aˆ l , j  l. is the estimate of Cov(aˆ tj X , aˆ lt X ), j  l.
The condition aˆ tj S pooled aˆl  0 is similar to the condition given in the
principal component analysis.
Intuitively, S (a ) measures the difference among the transformed
means reflected by
 n Y
g
j 1
j
Y 
2
j
relative to the random variation of the transformed
 Y
n1
data reflected by
i
i 1
 Y1  
2
n1  n2
 Y
i  n1 1
i
 Y2    
2
 Y
nT
 Yk  . As the
2
i
i  n1  ng 1 1
transformed
observations
Y1 , Y2 , , Yn1 ( population  1), Yn1 1 , Yn1  2 ,, Yn1  n2 ( population  2), , Yn1  ng 1 1 , ,
YnT ( population  g )
 n Y
g
are separated,
j 1
j
Y 
2
j
should be large even as the random
variation of the transformed data is taken into account.
Important result:
Let
e1 , e2 ,, es
be the orthonormal eigenvector of W
1
2
BW
1
2
4
corresponding
to
the
eigenvalues
1  2     s  0.
Then,
1 / 2
1 / 2
1 / 2
1
aˆ j  S pooled
e j , j  1,, s, where S pooled
S pooled
 S pooled
.
The following important result provides another way to obtain the
discriminants!!
Important result:
Let e1 , e2 ,, es be the eigenvectors of W 1 B corresponding to the
eigenvalues 1  2     s  0. Then, aˆ j , j  1, , s, are the scaled
eigenvectors satisfying aˆ j t S pooled aˆ j  1 . That is,
ej
ˆj 
a
e tj S pooled e j
2. Classification:
Fisher’s classification method for several populations is as follows:
For an observation X 0 , Fisher’s classification procedure based on the
first r  s sample discriminants is to allocate X 0 to the population l if








2
2
j 2
t
t
j 2
ˆ
ˆ
ˆ
ˆ




Y

Y

a
X

X

a
X

X

Y

Y
 j l  j 0 l  j 0 i  j i , i  l,
r
j 1
r
j 1
r
j 1
r
j 1
where Yˆj  aˆ tj X 0 , Yi j  aˆ tj X i , j  1,, r; i  1, g
Intuition of Fisher’s method:
R p : population 1 X 1
population 2 X 2
l j ( X )  aˆ tj X ,
j  1, , r
X0
population g X g
5
Y11 ,, Y1 r
R :
 Yˆ
r
j 1
 Y1 j
j

2
Yˆ1 ,, Yˆr
Y21 ,, Y2r
Yg1 ,, Ygr
: the “total” square distance between the transformed
X 0 ( Yˆ1 ,, Yˆr ) and the transformed mean of the population 1
( Y11 ,, Y1 r ).
 Yˆ
r
j 1
 Y2 j
j

2
: the “total” square distance between the transformed
X 0 ( Yˆ1 ,, Yˆr ) and the transformed mean of the population 2
( Y21 ,, Y2r ).


 Yˆ
r
j 1
 Yk j
j

2
: the “total” square distance between the transformed
( Yˆ1 ,, Yˆr ) X 0 and the transformed mean of the population g
( Yg1 ,, Ygr ).

 Yˆ
r
j 1
j
 Yl j
   Yˆ
2
r
j 1

2
j
 Yi , i  l , imply the total distance between
the transformed X 0 and the transformed mean of the population
l is smaller than the one between the one between the transformed
X 0 and the transformed mean of the other populations. In some
sense, X 0 is “closer” to the population l than to the other
populations. Therefore, X 0 is allocated to the population l.
Example:
6
 x1t   2 5
 x 4t  0 6
 x7t   1  2
 
 
 
X 1   x 2t    0 3 n1  3; X 2   x5t   2 4 n2  3; X 3   x8t    0 0  n3  3 .
 x3t    1 1
 x6t  1 2
 x9t   1  4
 
 
 
Then,
0
 1
1 
0
x1   , x 2   , x3   , x   5  ,
 3 
3
 4
  2
6 3 
t
B   3xi  x xi  x   
,
i 1
3 62
3
and
W   x j  x1 x j  x1    x j  x2 x j  x2    x j  x3 x j  x3 
3
t
j 1
6
j 4
t
9
t
j 7
 6  2


 2 24 
.
Further,
S pooled 
 1
W

n1  n2  n3  3  1
 3
1 
3 , W 1 B  1.07143 1.4  .
0.21429 2.7
4 



The eigenvectors of W 1 B are
0.7183
 0.9929 


e1  


2
.
867
;
e

2
 1
 0.11842  0.9043 .
0.9213


Thus,
aˆ1 
e1
e1t S pooled e1

0.386
 0.938 
e2
e2

; aˆ 2 



.
3.47 0.495
1.12  0.112
e2t S pooled e2
e1
Therefore,
yˆ1  aˆ1t x  0.386 x1  0.495x2 ; yˆ 2  aˆ 2t x  0.938x1  0.112 x2 .
1
To classify a new observation x0    , we need to compute
3
yˆ1  aˆ1t x0  1.87, yˆ 2  aˆ 2t x0  0.60 ,
y11  aˆ1t x1  1.10, y12  aˆ2t x1  1.27 ,
7
y21  aˆ1t x2  2.37, y22  aˆ2t x2  0.49 ,
y31  aˆ1t x3  0.99, y32  aˆ 2t x3  0.22 .
Since
 yˆ
2
2
j 1
 yˆ
j
y
j
y
j
2
j
y
j
3
 yˆ
2
 4.09,
  1.87  2.37  0.60  0.49
2
 0.26,
2
 8.32,
2
2
2
2
j 1
  1.87  1.10  0.60  1.27
2
2
j 1
j
1
  1.87  0.99  0.60  0.22
2
1
the observation x0    is allocated to population 2.
3
Useful Splus Commands:
>xmean1=apply(ir[1:50,],2,mean)
> xmean2=apply(ir[51:100,],2,mean)
# X1
# X2
> xmean3=apply(ir[101:150,],2,mean) # X 3
>xmean=apply(ir,2,mean)
# X
>b1<-50*(xmean1-xmean)%*%t(xmean1-xmean)
>b2<-50*(xmean2-xmean)%*%t(xmean2-xmean)
# n1 ( X 1  X )( X 1  X ) t
# n2 ( X 2  X )( X 2  X ) t
>b3<-50*(xmean3-xmean)%*%t(xmean3-xmean)
# n3 ( X 3  X )( X 3  X ) t
# B   n j X j  X X j  X 
g
>b<-b1+b2+b3
t
j 1
 X
 X 1 X i  X 1 
n1
>sum1=49*var(ir[1:50,])
#
i 1
t
i
n1  n2
> sum2=49*var(ir[51:100,])
#
 X
i  n1 1
 X 2  X i  X 2 
t
i
 X
nT
> sum3=49*var(ir[101:150,])
#
i  n1  n2 1
>w<-sum1+sum2+sum3
>invw<-solve(w)
 X 3  X i  X 3 
t
i
#W
# W 1
8
>spool=w/(50+50+50-3)
# S pooled  W
>evectors=eigen(invw%*%b)$vectors
# e1 , e2 , e3 , e4
#
aˆ1 
n1  n2  n3  3
e1
e1t S pooled e1
>a1hat=evectors[,1]/sqrt(t(evectors[,1])%*%spool%*%evectors[,1])
#
aˆ 2 
e2
e2t S pooled e2
>a2hat=evectors[,2]/sqrt(t(evectors[,2])%*%spool%*%evectors[,2])
>a1x=ir%*%a1hat
# aˆ1t X i , i  1, nT .
> a2x=ir%*%a2hat
# aˆ 2t X i , i  1,  nT
>plot(a1x,a2x)
# separation based on the first two sample discriminant
5
 3
Objective: Allocate x0    and compute the error rate
1
 
1
Useful Splus Commands (Classfication):
>x0=c(5,3,1,1)
>yhat1=a1hat%*%x0
>yhat2=a2hat%*%x0
# ŷ1
# ŷ 2
>y2bar2=a2hat%*%xmean2
# y11
# y12
# y 21
# y 22
>y3bar1=a1hat%*%xmean3
# y 31
>y3bar2=a2hat%*%xmean3
# y 32
>y1bar1=a1hat%*%xmean1
>y1bar2=a2hat%*%xmean1
>y2bar1=a1hat%*%xmean2
 yˆ
2
>dis1=(yhat1-y1bar1)^2+(yhat2-y1bar2)^2
#
j 1
j
 y1j

2
9
 yˆ
j
 y 2j

 yˆ
j
 y3j

2
>dis2=(yhat1-y2bar1)^2+(yhat2-y2bar2)^2
#
j 1
2
>dis3=(yhat1-y3bar1)^2+(yhat2-y3bar2)^2
#
j 1
>c(dis1,dis2,dis3)
2
2
Download