1 11.7 Fisher’s Method for Discriminating Among Several Populations 1. Separation: Suppose there are g populations, X 1 , X 2 , , X n1 : population 1 X n1 1 , X n1 2 ,, X n1 n2 : population 2 X n1ng 11 , , X nT : population g, n1 n2 ng nT . where Let X j be the sample mean for the population j, j 1,, g , and nT X X i 1 nT i . The sample between matrix g B n j ( X j X )( X j X )t j 1 Thus, a t Ba n j a t X j X X j X a n j a t X j a t X X tj a X t a g g t j 1 j 1 n j Y j Y , g 2 j 1 Yi a t X i , i 1, , nT , Y j is the mean for the j’th population, j 1, , g , n1 for example, Y1 Yi i 1 n1 nT and Y Y i 1 nT i . The sample within group matrix W is 2 n1 n2 W X i X 1 X i X 1 n1 X t i 1 i n1 1 X 2 X i X 2 X nT t i X g X i X g t i i n1 n g 1 1 Thus, a X a tWa a t X i X 1 X i X 1 a n1 nT t i 1 Yi Y1 n1 n1 n2 Y 2 i 1 i n1 1 i n1 ng 1 1 Y2 i Y nT 2 i X g X i X g a t t Yg . 2 i i n1 ng 1 1 Note: Y n1 a tWa nT g i 1 i Y1 2 Y i n1 1 i Y2 Y nT 2 Yg 2 i i n1 n g 1 1 nT g the pooled estimate based on Y1 , Y2 , , Yn . T X n1 S pooled n1 n2 W nT g i 1 X 1 X i X 1 i X nT t i i n1 n g 1 1 X g X i X g t nT g the pooled estimate based on X 1 , X 2 , , X nT . We now introduce Fisher’s linear discriminant method for several populations. Fisher’s discriminant method for several populations is as follows: Find the vector â1 maximizing the separation function n Y g S (a) t a Ba a tWa j 1 Y n1 i 1 i Y1 2 n1 n2 Y i n1 1 i j Y 2 j Y2 2 Y nT Yg , 2 i i n1 n g 1 1 subject to aˆ1t S pooled aˆ1 1. The linear combination aˆ1t X is called the sample first discriminant. Find the vector â 2 maximizing the separation function S (a ) subject 3 to aˆ 2t S pooled aˆ 2 1 and aˆ 2t S pooled aˆ1 0 . Find the vector â s maximizing the separation function S (a ) subject to aˆ st S pooled aˆ s 1 and aˆ st S pooled aˆ l 0, l s. aˆ tj S pooled aˆ j is the estimate of Var(aˆ tj X ), j 1,, s. Note: aˆ tj S pooled aˆ l , j l. is the estimate of Cov(aˆ tj X , aˆ lt X ), j l. The condition aˆ tj S pooled aˆl 0 is similar to the condition given in the principal component analysis. Intuitively, S (a ) measures the difference among the transformed means reflected by n Y g j 1 j Y 2 j relative to the random variation of the transformed Y n1 data reflected by i i 1 Y1 2 n1 n2 Y i n1 1 i Y2 2 Y nT Yk . As the 2 i i n1 ng 1 1 transformed observations Y1 , Y2 , , Yn1 ( population 1), Yn1 1 , Yn1 2 ,, Yn1 n2 ( population 2), , Yn1 ng 1 1 , , YnT ( population g ) n Y g are separated, j 1 j Y 2 j should be large even as the random variation of the transformed data is taken into account. Important result: Let e1 , e2 ,, es be the orthonormal eigenvector of W 1 2 BW 1 2 4 corresponding to the eigenvalues 1 2 s 0. Then, 1 / 2 1 / 2 1 / 2 1 aˆ j S pooled e j , j 1,, s, where S pooled S pooled S pooled . The following important result provides another way to obtain the discriminants!! Important result: Let e1 , e2 ,, es be the eigenvectors of W 1 B corresponding to the eigenvalues 1 2 s 0. Then, aˆ j , j 1, , s, are the scaled eigenvectors satisfying aˆ j t S pooled aˆ j 1 . That is, ej ˆj a e tj S pooled e j 2. Classification: Fisher’s classification method for several populations is as follows: For an observation X 0 , Fisher’s classification procedure based on the first r s sample discriminants is to allocate X 0 to the population l if 2 2 j 2 t t j 2 ˆ ˆ ˆ ˆ Y Y a X X a X X Y Y j l j 0 l j 0 i j i , i l, r j 1 r j 1 r j 1 r j 1 where Yˆj aˆ tj X 0 , Yi j aˆ tj X i , j 1,, r; i 1, g Intuition of Fisher’s method: R p : population 1 X 1 population 2 X 2 l j ( X ) aˆ tj X , j 1, , r X0 population g X g 5 Y11 ,, Y1 r R : Yˆ r j 1 Y1 j j 2 Yˆ1 ,, Yˆr Y21 ,, Y2r Yg1 ,, Ygr : the “total” square distance between the transformed X 0 ( Yˆ1 ,, Yˆr ) and the transformed mean of the population 1 ( Y11 ,, Y1 r ). Yˆ r j 1 Y2 j j 2 : the “total” square distance between the transformed X 0 ( Yˆ1 ,, Yˆr ) and the transformed mean of the population 2 ( Y21 ,, Y2r ). Yˆ r j 1 Yk j j 2 : the “total” square distance between the transformed ( Yˆ1 ,, Yˆr ) X 0 and the transformed mean of the population g ( Yg1 ,, Ygr ). Yˆ r j 1 j Yl j Yˆ 2 r j 1 2 j Yi , i l , imply the total distance between the transformed X 0 and the transformed mean of the population l is smaller than the one between the one between the transformed X 0 and the transformed mean of the other populations. In some sense, X 0 is “closer” to the population l than to the other populations. Therefore, X 0 is allocated to the population l. Example: 6 x1t 2 5 x 4t 0 6 x7t 1 2 X 1 x 2t 0 3 n1 3; X 2 x5t 2 4 n2 3; X 3 x8t 0 0 n3 3 . x3t 1 1 x6t 1 2 x9t 1 4 Then, 0 1 1 0 x1 , x 2 , x3 , x 5 , 3 3 4 2 6 3 t B 3xi x xi x , i 1 3 62 3 and W x j x1 x j x1 x j x2 x j x2 x j x3 x j x3 3 t j 1 6 j 4 t 9 t j 7 6 2 2 24 . Further, S pooled 1 W n1 n2 n3 3 1 3 1 3 , W 1 B 1.07143 1.4 . 0.21429 2.7 4 The eigenvectors of W 1 B are 0.7183 0.9929 e1 2 . 867 ; e 2 1 0.11842 0.9043 . 0.9213 Thus, aˆ1 e1 e1t S pooled e1 0.386 0.938 e2 e2 ; aˆ 2 . 3.47 0.495 1.12 0.112 e2t S pooled e2 e1 Therefore, yˆ1 aˆ1t x 0.386 x1 0.495x2 ; yˆ 2 aˆ 2t x 0.938x1 0.112 x2 . 1 To classify a new observation x0 , we need to compute 3 yˆ1 aˆ1t x0 1.87, yˆ 2 aˆ 2t x0 0.60 , y11 aˆ1t x1 1.10, y12 aˆ2t x1 1.27 , 7 y21 aˆ1t x2 2.37, y22 aˆ2t x2 0.49 , y31 aˆ1t x3 0.99, y32 aˆ 2t x3 0.22 . Since yˆ 2 2 j 1 yˆ j y j y j 2 j y j 3 yˆ 2 4.09, 1.87 2.37 0.60 0.49 2 0.26, 2 8.32, 2 2 2 2 j 1 1.87 1.10 0.60 1.27 2 2 j 1 j 1 1.87 0.99 0.60 0.22 2 1 the observation x0 is allocated to population 2. 3 Useful Splus Commands: >xmean1=apply(ir[1:50,],2,mean) > xmean2=apply(ir[51:100,],2,mean) # X1 # X2 > xmean3=apply(ir[101:150,],2,mean) # X 3 >xmean=apply(ir,2,mean) # X >b1<-50*(xmean1-xmean)%*%t(xmean1-xmean) >b2<-50*(xmean2-xmean)%*%t(xmean2-xmean) # n1 ( X 1 X )( X 1 X ) t # n2 ( X 2 X )( X 2 X ) t >b3<-50*(xmean3-xmean)%*%t(xmean3-xmean) # n3 ( X 3 X )( X 3 X ) t # B n j X j X X j X g >b<-b1+b2+b3 t j 1 X X 1 X i X 1 n1 >sum1=49*var(ir[1:50,]) # i 1 t i n1 n2 > sum2=49*var(ir[51:100,]) # X i n1 1 X 2 X i X 2 t i X nT > sum3=49*var(ir[101:150,]) # i n1 n2 1 >w<-sum1+sum2+sum3 >invw<-solve(w) X 3 X i X 3 t i #W # W 1 8 >spool=w/(50+50+50-3) # S pooled W >evectors=eigen(invw%*%b)$vectors # e1 , e2 , e3 , e4 # aˆ1 n1 n2 n3 3 e1 e1t S pooled e1 >a1hat=evectors[,1]/sqrt(t(evectors[,1])%*%spool%*%evectors[,1]) # aˆ 2 e2 e2t S pooled e2 >a2hat=evectors[,2]/sqrt(t(evectors[,2])%*%spool%*%evectors[,2]) >a1x=ir%*%a1hat # aˆ1t X i , i 1, nT . > a2x=ir%*%a2hat # aˆ 2t X i , i 1, nT >plot(a1x,a2x) # separation based on the first two sample discriminant 5 3 Objective: Allocate x0 and compute the error rate 1 1 Useful Splus Commands (Classfication): >x0=c(5,3,1,1) >yhat1=a1hat%*%x0 >yhat2=a2hat%*%x0 # ŷ1 # ŷ 2 >y2bar2=a2hat%*%xmean2 # y11 # y12 # y 21 # y 22 >y3bar1=a1hat%*%xmean3 # y 31 >y3bar2=a2hat%*%xmean3 # y 32 >y1bar1=a1hat%*%xmean1 >y1bar2=a2hat%*%xmean1 >y2bar1=a1hat%*%xmean2 yˆ 2 >dis1=(yhat1-y1bar1)^2+(yhat2-y1bar2)^2 # j 1 j y1j 2 9 yˆ j y 2j yˆ j y3j 2 >dis2=(yhat1-y2bar1)^2+(yhat2-y2bar2)^2 # j 1 2 >dis3=(yhat1-y3bar1)^2+(yhat2-y3bar2)^2 # j 1 >c(dis1,dis2,dis3) 2 2