Normal Density: f(y) = (2)-1/2||-1/2 exp(-0.5(y-)2/2) Mutivariate vector Y = (y1,y2)’ (for n=2 elements).Y ~ N() 0 1 .8 Y ~ N , 0 . 8 1 Multivariate normal density (Y) = (2)-n/2||-1/2 exp(-0.5(Y-)’Y-)) The larger the scalar (Y), the more likely we are to observe vector Y. Fisher linear discriminant function : k: number of multivariate normal (sub) populations n: number of features (measurements, observations) for each individual = number of elements in the vector Y of observations. Y: the nx1 vector of observations on an individual to be classified. D2 : The squared Mahalanobis distance F: The Fisher Linear Discriminant function pi : The proportion (prior probability) of our big population that is from subpopulation i. i: The multivariate normal density for subpopulation i, mean i variance i. Goal: We want Pr{ pop i | Y} where Y is a multivariate vector of measurements on an individual and we have i=1,2,…,k populations from which to choose. We will assume that the Y’s from each population have a multivariate normal density i. (1) Bayes’ Rule { relates Pr{A|B} to Pr{B|A} } Pr{ pop i | Y} = Pr{ pop i and Y}/ Pr{Y} = [pi Pr{ Y | pop i }]/ [j=1,k pj Pr{ Y | pop j }] =[ pi i ]/ [j=1,k pj j ] * (2) Simplest case, all populations equally likely (p) and have same covariance matrix . For population j we have pj j = p (2)-n/2||-1/2 exp( -0.5 Dj2 )= p (2)-n/2||-1/2 exp( -0.5 (Y-j)’-1(Y-j) = [ p (2)-n/2||-1/2 exp( -0.5 Y’-1Y)] exp( Fj) where (a) Fj = -0.5j’j + j’-1Y = aj + bj’Y = “Fisher Linear Discriminant Function.” The larger it is, the more likely it is that population j would produce an observation like Y. (b) [ p (2)-n/2||-1/2 exp( -0.5 Y’-1Y)] is the same for all populations. (c) Dj2 = squared Mahalanobis distance = (Y-j)’-1(Y-j Y’-1Y - 2[-0.5j’j + j’-1Y] = Y’-1Y - 2Fj The larger this is, the smaller is Fj and hence the less likely it is that population j would produce an observation like Y. It is negatively related to the discriminant function and hence acts like a distance. (3) Note that anything (like item (b)) in common to the numerator and denominator of Pr{ pop i | Y} =[ pi i ]/ [j=1,k pj j ] will cancel out so in this simplest case (equal priors p and equal covariance matrices), the factor[ p (2)-n/2||-1/2 exp( -0.5 Y’-1Y)] , which is constant across the populations, is eliminated leaving the simpler expression Pr{ pop i | Y} = exp( Fi)/ [j=1,kexp(Fj)]. Compute exp(Fi) for i=1,…,k and divide each by the sum to get (posterior) probabilities for each population for the case of equal priors and covariance matrices. (4) The Fisher Linear Discriminant function is seen to be everything in ln(pi i) that changes with i. When pi and i change, only the 2 term is constant so we have: Fi = [-0.5i’ii +ln(pi) -0.5ln|iY’i-1Y + i’i-1Y Note that ln(pi) is omitted if p is constant and -0.5ln|iY’i-1Y is omitted if there is a common variance-covariance matrix. When Y’i-1Y appears, for obvious reasons, F is called Fisher’s Quadratic Discriminant Function. Only the intercept [-0.5i’ii +ln(pi) -0.5ln|i is affected by unequal p. Example: equal priors and covariance matrices 7.5 7.5 6.25 = 7.5 25 12.5 6.25 12.5 31.25 0.05 0.02 0.2 = 0.05 0.0625 0.015 0.02 0.015 0.042 2 2 1 1 1 0 1 Classify the observation Y 2 1 1 1 3 ** Class notes example **; ods html close; ods listing; { } curly brackets - matrix, commas delineate rows ( ) round brackets function arguments [ ] square brackets row and column operations, e.g. sum A[+, ], select elements A[1,2], or submatrices A[1,2:5] etc. // join one on top of the other || join side by side ` backquote (upper left on typical keyboard) -> transpose ; PROC IML; S= {7.5 7.5 6.25, 7.5 25 12.5, 6.25 12.5 31.25}; IN = inv(S); m1 = {2,-1,1}; m2 = {-2,0,1}; m3 = {1,-1,1}; print S in m1 m2 m3; D1 =-0.5*m1`*in*m1||( m1`*IN); D2 =-0.5*m2`*in*m2||( m2`*IN); D3 =-0.5*m3`*in*m3||( m3`*IN); D = D1//D2//D3; Y = {1,2,3}; discriminant = D*({1}//Y); print D Y discriminant; expdisc = exp(discriminant); prob = expdisc/expdisc[+, ]; * multiply each exp(Fj) by sum of exp(Fk)*; print discriminant expdisc prob; S 7.5 7.5 6.25 7.5 25 12.5 m1 2 -1 1 D -0.52725 -0.461 -0.19725 0.43 -0.42 0.23 IN 0.2 -0.05 -0.02 6.25 12.5 31.25 -0.1775 0.085 -0.1275 discriminant m2 -2 0 1 -0.05 0.0625 -0.015 m3 1 -1 1 Y discriminant 1 -0.40125 2 -0.465 3 -0.11125 0.017 0.082 0.037 expdisc -0.02 -0.015 0.042 prob -0.40125 0.6694827 0.3053746 -0.465 0.6281351 0.2865145 -0.11125 0.894715 0.408111 Population 3 has the highest probability density at Y and hence the highest posterior probability of having produced Y. The pdf of Y in population j is seen above to be C times exp(Fj ) where C is the same for all populations. Thus if we compute the quantity exp( Fi) /j exp( Fj) for i=1,2,3, these will be 3 probabilities that add to 1 and are in the proper ratio. If we think in Bayesian terms of equal prior probabilities that an observed vector comes from population j then we have computed the posterior probability of being from group j given the observed Y. For example, e-0.11125/(e-0.40125+e-0.465+e-0.11125)=0.408111. * In [ pi i ]/ [j=1,k pj j ] and elsewhere we are ignoring the fact that (Y) = 0 for any particular Y ( is the pdf of a continuous random variable). The arguments can be made rigorous in the usual limit way – establish a small interval (square, cube, hypercube) around point Y that shrinks toward Y.