METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification with Bayes Rule Statistical Approach to P.R X [ X 1 , X 2 ,..., X d ] R3 g1 R1 Dimension of the feature space: d Set of different states of nature: {1 , 2 ,..., c } Categories: c find Ri Ri R j uRi R d set of possible actions (decisions): {1 , 2 ,..., a } Here, a decision might include a ‘reject option’ A Discriminant Function g i ( X ) g j ( X ) g i (X ) 1 i c in region Ri ; decision rule : k if g k ( X ) g j ( X ) R2 g2 g3 A Pattern Classifier g1 ( X ) X g2 ( X ) Max g c (X ) k So our aim now will be to define these functions g1 , g 2 ,..., g c to minimize or optimize a criterion. Parametric Approach to Classification • 'Bayes DecisionTheory' is used for minimum-error/minimum risk pattern classifier design. • Here, it is assumed that if a sample X is drawn from a class i it is a random variable represented with a multivariate probability density function. ‘Class- conditional density function’ P( X i ) • We also know a-priori probability P(i ) 1 i c (c is no. of classes) • Then, we can talk about a decision rule that minimizes the probability of error. • Suppose we have the observation X • This observation is going to change a-priori assumption to aposteriori probability: P(i X ) • which can be found by the Bayes Rule. P(i X ) P(i , X ) / P( X ) P( X i ).P(i ) P( X ) • P( X ) can be found by Total Probability Rule: When i ‘s are disjoint, X c P( X ) P(i , X ) i 1 c 1 2 P( X ) P( X i ).P(i ) i 1 • Decision Rule: Choose the category with highest a-posteriori probability, calculated as above, using Bayes Rule. then, gi ( X ) P(i X ) 1 g1 g 2 Decision boundary: g1 g 2 R1 or in general, decision boundaries are where: gi ( X ) g j ( X ) between regions Ri and Rj g 2 g1 R2 • Single feature – decision boundary – point 2 features – curve 3 features – surface More than 3 – hypersurface gi ( X ) P( X i ).P(i ) gi( X ) P( X i ).P(i ) P( X ) • Sometimes, it is easier to work with logarithms gi ( X ) log[ P( X i ).P(i )] gi ( X ) log P( X i ) log P(i ) • Since logarithmic function is a monotonically increasing function, log fn will give the same result. c1 , c2 2 Category Case: Assign to c1 if (1 ) c2 if ( 2 ) P(1 X ) P(2 X ) P(1 X ) P(2 X ) But this is the same as: c1 if P( X 1 ).P(1 ) P( X ) By throwing away c1 if P( X ) P( X 2 ).P(2 ) P( X ) ‘s, we end up with: P( X i ).P(1 ) P( X 2 ).P(2 ) Which the same as: Likelihood ratio P( X 1 ) P( X 2 ) P( X 2 ) P( X 1 ) k Example: a single feature, 2 category problem with gaussian density : Diagnosis of diabetes using sugar count X P(c1 ) 0.7 c1 state of being healthy c2 state of being sick (diabetes) P(c2 ) 0.3 P( X c1 ) 1 21 2 .e ( X m1 ) 2 / 2 12 P( X c2 ) P( X ) P( X c1 ) 1 if .e P( X c2 ).P(c2 ) m1 c1 2 2 2 ( X m2 ) 2 / 2 2 2 P( X c2 ) P( X c1 ).P( X c1 ) The decision rule: 1 d m2 X 2 P( X c1 ).P(c1 ) P( X c2 ).Por (c2 ) 0.7 P( X c1 ) 0.3P( X c2 ) Assume now: m2 20 m1 10 1 2 2 And we measured: X 17 Assign the unknown sample: Find likelihood ratio: e e X to the correct category. ( X 10) 2 / 8 ( X 20) / 8 2 for e 4.9 0.006 Compare with: So assign: Xto P(c2 ) 0.3 0.43 0.006 P(c1 ) 0.7 . c2 X 17 Example: A discrete problem Consider a 2-feature, 3 category case where: And 1 for ai X 1 bi 2 P ( X 1 , X 2 ci ) (ai bi ) ai X 2 bi 0 other wise P(c1 ) 0.4 , P(c2 ) 0.4 Find the decision boundaries and regions: R3 1 0.2 0.2 Solution: 3 R2 1 0.4 0.4 9 9 1 (0.4) 0.1 4 R1 1 3 P(c3 ) 0.2 , a1 1 b1 1 a 2 0 .5 b2 3.5 a3 3 b3 4 Remember now that for the 2-class case: ifc1 P ( X c1 ). P ( c1 ) P ( X c2 ). P ( c2 ) or P ( X c1 ) Likelihood ratio P ( X c2 ) P ( X c2 ) P ( X c1 ) k Error probabilities and a simple proof of minimum error Consider again a 2-class 1-d problem: P( X c1 ).P(c1 ) R1 d d Let’s show that: if the decision boundary is rather than any arbitrary point d . P( X c2 ).P(c2 ) R2 d (intersection point) Then P (E ) (probability of error) is minimum. P( E) P( X R2 , c1 ) P( X R1 , c2 ) P( X R2 c1 ).P(c1 ) P( X R1 c2 ).P(c2 ) [ P( X c1 )dX ].P(c1 ) [ P( X c2 )dX ].P(c2 ) R2 R1 P( X c1 ).P(c1 )dX P( X c2 ).P(c2 )dX R2 R1 d d It can very easily be seen that the P (E ) is minimum if d d . Minimum Risk Classification Risk associated with incorrect decision might be more important than the probability of error. So our decision criterion might be modified to minimize the average risk in making an incorrect decision. We define a conditional risk (expected loss) for decision i when X c occurs as: R i ( X ) ( i j ).P( j X ) j 1 Where ( i j ) is defined as the conditional loss associated with decision i when the true class is j . It is assumed that is known. i j The decision rule: decide on ci if R ( X ) R ( X ) for all 1 j c i j The discriminant function here can be defined as: 4 gi ( X ) Ri ( X ) • We can show that minimum – error decision is a special case of above rule where: (i i ) 0 ( i j ) 1 then, R i ( X ) P( j X ) j i 1 P(i X ) so the rule is i if 1 P(i X ) 1 P( j X ) P(i X ) R( j X ) For the 2 – category case, minimum – risk classifier becomes: R1 ( X ) 11P(1 X ) 12 P(2 X ) R 2 ( X ) 22 P(2 X ) 21P(1 X ) 11P(1 X ) 12 P(2 X ) 22 P(2 X ) 21P(1 X ) 1 if (11 21 ).P(1 X ) (12 22 ).P(2 X ) (11 21 ).P( X 1 ).P(1 ) (12 22 ).P( X 2 ) P(2 ) 1 if P( X 1 ) P( X 2 ) Otherwise, (12 22 ) P(2 ) . (21 11 ) P(1 ) 2. This is the same as likelihood rule if and 12 21 1 22 11 0 Discriminant Functions so far For Minimum Error: P(i X ) P( X i ).P(i ) log P( X i ) log P(i ) For Minimum Risk: Where Ri (X ) c R ( X ) ( i j ).P( j X ) i j 1 Bayes (Maximum Likelihood)Decision: • Most general optimal solution • Provides an upper limit(you cannot do better with other rule) • Useful in comparing with other classifiers Special Cases of Discriminant Functions Multivariate Gaussian (Normal) Density The general density form: P( X ) N ( M , ) : 1 (2 ) d /2 1/ 2 e 1/ 2 ( X M )T 1 ( X M ) Here X in the feature vector of size .d M : d element mean vector E ( X ) M [ 1 , 2 ,..., d ]T dxd: covariance matrix ij E[( X i i )( X j j )] ii E[( X i i ) 2 ] i (variance of feature 2 ) - symmetric ij 0 when X i and X j are statistically independent. Xi - determinant of General shape: where Hyper ellipsoids 2 (X M ) 2 – d problem: X1 , X 2 1 (X M ) is constant: X1 Mahalanobis 1 Distance T X2 1 M 2 12 12 2 21 2 , X2 If 12 0 , 21 0 (statistically independent features) then, major axes are parallel to major ellipsoid axes X1 if in addition 12 2 2 2 circular 1 in general, the equal density curves are hyper ellipsoids. Now gi ( X ) log e P( X i ) log e P(i ) is used for N ( M i , i ) since its ease in manipulation g i ( X ) (1 / 2).( X M i ) T 1 i (X Mi ) (1 / 2) log i log P(i ) g i (X ) is a quadratic function of X as will be shown. g i ( X ) 1 / 2. X 1 / 2. X T X 1 / 2.M M M 1 / 2M X 1 T i 1 i 1 T i i T i i i 1 i 1 / 2. log i1 log P(i ) Wi 1 / 2.i 1 Vi M T i 1 i a scalar Wio 1 / 2.M iT i1M i 1 / 2. log i log P(i ) Then, g i ( X ) X TWi X Vi X Wio On the decision boundary, gi ( X ) g j ( X ) X T Wi X X T W j X Vi X V j X Wio W jo 0 X T (Wi W j ) X (Vi V j ) X (Wio W jo ) 0 X T WX VX W0 0 Decision boundary function is hyperquadratic in general. Example in 2d. 11 12 W 22 21 V v1 v2 X x1 Then, above boundary becomes x2 x1 11 12 x1 x1 x2 v1 v2 W0 0 21 22 x2 x2 11x12 212 x1 x2 22 x2 2 v1 x1 v2 x2 W0 0 General form of hyper quadratic boundary IN 2-d. The special cases of Gaussian: Assume i matrix 2I Where is the unit I i 2d 2 0 2 0 i 0 0 0 0 0 0 0 0 .. 0 2 0 1 i 1 2 I g i ( X ) 21 2 ( X M i )T .( X M i ) 12 log 2 d log P(i ) g i ( X ) 2 2 X , M i 1 Now assume 2 log P(i ) (not a function M of X so can be removed) i P(i ) P( j ) gi ( X ) 21 2 X , M i 2 d 2 X , M i euclidian distance between X and Mi Then,the decision boundary is linear ! Decision Rule: Assign the unknown sample to the closest mean’s category unknown sample d2 d1 M1 M2 d P(towards i ) P (the j ) less probable d= Perpendicular bisector that will move category Minimum Distance Classifier • Classify an unknown sample X to the category with closest mean ! • Optimum when gaussian densities with equal variance and equal apriori probability. R M 1 M2 R2 1 M4 R4 M3 R3 Piecewise linear boundary in case of more than 2 categories. • Another special case: It can be shown that when (Covariance matrices are the same) i size and shape • Samples fall in clusters of equal unknown sample P(i ) P( j ) is called Mahalonobis Distance g i ( X ) 12 ( X M i )T 1 ( X M i ) log P(i ) is called Mahalonobis Distance T 1 1 2 (X Mi ) (X Mi ) Then, if P(i ) P( j ) The decision rule: i if (Mahalanobis Distance of unknown sample to M i ) > (Mahalanobis Distance of unknown sample to If P(i ) P( j ) The boundary moves toward the less probable one. Mj ) Binary Random Variables • Discrete features: Features can take only discrete values. Integrals are replaced by summations. • Binary Features: 0 or 1 pi ( X i 11 ) qi ( X i 1 2 ) pi 1 pi 1 0 • • P( X i 1 ) Xi Assume binary features are statistically independent. Where is binary Xi X X1 , X 2 ,..., X d T Binary Random Variables Example: Bit – matrix for machine – printed characters 1 0 BA a pixel Xi Here, each pixel may be taken as a featureX i For above problem, we have d 10 10 100 is the probability that pi for letter A,B,… Xi 1 P( xi ) ( pi ) xi (1 pi )1 xi defined for xi 0,1 undefined elsewhere: d d i 1 i 1 P( X ) P( xi ) ( pi ) xi (1 pi )1 xi d g k ( X ) log( P( X wk ) log P( wk )) xi log pi (1 xi ) log( 1 pi ) log P( wk ) i 1 • If statistical independence of features is assumed. • Consider the 2 category problem; assume: pi ( x 11 ) qi ( x 12 ) then, the decision boundary is: x log p (1 x ) log( 1 p ) x log q (1 x ) log( 1 q ) i i i i i i i i log P(1 ) log P(2 ) 0 So if pi 1 pi P (1 ) x log ( 1 x ) log log i qi i 1qi P ( 2 ) 0 category 1 else 2 g ( X ) boundary Wi X i W0 is linear in X. The decision a weighted sum of the inputs where: and Wi ln Pi (1 qi ) qi (1 pi ) W0 ln 11 qpii ln P (1 ) P ( 2 )