Chapter 8 Discriminant Analysis 8.1 Introduction Classification is an important issue in multivariate analysis and data mining. Classification: classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data, i.e., predicts unknown or missing values Classification—A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae Prediction: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known Classification Process : Model Construction Classification Algorithms Training Data NAME RANK M ike M ary B ill Jim D ave Anne A ssistan t P ro f A ssistan t P ro f P ro fesso r A sso ciate P ro f A ssistan t P ro f A sso ciate P ro f YEARS TENURED 3 7 2 7 6 3 no yes yes yes no no Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classification Process: Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME Tom M erlisa G eorge Joseph RANK Y E A R S TE N U R E D A ssistant P rof 2 no A ssociate P rof 7 no P rofessor 5 yes A ssistant P rof 7 yes Tenured? Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Discrimination— Introduction Discrimination is a technique concerned with allocating new observations to previously defined groups. There are k samples from k distinct populations: 1 x11 x11p G1 : Gk x 1 x 1 n1 p n11 k x11 x1kp : x k x k nk p nk 1 One wants to find the so-called discriminant function and related rule to identify the new observations. Example 11.3 Bivariate case Discriminant function and rule Discriminant function: w x l'x x G1 if w x a Rule x G2 if w x a Example 11.1: Riding mowers Consider two groups in city: riding-mower owners and those without riding mowers. In order to identify the best sales prospects for an intensive sales campaign, a riding-mower manufacturer is interested in classifying families as prospective owners or nonowners on the basis of income and lot size. Example 11.1: Riding mowers 2: x2 : x1 : 2 (Income in $1000s) (Lot size 1000 ft) 60 18.4 85.5 16.8 64.8 21.6 61.5 20.8 87 23.6 110.1 19.2 108 17.6 82.8 22.4 69 20 93 20.8 51 22 81 20 Nonowners x1 : (Income in $1000s) 75 52.8 64.8 43.2 84 49.2 59.4 66 47.4 33 51 63 x2: 2 (Lot size 1000 ft) 19.6 20.8 17.2 20.4 17.6 17.6 16 18.4 16.4 18.8 14 14.8 Example 11.1: Riding mowers Classify as G1 G2 True G1 10 2 G2 2 10 8.2 Discriminant by Distance Assume k=2 for simplicity G1 : N p μ 1 ,Σ1 , G2 : N p μ 2 ,Σ2 Discrimina nt function : w x d 2 x,G1 d 2 x,G2 x G1 Rule : x G2 if if w x 0 w x 0 8.2 Discriminant by Distance Consider the Mahalanobis distance d 2 x,G j x μ j ' Σ j 1 x μ j , j 1,2. when Σ1 Σ2 Σ 'Σ x μ x μ 'Σ x μ 1 2 x μ μ 'Σ μ -μ 2 w x x μ 1 1 -1 1 2 2 -1 1 -1 2 2 8.2 Discriminant by Distance Let 1 1 μ μ μ 2 2 c Σ -1 μ 1 μ 2 The discrimina nt function w x can be w x x μ 'Σ -1 μ 1 μ 2 c' x μ 8.2 Discriminant by Distance When μ 1 , μ 2 , Σ are known, their estimators are x j Σ~ 1 nj j xi n j i 1 1 A1 A2 n1 n2 2 Where xi j x j xi j x j ' nj Aj i 1 Example Univariate Case with equal variance G1 : N 1 , 12 , G2 : N 2 , 22 1 x G1 Rule : x G2 if if xa xa a 2 1 a μ 1 μ 2 2 Example Univariate Case with equal variance G1 : N 1 , 12 , G2 : N 2 , 22 a* a* 2 1 1 2 1 2 8.3 Fisher’s Discriminant Function Idea: projection, ANOVA 8.3 Fisher’s Discriminant Function Training samples G1 : N p μ 1 , Σ , x1 , , x n1 Gk : N p μ k , Σ , k k x1 , , x nk 1 k 8.3 Fisher’s Discriminant Function Projection the data on a direction l R p , the F-statistics l'Bl k 1 Fl , l'El n k where B na xa x xa x ' k a 1 E x j xa x j x ' k na a 1 j 1 a a 8.3 Fisher’s Discriminant Function To find l * R p such that Fl* maxp Fl lR The solution of l * is the eigenvector associated with the largest eigenvalue of B .E 0 Discriminant function: u x l'x, where l l (B) Two Populations B n1 x 1 x x 1 x ' n2 x 2 x x 2 x ' n1 x 1 n2 x 2 x n1 n2 Note and E A1 A2 n1n2 x 1 x 2 x 1 x 2 ' B n1 n2 We have There is only one non-zero eigenvalue of B E 0 as rank B 1. (B) Two Populations The associated eigenvector is E 1 x 1 x 2 . Discriminant function: u x x'E x G1 if Rule: x G2 if where 1 u x u x 1 1 2 c' x x 2 x 1 x 2 c'x when Σ1 Σ2 (B) Two Populations When Σ1 Σ 2 , where ˆ 12 is replaced by 1 ˆ c'x 2 c ' x ˆ 1 2 ˆ 1 ˆ 2 1 c'A1c n1 1 1 1 1 1 2 x x ' A1 A2 A1 A1 A2 x 1 x 2 n1 1 ˆ 22 1 c'A2 c n2 1 1 1 1 1 2 x x ' A1 A2 A2 A1 A2 x 1 x 2 n2 1 Example Inset Classification No. 1 2 3 4 5 6 7 8 9 10 11 Note: n.g. c.g. y Table 2.1 Data of two species of insects x1 x2 n. g. c. g. 6.36 5.24 1 1 5.92 5.12 1 2 5.92 5.36 1 1 6.44 5.64 1 1 6.40 5.16 1 1 6.56 5.56 1 1 6.64 5.36 1 1 6.68 4.96 1 1 6.72 5.48 1 1 6.76 5.60 1 1 6.72 5.08 1 1 y 2.4713 2.3335 2.3663 2.5481 2.4714 2.5702 2.5650 2.5213 2.6034 2.6309 2.5488 No. Table 2.1 Data of two species of insects x1 x2 n. g. c. g. 1 2 3 4 5 6 7 8 9 10 11 12 6.00 5.60 5.65 5.76 5.96 5.72 5.64 5.44 5.04 4.56 5.48 5.76 4.88 4.64 4.96 4.80 5.08 5.04 4.96 4.88 4.44 4.04 4.20 4.80 data x1 and x2 are the characteristics of insect (Hoel,1947) means natural group (species), the classified group, the value of the discriminant function 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 y 2.3227 2.1796 2.2343 2.2456 2.3391 2.2674 2.2343 2.1682 1.9977 1.8106 2.0863 2.2456 Example Inset Classification 6.4654 5.9878 2 5.5500 x , x , x 5.3236 4.7267 5 . 0122 2.6765 1.2942 4.8097 3.1364 E , B 1.2942 1.7545 3.1364 2.0453 1 The eigenvalue of B E 0 is 1.9187 and the associated eigenvector is 0.2759 1 2 1 E x x . 0.1367 Example Inset Classification The discriminant function is u x1 , x2 0.2759 x1 0.1367 x2 and the associated value of each observation is given in the table. The cutting point is 2.3447. classify as G1 G2 Classification is True G1 10 1 G2 0 12 If we use 2.3831 ˆ1 0.0939,ˆ 2 0.1497 , we have the same classification. 8.4 Bayes’ Discriminant Analysis A. Idea There are k populations G1, …, Gk in Rp. A partition of Rp, R1, …, Rk , is determined based on a training sample. Rule: x Gi if x falls into Ri Loss: c j | i : x is from Gi , but x falls into Rj The Probability of this misclassification P j | i R pi x dx , where pi x is the density of x Gi . j 8.4 Bayes’ Discriminant Analysis Expected cost of misclassification is ECM R1 , k k i 1 j 1 , Rk qi c j | i p j | i where q1, …, qk are prior probabilities. We want to minimize ECM(R1, …, Rk ) w.r.t. R1, …, Rk . B. Method Theorem 6.4.1 Let k ht x qi pi x c t | i i 1 i t Then the optimal Rt’s are Rt x : ht x h j x , j t, t 1, , k . Corollary 1 Take c j | i ij 1 if i j and 0 if i j . Then Proof: Rt x : qt pt x q j p j x , j t, t 1, , k . k ht x qi pi x qt pt x i 1 c x qt pt x Corollary 2 In the case of k=2 h1 x q2 p2 x c12 h2 x q1 p1 x c21 we have R1 x:q2 p2 x c 1| 2 q1 p1 x c 2 |1 R2 x:q2 p2 x c 2 |1 q1 p1 x c 1| 2 Discriminant function: u x p1 x p2 x x G1 if u x d Rule: x G2 if u x d q2 c 1| 2 where d q1c 2 |1 Corollary 3 In the case of k=2 and N p μ 1 ,Σ if x G1 x~ 2 ,Σ if x G N μ 2 p Then u x p1 x expw x p2 x 1 1 2 where w x x μ μ 'Σ -1 μ 1 μ 2 2 x G1 Rule : x G2 if if w x ln d w x ln d C. Example 11.3: Detection of hemophilia A carriers For the detection of hemophilia A carriers, to construct a procedure for detecting potential hemophilia A carriers, blood samples were assayed for two groups of women and measurements on the two variables. The first group of 30 women were selected from a population of women who did not carry the hemophilia gene. This group was called the normal group. The second group of 22 women was selected from known hemophilia A carriers. This group was called the obligatory carriers. C. Example 11.3: Detection of hemophilia a carriers Variables: log10 (AHF activity) log10 (AHF-like antigen) Populations: population of women who did not carry the hemophilia gene (n1=30) population of women who are known hemophilia A carriers (n2=45) C. Example 11.3: Detection of hemophilia a carriers C. Example 11.3: Detection of hemophilia a carriers Data set normal log10(AHF activity) log10(AHF-like antigen) Obligatory carrier -0.0056 -0.1698 -0.3469 -0.0894 -0.1679 -0.0836 -0.1979 -0.0762 -0.1913 -0.1092 -0.5268 -0.0842 -0.0225 0.0084 -0.1827 0.1237 -0.4702 -0.1519 0.0006 -0.2015 -0.1932 0.1507 -0.1259 -0.1551 -0.1952 0.0291 -0.228 -0.0997 -0.1972 -0.0867 -0.1657 -0.1585 -0.1879 0.0064 0.0713 0.0106 -0.0005 0.0392 -0.2123 -0.119 -0.4773 0.0248 -0.058 0.0782 -0.1138 0.214 -0.3099 -0.0686 -0.1153 -0.0498 -0.2293 0.0933 -0.0669 -0.1232 -0.1007 0.0442 -0.171 -0.0733 -0.0607 -0.056 log10(AHF activity) log10(AHF-like antigen) -0.3478 -0.4719 -0.2447 -0.3351 -0.1878 -0.3618 -0.4986 -0.5015 -0.1326 -0.6911 -0.3608 -0.4535 -0.3479 -0.3539 -0.361 -0.3226 -0.4319 -0.2734 -0.5573 -0.3755 -0.495 -0.5107 -0.1652 -0.4232 -0.2375 -0.2205 -0.2154 -0.3447 -0.254 -0.3778 -0.4046 -0.0639 -0.0149 -0.0312 -0.174 -0.1416 -0.1508 -0.0964 -0.2642 -0.0234 -0.3352 -0.1744 -0.4055 -0.2444 -0.4784 0.1151 -0.2008 -0.086 -0.2984 0.0097 -0.339 0.1237 -0.1682 -0.1721 0.0722 -0.1079 -0.0399 0.167 -0.0687 -0.002 0.0548 -0.1865 -0.0153 -0.2483 0.2132 -0.0407 -0.0998 0.2876 0.0046 -0.0219 0.0097 -0.0573 -0.2682 -0.1162 0.1569 -0.1368 0.1539 0.14 -0.0776 0.1642 0.1137 0.0531 0.0867 0.0804 0.0875 0.251 0.1892 -0.2418 0.1614 0.0282 C. Example 11.3: Detection of hemophilia a carriers SAS output C. Example 11.3: Detection of hemophilia a carriers C. Example 11.3: Detection of hemophilia a carriers C. Example 11.3: Detection of hemophilia a carriers