Example: Discriminating between loan applicants • Data on eight financial variables for 68 farmers who have borrowed money from a bank • Based on re-payment history, each farmer was classified as a ”good” or a ”bad” customer. The training samples consist of 34 good customers and 34 bad customers • Construct a classification rule to classify new applicants as potential ”good” or ”bad” customers. • We first ask SAS/R to test for equality of covariance matrices to indicate if we should use a linear (if covariance matrices are homogeneous) or quadratic (if they are not) classification rule. 702 Classification with Several Populations • We now consider the case of g > 2 populations. • While extension of the classification rules to this case is straightforward, accurately assessing the error rates of the sample classification functions is not trivial. • As in the g = 2 case, we first develop the theoretically optimal rules and then see how they need to be modified in applications. 703 Classification with Several Populations • We let fi(x), i = 1, ..., g denote the density associated with population πi. • Rurther, pi is the prior probability of πi and c(k|i), k = 1, ..., g is the cost of allocating an item to πk when it belongs in πi. For k = i, c(i|i) = 0. • Rk is the set of x’s classified as πk and P (k|i) = Z Rk fi(x)dx, P (i|i) = 1 − g X P (k|i). k=1,k6=i 704 Classification with Several Populations • The conditional expected cost of misclassifying x from π1 into any of the other g − 1 groups is ECM (1) = P (2|1)c(2|1) + P (3|1)c(3|1) + · · · + P (g|1)c(g|1) = g X P (k|1)c(k|1). k=2 • ECM (1) is incurred with prior probability p1. 705 Classification with Several Populations • If we obtain ECM (2), ECM (3), ..., ECM (g) in a similar manner, then the overall ECM is ECM = p1ECM (1) + p2ECM (2) + · · · + pg ECM (g) = g X i=1 pi g X P (k|i)c(k|i) . k=1,k6=i • The optimal classification rule consists in choosing R1, R2, ..., Rg that minimize ECM . 706 Classification with Several Populations • It can be shown that the classification regions that minimize ECM are defined by allocating x to the population πk for which g X pifi(x)c(k|i) i=1,i6=k is smallest. When all the costs of misclassification are equal, then the minimum ECM rule is the one that allocates x to P the πk for which i=1,i6=k pifi(x) is smallest. • Since i=1,i6=k pifi(x) is smallest when the ommitted term pk fk (x) is largest, for equal costs the optimal rule allocates x0 to πk when P pk fk (x0) > pifi(x0) for all i 6= k. 707 Classification with Several Populations • Note that this rule is identical to the Bayes rule that maximizes the posterior probability: prior × likelihood p f (x) =P , k = 1, 2, ..., g. P (πk |x) = Pg k k [prior × likelihood] i=1 pi fi (x) • In practice, we need to know the g densities, costs and prior probabilities. 708 Example with g = 3 Populations Classify as Priors Density at x0 π1 π2 π3 π1 c(1|1) = 0 c(2|1) = 10 c(3|1) = 50 p1 = 0.05 f1(x0) = 0.01 π2 c(1|2) = 500 c(2|2) = 0 c(3|2) = 200 p2 = 0.60 f2(x0) = 0.85 π3 c(1|3) = 100 c(2|3) = 50 c(3|3) = 0 p3 = 0.35 f3(x0) = 2 709 Example with g = 3 Populations Pg • We first compute the three values of i6=k pifi(x0)c(k|i): k=1: p2f2(x0)c(1|2) + p3f3(x0)c(1|3) = 0.6 × 0.85 × 500 + 0.35 × 2 × 100 = 325 k=2: 0.05 × 0.01 × 10 + 0.35 × 2 × 50 = 35.06 k=3: 0.05 × 0.01 × 50 + 0.60 × 0.85 × 200 = 102.03. • Since the smallest value is achieved for k = 2, we allocate x0 to π2. 710 Example with g = 3 Populations • If all costs of misclassification were equal, we would just look at the products pifi(x0): p1f1(x0) = 0.05 × 0.01 = 0.0005 p2f2(x0) = 0.60 × 0.85 = 0.51 p3f3(x0) = 0.35 × 2 = 0.70. • In this case, we would allocate x0 to π3. 711 Example with g = 3 Populations • Finally, if we were using the Bayes criterion we first compute X = 1g pifi(x0) = 0.05 × 0.01 + 0.60 × 0.85 + 0.35 × 2 = 1.213 i and then compute the three posterior probabilities as p1f1(x0) 0.0005 P (π1|x0) = = = 0.0004 1.213 1.213 0.51 p f (x ) = 0.421 P (π2|x0) = 2 2 0 = 1.213 1.213 p f (x ) 0.70 P (π3|x0) = 3 3 0 = = 0.578. 1.213 1.213 • x0 is allocated to π3, the population with highest posterior probability. 712 Classification with Normal Populations • If fi(x) = Np(µi, Σi) and if c(k|i) are equal, the minimum ECM (or TPM) rule is: allocate x to πk if 1 1 p ln(2π) − ln |Σk | − (x − µk )0Σ−1 ln pk fk (x) = ln pk − k (x − µk ) 2 2 2 = max ln pifi(x). i • The constant (p/2) ln(2π) is the same for all i. We ignore it and define the quadratic discriminant score for the ith population: 1 1 Q di = − ln |Σi| − (x − µi)0Σ−1 (x − µi) + ln pi. i 2 2 Q • We allocate x to πk if dk is the largest. 713 Estimated Minimum TPM Rule • In practice, we need to plug in estimates x̄i, Si for µi, Σi. In this case, the estimated minimum TPM rule consists in allocating x to πk if 1 1 Q ˆ dk (x) = − ln |Sk | − (x − x̄k )0Sk−1(x − x̄k ) + ln pk 2 2 is largest among all g sample quadratic scores. • If the population covariance matrices Σ1, ..., Σg = Σ, the score simplifies into a linear score because the first two terms in 1 0 −1 1 1 Q 0 −1 di (x) = − ln |Σ| − x Σ x + µiΣ x − mu0iΣ−1µi + ln pi. 2 2 2 are the same for all i. 714 Estimated Minimum TPM Rule • In the case of equal population covariance matrices and normal populations, we define the linear sample discriminant socre as 1 −1 −1 dˆi(x) = x̄0iSpool x − x̄0iSpool x̄i + ln pi, i = 1, ..., g, 2 and allocate x to πk if dˆk (x) is the largest among the g linear discriminant scores. 715 Estimated Minimum TPM Rule • An alternative rule in the equal covariance case is derived Q from di (x) by ignoring the term in |Σ|. If we do and plug in sample estimates for µi, Σ, then −1 d˜i(x) = (x − x̄i)0Spool (x − x̄i) + ln pi = Di2(x) + ln pi, so we allocate x to the “closest” population but penalize the distance by ln pi. 716 Example 11.11 • Data on the undergraduate GPA and GMAT of n = 85 applicantes to a graduate business program are given on Table 11.6. • These applicants had been grouped into three populations: π1 = Admit, π2 = Not Admit , π3 = Borderline. • A new applicant, with GPA = 3.21 and GMAT = 497 has applied for admission. Should she be admitted, not admitted, or placed in the maybe category? • A plot of the training sample is shown below 717 Example 11.11 (cont’d) • Using the inverse of the pooled covariance matrix and the sample means from output, we can compute the sample squared distances from x0 to each of the three groups: −1 D12(x0) = (x0 − x̄1)0Spool (x0 − x̄1) = h −0.19 −64.23 i " 28.61 0.016 0.016 0.0003 #" −0.19 −64.23 # = 2.58 −1 (x0 − x̄2) = 17.10 D22(x0) = (x0 − x̄2)0Spool −1 D32(x0) = (x0 − x̄3)0Spool (x0 − x̄3) = 2.47. • Since D32(x0) is smallest, we classify the applicant in the borderline group. 718 Example 11.11 (cont’d) • SAS code is posted as gmat.sas • R code is posted as gmat.R • The data are posted as gmat.dat 719 3 Example 11.11 (Linear Discriminants) 3 3 2 3 1 1 3 3 3 33 1 1 1 33 3 3 3 LD2 1 3 1 1 0 1 1 1 1 −1 1 1 1 1 1 11 1 1 1 1 11 22 3 33 33 33 3 3 −2 1 −4 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 3 1 2 2 1 11 2 2 22 2 1 2 2 2 2 1 −2 0 2 4 LD1 720 700 700 Example 11.11 (QDA vs. LDA) 11 11 1 1 1 1 1 1 1 2 2 2 2 3 2 3 400 2 2 2 22 2 2 2 2 33 3 1 1 2 2 600 2 2 2 2 2 1 3 2 2 2 3 22 2 2 2 2 3 2 3 333 3 3 2 1 1 1 1 3 2 2 1 3 2 1 1 1 3 3 3 3 3 3 3 3 33 1 11 1 3 2 2 2 2 2 2 33 3 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 33 1 1 1 1 3 3 2 1 1 1 400 500 2 3 333 3 2 2 3 2 2 2 2.5 3.0 gmat$GPA 3 300 3 300 gmat$GMAT 2 1 1 1 1 3 1 1 1 1 1 1 1 500 1 11 1 2 1 1 3 2 2 2 2 2 2 1 1 1 gmat$GMAT 600 1 3.5 2.5 3.0 3.5 gmat$GPA 721