UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11 p. 1 ----------------------------------------------------------------------------------------------------------------------------- -------------------------____________________________________________________________________________________________________ University of Illinois at Chicago School of Public Health Division of Epidemiology & Biostatistics ____________________________________________________________________________________________________ BSTT 580 Instructor Textbook Applied Multivariate Statistical Analysis Prof. Stanley L. Sclove Johnson and Wichern, 4/e (JW) ____________________________________________________________________________________________________ Notes on JW Ch. 11: Classification and Discrimination _____________________________________________________________ CONTENTS 11.1. Introduction 515 11.2. Separation and Classification for Two Populations 630 11.3. Classification with Two Multivariate Normal Populations 639 11.4. Evaluating Classification Functions 649 11.5. Fisher's Discriminant Function -- Separation of Populations 661 11.6. Classification with Several Populations 665 11.7. Fisher'sMethod for Discriminating among Several Populations 683 11.8. Final Comments 697 Introducing Qualitative Variables (Logistic Regression) Classification and Regression Trees Neural Nets UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11 p. 2 ----------------------------------------------------------------------------------------------------------------------------- -------------------------- 11.1. Introduction Examples. (i) Tax returns are to be assigned for audit or passed over, according to the value of a variate of which high values indicate audit potential; the IRS called the corresponding score GAP, for "Greatest Audit Potential." (ii) Mortgage loan applicants are to be granted a loan according to the value of a variate formed from items on the loan application (age, income, price of house, education, job seniority). (iii) Charge account applicants were granted or denied charge accounts on the basis of the expert judgment of the store's credit officers. Their decision are now to be combined with the data on the applications to develop a variate, high values of which correspond to granting charge privileges. Note that in this example the two groups are defined by expert judgment, rather than harddata. A variable used in developing the statistical credit-scoring index may or may not have been used by the experts. The data consist of a group label variable Y which is (0,1) in the case of two groups and a vector X of variables X1, X2, . . ., Xp. We classify an individual having values x into thegroup indicated by Y=1 if P(Y=1|X=x) is sufficiently large. The vector X can contain both metric (numerical) and nonmetric (categorical) variables. When it contains only metric variables, and when the within-group distributions are multivariate normal, there is a special method for classification, "Discriminant Analysis". What are Discriminant Analysis and Logistic Regression? Thus, Discriminant Analysis is best viewed as a special case of classification, so we begin with a discussion of Classification in general. We start with an example of Classification in terms of categorized variables. UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11 p. 3 ----------------------------------------------------------------------------------------------------------------------------- -------------------------- In the Land of Sameness, 80% of the people are Creditworthy and 20% are Defaulters. Below are the joint distributions of monthly income and monthly financial obligations in the two groups, from past data. TABLE. Joint Distribution of Monthly Income and Monthly Financial Obligations for Creditworthy Individuals and Defaulters Creditworthy -----------INCOME ($K) | 1 2 3 4 | _______|_____________________|____ 1 | .04 .08 .12 .16 | .40 OBLIGATIONS 2 | .03 .06 .09 .12 | .30 ($ K) 3 | .02 .04 .06 .08 | .20 4 | .01 .02 .03 .04 | .10 _______|_____________________|____ | .10 .20 .30 .40 |1.00 Defaulters ---------INCOME ($K) | 1 2 3 4 | ___ |____________________|_____ 1 | .04 .03 .02 .01 | .10 OBLIGATIONS 2 | .02 .04 .06 .08 | .20 ($ K) 3 | .03 .06 .12 .09 | .30 4 | .01 .07 .20 .12 | .40 ___|____ _______________|_____ | .10 .20 .40 .30 | 1.00 Consider the rule which grants credit to those who have an income of $3K or more and Obligations of $3K or less (i.e., Obligations of less than $4K). Q. What proportion of the Creditworthy will be granted credit by this rule? A. Let I = Income and O = Obligations. Let C = Creditworthy and D = Defaulter. P(I>=3,O < 4|C) = .12+.16+.09+.12+.06+.08 = .63 UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11 p. 4 ----------------------------------------------------------------------------------------------------------------------------- -------------------------- Q. What proportion of the Defaulters will be denied credit by this rule? A. P(I >= 3, O < 4|D) = .02+.01+.06+.08+.12+.09= .38 P(Denied | D) = 1-.38 = .62 Q. What is the error rate in classifying the Creditworthy? A. Error rate for Creditworthy = P(Denied| C ) =1-.63 =.37 Q. What is the error rate in classifying Defaulters? A. Error rate for Defaulters = P(Granted| D ) =.38 Q. What is the overall error rate of this rule? A. Overall error rate = (Err.Rate for Creditworthy)xP(C)+ (Err.Rate for Defaulters)xP(D) = .37x.80 + .38x.20 = .296 + .076 = .372 Q. What is the posterior probability that someone with Income = 4 and Obligations = 1 is from the Creditworthy population? A. P(C|I=4,O=1) = P(C and I=4,O=1)/P(I=4,O=1) P(C and I=4,O=1) = P(I=4,O=1|C)xP(C) = .16 x .80= .128 P(I=4,O=1) = P(I=4,O=1|C)P(C)+P(I=4,O=1|D)P(D) =.16 x .80 + .01 x .20 = .128 + .002 = .130 P(C | I=4,O=1 ) = 128/130 or about .985 UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11 p. 5 ----------------------------------------------------------------------------------------------------------------------------- -------------------------- 11.2. Separation and Classification for Two Populations 630 The two populations are denoted by 1and 2 ; the probability density functions, by f1(x) and f2(x). Of primary importance is the "likelihood ratio" ("density ratio"), f1(x)/f2(x). The best candidates for 1 are those x for which it is large, that is, f1(x)/f2(x)> c . In terms of prior probabilities p1 and p2 and misclassification costs c(1|2) and c(2|1), the value of the cut-off point c is c = c(1|2) p2 / c(2|1) p1 . 11.3. Classification with Two Multivariate Normal Populations 639 When the distributions are multinormal with common covariance matrix,the natural log of the density ratio is linear in x. This linear function is called the linear discrimination function(LDF). When the distributions are multinormal with different covariance matrices,the natural log of the density ratio is quadratic in the elements of x. The discrimination function is quadratic. UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11 p. 6 ----------------------------------------------------------------------------------------------------------------------------- -------------------------- Example. A Two-Group Discriminant Analysis: Purchasers versus Nonpurchasers (Hair et al.) Here is output from a discriminant analysis. Note that "Buy?" is the binary dependent variable, indicating the group. Note also that the variable Style wasn't included since its t was N.S. MTB > COLUMN C1 C2 C3 C4 C5 info NAME Case Durablty Perform Style Buy? COUNT 10 10 10 10 10 MTB > DISCriminant analysis for labels in C5,data in C2-C3 TABLE. Classification Functions Group: 0 1 -------------------------Constant -6.170 -25.619 Durablty 1.823 5.309 Perform 1.479 2.466 Q. What are the values of the two classification functions for an individual giving the food mixer a Durability rating of 4 and a Performance rating of 6? A. For j = 0,1, let C(j|d,p) denote the value of the classification function for Group j, given Durability = d and Performance= p. For Group 0: C(0|4,6) = -6.170 + 1.823(4)+ 1.479(6) = +9.996 For Group 1: C(1|4,6) = -25.619 + 5.309(4)+ 2.466(6) = +6.413 Q. Classify this individual. A. Since C(0|4,6) > C(1|4,6), we classify this individual as 0: that is, we predict that this individual will not buy. Q. What is this person's posterior probability of membership in the 'Buy' group? UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11 p. 7 ----------------------------------------------------------------------------------------------------------------------------- -------------------------- A. For j = 0,1, let p(j|d,p) denote the posterior probability of membership in group j, given that Durability = d and Performance = p. Math shows that the classification functions are equal to the log posterior probabilities, except for a constant which doesn't depend on the group. That is, C(j|d,p) = ln[p(j|d,p)] + k. This gives p(j|d,p) = k'exp[C(j|d,p)]. p(0|4,6) = k'exp(+9.996), p(1|4,6) = k'exp(+6.413) The sum of the two is 1. Hence k' = 1/[exp(+9.996) + exp(+6.413)]. p(1|4,6) = exp(+6.413)/[exp(+6.413) + exp(+9.996)]= 1/[1 + exp(9.9966.413)] = 1/[1 + exp(+3.187)] = .0397 . Also, p(0|4,6) = .9603. 11.4. Evaluating Classification Functions 649 The classification must be evaluated on data other than that used to train the procedure. There are several ways to do this. One is to train the procedure on part (say half) the data and use the other part for evaluation. The confusion matrix is a good way to display the numbers of correct and incorrect classifications. UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11 p. 8 ----------------------------------------------------------------------------------------------------------------------------- -------------------------- One sort of classification is the use of a test for presence of a disease. Let D denote the event that the disease is actually present and A that it is actually absent. Let d denote the event that the test says the disease is present and a the event that the test says it is not. This gives a two-by-two table. PREDICTED MEMBERSHIP d a Total ACTUAL 1: D n1C n1M n1 MEMBERSHIP 2: A n2M n2C n2 nd na n Total In the notation for the cell frequencies, the subscript C means correct classification and M means misclassification. In connection with the table, there are eight conditional probabilities which are of interest:-The error rate for diseased (D's) (rate of false negatives) is the estimate of P(a|D), or n1M /n1. The sensitivity of the test is the estimate of P(d|D), or n1C /n1. The error rate for non-diseased (A's) (rate of false positives) is the estimate of P(d|A), or n2M /n2. The specificity of the test is the estimate of P(a|A), or n2C /n2. The false alarm rate is the estimate of P(A|d), or n2M /nd. The predictive value of a positive test is the estimate of P(D|d), or n1C /nd. The rate of diseased among negatives is the estimate of P(D|a), or n1M /na. The predictive value of a negative test is the estimate of P(A|a), or n2C /na. Unless a procedure like the leave-one-out procedure or cross-validation has been used, the error rates are over-optimistically low. When such a procedure has been used, the rates based on n1 and n2 are unbiased even if the n were not sampled randomly from the population; but the correctness of those rates based on nd and na would depend upon random sampling and in the absence of random sampling must be adjusted to overall rates P(D) and P(A). UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11 p. 9 ----------------------------------------------------------------------------------------------------------------------------- -------------------------- 11.5. Fisher's Discriminant Function -- Separation of Populations 661 R. A. Fisher arrived at what is now called the LDF by another method. Consider any variate, a'X. There is a value of two-sample t corresponding to that variate. Write it in terms of a, and maximize over a. The result is the same as the LDF. 11.6. Classification with Several Populations 665 A reasonable optimality criterion is expected cost of misclassification. It is minimized by classifying x into that population for which the conditional cost of misclassification, given x, is a minimum. With equal costs of misclassification, this reduces to classification by maximum posterior probability. Ignoring terms which are constant across all populations, the log posterior probability is linear in x if the distributions are multinormal with common covariance matrix, quadratic with different covariance matrices. 11.7. Fisher's Method for Discriminating Among Several Populations 683 R. A. Fisher's method is: Consider any variate, a'X. There is a value of the F statistic corresponding to that variate. Write it in terms of a, and maximize over a. The maximizing a's are eigenvectors of BW-1, where B is a betweengroups sum-of-products matrix and W is the within-groups sum-of-products matrix. We need at most min{p,g-1} variates, where g is the number of populations. 11.8. Final Comments 683 Including Qualitative Variables Classification Trees Neural Networks For an expansion of the material in Section 11.8, please link to Addendumto JW Section 11.8 . UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11 p. 10 ----------------------------------------------------------------------------------------------------------------------------- -------------------------- _______________________________________________________________ Copyright © 2000 Stanley Louis Sclove Created: 5 October 1999 Updated: 29 Oct 2000 ________________________________________________________________________________________________________________