MS Word - University of Illinois at Chicago

advertisement
UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11
p. 1
----------------------------------------------------------------------------------------------------------------------------- -------------------------____________________________________________________________________________________________________
University of Illinois at Chicago
School of Public Health
Division of Epidemiology & Biostatistics
____________________________________________________________________________________________________
BSTT 580
Instructor
Textbook
Applied Multivariate Statistical Analysis
Prof. Stanley L. Sclove
Johnson and Wichern, 4/e (JW)
____________________________________________________________________________________________________
Notes on JW Ch. 11: Classification and Discrimination
_____________________________________________________________
CONTENTS
11.1. Introduction 515
11.2. Separation and Classification for Two Populations 630
11.3. Classification with Two Multivariate Normal Populations 639
11.4. Evaluating Classification Functions 649
11.5. Fisher's Discriminant Function -- Separation of Populations 661
11.6. Classification with Several Populations 665
11.7. Fisher'sMethod for Discriminating among Several Populations 683
11.8. Final Comments 697
Introducing Qualitative Variables (Logistic Regression)
Classification and Regression Trees
Neural Nets
UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11
p. 2
----------------------------------------------------------------------------------------------------------------------------- --------------------------
11.1. Introduction
Examples.
(i) Tax returns are to be assigned for audit or passed over, according to the
value of a variate of which high values indicate audit potential; the IRS called
the corresponding score GAP, for "Greatest Audit Potential."
(ii) Mortgage loan applicants are to be granted a loan according to the
value of a variate formed from items on the loan application (age, income,
price of house, education, job seniority).
(iii) Charge account applicants were granted or denied charge accounts on
the basis of the expert judgment of the store's credit officers. Their decision
are now to be combined with the data on the applications to develop a variate,
high values of which correspond to granting charge privileges. Note that in
this example the two groups are defined by expert judgment, rather than
harddata. A variable used in developing the statistical credit-scoring index
may or may not have been used by the experts. The data consist of a group
label variable Y which is (0,1) in the case of two groups and a vector X of
variables X1, X2, . . ., Xp. We classify an individual having values x into
thegroup indicated by Y=1 if P(Y=1|X=x) is sufficiently large. The vector X
can contain both metric (numerical) and nonmetric (categorical) variables.
When it contains only metric variables, and when the within-group
distributions are
multivariate normal, there is a special method for classification, "Discriminant
Analysis".
What are Discriminant Analysis and Logistic Regression?
Thus, Discriminant Analysis is best viewed as a special case of classification,
so we begin with a discussion of Classification in general. We start with an
example of Classification in terms of categorized variables.
UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11
p. 3
----------------------------------------------------------------------------------------------------------------------------- --------------------------
In the Land of Sameness, 80% of the people are Creditworthy and 20% are
Defaulters. Below are the joint distributions of monthly income and monthly
financial obligations in the two groups, from past data.
TABLE. Joint Distribution of Monthly Income and Monthly Financial
Obligations for Creditworthy Individuals and Defaulters
Creditworthy
-----------INCOME ($K)
| 1
2
3
4 |
_______|_____________________|____
1 | .04 .08 .12
.16 | .40
OBLIGATIONS
2 | .03 .06 .09
.12 | .30
($ K)
3 | .02 .04 .06
.08 | .20
4 | .01 .02 .03
.04 | .10
_______|_____________________|____
| .10 .20 .30
.40 |1.00
Defaulters
---------INCOME ($K)
| 1
2
3
4 |
___ |____________________|_____
1 | .04 .03 .02 .01 | .10
OBLIGATIONS 2 | .02 .04 .06 .08 | .20
($ K)
3 | .03 .06 .12 .09 | .30
4 | .01 .07 .20 .12 | .40
___|____ _______________|_____
| .10 .20 .40 .30 | 1.00
Consider the rule which grants credit to those who have an income of $3K or
more and Obligations of $3K or less (i.e., Obligations of less than $4K).
Q. What proportion of the Creditworthy will be granted credit by this rule?
A. Let I = Income and O = Obligations. Let C = Creditworthy and D =
Defaulter. P(I>=3,O < 4|C) = .12+.16+.09+.12+.06+.08 = .63
UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11
p. 4
----------------------------------------------------------------------------------------------------------------------------- --------------------------
Q. What proportion of the Defaulters will be denied credit by this rule?
A. P(I >= 3, O < 4|D) = .02+.01+.06+.08+.12+.09= .38 P(Denied | D) = 1-.38
= .62
Q. What is the error rate in classifying the Creditworthy?
A. Error rate for Creditworthy = P(Denied| C ) =1-.63 =.37
Q. What is the error rate in classifying Defaulters?
A. Error rate for Defaulters = P(Granted| D ) =.38
Q. What is the overall error rate of this rule?
A. Overall error rate = (Err.Rate for Creditworthy)xP(C)+ (Err.Rate for
Defaulters)xP(D) = .37x.80 + .38x.20 = .296 + .076 = .372
Q. What is the posterior probability that someone with Income = 4 and
Obligations = 1 is from the Creditworthy population?
A. P(C|I=4,O=1) = P(C and I=4,O=1)/P(I=4,O=1)
P(C and I=4,O=1) = P(I=4,O=1|C)xP(C) = .16 x .80= .128
P(I=4,O=1) = P(I=4,O=1|C)P(C)+P(I=4,O=1|D)P(D) =.16 x .80 + .01 x .20 =
.128 + .002 = .130
P(C | I=4,O=1 ) = 128/130 or about .985
UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11
p. 5
----------------------------------------------------------------------------------------------------------------------------- --------------------------
11.2. Separation and Classification for Two Populations 630
The two populations are denoted by 1and 2 ; the probability density
functions, by f1(x) and f2(x). Of primary importance is the "likelihood
ratio" ("density ratio"), f1(x)/f2(x). The best candidates for 1 are those x for
which it is large, that is, f1(x)/f2(x)> c . In terms of prior probabilities p1
and p2 and misclassification costs c(1|2) and c(2|1), the value of the cut-off
point c is c = c(1|2) p2 / c(2|1) p1 .
11.3. Classification with Two Multivariate Normal Populations
639
When the distributions are multinormal with common covariance matrix,the
natural log of the density ratio is linear in x. This linear function is called the
linear discrimination function(LDF).
When the distributions are multinormal with different covariance matrices,the
natural log of the density ratio is quadratic in the elements of x. The
discrimination function is quadratic.
UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11
p. 6
----------------------------------------------------------------------------------------------------------------------------- --------------------------
Example. A Two-Group Discriminant Analysis: Purchasers versus
Nonpurchasers
(Hair et al.) Here is output from a discriminant analysis. Note that "Buy?" is
the binary dependent variable, indicating the group. Note also that the
variable Style wasn't included since its t was N.S.
MTB >
COLUMN
C1
C2
C3
C4
C5
info
NAME
Case
Durablty
Perform
Style
Buy?
COUNT
10
10
10
10
10
MTB > DISCriminant analysis for labels in C5,data in
C2-C3
TABLE.
Classification Functions
Group:
0
1
-------------------------Constant
-6.170 -25.619
Durablty
1.823
5.309
Perform
1.479
2.466
Q. What are the values of the two classification functions for an individual
giving the food mixer a Durability rating of 4 and a Performance rating of 6?
A. For j = 0,1, let C(j|d,p) denote the value of the classification function for
Group j, given Durability = d and Performance= p.
For Group 0: C(0|4,6) = -6.170 + 1.823(4)+ 1.479(6) = +9.996
For Group 1: C(1|4,6) = -25.619 + 5.309(4)+ 2.466(6) = +6.413
Q. Classify this individual.
A. Since C(0|4,6) > C(1|4,6), we classify this individual as 0: that is, we
predict that this individual will not buy.
Q. What is this person's posterior probability of membership in the 'Buy'
group?
UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11
p. 7
----------------------------------------------------------------------------------------------------------------------------- --------------------------
A. For j = 0,1, let p(j|d,p) denote the posterior probability of membership in
group j, given that Durability = d and Performance = p. Math shows that the
classification functions are equal to the log posterior probabilities, except for a
constant which doesn't depend on the group. That is,
C(j|d,p) = ln[p(j|d,p)] + k.
This gives
p(j|d,p) = k'exp[C(j|d,p)].
p(0|4,6) = k'exp(+9.996), p(1|4,6) = k'exp(+6.413)
The sum of the two is 1. Hence k' = 1/[exp(+9.996) + exp(+6.413)].
p(1|4,6) = exp(+6.413)/[exp(+6.413) + exp(+9.996)]= 1/[1 + exp(9.9966.413)] = 1/[1 + exp(+3.187)] = .0397 . Also, p(0|4,6) = .9603.
11.4.
Evaluating Classification Functions 649
The classification must be evaluated on data other than that used to train the
procedure. There are several ways to do this. One is to train the procedure on
part (say half) the data and use the other part for evaluation.
The confusion matrix is a good way to display the numbers of correct and
incorrect classifications.
UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11
p. 8
----------------------------------------------------------------------------------------------------------------------------- --------------------------
One sort of classification is the use of a test for presence of a disease. Let D
denote the event that the disease is actually present and A that it is actually
absent. Let d denote the event that the test says the disease is present and a
the event that the test says it is not. This gives a two-by-two table.
PREDICTED MEMBERSHIP
d
a
Total
ACTUAL
1:
D
n1C
n1M
n1
MEMBERSHIP
2:
A
n2M
n2C
n2
nd
na
n
Total
In the notation for the cell frequencies, the subscript C means correct
classification and M means misclassification. In connection with the table,
there are eight conditional probabilities which are of interest:-The error rate for diseased (D's) (rate of false negatives)
is the
estimate of P(a|D), or n1M /n1.
The
sensitivity of the test
is the estimate of P(d|D), or n1C /n1.
The error rate for non-diseased (A's) (rate of false positives) is the
estimate of P(d|A), or n2M /n2.
The
specificity of the test
is the estimate of P(a|A), or n2C /n2.
The
false alarm rate
is the estimate of P(A|d), or n2M /nd.
The predictive value of a positive test is the estimate of P(D|d), or n1C /nd.
The rate of diseased among negatives is the estimate of P(D|a), or n1M /na.
The predictive value of a negative test is the estimate of P(A|a), or n2C /na.
Unless a procedure like the leave-one-out procedure or cross-validation has
been used, the error rates are over-optimistically low. When such a procedure
has been used, the rates based on n1 and n2 are unbiased even if the n were
not sampled randomly from the population; but the correctness of those rates
based on nd and na would depend upon random sampling and in the absence of
random sampling must be adjusted to overall rates P(D) and P(A).
UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11
p. 9
----------------------------------------------------------------------------------------------------------------------------- --------------------------
11.5. Fisher's Discriminant Function -- Separation of Populations
661
R. A. Fisher arrived at what is now called the LDF by another method.
Consider any variate, a'X. There is a value of two-sample t corresponding
to that variate. Write it in terms of a, and maximize over a. The result is the
same as the LDF.
11.6. Classification with Several Populations
665
A reasonable optimality criterion is expected cost of misclassification. It is
minimized by classifying x into that population for which the conditional
cost of misclassification, given x, is a minimum. With equal costs of
misclassification, this reduces to classification by maximum posterior
probability. Ignoring terms which are constant across all populations, the log
posterior probability is linear in x if the distributions are multinormal with
common covariance matrix, quadratic with different covariance matrices.
11.7. Fisher's Method for Discriminating Among Several Populations
683
R. A. Fisher's method is: Consider any variate, a'X. There is a value of the F
statistic corresponding to that variate. Write it in terms of a, and maximize
over a. The maximizing a's are eigenvectors of BW-1, where B is a betweengroups sum-of-products matrix and W is the within-groups sum-of-products
matrix. We need at most min{p,g-1} variates, where g is the number of
populations.
11.8. Final Comments 683
Including Qualitative Variables
Classification Trees
Neural Networks
For an expansion of the material in Section 11.8, please link to Addendumto
JW Section 11.8 .
UIC Epi/Bio Division, SPH / Applied Multivariate Statistical Analysis / Sclove / Notes on JW Ch. 11
p.
10
----------------------------------------------------------------------------------------------------------------------------- --------------------------
_______________________________________________________________
Copyright © 2000 Stanley Louis Sclove
Created: 5 October 1999 Updated: 29 Oct 2000
________________________________________________________________________________________________________________
Download