Example: Discriminating between loan applicants

advertisement
Example: Discriminating between loan
applicants
• Data on eight financial variables for 68 farmers who have
borrowed money from a bank
• Based on re-payment history, each farmer was classified as a
”good” or a ”bad” customer. The training samples consist
of 34 good customers and 34 bad customers
• Construct a classification rule to classify new applicants as
potential ”good” or ”bad” customers.
• We first ask SAS/R to test for equality of covariance matrices
to indicate if we should use a linear (if covariance matrices
are homogeneous) or quadratic (if they are not) classification
rule.
702
Classification with Several Populations
• We now consider the case of g > 2 populations.
• While extension of the classification rules to this case is
straightforward, accurately assessing the error rates of the
sample classification functions is not trivial.
• As in the g = 2 case, we first develop the theoretically optimal rules and then see how they need to be modified in
applications.
703
Classification with Several Populations
• We let fi(x), i = 1, ..., g denote the density associated with
population πi.
• Rurther, pi is the prior probability of πi and c(k|i), k = 1, ..., g
is the cost of allocating an item to πk when it belongs in πi.
For k = i, c(i|i) = 0.
• Rk is the set of x’s classified as πk and
P (k|i) =
Z
Rk
fi(x)dx,
P (i|i) = 1 −
g
X
P (k|i).
k=1,k6=i
704
Classification with Several Populations
• The conditional expected cost of misclassifying x from π1
into any of the other g − 1 groups is
ECM (1) = P (2|1)c(2|1) + P (3|1)c(3|1) + · · · + P (g|1)c(g|1)
=
g
X
P (k|1)c(k|1).
k=2
• ECM (1) is incurred with prior probability p1.
705
Classification with Several Populations
• If we obtain ECM (2), ECM (3), ..., ECM (g) in a similar manner, then the overall ECM is
ECM = p1ECM (1) + p2ECM (2) + · · · + pg ECM (g)


=
g
X
i=1
pi 
g
X
P (k|i)c(k|i) .
k=1,k6=i
• The optimal classification rule consists in choosing R1, R2, ..., Rg
that minimize ECM .
706
Classification with Several Populations
• It can be shown that the classification regions that minimize
ECM are defined by allocating x to the population πk for
which
g
X
pifi(x)c(k|i)
i=1,i6=k
is smallest. When all the costs of misclassification are equal,
then the minimum ECM rule is the one that allocates x to
P
the πk for which i=1,i6=k pifi(x) is smallest.
• Since i=1,i6=k pifi(x) is smallest when the ommitted term
pk fk (x) is largest, for equal costs the optimal rule allocates
x0 to πk when
P
pk fk (x0) > pifi(x0) for all i 6= k.
707
Classification with Several Populations
• Note that this rule is identical to the Bayes rule that maximizes the posterior probability:
prior × likelihood
p f (x)
=P
, k = 1, 2, ..., g.
P (πk |x) = Pg k k
[prior × likelihood]
i=1 pi fi (x)
• In practice, we need to know the g densities, costs and prior
probabilities.
708
Example with g = 3 Populations
Classify as
Priors
Density at x0
π1
π2
π3
π1
c(1|1) = 0
c(2|1) = 10
c(3|1) = 50
p1 = 0.05
f1(x0) = 0.01
π2
c(1|2) = 500
c(2|2) = 0
c(3|2) = 200
p2 = 0.60
f2(x0) = 0.85
π3
c(1|3) = 100
c(2|3) = 50
c(3|3) = 0
p3 = 0.35
f3(x0) = 2
709
Example with g = 3 Populations
Pg
• We first compute the three values of i6=k pifi(x0)c(k|i):
k=1:
p2f2(x0)c(1|2) + p3f3(x0)c(1|3)
= 0.6 × 0.85 × 500 + 0.35 × 2 × 100 = 325
k=2:
0.05 × 0.01 × 10 + 0.35 × 2 × 50 = 35.06
k=3:
0.05 × 0.01 × 50 + 0.60 × 0.85 × 200 = 102.03.
• Since the smallest value is achieved for k = 2, we allocate x0
to π2.
710
Example with g = 3 Populations
• If all costs of misclassification were equal, we would just look
at the products pifi(x0):
p1f1(x0) = 0.05 × 0.01 = 0.0005
p2f2(x0) = 0.60 × 0.85 = 0.51
p3f3(x0) = 0.35 × 2 = 0.70.
• In this case, we would allocate x0 to π3.
711
Example with g = 3 Populations
• Finally, if we were using the Bayes criterion we first compute
X
= 1g pifi(x0) = 0.05 × 0.01 + 0.60 × 0.85 + 0.35 × 2 = 1.213
i
and then compute the three posterior probabilities as
p1f1(x0)
0.0005
P (π1|x0) =
=
= 0.0004
1.213
1.213
0.51
p f (x )
= 0.421
P (π2|x0) = 2 2 0 =
1.213
1.213
p f (x )
0.70
P (π3|x0) = 3 3 0 =
= 0.578.
1.213
1.213
• x0 is allocated to π3, the population with highest posterior
probability.
712
Classification with Normal Populations
• If fi(x) = Np(µi, Σi) and if c(k|i) are equal, the minimum
ECM (or TPM) rule is: allocate x to πk if
1
1
p
ln(2π) − ln |Σk | − (x − µk )0Σ−1
ln pk fk (x) = ln pk −
k (x − µk )
2
2
2
= max ln pifi(x).
i
• The constant (p/2) ln(2π) is the same for all i. We ignore
it and define the quadratic discriminant score for the ith
population:
1
1
Q
di = − ln |Σi| − (x − µi)0Σ−1
(x − µi) + ln pi.
i
2
2
Q
• We allocate x to πk if dk is the largest.
713
Estimated Minimum TPM Rule
• In practice, we need to plug in estimates x̄i, Si for µi, Σi.
In this case, the estimated minimum TPM rule consists in
allocating x to πk if
1
1
Q
ˆ
dk (x) = − ln |Sk | − (x − x̄k )0Sk−1(x − x̄k ) + ln pk
2
2
is largest among all g sample quadratic scores.
• If the population covariance matrices Σ1, ..., Σg = Σ, the
score simplifies into a linear score because the first two
terms in
1 0 −1
1
1
Q
0
−1
di (x) = − ln |Σ| − x Σ x + µiΣ x − mu0iΣ−1µi + ln pi.
2
2
2
are the same for all i.
714
Estimated Minimum TPM Rule
• In the case of equal population covariance matrices and normal populations, we define the linear sample discriminant
socre as
1
−1
−1
dˆi(x) = x̄0iSpool
x − x̄0iSpool
x̄i + ln pi, i = 1, ..., g,
2
and allocate x to πk if dˆk (x) is the largest among the g linear
discriminant scores.
715
Estimated Minimum TPM Rule
• An alternative rule in the equal covariance case is derived
Q
from di (x) by ignoring the term in |Σ|. If we do and plug in
sample estimates for µi, Σ, then
−1
d˜i(x) = (x − x̄i)0Spool
(x − x̄i) + ln pi = Di2(x) + ln pi,
so we allocate x to the “closest” population but penalize the
distance by ln pi.
716
Example 11.11
• Data on the undergraduate GPA and GMAT of n = 85 applicantes to a graduate business program are given on Table
11.6.
• These applicants had been grouped into three populations:
π1 = Admit, π2 = Not Admit , π3 = Borderline.
• A new applicant, with GPA = 3.21 and GMAT = 497 has
applied for admission. Should she be admitted, not admitted,
or placed in the maybe category?
• A plot of the training sample is shown below
717
Example 11.11 (cont’d)
• Using the inverse of the pooled covariance matrix and the
sample means from output, we can compute the sample
squared distances from x0 to each of the three groups:
−1
D12(x0) = (x0 − x̄1)0Spool
(x0 − x̄1)
=
h
−0.19 −64.23
i
"
28.61 0.016
0.016 0.0003
#"
−0.19
−64.23
#
= 2.58
−1
(x0 − x̄2) = 17.10
D22(x0) = (x0 − x̄2)0Spool
−1
D32(x0) = (x0 − x̄3)0Spool
(x0 − x̄3) = 2.47.
• Since D32(x0) is smallest, we classify the applicant in the
borderline group.
718
Example 11.11 (cont’d)
• SAS code is posted as gmat.sas
• R code is posted as gmat.R
• The data are posted as gmat.dat
719
3
Example 11.11 (Linear Discriminants)
3
3
2
3
1
1
3
3
3
33
1
1
1
33
3
3
3
LD2
1
3
1 1
0
1
1
1
1
−1
1
1
1
1 1 11
1
1
1
1
11
22
3
33
33
33 3
3
−2
1
−4
2
2
2
3
3
2
2
2
2
2
2
2
2 2
2
2 2
3
1
2
2
1
11
2
2
22 2
1
2
2
2
2
1
−2
0
2
4
LD1
720
700
700
Example 11.11 (QDA vs. LDA)
11
11
1 1
1 1
1
1
1
2
2
2
2
3
2
3
400
2
2 2
22
2
2
2
2
33
3
1
1
2 2
600
2
2
2
2
2
1
3
2 2
2
3
22
2
2
2
2
3
2
3 333
3
3
2
1 1
1 1
3
2
2
1
3
2
1
1
1
3
3
3
3
3 3
3
3 33
1
11 1
3
2
2 2
2
2
2
33
3
1 1
1
1
1
1
1
1
3
3
3
3
3 3
3
3 33
1
1
1
1
3
3
2
1
1
1
400
500
2
3 333
3
2 2
3
2
2
2
2.5
3.0
gmat$GPA
3
300
3
300
gmat$GMAT
2
1 1
1 1
3
1 1
1
1
1
1
1
500
1
11 1
2
1
1
3
2
2 2
2
2
2
1
1
1
gmat$GMAT
600
1
3.5
2.5
3.0
3.5
gmat$GPA
721
Download