Discrimination and Classification - Introduction

advertisement
Discrimination and Classification - Introduction
• Main objectives behind discrimination and classification are:
1. Separate distinct sets of objects: discrimination.
2. Classify new objects into well defined populations:
classification.
• An example of discrimination: Given measurements on the
concentrations of five elements in bullet lead, find combinations of those concentrations that best describe bullets made
by Cascade, Federal, Winchester and Remington.
• An example of classification: Using information on prisoners
eligible for parole (good behavior, history of drug use, job
skills, etc) can we successfully allocate a prisoners elegible
for parole into two groups: those who will commit another
crime or those who will not commit another crime?
653
Discrimination and Classification
• In practice, the two objectives often overlap:
– A function of the p variables that serves as a discriminant
is also used for classifying a new object into one of the
populations.
– An allocation or classification rule can often serve as a
discriminant.
• The setup is the usual one: p variables are measured on n
sample units or subjects. We wish to find a function of the
variables that will optimize the discrimination between units
belonging to different populations (minimize classification
errors).
654
Two Populations
• We wish to separate two populations and also allocate new
units to one of the two populations.
• Let π1, π2 denote the two populations and let X = [X1, X2, ... Xp]
denote the p−dimensional measurement vector for a unit.
• We assume that the densities f1(x) and f2(x) describe variability in the two populations.
• For example, individuals who are good credit risks may constitute population 1 (π1) and those who are bad risks may
constitute population 2 (π2). Variables that may be used
to discriminate between populations or to classify individuals
into one of the two populations might include income, age,
number of credit cards, family size, education, and such.
655
Two Populations
• To develop a classification rule, we need a training or
learning sample of individuals from each population.
• The training samples can be examined for differences: what
are the characteristics of individuals that best distinguish
between the two populations?
• If R1 and R2 denote the regions defined by the set of
characteristics of objects in each of the training samples,
then when a new object (whose group membership we do
not know) falls into Rj we classify it into πj .
• We can make mistakes by misclassifying some units
656
Two Populations
• When is classification of a new unit necessary?
– When we have incomplete knowledge of future
performance, e.g., student success in graduate school.
– When obtaining perfect information requires destroying
the object, e.g., products meeting specifications.
– When obtaining information is impossible or very expensive. E.g., authorship of the unsigned Federalist papers
(Madison or Hamilton?) or perfectly diagnosing a medical
problem which may only be done by conducting invasive
and expensive surgery.
• The bigger the overlap between the set of characteristics that
distinguish units from π1 and π2, the larger the probability of
a classification error.
657
Example: Owners of Riding Mowers
• A manufacturer of riding lawn mowers wishes to target an
ad campaign to families who are most likely to purchase one.
• Is income and size of lot enough to discriminate owners and
non-owners of riding lawn mowers?
• A sample of size n1 = 12 current owners and n2 = 12
current non-owners was obtained and X1 = income and
X2 = size of lot were measured for each.
• Owners tend to have higher income and larger lots, but
income appears to be a better discriminator.
• Note that the two populations, separated by the line, are not
perfectly discriminated, so some classfication errors are likely
to occur.
658
Owners and Non-owners of Riding Mowers
659
Priors and Misclassification Costs
• In addition to the observed characteristics of units, other
features of classification rules are:
1. Prior probability: If one population is more prevalent than
the other, chances are higher that a new unit came from
the larger population. Stronger evidence would be needed
to allocate the unit to the population with the smaller prior
probability.
2. Costs of misclassification: It may be more costly to
misclassify a seriously ill subject as healthy than to
misclassify a healthy subject as being ill. It may be more
costly for a lender to misclassify a bad credit risk as a
good credit risk and award a loan than to misclassify a
good credit risk as a bad credit and deny a loan.
660
Classification Regions
• Let f1(x) and f2(x) be the probability density functions
associated with the random vector X for populations π1
and π2, respectively.
• Let Ω denote the sample space (collection of all possible
values for x). R1 is the set of values of x for which we
classify objects into π1 and R2 = Ω − R1 is the set of values
of x for which we classify objects into π2.
• Since every object belongs into one of the two populations,
we have that Ω = R1 ∪ R2.
661
Classification Regions
662
Classification Regions
• Ignore for now the prior probabilities of each population and
the potentially different misclassification costs.
• The probability of misclassifying an object into π2 when it
belongs in π1 is
P (2|1) = P (X ∈ R2|π1) =
Z
R2 =Ω−R1
f1(x)dx,
and probability of misclassifying an object into π1 when it
belongs in π2 is
P (1|2) = P (X ∈ R1|π2) =
Z
R1
f2(x)dx.
• In the next figure, the solid shaded region is P (2|1) and the
striped region is P (1|2).
663
Classification Regions
664
Probability of Misclassification
• Let p1, p2 denote the prior probabilities of π1, π2, respectively.
• Then probabilities of the four possible outcomes are:
P (correctly classified as π1) = P (X ∈ R1|π1)P (π1) = P (1|1)p1
P (incorrectly classified as π1) = P (X ∈ R1|π2)P (π2) = P (1|2)p2
P (correctly classified as π2) = P (X ∈ R2|π2)P (π2) = P (2|2)p2
P (incorrectly classified as π2) = P (X ∈ R2|π1)P (π1) = P (2|1)p1
• Further, let c(1|2) and c(2|1) be the costs of misclassifying
an object into π2 and π1, respectively.
665
Expected Cost of Misclassification
• Classification rules are often evaluated in terms of the
expected cost of misclassification or ECM :
ECM = c(2|1)P (2|1)p1 + c(1|2)P (1|2)p2.
• There is no cost when units are correctly classified.
• We seek rules that minimize the ECM . This leads to an
optimal classification rule: classify an object into π1 if
f1(x)c(2|1)p1
> 1.
f2(x)c(1|2)p2
666
Classification Rule
• Equivalently, the regions R1, R2 that minimize the ECM are
defined by the values of x for which
f (x)
R1 : 1
>
f2(x)
f1(x)
R2 :
<
f2(x)
c(1|2)
c(2|1)
!
c(1|2)
c(2|1)
!
p2
p1
!
p2
p1
!
• Implementation of the minimum ECM rule require for a new
unit requires evaluation of f1 and f2 at the new vector of
observations x0, but it does not require knowing the two
costs or the two prior probabilities, just their ratio.
• When the prior probabilities or the misclassification costs are
equal, the classification rule above simplifies correspondingly.
667
Special cases of minimum ECM regions
f (x)
c(1|2)
p1 = p2 : R1 = 1
>
,
f2(x)
c(2|1)
p
f (x)
> 2,
c(1|2) = c(2|1) : R1 = 1
f2(x)
p1
p1 = p2 ,
f (x)
c(1|2) = c(2|1) : R1 = 1
> 1,
f2(x)
f (x)
c(1|2)
R2 = 1
<
f2(x)
c(2|1)
f (x)
p
R2 = 1
< 2
f2(x)
p1
f (x)
R2 = 1
< 1.
f2(x)
If x is on the boundary between R1 and R2 then toss a coin, or
randomly classify in some way
668
Other Criteria for Choosing a Classification Rule
• The total probability of misclassification or T P M ignores the
cost of misclassification and is defined as the probability of
either misclassifying a π1 observation or misclassifying a π2
observation:
T P M = p1
Z
R2
f1(x)dx + p2
Z
R1
f2(x)dx,
and is equivalent to ECM when costs are equal. An optimal
rule in this sense would minimize T P M .
• The optimal TPM regions are given in the second equation
in the preceding page.
669
Other Criteria for Choosing a Classification Rule
• We might consider classifying a new unit with observation
x0 into the population with the highest posterior probability
P (πi|x0). By Bayes rule
P (we observe x0|π1)p1
P (observe x0|π1)p1 + P (observe x0|π2)p2
p1f1(x0)
=
.
p1f1(x0) + p2f2(x0)
P (π1|x0) =
• Clearly, P (π2|x0) = 1 − P (π1|x0).
• Using the posterior probability criterion, we classify a unit
with measurements x0 into π1 when P (π1|x0) > P (π2|x0).
670
Two multivariate normal populations with
Σ1 = Σ2
• We now assume that f1(x) = Np(µ1, Σ) and f2(x) = Np(µ2, Σ).
• Note that
1
1
f1(x)
= exp − (x − µ1)0Σ−1(x − µ1) + (x − µ2)0Σ−1(x − µ2) .
f2(x)
2
2
• Then, R1 is given by the set of x values for which:
f1(x)
1
1
= exp − (x − µ1)0Σ−1(x − µ1) + (x − µ2)0Σ−1(x − µ2)
f2(x)
2
2
>
c(1|2)
c(2|1)
!
!
p2
,
p1
and R2 is the complementary set (< instead of >).
671
Multivariate Normal Populations
• Given the definitions of R1, R2, an allocation that minimizes
the ECM is the following: allocate x0 to π1 if
1
0
−1
(µ1−µ2) Σ x0− (µ1−µ2)0Σ−1(µ1+µ2) > ln
2
"
c(1|2)
c(2|1)
!
p2
p1
!#
and allocate x0 to π2 otherwise.
• Since µ1, µ2, Σ are typically unknown, in practice we use x̄1, x̄2
as estimators of the population means, and Spool as estimator of the common covariance matrix Σ.
• The classification rule is a linear function of x0 and is known
as Fisher’s linear classification rule.
672
,
Multivariate Normal Populations
• In the special case in which
c(1|2)
c(2|1)
!
p2
p1
!
= 1,
we have that ln(1) = 0 and the rule simplifies. Let
m̂ =
1
1
−1
(x̄1 − x̄2)0Spool
(x̄1 + x̄2) = (ȳ1 + ȳ2),
2
2
with
−1
ȳ1 = (x̄1 − x̄2)0Spool
x̄1 = â0x̄1
−1
ȳ2 = (x̄1 − x̄2)0Spool
x̄2 = â0x̄2.
673
Multivariate Normal Populations
• For a new observation x0, we allocate the object to π1 if
ŷ0 > m̂
and to π2 otherwise, where
−1
ŷ0 = (x̄1 − x̄2)0Spool
x0 = â0x0.
• In this simple case, we simply compare ŷ0 to the midpoint of
ȳ1 and ȳ2.
• Note that the classification rule computed using estimated
parameters does not guarantee minimum ECM , but if sample
sizes are large enough and the two populations are reasonably normal, the actual ECM will tend to be close to the
miminum.
674
Example: Hemophilia A Carriers
• Objective was to detect individuals who are hemophilia A
carriers.
• Blood samples from women known to be carriers and women
known to be non-carriers were assayed and measurements
were taken on two variables: X1 = log10(AHF activity) and
X2 = log10(AHF-like antigen).
• Samples of size n1 = 30 normal women and n2 = 22 carrier
women were obtained.
• Pairs of measurements for each woman are plotted on the
figure in the next page, together with the 50% and 95%
probability contours estimated under the bivariate normal assumption.
675
Example: Hemophilia A Carriers
676
Example: Hemophilia A Carriers
• Sample statistics were the following:
"
x̄1 =
−0.0065
−0.0390
#
"
, x̄2 =
−0.2483
0.0262
#
−1
, Spool
=
"
131.158 −90.423
−90.423 108.147
• Assuming (for now) equal prior probabilities and costs, the
discriminant function is
−1
ŷ = â0x = [x̄1 − x̄2]0Spool
x
= 37.61x1 − 28.92x2.
677
#
.
Example: Hemophilia A Carriers
• Further,
ȳ1 = â0x̄1 = 37.61 × (−0.0065) − 28.92 × (−0.0390) = 0.88
ȳ2 = â0x̄2 = 37.61 × (−0.2483) − 28.92 × 0.0262 = −10.10.
• Therefore, m̂ = (ȳ1 + ȳ2)/2 = −4.61.
• Suppose that a new patient has measurements x1 = −0.210
and x2 = −0.044. Is she normal or is she a carrier? Using
Fisher’s linear discriminant function, we would allocate the
woman to the normal group if
ŷ0 = â0x0 ≥ m̂.
• In this case,
ŷ0 = 37.61 × (−0.210) − 28.92 × (−0.044) = −6.62 < −4.61.
Therefore, she would be classified as an obligatory carrier.
678
Example: Hemophilia A Carriers
• If our patient is maternal first cousin of a hemophiliac we
know from genetics that she has p2 = 0.25 of being hemophiliac. Then p1 = 0.75.
• Assume equal cost of misclassification. The discriminant is
1
−1
−1
0
ŵ = (x̄1 − x̄2) Spoolx0 − (x̄1 − x̄2)0Spool
(x̄1 + x̄2),
2
which for this patient is equal to -2.01.
• We classify the patient as an obligatory carrier because
"
ŵ = −2.01 < ln
#
p2
= −1.10,
p1
even though she was more likely to be normal.
679
Fisher, Classification and Hotelling’s T 2
• For the two population case, R. A. Fisher arrived at the linear
discriminant function in a different way.
• Given p−dimensional observations x from populations π1 and
π2, he proposed finding the linear combinations of the elements of x, ŷ1i = â0x1i and ŷ2j = â0x2j for i = 1, ..., n1; j =
1, ..., n2 to maximize
(ȳ1 − ȳ2)2
,
separation =
s2
y
0
with s2
y the pooled estimate of the variance of ŷ1i = â x1i .
(Classify into π1 if ŷ1i = â0x1i > â0(x̄1 + x̄2)/2
680
Fisher, Classification and Hotelling’s T 2
−1
• The separation is maximized for â0 = (x̄1 − x̄2)0Spool
and the
maximum of the ratio is
−1
D2 = (x̄1 − x̄2)0Spool
(x̄1 − x̄2).
• Therefore, for two populations and a linear rule, the maximum relative separation that can be obtained is equal to the
squared distance between the multivariate means.
• A test of H0 : µ1 = µ2 is also a test for the hypothesis of no
separation. If Hotelling’s T 2 fails to reject H0, then the data
will not provide a useful classification rule.
681
The normal case when Σ1 6= Σ2
• When Σ1 6= Σ2, we can no longer use a simple linear
classification rule.
• Under normality, terms involving |Σi|1/2 do not cancel and
the exponential terms do not combine easily.
• Using the definition of R1 given earlier and expressing the
likelihood ratio in the log scale, we now have:
1
−1
0 Σ−1 −µ0 Σ−1 )x−k > ln
−Σ
)x+(µ
R1 : − x0(Σ−1
2 2
1 1
1
2
2
"
c(1|2)
c(2|1)
with R2 the complement, where
!
k=
1
|Σ1|
1
0 Σ−1 µ ).
ln
+ (µ01Σ−1
µ
−
µ
1
2
2 2
1
2
|Σ2|
2
682
!
p2
p1
!#
,
The normal case when Σ1 6= Σ2
• The classification regions are now quadratic functions of x.
• The classification rule for a new observation x0 is now the
following: allocate x0 to π1 if
1
−1
0 Σ−1 −µ0 Σ−1 )x −k > ln
− x00(Σ−1
−Σ
)x
+(µ
0
0
1 1
2 2
1
2
2
"
c(1|2)
c(2|1)
!
p2
p1
and allocate x0 to π2 otherwise.
• In practice, we estimate the classification rule by substituting
the unobservable population parameters (µ1, µ2, Σ1, Σ2) by
sample estimates (x̄1, x̄2, S1, S2).
683
!#
,
Features of Quadratic Rules
• Disjoint classification regions are possible.
684
Features of Quadratic Rules
• Quadratic rules are not robust to departures from normality.
685
• Sometimes, the quadratic discriminant rule is written as:
classify observation into π1 if
"
ln
#
"
#
f1(x)p1
c(1|2)
> ln
.
f2(x)p2
c(2|1)
• In the normal case with equal misclassification costs, the
estimated discriminant function is given by
1
1
(x − x̄1)0S1−1(x − x̄1) + ln(p1)
− ln |S1| −
2
2
1
1
− − ln |S2| − (x − x̄2)0S2−1(x − x̄2) + ln(p2) .
2
2
More on Quadratic Discriminant Functions
Q
• The term quadratic discriminant score or dk (x) is sometimes
used to denote the terms in the expression just presented.
That is,
1
1
Q
dˆ1 (x) = − ln |S1| − (x − x̄1)0S1−1(x − x̄1) + ln(p1)
2
2
1
1
Q
dˆ2 (x) = − ln |S2| − (x − x̄2)0S2−1(x − x̄2) + ln(p2).
2
2
• With equal misclassification costs, we allocate an observation
Q
to the population with higher dˆk (x).
• We see later that the quadratic discriminant scores can be
used to allocate objects to more than two populations.
686
Selecting a Discriminant Function
• If populations are normal, how do we decide whether to use
a linear or a quadratic discriminant function?
– Use Bartlett’s test to test for equality of covariance
matrices (In SAS PROC DISCRIM does this, not in R).
– Test is sensitive to departures from normality.
• If data are not normally distributed:
– Neither the linear nor quadratic discriminant functions are
optimal, but they still could perform well if the populations are well separted.
– Quadratic rules tend to work better on non-normal samples even when covariance matrices are homogeneous.
687
• Transform variables to better approximate samples from normal distributions, or use ranks.
• Evaluate proposed rules on new observations with known
group membership (or use crossvalidation)
688
Estimating Misclassification Probabillities
• To evaluate the performance of a classification rule, we wish
to get estimates of P (1|2) and of P (2|1).
• When the parent population densities f1(x), f2(x) are not
known, we limit ourselves to evaluating the performance of
the sample classification rule.
• We discuss several approaches to estimate Actual Error Rate
(AER):
– The apparent error rate method
– Data splitting, or set-aside method
– Crossvalidation
– Bootstrap approach.
689
Apparent Error Rates
• Use the classification rule estimated from the training sample (either the linear or the quadratic rule) to classify each
member of the training sample.
• Compute the proportion of sample units that are misclassified into an incorrect population.
• These proportions (P̂ (1|2) and P̂ (2|1)) are called apparent
error rates. They provide estimates of misclassification probabilities.
• As estimates of misclasification probabilities, they can be significantly biased toward zero, since we are re-using the same
data for estimating the rule and assessing its performance.
• The bias decreases as the size of the training sample increases.
690
Data splitting
• If the training sample is large enough, we can randomly split
the sample into a training and a validation set. About 25%
to 35% of the sample should be set aside for validation.
• Estimate the classification rule from the training sub-sample.
• We then use the rule to classify the cases in the validation
sub-sample and compute the proportion of them that are
misclassified.
• Estimates of P (1|2) and of P (2|1) are less biased but tend
to have larger variances because a smaller sample of units is
used to estimate the misclassification probabilities.
• Further, the rule evaluated is not the one that will be used in
practice (and which will be estimated using all observations).
691
Crossvalidation or Hold-out Methods
• Similar to the set-aside method, except that instead of putting
aside an entire validation sample we now randomly split the
training sample into g groups.
• We sequentially set aside one group at a time and estimate
the rule from the observations in the other g − 1 groups.
• We then classify the cases in the group that was set aside
and estimate the proportion of misclassified cases.
• Repeat the steps g times, each time setting a different group
aside.
• The estimated probabilities of misclassification are computed
as the average over the g estimates.
692
Crossvalidation
• Notice that a different rule is used to estimate misclassification probabilities for each of the g subsets of data.
• Estimates of misclassification probabilities are less unbiased
and have smaller variance than the estimates obtained using
the single set-aside method.
• Bias will increase when more variables are used.
• When g = n1 + n2 (i.e., we set aside one case at a time) the
method is called a ”holdout” or ”leave one out” method. It
is sometimes (incorrectly) called a jacknife method.
693
Bootstrap Estimation
• First, estimate the apparent error rates from the complete
sample: P̂ (1|2) and P̂ (2|1).
• Next, create B new samples of size n1 and n2 by randomly
sampling observations with replacement from the original
training samples.
• From each of the bootstrap samples, estimate a classification
rule and compute the apparent error rates. Label them as
P̂i(1|2) and P̂i(2|1), for i = 1, ..., B.
694
Bootstrap Estimation
• Use the B bootstrap estimates P̂i(1|2) and P̂i(2|1) to
estimate the bias of the apparent error rates as
ˆ
bias(1|2)
= B −1
B
X
P̂i(1|2) − P̂ (1|2)
i=1
ˆ
bias(2|1)
= B −1
B
X
P̂i(2|1) − P̂ (2|1)
i=1
695
Bootstrap Estimation
• A bootstrap-corrected estimate of P (1|2) is given by
ˆ
P̂ c(1|2) = P̂ (1|2) − bias(1|2)
= P̂ (1|2) − (B −1
B
X
P̂i(1|2) − P̂ (1|2))
i=1
B
X
= 2 × P̂ (1|2) − B −1
P̂i(1|2).
i=1
• Similarly,
ˆ
P̂ c(2|1) = P̂ (2|1) − bias(2|1)
= P̂ 2|1) − (B −1
B
X
P̂i(2|1) − P̂ (2|1))
i=1
= 2 × P̂ (2|1) − B −1
B
X
P̂i(2|1).
i=1
696
Bootstrap Confidence Intervals
• We can also compute bootstrap confidence intervals fpr the
misclassification probabilities.
• An approximate 100(1 − α)% confidence interval for P (1|2)
is given by
P̂ c(1|2) ± zα/2
v
uP
−1 PB
2
u
(
P̂
(1|2)
−
B
P̂
(1|2))
i
k
i
t
k=1
(B − 1)
,
where zα/2 is the upper (1 − α/2) percentile of the standard
normal distribution.
697
Percentile Bootstrap Confidence Intervals
• A more direct approach to estimating the confidence limits is
to compute the lower limit as the P̂i(1|2) value exceeded by
1−α/2 estimates of P (1|2) in the B samples and compute the
upper limit as the P̂i(1|2) value exceeded by α/2 estimates
in the B samples.
• This direct approach to confidence limits works better when
B is large.
• Reference: Efron, B. and Tibshirani, R. J. 1993, An
Introduction to the Bootstrap, Chapman Hall, New York.
(Chapter 17)
698
Variable Selection
• We must often decide how many variables (out of p) to
include in the classification rule.
• In general, the apparent error rate will not increase as more
variables are used for classification.
• However, the probability of misclassifying a new case can
increase when the number of variables used to constuct the
classification rule increases.
699
Variable Selection
• We can use Proc Stepdisc in SAS to help select a ’good’ set
of variables using either
– backward elimination: start with all p variables and
sequentially eliminate variables that do not improve the
probability of correct classification
– forward selection: sequentially add variables to the
classification rule.
– stepwise selection
• The Stepdisc procedure uses the p-value for a partial Ftest to determine the ”signifcance” of adding or deleting a
variable.
700
Variable Selection
• The partial F-test for including a new variable is computed by
– regression of each variable not in the model on all variables currently in the model
– compute the residuals
– compute the F-test for no difference in population means
for a one-way ANOVA of the residuals
– The variable with the most significant partial F-test is
added to the classification model
• Re-estimate misclasification probabilities as each variable is
added or deleted from the model.
701
Download