Medical Statistics & Epidemiology (153133) Discrimimant Analysis

advertisement
Medical Statistics & Epidemiology (153133)
Discrimimant Analysis (week 51)
October 2008 (K.P.)
CONTENTS
1
2
3
4
5
6
Introduction
Classifying if population parameters are known
Principal Components
Training Samples
Generalization to three or more populations
Assignment
1. Introduction
We consider the problem of classifying an individual into one of a number of categories.
We start with the problem of classifying an individual into one of two categories. Later
on, in section 5, we explain how classification into one of three or more categories can
be handled. Theory about how to classify into one of two categories is exemplified by
means of data contained in the following table
firstjoint
secondjoint
aedeagus
group
191
131
53
1
185
134
50
1
200
137
52
1
173
127
50
1
171
118
49
1
160
118
47
1
188
134
54
1
186
129
51
1
174
131
52
1
163
115
47
1
190
143
52
1
174
131
50
1
201
130
51
1
190
133
53
1
182
130
51
1
184
131
51
1
177
127
49
1
178
126
53
1
210
140
54
1
1
182
121
51
1
186
136
56
1
186
107
49
2
211
122
49
2
201
114
47
2
242
131
54
2
184
108
43
2
211
118
51
2
217
122
49
2
223
127
51
2
208
125
50
2
199
124
46
2
211
129
49
2
218
126
49
2
203
122
49
2
192
116
49
2
195
123
47
2
211
122
48
2
187
123
47
2
192
109
46
2
223
124
53
2
188
114
48
2
216
120
50
2
185
114
46
2
178
119
47
2
187
111
49
2
187
112
49
2
201
130
54
2
187
120
47
2
210
119
50
2
196
114
51
2
195
110
49
2
187
124
49
2
The data are measurements made on two species of Chaetocnema. The two species are
Chaetocnema Concinna (Group  1 ) and Chaetocnema Heikertlingeri (Groep  2 ). The
variable x1  firstjoint is the width of the first joint of the first tarsus in microns. The
variable x2  secondjoint is the width for the second joint. Finally, x3  aedeagus is the
maximal width of the aedeagus in the fore-part in microns. It is hard to distinguish the
two species. Our problem is how to classify a (new) individual into one of the species
mentioned using the three measurements if we already assume that the individual belongs
to either Chaetocnema Concinna or Chaetocnema Heikertlingeri.
In practice in classification problems there are more variables than in our example. Just
for ease of presentation we restrict to a data set with just three variables. In the classical
classification problem the following assumptions are made. If X  ( X1, X 2 ,..., X p )T
represents the (column) vector of measurements of an individual then X is distributed
2
according to a multivariate normal distribution with a vector of expectations  and a
variance-covariance matrix  . First of all: what do we mean with a multivariate normal
distribution? In text books you can find different (but equivalent) definitions. In this
document we use the following definition of a multivariate normal distribution: a
stochastic vector X  ( X1, X 2 ,..., X p )T has a multivariate normal distribution if each
linear function
aT X  a1 X1  a2 X 2  ...  a p X p
has a (one dimensional) normal
distribution as long as the vector a  (a1, a2 ,..., a p )T differs from the null vector. Each
element i of the vector  is the expectation of the corresponding element: i  E ( X i ) .
If  ij denotes the (i, j ) -element of the matrix  then it is the covariance of X i and X j :
 ij  cov( X i , X j ) .
Note
that
the
diagonal
elements
 ii
are
variances:
 ii  cov( X i , X i )  var( X i ) .
In general one classifies an individual into one of two populations which are called 1
and  2 . In our example 1 refers to Chaetocnema Concinna and  2 refers to
Chaetocnema Heikertlingeri. It is assumed that the vector X of the individual that has to
be classified has the vector of expectations 1 if the individual belongs to population 1 .
If the individual belongs to  2 then X has vector of expectations 2 . Furthermore it is
assumed that the two populations share a common variance-covariance matrix  with
respect to the multivariate normal distribution of X .
2. Classifying if population parameters are known
We first explain two rules you can use for classification: one using Fisher’s discriminant
function and one using posterior probabilities. We start with Fisher’s discriminant
function. In this section we assume that 1 , 2 and  are known.
Imagine that p  1 holds. Then the situation is rather simple. You have one variable
X which has a normal distribution. If the individual belongs to 1 then X has
expectation ~1 . In case of  2 the expectation of X is ~2 . Irrespective of the population
the variance is  2 . The classification rule is simple: classify the individual into
population 1 if the distance of X to ~1 is smaller than the distance of X to ~2 . In fact
| ~  ~2 |
this classification problem depends on one parameter   1
. The

misclassification probabilities only depend on the parameter  . Let us express the
probability that the individual is classified into population  2 whereas the individual
belongs to 1 in reality. Assuming that ~2  ~1 this probability is equal to
1 ~
1 ~
(   ~2 )  ~1 
(   ~1 ) 


P X  12 ( ~1  ~2 )   P Z  2 1
  P Z  2 2







1
1
1
 P ( Z  2  )  P ( Z   2 )  (  2 )
3
with Z 
X  ~1
having the standard normal distribution and (t )  P( Z  t ) being the

cumulative distribution function of the standard normal distribution. Let us now express
the other misclassification probability: the probability that the individual is classified into
population 1 whereas the individual belongs to  2 in reality. Still assuming ~2  ~1 this
probability is
1 ~
1 ~
( 1  ~2 )  ~2 
( 1  ~2 ) 


~
~
2
2
P X  ( 1  2 )   P Z 
  P Z 







1
2
 P( Z   12 )  ( 12 )
with now Z 
X  ~2
having the standard normal distribution. So we conclude that both

misclassification probabilities are equal to ( 12 ) . Verify that these probabilities don’t
~ . Note that the misclassification probabilities decrease if 
change if we assume ~2  
1
increases.
Let us return to p  1 . So we have to use a stochastic vector X for purposes of
classification. Fisher’s discriminant function is based on the best reduction to the case
p  1 by means of a linear function a T X . This linear function a T X has a normal
distribution because we assume that X has a multivariate normal distribution. Let ~1
denote the expectation of X if the individual belongs to 1 , ~2 is the expectation in case
of  2 . The linear function a T X has a variance  2 which does not depend on the
population. Note that the three parameters ~1 , ~2 and  2 all depend on the vectors 1 ,
2 , a and the matrix  and that 1 , 2 and  are (assumed) known. For Fisher’s
| ~  ~2 |
discriminant function we choose the vector a that maximizes   1
. Without

proof we present the solution. The best choice for a is given by a  1 ( 2  1 ) and
hence Fisher’s discriminant function is ( 2  1 )T 1 X . Sometimes Fisher’s discriminant
function is rescaled to (  2  1 )T  1  X  12 ( ~1  ~2 )  . The parameter  is called the
Mahalanobis distance. It can be shown that the Mahalanobis distance is given by:
2  ( 2  1 )T 1 ( 2  1 )
Let us now turn to classification using posterior probabilities. Let p1 define the
probability that the individual belongs to population 1 and let p2 define the probability
that the individual belongs to  2 . If the individual belongs to 1 the vector X has a
multivariate normal distribution with vector of expectations 1 and variance-covariance
4
matrix  , let f1 ( x) denote the corresponding (multivariate) density function. Let f 2 ( x)
represent the (multivariate) density function in case of  2 . Consider an individual which
we have to classify and assume that the outcome of X is x . The conditional probability
that the individual belongs to 1 given the event X  x is (using Bayes’ formula):
P( 1 | X  x) 
p1 f1 ( x)
.
p1 f1 ( x)  p2 f 2 ( x)
Furthermore, the conditional probability that the individual belongs to  2 given the event
X  x is:
P( 2 | X  x) 
p2 f 2 ( x )
p1 f1 ( x)  p2 f 2 ( x)
Conditional probabilities are often called posterior probabilities in this respect. We
classify the individual into the population with the highest posterior probability. In many
cases there is no information to estimate the prior probabilities pi , then often equal prior
probabilities are assumed.
3. Principal Components
We noticed that in the discriminant analysis procedure of SPSS first a data reduction
technique is applied: principal components. So we proceed explaining principal
components. In this section we still assume that the variance-covariance matrix  is
known.
Consider a stochastic vector X with large dimension p . In practice the vector may
contain many characteristics that are highly correlated. Using principal components the
large vector X is replaced by a few linear functions a1T X , a2T X ,..., arT X (the principal
components) that are independent. The vectors ai are chosen to have norm 1 and to
maximize the variances var( aiT X ) .
For a variance-covariance matrix  it can be shown that all eigenvalues are real numbers
that are non-negative. Furthermore we here assume that the eigenvalues of  are strictly
positive and that the eigenvalues are all different. Let 1 , 2 ,...,  p represent the
eigenvalues of  . Assume 1  2  ...   p and let the vectors v1 , v2 ,..., v p present the
corresponding eigenvectors, taking to be orthonormal (the vectors are orthogonal and
their norms are all equal to 1).
The principal components are as follows. The first principal component is v1T X and its
variance is 1 , the second principal component (which is independent of v1T X ) is v2T X
5
and its variance is 2 , etc. It can be proven that the sum of variances
to 1  2  ...   p , which is equal to sum
p
 var( X ) is equal
i 1
i
p
 var( v
T
i
i 1
X ) , the sum of the variances of the
principal components. In practice often the first few eigenvalues are very much larger
than the rest of the eigenvalues. For instance, 1  2 may (approximately) represent 95%
p
of the total variance
 var( X ) .
i 1
i
It means that 95% of the variation of the data is
contained by just two variables v1T X and v2T X . Only with a slight loss of information
then the vector X may be reduced to just two variables, v1T X and v2T X . Classification
may proceed just using these two variables.
4. Training samples
Up till now we assumed that 1 , 2 and  are known. How do we have to estimate (the
elements of) 1 , 2 and  ? For our example shown in the introduction we have data of
two ‘training samples’. The elements of 1 and 2 are estimated in a straightforward
way. The first element of 1 is the expectation of the variable ‘firstjoint’ of
Chaetocnema Concinna and is estimated by means of the corresponding sample mean
M1 , etc.
Let us consider estimation of the diagonal elements of the matrix  for our example, the
variances of the variables X1 , X 2 en X 3 . For ease of presentation we restrict to
estimation of the variance of X1 . If X11, X12 ,..., X1m represent the measurements of the
~ ~
~
first sample and X11, X12 ,..., X1n represent the measurements of the second sample then
~
~ 2
2
S12    X 1i  M 1  /( m  1) and S22   X 1i  M1 /( n  1) are the two sample variances.

i

i
These estimators are combined by means of the formula
S2=
m 1
n 1
S12 
S 22
mn2
mn2
for estimating the variance of X1 , the (1,1)-element of  .
For showing how the non-diagonal elements of  are estimated let us restrict to showing
a formula for estimating of the (1,2)-element of  , the covariance of X1 and X 2 .
Estimation of this covariance is based on the two sample covariances
~
~ ~
~
S12(1)   ( X 1i  M1 )( X 2i  M 2 ) /( m  1)
and S12( 2 )   ( X 1i  M 1 )( X 2i  M 2 ) /( n  1) ,
i
i
which are combined as follows: S12 
m 1
n 1
S12(1) 
S12( 2 ) .
mn2
mn2
Replacing parameters by estimates we get the actual classification rules based on Fisher’s
discrimination function, etc.
6
Applying the procedures of SPSS to our example we got output, a part of the output is as
follows.
Function
Eigenvalue
Eigenvalues
% of variance Cumulative %
Canonical
correlation
a
1
100.0
100.0
.889
3.774
a First 1 canonical discriminant functions were used in the analysis.
Canonical Discriminant Function Coefficients
Function
1
Firstjoint
-1.00
Secondjoint
.143
Aedeagus
.259
(constant)
-11.123
Unstandardized coefficients
Apparently the data can be reduced to one principal component for purposes of
classification. The second table indicates on which linear function of the data
classification will be based.
5. Generalization to three or more populations.
The two methods of classification of section 2 can be generalized towards classification
into three or more populations. The method based on posterior (conditional) probabilities
can be generalized rather easily. If  3 is a third population with density f3 then the
conditional probability that the individual belongs to 1 given the event X  x of
section 2 changes into
P( 1 | X  x) 
p1 f1 ( x)
p1 f1 ( x)  p2 f 2 ( x)  p3 f 3 ( x)
with analogous formulas for P( 2 | X  x) and P( 3 | X  x) . The individual is
classified into the population with the highest (estimated) conditional probability.
Generalization of the method of Fisher’s discriminant function to three or more
populations can be done by defining a Mahalanobis distance for each population.
Consider e.g. three populations, 1 ,  2 and  3 , with vectors of expectations 1, 2 and
3 and common variance-covariance matrix  . We define a Mahalanobis distance 1 of
X to population 1 by means of:
21  ( X  1 )T 1 ( X  1 )
7
and distances to  2 and  3 by means of respectively 22  ( X  2 )T 1 ( X  2 ) and
23  ( X  3 )T  1 ( X  3 ) . The individual is classified into the population with the
smallest (estimated) Mahalanobis distance.
6. Assignment
You have to study classification using the famous iris data, these data are contained by
the SPSS file irises.sav. There are three species of iris (iris setosa, iris versicolor and iris
virginica) and four measurements of flour parts (sepal length, sepal width, petal length
and petal width, in centimeters).
Part A
First study classification of an iris if you consider only two species, two samples. If the
number of characters of your last name is a multiple of 3 ( 3,6,9,12,... ) then consider
group 1 and group 2. If the number of characters of your last name is a multiple of 3 plus
1 ( 4,7,10,13,... ) then consider group 1 and group 3. If the number of characters of your
last name is a multiple of 3 plus 2 ( 5,8,11,14,... ) then consider group 2 and group 3. Use
SELECT CASES (see file About SPSS) for this first part of the assignment.
Use the procedures of SPSS for classification (see below for how to invoke them), using
all four variables. Check that only one discriminant function (principal component)
suffices for classification. Count the misclassifications. Make histograms of the first
discriminant function for each group separately (use SPLIT FILE, see file About SPSS).
Estimate the misclassification probabilities and compare these with counts.
Part B
Secondly study classification of an iris if you consider three species, all three groups.
Find out how many principal components are used for classification now and find the
corresponding percentage of the total variance. Observe how many irises are
misclassified in the training samples. For each misclassification in the training samples
specify the true population and the predicted population. Use the discriminant scores
(function 1 and 2) to locate the misclassifications in the territorial map.
Now study classification if only the first discriminant function (principal component) is
used for classification. Make histograms of the first discriminant function for each group
separately. Describe how irises should be classified and estimate the corresponding
misclassification probabilities.
Part C
In assignment B you invented a classification rule based on the first discriminant
function. Show that this classification rule gets worse if you base the classification rule
on one of the four original measurements (sepal length, sepal width, petal length and
petal width) instead of the first discriminant function. During discussions with the
teacher only the results of the best original measurement need to be presented and
compared with the results based of the first discriminant function.
8
How to use SPSS.
Before you start: consult the text of the file About SPSS.doc. After you started SPSS and
introduced the file irises.sav (open an existing data source), choose ANALYSE,
CLASSIFY and DISCRIMINANT… one after another. Choose group as dependent and
don’t forget to define the range (1 to 3). Enter the ‘independents’.
Before you actually start the SPSS discriminant analysis you have to do additional steps
if you need additional information:
For casewise results and the territorial map click on the button CLASSIFY and choose
CASEWISE RESULTS and TERRITORIAL MAP.
For saving the values of the discriminant functions click on the button SAVE and
choose DICRIMINANT SCORES.
For SELECT CASES and SPLIT FILE we refer to the file About SPSS.
You don’t have to write a report for this assignment. Just take your computer output
and your own notes with you for an (oral) discussion with the teacher about your results.
For making an appointment for the assignment: send an e-mail
(k.poortema@ewi.utwente.nl) or ring (053)4893379 .
9
Download