Medical Statistics & Epidemiology (153133) Discrimimant Analysis (week 51) October 2008 (K.P.) CONTENTS 1 2 3 4 5 6 Introduction Classifying if population parameters are known Principal Components Training Samples Generalization to three or more populations Assignment 1. Introduction We consider the problem of classifying an individual into one of a number of categories. We start with the problem of classifying an individual into one of two categories. Later on, in section 5, we explain how classification into one of three or more categories can be handled. Theory about how to classify into one of two categories is exemplified by means of data contained in the following table firstjoint secondjoint aedeagus group 191 131 53 1 185 134 50 1 200 137 52 1 173 127 50 1 171 118 49 1 160 118 47 1 188 134 54 1 186 129 51 1 174 131 52 1 163 115 47 1 190 143 52 1 174 131 50 1 201 130 51 1 190 133 53 1 182 130 51 1 184 131 51 1 177 127 49 1 178 126 53 1 210 140 54 1 1 182 121 51 1 186 136 56 1 186 107 49 2 211 122 49 2 201 114 47 2 242 131 54 2 184 108 43 2 211 118 51 2 217 122 49 2 223 127 51 2 208 125 50 2 199 124 46 2 211 129 49 2 218 126 49 2 203 122 49 2 192 116 49 2 195 123 47 2 211 122 48 2 187 123 47 2 192 109 46 2 223 124 53 2 188 114 48 2 216 120 50 2 185 114 46 2 178 119 47 2 187 111 49 2 187 112 49 2 201 130 54 2 187 120 47 2 210 119 50 2 196 114 51 2 195 110 49 2 187 124 49 2 The data are measurements made on two species of Chaetocnema. The two species are Chaetocnema Concinna (Group 1 ) and Chaetocnema Heikertlingeri (Groep 2 ). The variable x1 firstjoint is the width of the first joint of the first tarsus in microns. The variable x2 secondjoint is the width for the second joint. Finally, x3 aedeagus is the maximal width of the aedeagus in the fore-part in microns. It is hard to distinguish the two species. Our problem is how to classify a (new) individual into one of the species mentioned using the three measurements if we already assume that the individual belongs to either Chaetocnema Concinna or Chaetocnema Heikertlingeri. In practice in classification problems there are more variables than in our example. Just for ease of presentation we restrict to a data set with just three variables. In the classical classification problem the following assumptions are made. If X ( X1, X 2 ,..., X p )T represents the (column) vector of measurements of an individual then X is distributed 2 according to a multivariate normal distribution with a vector of expectations and a variance-covariance matrix . First of all: what do we mean with a multivariate normal distribution? In text books you can find different (but equivalent) definitions. In this document we use the following definition of a multivariate normal distribution: a stochastic vector X ( X1, X 2 ,..., X p )T has a multivariate normal distribution if each linear function aT X a1 X1 a2 X 2 ... a p X p has a (one dimensional) normal distribution as long as the vector a (a1, a2 ,..., a p )T differs from the null vector. Each element i of the vector is the expectation of the corresponding element: i E ( X i ) . If ij denotes the (i, j ) -element of the matrix then it is the covariance of X i and X j : ij cov( X i , X j ) . Note that the diagonal elements ii are variances: ii cov( X i , X i ) var( X i ) . In general one classifies an individual into one of two populations which are called 1 and 2 . In our example 1 refers to Chaetocnema Concinna and 2 refers to Chaetocnema Heikertlingeri. It is assumed that the vector X of the individual that has to be classified has the vector of expectations 1 if the individual belongs to population 1 . If the individual belongs to 2 then X has vector of expectations 2 . Furthermore it is assumed that the two populations share a common variance-covariance matrix with respect to the multivariate normal distribution of X . 2. Classifying if population parameters are known We first explain two rules you can use for classification: one using Fisher’s discriminant function and one using posterior probabilities. We start with Fisher’s discriminant function. In this section we assume that 1 , 2 and are known. Imagine that p 1 holds. Then the situation is rather simple. You have one variable X which has a normal distribution. If the individual belongs to 1 then X has expectation ~1 . In case of 2 the expectation of X is ~2 . Irrespective of the population the variance is 2 . The classification rule is simple: classify the individual into population 1 if the distance of X to ~1 is smaller than the distance of X to ~2 . In fact | ~ ~2 | this classification problem depends on one parameter 1 . The misclassification probabilities only depend on the parameter . Let us express the probability that the individual is classified into population 2 whereas the individual belongs to 1 in reality. Assuming that ~2 ~1 this probability is equal to 1 ~ 1 ~ ( ~2 ) ~1 ( ~1 ) P X 12 ( ~1 ~2 ) P Z 2 1 P Z 2 2 1 1 1 P ( Z 2 ) P ( Z 2 ) ( 2 ) 3 with Z X ~1 having the standard normal distribution and (t ) P( Z t ) being the cumulative distribution function of the standard normal distribution. Let us now express the other misclassification probability: the probability that the individual is classified into population 1 whereas the individual belongs to 2 in reality. Still assuming ~2 ~1 this probability is 1 ~ 1 ~ ( 1 ~2 ) ~2 ( 1 ~2 ) ~ ~ 2 2 P X ( 1 2 ) P Z P Z 1 2 P( Z 12 ) ( 12 ) with now Z X ~2 having the standard normal distribution. So we conclude that both misclassification probabilities are equal to ( 12 ) . Verify that these probabilities don’t ~ . Note that the misclassification probabilities decrease if change if we assume ~2 1 increases. Let us return to p 1 . So we have to use a stochastic vector X for purposes of classification. Fisher’s discriminant function is based on the best reduction to the case p 1 by means of a linear function a T X . This linear function a T X has a normal distribution because we assume that X has a multivariate normal distribution. Let ~1 denote the expectation of X if the individual belongs to 1 , ~2 is the expectation in case of 2 . The linear function a T X has a variance 2 which does not depend on the population. Note that the three parameters ~1 , ~2 and 2 all depend on the vectors 1 , 2 , a and the matrix and that 1 , 2 and are (assumed) known. For Fisher’s | ~ ~2 | discriminant function we choose the vector a that maximizes 1 . Without proof we present the solution. The best choice for a is given by a 1 ( 2 1 ) and hence Fisher’s discriminant function is ( 2 1 )T 1 X . Sometimes Fisher’s discriminant function is rescaled to ( 2 1 )T 1 X 12 ( ~1 ~2 ) . The parameter is called the Mahalanobis distance. It can be shown that the Mahalanobis distance is given by: 2 ( 2 1 )T 1 ( 2 1 ) Let us now turn to classification using posterior probabilities. Let p1 define the probability that the individual belongs to population 1 and let p2 define the probability that the individual belongs to 2 . If the individual belongs to 1 the vector X has a multivariate normal distribution with vector of expectations 1 and variance-covariance 4 matrix , let f1 ( x) denote the corresponding (multivariate) density function. Let f 2 ( x) represent the (multivariate) density function in case of 2 . Consider an individual which we have to classify and assume that the outcome of X is x . The conditional probability that the individual belongs to 1 given the event X x is (using Bayes’ formula): P( 1 | X x) p1 f1 ( x) . p1 f1 ( x) p2 f 2 ( x) Furthermore, the conditional probability that the individual belongs to 2 given the event X x is: P( 2 | X x) p2 f 2 ( x ) p1 f1 ( x) p2 f 2 ( x) Conditional probabilities are often called posterior probabilities in this respect. We classify the individual into the population with the highest posterior probability. In many cases there is no information to estimate the prior probabilities pi , then often equal prior probabilities are assumed. 3. Principal Components We noticed that in the discriminant analysis procedure of SPSS first a data reduction technique is applied: principal components. So we proceed explaining principal components. In this section we still assume that the variance-covariance matrix is known. Consider a stochastic vector X with large dimension p . In practice the vector may contain many characteristics that are highly correlated. Using principal components the large vector X is replaced by a few linear functions a1T X , a2T X ,..., arT X (the principal components) that are independent. The vectors ai are chosen to have norm 1 and to maximize the variances var( aiT X ) . For a variance-covariance matrix it can be shown that all eigenvalues are real numbers that are non-negative. Furthermore we here assume that the eigenvalues of are strictly positive and that the eigenvalues are all different. Let 1 , 2 ,..., p represent the eigenvalues of . Assume 1 2 ... p and let the vectors v1 , v2 ,..., v p present the corresponding eigenvectors, taking to be orthonormal (the vectors are orthogonal and their norms are all equal to 1). The principal components are as follows. The first principal component is v1T X and its variance is 1 , the second principal component (which is independent of v1T X ) is v2T X 5 and its variance is 2 , etc. It can be proven that the sum of variances to 1 2 ... p , which is equal to sum p var( X ) is equal i 1 i p var( v T i i 1 X ) , the sum of the variances of the principal components. In practice often the first few eigenvalues are very much larger than the rest of the eigenvalues. For instance, 1 2 may (approximately) represent 95% p of the total variance var( X ) . i 1 i It means that 95% of the variation of the data is contained by just two variables v1T X and v2T X . Only with a slight loss of information then the vector X may be reduced to just two variables, v1T X and v2T X . Classification may proceed just using these two variables. 4. Training samples Up till now we assumed that 1 , 2 and are known. How do we have to estimate (the elements of) 1 , 2 and ? For our example shown in the introduction we have data of two ‘training samples’. The elements of 1 and 2 are estimated in a straightforward way. The first element of 1 is the expectation of the variable ‘firstjoint’ of Chaetocnema Concinna and is estimated by means of the corresponding sample mean M1 , etc. Let us consider estimation of the diagonal elements of the matrix for our example, the variances of the variables X1 , X 2 en X 3 . For ease of presentation we restrict to estimation of the variance of X1 . If X11, X12 ,..., X1m represent the measurements of the ~ ~ ~ first sample and X11, X12 ,..., X1n represent the measurements of the second sample then ~ ~ 2 2 S12 X 1i M 1 /( m 1) and S22 X 1i M1 /( n 1) are the two sample variances. i i These estimators are combined by means of the formula S2= m 1 n 1 S12 S 22 mn2 mn2 for estimating the variance of X1 , the (1,1)-element of . For showing how the non-diagonal elements of are estimated let us restrict to showing a formula for estimating of the (1,2)-element of , the covariance of X1 and X 2 . Estimation of this covariance is based on the two sample covariances ~ ~ ~ ~ S12(1) ( X 1i M1 )( X 2i M 2 ) /( m 1) and S12( 2 ) ( X 1i M 1 )( X 2i M 2 ) /( n 1) , i i which are combined as follows: S12 m 1 n 1 S12(1) S12( 2 ) . mn2 mn2 Replacing parameters by estimates we get the actual classification rules based on Fisher’s discrimination function, etc. 6 Applying the procedures of SPSS to our example we got output, a part of the output is as follows. Function Eigenvalue Eigenvalues % of variance Cumulative % Canonical correlation a 1 100.0 100.0 .889 3.774 a First 1 canonical discriminant functions were used in the analysis. Canonical Discriminant Function Coefficients Function 1 Firstjoint -1.00 Secondjoint .143 Aedeagus .259 (constant) -11.123 Unstandardized coefficients Apparently the data can be reduced to one principal component for purposes of classification. The second table indicates on which linear function of the data classification will be based. 5. Generalization to three or more populations. The two methods of classification of section 2 can be generalized towards classification into three or more populations. The method based on posterior (conditional) probabilities can be generalized rather easily. If 3 is a third population with density f3 then the conditional probability that the individual belongs to 1 given the event X x of section 2 changes into P( 1 | X x) p1 f1 ( x) p1 f1 ( x) p2 f 2 ( x) p3 f 3 ( x) with analogous formulas for P( 2 | X x) and P( 3 | X x) . The individual is classified into the population with the highest (estimated) conditional probability. Generalization of the method of Fisher’s discriminant function to three or more populations can be done by defining a Mahalanobis distance for each population. Consider e.g. three populations, 1 , 2 and 3 , with vectors of expectations 1, 2 and 3 and common variance-covariance matrix . We define a Mahalanobis distance 1 of X to population 1 by means of: 21 ( X 1 )T 1 ( X 1 ) 7 and distances to 2 and 3 by means of respectively 22 ( X 2 )T 1 ( X 2 ) and 23 ( X 3 )T 1 ( X 3 ) . The individual is classified into the population with the smallest (estimated) Mahalanobis distance. 6. Assignment You have to study classification using the famous iris data, these data are contained by the SPSS file irises.sav. There are three species of iris (iris setosa, iris versicolor and iris virginica) and four measurements of flour parts (sepal length, sepal width, petal length and petal width, in centimeters). Part A First study classification of an iris if you consider only two species, two samples. If the number of characters of your last name is a multiple of 3 ( 3,6,9,12,... ) then consider group 1 and group 2. If the number of characters of your last name is a multiple of 3 plus 1 ( 4,7,10,13,... ) then consider group 1 and group 3. If the number of characters of your last name is a multiple of 3 plus 2 ( 5,8,11,14,... ) then consider group 2 and group 3. Use SELECT CASES (see file About SPSS) for this first part of the assignment. Use the procedures of SPSS for classification (see below for how to invoke them), using all four variables. Check that only one discriminant function (principal component) suffices for classification. Count the misclassifications. Make histograms of the first discriminant function for each group separately (use SPLIT FILE, see file About SPSS). Estimate the misclassification probabilities and compare these with counts. Part B Secondly study classification of an iris if you consider three species, all three groups. Find out how many principal components are used for classification now and find the corresponding percentage of the total variance. Observe how many irises are misclassified in the training samples. For each misclassification in the training samples specify the true population and the predicted population. Use the discriminant scores (function 1 and 2) to locate the misclassifications in the territorial map. Now study classification if only the first discriminant function (principal component) is used for classification. Make histograms of the first discriminant function for each group separately. Describe how irises should be classified and estimate the corresponding misclassification probabilities. Part C In assignment B you invented a classification rule based on the first discriminant function. Show that this classification rule gets worse if you base the classification rule on one of the four original measurements (sepal length, sepal width, petal length and petal width) instead of the first discriminant function. During discussions with the teacher only the results of the best original measurement need to be presented and compared with the results based of the first discriminant function. 8 How to use SPSS. Before you start: consult the text of the file About SPSS.doc. After you started SPSS and introduced the file irises.sav (open an existing data source), choose ANALYSE, CLASSIFY and DISCRIMINANT… one after another. Choose group as dependent and don’t forget to define the range (1 to 3). Enter the ‘independents’. Before you actually start the SPSS discriminant analysis you have to do additional steps if you need additional information: For casewise results and the territorial map click on the button CLASSIFY and choose CASEWISE RESULTS and TERRITORIAL MAP. For saving the values of the discriminant functions click on the button SAVE and choose DICRIMINANT SCORES. For SELECT CASES and SPLIT FILE we refer to the file About SPSS. You don’t have to write a report for this assignment. Just take your computer output and your own notes with you for an (oral) discussion with the teacher about your results. For making an appointment for the assignment: send an e-mail (k.poortema@ewi.utwente.nl) or ring (053)4893379 . 9