Projetion Pursuit method for the Small Sample Size with the Large Number of Variables Eun-kyung Lee 1 Department 1 2 and Dianne Cook of Statistis, Ewha Womans University, 11-1 Daehyun-dong, Seodaemun-gu, Seoul, 120-750, Korea 2 Department of Statistis, Iowa State University, Ames, IA 50011, USA Summary In high-dimensional data, one often seeks a few interesting low-dimensional projetions whih reveal important aspets of the data. Projetion pursuit for exploratory supervised lassiation is for nding separable lass struture. Even though the projetion pursuit method an bypass the urse of dimensionality, when we have the small number of observations relative to the number of variables, the lass struture of optimal projetion an be biased seriously. In this situation, most lassial multivariate analysis methods have problems. We disuss how the sample size and dimensionality are related, and we propose a new projetion pursuit index that onsiders the penalty for the projetion oeÆients and overomes the problem of small sample size. Keywords: The urse of dimensionality; Gene expression data analysis; Multivariate data; Penalized disriminant analysis; Projetion pursuit 2 1 Introdution This paper is about the exploratory data analysis of small samples with a large number of variables, espeially in supervised lassiation. If the lassier obtained for a given training set is inadequate, it is natural to onsider adding new variables, partiularly ones that will help separate the diÆult to lassify ases. If the new variables provide any additional information, the performane of the lassier must improve. Unfortunately, beyond a ertain point, additional variables will make the lassier worse. The problem arises when the sample size is small or the variables are highly orrelated. When the training set is relatively small ompared to the number of variables, the statistial parameters estimated on this training set are not aurate and they are unstable. A quite dierent lassier may be obtained when a dierent training set is used. A small sized sample with the very large number of variables is the typial situation of gene expression data analysis. In this paper, we fous on leukemia data from two types of leukemia, aute lymphoblasti leukemia (ALL) and aute myeloid leukemia (AML). This data set onsists of 25 ases of AML and 47 ases of ALL (38 ases of B-ell ALL and 9 ases of T-ell ALL). After preproessing, we have 3571 human genes (Golub et al. 1999). In Chapter 2, we disuss how the sample size and dimensionality are related and how they aet the supervised lassiation. Chapter 3 introdues a new projetion pursuit method when the sample size is small and the number of variables are large and desribes its properties and apply our new projetion index to leukemia data. We explain how this new projetion pursuit index an be applied to the gene seletion method and ompare to other gene seletion methods in Chapter 4. 2 Problems of high dimensionality To see how the number of variables aets the lassiation methods that use separating hyperplanes, we investigate linear disriminant analysis with leukemia data (n=72 and p=3571). For gene seletion, we use the ratio of between-group to within-group sums of squares. P P I (y = k )(x x )2 =1 BW (j ) = P P =1 (1) x )2 =1 =1 I (y = k)(x n g i n k g i k;j :;j i k i i;j k;j Pn Pn Pn where x = (1=n) =1 x and x = ( =1 I (y = k)x )=( =1 I (y = k )). First, we sample a 2/3 training set(n = 48) and alulate BW values for eah gene using this training set. After then we selet the p variables that have larger BW values. Using this training set with p variables, we :;j i i;j k;j i train i i;j i i 3 build the lassier using Linear Disriminant Analysis (LDA) and ompute the training error and the test error. Repeat this 200 times. Median and upper quantiles of training and test errors are summarized in Table 1. For various p, training errors are almost 0. That is, the training sets are perfetly separated regardless of p. But test errors are inreased as p is inreased. As p approahes n, the test error gets worse. Table 1. Training and Test error for various number of variables. True Class Permuted Class Training error Test error Training error Test error p Q2 40 0 30 0 20 0 10 0 Q3 Q2 0 4 0 2 0 1 0 1 Q2 : median Q3 5 3 2 2 Q2 Q3 0 0 1 1 3 4 7 9 Q3 : upper quantile Q2 Q3 15 16 14 16 14 15.25 14 15 It gets more interesting when we sramble the lass id's using permutation and then the lass separations are spurious. Table 1 shows the results of the same proedure outlined above using permuted lasses. We might suspet that a lassier will not aurately separate these spurious lasses. Surprise! When p = 40, the training error is 0. This result an be explained by the apaity of a separating plane. When he training set has p = 40 and n = 48, the probability that n sample points in p-dimensional spae are linearly separable is lose to 1 (Ripley, 1996). Therefore there exists a separating hyperplane purely by hane. When p gets smaller, the training error with the permuted lass gets larger. The test errors are onsistent, independent of p. These results an be seen visually using the LDA projetion pursuit index (Lee, 2003). Figure 2 shows the 2-dimensional optimal projetions using the LDA PP index with the training set for both true lass (top row) and the permuted lass (bottom row). 1, 2, and 3 represent AML, B-ell ALL and T-ell ALL in the training set, and the symbols , 4, and + represent AML, B-ell ALL and T-ell ALL in the test set. After nding the 2-dimensional optimal projetion with the training set, we projet both training and test sets onto this optimal projetion. For the true lass, the training set with p = 40 is more separable and has smaller within-lass variane than p = 10, but the test set shows quite dierent group means and more larger withinlass variane. Notie that the test set is not separable on this projetion. When p is smaller, the training set has larger within-lass variane and the test set has more similar struture to the training set. For the permuted lass, when p = 40, we an nd separated lass struture with permuted lass for the training set. When p = 30, the training set is 4 (a) p=40 3333 3 3 (b) p=30 1 2222 22 2 222 22 22 222 222 22 22 (c) p=20 2 22 2 222 22 2222 222222 22 2 2 11 111 111 1111 111 3 3 33 33 1 111 11 1 1111 11 1 11 1 111 11 1 11 1 111111 (e) p=40 (f) p=30 (g) p=20 2 22 2 2 2 2 22 22 2 2 2 2 2 2 2 2 2 11 2 2 2 3 32 11 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 2 22222 2 2222 2 2 2 2 2 2222 2 2 11 11 111 11 1 1111 21 1 3 Training set Test set 1 1 1111 1 1 1 1 1 11 11 (h) p=10 1 1 3 3 3 33 3 2 2 2 22 2 2 2 2 222 22 222 222 222 2 2 3 33 1111 111 11111 1 11 3 3333 3 2 2 2 2 2 22 22 22 2222222 22 2 2 3 2 222 2 22222 2222 22 22 2222 (d) p=10 3 333 3 2 3 2 1 1 1 1 1 2 2 1 1 11 221 12 221 2 2 2 22 23 22 2 2 22 11 1 2 2 3 2 3 2 33 3 3 3 3 1 : AML 2 : B-ell ALL 3 : T-ell ALL : AML : B-ell ALL + : T-ell ALL 4 Figure 1. LDA training and test sets for various number of variables with true lass(a-d) and permuted lass(e-f) still separated. As p is smaller, lass struture for the training set weakens. For all p, the test sets don't reveal any lass struture. These results support LDA errors in Table 1. From these results, we an onlude that when p is large, the LDA lassier is biased too muh. Therefore we need to hoose the number of variables arefully. Many lassial multivariate analysis methods need to alulate the inverse of ovariane matrix. If n p + 1 or the variables are highly orrelated, ^ will be lose to singular whih will result in numerial instability in alulating the inverse. It is neessary to estimate for the variane-ovariane matrix dierently. If there is prior information about this ovariane, then we an use a Bayesian or pseudo-Bayesian estimate ~ = (1 )^ + , where is a pre-determined matrix from prior information or assumption. If is diagonal, it will help avoid numerial problems.. For the extreme assumption that all variables are independent, we an use =diag(^ ). Even though the assumption is inorret, the resulting heuristi estimates an provide better performane than the MLE. We investigate LDA in this point of view. LDA nds a projetion a by maximizing a a=a a, where is the between-lass ovariane matrix and is the within-lass ovariane matrix. When sample size is small and the number of variables are large, LDA is usually too exible and sometimes T W B T W B 5 (b) Leukemia 2D : p=3571 1 1 1 1 33 1 1111 33 11 1 3 1 1 11 3 11 33 20 30 (a) Leukemia 1D : p=3571 10 33 222 0 111 −1 0 1 2 2 2 2222 2 2 2 2222 2222 222 2 2 22 1: AML 2: B-ell ALL 3: T-ell ALL Figure 2. Leukemia data - 1D and 2D optimal projetions using LDA index(p=3571) (a) Histogram of 1D optimal projeted data (b) Plot of 2D optimal projeted data. a a an be small for some a. It auses the data piling problem (Marron, T W et al 2002). Figure 2 shows the optimal 1-dimensional and 2-dimensional projeted data using the LDA index for Leukemia data when n = 72 and p = 3571. In the 1-dimensional projetion, all data points in one lass are projeted onto almost one point and all lasses have very small within-lass varianes. In the 2-dimensional projetion, eah lass lies in one line and two variables in this projeted data have a perfet linear relationship. To esape this data piling problem, a penalty of the projetion oeÆients is onsidered. Penalized disriminant analysis (PDA: Hastie, et al, 1995) is a generalized method of LDA that inorporates prior information as the penalty of roughness. PDA nds a projetion a by maximizing (a a)=(a ( + )a). In PDA, a pre-determined matrix keeps the within-group struture of the projeted data from degenerating too muh. That is, the optimal projetion from PDA has larger within-lass variane than the optimal projetions from the LDA PP index and it depends on the size of and the hoie of . We extend this idea to new projetion pursuit methods. T 3 3.1 B T W PDA projetion pursuit index Index denition We propose a new projetion pursuit index whih is the extension of the LDA PP index (Lee, 2003). The main purpose is to (1) prevent the problems with the small number of observations and the large number of variables and (2) 6 nd projetions that ontain lass separations in a reasonable manner. We use ~ () = (1 )^ +diag(^ ) as our estimate of variane-ovariane matrix. As is inreased, ~ tends to be diag(^ ). When the data is standardized, it redues to ~ () = (1 )^ + I. Let X be the p-dimensional vetor of the j th observation in the ith lass, the number of lasses, P n is the number of i = 1; : : : ; g , j = 1; : : : ; n , g is P = 1 =1 X be the ith observations in lass i, and n = =1 n . Let X P P lass mean and X = 1 =1 =1 X be the total mean. For onveniene, we assume that X 's are standardized. Let ij i :: n i ni g g ni i j i i i: ij j ni ij ij g X B = =1 i: ni (X X )(X :: X ) : between-lass sums of squares; i: :: T i g ni X X W = =1 =1 i X )(X (X ij i: X ) : within-lass sums of squares: ij i: T j Here, B+W = n^ and ^ is the orrelation matrix. Then, the PDA projetion pursuit index is IP DA A (1 )W + nIp A (A; ) = 1 A (1 )(B + W) + nI A T T (2) p where A is an orthonormal projetions onto k-dimensional spae and 2 [0; 1) is a predetermined parameter. Let B = (1 )B and W = (1 )W + nI . Then, the PDA index has a same form as the LDA index and when = 0, the PDA index is same as the LDA index. p Proposition 1. Let = (1 01 k Y =1 )(B + W) + nIp = B + W . p Y i IP DA (A; ) 1 i 1 (3) +1 where 1 2 0 : eigenvalues of 1 2 W 1 2 , e1 ; e2 ; ; e : orresponding eigenvetors of 1 2W 1 2, f 1; f 2 ; ; f : eigenvetors of 1 2 B 1 2. In (3), the right equality holds when A= 1 2 [e e 1 e +1 ℄ = 1 2 [f 1 f 2 f ℄ and the left equality holds when A= 1 2[e e 1 e1 ℄ = 1 2[f +1 f +2 f ℄. i = Then, i p k = = p = = p = = p = p p p k = k = = k k p k p k p Proof of this proposition is same as Proposition 1 in Lee, et al (2004). To explain the dierene between the PDA and LDA index, we use the prinipal 7 omponents. Let = B + W = QDQ = T p X =1 di qi qTi ; (4) i where Q = [q1 ; q2 ; ; q ℄ is the eigenvetor matrix of , D = diag(d1 ; d2 ; ; is the eigenvalues of , d = 0 for all i = r + 1; ; p and rank() = r. Then, tr() = tr( ) = np, and dp ) p i = p X =1 (1 )d + n q q i i T i (5) i Therefore, and have same prinipal omponent diretions and total variane. The dierene between these two variane matries is the proportion of total variane due to the kth prinipal omponent. For the LDA index, we use the original prinipal omponent of . The PDA index keeps the diretion of 's prinipal omponent and the total variane, but hanges the proportion of total variane explained by eah diretion. When the proportion due to the kth prinipal omponent is larger than 1=p, the PDA index uses the shrunk proportion of the total variane due to this diretion. Otherwise, the PDA index uses the inreased proportion of the total variane due to this diretion. For the non-signiant prinipal omponent, the PDA index put =p as a proportion of the total variane on that prinipal omponent. Therefore if p is large and if we want to keep the amount of shrinkage, we need to use larger . 3.2 Examples A toy example from Marron, et al (2002) is used to demonstrate the dierene between the PDA index and the LDA index. There are two lasses in the 39-dimensional spae. Eah lass has 20 data points. They are generated from the standard Gaussian distribution, exept that the mean of the rst variable is shifted to 2.2 and -2.2 for two lasses, respetively. Therefore, the separation of two lasses only depends on the rst variable. Figure 3 (a)-(e) show the histograms of the 1-dimensional optimal projeted data using the PDA PP index with various values. As we mentioned before, when = 0 (Figure 3-b), the PDA PP index is same as the LDA index and the projeted data has very small within-group variane. As is inreased, the withingroup varianes of the projeted data also get larger and the projeted data have more reasonable lass struture. To see the dierene between the LDA index and the PDA index in detail, we ompared the optimal projetion oeÆients of the LDA index and the PDA index with = 0:9. Figure 3 (b-1) shows the absolute values of the 8 optimal projetion oeÆients using the LDA index. All the oeÆients have small values and from these oeÆients, it is hard to deide whih variables are more important than the others to separate two lasses. On the other hand, the oeÆient of the rst variable has very large value and the others are very small (Figure 3 (e-1)). From this result, we an onlude that the LDA index fouses only on the projetion having small within-lass variane relative to the total variane and leads us to the projetion that is biased too muh, therefore annot be useful when the sample size is small and the number of variables are large. On the other hand, the PDA index an lead us to a quite reasonable projetion and its oeÆients an be used as a guideline to selet important variables. We apply the PDA index to the Leukemia data with three lasses : AML, B-ell ALL and T-ell ALL,. Figure 4 shows the results using the PDA index with various values. After nding the 2-dimensional optimal projetion using the PDA index on the training set, we projet both training and test sets onto this optimal projetion. When = 0, the training set and the test set have dierent lass struture on this projetion. The training set has very small within-lass varianes. On the other hand, the test set has large within-lass variane. When = 0.1, the training set and the test set have similar lass strutures. But as is inreased, the within-lass variane of the training set is inreased too muh and beyond a ertain point, the PDA index an be biased in the other way of the LDA index, that is, the within-lass variane of the training set has larger than the within-lass variane of the test set. Therefore we need to selet value that an keep the within-lass variane of the training set in the reasonable amount. For seleting value, we suggest to use S () = tr(W)=n, where W is the within-lass sum of squares of the optimal projeted data using the PDA index with . If we use standardized data, the optimal value of S () is 1. If S () is smaller than 1, we suggest to inrease your value. If S () is larger than 1, we suggest to derease your value. Figure 5 shows the plot of S () and . For the leukemia data with 40 genes, the best value is around 0.2. In the plot of 2-dimensional projeted data using the PDA index with = 0:2, we an see that training set and test set have similar within-group varianes. We examine the training and test error in lassiation using the LDA and the PDA projetion pursuit methods. After nding the optimal 2-dimensional projetion using the LDA and PDA index with = 0:1, we build a lassier using the rule (11-65) in Johnson and Wihern (2002), and ompute the training and test errors. This is repeated 200 times. The median and upper quantile of the test errors are summarized in Table 2. The results from the LDA PP method is same as Table 1. For the PDA PP method, the training 9 (b) LDA : lambda=0 10 6 8 15 10 (a) The first variable 222222 22 222 22 2 5 4 22 2 222 222 22 2 22 222 2 1111 111 11111 111 1 0 0 1111111 −6 −4 −2 0 X1 2 6 −6 −4 −2 0 2 1D projected data (d) PDA : lambda=0.5 4 6 (e) PDA : lambda=0.9 4 2 22222222 22 22 22 2 −4 −2 0 2 1D projected data 4 6 0 0 0 2 111 1111 11 111111 1111 111111111 1 2 11 1111 111111111 2 4 2 222222 22222 22 2 −6 −6 −4 −2 0 2 1D projected data 4 6 −6 −4 −2 0 2 1D projected data 4 6 0.8 0.6 0.4 PDA PP coefficient 0.6 0.4 0.0 0.2 0.0 0.2 0.8 1.0 (e−1) PDA : lambda=0.9 1.0 (b−1) LDA : lambda=0 PDA PP coefficient 2 2 222222222 22 2 4 6 6 8 6 10 8 8 12 10 (c) PDA : lambda=0.1 4 1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 variables 1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 variables Figure 3. Toy example : p = 39 , n = 40. (a) - (e) The histograms of 1D optimal projeted data using the PDA index with various , (b-1) (e-1) the projetion pursuit oeÆient values for orresponding variables. 10 p=40:lambda=0 p=40:lambda=0.1 2 2222222 22 22 2 22222 2 2 22 2 22 2 222 22 222 222 111 1 111111 3 p=40:lambda=0.5 2222 2 2 222 2222 2 2 2 22 22 333 3 111 1 1 111 11 1 1 1 2 2 1 1 1 11 11 1 1 11 11 11 3 3 Training set Test set 1 : AML 3 2 3 3 33 p=40:lambda=0.9 3 3 3 3 3 2 : B-ell ALL : AML 4 : B-ell ALL 2 1 2 22 22 222222222 2222 2 222 2 3 3 3 11 11 11 11111111 3 : T-ell ALL + : T-ell ALL Figure 4. Leukemia data(p=40) - 2D projetions using P DA index with various p=40:lambda=0.2 Lambda selection (p=40) 4 3 33 33 2 222222 2 22 2 2222 2 22 2 22 2 1 1 0 tr(W)/n 2 3 3 1 1 11 1 11 11 1 11 0.0 0.2 0.4 0.6 lambda 0.8 Training set Test set 1 : AML 2 : B-ell ALL : AML 4 : B-ell ALL 3 : T-ell ALL + : T-ell ALL Figure 5. seletion for the Leukemia data(p=40) : Plot of S() vs. and the optimal 2-dimensional projeted data using the PDA index with =0.2 p=3571 : lambda=0 3 3 3 3 3 p=3571 : lambda=0.1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 test error = 8 Training set Test set 22222 2 222 22 222 222 222 22 2222 2 22 2 1 1 11 11 1 111 11 2 2 2 2 2 2 2 2 2 2 2 2 p=3571 : lambda=0.5 p=3571 : lambda=0.9 3 333 3 33333 3 11111 1 1 11 111111 3 3333 test error = 3 1 : AML 2 2 2 222222 22 22 22 2222222 11 11 111 111 11 test error = 0 2 : B-ell ALL : AML 4 : B-ell ALL test error = 0 3 : T-ell ALL + : T-ell ALL Figure 6. Leukemia data(p=3571) - 2-dimensional projetions using P DA index with various 11 errors are the same as linear disriminant lassiation with the original data (all 0), but the test errors are muh lower. For various numbers of variables( p), the test errors of the PDA PP method are smaller than the errors of the LDA PP method. Also the PDA PP method shows very onsistent test errors for all p. From this result, we an onlude that the PDA index helps nd less biased and more reasonable projetions. Table 2. Training and test error for the LDA PP method and the PDA PP method with = 0:2. LDA PP PDA PP with = 0:1 Training error Test error Training error Test error p Q2 40 0 30 0 20 0 10 0 Q3 0 0 0 0 Q2 4 2 1 1 Q3 5 3 2 2 Q2 0 0 0 0 Q3 0 0 0 0 Q2 1 1 1 1 Q3 2 2 2 2 Figure 6 shows the 2-dimensional optimal projetions using the PDA index with Leukemia data with all genes. For all values, the between-lass variane struture are similar, the means of three lasses form a triangle shape, but the within-lass struture are quite dierent. When = 0, the withinlass sums of squares of the projeted data is a singular matrix, whih would suggest the 1-dimensional projetion is enough to use linear disriminant analysis and this optimal projetion will not be useful for showing separations in new samples. When = 0:1, the within-variane is very small, but it is nonsingular and the 2-dimensional projetion has more information than the 1-dimensional projetion. As is inreased, the within-lass sums of square of the projeted data gets larger and the training and test set have more similar within-lass varianes. Using S (), we suggest to use larger value, around 0.7 (Figure 7). 4 Appliation : Gene seletion In the previous setions, we used the BW values to selet genes that are useful for separating lasses. The BW values are alulated for eah gene and there is no onsideration of the orrelation between genes in their alulation. But most of genes are highly orrelated and some genes work together to separate lasses, even though they have small BW values. In this sense, the projetion oeÆients from the PDA index an provide a better gene seletion method. As we saw in the toy example, these oeÆients from the PDA index tend to be more reasonable than the oeÆients from the LDA index. These 12 p=3571:lambda=0.7 Lambda selection(p=3571) 11 111 1 1111 11111 1 1 tr(W)/n 2 3 4 3 33 33 3 0 2 2 222 22222 2 2 2 2 2 2 2 22 22 0.0 0.2 0.4 0.6 lambda Training set Test set 0.8 1 : AML 2 : B-ell ALL : AML 4 : B-ell ALL 3 : T-ell ALL + : T-ell ALL Figure 7. Lambda seletion for Leukemia data(p=3571) : Plot of S() vs. lambda and 2-dimensional projeted data using PDA index with =0.7 oeÆients an be used to explain how important the orresponding variables are to separate lasses. We ompare these oeÆients to the BW values in terms of separating lasses. To see how the projetion oeÆients from the PDA index and the BW values are related, we start from the Leukemia data (p = 3571, n = 72) with 2 lasses, AML and ALL. Figure 8 shows the plot of the BW and the projetion oeÆients from the PDA projetion method with = 0:9. For most genes, the BW values are less than 0.5 and the projetion oeÆients are in between -0.04 and 0.04. Most genes with large BW values have larger than 0.04 or smaller than -0.04 for the projetion oeÆient value. On the other hand, the BW values of most genes with high projetion oeÆients are spread out very widely. Some of them are less than 0.5. Table 3. Comparison between BW and projetion oeÆients. From large projetion oe. From large BW genes BW PP oe. genes BW PP oe. U34877 at 0.16 -0.06 M84526 at 3.01 -0.06 M27891 at 2.66 -0.06 M27891 at 2.66 -0.06 M84526 at 3.01 -0.06 U46499 at 2.54 -0.05 X95735 at 1.80 -0.06 M23197 at 2.29 -0.04 HG1612-HT1612 at 1.03 0.06 X95735 at 1.80 -0.06 We selet 5 genes from large BW values : M84526 at, M27891 at, U46499 at, 0.00 −0.06 −0.04 −0.02 PP 0.02 0.04 0.06 13 0.0 0.5 1.0 1.5 2.0 2.5 3.0 BW Figure 8. BW vs PDA( = 0:9) projetion oeÆients : Leukemia data with two lasses, AML and ALL (p=3571,n=72) M23197 at and X95735 at, and large projetion oeÆients : U34877 at, M27891 at, M84526 at, X95735 at and HG1612-HT1612 at, respetively and ompare them. Table 3 shows the BW values and projetion oeÆients for eah gene. All genes from larger projetion oeÆients have large BW values exept one, gene U34877 at. All genes from larger BW values have large projetion oeÆients. Figure 8 shows the satter plot matries of 5 genes from large BW and projetion oeÆients. 5 genes from large BW values show quite separable group means, but any pairs of these genes are not learly separable. At least one or two ases are mislassied if we use one separating hyperplane. On the other hand, 5 genes from larger projetion oeÆients show more mixed struture. In the plot of X95735 at and HG1612-HT1612 at, we an nd a separating hyperplane that 2 lasses are learly separable. Whenever we optimize the PDA index, we an get dierent projetion oeÆients, espeially when we use very large number of variables. It is from the urse of dimensionality. Beause most of high dimensional spae is empty, espeially when we have small number of observations, we an separate lasses in many dierent ways. Therefore with the projetion pursuit method, we an explore various projetions that provide us the separated lass struture. To show how this works, we hoose another projetion from maximizing the 14 M84526_at M27891_at U46499_at M23197_at X95735_at (a) 5 genes from large BW values U34877_at M27891_at M84526_at X95735_at HG1612−HT1612_at (b) 5 genes from large projetion oe. Figure 8. Satter plot matries : 5 genes from large BW values and 5 genes from large projetion oeÆients with the PDA index( = 0:9) : Leukemia data with two lasses, AML() and ALL(+) (p=3571,n=72) 15 PDA index ( = 0:9) and selet 10 genes with large projetion oeÆients. Table 4 shows the summary of 10 genes from larger projetion oeÆients. All 10 genes from larger projetion oeÆients have small BW values (less than 1) and some of them have very small BW values, even less than 0.1. Table 4. Comparison between BW and projetion oeÆients. genes M82809 at X51521 at S68616 at L02426 at AF006087 at U30255 at U41654 at M84371 rna1 s at L10373 at U10868 at BW PP oe. 0.1241 -0.0673 0.5920 0.0563 0.0932 -0.0555 0.0005 -0.0554 0.0187 0.0543 0.3871 -0.0541 0.1560 0.0525 0.7359 0.0522 0.2444 0.0519 0.4401 -0.0518 Figure 10 shows the histograms of the projeted data onto the LDA optimal projetion with seleted genes from larger BW values and larger projetion oeÆients in Table 4. In Figure 10(a-1), 5 genes from larger BW values an separate AML and ALL learly exept 5 ases (2 ases in ALL and 3 ases in AML). As we inrease the number of genes up to 10, the two groups are more separable, but with 10 genes from larger BW values, we still have a mislassied ase. Figure 10(b-1) is the histogram of the projeted data onto the LDA optimal projetion with 5 seleted genes from large projetion oeÆients. Lots of ases are mislassied. However, as we inrease the number of genes up to 10, the performane of separating two lasses are improved very quikly. When we use 10 genes from larger projetion oeÆients, two lasses are learly separable. As we an see in Figure 10, adding more genes to lassiation is not muh helpful to separate lasses beyond a ertain point. As a guideline to deide the number of separated genes, we reommend to use the LDA index value. Figure 11 shows plots of the optimal LDA index value versus the number of genes. In Figure 11(a), we use the same projetion oeÆient as in Table 3 and optimize these seleted genes with the 1-dimensional LDA index. After p = 5, the LDA index value isn't muh inreased. Therefore, 5 seleted genes in Table 3 are enough to separate AML and ALL. In Figure 11(b), genes are seleted from larger BW values. After one or two genes are seleted, LDA index value is inreased very slowly. Figure 11() is the LDA index plot of the seleted genes from larger projetion oeÆients that are used for Table 4. As we expeted from very low BW values, the LDA index values are very 16 (a−2) 7 genes from BW (a−3) 9 genes from BW (a−4) 10 genes from BW −2 −1 0 1 2 −3 −2 −1 0 1 30 25 20 −4 −2 0 2 −4 (b−3) 9 genes from B coeff. −3 −2 −1 0 1 2 3 (b−4) 10 genes from B coeff. 10 2 22222222 2222222222 222222222 222222 222222 2222222222 222222 2 1 111111111111111 1 11 5 −2 −1 0 1 2 3 1 0 −3 1111111 111111111 111111 111 11111 1111 0 0 1 −2 −1 0 1 2 3 0 1 1 1111111111111 2 5 5 22 2222222222222 2222222222 222 5 10 2222 2222 22 22 2222222222 22 22 222 2222 10 10 15 20 15 (b−2) 7 genes from B coeff. 111111 11111 11 11 11 0 2 15 (b−1) 5 genes from B coeff. 1 11 111 111111 1111 11 15 −3 222 222 2222222222 222 222222222 5 5 0 −4 10 15 2 2 222 22222 2222 22222 22 2222 2 11111 11 1111111 111 11 11 2 22 22222 22 222 22222 22222222 1 0 1 1111111111111 1 1 0 5 10 5 10 15 222 2222 2222222222222 22222 15 20 10 20 25 25 30 15 30 35 (a−1) 5 genes from BW −3 −2 −1 0 1 2 −3 −2 −1 0 1 2 1 : AML 2 : ALL Figure 10. 1D optimal projetion using the LDA index with various number of genes from BW and projetion oeÆients : Leukemia data with two lasses low when p is small. But as p is inreased, the LDA index value is inreased rapidly and after p = 12, it stays steady. 5 Disussion We have looked at the problems in a high dimensional spae, espeially when we have a small number of observations, and have proposed a new projetion pursuit index that adds a penalty term for high dimensionality or multiollinearity. The PDA index works well to separate lasses in reasonable manner when data have multiollinearity or very high dimensionality relative to the sample size. To use the PDA index, we need to hoose . In the original PDA, ross-validation an be used to selet . But the main purpose of projetion pursuit is exploratory data analysis. Therefore ross-validation annot be a good approah to selet for our PDA index. This is the main reason we keep in [0,1). One guideline to selet is to use larger for large p. The PDA index an be used to selet important variables that are helpful to separate lasses. In gene expression data analysis, this appliation is useful to selet important genes that work dierently in eah lass. It an be extended 10 15 p (a) From Table 3 20 1.0 0.6 0.5 p=12 0.4 0.5 0.4 5 0.7 1D LDA index 0.8 0.9 1.0 0.9 0.8 0.7 1D LDA index 0.6 0.7 p=5 0.4 0.5 0.6 1D LDA index 0.8 0.9 1.0 17 5 10 15 20 5 10 p (b) From BW 15 20 p () From Table 4 Figure 11. plots of 1D optimal LDA index vs the number of seleted genes to luster genes. To optimize this PDA index, we used the modied simulated annealing method (Lee, 2003). We have used the R language for this researh and the PDA index is inluded in the lassPP pakage (available at CRAN). Referenes [1℄ Duda, R. O., and Hart, P. E. (1973). Pattern Classiation and Sene Analysis, John Wiley and Sons. [2℄ Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). \Comparison of Disrimination Methods for the Classiation of Tumors Using Gene Expression Data" Journal of the Amerian Statistial Assoiation, 97 , 77-87. [3℄ Friedman, J. (1987). \Exploratory projetion pursuit" Journal of the Amerian Statistial Assoiation, 82 , 249-266. [4℄ Friedman, J. (1989). \Regularized disriminant analysis" Journal of the Amerian Statistial Assoiation, 84 , 165-175. [5℄ Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M., Downing, J. R., Caligiuri, M. A., Bloomeld, C. D., and Lander, E. S. (1999). \Moleular Classiation of Caner: Class Disovery and Class Predition by Gene Expression Monitoring." [6℄ Hastie, T., Buja, A. and Tibshirani, R. (1994). \Flexible disriminant analysis by optimal soring" Journal of the Amerian Statistial Assoiation, 89 , 1255-1270. [7℄ Hastie, T., Buja, A. and Tibshirani, R. (1995). \Penalized disriminant analysis" Annals of Statistis, 23 , 73-102. [8℄ Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of 18 Statistial Learning, Springer. [9℄ Johnson, R. A., and Wihern, D. W. (1998). Applied Multivariate Statistial Analysis, 4th edition, Prentie-Hall, New Jersey. [10℄ Lee, E. (2003). \Projetion Pursuit for Exploratory Supervised Classiation" Ph.D thesis, Department of Statistis, Iowa State University. [11℄ Lee, E., Cook, D., Klinke, S., and Lumley, T.(2004). \Projetion pursuit for exploratory supervised lassiation" , ISU Preprint. [12℄ Marron, J. S. and Todd, M. (2002). \Distane Weighted Disrimination" Optimization Online Digest July 2002 [13℄ Ripley, B. D. (1996). Pattern Reognition and Neural Networks, Cambridge University Press.