Document 10637523

advertisement
Projetion Pursuit method for the Small Sample
Size with the Large Number of Variables
Eun-kyung Lee
1
Department
1
2
and Dianne Cook
of
Statistis,
Ewha
Womans
University,
11-1
Daehyun-dong, Seodaemun-gu, Seoul, 120-750, Korea
2
Department of Statistis,
Iowa State University,
Ames,
IA
50011, USA
Summary
In high-dimensional data, one often seeks a few interesting low-dimensional
projetions whih reveal important aspets of the data. Projetion pursuit for exploratory supervised lassiation is for nding separable lass
struture. Even though the projetion pursuit method an bypass the urse
of dimensionality, when we have the small number of observations relative
to the number of variables, the lass struture of optimal projetion an be
biased seriously. In this situation, most lassial multivariate analysis methods have problems. We disuss how the sample size and dimensionality are
related, and we propose a new projetion pursuit index that onsiders the
penalty for the projetion oeÆients and overomes the problem of small
sample size.
Keywords: The urse of dimensionality; Gene expression data analysis;
Multivariate data; Penalized disriminant analysis; Projetion pursuit
2
1
Introdution
This paper is about the exploratory data analysis of small samples with a
large number of variables, espeially in supervised lassiation. If the lassier obtained for a given training set is inadequate, it is natural to onsider
adding new variables, partiularly ones that will help separate the diÆult
to lassify ases. If the new variables provide any additional information, the
performane of the lassier must improve. Unfortunately, beyond a ertain
point, additional variables will make the lassier worse. The problem arises
when the sample size is small or the variables are highly orrelated. When
the training set is relatively small ompared to the number of variables, the
statistial parameters estimated on this training set are not aurate and they
are unstable. A quite dierent lassier may be obtained when a dierent
training set is used.
A small sized sample with the very large number of variables is the typial
situation of gene expression data analysis. In this paper, we fous on leukemia
data from two types of leukemia, aute lymphoblasti leukemia (ALL) and
aute myeloid leukemia (AML). This data set onsists of 25 ases of AML
and 47 ases of ALL (38 ases of B-ell ALL and 9 ases of T-ell ALL). After
preproessing, we have 3571 human genes (Golub et al. 1999).
In Chapter 2, we disuss how the sample size and dimensionality are related
and how they aet the supervised lassiation. Chapter 3 introdues a new
projetion pursuit method when the sample size is small and the number of
variables are large and desribes its properties and apply our new projetion
index to leukemia data. We explain how this new projetion pursuit index an
be applied to the gene seletion method and ompare to other gene seletion
methods in Chapter 4.
2
Problems of high dimensionality
To see how the number of variables aets the lassiation methods that
use separating hyperplanes, we investigate linear disriminant analysis with
leukemia data (n=72 and p=3571). For gene seletion, we use the ratio of
between-group to within-group sums of squares.
P
P
I (y = k )(x
x )2
=1
BW (j ) = P P =1
(1)
x )2
=1 =1 I (y = k)(x
n
g
i
n
k
g
i
k;j
:;j
i
k
i
i;j
k;j
Pn
Pn
Pn
where x = (1=n) =1 x and x = ( =1 I (y = k)x )=( =1 I (y =
k )). First, we sample a 2/3 training set(n
= 48) and alulate BW values
for eah gene using this training set. After then we selet the p variables
that have larger BW values. Using this training set with p variables, we
:;j
i
i;j
k;j
i
train
i
i;j
i
i
3
build the lassier using Linear Disriminant Analysis (LDA) and ompute
the training error and the test error. Repeat this 200 times. Median and
upper quantiles of training and test errors are summarized in Table 1. For
various p, training errors are almost 0. That is, the training sets are perfetly
separated regardless of p. But test errors are inreased as p is inreased. As
p approahes n, the test error gets worse.
Table 1. Training and Test error for various number of variables.
True Class
Permuted Class
Training error Test error Training error Test error
p Q2
40 0
30 0
20 0
10 0
Q3
Q2
0
4
0
2
0
1
0
1
Q2 : median
Q3
5
3
2
2
Q2
Q3
0
0
1
1
3
4
7
9
Q3 : upper quantile
Q2
Q3
15 16
14 16
14 15.25
14 15
It gets more interesting when we sramble the lass id's using permutation
and then the lass separations are spurious. Table 1 shows the results of the
same proedure outlined above using permuted lasses. We might suspet
that a lassier will not aurately separate these spurious lasses. Surprise!
When p = 40, the training error is 0. This result an be explained by the
apaity of a separating plane. When he training set has p = 40 and n =
48, the probability that n sample points in p-dimensional spae are linearly
separable is lose to 1 (Ripley, 1996). Therefore there exists a separating
hyperplane purely by hane. When p gets smaller, the training error with
the permuted lass gets larger. The test errors are onsistent, independent
of p.
These results an be seen visually using the LDA projetion pursuit index
(Lee, 2003). Figure 2 shows the 2-dimensional optimal projetions using the
LDA PP index with the training set for both true lass (top row) and the
permuted lass (bottom row). 1, 2, and 3 represent AML, B-ell ALL and
T-ell ALL in the training set, and the symbols , 4, and + represent AML,
B-ell ALL and T-ell ALL in the test set. After nding the 2-dimensional
optimal projetion with the training set, we projet both training and test
sets onto this optimal projetion. For the true lass, the training set with
p = 40 is more separable and has smaller within-lass variane than p = 10,
but the test set shows quite dierent group means and more larger withinlass variane. Notie that the test set is not separable on this projetion.
When p is smaller, the training set has larger within-lass variane and the
test set has more similar struture to the training set.
For the permuted lass, when p = 40, we an nd separated lass struture
with permuted lass for the training set. When p = 30, the training set is
4
(a) p=40
3333
3
3
(b) p=30
1
2222
22
2
222
22
22
222
222
22 22
(c) p=20
2
22 2 222
22 2222
222222
22 2 2
11
111
111
1111
111
3
3
33 33
1
111
11
1
1111 11
1
11
1
111
11
1 11
1
111111
(e) p=40
(f) p=30
(g) p=20
2
22
2
2 2 2
22
22
2
2
2
2
2
2
2 2
2 11 2
2
2
3 32
11 1
1
1
1
1
1 1 1
1 1
1
1
1
3
2
2
22222 2
2222 2
2 2
2 2 2222
2 2
11
11
111
11
1 1111
21 1
3
Training set
Test set
1
1
1111 1
1 1 1 1
11 11
(h) p=10
1
1
3
3
3
33
3
2
2
2 22 2 2 2
2
222 22
222
222 222 2
2
3 33
1111
111
11111 1
11
3
3333
3
2 2
2
2 2 22
22
22 2222222
22 2
2
3
2
222 2
22222
2222 22
22
2222
(d) p=10
3
333 3
2
3
2 1 1
1 1
1
2 2 1
1
11
221 12 221
2
2
2
22
23
22
2 2 22
11 1
2
2 3
2 3
2
33
3
3
3 3
1 : AML 2 : B-ell ALL 3 : T-ell ALL
: AML
: B-ell ALL + : T-ell ALL
4
Figure 1. LDA training and test sets for various number of variables with true
lass(a-d) and permuted lass(e-f)
still separated. As p is smaller, lass struture for the training set weakens.
For all p, the test sets don't reveal any lass struture. These results support
LDA errors in Table 1. From these results, we an onlude that when p is
large, the LDA lassier is biased too muh. Therefore we need to hoose the
number of variables arefully.
Many lassial multivariate analysis methods need to alulate the inverse of
ovariane matrix. If n p + 1 or the variables are highly orrelated, ^ will
be lose to singular whih will result in numerial instability in alulating
the inverse. It is neessary to estimate for the variane-ovariane matrix
dierently. If there is prior information about this ovariane, then we an
use a Bayesian or pseudo-Bayesian estimate ~ = (1 )^ + , where is a pre-determined matrix from prior information or assumption. If is
diagonal, it will help avoid numerial problems.. For the extreme assumption
that all variables are independent, we an use =diag(^ ). Even though the
assumption is inorret, the resulting heuristi estimates an provide better
performane than the MLE.
We investigate LDA in this point of view. LDA nds a projetion a by maximizing a a=a a, where is the between-lass ovariane matrix and
is the within-lass ovariane matrix. When sample size is small and the
number of variables are large, LDA is usually too exible and sometimes
T
W
B
T
W
B
5
(b) Leukemia 2D : p=3571
1
1
1
1
33
1
1111
33
11
1
3
1
1
11
3
11
33
20
30
(a) Leukemia 1D : p=3571
10
33
222
0
111
−1
0
1
2
2
2
2222
2
2
2
2222
2222
222
2
2
22
1: AML 2: B-ell ALL 3: T-ell ALL
Figure 2. Leukemia data - 1D and 2D optimal projetions using LDA index(p=3571) (a) Histogram of 1D optimal projeted data (b) Plot of 2D optimal
projeted data.
a a an be small for some a. It auses the data piling problem (Marron,
T
W
et al 2002). Figure 2 shows the optimal 1-dimensional and 2-dimensional
projeted data using the LDA index for Leukemia data when n = 72 and
p = 3571. In the 1-dimensional projetion, all data points in one lass are
projeted onto almost one point and all lasses have very small within-lass
varianes. In the 2-dimensional projetion, eah lass lies in one line and two
variables in this projeted data have a perfet linear relationship. To esape
this data piling problem, a penalty of the projetion oeÆients is onsidered.
Penalized disriminant analysis (PDA: Hastie, et al, 1995) is a generalized
method of LDA that inorporates prior information as the penalty of roughness. PDA nds a projetion a by maximizing (a a)=(a ( + )a).
In PDA, a pre-determined matrix keeps the within-group struture of the
projeted data from degenerating too muh. That is, the optimal projetion
from PDA has larger within-lass variane than the optimal projetions from
the LDA PP index and it depends on the size of and the hoie of . We
extend this idea to new projetion pursuit methods.
T
3
3.1
B
T
W
PDA projetion pursuit index
Index denition
We propose a new projetion pursuit index whih is the extension of the LDA
PP index (Lee, 2003). The main purpose is to (1) prevent the problems with
the small number of observations and the large number of variables and (2)
6
nd projetions that ontain lass separations in a reasonable manner. We
use ~ () = (1 )^ +diag(^ ) as our estimate of variane-ovariane matrix.
As is inreased, ~ tends to be diag(^ ). When the data is standardized, it
redues to ~ () = (1 )^ + I.
Let X be the p-dimensional vetor of the j th observation in the ith lass,
the number of lasses, P
n is the number of
i = 1; : : : ; g , j = 1; : : : ; n , g is P
= 1 =1 X be the ith
observations in lass i, and
n = =1 n . Let X
P
P
lass mean and X = 1 =1 =1 X be the total mean. For onveniene,
we assume that X 's are standardized. Let
ij
i
::
n
i
ni
g
g
ni
i
j
i
i
i:
ij
j
ni
ij
ij
g
X
B =
=1
i:
ni (X
X )(X
::
X ) : between-lass sums of squares;
i:
::
T
i
g
ni
X
X
W =
=1 =1
i
X )(X
(X
ij
i:
X ) : within-lass sums of squares:
ij
i:
T
j
Here, B+W = n^ and ^ is the orrelation matrix. Then, the PDA projetion
pursuit index is
IP DA
A (1
)W + nIp
A
(A; ) = 1 A (1 )(B + W) + nI A
T
T
(2)
p
where A is an orthonormal projetions onto k-dimensional spae and 2
[0; 1) is a predetermined parameter. Let B = (1 )B and W = (1
)W + nI . Then, the PDA index has a same form as the LDA index and
when = 0, the PDA index is same as the LDA index.
p
Proposition 1. Let = (1
01
k
Y
=1
)(B + W) + nIp = B + W .
p
Y
i IP DA (A; ) 1
i 1
(3)
+1
where 1 2 0 : eigenvalues of 1 2 W 1 2 ,
e1 ; e2 ; ; e : orresponding eigenvetors of 1 2W 1 2,
f 1; f 2 ; ; f : eigenvetors of 1 2 B 1 2.
In (3), the right equality holds when A= 1 2 [e e 1 e +1 ℄
= 1 2 [f 1 f 2 f ℄ and the left equality holds when A=
1 2[e e 1 e1 ℄ = 1 2[f +1 f +2 f ℄.
i
=
Then,
i
p
k
=
=
p
=
=
p
=
=
p
=
p
p
p
k
=
k
=
=
k
k
p
k
p
k
p
Proof of this proposition is same as Proposition 1 in Lee, et al (2004). To
explain the dierene between the PDA and LDA index, we use the prinipal
7
omponents. Let
= B + W = QDQ =
T
p
X
=1
di qi qTi ;
(4)
i
where Q = [q1 ; q2 ; ; q ℄ is the eigenvetor matrix of , D = diag(d1 ; d2 ; ;
is the eigenvalues of , d = 0 for all i = r + 1; ; p and rank() = r.
Then, tr() = tr( ) = np, and
dp )
p
i
=
p
X
=1
(1 )d + n q q
i
i
T
i
(5)
i
Therefore, and have same prinipal omponent diretions and total
variane. The dierene between these two variane matries is the proportion of total variane due to the kth prinipal omponent. For the LDA
index, we use the original prinipal omponent of . The PDA index keeps
the diretion of 's prinipal omponent and the total variane, but hanges
the proportion of total variane explained by eah diretion. When the proportion due to the kth prinipal omponent is larger than 1=p, the PDA
index uses the shrunk proportion of the total variane due to this diretion.
Otherwise, the PDA index uses the inreased proportion of the total variane
due to this diretion. For the non-signiant prinipal omponent, the PDA
index put =p as a proportion of the total variane on that prinipal omponent. Therefore if p is large and if we want to keep the amount of shrinkage,
we need to use larger .
3.2 Examples
A toy example from Marron, et al (2002) is used to demonstrate the dierene
between the PDA index and the LDA index. There are two lasses in the
39-dimensional spae. Eah lass has 20 data points. They are generated
from the standard Gaussian distribution, exept that the mean of the rst
variable is shifted to 2.2 and -2.2 for two lasses, respetively. Therefore, the
separation of two lasses only depends on the rst variable. Figure 3 (a)-(e)
show the histograms of the 1-dimensional optimal projeted data using the
PDA PP index with various values. As we mentioned before, when = 0
(Figure 3-b), the PDA PP index is same as the LDA index and the projeted
data has very small within-group variane. As is inreased, the withingroup varianes of the projeted data also get larger and the projeted data
have more reasonable lass struture.
To see the dierene between the LDA index and the PDA index in detail,
we ompared the optimal projetion oeÆients of the LDA index and the
PDA index with = 0:9. Figure 3 (b-1) shows the absolute values of the
8
optimal projetion oeÆients using the LDA index. All the oeÆients have
small values and from these oeÆients, it is hard to deide whih variables
are more important than the others to separate two lasses. On the other
hand, the oeÆient of the rst variable has very large value and the others
are very small (Figure 3 (e-1)). From this result, we an onlude that the
LDA index fouses only on the projetion having small within-lass variane
relative to the total variane and leads us to the projetion that is biased
too muh, therefore annot be useful when the sample size is small and the
number of variables are large. On the other hand, the PDA index an lead us
to a quite reasonable projetion and its oeÆients an be used as a guideline
to selet important variables.
We apply the PDA index to the Leukemia data with three lasses : AML,
B-ell ALL and T-ell ALL,. Figure 4 shows the results using the PDA index
with various values. After nding the 2-dimensional optimal projetion
using the PDA index on the training set, we projet both training and test
sets onto this optimal projetion.
When = 0, the training set and the test set have dierent lass struture
on this projetion. The training set has very small within-lass varianes.
On the other hand, the test set has large within-lass variane. When =
0.1, the training set and the test set have similar lass strutures. But as
is inreased, the within-lass variane of the training set is inreased too
muh and beyond a ertain point, the PDA index an be biased in the other
way of the LDA index, that is, the within-lass variane of the training set
has larger than the within-lass variane of the test set. Therefore we need
to selet value that an keep the within-lass variane of the training set
in the reasonable amount.
For seleting value, we suggest to use S () = tr(W)=n, where W is the
within-lass sum of squares of the optimal projeted data using the PDA
index with . If we use standardized data, the optimal value of S () is 1.
If S () is smaller than 1, we suggest to inrease your value. If S () is
larger than 1, we suggest to derease your value. Figure 5 shows the plot of
S () and . For the leukemia data with 40 genes, the best value is around
0.2. In the plot of 2-dimensional projeted data using the PDA index with
= 0:2, we an see that training set and test set have similar within-group
varianes.
We examine the training and test error in lassiation using the LDA and the
PDA projetion pursuit methods. After nding the optimal 2-dimensional
projetion using the LDA and PDA index with = 0:1, we build a lassier
using the rule (11-65) in Johnson and Wihern (2002), and ompute the
training and test errors. This is repeated 200 times. The median and upper
quantile of the test errors are summarized in Table 2. The results from the
LDA PP method is same as Table 1. For the PDA PP method, the training
9
(b) LDA : lambda=0
10
6
8
15
10
(a) The first variable
222222
22
222 22 2
5
4
22 2
222
222
22
2
22
222
2
1111 111
11111
111 1
0
0
1111111
−6
−4
−2
0
X1
2
6
−6
−4
−2
0
2
1D projected data
(d) PDA : lambda=0.5
4
6
(e) PDA : lambda=0.9
4
2 22222222
22 22 22
2
−4
−2
0
2
1D projected data
4
6
0
0
0
2
111
1111
11
111111
1111 111111111 1
2
11 1111
111111111
2
4
2 222222
22222
22 2
−6
−6
−4
−2
0
2
1D projected data
4
6
−6
−4
−2
0
2
1D projected data
4
6
0.8
0.6
0.4
PDA PP coefficient
0.6
0.4
0.0
0.2
0.0
0.2
0.8
1.0
(e−1) PDA : lambda=0.9
1.0
(b−1) LDA : lambda=0
PDA PP coefficient
2 2 222222222
22 2
4
6
6
8
6
10
8
8
12
10
(c) PDA : lambda=0.1
4
1 3 5 7 9
12 15 18 21 24 27 30 33 36 39
variables
1 3 5 7 9
12 15 18 21 24 27 30 33 36 39
variables
Figure 3. Toy example : p = 39 , n = 40. (a) - (e) The histograms of 1D optimal
projeted data using the PDA index with various , (b-1) (e-1) the projetion pursuit
oeÆient values for orresponding variables.
10
p=40:lambda=0
p=40:lambda=0.1
2
2222222
22
22
2
22222
2 2
22 2
22
2
222
22
222
222
111
1
111111
3
p=40:lambda=0.5
2222 2
2 222
2222 2
2 2 22
22
333 3
111 1 1
111 11
1 1
1
2
2
1 1
1
11
11
1
1
11
11 11
3
3
Training set
Test set
1 : AML
3
2
3
3
33
p=40:lambda=0.9
3
3
3
3
3
2 : B-ell ALL
: AML 4 : B-ell ALL
2
1
2 22
22
222222222
2222 2 222
2
3
3
3
11 11
11
11111111
3 : T-ell ALL
+ : T-ell ALL
Figure 4. Leukemia data(p=40) - 2D projetions using P DA index with various p=40:lambda=0.2
Lambda selection (p=40)
4
3 33
33
2
222222
2 22
2 2222
2 22
2 22
2
1
1
0
tr(W)/n
2
3
3
1
1
11
1
11
11 1
11
0.0
0.2
0.4
0.6
lambda
0.8
Training set
Test set
1 : AML
2 : B-ell ALL
: AML 4 : B-ell ALL
3 : T-ell ALL
+ : T-ell ALL
Figure 5. seletion for the Leukemia data(p=40) : Plot of S() vs. and the
optimal 2-dimensional projeted data using the PDA index with =0.2
p=3571 : lambda=0
3
3
3
3
3
p=3571 : lambda=0.1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
test error = 8
Training set
Test set
22222
2
222
22
222
222
222
22
2222
2
22
2
1
1
11
11
1
111
11
2
2
2
2
2
2
2
2
2
2
2
2
p=3571 : lambda=0.5
p=3571 : lambda=0.9
3
333
3
33333
3
11111 1 1
11
111111
3
3333
test error = 3
1 : AML
2 2
2
222222
22
22
22
2222222
11
11
111
111
11
test error = 0
2 : B-ell ALL
: AML 4 : B-ell ALL
test error = 0
3 : T-ell ALL
+ : T-ell ALL
Figure 6. Leukemia data(p=3571) - 2-dimensional projetions using P DA index
with various 11
errors are the same as linear disriminant lassiation with the original data
(all 0), but the test errors are muh lower. For various numbers of variables(
p), the test errors of the PDA PP method are smaller than the errors of the
LDA PP method. Also the PDA PP method shows very onsistent test errors
for all p. From this result, we an onlude that the PDA index helps nd
less biased and more reasonable projetions.
Table 2. Training and test error for the LDA PP method and the PDA PP method
with = 0:2.
LDA PP
PDA PP with = 0:1
Training error Test error Training error Test error
p Q2
40 0
30 0
20 0
10 0
Q3
0
0
0
0
Q2
4
2
1
1
Q3
5
3
2
2
Q2
0
0
0
0
Q3
0
0
0
0
Q2
1
1
1
1
Q3
2
2
2
2
Figure 6 shows the 2-dimensional optimal projetions using the PDA index
with Leukemia data with all genes. For all values, the between-lass variane struture are similar, the means of three lasses form a triangle shape,
but the within-lass struture are quite dierent. When = 0, the withinlass sums of squares of the projeted data is a singular matrix, whih would
suggest the 1-dimensional projetion is enough to use linear disriminant
analysis and this optimal projetion will not be useful for showing separations in new samples. When = 0:1, the within-variane is very small, but it
is nonsingular and the 2-dimensional projetion has more information than
the 1-dimensional projetion. As is inreased, the within-lass sums of
square of the projeted data gets larger and the training and test set have
more similar within-lass varianes. Using S (), we suggest to use larger value, around 0.7 (Figure 7).
4
Appliation : Gene seletion
In the previous setions, we used the BW values to selet genes that are useful
for separating lasses. The BW values are alulated for eah gene and there
is no onsideration of the orrelation between genes in their alulation. But
most of genes are highly orrelated and some genes work together to separate
lasses, even though they have small BW values. In this sense, the projetion
oeÆients from the PDA index an provide a better gene seletion method.
As we saw in the toy example, these oeÆients from the PDA index tend
to be more reasonable than the oeÆients from the LDA index. These
12
p=3571:lambda=0.7
Lambda selection(p=3571)
11
111
1 1111
11111
1
1
tr(W)/n
2
3
4
3
33
33 3
0
2
2 222
22222
2
2
2
2
2
2
2 22
22
0.0
0.2
0.4
0.6
lambda
Training set
Test set
0.8
1 : AML
2 : B-ell ALL
: AML 4 : B-ell ALL
3 : T-ell ALL
+ : T-ell ALL
Figure 7. Lambda seletion for Leukemia data(p=3571) : Plot of S() vs. lambda
and 2-dimensional projeted data using PDA index with =0.7
oeÆients an be used to explain how important the orresponding variables
are to separate lasses. We ompare these oeÆients to the BW values in
terms of separating lasses.
To see how the projetion oeÆients from the PDA index and the BW
values are related, we start from the Leukemia data (p = 3571, n = 72)
with 2 lasses, AML and ALL. Figure 8 shows the plot of the BW and the
projetion oeÆients from the PDA projetion method with = 0:9. For
most genes, the BW values are less than 0.5 and the projetion oeÆients
are in between -0.04 and 0.04. Most genes with large BW values have larger
than 0.04 or smaller than -0.04 for the projetion oeÆient value. On the
other hand, the BW values of most genes with high projetion oeÆients
are spread out very widely. Some of them are less than 0.5.
Table 3. Comparison between BW and projetion oeÆients.
From large projetion oe.
From large BW
genes
BW PP oe.
genes
BW PP oe.
U34877 at
0.16 -0.06
M84526 at 3.01 -0.06
M27891 at
2.66 -0.06
M27891 at 2.66 -0.06
M84526 at
3.01 -0.06
U46499 at 2.54 -0.05
X95735 at
1.80 -0.06
M23197 at 2.29 -0.04
HG1612-HT1612 at 1.03
0.06
X95735 at 1.80 -0.06
We selet 5 genes from large BW values : M84526 at, M27891 at, U46499 at,
0.00
−0.06
−0.04
−0.02
PP
0.02
0.04
0.06
13
0.0
0.5
1.0
1.5
2.0
2.5
3.0
BW
Figure 8. BW vs PDA( = 0:9) projetion oeÆients : Leukemia data with two
lasses, AML and ALL (p=3571,n=72)
M23197 at and X95735 at, and large projetion oeÆients : U34877 at,
M27891 at, M84526 at, X95735 at and HG1612-HT1612 at, respetively and
ompare them. Table 3 shows the BW values and projetion oeÆients
for eah gene. All genes from larger projetion oeÆients have large BW
values exept one, gene U34877 at. All genes from larger BW values have
large projetion oeÆients. Figure 8 shows the satter plot matries of 5
genes from large BW and projetion oeÆients. 5 genes from large BW
values show quite separable group means, but any pairs of these genes are
not learly separable. At least one or two ases are mislassied if we use
one separating hyperplane. On the other hand, 5 genes from larger projetion oeÆients show more mixed struture. In the plot of X95735 at and
HG1612-HT1612 at, we an nd a separating hyperplane that 2 lasses are
learly separable.
Whenever we optimize the PDA index, we an get dierent projetion oeÆients, espeially when we use very large number of variables. It is from the
urse of dimensionality. Beause most of high dimensional spae is empty, espeially when we have small number of observations, we an separate lasses
in many dierent ways. Therefore with the projetion pursuit method, we
an explore various projetions that provide us the separated lass struture.
To show how this works, we hoose another projetion from maximizing the
14
M84526_at
M27891_at
U46499_at
M23197_at
X95735_at
(a) 5 genes from large BW values
U34877_at
M27891_at
M84526_at
X95735_at
HG1612−HT1612_at
(b) 5 genes from large projetion oe.
Figure 8. Satter plot matries : 5 genes from large BW values and 5 genes from
large projetion oeÆients with the PDA index( = 0:9) : Leukemia data with two
lasses, AML() and ALL(+) (p=3571,n=72)
15
PDA index ( = 0:9) and selet 10 genes with large projetion oeÆients.
Table 4 shows the summary of 10 genes from larger projetion oeÆients.
All 10 genes from larger projetion oeÆients have small BW values (less
than 1) and some of them have very small BW values, even less than 0.1.
Table 4. Comparison between BW and projetion oeÆients.
genes
M82809 at
X51521 at
S68616 at
L02426 at
AF006087 at
U30255 at
U41654 at
M84371 rna1 s at
L10373 at
U10868 at
BW PP oe.
0.1241 -0.0673
0.5920 0.0563
0.0932 -0.0555
0.0005 -0.0554
0.0187 0.0543
0.3871 -0.0541
0.1560 0.0525
0.7359 0.0522
0.2444 0.0519
0.4401 -0.0518
Figure 10 shows the histograms of the projeted data onto the LDA optimal
projetion with seleted genes from larger BW values and larger projetion
oeÆients in Table 4. In Figure 10(a-1), 5 genes from larger BW values an
separate AML and ALL learly exept 5 ases (2 ases in ALL and 3 ases
in AML). As we inrease the number of genes up to 10, the two groups are
more separable, but with 10 genes from larger BW values, we still have a
mislassied ase.
Figure 10(b-1) is the histogram of the projeted data onto the LDA optimal
projetion with 5 seleted genes from large projetion oeÆients. Lots of
ases are mislassied. However, as we inrease the number of genes up to 10,
the performane of separating two lasses are improved very quikly. When
we use 10 genes from larger projetion oeÆients, two lasses are learly
separable.
As we an see in Figure 10, adding more genes to lassiation is not muh
helpful to separate lasses beyond a ertain point. As a guideline to deide
the number of separated genes, we reommend to use the LDA index value.
Figure 11 shows plots of the optimal LDA index value versus the number of
genes. In Figure 11(a), we use the same projetion oeÆient as in Table 3
and optimize these seleted genes with the 1-dimensional LDA index. After
p = 5, the LDA index value isn't muh inreased. Therefore, 5 seleted genes
in Table 3 are enough to separate AML and ALL. In Figure 11(b), genes are
seleted from larger BW values. After one or two genes are seleted, LDA
index value is inreased very slowly. Figure 11() is the LDA index plot of
the seleted genes from larger projetion oeÆients that are used for Table
4. As we expeted from very low BW values, the LDA index values are very
16
(a−2) 7 genes from BW
(a−3) 9 genes from BW
(a−4) 10 genes from BW
−2
−1
0
1
2
−3
−2
−1
0
1
30
25
20
−4
−2
0
2
−4
(b−3) 9 genes from B coeff.
−3
−2
−1
0
1
2
3
(b−4) 10 genes from B coeff.
10
2
22222222
2222222222
222222222
222222
222222
2222222222
222222 2
1 111111111111111
1 11
5
−2
−1
0
1
2
3
1
0
−3
1111111
111111111
111111
111
11111
1111
0
0
1
−2
−1
0
1
2
3
0
1 1 1111111111111
2
5
5
22 2222222222222
2222222222 222
5
10
2222
2222
22
22
2222222222
22
22 222 2222
10
10
15
20
15
(b−2) 7 genes from B coeff.
111111
11111 11
11
11
0
2
15
(b−1) 5 genes from B coeff.
1 11
111
111111 1111 11
15
−3
222 222
2222222222
222
222222222
5
5
0
−4
10
15
2 2 222
22222
2222
22222
22
2222 2
11111
11
1111111 111
11 11
2 22 22222
22
222
22222
22222222
1
0
1 1111111111111 1 1
0
5
10
5
10
15
222 2222
2222222222222
22222
15
20
10
20
25
25
30
15
30
35
(a−1) 5 genes from BW
−3
−2
−1
0
1
2
−3
−2
−1
0
1
2
1 : AML 2 : ALL
Figure 10. 1D optimal projetion using the LDA index with various number of genes
from BW and projetion oeÆients : Leukemia data with two lasses
low when p is small. But as p is inreased, the LDA index value is inreased
rapidly and after p = 12, it stays steady.
5
Disussion
We have looked at the problems in a high dimensional spae, espeially when
we have a small number of observations, and have proposed a new projetion
pursuit index that adds a penalty term for high dimensionality or multiollinearity.
The PDA index works well to separate lasses in reasonable manner when
data have multiollinearity or very high dimensionality relative to the sample
size. To use the PDA index, we need to hoose . In the original PDA,
ross-validation an be used to selet . But the main purpose of projetion
pursuit is exploratory data analysis. Therefore ross-validation annot be a
good approah to selet for our PDA index. This is the main reason we
keep in [0,1). One guideline to selet is to use larger for large p.
The PDA index an be used to selet important variables that are helpful to
separate lasses. In gene expression data analysis, this appliation is useful to
selet important genes that work dierently in eah lass. It an be extended
10
15
p
(a) From Table 3
20
1.0
0.6
0.5
p=12
0.4
0.5
0.4
5
0.7
1D LDA index
0.8
0.9
1.0
0.9
0.8
0.7
1D LDA index
0.6
0.7
p=5
0.4
0.5
0.6
1D LDA index
0.8
0.9
1.0
17
5
10
15
20
5
10
p
(b) From BW
15
20
p
() From Table 4
Figure 11. plots of 1D optimal LDA index vs the number of seleted genes
to luster genes.
To optimize this PDA index, we used the modied simulated annealing
method (Lee, 2003). We have used the R language for this researh and
the PDA index is inluded in the lassPP pakage (available at CRAN).
Referenes
[1℄ Duda, R. O., and Hart, P. E. (1973). Pattern Classiation and Sene
Analysis, John Wiley and Sons.
[2℄ Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). \Comparison of Disrimination Methods for the Classiation of Tumors Using Gene Expression
Data" Journal of the Amerian Statistial Assoiation, 97 , 77-87.
[3℄ Friedman, J. (1987). \Exploratory projetion pursuit" Journal of the
Amerian Statistial Assoiation, 82 , 249-266.
[4℄ Friedman, J. (1989). \Regularized disriminant analysis" Journal of the
Amerian Statistial Assoiation, 84 , 165-175.
[5℄ Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M.,
Mesirov, J. P., Coller, H., Loh, M., Downing, J. R., Caligiuri, M. A., Bloomeld, C. D., and Lander, E. S. (1999). \Moleular Classiation of Caner:
Class Disovery and Class Predition by Gene Expression Monitoring."
[6℄ Hastie, T., Buja, A. and Tibshirani, R. (1994). \Flexible disriminant
analysis by optimal soring" Journal of the Amerian Statistial Assoiation,
89 , 1255-1270.
[7℄ Hastie, T., Buja, A. and Tibshirani, R. (1995). \Penalized disriminant
analysis" Annals of Statistis, 23 , 73-102.
[8℄ Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of
18
Statistial Learning, Springer.
[9℄ Johnson, R. A., and Wihern, D. W. (1998). Applied Multivariate Statistial Analysis, 4th edition, Prentie-Hall, New Jersey.
[10℄ Lee, E. (2003). \Projetion Pursuit for Exploratory Supervised Classiation" Ph.D thesis, Department of Statistis, Iowa State University.
[11℄ Lee, E., Cook, D., Klinke, S., and Lumley, T.(2004). \Projetion pursuit
for exploratory supervised lassiation"
, ISU Preprint.
[12℄ Marron, J. S. and Todd, M. (2002). \Distane Weighted Disrimination"
Optimization Online Digest July 2002
[13℄ Ripley, B. D. (1996). Pattern Reognition and Neural Networks, Cambridge University Press.
Download