Investigating the Efficacy of Nonlinear Dimensionality Reduction Schemes in

advertisement
368
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
VOL. 5,
NO. 3,
JULY-SEPTEMBER 2008
Investigating the Efficacy of Nonlinear
Dimensionality Reduction Schemes in
Classifying Gene and Protein
Expression Studies
George Lee, Carlos Rodriguez, and Anant Madabhushi
Abstract—The recent explosion in procurement and availability of high-dimensional gene and protein expression profile data sets for
cancer diagnostics has necessitated the development of sophisticated machine learning tools with which to analyze them. While some
investigators are focused on identifying informative genes and proteins that play a role in specific diseases, other researchers have
attempted instead to use patients based on their expression profiles to prognosticate disease status. A major limitation in the ability to
accurately classify these high-dimensional data sets stems from the “curse of dimensionality,” occurring in situations where the number
of genes or peptides significantly exceeds the total number of patient samples. Previous attempts at dealing with this issue have mostly
centered on the use of a dimensionality reduction (DR) scheme, Principal Component Analysis (PCA), to obtain a low-dimensional
projection of the high-dimensional data. However, linear PCA and other linear DR methods, which rely on euclidean distances to
estimate object similarity, do not account for the inherent underlying nonlinear structure associated with most biomedical data. While
some researchers have begun to explore nonlinear DR methods for computer vision problems such as face detection and recognition,
to the best of our knowledge, few such attempts have been made for classification and visualization of high-dimensional biomedical
data. The motivation behind this work is to identify the appropriate DR methods for analysis of high-dimensional gene and protein
expression studies. Toward this end, we empirically and rigorously compare three nonlinear (Isomap, Locally Linear Embedding, and
Laplacian Eigenmaps) and three linear DR schemes (PCA, Linear Discriminant Analysis, and Multidimensional Scaling) with the intent
of determining a reduced subspace representation in which the individual object classes are more easily discriminable. Owing to the
inherent nonlinear structure of gene and protein expression studies, our claim is that the nonlinear DR methods provide a more truthful
low-dimensional representation of the data compared to the linear DR schemes. Evaluation of the DR schemes was done by
1) assessing the discriminability of two supervised classifiers (Support Vector Machine and C4.5 Decision Trees) in the different lowdimensional data embeddings and 2) five cluster validity measures to evaluate the size, distance, and tightness of object aggregates in
the low-dimensional space. For each of the seven evaluation measures considered, statistically significant improvement in the quality
of the embeddings across 10 cancer data sets via the use of three nonlinear DR schemes over three linear DR techniques was
observed. Similar trends were observed when linear and nonlinear DR was applied to the high-dimensional data following feature
pruning to isolate the most informative features. Qualitative evaluation of the low-dimensional data embedding obtained via the six DR
methods further suggests that the nonlinear schemes are better able to identify potential novel classes (e.g., cancer subtypes) within
the data.
Index Terms—Dimensionality reduction, bioinformatics, data clustering, data visualization, machine learning, manifold learning,
nonlinear dimensionality reduction, gene expression, proteomics, prostate cancer, lung cancer, ovarian cancer, Principal Component
Analysis (PCA), Linear Discriminant Analysis, Multidimensional Scaling, Isomap, Locally Linear Embedding (LLE), Laplacian
Eigenmaps, classification, support vector machine, decision trees.
Ç
1
INTRODUCTION
G
ENE expression profiling and protein expression profiling have emerged as promising new methods for
disease prognostication [1], [2], [3]. Attempts at analyzing
several thousand dimensional gene and protein profiles
. G. Lee and A. Madabhushi are with the Department of Biomedical
Engineering, Rutgers, The State University of New Jersey, 599 Taylor
Road, Piscatway, NJ 08854.
E-mail: geolee@eden.rutgers.edu, anantm@rci.rutgers.edu.
. C. Rodriguez is with Evri, Seattle, WA. E-mail: carlos@evri.com.
Manuscript received 14 Aug. 2007; revised 5 Dec. 2007; accepted 20 Mar.
2008; published online 10 Apr. 2008.
For information on obtaining reprints of this article, please send e-mail to:
tcbb@computer.org, and reference IEEECS Log Number
TCBBSI-2007-08-0100.
Digital Object Identifier no. 10.1109/TCBB.2008.36.
1545-5963/08/$25.00 ß 2008 IEEE
have been primarily motivated by two factors: 1) identification of individual informative genes and proteins responsible for disease characterization [4], [5], [6], [7] and 2) to
classify patients into different disease cohorts [8], [9], [10],
[11], [12], [13], [14]. Several researchers involved in the latter
area have attempted to use different classification methods
to stratify patients based on their gene and protein
expression profiles into different categories [8], [9], [10],
[15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26],
[27]. While the availability of studies continues to grow,
most protein and gene expression databases contain no
more than a few thousand patient samples. Thus, the task of
stratifying these patients based on the gene/protein profile
is subject to the “curse of dimensionality” problem [28],
[29], owing to the relatively small number of patient
Published by the IEEE CS, CI, and EMB Societies & the ACM
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND...
samples compared to the size of the feature space.
Classification of the new unseen test samples is thus poor
due to the sparseness of data in the high-dimensional
feature space. Additionally, many of the features within the
expression profile may be noninformative or redundant,
providing little additional class discriminatory information
[11], [12] while increasing computing time and classifier
complexity. In order to bridge the gap between the number
of patient samples and gene/peptide features and overcome the curse of dimensionality problem, researchers have
proposed 1) feature selection and 2) dimensionality reduction (DR) to reduce the size of the feature space.
Feature selection refers to the identification of the most
informative features and has been commonly utilized to
precede classification in gene and protein expression
studies [11], [14], [30]. However, since a typical gene or
protein microarray records expressions from thousands of
genes or proteins, the cost of finding an optimal informative
subset from several million possible combinations becomes
a near intractable problem. Further, genes or peptides that
were pruned during the feature selection process may be
significant in stratifying intraclass subtypes.
DR refers to a class of methods that transforms the highdimensional data into a reduced subspace to represent data
in far fewer dimensions. In Principal Component Analysis
(PCA), a linear DR method, the reduced dimensional data is
arranged along the principal eigenvectors, which represent
the direction along which the greatest variability of the data
occurs [31]. Note that unlike with feature selection, the
samples in the transformed embedding subspace no longer
represent specific gene and protein expressions from the
original high-dimensional space but rather encapsulate data
similarities in low-dimensional space. Even though the
objects in the transformed embedding space are divorced
from their original biological meaning, the organization and
arrangement of the patient samples in low-dimensional
embedding space lends itself to data visualization and
classification. Thus, if two patient samples from a specific
disease cohort are mapped adjacent to each other in an
embedding space derived from their respective highdimensional expression profiles, then it suggests that the
two patients have a similar disease condition. By exploiting
the entire high-dimensional space, DR methods, unlike
feature selection, offer the opportunity to stratify the data
into subclasses (e.g., novel cancer subtypes).
The most popular method for DR for bioinformaticsrelated applications has been PCA [3], [32], [33], [34], [35],
[36], [37], [38]. Originally developed by Hotelling [39],
PCA finds orthogonal eigenvectors along which the greatest amount of variability in the data lies. The underlying
intuition behind PCA is that the data is linear and that the
embedded eigenvectors represent low-dimensional projections of linear relationships between data points in highdimensional space. Linear Discriminant Analysis (LDA)
[31], also known as Fisher Discriminant Analysis, is
another linear DR scheme, which incorporates data label
information to find data projections that separate the data
into distinct clusters. Multidimensional Scaling (MDS) [40]
reduces data dimensionality by preserving the least
squares Euclidean distance in the low-dimensional space.
369
Classifier performance with linear DR schemes for
biomedical data has been a mixed bag. Dawson et al.
[34] found that there were biologically significant elements
of the gene expression profile that were not seen with
linear MDS. Ye et al. [29] found that LDA gave poor
results in distinguishing disease classes on a cohort of nine
gene expression studies. Truntzer et al. [35] also found
limited use of LDA and PCA for classifying gene and
protein expression profiles of a diffuse large B-cell
lymphoma data set since the classes appeared to be
linearly inseparable. The aforementioned results appear to
suggest that biomedical data has a nonlinear underlying
structure [34], [35] and that DR methods that do not
impose linear constraints in computing the data projection
might be more appropriate compared to PCA, MDS, and
LDA for classification and visualization of data classes in
gene and protein expression profiles.
Recently, nonlinear DR methods such as Spectral Clustering [41], Isometric mapping (Isomap) [42], Locally Linear
Embedding (LLE) [43], and Laplacian Eigenmaps (LEM) [44]
have been developed to reduce data dimensionality without
assuming a euclidean relationship between data samples in
the high-dimensional space. Shi and Malik’s Spectral
Clustering algorithm (also known as Graph Embedding
[41]) builds upon graph theory to partition the graph into
clusters and separate them accordingly. Madabhushi et al.
[45] demonstrated the use of graph embedding to detect the
presence of new tissue classes on high-dimensional prostate
MRI studies. The utility of this scheme has also recently been
demonstrated in distinguishing between cancerous and
benign magnetic resonance spectra (MRS) in the prostate
[46] and in discriminating between different cancer grades
on digitized tissue histopathology [47], [48]. Tenenbaum
et al. [42] presented the Isomap algorithm for nonlinear DR
and described the term “manifold” for machine learning as a
nonlinear surface embedded in high-dimensional space
along which dissimilarities between data points are best
represented. The Isomap algorithm estimates geodesic
distances, defined as the distance between two points along
the manifold, and preserves the nonlinear geodesic distances
(as opposed to euclidean distances used in linear methods)
while projecting the data onto a low-dimensional space. LLE
proposed by Roweis and Saul [43] uses local weights to
preserve local geometry in order to find the global nonlinear
manifold structure of the data. The geodesic distance
between data points is approximated by assuming that the
data is locally linear. Recently, Belkin and Niyogi presented
the LEM algorithm [44], which like Spectral Clustering,
Isomap, and LLE, makes local connections but uses the
Laplacian to simplify determination of the locality preserving weights used to obtain the low-dimensional data
embeddings. Graph Embedding, LLE, Isomaps, and LEM,
all aim to nonlinearly project the high-dimensional data in
such a way that two objects xa and xb that lie adjacent to each
other on the manifold are adjacent to each other in the lowdimensional embedding space, and likewise, two objects
that are distant from each other on the manifold are far apart
in the low-dimensional embedding space. As previously
demonstrated by Tenenbaum et al. [42], Fig. 1 reveals the
limitations of using a linear DR for highly nonlinear data.
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
370
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
VOL. 5,
NO. 3,
JULY-SEPTEMBER 2008
Fig. 1. (a) Nonlinear manifold structure of the Swiss roll data set [42]. Labels from two classes (shown with red circles and green squares) are
provided to show the distribution of data along the manifold. (b) The low-dimensional embedding obtained via linear MDS on the Swiss roll reveals a
high degree of overlap between samples from the two classes due to the use of euclidean distance as a dissimilarity metric. The embedding obtained
via LEM, on the other hand, is able to almost perfectly distinguish the two classes by projecting the data in terms of geodesic distance determined
along the manifold.
Fig. 1 shows the embedding of the Swiss roll data set shown
in Fig. 1a obtained by a linear DR method (MDS) in Fig. 1b
and a nonlinear DR scheme (LEM) in Fig. 1c. MDS, which
preserves euclidean distances, is unable to capture the
nonlinear manifold structure of the Swiss roll, but LEM is
capable of learning the shape of the manifold and representing points in the low-dimensional embedding by estimating
geodesic distances. Thus, while MDS (Fig. 1b) shows overlap
between the two classes that lie along the Swiss roll, LEM
(Fig. 1c) provides an unraveled Swiss roll that separates the
data classes in 2D embedding space.
While PCA remains the most popular DR method for
bioinformatics-related applications [32], [34], [35], [36], [37],
[38], nonlinear DR methods have begun to gain popularity
[3], [30], [45], [49]. Liu et al. [30] found high classification
accuracy in the use of kernel PCA (nonlinear variant of
PCA) for gene expression data sets while Weng et al. [49]
recommended the use of Isomap for medical data analysis.
Shi and Chen [3] found that LLE outperformed PCA in
classifying three gene expression cancer studies. Dawson
et al. [34] compared Isomap, PCA, and linear MDS for
oligonucleotide data sets, and Nilsson et al. [50] compared
Isomap with MDS in terms of their ability to reveal
structures in microarray data. In these and other related
studies [3], [34], [49], [50], the nonlinear methods were
found to outperform linear DR schemes. While several
researchers have performed comparative studies of classifier methods [51], [52], [53] to determine the optimal scheme
for various applications, to the best of our knowledge, no
comprehensive comparative study of different nonlinear
and linear DR schemes in terms of their ability to
discriminate between samples has been attempted thus far.
The primary motivation of this work is to systematically
and quantitatively compare and evaluate the performance
of six DR methods; three linear methods (PCA, LDA [31],
and linear MDS [40]) and three nonlinear DR methods
(Isomap [42], LLE [43], and LEM [44]) in terms of their
ability to faithfully represent the underlying structure of
biomedical data. A total of 10 different binary class gene
and protein expression studies corresponding to prostate,
lung, breast, gliomal, and ovarian cancers, as well as
leukemia and lymphoma are considered in this comparative
study. Low-dimensional data embeddings of the cancer
studies obtained from each of the six DR methods are
evaluated in two ways. First, the low-dimensional data
embeddings for each data set are compared in terms of
classifier accuracy evaluated via a support vector machine
(SVM) and a decision tree (C4.5) classifier. The intuition
behind the use of classifiers is that if the embedding
produced by a particular DR method accurately captures
the structure of the data manifold, then xa , xb , belonging to
different classes in the high-dimensional data set D, will
have low-dimensional embedding coordinates G ðxa Þ,
G ðxb Þ far apart from each other. Thus, if the underlying
structure of the data has been faithfully reconstructed by the
DR method, then the task of discriminating between objects
from different classes becomes trivial (i.e., a linear classifier
would suffice). Note that a more complex classifier with a
nonlinear separating hyperplane could potentially distinguish objects from different classes in an embedding space
that does not represent a faithful reconstruction of the
original multidimensional manifold. However, the emphasis in this work is not in identifying the optimal classification scheme but rather in identifying the DR method that
can provide the optimal low-dimensional representations so
that the task of discriminating different object classes
becomes trivial. The role of classifiers in this work only
serves to quantitatively evaluate the quality of the data
embeddings. In addition to the use of two classifiers, we
also consider five different cluster measures to evaluate the
low-dimensional data representation. The intuition behind
the use of the cluster validity measures is that in the optimal
low-dimensional data representation, objects xa 2 D with
associated class label Y ðxa Þ ¼ þ1 and all objects xb 2 D,
Y ðxb Þ ¼ 1 will form two distinct, tight, and well-separated
clusters.
The organization of the rest of this paper is given as
follows: In Section 2, we provide an overview of the six
DR methods compared in this paper. In Section 3, the
experimental setup for quantitatively comparing the linear
and nonlinear DR schemes is described. Quantitative and
qualitative results and accompanying discussion is presented in Section 4. Finally, concluding remarks are
presented in Section 5.
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND...
371
TABLE 1
List of Frequently Appearing Symbols and Notations in This Paper
2
OVERVIEW OF DIMENSIONALITY
REDUCTION METHODS
2.1 Terminology
A total of 10 binary gene and protein expressions and one
multiclass data set Dj , j 2 f1; 2; . . . ; 11g, were considered
in this study. Each Dj ¼ fx1 ; x2 ; . . . ; xn g is represented
by an n M dimensional matrix of n samples xi ,
i 2 f1; 2; . . . ; ng. Each xi 2 Dj has an associated class label
Y ðxi Þ 2 fþ1; 1g and an M-dimensional feature vector
F ðxi Þ ¼ ½fu ðxi Þju 2 f1; 2; . . . ; Mg, where fu ðxi Þ represents
the gene or protein expression values associated with xi .
Following application of DR methods , where
2 fP CA; LDA; MDS; ISO; LLE; LEMg, the individual
data points xi 2 D are represented by an m-dimensional
embedding vector Gðxi Þ ¼ ½gv ðxi Þjv 2 f1; 2; . . . ; mg, where
gv ðxi Þ represents the embedding coordinate along the
principal eigenvectors of xi in m-dimensional embedding
space. Table 1 lists notation and symbols used frequently
in this paper.
2.2
Linear Dimensionality Reduction Methods
2.2.1 Principal Component Analysis (PCA)
PCA is widely used to visualize high-dimensional data and
discern relationships by finding orthogonal axes that
contain the greatest amount of variance in the data [39].
These orthogonal eigenvectors corresponding to the largest
eigenvalues are called “principal components” and are
obtained in the following manner. Each data point xi 2 D is
first centered by subtracting the mean of all the features for
each observation xi from its original feature value fu ðxi Þ, as
shown in
1
fu ðxi Þ ¼ fu ðxi Þ n
n
X
fu ðxi Þ;
ð1Þ
i¼1
for u 2 f1; 2; . . . ; Mg. From feature values fu ðxi Þ for each
xi 2 D, a new n M matrix Y is constructed. The matrix Y
is then decomposed into corresponding singular values as
shown in
Y ¼ UV T ;
ð2Þ
where via singular value decomposition, an n n diagonal
matrix containing the eigenvalues of the principal
components and an m n left singular matrix U and M n matrix V are obtained. The eigenvalues in represent
the amount of variance for each eigenvector gPv CA , v 2
f1; 2; . . . ; mg in matrix V T and are used to rank the
corresponding eigenvectors in the order of greatest
variance. Thus, the first m eigenvectors are obtained, as
they contain the most variance in the data while the
remaining eigenvectors are discarded so each data sample
xi 2 D is now described by an m-dimensional embedding
vector GP CA ðxi Þ.
2.2.2 Linear Discriminant Analysis (LDA)
LDA [31] takes into account class labels to find eigenvectors
that can discriminate between two classes fþ1; 1g in a
data set. The intraclass scatter matrix SW and interclass
scatter matrix SB [31] are computed from the sample
means
P
for dataP
clusters þ1 and 1, giving þ ¼ n1þ a F ðxa Þ and
¼ n1 b F ðxb Þ, respectively, where for xa 2 D, Y ðxa Þ ¼
þ1 and for xb 2 D, Y ðxb Þ ¼ 1. Note that both þ and are M-dimensional vectors. From the sample means,
X
SW ¼
ðF ðxa Þ þ ÞðF ðxa Þ þ ÞT
xa 2D
Y ðxa Þ¼þ1
þ
X
ðF ðxb Þ ÞðF ðxb Þ ÞT
ð3Þ
xb 2D
Y ðxb Þ¼1
and
SB ¼ ðþ Þðþ ÞT
ð4Þ
are calculated. SW and SB are then used to create the
eigenvectors by singular value decomposition:
1
UV T ¼ SW
SB :
ð5Þ
As with PCA, each data point xi 2 D is now represented by
an m-dimensional vector GLDA ðxi Þ corresponding to the
m largest eigenvalues in . While LDA has often been used
as both a DR method and a classifier, it is limited in
handling sparse data and for data sets where the Gaussian
distribution assumption is not valid [29].
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
372
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
2.2.3 Classical Multidimensional Scaling (MDS)
MDS [40] is implemented as a linear method that
preserves the euclidean geometry between each pair of
M-dimensional points xa , xb 2 D, which is arranged into a
symmetric n n distance matrix , as shown in
ðxa ; xb Þ ¼ kF ðxa Þ F ðxb Þk2 ;
ð6Þ
where k k2 represents the euclidean norm. MDS finds
optimal positions for the data points xa , xb in m-dimensional
space through minimization of the least squares error in the
input pairwise euclidean distances between xa and xb [40].
Note that classical MDS differs from nonlinear variants of
MDS such as nonmetric MDS [40], which do not preserve
input euclidean distances.
2.3 Nonlinear Dimensionality Reduction Methods
2.3.1 Isometric Mapping (Isomap (ISO))
The Isomap algorithm [42] modifies classical MDS to handle
nonlinearities in the data through the use of a neighborhood
mapping. By creating linear connections from each point
xi 2 D to its closest neighbors in euclidean space, a
manifold representation of the data is constructed, being a
user-defined parameter. Nonlinear connections between
points outside of the neighborhood are approximated by
calculating the shortest distance between two points xa , xb 2
D along the paths in the neighborhood map, where
a; b 2 f1; 2; . . . ; ng. Thus, new geodesic distances (distances
measured along the surface of the manifold) are calculated
and arranged in an n n pairwise distance matrix , where
ðxa ; xb Þ contains the nonlinear geodesic distances between
xa , xb 2 D. The matrix is then given as an input to the
classical MDS algorithm from which each data point xi 2 D,
i 2 f1; 2; . . . ; ng, is represented by its m-dimensional embedding vector GISO ðxi Þ.
2.3.2 Locally Linear Embedding (LLE)
LLE [43], like the Isomap algorithm [42], utilizes a
neighborhood map connecting each data sample xi to its
nearest neighbors in euclidean space. However, instead of
calculating manifold distances, LLE describes each xi in
terms of its closest neighbors xa . Thus, for each xi , an
M matrix Z containing the centered features f^u ðxa Þ ¼
fu ðxa Þ fu ðxi Þ is obtained. To describe the local geometry
for each xi , linear coefficients accounting for the location of
xi relative to each xa can be optimized by solving for the
-dimensional weight vector wðxi ; xa Þ via the linear system:
Z T Zw ¼ I T ;
ð7Þ
VOL. 5,
NO. 3,
JULY-SEPTEMBER 2008
2.3.3 Laplacian Eigenmaps (LEM)
The LEM [44] algorithm, similar to LLE and Isomap,
establishes a locally linear mapping by connecting each
point xi 2 D to its nearest neighbors. Weights are assigned
between each pair of points to form an n n symmetrical
~ , where weights W
~ ðxi ; xa Þ ¼ 1, when xa is a
weight matrix W
~ ðxi ; xb Þ ¼ 0, when
nearest neighbor of each xi 2 D, and W
xb is not a nearest neighbor of xi 2 D. From weight
~
matrix W
P and a diagonal matrix of column sums
~ ði; jÞ, for all i 2 f1; 2; . . . ; ng, a symmetric
Dði; iÞ ¼ j W
positive semidefinite matrix L called the Laplacian is
calculated as
~:
L¼DW
ð9Þ
Singular value decomposition (2) is then used to obtain the
m-dimensional embedding vector GLEM ðxi Þ for each xi 2 D
from the Laplacian L.
3
EXPERIMENTAL DESIGN
The organization of this section is given as follows: In
Section 3.1, we provide a description of data sets followed
by brief outline of methodology in Section 3.2. In Sections 3.3
and 3.4, we briefly describe the different qualitative and
quantitative evaluation measures we use for comparing the
performance of the DR methods.
3.1 Description of Data Sets
A total of 10 publicly available binary class [1], [7], [8], [16],
[17], [19], [24], [25], [26], [54] and one multiclass data set [32]
corresponding to high-dimensional gene and protein
expression studies1 were acquired for the purposes of this
study. The two-class data sets correspond to gene and
protein expression profiles of normal and cancerous
samples for breast [7], colon [16], lung [26], ovarian [19],
and prostate [17] cancers, leukemia [1], [32], lymphoma [8],
[25], and glioma studies [24]. The multiclass data set
comprises five subtypes of leukemia. The sizes of the data
sets range from 30 to 253 patient samples, with the number
of corresponding features ranging from 791 to 54,675 genes
or peptides. Table 2 provides a description of all the data
sets that we considered including a description of the data
classes and the originating study for these data sets. Note
that for each study, the number of patient samples is
significantly smaller than the dimensionality of the feature
space. No preprocessing or normalization of any kind was
performed on the original feature space prior to DR. An
experiment was however performed to compare DR
performance with and without feature pruning on the
original high-dimensional studies.
ð8Þ
3.2 Brief Outline of Methodology
Our methodology for investigating the embedded data
representation given by DR is comprised of four main
steps described briefly below and illustrated in the
flowchart in Fig. 2.
Step 1: DR. To evaluate and compare the low-dimensional data embeddings, we reduced the dimensionality of
where I is an n n identity matrix. Singular value
decomposition is used to obtain the m-dimensional embedding vector GLLE ðxi Þ and for each xi 2 D from cost matrix .
1. These data sets were downloaded from the Biomedical Kent-Ridge
Repositories at http://sdmc.lit.org.sg/GEDatasets/Datasets and http://
sdmc.i2r.a-star.edu.sg/rp and the Gene Expression Omnibus(GEO) Repository at http://www.ncbi.nlm.nih.gov/geo/.
where I is a column vector of ones of length . From each of
n weight matrices w, the n n matrix, W , stores the linear
coefficients of each xi and xa in W ðxi ; xa Þ and W ðxi ; xb Þ ¼ 0,
where xb are not among the nearest neighbors of xi 2 D. A
cost matrix is then computed from the weight matrix W as
¼ ðI W ÞT ðI W Þ;
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND...
373
TABLE 2
Description of Gene Expression and Proteomic Spectra Data Sets Considered in This Study
Fig. 2. Flowchart showing the overall organization and process flow of our experimental design.
M-dimensional xi 2 Dj , j 2 f1; 2; . . . ; 11g, via six DR methods 2 fP CA; LDA; MDS; ISO; LLE; LEMg. The resulting m-dimensional embedding vectors G ðxi Þ now represent
the low-dimensional signatures for each xi 2 Dj and for
each method . Additionally, we obtain m-dimensional
embedding vectors for feature-pruned samples xi 2 Dj ,
^ < M dimensional samples
j 2 f1; 2; . . . ; 11g, containing M
xi , for each method .
Step 2: Qualitative evaluation for novel class detection. In
order to evaluate the presence of possible subclusters
within the data, the dominant embedding coordinates
g1 ðxi Þ, g2 ðxi Þ, g3 ðxi Þ, for each method , and xi 2 D were
plotted against each other. The graphical plots reveal the
m-dimensional embedding representations of the highdimensional data via each of the six DR methods. On the
eigenplots obtained for each DR scheme, potential subclasses are visually identified.
Step 3: Quantitative evaluation of DR performance via
classifier accuracy. To evaluate the quality of the DR
embeddings, two classifiers are trained using the lowdimensional embedding vector G ðxi Þ, for
2 fP CA; LDA; MDS; ISO; LLE; LEMg;
and xi 2 D. For each D, the samples xi 2 D are divided into
a training set SjT r and a testing set Sjt . Samples xa 2 SjT r will
be used to train an SVM and decision tree (C4.5) classifier
ðCSV M ; CC4:5 Þ in the embedding space defined by embedding
coordinates G ðxa Þ,
2 fP CA; LDA; MDS; ISO; LLE; LEMg
to distinguish between the different classes. Once the
classifiers have been trained, they will be applied to
predict class labels CSV M ðG ðxb ÞÞ; CC4:5 ðG ðxb ÞÞ 2 fþ1; 1g
for all xb 2 Sjt for method . The classifier predictions
CSV M ðG ðxb ÞÞ; CC4:5 ðG ðxb ÞÞ are compared against the true
object label Y ðxb Þ for xb 2 D to estimate the classifier
accuracy, recorded for each DR scheme. The same
procedure is repeated using the feature-pruned samples
^ < M dimensionality, following DR.
xi 2 D with M
Step 4: Quantitative evaluation of DR performance via cluster
validity measures. To compare the size, tightness, and
separation of class clusters from different DR methods, we
first normalize the embedding space obtained via each of
six DR methods 2 fP CA; LDA; MDS; ISO; LLE; LEMg.
In this normalized embedding space, we calculate centroids
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
374
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
~;þ and G
~; corresponding to the þ1 and 1 classes.
G
~;þ and G
~; , we measure the
From the centroids G
separation between clusters as well as the tightness of each
cluster by measuring the distances of each xi 2 D to the
corresponding class centroid. The same procedure is
repeated following feature pruning.
3.3
Detailed Description of Experimental Design
3.3.1 Feature Pruning
A feature pruning step is employed to identify a set
^
of informative features F^ðxi Þ ¼ ½fu^ ðxi Þj^
u 2 f1; 2; . . . ; Mg,
^ < M for each xi 2 D. The aim of feature pruning
where M
is to compare whether the trends in performance of the six
DR methods considered in this study is similar when
considering all features F ðxi Þ and when considering only
the most informative features F^ðxi Þ. The feature pruning
method based on t-statistics and described in [1] and [15]
was considered. For all xi 2 D and for a specific gene or
protein expression feature u 2 f1; 2; . . . ; Mg, the mean fuþ ,
2
2
fu and variance fu þ , fu of the expression levels for the
þ1 or 1 class were computed. Hence,
fuþ ¼
1 X
fu ðxa Þ;
nþ xa 2Dj
ð10Þ
Y ðxa Þ¼þ1
fu
2
¼
2
1 X fu ðxb Þ fu :
n xb 2Dj
ð11Þ
Y ðxb Þ¼1
2
2
The values of fuþ , fu , fu þ , fu were then used to
calculate the information content of each gene or protein
expression feature as
f þ fu
ffi:
T ðfu Þ ¼ quffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
fu2 fu þ
þ
nþ
n
ð12Þ
The different features are then ranked in descending
order based on their information content T ðfu Þ. The
top 10 percentile of most informative features fu^ ,
^ where M
^ < M, are used to compute a
u^ 2 f1; 2; . . . ; Mg,
second set of embeddings for each Dj , j 2 f1; 2; . . . ; 11g
and 2 fP CA; LDA; MDS; ISO; LLE; LEMg.
3.3.2 Qualitative Evaluation to Identify Novel
Subclasses
The linear and nonlinear DR methods were evaluated in
terms of their ability to identify new subclasses within
the data. The three dominant eigenvalues g1 ðxi Þ,
g2 ðxi Þ, and g3 ðxi Þ are plotted against each other,
for 2 fP CA; LDA; MDS; ISO; LLE; LEMg, and for all
xi 2 D. The 3D space of embedding coordinates G ðxi Þ,
for all xi 2 D, were visually inspected for 1) distinct
clusters within the dominant þ1, 1 classes and
2) distinct clusters that appear to be far removed from
the cluster centers G;þ and G; . Since the ground truth
for newly identified subclasses within the binary class
data sets was unavailable, we also compared the six DR
schemes on a multiclass Acute Lymphoblastic Leukemia
data set [32], which is comprised of five known
subclasses.
VOL. 5,
NO. 3,
JULY-SEPTEMBER 2008
3.3.3 Quantitative Evaluation to Measure Class
Discriminability
In this section, we describe in greater detail the different
performance measures used for evaluating the efficacy of
DR methods.
1. DR comparison via classifier accuracy. The accuracy of
two classifiers (Linear SVMs and C4.5 Decision Trees) was
used to quantitatively evaluate Gðxi Þ, xi 2 D on 11 data sets
Dj , j 2 f1; 2; . . . ; 11g, using the class labels provided. Both
classifiers considered, SVMs and C4.5 Decision Trees,
require the use of a training set SjT r to construct a prediction
model for new data and a testing set Sjt . Each classifier was
first trained by using labeled instances in SjT r , where for
each xa 2 SjT r , Y ðxa Þ 2 fþ1; 1g. The classifier training is
done separately for each DR method
2 fP CA; LDA; MDS; ISO; LLE; LEMg:
To train the classifiers, we randomly set aside 1/3 of the
samples in SjT r for training, and the remaining 2/3
samples in Sjt were used for testing. The three-fold crossvalidation method was then used to determine the optimal
classifier parameters. The classifier outputs CSV M ðG ðxb ÞÞ,
CC4:5 ðG ðxb ÞÞ 2 fþ1; 1g, where xb 2 Sjt , was compared
against the true object label Y ðxb Þ 2 fþ1; 1g. Subsequently, accuracy is defined as the ratio of the number
of objects xb 2 Sjt , correctly labeled by the classifier to the
total number of tested objects in each Sjt . Below, we
provide a brief description of the two classifiers considered in this study.
1.1. Support Vector Machines (SVM). SVMs were first
introduced by Vapnik [55] and are based on the structural
risk minimization (SRM) principle from statistical learning
theory. The SVM attempts to minimize a bound on the
generalization error (error made on test data). SVM-based
techniques focus on “borderline” training examples (or
support vectors) that are most difficult to classify. The SVM
projects the input training data G ðxi Þ, for xb 2 SjT r , onto a
higher dimensional space using the linear kernel defined as
T
G ðxa Þ; G ðxb Þ ¼ G ðxa Þ G ðxb Þ þ b;
ð13Þ
where b is the bias estimated on the training set SjT r D.
The general form of the SVM is given by
CSV M ¼
ns
X
Y ðxb Þ G ðxa Þ; G ðxb Þ ;
ð14Þ
¼1
where x , for 2 f1; 2; . . . ; ns g, denotes the number of
support vectors, and the model parameter is obtained by
maximizing the following objective function:
ðÞ ¼
ns
X
¼1
ns
1X
Y ðx ÞY ðx
Þ;
2 ;
¼1
ð15Þ
P s
subject to the constraint n¼1
Y ðx Þ ¼ 0 and 0 !,
where , 2 f1; 2; . . . ; ns g, and where the parameter !
controls the trade-off between the empirical risk (training
errors) and model complexity.
Additionally, a one-against-all SVM scheme was implemented for the multiclass case [14]. For this scheme, a
binary classifier is built for each class to separate one class
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND...
from all the other classes. Again, 1/3 of the samples from
each class are randomly selected for training set SjT r and the
predictions are made on the remaining 2/3 of the samples
in Sjt . Each of the five binary classifiers make a prediction as
to whether each xa 2 Dj belongs to the target class. In the
ideal case, only the binary classifier trained to identify Y ðxa Þ
as the target class should output a value of 1 and the other
four classifiers would output 0. If so, xa is said to have been
correctly classified. If not, xa is randomly assigned one of
the five class labels. If the randomly assigned class label is
not its true class label, xa is said to have been misclassified.
Otherwise, it is determined to have been correctly classified.
1.2. C4.5 Decision Trees (C4.5). A special type of classifier
is the decision tree, which is trained using an iterative
selection of individual features fu ðxa Þ that are the most
salient at each node in the tree [56]. One of the most
commonly used algorithms for generating decision trees is
the C4.5 rules proposed by Quinlan [56]. The rules
generated by this approach are in conjunctive form such
as “if A and B then C,” where both A and B are the rule
antecedents, while C is the rule consequence. Every path
from the root to the leaf is converted to an initial rule by
regarding all the conditions appearing in the path as the
conjunctive rule antecedents while regarding the class label
Y ðxa Þ xa 2 D held by the leaf as a rule consequence. Tree
pruning is then done by using a greedy elimination rule
that removes antecedents that are not sufficiently discriminatory. The rule set is then further refined by the way of the
minimum description length (MDL) principle [57] to
remove those rules that do not contribute to the accuracy
of the tree. Hence, for each
2 fP CA; LDA; MDS; ISO; LLE; LEMg;
we obtain a separate decision tree classifier CC4:5 ðG ðxi ÞÞ to
classify every xi 2 D as fþ1; 1g. The C4.5 Decision Trees is
extended for the multiclass case by simply adding more
output labels. Classifier evaluation is also similarly performed on Dj , j 2 f1; 2; . . . ; 11g following feature pruning.
2. DR comparison via cluster validity measures. The lowdimensional embeddings G ðxi Þ, obtained for each
2 fP CA; LDA; MDS; ISO; LLE; LEMg, are also compared in terms of the five cluster validity measures. Prior
to this however, the embedding coordinates G ðxi Þ for xi 2
D need to be normalized within a unit hypercube H in
order to facilitate a quantitative comparison across the six
DR schemes. The eigenvector gv ðxi Þ, for each xi 2 D and
v 2 f1; 2; . . . ; mg, is thus scaled between [0, 1] along each of
m-dimensions via the following formulation:
gv ðxi Þ mini gv ðxi Þ
h
i
h
i;
g~v ðxi Þ ¼
ð16Þ
maxi gv ðxi Þ mini gv ðxi Þ
where g~v ðxi Þ is the normalized embedding coordinate of
xi along the vth dimension, where v 2 f1; 2; . . . ; mg. For all
xa 2 Dj , such that Y ðxa Þ ¼ þ1, the cluster center of the
~;þ , is obtained by averaging the embedding
þ1 class, G
coordinate locations g~v along each dimension v 2
f1; 2; . . . ; mg and for each . Formally, where nþ is the
number of objects in the þ1 class,
g~;þ ¼
1 X g~ ðxa Þ:
nþ xa 2Dj v
375
ð17Þ
Y ðxa Þ¼þ1
Thus, the normalized cluster center for the þ1 class is
~;þ ¼ ½~
obtained as G
gv jv 2 f1; 2; . . . ; mg. Similarly, we
~; for the 1 class. Having
obtain the cluster center G
~;þ and G
~; , we define five cluster validity
obtained G
measures as follows:
2.1. Intercentroid Distance (ICD). CICD is defined as the
~;þ and G
~; [58].
euclidean distance between centroids G
ICD
C
is calculated for each Dj , j 2 f1; 2; . . . ; 10g, and for all .
2.2. Cluster Tightness (CT). To evaluate the tightness and
distinctness of object clusters in the embedding space, we
define and evaluate four CT measures: CCT ;;þ , CCT ;; ,
CCT ;;þ , and CCT ;; . CCT ;;þ is defined as the mean
euclidean distance of all objects xa 2 Dj , Y ðxa Þ ¼ þ1, from
~;þ . Formally, this is expressed as
G
CCT ;;þ ¼
1 X ~ ðxa Þ:
~;þ G
G
nþ xa 2D
ð18Þ
Y ðxa Þ¼þ1
We also similarly compute CCT ;;þ as the standard
deviation of the euclidean distances of all xa 2 D from
~;þ [59]. Similarly,
their corresponding cluster centroid G
CT ;;
CT ;;
C
and C
are also defined for the 1 class. The
calculation of the above cluster measures and normalization
of the embedding coordinate system is repeated for all Dj ,
j 2 f1; 2; . . . ; 10g following feature pruning.
Following the computation of the seven quantitative
performance measures (two classifier and five cluster), a
paired student t-test comparison is performed between the
values for CSV M , CC4:5 , CICD , CCT ;þ , CCT ; , CCT ;þ , CCT ; for
each of the following nine pairs of linear and nonlinear
methods (PCA/ISO, LDA/ISO, MDS/ISO, PCA/LLE,
LDA/LLE, MDS/LLE, PCA/LEM, LDA/LEM, and MDS/
LEM) across all data sets Dj , j 2 f1; 2; . . . ; 10g, under the
null hypothesis that there is no difference in the seven
performance measures between each of the nine pairs of
linear/nonlinear DR methods. Thus, if p 0:05 for a pair of
linear/nonlinear methods for a particular performance
measure, the difference is assumed to be statistically
significant. A similar t-test comparison is also performed
using the embedding data obtained following feature
pruning with the aim of showing similar trends across the
six different DR methods applied to both the unpruned and
the feature-pruned data sets.
4
4.1
RESULTS
AND
DISCUSSION
Qualitative Results
4.1.1 Class Separability in Embedding Space
Fig. 3 shows the 2D embedding plots of six different linear
and nonlinear DR methods for one proteomic spectra
(ovarian cancer [19]) and two gene expression (colon [16]
and lung cancer [54]) data sets. Each of the plots in Figs. 3a,
3b, 3c, 3d, 3e, 3f, 3g, 3h, 3i, 3j, 3k, and 3l were generated by
plotting the first dominant eigenvector g~1 ðxi Þ versus the
second dominant eigenvector of g~2 ðxi Þ, for all xi 2 Dj , and
for a given DR method . The two object classes (þ1 and
1) are denoted with different symbols. Figs. 3a and 3d
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
376
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
VOL. 5,
NO. 3,
JULY-SEPTEMBER 2008
Fig. 3. Embedding plots were obtained by plotting dominant eigenvectors g~1 ðxi Þ and g~2 ðxi Þ against each other for ((a), (d), (g), (j)) ovarian cancer
[19], ((b), (e), (h), (k)) colon cancer [16], and ((c), (f), (i), (l)) lung cancer [54] data sets for six different linear and nonlinear DR methods. Embedding
plots for ¼ (a) PCA, (d) MDS, (g) ISO, and (j) LLE for the ovarian cancer data set [19] are shown in the left column while in the middle column
embedding plots for colon cancer [16] obtained via ¼ (b) LDA, (e) MDS, (h) LLE, and (k) LEM are shown. Embedding plots for the lung cancer
data set [54] for ¼ (c) PCA, (f) LDA, (i) ISO, and (l) LEM are shown in the rightmost column.
correspond to the embeddings generated by two of the
linear DR methods (PCA, MDS) while Figs. 3g and 3j show
the corresponding plots obtained from two of the nonlinear
DR methods (ISO, LLE) on the ovarian cancer study. Note
that in the embedding obtained with both Isomap and LLE
(Figs. 3g and 3j), the two classes are clearly distinguishable
while the corresponding embeddings obtained with PCA
and MDS (Figs. 3a and 3d) reveal a significant degree of
overlap between the þ1 and 1 classes. A similar trend is
seen with PCA and LDA (Figs. 3b and 3e) and LLE and
LEM (Figs. 3h and 3k) on the colon cancer data set [16].
Note that in spite of the presence of a couple of apparent
outliers in the embeddings obtained by LLE and LEM, the
nonlinear DR methods appear to perform much better
compared to PCA and MDS (Figs. 3b and 3e). The difference
is even more stark in the embeddings obtained with PCA
(Fig. 3c) and LDA (Fig. 3f) compared to Isomap (Fig. 3i) and
LEM (Fig. 3l) on the lung cancer [54] data set in the
rightmost column.
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND...
377
Fig. 4. Embedding graphs obtained by plotting the three most dominant embedding vectors g~1 ðxi Þ, g~2 ðxi Þ, and g~3 ðxi Þ for xi 2 Dj , for ¼ (a) LDA,
(b) LLE, and (c) LEM, respectively, on the lung cancer Michigan data set [26] in the top row. In the middle row, the embedding results obtained on the
prostate cancer study [17] for ¼ (d) MDS, (e) LLE, and (f) LEM, respectively, are shown. Finally, the embedding plots obtained via (g) PCA,
(h) ISO, and (i) LLE for the multiclass acute lymphoblastic leukemia data set [32] are shown in the bottom row. Note that the ellipses in (b), (c), (e),
and (f) have been manually placed to highlight what appear to be potentially new classes.
4.1.2 Novel Class Detection in Embedding Space
Fig. 4 illustrates qualitatively the differences between the
linear and nonlinear DR methods in capturing the true
underlying low-dimensional structure of the data and
highlights the differences between the two types of methods
in terms of their ability to identify subclasses in the data.
Figs. 4a, 4b, and 4c show the embedding plots obtained via
LDA, LLE, and LEM, respectively, for the lung cancer
Michigan data set [26]. For LDA (Fig. 4a), no meaningful
clustering of samples was observable, while for both LLE
and LEM, two distinct clusters of normal classes (denoted
via superimposed ellipses) were identifiable. In Fig. 4,
subclusters (denoted in superimposed ellipses) in the
prostate cancer data set [17] for both LLE and LEM
(Figs. 4e and 4f) were discernable but were occult in MDS
(Fig. 4d). Note that the ellipses in Figs. 4b, 4c, 4e, and 4f are
manually placed on the plots to highlight what appear to be
possible new classes. Since 10 of the studies considered in
this work were labeled as binary class data sets, we were
unable to evaluate the validity of the newly detected
subclasses. Note however that the two nonlinear methods
for both the lung cancer [54] and leukemia data sets [1], [32]
identify near identical subclusters, lending further credibility to the fact that the subclusters identified are genuine
subclasses. To further test the ability of nonlinear DR
schemes for novel class detection, a multiclass data set
comprising five known subtypes of acute lymphoblastic
leukemia [32] was considered. As shown in Fig. 4g, PCA is
unable to unravel the classes as discriminatingly as Isomap
(Fig. 4h) or LLE (Fig. 4i). The five subclasses shown in
Figs. 4g, 4h, and 4i are represented with different symbols.
4.2
Quantitative Results
4.2.1 Classifier Accuracy
For each of the 10 binary class and one multiclass data sets,
classifier accuracy ðCC4:5 ; CSV M Þ was assessed on the
embeddings obtained via the six DR schemes for both
unpruned (Figs. 5a and 5b) and feature-pruned data sets
(Fig. 5c). It can be seen from Fig. 5 that on average,
nonlinear DR methods (ISO, LLE, and LEM) perform better
than their linear counterparts (PCA, LDA, and MDS) for
both classifiers. Tables 3 and 4 list the accuracy results for
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
378
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
VOL. 5,
NO. 3,
JULY-SEPTEMBER 2008
Fig. 5. Average (a) CSV M; and (b) CC4:5; for 2 fP CA; LDA; MDS; ISO; LLE; LEMg, for 10 binary class data sets, before feature pruning.
Additionally, average (c) CSV M; is given for all , for 10 binary class data sets, following feature pruning.
TABLE 3
Accuracy of CSV M; for Each of the 10 Binary Class Data Sets with and without DR
for 2 fP CA; LDA; MDS; ISO; LLE; LEMg and without Feature Pruning
TABLE 4
Accuracy of CC4:5; for Each of the 10 Binary Class Data Sets with and without DR
for 2 fP CA; LDA; MDS; ISO; LLE; LEMg and without Feature Pruning
the SVM and C4.5 classifiers. Over 10 binary class data sets
with and without DR, classifier accuracy with an SVM for a
multiclass data set (Acute Lymphoblastic Leukemia [32]) is
provided in Table 5, again with and without DR. The results
in Tables 3, 4, and 5 clearly suggest that for both the binary
class and multiclass cases, 1) classification of high-dimensional data should be preceded by DR and 2) nonlinear DR
schemes result in higher classifier accuracy compared to
linear DR schemes.
the 10 binary class data sets after feature pruning. Tables 6,
7, and 8 show the average CICD; , CCT ;; , and CCT ;;þ
values for 2 fP CA; LDA; MDS; ISO; LLE; LEMg over
the 10 binary class studies considered in this work without
TABLE 5
Accuracy of CSV M; in Distinguishing Subtypes of Acute
Lymphoblastic Leukemia Data Set with and without DR
for 2 fP CA; LDA; MDS; ISO; LLE; LEMg and
without Feature Pruning
4.2.2 Cluster Metrics
Fig. 6 and Tables 6, 7, and 8 show the results for the cluster
validity measures for all . Figs. 6a, 6b, and 6c correspond
to average CICD; , CCT ;;þ , and CCT ;; , respectively, across
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND...
379
Fig. 6. Average (a) CICD; , (b) CCT ;;þ , and (c) CCT ;; values over 10 binary class data sets for each 2 fP CA; LDA; MDS; ISO; LLE; LEMg after
feature pruning.
TABLE 6
Values of CICD; for Each of the 10 Binary Class Data Sets Following DR
for 2 fP CA; LDA; MDS; ISO; LLE; LEMg without Feature Pruning
TABLE 7
Values of CCT ;; for Each of the 10 Binary Class Data Sets Following DR
for 2 fP CA; LDA; MDS; ISO; LLE; LEMg without Feature Pruning
feature pruning. From Fig. 6a and Table 6, we observe that
the intercentroid distance between the dominant clusters is
on average greater for the nonlinear DR methods compared
to the linear methods, in turn suggesting greater separation
between the two classes. Similarly from Figs. 6b and 6c, we
observe that the average CCT ;;þ and CCT ;; values over all
the 10 binary class data sets are smaller for the nonlinear DR
methods compared to the linear methods, suggesting the
objects classes form more compact tighter clusters in the
embedding spaces generated via nonlinear DR schemes.
In Table 9, p-values for the paired student t-tests are
obtained by comparing CSV M; , CC4:5; , CICD; , CCT ;;þ , and
CCT ;; across the 10 binary class data sets for each pair of
linear and nonlinear DR methods (PCA/ISO, LDA/ISO,
MDS/ISO, PCA/LLE, LDA/LLE, MDS/LLE, PCA/LEM,
LDA/LEM, and MDS/LEM). Statistically significant differences were observed for all the performance measures
considered for each pair of linear/nonlinear DR methods
across all 10 binary class data sets. Similar trends were
observed for embeddings obtained from linear and nonlinear DR schemes following feature pruning (Table 10).
Additionally, we investigated the performance of each
of the DR methods across the 10 binary class studies as
a function of the number of dimensions of the embedding space G from a classification perspective. Figs. 7a
and 7b show the average classification accuracy (CSV M;
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
380
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
VOL. 5,
NO. 3,
JULY-SEPTEMBER 2008
TABLE 8
Values of CCT ;;þ for Each of the 10 Binary Class Data Sets Following DR
for 2 fP CA; LDA; MDS; ISO; LLE; LEMg without Feature Pruning
TABLE 9
p-Values Obtained by a Paired Student t-Test of CSV M; , CC4:5; , CICD; , CCT ;;þ , and CCT ;; across 10 Data Sets, Comparing
Nine Pairs of Linear/Nonlinear DR Methods without Feature Pruning
TABLE 10
p-Values Obtained by a Paired Student t-Test of CSV M; , CC4:5; , CICD; , CCT ;;þ , and CCT ;; across 10 Data Sets, Comparing
Nine Pairs of Linear/Nonlinear DR Methods Following Feature Pruning
and CC4:5; , respectively) for each DR method, where the
number of dimensions is being varied from 2 to 10
ðv 2 f2; 3; . . . ; 10gÞ. Similarly, Figs. 8a, 8b, and 8c show
the cluster validity measures (CICD; , CCT ;;þ , CCT ;; ,
CCT ;;þ , and CCT ;; , respectively) for each DR method,
where the number of dimensions is also being varied
from 2 to 10 ðv 2 f2; 3; . . . ; 10gÞ. For both the classifier
and cluster validity measures, one can see similar trends
across dimensions showing nonlinear DR methods outperforming linear methods (Figs. 7 and 8), thereby
comprehensively demonstrating that the nonlinear DR
schemes outperform the linear DR methods independent
of the number of embedding dimensions considered.
5
CONCLUDING REMARKS
The primary objective of this paper was to identify appropriate DR methods to precede analysis and classification of
high-dimensional gene and protein expression studies. This
is especially important in applications where the goal is to
identify two or more specific classes within the data sets. In
this paper, we quantitatively compared the performance of
six different DR methods, three linear (PCA, LDA, and MDS)
and three nonlinear (Isomap, LLE, and LEM) from the
perspective of 1) distinguishing between cancer and noncancer studies and 2) identifying new object classes (cancer
subtypes) from 10 binary high-dimensional gene and protein
expression data sets for prostate, lung, breast, and ovarian
cancers, as well as for leukemia, lymphomas, and gliomas.
Additionally, a multiclass data set comprising five distinct
subtypes of lymphoblastic leukemia was also considered.
The efficacy of the low-dimensional representations of the
high-dimensional data obtained by the different DR methods
was evaluated via two classifier schemes (SVM and C4.5) and
five different cluster validity measures. The intuition behind
the use of these evaluation measures was that if the
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND...
381
Fig. 7. Average classification accuracy for (a) CSV M; and (b) CC4:5; over 10 binary class data sets for 2 fP CA; LDA; MDS; ISO; LLE; LEMg for
v 2 f2; 3; . . . ; 10g.
Fig. 8. Average (a) CICD; , (b) CCT ;;þ , and (c) CCT ;; values over all 10 binary class data sets for each 2 fP CA; LDA; MDS; ISO; LLE; LEMg
for v 2 f2; 3; . . . ; 10g.
Fig. 9. The degree to which nonlinearity in the data can be accurately approximated is dependent on the size of the local neighborhood within which
linearity is assumed. As increases, the locally linear assumption is no longer valid. (a) and (b) CICD;LLE and CICD;LEM , respectively, plotted against
increasing values of for the lung cancer Michigan data set. As increases, CICD;LLE and CICD;LEM both decrease, suggesting that the nonlinear DR
schemes are effectively becoming linear.
low-dimensional embedding was indeed a faithful representation of the high-dimensional feature space, the two
different data classes would be separable into distinct tightly
packed clusters. Embeddings were generated by the six
different DR methods from the original high-dimensional
data before and after feature pruning. Feature pruning was
applied to identify only the top 10 percentile of the most
informative features in each data set in order to reduce any
possible nonlinearity in the data on account of redundant or
uncorrelated features. The three different linear and three
nonlinear methods were also compared pairwise via a paired
student t-test in terms of the seven performance measures
and across all 10 data sets. In addition, six different DR
methods were also qualitatively compared in terms of the
ability of their respective embeddings to reveal the presence
of new subclasses within the data. Our primary conclusions
from this work are given as follows:
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
382
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
TABLE 11
Summary of the Best and Worst DR Methods in Terms
of Each of the Seven Performance Measures
VOL. 5,
NO. 3,
JULY-SEPTEMBER 2008
ACKNOWLEDGMENTS
This work was supported by grants from the Coulter
Foundation, the Cancer Institute of New Jersey, the New
Jersey Commission on Cancer Research (NJCCR), the
Society for Imaging Informatics in Medicine (SIIM), the
Aresty Foundation, and the National Cancer Institute
(R21CA127186-01 and R03CA128081-01). The authors
would like to thank Dr. Jianbo Shi and Dr. James Monaco
for their advice and comments.
REFERENCES
[1]
1.
2.
3.
The nonlinear methods significantly outperformed
the linear methods over all the data sets in terms of
all seven performance measures, suggesting in turn
the nonlinear underlying manifold structure of highdimensional biomedical studies.
The differences between the linear and nonlinear
methods were found to be statistically significant
even after pruning the data sets by feature selection
and were independent of the number of dimensions
of the embedding space that were considered.
The nonlinear methods also appeared to be able to
identify potential subclasses within the data better
compared to the linear methods. The linear
methods for the most part were unable to even
discriminate between the two most dominant
classes in each data set.
In making our conclusions, we also acknowledge the
following limitations of this study: 1) Our results are based
on a relatively small database comprising 10 binary class
and one multiclass gene and protein expression data sets,
2) not all linear and nonlinear DR methods were considered
in this study, and 3) the performance of nonlinear methods
are dependent on the size of the local neighborhood
parameter within which data linearity is assumed.
As the value of increases, the locally linear assumption
is no longer valid and the nonlinear DR methods begin to
resemble linear methods. The dependency of the nonlinear
methods on the value of is reflected in the plots in Fig. 9.
For both ¼ LLE and LEM, the corresponding cluster
measures CICD; begin to decrease with increasing values of
, suggesting the degeneracy of the nonlinear schemes.
Table 11 lists the best and worst DR schemes based on
each of the seven performance criteria considered in this
study. As can be surmised from Table 11, the nonlinear DR
scheme LLE was the best DR method overall and the linear
scheme LDA performed the worst.
Our results appear to suggest that if the objective is to
distinguish multiple classes or identify subclusters within
high-dimensional biomedical data, nonlinear DR methods
such as LLE, Isomap, and LEM may be a better choice
compared to linear DR methods such as PCA. Preliminary
results in an application involving prostate magnetic
resonance spectroscopy [46] appear to confirm the conclusions presented in this work.
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek,
J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri,
C.D. Bloomþeld, and E.S. Lander, “Molecular Classification of
Cancer: Class Discovery and Class Prediction by Gene Expression
Monitoring,” Science, vol. 286, no. 531, pp. 531-537, 1999.
Y. Peng, “A Novel Ensemble Machine Learning for Robust
Microarray Data Classification,” Computers in Biology and Medicine,
vol. 36, no. 6, pp. 553-573, 2006.
C. Shi and L. Chen, “Feature Dimension Reduction for Microarray
Data Analysis Using Locally Linear Embedding,” Proc. Third Asia
Pacific Bioinformatics Conf. (APBC ’05), pp. 211-217, 2005.
S.D. Der, A. Zhou, B.R. Williams, and R.H. Silverman, “Identification of Genes Differentially Regulated by Interferon Alpha, Beta,
or Gamma Using Oligonucleotide Arrays,” Proc. Nat’l Academy of
Sciences of the United States of Am., vol. 95, no. 26, pp. 15623-15628,
Dec. 1998.
R. Maglietta, A. D’Addabbo, A. Piepoli, F. Perri, S. Liuni, G.
Pesole, and N. Ancona, “Selection of Relevant Genes in Cancer
Diagnosis Based on Their Prediction Accuracy,” Artificial Intelligence in Medicine, vol. 40, no. 1, pp. 29-44, May 2007.
T.M. Huang and V. Kecman, “Gene Extraction for Cancer
Diagnosis by Support Vector Machines—An Improvement,”
Artificial Intelligence in Medicine, vol. 35, nos. 1-2, pp. 185-194, 2005.
G. Turashvili, J. Bouchal, K. Baumforth, W. Wei, M. Dziechciarkova, J. Ehrmann, J. Klein, E. Fridman, J. Skarda, J. Srovnal, M.
Hajduch, P. Murray, and Z. Kolar, “Novel Markers for Differentiation of Lobular and Ductal Invasive Breast Carcinomas by
Laser Microdissection and Microarray Analysis,” BMC Cancer,
vol. 7, no. 55, 2007.
A.A. Alizadeh, M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A.
Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L.
Yang, G.E. Marti, T. Moore, J. Hudson, L. Lu, D.B. Lewis, R.
Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M.R.
Grever, J.C. Byrd, D. Botstein, P.O. Brown, and L.M. Staudt,
“Distinct Types of Diffuse Large B-Cell Lymphoma Identified by
Gene Expression Profiling,” Nature, vol. 403, no. 6769, pp. 503-511,
2000.
A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer,
and Z. Yakhini, “Tissue Classification with Gene Expression
Profiles,” J. Computational Biology, vol. 7, nos. 3-4, pp. 559-583,
2000.
M.P. Brown, W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet,
T.S. Furey, M. Ares, and D. Haussler, “Knowledge-Based Analysis
of Microarray Gene Expression Data by Using Support Vector
Machines,” Proc. Nat’l Academy of Sciences USA, vol. 97, no. 1,
pp. 262-267, Jan. 2000.
A.C. Tan and D. Gilbert, “Ensemble Machine Learning on Gene
Expression Data for Cancer Classification,” Applied Bioinformatics,
vol. 2, no. 3 supplement, pp. S75-S83, 2003.
L. Song, J. Bedo, K.M. Borgwardt, A. Gretton, and A. Smola,
“Gene Selection via the Bahsic Family of Algorithms,” Bioinformatics, vol. 23, pp. 490-498, 2007.
L. Li, W. Jiang, X. Li, K.L. Moser, Z. Guo, L. Du, Q. Wang, E.J.
Topol, Q. Wang, and S. Rao, “A Robust Hybrid between Genetic
Algorithm and Support Vector Machine for Extracting an Optimal
Feature Gene Subset,” Genomics, vol. 85, pp. 16-23, 1995.
T. Li, C. Zhang, and M. Ogihara, “A Comparative Study of Feature
Selection and Multiclass Classification Methods for Tissue
Classification Based on Gene Expression,” Bioinformatics, vol. 20,
no. 15, pp. 2429-2437, Oct. 2004.
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND...
[15] H. Liu, J. Li, and L. Wong, “A Comparative Study on Feature
Selection and Classification Methods Using Gene Expression
Profiles and Proteomic Patterns,” Genome Informatics, vol. 13,
pp. 51-60, 2002.
[16] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack,
and A.J. Levine, “Broad Patterns of Gene Expression Revealed by
Clustering of Tumor and Normal Colon Tissues Probed by
Oligonucleotide Arrays,” Proc. Nat’l Academy of Sciences, vol. 96,
no. 12, pp. 6745-6750, 1999.
[17] D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P.
Tamayo, A.A. Renshaw, A.V. D’Amico, J.P. Richie, E.S. Lander, M.
Loda, P.W. Kantoff, T.R. Golub, and W.R. Sellers, “Gene
Expression Correlates of Clinical Prostate Cancer Behavior,”
Cancer Cell, vol. 1, no. 2, pp. 203-209, 2002.
[18] M. Park, J.W. Lee, J.B. Lee, and S.H. Song, “Several Biplot Methods
Applied to Gene Expression Data,” J. Statistical Planning and
Inference, vol. 138, pp. 500-515, 2007.
[19] E.F. Petricoin, A.M. Ardekani, B.A. Hitt, P.J. Levine, V.A. Fusaro,
S.M. Steinberg, G.B. Mills, C. Simone, D.A. Fishman, E.C. Kohn,
and L.A. Liotta, “Use of Proteomic Patterns in Serum to Identify
Ovarian Cancer,” The Lancet, vol. 359, no. 9306, pp. 572-577, 2002.
[20] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E.
Dmitrovsky, E.S. Lander, and T.R. Golub, “Interpreting Patterns of
Gene Expression with Self-Organizing Maps: Methods and
Application to Hematopoietic Differentiation,” Proc. Nat’l Academy
of Sciences USA, vol. 96, no. 6, pp. 2907-2912, Mar. 1999.
[21] S. Yang, J. Shin, K.H. Park, H.-C. Jeung, S.Y. Rha, S.H. Noh, W.I.
Yang, and H.C. Chung, “Molecular Basis of the Differences
between Normal and Tumor Tissues of Gastric Cancer,” Biochimica et Biophysica Acta, 2007.
[22] L.J. van ’t Veer, H. Dai, M.J. van de Vijver, Y.D. He, A.A.M. Hart,
M. Mao, H.L. Peterse, K. van der Kooy, M.J. Marton, A.T.
Witteveen, G.J. Schreiber, R.M. Kerkhoven, C. Roberts, R.B.P.S.
Linsley, and S.H. Friend, “Gene Expression Profiling Predicts
Clinical Outcome of Breast Cancer,” Nature, vol. 415, pp. 430-536,
2002.
[23] S.L. Pomeroy, P. Tamayo, M. Gaasenbeek, L.M. Sturla, M. Angelo,
M.E. McLaughlin, J.Y.H. Kim, L.C. Goumnerova, P.M. Black, C.
Lau, J.C. Allen, D. Zagzag, J.M. Olson, T. Curran, C. Wetmore, J.A.
Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G.
Stolovitzky, D.N. Louis, J.P. Mesirov, E.S. Lander, and T.R. Golub,
“Prediction of Central Nervous System Embryonal Tumour
Outcome Based on Gene Expression,” Nature, vol. 415, pp. 436442, 2002.
[24] W.A. Freije, F.E. Castro-Vargas, Z. Fang, S. Horvath, T. Cloughesy,
L.M. Liau, P.S. Mischel, and S.F. Nelson, “Gene Expression
Profiling of Gliomas Strongly Predicts Survival,” Cancer Research,
vol. 64, no. 18, pp. 6503-6510, 2004.
[25] M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C.T.
Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, T.S.
Ray, M.A. Kovall, K.W. Last, A. Norton, T.A. Lister, J. Mesirov,
D.S. Neuberg, E.S. Lander, J.C. Aster, and T.R. Golub, “Diffuse
Large B-Cell Lymphoma Outcome Prediction by Gene-Expression
Profiling and Supervised Machine Learning,” Nature Medicine,
vol. 8, pp. 68-74, 2002.
[26] D.G. Beer, S.L.R. Kardia, C.-C. Huang, T.J. Giordano, A.M. Levin,
D.E. Misek, L. Lin, G. Chen, T.G. Gharib, D.G. Thomas, M.L.
Lizyness, R. Kuick, S. Hayasaka, J.M.G. Taylor, M.D. Iannettoni,
M.B. Orringer, and S. Hanash, “Gene-Expression Profiles Predict
Survival of Patients with Lung Adenocarcinoma,” Nature Medicine, vol. 8, pp. 816-823, 2002.
[27] D.A. Wigle, I. Jurisica, N. Radulovich, M. Pintilie, J. Rossant, N.
Liu, C. Lu, J. Woodgett, I. Seiden, M. Johnston, S. Keshavjee, G.
Darling, T. Winton, B.-J. Breitkreutz, P. Jorgenson, M. Tyers, F.A.
Shepherd, and M.S. Tsao, “Molecular Profiling of Non-Small Cell
Lung Cancer and Correlation with Disease-Free Survival,” Cancer
Research, vol. 62, pp. 3005-3008, 2002.
[28] R.E. Bellman, Adaptive Control Processes. Princeton Univ. Press,
1961.
[29] J. Ye, T. Li, T. Xiong, and R. Janardan, “Using Uncorrelated
Discriminant Analysis for Tissue Classification with Gene
Expression Data,” IEEE/ACM Trans. Computational Biology and
Bioinformatics, vol. 1, no. 4, pp. 181-190, Jan.-Mar. 2004.
[30] Z. Liu, D. Chen, and H. Bensmail, “Gene Expression Data
Classification with Kernel Principal Component Analysis,”
J. Biomedicine and Biotechnology, vol. 2, pp. 155-159, 2005.
383
[31] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second
ed. Wiley, 2000.
[32] E.-J. Yeoh, M.E. Ross, S.A. Shurtleff, W.K. Williams, D. Patel, R.
Mahfouz, F.G. Behm, S.C. Raimondi, M.V. Relling, A. Patel, C.
Cheng, D. Campana, D. Wilkins, X. Zhou, J. Li, H. Liu, C.-H. Pui,
W.E. Evans, C. Naeve, L. Wong, and J.R. Downing, “Classification,
Subtype Discovery, and Prediction of Outcome in Pediatric Acute
Lymphoblastic Leukemia by Gene Expression Profiling,” Cancer
Cell, vol. 1, no. 2, pp. 133-143, 2002.
[33] J.J. Dai, L. Lieu, and D. Rocke, “Dimension Reduction for
Classification with Gene Expression Microarray Data,” Statistical
Applications in Genetics and Molecular Biology, vol. 5, no. 1, pp. 1-15,
2006.
[34] K. Dawson, R.L. Rodriguez, and W. Malyj, “Sample Phenotype
Clusters in High-Density Oligonucleotide Microarray Data Sets
Are Revealed Using Isomap, a Nonlinear Algorithm,” BMC
Bioinformatics, vol. 6, p. 195, 2005.
[35] C. Truntzer, C. Mercier, J. Estève, C. Gautier, and P. Roy,
“Importance of Data Structure in Comparing Two Dimension
Reduction Methods for Classification of Microarray Gene Expression Data,” BMC Bioinformatics, vol. 8, no. 90, 2007.
[36] A. Andersson, T. Olofsson, D. Lindgren, B. Nilsson, C. Ritz, P.
Eden, C. Lassen, J. Rade, M. Fontes, H. Morse, J. Heldrup, M.
Behrendtz, F.M.M. Hoglund, B. Johansson, and T. Fioretos,
“Molecular Signatures in Childhood Acute Leukemia and Their
Correlations to Expression Patterns in Normal Hematopoietic
Subpopulations,” Proc. Nat’l Academy of Sciences, vol. 102, no. 52,
pp. 19069-19074, 2005.
[37] Y. Zhu, R. Wu, N. Sangha, C. Yoo, K.R. Cho, K.A. Shedden, H.
Katabuchi, and D.M. Lubman, “Classifications of Ovarian Cancer
Tissues by Proteomic Patterns,” Proteomics, vol. 6, pp. 5846-5856,
2006.
[38] M.A. Mendez, C. Hodar, C. Vulpe, and M. Gonzalez, “Discriminant Analysis to Evaluate Clustering of Gene Expression Data,”
Federation of European Biochemical Soc., vol. 522, pp. 24-28, 2002.
[39] H. Hotelling, “Analysis of a Complex of Statistical Variables into
Principal Components,” J. Educational Psychology, vol. 24, pp. 417441, 1933.
[40] J. Venna and S. Kaski, “Local Multidimensional Scaling,” Neural
Networks, vol. 19, pp. 889-899, 2006.
[41] J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,”
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8,
pp. 888-905, Aug. 2000.
[42] J. Tenenbaum, V. de Silva, and J.C. Langford, “A Global
Geometric Framework for Nonlinear Dimensionality Reduction,”
Science, vol. 290, no. 5500, pp. 2319-2322, 2000.
[43] S. Roweis and L. Saul, “Nonlinear Dimensionality Reduction by
Locally Linear Embedding,” Science, vol. 290, no. 5500, pp. 23232326, 2000.
[44] M. Belkin and P. Niyogi, “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation,” Neural Computation,
vol. 15, no. 6, pp. 1373-1396, 2003.
[45] A. Madabhushi, J. Shi, M. Rosen, J.E. Tomaszeweski, and M.D.
Feldman, “Graph Embedding to Improve Supervised Classification and Novel Class Detection: Application to Prostate Cancer,”
Proc. Eighth Int’l Conf. Medical Image Computing and ComputerAssisted Intervention (MICCAI ’05), pp. 729-737, 2005.
[46] P. Tiwari, A. Madabhushi, and M. Rosen, “A Hierarchical
Unsupervised Spectral Clustering Scheme for Detection of
Prostate Cancer from Magnetic Resonance Spectroscopy (MRS),”
Proc. 10th Int’l Conf. Medical Image Computing and ComputerAssisted Intervention (MICCAI ’07), vol. 2, pp. 278-286, 2007.
[47] S. Doyle, M. Hwang, K. Shah, A. Madabhushi, M. Feldman, and J.
Tomaszeweski, “Automated Grading of Prostate Cancer Using
Architectural and Textural Image Features,” Proc. Fourth IEEE Int’l
Symp. Biomedical Imaging (ISBI ’07), pp. 1284-1287, 2007.
[48] S. Doyle, M. Hwang, S. Naik, M. Feldman, J. Tomaszeweski, and
A. Madabhushi, “Using Manifold Learning for Content-Based
Image Retrieval of Prostate Histopathology,” Proc. 10th Int’l Conf.
Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2007.
[49] S. Weng, C. Zhang, Z. Lin, and X. Zhang, “Mining the Structural
Knowledge of High-Dimensional Medical Data Using Isomap,”
Medical and Biological Eng. and Computing, vol. 43, pp. 410-412,
2005.
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
384
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
[50] J. Nilsson, T. Fioretos, M. Höglund, and M. Fontes, “Approximate
Geodesic Distances Reveal Biologically Relevant Structures in
Microarray Data,” Bioinformatics, vol. 20, no. 6, pp. 874-880, 2004.
[51] T. Dietterich, “Ensemble Methods in Machine Learning,” Proc.
First Int’l Workshop Multiple Classifier Systems (MCS), 2000.
[52] A. Madabhushi, J. Shi, M.D. Feldman, M. Rosen, and J.
Tomaszewski, “Comparing Ensembles of Learners: Detecting
Prostate Cancer from High Resolution MRI,” Proc. Second Int’l
Workshop Computer Vision Approaches to Medical Image Analysis
(CVAMIA ’06), pp. 25-36, 2006.
[53] T.-S. Lim, W.-Y. Loh, and Y.-S. Shih, “A Comparison of Prediction
Accuracy, Complexity, and Training Time of Thirty-Three Old
and New Classification Algorithms,” Machine Learning, vol. 40,
pp. 203-228, 2000.
[54] G.J. Gordon, R.V. Jensen, L.-L. Hsiao, S.R. Gullans, J.E. Blumenstock, S. Ramaswamy, W.G. Richards, D.J. Sugarbaker, and R.
Bueno, “Translation of Microarray Data into Clinically Relevant
Cancer Diagnostic Tests Using Gene Expression Ratios in Lung
Cancer and Mesothelioma,” Cancer Research, vol. 62, pp. 4963-4967,
2002.
[55] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine
Learning, vol. 20, 1995.
[56] J.R. Quinlan, “Bagging, Boosting, and C4.5,” Proc. 13th Nat’l Conf.
Artificial Intelligence and Eighth Innovative Applications of Artificial
Intelligence Conf. (AAAI/IAAI ’96), vol. 1, pp. 725-730, 1996.
[57] J.R. Quinlan and R.L. Rivest, “Inferring Decision Trees Using the
Minimum Description Length Principle,” Information Computation,
vol. 80, no. 3, pp. 227-248, 1989.
[58] J. Handl, J. Knowles, and D.B. Kell, “Computational Cluster
Validation in Post-Genomic Data Analysis,” Bioinformatics, vol. 21,
no. 15, pp. 3201-3212, 2005.
[59] F. Kovacs, C. Legancy, and A. Babos, “Cluster Validity Measurement Techniques,” Proc. Sixth Int’l Symp. Hungarian Researchers on
Computational Intelligence (CINTI), 2005.
VOL. 5,
NO. 3,
JULY-SEPTEMBER 2008
George Lee is currently a senior undergraduate
at Rutgers University and will receive the BS
degree in biomedical engineering in May 2008.
He has recently been awarded an Aresty
Research Grant (2007) for his work in machine
learning methods to distinguish between cancer
and benign studies from high-dimensional gene
and protein expression.
Carlos Rodriguez received the degree from the
University of Puerto Rico in 2007. He visited
Rutgers University as a RISE scholar in the
summer of 2006, where he worked with
Dr. Anant Madabhushi. He is currently a software engineer at Evri, a Seattle-based company
working on semantic web.
Anant Madabhushi received the BS degree in
biomedical engineering from Mumbai University,
Mumbai, India, in 1998, the MS degree in
biomedical engineering from the University of
Texas, Austin, in 2000, and the PhD degree in
bioengineering from the University of Pennsylvania in 2004. He is currently an assistant
professor in the Department of Biomedical
Engineering, Rutgers University. He is also a
member of the Cancer Institute of New Jersey
and an adjunct assistant professor of radiology at the Robert Wood
Johnson Medical Center, New Brunswick, New Jersey. He has nearly
50 publications and book chapters in leading international journals and
peer-reviewed conferences. His research interests are in the area of
medical image analysis, computer-aided diagnosis, machine learning,
and computer vision and in applying these techniques for early detection
and diagnosis of prostate and breast cancers from high-resolution MRI,
MR spectroscopy, protein and gene expression studies, and digitized
tissue histopathology. He is also the recipient of a number of awards for
both research as well as teaching, including the Busch Biomedical
Award (2006), the Technology Commercialization Award (2006), the
Coulter Early Career Award (2006), the Excellence in Teaching Award
(2007), and the Cancer Institute of New Jersey New Investigator Award
(2007). In addition, his research work has also received grant funding
from the National Cancer Institute, New Jersey Commission on Cancer
Research, and the US Department of Defense.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.
Download