368 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO. 3, JULY-SEPTEMBER 2008 Investigating the Efficacy of Nonlinear Dimensionality Reduction Schemes in Classifying Gene and Protein Expression Studies George Lee, Carlos Rodriguez, and Anant Madabhushi Abstract—The recent explosion in procurement and availability of high-dimensional gene and protein expression profile data sets for cancer diagnostics has necessitated the development of sophisticated machine learning tools with which to analyze them. While some investigators are focused on identifying informative genes and proteins that play a role in specific diseases, other researchers have attempted instead to use patients based on their expression profiles to prognosticate disease status. A major limitation in the ability to accurately classify these high-dimensional data sets stems from the “curse of dimensionality,” occurring in situations where the number of genes or peptides significantly exceeds the total number of patient samples. Previous attempts at dealing with this issue have mostly centered on the use of a dimensionality reduction (DR) scheme, Principal Component Analysis (PCA), to obtain a low-dimensional projection of the high-dimensional data. However, linear PCA and other linear DR methods, which rely on euclidean distances to estimate object similarity, do not account for the inherent underlying nonlinear structure associated with most biomedical data. While some researchers have begun to explore nonlinear DR methods for computer vision problems such as face detection and recognition, to the best of our knowledge, few such attempts have been made for classification and visualization of high-dimensional biomedical data. The motivation behind this work is to identify the appropriate DR methods for analysis of high-dimensional gene and protein expression studies. Toward this end, we empirically and rigorously compare three nonlinear (Isomap, Locally Linear Embedding, and Laplacian Eigenmaps) and three linear DR schemes (PCA, Linear Discriminant Analysis, and Multidimensional Scaling) with the intent of determining a reduced subspace representation in which the individual object classes are more easily discriminable. Owing to the inherent nonlinear structure of gene and protein expression studies, our claim is that the nonlinear DR methods provide a more truthful low-dimensional representation of the data compared to the linear DR schemes. Evaluation of the DR schemes was done by 1) assessing the discriminability of two supervised classifiers (Support Vector Machine and C4.5 Decision Trees) in the different lowdimensional data embeddings and 2) five cluster validity measures to evaluate the size, distance, and tightness of object aggregates in the low-dimensional space. For each of the seven evaluation measures considered, statistically significant improvement in the quality of the embeddings across 10 cancer data sets via the use of three nonlinear DR schemes over three linear DR techniques was observed. Similar trends were observed when linear and nonlinear DR was applied to the high-dimensional data following feature pruning to isolate the most informative features. Qualitative evaluation of the low-dimensional data embedding obtained via the six DR methods further suggests that the nonlinear schemes are better able to identify potential novel classes (e.g., cancer subtypes) within the data. Index Terms—Dimensionality reduction, bioinformatics, data clustering, data visualization, machine learning, manifold learning, nonlinear dimensionality reduction, gene expression, proteomics, prostate cancer, lung cancer, ovarian cancer, Principal Component Analysis (PCA), Linear Discriminant Analysis, Multidimensional Scaling, Isomap, Locally Linear Embedding (LLE), Laplacian Eigenmaps, classification, support vector machine, decision trees. Ç 1 INTRODUCTION G ENE expression profiling and protein expression profiling have emerged as promising new methods for disease prognostication [1], [2], [3]. Attempts at analyzing several thousand dimensional gene and protein profiles . G. Lee and A. Madabhushi are with the Department of Biomedical Engineering, Rutgers, The State University of New Jersey, 599 Taylor Road, Piscatway, NJ 08854. E-mail: geolee@eden.rutgers.edu, anantm@rci.rutgers.edu. . C. Rodriguez is with Evri, Seattle, WA. E-mail: carlos@evri.com. Manuscript received 14 Aug. 2007; revised 5 Dec. 2007; accepted 20 Mar. 2008; published online 10 Apr. 2008. For information on obtaining reprints of this article, please send e-mail to: tcbb@computer.org, and reference IEEECS Log Number TCBBSI-2007-08-0100. Digital Object Identifier no. 10.1109/TCBB.2008.36. 1545-5963/08/$25.00 ß 2008 IEEE have been primarily motivated by two factors: 1) identification of individual informative genes and proteins responsible for disease characterization [4], [5], [6], [7] and 2) to classify patients into different disease cohorts [8], [9], [10], [11], [12], [13], [14]. Several researchers involved in the latter area have attempted to use different classification methods to stratify patients based on their gene and protein expression profiles into different categories [8], [9], [10], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27]. While the availability of studies continues to grow, most protein and gene expression databases contain no more than a few thousand patient samples. Thus, the task of stratifying these patients based on the gene/protein profile is subject to the “curse of dimensionality” problem [28], [29], owing to the relatively small number of patient Published by the IEEE CS, CI, and EMB Societies & the ACM Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND... samples compared to the size of the feature space. Classification of the new unseen test samples is thus poor due to the sparseness of data in the high-dimensional feature space. Additionally, many of the features within the expression profile may be noninformative or redundant, providing little additional class discriminatory information [11], [12] while increasing computing time and classifier complexity. In order to bridge the gap between the number of patient samples and gene/peptide features and overcome the curse of dimensionality problem, researchers have proposed 1) feature selection and 2) dimensionality reduction (DR) to reduce the size of the feature space. Feature selection refers to the identification of the most informative features and has been commonly utilized to precede classification in gene and protein expression studies [11], [14], [30]. However, since a typical gene or protein microarray records expressions from thousands of genes or proteins, the cost of finding an optimal informative subset from several million possible combinations becomes a near intractable problem. Further, genes or peptides that were pruned during the feature selection process may be significant in stratifying intraclass subtypes. DR refers to a class of methods that transforms the highdimensional data into a reduced subspace to represent data in far fewer dimensions. In Principal Component Analysis (PCA), a linear DR method, the reduced dimensional data is arranged along the principal eigenvectors, which represent the direction along which the greatest variability of the data occurs [31]. Note that unlike with feature selection, the samples in the transformed embedding subspace no longer represent specific gene and protein expressions from the original high-dimensional space but rather encapsulate data similarities in low-dimensional space. Even though the objects in the transformed embedding space are divorced from their original biological meaning, the organization and arrangement of the patient samples in low-dimensional embedding space lends itself to data visualization and classification. Thus, if two patient samples from a specific disease cohort are mapped adjacent to each other in an embedding space derived from their respective highdimensional expression profiles, then it suggests that the two patients have a similar disease condition. By exploiting the entire high-dimensional space, DR methods, unlike feature selection, offer the opportunity to stratify the data into subclasses (e.g., novel cancer subtypes). The most popular method for DR for bioinformaticsrelated applications has been PCA [3], [32], [33], [34], [35], [36], [37], [38]. Originally developed by Hotelling [39], PCA finds orthogonal eigenvectors along which the greatest amount of variability in the data lies. The underlying intuition behind PCA is that the data is linear and that the embedded eigenvectors represent low-dimensional projections of linear relationships between data points in highdimensional space. Linear Discriminant Analysis (LDA) [31], also known as Fisher Discriminant Analysis, is another linear DR scheme, which incorporates data label information to find data projections that separate the data into distinct clusters. Multidimensional Scaling (MDS) [40] reduces data dimensionality by preserving the least squares Euclidean distance in the low-dimensional space. 369 Classifier performance with linear DR schemes for biomedical data has been a mixed bag. Dawson et al. [34] found that there were biologically significant elements of the gene expression profile that were not seen with linear MDS. Ye et al. [29] found that LDA gave poor results in distinguishing disease classes on a cohort of nine gene expression studies. Truntzer et al. [35] also found limited use of LDA and PCA for classifying gene and protein expression profiles of a diffuse large B-cell lymphoma data set since the classes appeared to be linearly inseparable. The aforementioned results appear to suggest that biomedical data has a nonlinear underlying structure [34], [35] and that DR methods that do not impose linear constraints in computing the data projection might be more appropriate compared to PCA, MDS, and LDA for classification and visualization of data classes in gene and protein expression profiles. Recently, nonlinear DR methods such as Spectral Clustering [41], Isometric mapping (Isomap) [42], Locally Linear Embedding (LLE) [43], and Laplacian Eigenmaps (LEM) [44] have been developed to reduce data dimensionality without assuming a euclidean relationship between data samples in the high-dimensional space. Shi and Malik’s Spectral Clustering algorithm (also known as Graph Embedding [41]) builds upon graph theory to partition the graph into clusters and separate them accordingly. Madabhushi et al. [45] demonstrated the use of graph embedding to detect the presence of new tissue classes on high-dimensional prostate MRI studies. The utility of this scheme has also recently been demonstrated in distinguishing between cancerous and benign magnetic resonance spectra (MRS) in the prostate [46] and in discriminating between different cancer grades on digitized tissue histopathology [47], [48]. Tenenbaum et al. [42] presented the Isomap algorithm for nonlinear DR and described the term “manifold” for machine learning as a nonlinear surface embedded in high-dimensional space along which dissimilarities between data points are best represented. The Isomap algorithm estimates geodesic distances, defined as the distance between two points along the manifold, and preserves the nonlinear geodesic distances (as opposed to euclidean distances used in linear methods) while projecting the data onto a low-dimensional space. LLE proposed by Roweis and Saul [43] uses local weights to preserve local geometry in order to find the global nonlinear manifold structure of the data. The geodesic distance between data points is approximated by assuming that the data is locally linear. Recently, Belkin and Niyogi presented the LEM algorithm [44], which like Spectral Clustering, Isomap, and LLE, makes local connections but uses the Laplacian to simplify determination of the locality preserving weights used to obtain the low-dimensional data embeddings. Graph Embedding, LLE, Isomaps, and LEM, all aim to nonlinearly project the high-dimensional data in such a way that two objects xa and xb that lie adjacent to each other on the manifold are adjacent to each other in the lowdimensional embedding space, and likewise, two objects that are distant from each other on the manifold are far apart in the low-dimensional embedding space. As previously demonstrated by Tenenbaum et al. [42], Fig. 1 reveals the limitations of using a linear DR for highly nonlinear data. Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. 370 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO. 3, JULY-SEPTEMBER 2008 Fig. 1. (a) Nonlinear manifold structure of the Swiss roll data set [42]. Labels from two classes (shown with red circles and green squares) are provided to show the distribution of data along the manifold. (b) The low-dimensional embedding obtained via linear MDS on the Swiss roll reveals a high degree of overlap between samples from the two classes due to the use of euclidean distance as a dissimilarity metric. The embedding obtained via LEM, on the other hand, is able to almost perfectly distinguish the two classes by projecting the data in terms of geodesic distance determined along the manifold. Fig. 1 shows the embedding of the Swiss roll data set shown in Fig. 1a obtained by a linear DR method (MDS) in Fig. 1b and a nonlinear DR scheme (LEM) in Fig. 1c. MDS, which preserves euclidean distances, is unable to capture the nonlinear manifold structure of the Swiss roll, but LEM is capable of learning the shape of the manifold and representing points in the low-dimensional embedding by estimating geodesic distances. Thus, while MDS (Fig. 1b) shows overlap between the two classes that lie along the Swiss roll, LEM (Fig. 1c) provides an unraveled Swiss roll that separates the data classes in 2D embedding space. While PCA remains the most popular DR method for bioinformatics-related applications [32], [34], [35], [36], [37], [38], nonlinear DR methods have begun to gain popularity [3], [30], [45], [49]. Liu et al. [30] found high classification accuracy in the use of kernel PCA (nonlinear variant of PCA) for gene expression data sets while Weng et al. [49] recommended the use of Isomap for medical data analysis. Shi and Chen [3] found that LLE outperformed PCA in classifying three gene expression cancer studies. Dawson et al. [34] compared Isomap, PCA, and linear MDS for oligonucleotide data sets, and Nilsson et al. [50] compared Isomap with MDS in terms of their ability to reveal structures in microarray data. In these and other related studies [3], [34], [49], [50], the nonlinear methods were found to outperform linear DR schemes. While several researchers have performed comparative studies of classifier methods [51], [52], [53] to determine the optimal scheme for various applications, to the best of our knowledge, no comprehensive comparative study of different nonlinear and linear DR schemes in terms of their ability to discriminate between samples has been attempted thus far. The primary motivation of this work is to systematically and quantitatively compare and evaluate the performance of six DR methods; three linear methods (PCA, LDA [31], and linear MDS [40]) and three nonlinear DR methods (Isomap [42], LLE [43], and LEM [44]) in terms of their ability to faithfully represent the underlying structure of biomedical data. A total of 10 different binary class gene and protein expression studies corresponding to prostate, lung, breast, gliomal, and ovarian cancers, as well as leukemia and lymphoma are considered in this comparative study. Low-dimensional data embeddings of the cancer studies obtained from each of the six DR methods are evaluated in two ways. First, the low-dimensional data embeddings for each data set are compared in terms of classifier accuracy evaluated via a support vector machine (SVM) and a decision tree (C4.5) classifier. The intuition behind the use of classifiers is that if the embedding produced by a particular DR method accurately captures the structure of the data manifold, then xa , xb , belonging to different classes in the high-dimensional data set D, will have low-dimensional embedding coordinates G ðxa Þ, G ðxb Þ far apart from each other. Thus, if the underlying structure of the data has been faithfully reconstructed by the DR method, then the task of discriminating between objects from different classes becomes trivial (i.e., a linear classifier would suffice). Note that a more complex classifier with a nonlinear separating hyperplane could potentially distinguish objects from different classes in an embedding space that does not represent a faithful reconstruction of the original multidimensional manifold. However, the emphasis in this work is not in identifying the optimal classification scheme but rather in identifying the DR method that can provide the optimal low-dimensional representations so that the task of discriminating different object classes becomes trivial. The role of classifiers in this work only serves to quantitatively evaluate the quality of the data embeddings. In addition to the use of two classifiers, we also consider five different cluster measures to evaluate the low-dimensional data representation. The intuition behind the use of the cluster validity measures is that in the optimal low-dimensional data representation, objects xa 2 D with associated class label Y ðxa Þ ¼ þ1 and all objects xb 2 D, Y ðxb Þ ¼ 1 will form two distinct, tight, and well-separated clusters. The organization of the rest of this paper is given as follows: In Section 2, we provide an overview of the six DR methods compared in this paper. In Section 3, the experimental setup for quantitatively comparing the linear and nonlinear DR schemes is described. Quantitative and qualitative results and accompanying discussion is presented in Section 4. Finally, concluding remarks are presented in Section 5. Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND... 371 TABLE 1 List of Frequently Appearing Symbols and Notations in This Paper 2 OVERVIEW OF DIMENSIONALITY REDUCTION METHODS 2.1 Terminology A total of 10 binary gene and protein expressions and one multiclass data set Dj , j 2 f1; 2; . . . ; 11g, were considered in this study. Each Dj ¼ fx1 ; x2 ; . . . ; xn g is represented by an n M dimensional matrix of n samples xi , i 2 f1; 2; . . . ; ng. Each xi 2 Dj has an associated class label Y ðxi Þ 2 fþ1; 1g and an M-dimensional feature vector F ðxi Þ ¼ ½fu ðxi Þju 2 f1; 2; . . . ; Mg, where fu ðxi Þ represents the gene or protein expression values associated with xi . Following application of DR methods , where 2 fP CA; LDA; MDS; ISO; LLE; LEMg, the individual data points xi 2 D are represented by an m-dimensional embedding vector Gðxi Þ ¼ ½gv ðxi Þjv 2 f1; 2; . . . ; mg, where gv ðxi Þ represents the embedding coordinate along the principal eigenvectors of xi in m-dimensional embedding space. Table 1 lists notation and symbols used frequently in this paper. 2.2 Linear Dimensionality Reduction Methods 2.2.1 Principal Component Analysis (PCA) PCA is widely used to visualize high-dimensional data and discern relationships by finding orthogonal axes that contain the greatest amount of variance in the data [39]. These orthogonal eigenvectors corresponding to the largest eigenvalues are called “principal components” and are obtained in the following manner. Each data point xi 2 D is first centered by subtracting the mean of all the features for each observation xi from its original feature value fu ðxi Þ, as shown in 1 fu ðxi Þ ¼ fu ðxi Þ n n X fu ðxi Þ; ð1Þ i¼1 for u 2 f1; 2; . . . ; Mg. From feature values fu ðxi Þ for each xi 2 D, a new n M matrix Y is constructed. The matrix Y is then decomposed into corresponding singular values as shown in Y ¼ UV T ; ð2Þ where via singular value decomposition, an n n diagonal matrix containing the eigenvalues of the principal components and an m n left singular matrix U and M n matrix V are obtained. The eigenvalues in represent the amount of variance for each eigenvector gPv CA , v 2 f1; 2; . . . ; mg in matrix V T and are used to rank the corresponding eigenvectors in the order of greatest variance. Thus, the first m eigenvectors are obtained, as they contain the most variance in the data while the remaining eigenvectors are discarded so each data sample xi 2 D is now described by an m-dimensional embedding vector GP CA ðxi Þ. 2.2.2 Linear Discriminant Analysis (LDA) LDA [31] takes into account class labels to find eigenvectors that can discriminate between two classes fþ1; 1g in a data set. The intraclass scatter matrix SW and interclass scatter matrix SB [31] are computed from the sample means P for dataP clusters þ1 and 1, giving þ ¼ n1þ a F ðxa Þ and ¼ n1 b F ðxb Þ, respectively, where for xa 2 D, Y ðxa Þ ¼ þ1 and for xb 2 D, Y ðxb Þ ¼ 1. Note that both þ and are M-dimensional vectors. From the sample means, X SW ¼ ðF ðxa Þ þ ÞðF ðxa Þ þ ÞT xa 2D Y ðxa Þ¼þ1 þ X ðF ðxb Þ ÞðF ðxb Þ ÞT ð3Þ xb 2D Y ðxb Þ¼1 and SB ¼ ðþ Þðþ ÞT ð4Þ are calculated. SW and SB are then used to create the eigenvectors by singular value decomposition: 1 UV T ¼ SW SB : ð5Þ As with PCA, each data point xi 2 D is now represented by an m-dimensional vector GLDA ðxi Þ corresponding to the m largest eigenvalues in . While LDA has often been used as both a DR method and a classifier, it is limited in handling sparse data and for data sets where the Gaussian distribution assumption is not valid [29]. Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. 372 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2.2.3 Classical Multidimensional Scaling (MDS) MDS [40] is implemented as a linear method that preserves the euclidean geometry between each pair of M-dimensional points xa , xb 2 D, which is arranged into a symmetric n n distance matrix , as shown in ðxa ; xb Þ ¼ kF ðxa Þ F ðxb Þk2 ; ð6Þ where k k2 represents the euclidean norm. MDS finds optimal positions for the data points xa , xb in m-dimensional space through minimization of the least squares error in the input pairwise euclidean distances between xa and xb [40]. Note that classical MDS differs from nonlinear variants of MDS such as nonmetric MDS [40], which do not preserve input euclidean distances. 2.3 Nonlinear Dimensionality Reduction Methods 2.3.1 Isometric Mapping (Isomap (ISO)) The Isomap algorithm [42] modifies classical MDS to handle nonlinearities in the data through the use of a neighborhood mapping. By creating linear connections from each point xi 2 D to its closest neighbors in euclidean space, a manifold representation of the data is constructed, being a user-defined parameter. Nonlinear connections between points outside of the neighborhood are approximated by calculating the shortest distance between two points xa , xb 2 D along the paths in the neighborhood map, where a; b 2 f1; 2; . . . ; ng. Thus, new geodesic distances (distances measured along the surface of the manifold) are calculated and arranged in an n n pairwise distance matrix , where ðxa ; xb Þ contains the nonlinear geodesic distances between xa , xb 2 D. The matrix is then given as an input to the classical MDS algorithm from which each data point xi 2 D, i 2 f1; 2; . . . ; ng, is represented by its m-dimensional embedding vector GISO ðxi Þ. 2.3.2 Locally Linear Embedding (LLE) LLE [43], like the Isomap algorithm [42], utilizes a neighborhood map connecting each data sample xi to its nearest neighbors in euclidean space. However, instead of calculating manifold distances, LLE describes each xi in terms of its closest neighbors xa . Thus, for each xi , an M matrix Z containing the centered features f^u ðxa Þ ¼ fu ðxa Þ fu ðxi Þ is obtained. To describe the local geometry for each xi , linear coefficients accounting for the location of xi relative to each xa can be optimized by solving for the -dimensional weight vector wðxi ; xa Þ via the linear system: Z T Zw ¼ I T ; ð7Þ VOL. 5, NO. 3, JULY-SEPTEMBER 2008 2.3.3 Laplacian Eigenmaps (LEM) The LEM [44] algorithm, similar to LLE and Isomap, establishes a locally linear mapping by connecting each point xi 2 D to its nearest neighbors. Weights are assigned between each pair of points to form an n n symmetrical ~ , where weights W ~ ðxi ; xa Þ ¼ 1, when xa is a weight matrix W ~ ðxi ; xb Þ ¼ 0, when nearest neighbor of each xi 2 D, and W xb is not a nearest neighbor of xi 2 D. From weight ~ matrix W P and a diagonal matrix of column sums ~ ði; jÞ, for all i 2 f1; 2; . . . ; ng, a symmetric Dði; iÞ ¼ j W positive semidefinite matrix L called the Laplacian is calculated as ~: L¼DW ð9Þ Singular value decomposition (2) is then used to obtain the m-dimensional embedding vector GLEM ðxi Þ for each xi 2 D from the Laplacian L. 3 EXPERIMENTAL DESIGN The organization of this section is given as follows: In Section 3.1, we provide a description of data sets followed by brief outline of methodology in Section 3.2. In Sections 3.3 and 3.4, we briefly describe the different qualitative and quantitative evaluation measures we use for comparing the performance of the DR methods. 3.1 Description of Data Sets A total of 10 publicly available binary class [1], [7], [8], [16], [17], [19], [24], [25], [26], [54] and one multiclass data set [32] corresponding to high-dimensional gene and protein expression studies1 were acquired for the purposes of this study. The two-class data sets correspond to gene and protein expression profiles of normal and cancerous samples for breast [7], colon [16], lung [26], ovarian [19], and prostate [17] cancers, leukemia [1], [32], lymphoma [8], [25], and glioma studies [24]. The multiclass data set comprises five subtypes of leukemia. The sizes of the data sets range from 30 to 253 patient samples, with the number of corresponding features ranging from 791 to 54,675 genes or peptides. Table 2 provides a description of all the data sets that we considered including a description of the data classes and the originating study for these data sets. Note that for each study, the number of patient samples is significantly smaller than the dimensionality of the feature space. No preprocessing or normalization of any kind was performed on the original feature space prior to DR. An experiment was however performed to compare DR performance with and without feature pruning on the original high-dimensional studies. ð8Þ 3.2 Brief Outline of Methodology Our methodology for investigating the embedded data representation given by DR is comprised of four main steps described briefly below and illustrated in the flowchart in Fig. 2. Step 1: DR. To evaluate and compare the low-dimensional data embeddings, we reduced the dimensionality of where I is an n n identity matrix. Singular value decomposition is used to obtain the m-dimensional embedding vector GLLE ðxi Þ and for each xi 2 D from cost matrix . 1. These data sets were downloaded from the Biomedical Kent-Ridge Repositories at http://sdmc.lit.org.sg/GEDatasets/Datasets and http:// sdmc.i2r.a-star.edu.sg/rp and the Gene Expression Omnibus(GEO) Repository at http://www.ncbi.nlm.nih.gov/geo/. where I is a column vector of ones of length . From each of n weight matrices w, the n n matrix, W , stores the linear coefficients of each xi and xa in W ðxi ; xa Þ and W ðxi ; xb Þ ¼ 0, where xb are not among the nearest neighbors of xi 2 D. A cost matrix is then computed from the weight matrix W as ¼ ðI W ÞT ðI W Þ; Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND... 373 TABLE 2 Description of Gene Expression and Proteomic Spectra Data Sets Considered in This Study Fig. 2. Flowchart showing the overall organization and process flow of our experimental design. M-dimensional xi 2 Dj , j 2 f1; 2; . . . ; 11g, via six DR methods 2 fP CA; LDA; MDS; ISO; LLE; LEMg. The resulting m-dimensional embedding vectors G ðxi Þ now represent the low-dimensional signatures for each xi 2 Dj and for each method . Additionally, we obtain m-dimensional embedding vectors for feature-pruned samples xi 2 Dj , ^ < M dimensional samples j 2 f1; 2; . . . ; 11g, containing M xi , for each method . Step 2: Qualitative evaluation for novel class detection. In order to evaluate the presence of possible subclusters within the data, the dominant embedding coordinates g1 ðxi Þ, g2 ðxi Þ, g3 ðxi Þ, for each method , and xi 2 D were plotted against each other. The graphical plots reveal the m-dimensional embedding representations of the highdimensional data via each of the six DR methods. On the eigenplots obtained for each DR scheme, potential subclasses are visually identified. Step 3: Quantitative evaluation of DR performance via classifier accuracy. To evaluate the quality of the DR embeddings, two classifiers are trained using the lowdimensional embedding vector G ðxi Þ, for 2 fP CA; LDA; MDS; ISO; LLE; LEMg; and xi 2 D. For each D, the samples xi 2 D are divided into a training set SjT r and a testing set Sjt . Samples xa 2 SjT r will be used to train an SVM and decision tree (C4.5) classifier ðCSV M ; CC4:5 Þ in the embedding space defined by embedding coordinates G ðxa Þ, 2 fP CA; LDA; MDS; ISO; LLE; LEMg to distinguish between the different classes. Once the classifiers have been trained, they will be applied to predict class labels CSV M ðG ðxb ÞÞ; CC4:5 ðG ðxb ÞÞ 2 fþ1; 1g for all xb 2 Sjt for method . The classifier predictions CSV M ðG ðxb ÞÞ; CC4:5 ðG ðxb ÞÞ are compared against the true object label Y ðxb Þ for xb 2 D to estimate the classifier accuracy, recorded for each DR scheme. The same procedure is repeated using the feature-pruned samples ^ < M dimensionality, following DR. xi 2 D with M Step 4: Quantitative evaluation of DR performance via cluster validity measures. To compare the size, tightness, and separation of class clusters from different DR methods, we first normalize the embedding space obtained via each of six DR methods 2 fP CA; LDA; MDS; ISO; LLE; LEMg. In this normalized embedding space, we calculate centroids Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. 374 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, ~;þ and G ~; corresponding to the þ1 and 1 classes. G ~;þ and G ~; , we measure the From the centroids G separation between clusters as well as the tightness of each cluster by measuring the distances of each xi 2 D to the corresponding class centroid. The same procedure is repeated following feature pruning. 3.3 Detailed Description of Experimental Design 3.3.1 Feature Pruning A feature pruning step is employed to identify a set ^ of informative features F^ðxi Þ ¼ ½fu^ ðxi Þj^ u 2 f1; 2; . . . ; Mg, ^ < M for each xi 2 D. The aim of feature pruning where M is to compare whether the trends in performance of the six DR methods considered in this study is similar when considering all features F ðxi Þ and when considering only the most informative features F^ðxi Þ. The feature pruning method based on t-statistics and described in [1] and [15] was considered. For all xi 2 D and for a specific gene or protein expression feature u 2 f1; 2; . . . ; Mg, the mean fuþ , 2 2 fu and variance fu þ , fu of the expression levels for the þ1 or 1 class were computed. Hence, fuþ ¼ 1 X fu ðxa Þ; nþ xa 2Dj ð10Þ Y ðxa Þ¼þ1 fu 2 ¼ 2 1 X fu ðxb Þ fu : n xb 2Dj ð11Þ Y ðxb Þ¼1 2 2 The values of fuþ , fu , fu þ , fu were then used to calculate the information content of each gene or protein expression feature as f þ fu ffi: T ðfu Þ ¼ quffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 fu2 fu þ þ nþ n ð12Þ The different features are then ranked in descending order based on their information content T ðfu Þ. The top 10 percentile of most informative features fu^ , ^ where M ^ < M, are used to compute a u^ 2 f1; 2; . . . ; Mg, second set of embeddings for each Dj , j 2 f1; 2; . . . ; 11g and 2 fP CA; LDA; MDS; ISO; LLE; LEMg. 3.3.2 Qualitative Evaluation to Identify Novel Subclasses The linear and nonlinear DR methods were evaluated in terms of their ability to identify new subclasses within the data. The three dominant eigenvalues g1 ðxi Þ, g2 ðxi Þ, and g3 ðxi Þ are plotted against each other, for 2 fP CA; LDA; MDS; ISO; LLE; LEMg, and for all xi 2 D. The 3D space of embedding coordinates G ðxi Þ, for all xi 2 D, were visually inspected for 1) distinct clusters within the dominant þ1, 1 classes and 2) distinct clusters that appear to be far removed from the cluster centers G;þ and G; . Since the ground truth for newly identified subclasses within the binary class data sets was unavailable, we also compared the six DR schemes on a multiclass Acute Lymphoblastic Leukemia data set [32], which is comprised of five known subclasses. VOL. 5, NO. 3, JULY-SEPTEMBER 2008 3.3.3 Quantitative Evaluation to Measure Class Discriminability In this section, we describe in greater detail the different performance measures used for evaluating the efficacy of DR methods. 1. DR comparison via classifier accuracy. The accuracy of two classifiers (Linear SVMs and C4.5 Decision Trees) was used to quantitatively evaluate Gðxi Þ, xi 2 D on 11 data sets Dj , j 2 f1; 2; . . . ; 11g, using the class labels provided. Both classifiers considered, SVMs and C4.5 Decision Trees, require the use of a training set SjT r to construct a prediction model for new data and a testing set Sjt . Each classifier was first trained by using labeled instances in SjT r , where for each xa 2 SjT r , Y ðxa Þ 2 fþ1; 1g. The classifier training is done separately for each DR method 2 fP CA; LDA; MDS; ISO; LLE; LEMg: To train the classifiers, we randomly set aside 1/3 of the samples in SjT r for training, and the remaining 2/3 samples in Sjt were used for testing. The three-fold crossvalidation method was then used to determine the optimal classifier parameters. The classifier outputs CSV M ðG ðxb ÞÞ, CC4:5 ðG ðxb ÞÞ 2 fþ1; 1g, where xb 2 Sjt , was compared against the true object label Y ðxb Þ 2 fþ1; 1g. Subsequently, accuracy is defined as the ratio of the number of objects xb 2 Sjt , correctly labeled by the classifier to the total number of tested objects in each Sjt . Below, we provide a brief description of the two classifiers considered in this study. 1.1. Support Vector Machines (SVM). SVMs were first introduced by Vapnik [55] and are based on the structural risk minimization (SRM) principle from statistical learning theory. The SVM attempts to minimize a bound on the generalization error (error made on test data). SVM-based techniques focus on “borderline” training examples (or support vectors) that are most difficult to classify. The SVM projects the input training data G ðxi Þ, for xb 2 SjT r , onto a higher dimensional space using the linear kernel defined as T G ðxa Þ; G ðxb Þ ¼ G ðxa Þ G ðxb Þ þ b; ð13Þ where b is the bias estimated on the training set SjT r D. The general form of the SVM is given by CSV M ¼ ns X Y ðxb Þ G ðxa Þ; G ðxb Þ ; ð14Þ ¼1 where x , for 2 f1; 2; . . . ; ns g, denotes the number of support vectors, and the model parameter is obtained by maximizing the following objective function: ðÞ ¼ ns X ¼1 ns 1X Y ðx ÞY ðx Þ; 2 ; ¼1 ð15Þ P s subject to the constraint n¼1 Y ðx Þ ¼ 0 and 0 !, where , 2 f1; 2; . . . ; ns g, and where the parameter ! controls the trade-off between the empirical risk (training errors) and model complexity. Additionally, a one-against-all SVM scheme was implemented for the multiclass case [14]. For this scheme, a binary classifier is built for each class to separate one class Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND... from all the other classes. Again, 1/3 of the samples from each class are randomly selected for training set SjT r and the predictions are made on the remaining 2/3 of the samples in Sjt . Each of the five binary classifiers make a prediction as to whether each xa 2 Dj belongs to the target class. In the ideal case, only the binary classifier trained to identify Y ðxa Þ as the target class should output a value of 1 and the other four classifiers would output 0. If so, xa is said to have been correctly classified. If not, xa is randomly assigned one of the five class labels. If the randomly assigned class label is not its true class label, xa is said to have been misclassified. Otherwise, it is determined to have been correctly classified. 1.2. C4.5 Decision Trees (C4.5). A special type of classifier is the decision tree, which is trained using an iterative selection of individual features fu ðxa Þ that are the most salient at each node in the tree [56]. One of the most commonly used algorithms for generating decision trees is the C4.5 rules proposed by Quinlan [56]. The rules generated by this approach are in conjunctive form such as “if A and B then C,” where both A and B are the rule antecedents, while C is the rule consequence. Every path from the root to the leaf is converted to an initial rule by regarding all the conditions appearing in the path as the conjunctive rule antecedents while regarding the class label Y ðxa Þ xa 2 D held by the leaf as a rule consequence. Tree pruning is then done by using a greedy elimination rule that removes antecedents that are not sufficiently discriminatory. The rule set is then further refined by the way of the minimum description length (MDL) principle [57] to remove those rules that do not contribute to the accuracy of the tree. Hence, for each 2 fP CA; LDA; MDS; ISO; LLE; LEMg; we obtain a separate decision tree classifier CC4:5 ðG ðxi ÞÞ to classify every xi 2 D as fþ1; 1g. The C4.5 Decision Trees is extended for the multiclass case by simply adding more output labels. Classifier evaluation is also similarly performed on Dj , j 2 f1; 2; . . . ; 11g following feature pruning. 2. DR comparison via cluster validity measures. The lowdimensional embeddings G ðxi Þ, obtained for each 2 fP CA; LDA; MDS; ISO; LLE; LEMg, are also compared in terms of the five cluster validity measures. Prior to this however, the embedding coordinates G ðxi Þ for xi 2 D need to be normalized within a unit hypercube H in order to facilitate a quantitative comparison across the six DR schemes. The eigenvector gv ðxi Þ, for each xi 2 D and v 2 f1; 2; . . . ; mg, is thus scaled between [0, 1] along each of m-dimensions via the following formulation: gv ðxi Þ mini gv ðxi Þ h i h i; g~v ðxi Þ ¼ ð16Þ maxi gv ðxi Þ mini gv ðxi Þ where g~v ðxi Þ is the normalized embedding coordinate of xi along the vth dimension, where v 2 f1; 2; . . . ; mg. For all xa 2 Dj , such that Y ðxa Þ ¼ þ1, the cluster center of the ~;þ , is obtained by averaging the embedding þ1 class, G coordinate locations g~v along each dimension v 2 f1; 2; . . . ; mg and for each . Formally, where nþ is the number of objects in the þ1 class, g~;þ ¼ 1 X g~ ðxa Þ: nþ xa 2Dj v 375 ð17Þ Y ðxa Þ¼þ1 Thus, the normalized cluster center for the þ1 class is ~;þ ¼ ½~ obtained as G gv jv 2 f1; 2; . . . ; mg. Similarly, we ~; for the 1 class. Having obtain the cluster center G ~;þ and G ~; , we define five cluster validity obtained G measures as follows: 2.1. Intercentroid Distance (ICD). CICD is defined as the ~;þ and G ~; [58]. euclidean distance between centroids G ICD C is calculated for each Dj , j 2 f1; 2; . . . ; 10g, and for all . 2.2. Cluster Tightness (CT). To evaluate the tightness and distinctness of object clusters in the embedding space, we define and evaluate four CT measures: CCT ;;þ , CCT ;; , CCT ;;þ , and CCT ;; . CCT ;;þ is defined as the mean euclidean distance of all objects xa 2 Dj , Y ðxa Þ ¼ þ1, from ~;þ . Formally, this is expressed as G CCT ;;þ ¼ 1 X ~ ðxa Þ: ~;þ G G nþ xa 2D ð18Þ Y ðxa Þ¼þ1 We also similarly compute CCT ;;þ as the standard deviation of the euclidean distances of all xa 2 D from ~;þ [59]. Similarly, their corresponding cluster centroid G CT ;; CT ;; C and C are also defined for the 1 class. The calculation of the above cluster measures and normalization of the embedding coordinate system is repeated for all Dj , j 2 f1; 2; . . . ; 10g following feature pruning. Following the computation of the seven quantitative performance measures (two classifier and five cluster), a paired student t-test comparison is performed between the values for CSV M , CC4:5 , CICD , CCT ;þ , CCT ; , CCT ;þ , CCT ; for each of the following nine pairs of linear and nonlinear methods (PCA/ISO, LDA/ISO, MDS/ISO, PCA/LLE, LDA/LLE, MDS/LLE, PCA/LEM, LDA/LEM, and MDS/ LEM) across all data sets Dj , j 2 f1; 2; . . . ; 10g, under the null hypothesis that there is no difference in the seven performance measures between each of the nine pairs of linear/nonlinear DR methods. Thus, if p 0:05 for a pair of linear/nonlinear methods for a particular performance measure, the difference is assumed to be statistically significant. A similar t-test comparison is also performed using the embedding data obtained following feature pruning with the aim of showing similar trends across the six different DR methods applied to both the unpruned and the feature-pruned data sets. 4 4.1 RESULTS AND DISCUSSION Qualitative Results 4.1.1 Class Separability in Embedding Space Fig. 3 shows the 2D embedding plots of six different linear and nonlinear DR methods for one proteomic spectra (ovarian cancer [19]) and two gene expression (colon [16] and lung cancer [54]) data sets. Each of the plots in Figs. 3a, 3b, 3c, 3d, 3e, 3f, 3g, 3h, 3i, 3j, 3k, and 3l were generated by plotting the first dominant eigenvector g~1 ðxi Þ versus the second dominant eigenvector of g~2 ðxi Þ, for all xi 2 Dj , and for a given DR method . The two object classes (þ1 and 1) are denoted with different symbols. Figs. 3a and 3d Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. 376 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO. 3, JULY-SEPTEMBER 2008 Fig. 3. Embedding plots were obtained by plotting dominant eigenvectors g~1 ðxi Þ and g~2 ðxi Þ against each other for ((a), (d), (g), (j)) ovarian cancer [19], ((b), (e), (h), (k)) colon cancer [16], and ((c), (f), (i), (l)) lung cancer [54] data sets for six different linear and nonlinear DR methods. Embedding plots for ¼ (a) PCA, (d) MDS, (g) ISO, and (j) LLE for the ovarian cancer data set [19] are shown in the left column while in the middle column embedding plots for colon cancer [16] obtained via ¼ (b) LDA, (e) MDS, (h) LLE, and (k) LEM are shown. Embedding plots for the lung cancer data set [54] for ¼ (c) PCA, (f) LDA, (i) ISO, and (l) LEM are shown in the rightmost column. correspond to the embeddings generated by two of the linear DR methods (PCA, MDS) while Figs. 3g and 3j show the corresponding plots obtained from two of the nonlinear DR methods (ISO, LLE) on the ovarian cancer study. Note that in the embedding obtained with both Isomap and LLE (Figs. 3g and 3j), the two classes are clearly distinguishable while the corresponding embeddings obtained with PCA and MDS (Figs. 3a and 3d) reveal a significant degree of overlap between the þ1 and 1 classes. A similar trend is seen with PCA and LDA (Figs. 3b and 3e) and LLE and LEM (Figs. 3h and 3k) on the colon cancer data set [16]. Note that in spite of the presence of a couple of apparent outliers in the embeddings obtained by LLE and LEM, the nonlinear DR methods appear to perform much better compared to PCA and MDS (Figs. 3b and 3e). The difference is even more stark in the embeddings obtained with PCA (Fig. 3c) and LDA (Fig. 3f) compared to Isomap (Fig. 3i) and LEM (Fig. 3l) on the lung cancer [54] data set in the rightmost column. Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND... 377 Fig. 4. Embedding graphs obtained by plotting the three most dominant embedding vectors g~1 ðxi Þ, g~2 ðxi Þ, and g~3 ðxi Þ for xi 2 Dj , for ¼ (a) LDA, (b) LLE, and (c) LEM, respectively, on the lung cancer Michigan data set [26] in the top row. In the middle row, the embedding results obtained on the prostate cancer study [17] for ¼ (d) MDS, (e) LLE, and (f) LEM, respectively, are shown. Finally, the embedding plots obtained via (g) PCA, (h) ISO, and (i) LLE for the multiclass acute lymphoblastic leukemia data set [32] are shown in the bottom row. Note that the ellipses in (b), (c), (e), and (f) have been manually placed to highlight what appear to be potentially new classes. 4.1.2 Novel Class Detection in Embedding Space Fig. 4 illustrates qualitatively the differences between the linear and nonlinear DR methods in capturing the true underlying low-dimensional structure of the data and highlights the differences between the two types of methods in terms of their ability to identify subclasses in the data. Figs. 4a, 4b, and 4c show the embedding plots obtained via LDA, LLE, and LEM, respectively, for the lung cancer Michigan data set [26]. For LDA (Fig. 4a), no meaningful clustering of samples was observable, while for both LLE and LEM, two distinct clusters of normal classes (denoted via superimposed ellipses) were identifiable. In Fig. 4, subclusters (denoted in superimposed ellipses) in the prostate cancer data set [17] for both LLE and LEM (Figs. 4e and 4f) were discernable but were occult in MDS (Fig. 4d). Note that the ellipses in Figs. 4b, 4c, 4e, and 4f are manually placed on the plots to highlight what appear to be possible new classes. Since 10 of the studies considered in this work were labeled as binary class data sets, we were unable to evaluate the validity of the newly detected subclasses. Note however that the two nonlinear methods for both the lung cancer [54] and leukemia data sets [1], [32] identify near identical subclusters, lending further credibility to the fact that the subclusters identified are genuine subclasses. To further test the ability of nonlinear DR schemes for novel class detection, a multiclass data set comprising five known subtypes of acute lymphoblastic leukemia [32] was considered. As shown in Fig. 4g, PCA is unable to unravel the classes as discriminatingly as Isomap (Fig. 4h) or LLE (Fig. 4i). The five subclasses shown in Figs. 4g, 4h, and 4i are represented with different symbols. 4.2 Quantitative Results 4.2.1 Classifier Accuracy For each of the 10 binary class and one multiclass data sets, classifier accuracy ðCC4:5 ; CSV M Þ was assessed on the embeddings obtained via the six DR schemes for both unpruned (Figs. 5a and 5b) and feature-pruned data sets (Fig. 5c). It can be seen from Fig. 5 that on average, nonlinear DR methods (ISO, LLE, and LEM) perform better than their linear counterparts (PCA, LDA, and MDS) for both classifiers. Tables 3 and 4 list the accuracy results for Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. 378 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO. 3, JULY-SEPTEMBER 2008 Fig. 5. Average (a) CSV M; and (b) CC4:5; for 2 fP CA; LDA; MDS; ISO; LLE; LEMg, for 10 binary class data sets, before feature pruning. Additionally, average (c) CSV M; is given for all , for 10 binary class data sets, following feature pruning. TABLE 3 Accuracy of CSV M; for Each of the 10 Binary Class Data Sets with and without DR for 2 fP CA; LDA; MDS; ISO; LLE; LEMg and without Feature Pruning TABLE 4 Accuracy of CC4:5; for Each of the 10 Binary Class Data Sets with and without DR for 2 fP CA; LDA; MDS; ISO; LLE; LEMg and without Feature Pruning the SVM and C4.5 classifiers. Over 10 binary class data sets with and without DR, classifier accuracy with an SVM for a multiclass data set (Acute Lymphoblastic Leukemia [32]) is provided in Table 5, again with and without DR. The results in Tables 3, 4, and 5 clearly suggest that for both the binary class and multiclass cases, 1) classification of high-dimensional data should be preceded by DR and 2) nonlinear DR schemes result in higher classifier accuracy compared to linear DR schemes. the 10 binary class data sets after feature pruning. Tables 6, 7, and 8 show the average CICD; , CCT ;; , and CCT ;;þ values for 2 fP CA; LDA; MDS; ISO; LLE; LEMg over the 10 binary class studies considered in this work without TABLE 5 Accuracy of CSV M; in Distinguishing Subtypes of Acute Lymphoblastic Leukemia Data Set with and without DR for 2 fP CA; LDA; MDS; ISO; LLE; LEMg and without Feature Pruning 4.2.2 Cluster Metrics Fig. 6 and Tables 6, 7, and 8 show the results for the cluster validity measures for all . Figs. 6a, 6b, and 6c correspond to average CICD; , CCT ;;þ , and CCT ;; , respectively, across Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND... 379 Fig. 6. Average (a) CICD; , (b) CCT ;;þ , and (c) CCT ;; values over 10 binary class data sets for each 2 fP CA; LDA; MDS; ISO; LLE; LEMg after feature pruning. TABLE 6 Values of CICD; for Each of the 10 Binary Class Data Sets Following DR for 2 fP CA; LDA; MDS; ISO; LLE; LEMg without Feature Pruning TABLE 7 Values of CCT ;; for Each of the 10 Binary Class Data Sets Following DR for 2 fP CA; LDA; MDS; ISO; LLE; LEMg without Feature Pruning feature pruning. From Fig. 6a and Table 6, we observe that the intercentroid distance between the dominant clusters is on average greater for the nonlinear DR methods compared to the linear methods, in turn suggesting greater separation between the two classes. Similarly from Figs. 6b and 6c, we observe that the average CCT ;;þ and CCT ;; values over all the 10 binary class data sets are smaller for the nonlinear DR methods compared to the linear methods, suggesting the objects classes form more compact tighter clusters in the embedding spaces generated via nonlinear DR schemes. In Table 9, p-values for the paired student t-tests are obtained by comparing CSV M; , CC4:5; , CICD; , CCT ;;þ , and CCT ;; across the 10 binary class data sets for each pair of linear and nonlinear DR methods (PCA/ISO, LDA/ISO, MDS/ISO, PCA/LLE, LDA/LLE, MDS/LLE, PCA/LEM, LDA/LEM, and MDS/LEM). Statistically significant differences were observed for all the performance measures considered for each pair of linear/nonlinear DR methods across all 10 binary class data sets. Similar trends were observed for embeddings obtained from linear and nonlinear DR schemes following feature pruning (Table 10). Additionally, we investigated the performance of each of the DR methods across the 10 binary class studies as a function of the number of dimensions of the embedding space G from a classification perspective. Figs. 7a and 7b show the average classification accuracy (CSV M; Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. 380 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO. 3, JULY-SEPTEMBER 2008 TABLE 8 Values of CCT ;;þ for Each of the 10 Binary Class Data Sets Following DR for 2 fP CA; LDA; MDS; ISO; LLE; LEMg without Feature Pruning TABLE 9 p-Values Obtained by a Paired Student t-Test of CSV M; , CC4:5; , CICD; , CCT ;;þ , and CCT ;; across 10 Data Sets, Comparing Nine Pairs of Linear/Nonlinear DR Methods without Feature Pruning TABLE 10 p-Values Obtained by a Paired Student t-Test of CSV M; , CC4:5; , CICD; , CCT ;;þ , and CCT ;; across 10 Data Sets, Comparing Nine Pairs of Linear/Nonlinear DR Methods Following Feature Pruning and CC4:5; , respectively) for each DR method, where the number of dimensions is being varied from 2 to 10 ðv 2 f2; 3; . . . ; 10gÞ. Similarly, Figs. 8a, 8b, and 8c show the cluster validity measures (CICD; , CCT ;;þ , CCT ;; , CCT ;;þ , and CCT ;; , respectively) for each DR method, where the number of dimensions is also being varied from 2 to 10 ðv 2 f2; 3; . . . ; 10gÞ. For both the classifier and cluster validity measures, one can see similar trends across dimensions showing nonlinear DR methods outperforming linear methods (Figs. 7 and 8), thereby comprehensively demonstrating that the nonlinear DR schemes outperform the linear DR methods independent of the number of embedding dimensions considered. 5 CONCLUDING REMARKS The primary objective of this paper was to identify appropriate DR methods to precede analysis and classification of high-dimensional gene and protein expression studies. This is especially important in applications where the goal is to identify two or more specific classes within the data sets. In this paper, we quantitatively compared the performance of six different DR methods, three linear (PCA, LDA, and MDS) and three nonlinear (Isomap, LLE, and LEM) from the perspective of 1) distinguishing between cancer and noncancer studies and 2) identifying new object classes (cancer subtypes) from 10 binary high-dimensional gene and protein expression data sets for prostate, lung, breast, and ovarian cancers, as well as for leukemia, lymphomas, and gliomas. Additionally, a multiclass data set comprising five distinct subtypes of lymphoblastic leukemia was also considered. The efficacy of the low-dimensional representations of the high-dimensional data obtained by the different DR methods was evaluated via two classifier schemes (SVM and C4.5) and five different cluster validity measures. The intuition behind the use of these evaluation measures was that if the Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND... 381 Fig. 7. Average classification accuracy for (a) CSV M; and (b) CC4:5; over 10 binary class data sets for 2 fP CA; LDA; MDS; ISO; LLE; LEMg for v 2 f2; 3; . . . ; 10g. Fig. 8. Average (a) CICD; , (b) CCT ;;þ , and (c) CCT ;; values over all 10 binary class data sets for each 2 fP CA; LDA; MDS; ISO; LLE; LEMg for v 2 f2; 3; . . . ; 10g. Fig. 9. The degree to which nonlinearity in the data can be accurately approximated is dependent on the size of the local neighborhood within which linearity is assumed. As increases, the locally linear assumption is no longer valid. (a) and (b) CICD;LLE and CICD;LEM , respectively, plotted against increasing values of for the lung cancer Michigan data set. As increases, CICD;LLE and CICD;LEM both decrease, suggesting that the nonlinear DR schemes are effectively becoming linear. low-dimensional embedding was indeed a faithful representation of the high-dimensional feature space, the two different data classes would be separable into distinct tightly packed clusters. Embeddings were generated by the six different DR methods from the original high-dimensional data before and after feature pruning. Feature pruning was applied to identify only the top 10 percentile of the most informative features in each data set in order to reduce any possible nonlinearity in the data on account of redundant or uncorrelated features. The three different linear and three nonlinear methods were also compared pairwise via a paired student t-test in terms of the seven performance measures and across all 10 data sets. In addition, six different DR methods were also qualitatively compared in terms of the ability of their respective embeddings to reveal the presence of new subclasses within the data. Our primary conclusions from this work are given as follows: Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. 382 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, TABLE 11 Summary of the Best and Worst DR Methods in Terms of Each of the Seven Performance Measures VOL. 5, NO. 3, JULY-SEPTEMBER 2008 ACKNOWLEDGMENTS This work was supported by grants from the Coulter Foundation, the Cancer Institute of New Jersey, the New Jersey Commission on Cancer Research (NJCCR), the Society for Imaging Informatics in Medicine (SIIM), the Aresty Foundation, and the National Cancer Institute (R21CA127186-01 and R03CA128081-01). The authors would like to thank Dr. Jianbo Shi and Dr. James Monaco for their advice and comments. REFERENCES [1] 1. 2. 3. The nonlinear methods significantly outperformed the linear methods over all the data sets in terms of all seven performance measures, suggesting in turn the nonlinear underlying manifold structure of highdimensional biomedical studies. The differences between the linear and nonlinear methods were found to be statistically significant even after pruning the data sets by feature selection and were independent of the number of dimensions of the embedding space that were considered. The nonlinear methods also appeared to be able to identify potential subclasses within the data better compared to the linear methods. The linear methods for the most part were unable to even discriminate between the two most dominant classes in each data set. In making our conclusions, we also acknowledge the following limitations of this study: 1) Our results are based on a relatively small database comprising 10 binary class and one multiclass gene and protein expression data sets, 2) not all linear and nonlinear DR methods were considered in this study, and 3) the performance of nonlinear methods are dependent on the size of the local neighborhood parameter within which data linearity is assumed. As the value of increases, the locally linear assumption is no longer valid and the nonlinear DR methods begin to resemble linear methods. The dependency of the nonlinear methods on the value of is reflected in the plots in Fig. 9. For both ¼ LLE and LEM, the corresponding cluster measures CICD; begin to decrease with increasing values of , suggesting the degeneracy of the nonlinear schemes. Table 11 lists the best and worst DR schemes based on each of the seven performance criteria considered in this study. As can be surmised from Table 11, the nonlinear DR scheme LLE was the best DR method overall and the linear scheme LDA performed the worst. Our results appear to suggest that if the objective is to distinguish multiple classes or identify subclusters within high-dimensional biomedical data, nonlinear DR methods such as LLE, Isomap, and LEM may be a better choice compared to linear DR methods such as PCA. Preliminary results in an application involving prostate magnetic resonance spectroscopy [46] appear to confirm the conclusions presented in this work. [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomþeld, and E.S. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, no. 531, pp. 531-537, 1999. Y. Peng, “A Novel Ensemble Machine Learning for Robust Microarray Data Classification,” Computers in Biology and Medicine, vol. 36, no. 6, pp. 553-573, 2006. C. Shi and L. Chen, “Feature Dimension Reduction for Microarray Data Analysis Using Locally Linear Embedding,” Proc. Third Asia Pacific Bioinformatics Conf. (APBC ’05), pp. 211-217, 2005. S.D. Der, A. Zhou, B.R. Williams, and R.H. Silverman, “Identification of Genes Differentially Regulated by Interferon Alpha, Beta, or Gamma Using Oligonucleotide Arrays,” Proc. Nat’l Academy of Sciences of the United States of Am., vol. 95, no. 26, pp. 15623-15628, Dec. 1998. R. Maglietta, A. D’Addabbo, A. Piepoli, F. Perri, S. Liuni, G. Pesole, and N. Ancona, “Selection of Relevant Genes in Cancer Diagnosis Based on Their Prediction Accuracy,” Artificial Intelligence in Medicine, vol. 40, no. 1, pp. 29-44, May 2007. T.M. Huang and V. Kecman, “Gene Extraction for Cancer Diagnosis by Support Vector Machines—An Improvement,” Artificial Intelligence in Medicine, vol. 35, nos. 1-2, pp. 185-194, 2005. G. Turashvili, J. Bouchal, K. Baumforth, W. Wei, M. Dziechciarkova, J. Ehrmann, J. Klein, E. Fridman, J. Skarda, J. Srovnal, M. Hajduch, P. Murray, and Z. Kolar, “Novel Markers for Differentiation of Lobular and Ductal Invasive Breast Carcinomas by Laser Microdissection and Microarray Analysis,” BMC Cancer, vol. 7, no. 55, 2007. A.A. Alizadeh, M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J. Hudson, L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botstein, P.O. Brown, and L.M. Staudt, “Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling,” Nature, vol. 403, no. 6769, pp. 503-511, 2000. A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, “Tissue Classification with Gene Expression Profiles,” J. Computational Biology, vol. 7, nos. 3-4, pp. 559-583, 2000. M.P. Brown, W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet, T.S. Furey, M. Ares, and D. Haussler, “Knowledge-Based Analysis of Microarray Gene Expression Data by Using Support Vector Machines,” Proc. Nat’l Academy of Sciences USA, vol. 97, no. 1, pp. 262-267, Jan. 2000. A.C. Tan and D. Gilbert, “Ensemble Machine Learning on Gene Expression Data for Cancer Classification,” Applied Bioinformatics, vol. 2, no. 3 supplement, pp. S75-S83, 2003. L. Song, J. Bedo, K.M. Borgwardt, A. Gretton, and A. Smola, “Gene Selection via the Bahsic Family of Algorithms,” Bioinformatics, vol. 23, pp. 490-498, 2007. L. Li, W. Jiang, X. Li, K.L. Moser, Z. Guo, L. Du, Q. Wang, E.J. Topol, Q. Wang, and S. Rao, “A Robust Hybrid between Genetic Algorithm and Support Vector Machine for Extracting an Optimal Feature Gene Subset,” Genomics, vol. 85, pp. 16-23, 1995. T. Li, C. Zhang, and M. Ogihara, “A Comparative Study of Feature Selection and Multiclass Classification Methods for Tissue Classification Based on Gene Expression,” Bioinformatics, vol. 20, no. 15, pp. 2429-2437, Oct. 2004. Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. LEE ET AL.: INVESTIGATING THE EFFICACY OF NONLINEAR DIMENSIONALITY REDUCTION SCHEMES IN CLASSIFYING GENE AND... [15] H. Liu, J. Li, and L. Wong, “A Comparative Study on Feature Selection and Classification Methods Using Gene Expression Profiles and Proteomic Patterns,” Genome Informatics, vol. 13, pp. 51-60, 2002. [16] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, “Broad Patterns of Gene Expression Revealed by Clustering of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays,” Proc. Nat’l Academy of Sciences, vol. 96, no. 12, pp. 6745-6750, 1999. [17] D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D’Amico, J.P. Richie, E.S. Lander, M. Loda, P.W. Kantoff, T.R. Golub, and W.R. Sellers, “Gene Expression Correlates of Clinical Prostate Cancer Behavior,” Cancer Cell, vol. 1, no. 2, pp. 203-209, 2002. [18] M. Park, J.W. Lee, J.B. Lee, and S.H. Song, “Several Biplot Methods Applied to Gene Expression Data,” J. Statistical Planning and Inference, vol. 138, pp. 500-515, 2007. [19] E.F. Petricoin, A.M. Ardekani, B.A. Hitt, P.J. Levine, V.A. Fusaro, S.M. Steinberg, G.B. Mills, C. Simone, D.A. Fishman, E.C. Kohn, and L.A. Liotta, “Use of Proteomic Patterns in Serum to Identify Ovarian Cancer,” The Lancet, vol. 359, no. 9306, pp. 572-577, 2002. [20] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, “Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation,” Proc. Nat’l Academy of Sciences USA, vol. 96, no. 6, pp. 2907-2912, Mar. 1999. [21] S. Yang, J. Shin, K.H. Park, H.-C. Jeung, S.Y. Rha, S.H. Noh, W.I. Yang, and H.C. Chung, “Molecular Basis of the Differences between Normal and Tumor Tissues of Gastric Cancer,” Biochimica et Biophysica Acta, 2007. [22] L.J. van ’t Veer, H. Dai, M.J. van de Vijver, Y.D. He, A.A.M. Hart, M. Mao, H.L. Peterse, K. van der Kooy, M.J. Marton, A.T. Witteveen, G.J. Schreiber, R.M. Kerkhoven, C. Roberts, R.B.P.S. Linsley, and S.H. Friend, “Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer,” Nature, vol. 415, pp. 430-536, 2002. [23] S.L. Pomeroy, P. Tamayo, M. Gaasenbeek, L.M. Sturla, M. Angelo, M.E. McLaughlin, J.Y.H. Kim, L.C. Goumnerova, P.M. Black, C. Lau, J.C. Allen, D. Zagzag, J.M. Olson, T. Curran, C. Wetmore, J.A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D.N. Louis, J.P. Mesirov, E.S. Lander, and T.R. Golub, “Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression,” Nature, vol. 415, pp. 436442, 2002. [24] W.A. Freije, F.E. Castro-Vargas, Z. Fang, S. Horvath, T. Cloughesy, L.M. Liau, P.S. Mischel, and S.F. Nelson, “Gene Expression Profiling of Gliomas Strongly Predicts Survival,” Cancer Research, vol. 64, no. 18, pp. 6503-6510, 2004. [25] M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C.T. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, T.S. Ray, M.A. Kovall, K.W. Last, A. Norton, T.A. Lister, J. Mesirov, D.S. Neuberg, E.S. Lander, J.C. Aster, and T.R. Golub, “Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene-Expression Profiling and Supervised Machine Learning,” Nature Medicine, vol. 8, pp. 68-74, 2002. [26] D.G. Beer, S.L.R. Kardia, C.-C. Huang, T.J. Giordano, A.M. Levin, D.E. Misek, L. Lin, G. Chen, T.G. Gharib, D.G. Thomas, M.L. Lizyness, R. Kuick, S. Hayasaka, J.M.G. Taylor, M.D. Iannettoni, M.B. Orringer, and S. Hanash, “Gene-Expression Profiles Predict Survival of Patients with Lung Adenocarcinoma,” Nature Medicine, vol. 8, pp. 816-823, 2002. [27] D.A. Wigle, I. Jurisica, N. Radulovich, M. Pintilie, J. Rossant, N. Liu, C. Lu, J. Woodgett, I. Seiden, M. Johnston, S. Keshavjee, G. Darling, T. Winton, B.-J. Breitkreutz, P. Jorgenson, M. Tyers, F.A. Shepherd, and M.S. Tsao, “Molecular Profiling of Non-Small Cell Lung Cancer and Correlation with Disease-Free Survival,” Cancer Research, vol. 62, pp. 3005-3008, 2002. [28] R.E. Bellman, Adaptive Control Processes. Princeton Univ. Press, 1961. [29] J. Ye, T. Li, T. Xiong, and R. Janardan, “Using Uncorrelated Discriminant Analysis for Tissue Classification with Gene Expression Data,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 1, no. 4, pp. 181-190, Jan.-Mar. 2004. [30] Z. Liu, D. Chen, and H. Bensmail, “Gene Expression Data Classification with Kernel Principal Component Analysis,” J. Biomedicine and Biotechnology, vol. 2, pp. 155-159, 2005. 383 [31] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. Wiley, 2000. [32] E.-J. Yeoh, M.E. Ross, S.A. Shurtleff, W.K. Williams, D. Patel, R. Mahfouz, F.G. Behm, S.C. Raimondi, M.V. Relling, A. Patel, C. Cheng, D. Campana, D. Wilkins, X. Zhou, J. Li, H. Liu, C.-H. Pui, W.E. Evans, C. Naeve, L. Wong, and J.R. Downing, “Classification, Subtype Discovery, and Prediction of Outcome in Pediatric Acute Lymphoblastic Leukemia by Gene Expression Profiling,” Cancer Cell, vol. 1, no. 2, pp. 133-143, 2002. [33] J.J. Dai, L. Lieu, and D. Rocke, “Dimension Reduction for Classification with Gene Expression Microarray Data,” Statistical Applications in Genetics and Molecular Biology, vol. 5, no. 1, pp. 1-15, 2006. [34] K. Dawson, R.L. Rodriguez, and W. Malyj, “Sample Phenotype Clusters in High-Density Oligonucleotide Microarray Data Sets Are Revealed Using Isomap, a Nonlinear Algorithm,” BMC Bioinformatics, vol. 6, p. 195, 2005. [35] C. Truntzer, C. Mercier, J. Estève, C. Gautier, and P. Roy, “Importance of Data Structure in Comparing Two Dimension Reduction Methods for Classification of Microarray Gene Expression Data,” BMC Bioinformatics, vol. 8, no. 90, 2007. [36] A. Andersson, T. Olofsson, D. Lindgren, B. Nilsson, C. Ritz, P. Eden, C. Lassen, J. Rade, M. Fontes, H. Morse, J. Heldrup, M. Behrendtz, F.M.M. Hoglund, B. Johansson, and T. Fioretos, “Molecular Signatures in Childhood Acute Leukemia and Their Correlations to Expression Patterns in Normal Hematopoietic Subpopulations,” Proc. Nat’l Academy of Sciences, vol. 102, no. 52, pp. 19069-19074, 2005. [37] Y. Zhu, R. Wu, N. Sangha, C. Yoo, K.R. Cho, K.A. Shedden, H. Katabuchi, and D.M. Lubman, “Classifications of Ovarian Cancer Tissues by Proteomic Patterns,” Proteomics, vol. 6, pp. 5846-5856, 2006. [38] M.A. Mendez, C. Hodar, C. Vulpe, and M. Gonzalez, “Discriminant Analysis to Evaluate Clustering of Gene Expression Data,” Federation of European Biochemical Soc., vol. 522, pp. 24-28, 2002. [39] H. Hotelling, “Analysis of a Complex of Statistical Variables into Principal Components,” J. Educational Psychology, vol. 24, pp. 417441, 1933. [40] J. Venna and S. Kaski, “Local Multidimensional Scaling,” Neural Networks, vol. 19, pp. 889-899, 2006. [41] J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, Aug. 2000. [42] J. Tenenbaum, V. de Silva, and J.C. Langford, “A Global Geometric Framework for Nonlinear Dimensionality Reduction,” Science, vol. 290, no. 5500, pp. 2319-2322, 2000. [43] S. Roweis and L. Saul, “Nonlinear Dimensionality Reduction by Locally Linear Embedding,” Science, vol. 290, no. 5500, pp. 23232326, 2000. [44] M. Belkin and P. Niyogi, “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation,” Neural Computation, vol. 15, no. 6, pp. 1373-1396, 2003. [45] A. Madabhushi, J. Shi, M. Rosen, J.E. Tomaszeweski, and M.D. Feldman, “Graph Embedding to Improve Supervised Classification and Novel Class Detection: Application to Prostate Cancer,” Proc. Eighth Int’l Conf. Medical Image Computing and ComputerAssisted Intervention (MICCAI ’05), pp. 729-737, 2005. [46] P. Tiwari, A. Madabhushi, and M. Rosen, “A Hierarchical Unsupervised Spectral Clustering Scheme for Detection of Prostate Cancer from Magnetic Resonance Spectroscopy (MRS),” Proc. 10th Int’l Conf. Medical Image Computing and ComputerAssisted Intervention (MICCAI ’07), vol. 2, pp. 278-286, 2007. [47] S. Doyle, M. Hwang, K. Shah, A. Madabhushi, M. Feldman, and J. Tomaszeweski, “Automated Grading of Prostate Cancer Using Architectural and Textural Image Features,” Proc. Fourth IEEE Int’l Symp. Biomedical Imaging (ISBI ’07), pp. 1284-1287, 2007. [48] S. Doyle, M. Hwang, S. Naik, M. Feldman, J. Tomaszeweski, and A. Madabhushi, “Using Manifold Learning for Content-Based Image Retrieval of Prostate Histopathology,” Proc. 10th Int’l Conf. Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2007. [49] S. Weng, C. Zhang, Z. Lin, and X. Zhang, “Mining the Structural Knowledge of High-Dimensional Medical Data Using Isomap,” Medical and Biological Eng. and Computing, vol. 43, pp. 410-412, 2005. Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply. 384 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, [50] J. Nilsson, T. Fioretos, M. Höglund, and M. Fontes, “Approximate Geodesic Distances Reveal Biologically Relevant Structures in Microarray Data,” Bioinformatics, vol. 20, no. 6, pp. 874-880, 2004. [51] T. Dietterich, “Ensemble Methods in Machine Learning,” Proc. First Int’l Workshop Multiple Classifier Systems (MCS), 2000. [52] A. Madabhushi, J. Shi, M.D. Feldman, M. Rosen, and J. Tomaszewski, “Comparing Ensembles of Learners: Detecting Prostate Cancer from High Resolution MRI,” Proc. Second Int’l Workshop Computer Vision Approaches to Medical Image Analysis (CVAMIA ’06), pp. 25-36, 2006. [53] T.-S. Lim, W.-Y. Loh, and Y.-S. Shih, “A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms,” Machine Learning, vol. 40, pp. 203-228, 2000. [54] G.J. Gordon, R.V. Jensen, L.-L. Hsiao, S.R. Gullans, J.E. Blumenstock, S. Ramaswamy, W.G. Richards, D.J. Sugarbaker, and R. Bueno, “Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma,” Cancer Research, vol. 62, pp. 4963-4967, 2002. [55] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, 1995. [56] J.R. Quinlan, “Bagging, Boosting, and C4.5,” Proc. 13th Nat’l Conf. Artificial Intelligence and Eighth Innovative Applications of Artificial Intelligence Conf. (AAAI/IAAI ’96), vol. 1, pp. 725-730, 1996. [57] J.R. Quinlan and R.L. Rivest, “Inferring Decision Trees Using the Minimum Description Length Principle,” Information Computation, vol. 80, no. 3, pp. 227-248, 1989. [58] J. Handl, J. Knowles, and D.B. Kell, “Computational Cluster Validation in Post-Genomic Data Analysis,” Bioinformatics, vol. 21, no. 15, pp. 3201-3212, 2005. [59] F. Kovacs, C. Legancy, and A. Babos, “Cluster Validity Measurement Techniques,” Proc. Sixth Int’l Symp. Hungarian Researchers on Computational Intelligence (CINTI), 2005. VOL. 5, NO. 3, JULY-SEPTEMBER 2008 George Lee is currently a senior undergraduate at Rutgers University and will receive the BS degree in biomedical engineering in May 2008. He has recently been awarded an Aresty Research Grant (2007) for his work in machine learning methods to distinguish between cancer and benign studies from high-dimensional gene and protein expression. Carlos Rodriguez received the degree from the University of Puerto Rico in 2007. He visited Rutgers University as a RISE scholar in the summer of 2006, where he worked with Dr. Anant Madabhushi. He is currently a software engineer at Evri, a Seattle-based company working on semantic web. Anant Madabhushi received the BS degree in biomedical engineering from Mumbai University, Mumbai, India, in 1998, the MS degree in biomedical engineering from the University of Texas, Austin, in 2000, and the PhD degree in bioengineering from the University of Pennsylvania in 2004. He is currently an assistant professor in the Department of Biomedical Engineering, Rutgers University. He is also a member of the Cancer Institute of New Jersey and an adjunct assistant professor of radiology at the Robert Wood Johnson Medical Center, New Brunswick, New Jersey. He has nearly 50 publications and book chapters in leading international journals and peer-reviewed conferences. His research interests are in the area of medical image analysis, computer-aided diagnosis, machine learning, and computer vision and in applying these techniques for early detection and diagnosis of prostate and breast cancers from high-resolution MRI, MR spectroscopy, protein and gene expression studies, and digitized tissue histopathology. He is also the recipient of a number of awards for both research as well as teaching, including the Busch Biomedical Award (2006), the Technology Commercialization Award (2006), the Coulter Early Career Award (2006), the Excellence in Teaching Award (2007), and the Cancer Institute of New Jersey New Investigator Award (2007). In addition, his research work has also received grant funding from the National Cancer Institute, New Jersey Commission on Cancer Research, and the US Department of Defense. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib. Authorized licensed use limited to: IEEE Xplore. Downloaded on November 14, 2008 at 14:46 from IEEE Xplore. Restrictions apply.