4 Classification Performance of Support Vector Machines on Genomic Data utilizing Feature Space Selection Techniques by Jason P. Sharma Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Masters of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY January 2002 BARKER @ Massachusetts Institute of Technology 2002. All rights reserved. MASSACHUSETTS tNSTITUTE OF TECHNOLOGY JUL 3 1 2002 LIBRARIES Author . Department of Electrical Engineering and Computer Science Jan 18, 2002 Certified by....... Bruce Tidor Associate Professor, EECS and BEH Thesis Supervisor Accepted by........... Arthur C. Smith Chairman, Department Committee on Graduate Students Classification Performance of Support Vector Machines on Genomic Data utilizing Feature Space Selection Techniques by Jason P. Sharma Submitted to the Department of Electrical Engineering and Computer Science on Jan 18, 2002, in partial fulfillment of the requirements for the degree of Masters of Engineering in Electrical Engineering and Computer Science Abstract Oligonucleotide array technology has recently enabled biologists to study the cell from a systems level by providing expression levels of thousands of genes simultaneously. Various computational techniques, such as Support Vector Machines (SVMs), have been applied to these multivariate data sets to study diseases operated at the transcriptional level. One such disease is cancer. While SVMs have been able to provide decision functions that successfully classify tissue samples as cancerous or normal based on the data provided by the array technology, it is known that by reducing the size of the feature space a more generalizable decision function can be obtained. We present several feature space selection methods, and show that the classification performance of SVMs can be dramatically improved when using the appropriate feature space selection method. This work proposes that such decision functions can then be used as diagnostic tools for cancer. We also propose several genes that appear to be critical to the differentiation of cancerous and normal tissues, based on the computational methodology presented. Thesis Supervisor: Bruce Tidor Title: Associate Professor, EECS and BEH Acknowledgments I would like to first thank my thesis advisor, Prof. Bruce Tidor, for providing me the opportunity to get involved in the field of bioinformatics with this thesis. In the course of working on this project, I learned much more than bioinformatics. The list is too extensive to describe here, but it starts with a true appreciation and understanding for the research process. Many thanks go to all the members of the Tidor lab, who quickly became more than just colleagues (and lunch companions). Relating to my project, I would like to thank Phillip Kim for his help and advice throughout my time in the lab. I would also like to thank Bambang Adiwijaya for providing me access to his pm-mm pattern technique. I am also indebted to my friend and roommate, Ashwinder Ahluwalia. His encouragement and support was critical to the completion of this project. Final thanks go to Steve Gunn for providing a freely available matlab implementation of the soft margin SVM used in this project. Contents 1 Introduction 11 1.1 M otivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.1.1 Cancer Mortality . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.1.2 Usage and Availability of Genechip Data . . . . . . . . . . . . 13 Related Array Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.1 Unsupervised Learning Techniques . . . . . . . . . . . . . . . 14 1.2.2 Supervised Learning Techniques . . . . . . . . . . . . . . . . . 18 Approach to Identifying Critical Genes . . . . . . . . . . . . . . . . . 19 1.2 1.3 2 Background 23 2.1 Basic Cancer Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Oligonucleotide Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Data Set Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Data Preprocessing Techniques 29 . . . . . . . . . . . . . . . . . . . . . 3 Visualization 33 3.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Multidimensional Scaling (MDS) . . . . . . . . . . . . . . . . . . . . 36 3.3 Locally Linear Embedding (LLE) . . . . . . . . . . . . . . . . . . . . 38 3.4 Information Content of the Data . . . . . . . . . . . . . . . . . . . . 43 4 Feature Space Reduction 4.1 Variable Selection Techniques 45 . . . . . . . . . . . . . . . . . . . . . . 46 4.2 5 Mean-difference Statistical Method . . . . . . . . . . . . . . . 46 4.1.2 Golub Selection Method . . . . . . . . . . . . . . . . . . . . . 49 4.1.3 SNP Detection Method . . . . . . . . . . . . . . . . . . . . . . 51 . . . . . . . . . . . . . . . . . . . . 54 4.2.1 Principal Components Analysis (PCA) . . . . . . . . . . . . . 55 4.2.2 Non-negative Matrix Factorization (NMF) . . . . . . . . . . . 57 4.2.3 Functional Classification . . . . . . . . . . . . . . . . . . . . . 58 Dimension Reduction Techniques Supervised Learning with Support Vector Machines 63 5.1 Support Vector Machines . . . . . . . . . . . . . . . . 64 5.1.1 Motivation for Use . . . . . . . . . . . . . . . 64 5.1.2 General Mechanism of SVMs . . . . . . . . . 64 5.1.3 Kernel Functions . . . . . . . . . . . . . . . . 68 5.1.4 Soft Margin SVMs . . . . . . . . . . . . . . . 69 5.2 Previous Work Using SVMs and Genechips . . . . . . 70 5.3 Support Vector Classification (SVC) Results . . . . . 73 5.3.1 SVC of Functional Data Sets . . . . . . . . . . 75 5.3.2 SVC of NMF & PCA Data Sets . . . . . . . . 76 5.3.3 SVC of Mean Difference Data Sets . . . . . . 77 5.3.4 SVC of Golub Data Sets . . . . . . . . . . . . 78 5.4 6 4.1.1 Important Genes Found . . . . . . . . . . . . . . . . Conclusions 79 83 6 List of Figures 2-1 Histogram of unnormalized gene expression for one array . . . . . . . 31 2-2 Histogram of log normalized gene expression for same array . . . . . . 31 3-1 Experiments 1, 2, 8, and 10 are normal breast tissues. Hierarchical clustering performed using euclidean distance and most similar pair replacem ent policy 3-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . Well clustered data with large separation using log normalized breast tissue data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3 35 38 Poorly separated data, but the two sets of samples are reasonably clustered. . . .... . . . . . . . . . . . .. .. . . ... . . . . .. . . . 39 3-4 LLE trained to recognize S-curve [20] . . . . . . . . . . . . . . . . . . 40 3-5 Branching of data in LLE projection . . . . . . . . . . . . . . . . . . 42 3-6 MDS projections maintain relative distance information . . . . . . . . 42 3-7 Possible misclassified sample . . . . . . . . . . . . . . . . . . . . . . . 44 4-1 Principal components of an ellipse are the primary and secondary axis 55 4-2 Functional Class Histogram 60 5-1 A maximal margin hyperplane that correctly classifies all points on either half-space [191 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 65 8 List of Tables 2.1 Tissue D ata Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Linear Separability of the Tissue Data 4.1 Genes Selected by Mean Difference Technique 5.1 Jack-knifing Error Percentages....... 5.2 15 genes with most "votes" . . . . . . . . . . . . . . . . . . . . . . . . 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...................... 29 43 61 80 81 10 Chapter 1 Introduction 1.1 Motivation Oligonucleotide array technology has recently been developed, which allows the measurement of expression levels of thousands of genes simultaneously. This technology provides biologists with an opportunity to view the cell at a systems level instead of one subsystem at a time in terms of gene expression. Clearly, this technology is suited for study of diseases that have responses at the transcriptional level. One such disease is cancer. Many studies have been done using oligonucleotide arrays to gather information about how cancer expresses itself in the genome. In the process of using these arrays, massive amounts of data have been produced. This data has, however, proven difficult for individuals to just "look" at and extract significant new insights due to its noisy and high dimensional nature. Computational techniques are much better suited, and hence are required to make proper use of the data obtained using oligonucleotide arrays. 1.1.1 Cancer Mortality Cancer is the second leading cause of death. Nearly 553,000 people are expected to (lie of cancer in the year 2001 [21]. Efforts on various fronts utilizing array technology are currently underway to increase our understanding of cancer's mechanism. 11 One effort is focused on identifying all genes that are related to the manifestation of cancer. While a few critical genes have been identified and researched (such as p53, Ras, or c-Myc), many studies indicate that there are many more genes that are related to the varied expression of cancer. Cancer has been shown to be a highly differentiated disease, occurring in different organs, varying in the rate of metastasis, as well as varying in the histology of damaged tissue. It has been hypothesized that different sets of "misbehaving" genes may be causing these variances, but as of yet these genes have not yet been identified for many of the different forms of cancer. Identifying these various genes could aid us in understanding the mechanism of cancer, and potentially lead to new treatments. Monitoring the expression of several thousands of genes using oligonucleotide arrays has increased the probability of finding cancer causing genes. Another effort has focused on utilizing oligonucleotide arrays as diagnosis tools for cancer. Given our current methods of treatment, it has been found that the most effective way of treating cancer is to diagnose its presence as early as possible and as specifically as possible. Specific treatments have been shown to perform significantly better on some patients compared to others, and hence the better we are able to identify the specific type of cancer a patient has, the better we are able to treat that patient. A diagnosis tool that can accurately and quickly determine the existence and type of cancer would clearly be beneficial. Again, because cancer has been shown to be a genetic disease, many efforts have been made to provide a cancer diagnosis tool that measures the expression of genes in cells. Since such technologies have been recently developed, most of the focus has been on finding a set of genes which accurately determine the existence of cancer. The goal of this thesis is to aid in both of the efforts mentioned above, namely to discover "new" genes that are causes of various types of cancers, and using these genes provide an accurate cancer diagnosis tool. 12 1.1.2 Usage and Availability of Genechip Data The usage of oligonucleotide arrays is not limited to the study of cancer. Oligonucleotide arrays have been used to study the cell cycle progression in yeast, specific cellular pathways such as glycolysis, etc. In each case, however, the amount of data is large, highly dimensional, and typically noisy. Computational techniques that can extract significant information are extremely important in realizing the potential of the array technology. Therefore, an additional goal of this thesis was to contribute a new process of array analysis, using a combination of computational learning techniques to extract significant information present in the oligonucleotide array data. 1.2 Related Array Analysis Because cancer is primarily a genetic disease (See Ch 2.1), there have been numerous studies of various types of cancer using oligonucleotide arrays to identify cancer related genes. The typical study focuses on one organ-specific cancer, such as prostate cancer. First, tissue samples from cancerous prostates (those with tumors) and healthy prostates are obtained. The mRNA expression levels of the cells in each of the samples is then measured, using an oligonucleotide array per sample. The resulting data set can be compactly described as an expression matrix V, where each column represents the expression vector C of an experiment, and each row represents the expression vector ' of a gene. Each element Vj is a relative mRNA expression level of the 7'h gene in the Jth sample. - UVii V1 2 - - - Vn V2 1 V22 . .. V2 n 0 \ n1 Vn2 . . Vnmn n is the number of experiments, and n is the number of genes. 13 The study then introduces a new computational technique to analyze the data gathered. The ability of the technique to extract information is first validated, by analyzing a known data set and/or by comparing the results obtained through other traditional techniques. Typically, the new technique is then used to make new assertions, which may be cause for future study. All of the techniques can be classified in two categories: unsupervised and supervised learning techniques. A description of the techniques applied to array data follows. Unsupervised learning techniques focus on finding structure in the data, while supervised learning techniques focus on making classifications. The main difference between unsupervised and supervised learning techniques is that supervised learning techniques use user-supplied information about what class an object belongs to, while unsupervised techniques do not. 1.2.1 Unsupervised Learning Techniques Unsupervised learning techniques focus on discovering structure on the data. The word "discover" is used because unlike supervised learning techniques, unsupervised techniques have no prior knowledge (nor any notion) about what class a sample belongs to. Most of the techniques find structure in the data based on similarity measures. Clustering One major class of unsupervised learning techniques is known as clustering. Clustering is equivalent to grouping similar objects based on some similarity measure. A similarity function f(Y, z) is typically called repeatedly in order to achieve the final grouping. In the context of arrays, clustering has been used to group similar experiments and to group similar genes. Researchers have been able to observe the expression of cellular processes across different experiments by observing the expression of sets of genes obtained from clustering, and furthermore associate functions for genes that were previously unannotated. By clustering experiments, researchers 14 have, for example, been able to identify new subtypes ofcancers. Three algorithms that have been utilized to do clustering on genomic data are hierarchical clustering, k-means clustering, and Self Organizing Maps (SOMs). Hierarchical clustering was first applied to oligonucleotide array data by Eisen et al. [6]. The general algorithm uses a similarity metric to determine the highest correlated pair of objects among the set of objects to be clustered C. This pair is then clustered together, removed from the set C, and the average of the pair is added to C. The next iteration proceeds similarly, with the highest correlated pair in C being removed and being replaced by a single object. This process continues until C has only one element. The clustering of pairs of objects can be likened to leaves of a tree being joined at a branch, and hence a dendrogram can be used to represent the correlations between each object. Eisen et al. used several similarity measures and tried different remove/replace policies for the algorithm. They were able to demonstrate the ability of this technique by successfully grouping genes of known similar function for yeast. Specifically, by clustering on 2,467 genes using a set of 12 time-course experiments that measured genomic activity throughout the yeast cell cycle, several clusters of genes corresponding to a common cellular function were found. One cluster was comprised of genes encoding ribosomal proteins. Other clusters found contained genes involved in cholesterol biosynthesis, signaling and angiogenesis, and tissue remodeling and wound healing. K-means clustering is a technique that requires the user to input the number of expected clusters, and often, the coordinates of the centroids in n dimensional space (where n corresponds to the number of variables used to represent each object). Once this is done, all points are assigned to their closest centroid (based on some similarity metric, not necessarily euclidean distance), and the centroid is recomputed by averaging the positions of all the points assigned to it. This process is repeated until each centroid does not move after recomputing its position. This technique effectively groups objects together in the desired number of bins. Self Organizing Maps (SOMs) is another clustering technique that is quite similar to k-means [23]. This technique requires al input: the geometry in which the number 15 of expected clusters in the data is expected. For example, a 3x2 grid means that a total of six clusters are expected in the data. This grid is then used to calculate the positions of six centroids in n dimensional space. The SOM algorithm then randomly selects a data point and moves the closest centroid towards that point. As the distance between each successive (randomly selected) data point and the closest centroid decreases, the distance the centroid moves decreases. After all points have been used to move the centroids, the algorithm is complete. This technique was used to cluster genes and identify groups of genes that behaved similarly for yeast cell cycle array data. SOMs were able to find clusters of yeast genes that were active in the respective G1,S, G2, and M phases of the cell cycle when analyzing 828 genes over 16 experiments using a 6x5 grid. While the obtained clusters were desirable, the selection of the grid was somewhat arbitrary and the algorithm was run until results matching previous knowledge were obtained. As noted above, clustering can be performed by either clustering on genes or clustering on experiments. When both clusterings are performed and the expression matrix is rearranged accordingly, this is known as two-way clustering. Such an analysis was done on cancerous and non-cancerous colon tissue samples [1]. The study conducted by Alon et al. was the first to cluster on genes (based on expression across experiments) and cluster on experiments (based on expression across genes). Given a data set of 40 cancererous and 22 normal colon tissue samples on arrays with 6500 features, the clustering algorithm used was effective at grouping the cancer and normal tissues respectively. Alon et al. also attempted to use a method similar to that described in Ch 4.1.1 to reduce the data set and observe the resulting classification performance of the clustering algorithm. Dimension Reduction Dimension reduction is a type of technique that reduces the representation of data by finding a few components that are repeatedly expressed in the data. Linear dimension reduction techniques use linear combinations of these components to reconstruct the original data. Besides finding a space saving minimal representation, such techniques 16 are useful because by finding the critical components, they essentially group together dimensions that behave consistently throughout the data. Given V as the original data, dimension reduction techniques perform the following operation: V ~ W * H. The critical components are the columns of the matrix W. Given an expression matrix V, each column of W can be referred to as a "basis array". Each basis array contains groupings of genes that behave consistently across each of the arrays. Just as before, since the genes are all commonly expressed in the various arrays, it is possible that they are functionally related. If a basis array is relatively sparse, it can then be interpreted to represent a single cellular processes. Since an experiment is a linear combination of such basis arrays, it can then be observed which cellular processes are active in an experiemnt. There are several types of linear dimension reduction techniques, all using the notion of finding a set of basis vectors that can be used to reconstruct the data. The techniques differ by placing different constraints to find different sets of basis vectors. For more details, please see Ch 4.2. The most common technique used is Principal Components Analysis (PCA). PCA decomposes a matrix into the critical components as described above (known as eigenvectors), but provides an ordering of these components. The first vector (principal component) is in the direction of the most variability in the data, and each successive component accounts for as much of the remaining variability in the data as possible. Alter et al. applied PCA to the expression matrix, and were able to find a set of principal components where each component represented a group of genes that were active during different stages of the yeast cell cycle [2]. Fourteen time-course experiments were conducted over the length of the yeast cell cycle using arrays with 6,108 features. Further study of two principal components were able to effectively represent most of the cell cycle expression oscillations. Another method known as Non-negative Matrix Factorization (NMF) has been used to decompose oligonucleotide array into similarly interesting components that represent cellular processes (Kim & Tidor, to be published). NMF places the constraint that each basis vector has only positive values. PCA, in contrast, allows for 17 negative numbers in the eigenvectors. NMF leads to a decomposition that is well suited for oligonucleotide array analysis. The interaction between the various basis vectors is clear: it is always additive. With PCA, some genes may be down-regulated in a certain eigenvector while they maybe be up-regulated in another. Combining two eigenvectors can cancel out the expression of certain cellular processes, while combining two basis vectors from NMF always results in the expression of two cellular processes. In this sense, NMF seems to do a slightly better job selecting basis vectors that represent cellular processes. Other methods of dimensionality reduction exist as well, such as Multidimensional Scaling (MDS) and Locally Linear Embedding (LLE, a type of non-linear dimensionality reduction) [10, 20]. Both of these techniques are more commonly used as visualization techniques. See Ch 3.1 and 3.2 for more details. 1.2.2 Supervised Learning Techniques Supervised learning techniques take advantage of user provided information about what class a set of objects belong to in order to learn which features are critical to the differentiation between classes, and often to come up with a decision function that distinguishes between classes. The user provided information is known as the training set. A "good" (generalizable) decision function is then able to accurately classify samples not in the training set. One technique that focuses on doing feature selection is the class predictor methodology developed by Golub et al. [9] that finds arbitrarily sized sets of genes that could be used for classification. The technique focuses on finding genes that have an idealized expression pattern across all arrays, specifically genes that have high expression among the cancerous arrays and low expression among the non-cancerous arrays. These genes are then assigned a predictive power, and collectively are used to decide whether an unknown sample is either cancerous or non-cancerous (refer to Ch 4.2 for more details of the algorithm). Golub et al. were able to use the above technique to distinguish between two types of leukemias, discover a new subtype of leukemia, and to determine the class of new leukemia cases. 18 One technique that seeks to solve the binary classification problem (classification into two groups) is the Support Vector Machine (SVM). The binary classification problem can be described as finding a decision surface (or separating hyperplane) in iL dimensions that accurately allows all samples of the same class to be in the same half-space. The Golub technique is an example of a linear supervised learning method, because the decision surface described by the decision function is "flat". In contrast, non-linear techniques create decision functions with higher-order terms that correspond to hyperplanes which have contours. SVMs have the ability to find such non-linear decision functions. Several researchers have utilized SVMs to obtain good classification accuracy using oligonucleotide arrays to assign samples as cancerous or non-cancerous [4, 8, 17]. This computational technique classifies data into two groups by finding the "maximal margin" hyperplane. In a high dimensional space with a few points, there are many hyperplanes that can be used to distinguish two classes. SVMs choose the "optimal" hyperplane by selecting the one that provides the largest separation between the two groups. In the case where a linear hyperplane can not be found to separate the two groups, SVMs can be used to find a non-linear hyperplane by projecting the data into a higher dimensional space, finding a linear hyperplane in this space, and then projecting the linear hyperplane into the original space which causes it to be non-linear. See Ch 5 for more details. 1.3 Approach to Identifying Critical Genes In order to extract information from the data produced by using oligoiucleotide arrays, computational techniques which are able to handle large, noisy, and high dimensional data are necessary. Since our goal is two-tiered: " to discover new genes critical to cancer " to create a diagnosis tool based on measurements of expression of a small set of genes 19 our proposed solution is two-tiered as well: " use various feature selection techniques to discover these cancer causing genes " use a supervised learning technique to create a decision function that can serve as a diagnosis tool, as well as to validate and further highlight cancer-related genes One of several feature selection methods will be used to reduce the dimensionality of a cancer data set. The dimensionally reduced data set will then be used as a training set for a support vector machine, which will build a decision function that will be able to classify tissue samples as "cancerous" or "normal". The success of the feature selection method can be tested by observing the jack-knifing classification performance of the resulting decision function. Also, the decision function can be further analyzed to determine which variables are associated the heaviest "weight" in the decision function, and hence are the most critical differentiators between cancerous and non-cancerous samples. Pre-Processing Visualization Feature Space Reduction Support Vector Classification The feature selection method will not only find critical cancer-causing genes, but will reduce the dimensionality of the data set. This is critical because it has been 20 shown that when the number of features is much greater than the number of samples, the generalization performance of the resulting decision function suffers. The feature selection method will be used to improve the generalizability of the decision function, while the generalization performance of the SVM can also be used to rank the effectiveness of each feature selection method. Besides providing a potential diagnosis tool, analysis of a highly generalizable decision function found by SVMs may also highlight critical genes, which may warrant future study. SVMs were the chosen supervised learning method due to their ability to deterministically select the optimal decision function, based on maximizing the margin between the two classes. The maximal margin criteria has been proven to have better (or equivalent) generalization performance when compared to other supervised learning methods, such as neural nets, while the decision function's form (linear, quadratic, etc.) can be easily and arbitrarily selected when using SVMs. This increases our ability to effectively analyze and use the resulting decision function as a second method of feature selection. 21 22 Chapter 2 Background The first section of this chapter will describe the basic mechanisms of cancer, as well as demonstrate the applicability of oligonucleotide arrays to the study of cancer. The second section discusses how the array technology works, its capabilities, and its limitations. Our specific datasets will be described, and then the various techniques commonly used to prepare array data will also be presented. 2.1 Basic Cancer Biology Cancer is Primarily a Genetic Disease Cancer is caused by the occurrence of a mutation that causes the controls or regulatory elements of a cell to misbehave such that cells grow and divide in an unregulated fashion, without regard to the body's need for more cells of that type. The macroscopic expression of cancer is the tumor. There are two general categories of tumors: benign, and malignant. Benign tumors are those that are localized and of small size. Malignant tumors, in contrast, typically spread to other tissues. The spreading of tumors is called metastasis. There are two classes of gencs that have been identified to cause cancer: oncogenes, and tunior-suppressor genes. An oncogene is defined as any gene that encodes a protein capable of' transforming cells in culture or inducing cancer in animals. A 23 tumor-suppressor gene is defined as any gene that typically prevents cancer, but when mutated is unable to do so because the encoded protein has lost its functionality. Development of cancer has been shown to require several mutations, and hence older individuals are more likely to have cancer because of the increased time it typically takes to accumulate multiple mutations. Oncogenes There are three typical ways in which oncogenes arise and operate. Such mutations result in a gain of function [15]: " Point mutations in an oncogene that result in a constitutively acting (constantly "on" or produced) protein product " Localized gene amplification of a DNA segment that includes an oncogene, leading to over-expression of the encoded protein " Chromosomal translocation that brings a growth-regulatory gene under the control of a different promotor and that causes inappropriate expression of the gene Tumor-Suppressor Genes There are five broad classes of proteins that are encoded by tumor-suppressor genes. In these cases, mutations of such genes and resulting proteins cause a loss of function [15]: * Intracellular proteins that regulate or inhibit progression through a specific stage of the cell cycle * Receptors for secreted hormones that function to inhibit cell proliferation * Checkpoint-control proteins that arrest the cell cycle if DNA is damaged or chromosomes are abnormal * Proteins that promote apoptosis (programmed cell death) " Enzymes that participate in DNA repair 24 2.2 Oligonucleotide Arrays The oligonucleotide array is a technology for measuring relative mRNA expression levels of several thousand genes simultaneously [14]. The largest benefit of this is that all the gene measurements can be done under the same experimental conditions, and therefore can be used to look at how the whole genome as a system responds to different conditions. The 180 oligonucleotide arrays that compose the data set which I am analyzing were commercially produced by Affymetrix, Inc. Technology GeneChips, Affymetrix's proprietary name for oligonucleotide arrays, can be used to determine relative levels of mRNA concentrations in a sample by hybridizing complete cellular mRNA populations to the oligonucleotide array. The general strategy of oligonucleotide arrays is that for each gene whose expression is to be measured (quantified by its mRNA concentration in the cell), there are small segments of nucleotides anchored to a piece of glass using photolithography techniques (borrowed from the semiconductor industry, hence the name GeneChips). These small segments of nucleotides are supposed to be complementary to parts of the gene's coding region, and are known as oligonucleotide probes. In a certain area on the piece of glass (specifically, a small square known as a feature), there exist hundreds of thousands of the exact same oligonucleotide probe. So, when a cell's mRNA is washed over the array using a special procedure with the correct conditions, the segments of mRNA that are complementary to those on the array will actually hybridize or bind to one of the thousands of probes. Since the probes on the array are designed to be complementary to the mRNA sequence of a specific gene, the overall amount of target mRNA (from the cell) left on the array gives an indication of the mRNA cellular concentration of such a gene. In order to actually detect the amount of target mRNA hybridized to a feature on the array, the target mRNA is prepared with fluorescent material. Since all the probes 25 for a specific gene are localized to a feature, the amount of fluorescence emitted from the feature can be measured and then interpreted as the level of hybridized target mRNA. Genechips Affymetrix has developed their oligonucleotide arrays with several safeguards to improve accuracy. Because there may be other mRNA segments that have complementary nucleotide sequences to a portion of a probe, Affymetrix has included a "mismatch" probe which has a single nucleotide inserted in the middle of the original nucleotide sequence. The target mRNA (for the desired gene) should not bind to this mismatch probe. Hence, the amount of binding mRNA to the mismatch probe is supposed to represent some of the noise that also binds to portions of the match probe. By subtracting the amount of mRNA hybridized to the mismatch probe from the amount of mRNA hybridized to the match probe, a more accurate description of the amount of the target gene's mRNA should be obtained. Since the oligonucleotide probes are only approximately 25 nucleotides in length and the target gene's mRNA is typically 1000 nucleotides in length, Affymetrix selects multiple probes that hybridize the best to the target mRNA while hybridizing poorly to other mRNA's in the cell. The nucleotide sequences for the probes are not released by Affymetrix. Overall, to measure the relative mRNA concentration of a single gene, approximately 20 probe pairs (20 match probes and 20 mismatch probes) exist on the chip. A measure provided by Affymetrix uses the 20 probe pairs and computes an overall expression value for the corresponding gene is known as the "average difference". Most techniques utilize the average difference value instead of analyzing the individual intensity values observed at each of the 40 probes. The measure is calculated as follows average difference = n where n is the number of probe set pairs, p-A is the perfect match probe set expression 26 vector, and mrA is the mismatch probe set expression vector. Capabilities A major benefit of oligonucleotide arrays is that they provide a snapshot view of nearly an entire system by maintaining the same set of experimental conditions for each measurement of each gene. This lends itself to analysis of how different components interact among each other. Another important benefit is that the use of the arrays produces a large amount of data quickly. Previously, experiments to measure relative mRNA expression levels would go a few at time. Using the arrays, thousands of measurements are quickly produced. From an experimental point of view, another benefit of using oligonucleotide arrays is that the method requires less tedious steps such as preparing clones, PCR produces, or cDNAs. Limitations There are several problems with the usage of oligonucleotide arrays. Some issues related to the oligonucleotide technology are: " the manner in which expression levels are measured lacks precision (via fluorescence), because genes with longer mRNA will have more fluorescent dyes attached than those genes with shorter mRNA, resulting in higher fluorescence for genes with longer mRNA * absolute expression levels cannot be interpreted from the data " no standards for gathering samples for gene chips, and hence conditions in which the experiments are done are not controlled " probes used may not accurately represent a gene, or may pick up too much noise (hybridization to mRNA of genes other than the target gene) resulting in negative average difference values Another issue related to using oligonucleotide arrays is that some important or critical genes may not be on the standard genechips, made by Affymetrix. A more 27 general issue is that while mRNA concentrations are somewhat correlated to the associated protein concentrations, it is not understood to what degree they are correlated. Also, some proteins are regulated at the transcription level, while some are regulated at the translation level. Ideally, there would be arrays that measured protein concentrations of all the proteins in the organism to be studied. Analyzing relative amounts of mRNA (at the transcription level) is basically one step removed from such desired analysis. 2.3 Data Set Preparation Two sets of oligonucleotide array data have been provided for analysis by John B. Welsh from the Novartis Foundation. Both sets use Affymetrix genechips, but each set has used a different version of the cancer genechip, which contains genes that are (in the judgment of Affymetrix) relevant to cancer and its cellular mechanisms. The first data set uses the HuGeneFL genechip from Affymetrix, which contains probes with length of 25 nucleotides. These chips contain over 6000 probe sets, with 20 (perfect match, mismatch) probe pairs per probe set. Each probe set corresponds to a gene. The data set is composed of 49 such arrays, with 27 arrays used to measure the expression patterns of cancerous ovarian tissue, and 4 arrays used to measure the expression patterns of normal ovarian tissue. The rest of the arrays were used to test the expression patterns of cancer cell lines, as well as to perform duplicate experiments. The method of preparation of these experiments was published by Welsh et al. [25]. 1 The second data set uses the U95avl and U95av2 genechip from Affymetrix. The two versions each contain probe sets for over 12,000 genes. Each probe set contains 16 probe pairs of probes of length 25. The two chips differ only by 25 probe sets, and hence by using only the common sets, the experiments can be treated as being from one type of genechip. This data set is composed of 180 genechip experiments, 'While the HuGeneFL genechips were used for initial analysis, all the analysis included in this thesis is performed on the second data set. 28 Table 2.1. Tissue Data Sets Tissue Type # Cancerous Samples # Normal Samples breast 21 4 colon gastric kidney liver lung 21 11 11 10 28 4 2 3 3 4 ovary pancreas prostate 14 6 25 2 4 9 from 10 different cancer tissues. There are approximately 35 normal tissue samples, and 135 cancerous tissue samples. The various tissues include breast, colon, gastric, kidney, liver, lung, ovary, pancreas, and prostate. It is important to note that the number of cancerous samples is in almost all cases, much larger than the number of normal samples. Ideally, these numbers would be equal to each other. Also, the tissue data sets themselves are not particularly large. They might not be able to provide a generalizable diagnostic tool because of the limited size of the training set. One last issue is that the "normal" tissue samples are not from individuals who are without cancer. Tissue samples are "normal" if they do not exhibit any metastasis or tumors. Often the normal samples are taken from individuals with a different type of cancer, or just healthy parts of the tissue that contains tumors. 2.4 Data Preprocessing Techniques Various preprocessing techniques can affect the ability of an analysis technique significantly. Using genechips, there are three common ways of preprocessing the data that have been used in the analysis of the above data sets. Additionally, there are some preprocessing techniques which were used to remove poor quality data. 29 Removal of "Negative" Genes The first major concern regarding the quality of the data sets above is that there are large numbers of genes with negative average difference values. Clearly, negative expression of a gene has no meaning. Negative average difference values mean that among all probe pairs in a probe set, there is more binding to the mismatch probe than there is to the match probe on average. Ideally, the probes would have been selected such that this would not happen. The exact biological explanation for such behavior is not known, and hence it is unclear whether to interpret the target gene as present or not since the same nonspecific mRNAs that hybridized to the mismatch probe could be hybridized to the respective perfect match probe as well, causing the perfect match probe to have a high fluorescent intensity. One approach is to remove any genes which have any negative average difference values from consideration. Hence, when comparing arrays using computational techniques, any gene which has had a negative average difference value in any array will be removed and hence that dimension or variable will not be used by the technique to differentiate between arrays. Log Normalization The log function has the interesting property that given a distribution, it has the ability to make it appear more gaussian. Gaussian distributions have statistical properties such that the statistical significance of a result can be established. As can be seen from the figures above, log normalizing the data has the effect of compressing the data close to the intervals, while stretching out the data in the middle of the original distribution. One issue is that the log of a negative number is not defined, which is another reason to remove genes with negative average difference values. 30 I 40UU I I I 3 3.5 4 4000 3500 3000 2500 2000 1500 1000 500 0 0 0.5 1 1.5 2 2.5 4.5 x 10" Figure 2-1. Histogram of unnormalized gene expression for one array 250- 200- 150- 100- 50 L 0 -2 ' - 0 " 2 4 6 8 10 12 Figure 2-2. Histogram of log normalized gene expression for same array 31 Global Scaling When making preparations for a genechip experiment, many variables can affect the amount of mRNA which hybridizes to the array. If differences in the preparation exist, they should have a uniform effect on the data such that one experiment's average difference values for all genes would be, for example, consistently higher than those for another experiment's. When doing analysis on the arrays, it is desirable to eliminate these effects so the analysis can be focused on "real" variations in the cell's behavior, not differences in the preparation of the experiment. One common way to do this is to scale each array such that the means of the average difference values are equal to each other, some arbitrary number c. This is illustrated below for an array with expression vector Y with n average difference values: C* X* ( nn - This effectively removes the effects of variation in the preparation of the experiments. Mean-Variance Normalization Another technique used to transform the data and remove the effects of the preparation of the experiment is to mean-variance normalize the data. In order to do this, the mean pu and standard deviation o- for each array Y is calculated. Each average difference value xi is transformed: xi - px The resulting number represents the number of standard deviations the value is from the mean (commonly known as a z-score). An important result of this type of normalization is that it also removes the effects of scaling. 32 Chapter 3 Visualization To better understand the information content in the oligonucleotide array data sets, the first goal was to attempt to visualize the data. Since many computational techniques focus on grouping the data into subgroups, it would be useful to see 1) how well the traditional grouping techniques such as hierarchical/k-means clustering perform 2) how the data looks projected into dimensions that can be visualized by humans easily (specifically, three dimensions). By visualizing the data, we can understand how much noise (relative to the information) is in the data, possibly find inconsistencies in the data, and predict which types of techniques might be useful in extracting information from the data. Visualization techniques also enable us to understand how various transformations on the data affect the information content. Specifically, in order to understand how each preprocessing technique (described in Ch 2.4) affects the data, we can apply the preprocessing technique to the data, and observe the effect by comparing the original data visualization versus the modified data visualization. 33 3.1 Hierarchical Clustering Technique Description Hierarchical clustering is an unsupervised learning technique that creates a tree which shows the similarity between objects. Sub-trees of the original tree can be seen as a group of related objects, and hence the tree structure can indicate a grouping/clustering of the objects. Nodes of the tree represent subsets of the input set of objects S. Specifically, the set S is the root of the tree, the leaves are the individual elements of S, and the internal nodes represent the union of the children. Hence, a path down a well constructed tree should visit increasingly tightly-related elements [7]. The general algorithm is included below: " calculate similarity matrix A using set of objects to be clustered " while number of columns/rows of A is > 1 1. find the largest element Aj 2. replace objects i and j by their average in the set of objects to be clustered 3. recalculate similarity matrix A with new set of objects Variations on the hierarchical clustering algorithm typically stem from which similarity coefficient is used, as well as how the objects are joined. Euclidean distance, or the standard correlation coefficient are commonly used as the similarity measure. Objects or subsets of objects can be joined by averaging all their vectors, or averaging two vectors from each subset which have the smallest distance between them. Expected Results on Data Set Ideally, using the hierarchical clustering technique we would obtain two general clusters. One cluster would contain all the cancer samples, and another cluster would contain all the normal sample. Since the hypothesis is that cancer samples should have roughly similar expression pattern across all the genes, all these samples should be 34 x 104 7- 6- 5- 4- 3- 2- 1 - 0- 1 8 10 2 3 6 14 16 20 23 21 17 19 18 22 24 13 11 12 15 25 7 5 9 4 Figure 3-1. Experiments 1, 2, 8, and 10 are normal breast tissues. Hierarchical clustering performed using euclidean distance and most similar pair replacement policy grouped together. Similarly, normal samples should be grouped together. Although using the hierarchical clustering algorithm the sets of cancer and normal samples will be joined at a node of the tree, ideally the root of the tree would have two children, representing the cancer and normal sample clusters. This would show that the difference between cancer and normal samples is on average larger than the variation inside each subset. Results Four different variations of hierarchical clustering were performed on 4 expression matrices per tissue. The 4 variations of clustering resulting from utilizing 2 different similarity metrics (euclidean distance, and pearson coefficient) with 2 remove/replace policies (removing the most similar pair and replacing with the average, removing the most similar pair and replacing with the sample that was most similar to the remaining samples). For each tissue, the original expression matrix (after removing negative genes, see Ch 2.4) was clustered on as well as three other pre-processed expression matrices. Log normalization, mean-variance normalization, and global scaling were all applied and the resulting expression matrices were clustered. For most of the tissues, across all the different variations of clustering and the differently processed data sets, the desired results were obtained. The normal samples were consistently grouped together, forming a branch of the resulting dendrogram. However, the set of cancer samples and set of normal samples were never separate branches that joined at the root. This would have indicated that for each sample in each set, the most dissimilar object within the set was still more similar than a sample from the other set. Apparently, the amount of separation between the sets of cancer and normal samples is not large enough to give this result. 3.2 Multidimensional Scaling (MDS) Technique Description Each multidimensional object can be represented by a point in euclidean space. Given several objects, it is often desirable to see how the objects are positioned in this space relative to each other. However, visualization beyond 3 dimensions is typically nonintuitive. Multidimensional scaling (MDS) is a technique that can project multidimensional data into an arbitrary lower dimension. The technique is often used for visualizing the relative positions of multidimensional objects in 2 or 3 dimensions. The MDS method is based on dissimilarities between objects. The goal of MDS is to create a picture in the desired n dimensional space that accurately reflects the dissimilarities between all the objects being considered. Objects that are similar are shown to be in close proximity, while objects that are dissimilar are shown to be distant. When dissimilarities are based on quantitative measures, such as euclidean distance or the correlation coefficient, the MDS technique used is known as metric MDS. The general algorithm for metric MDS [10] is included below: 36 " use euclidean distance to create a dissimilarity matrix D " scale the above matrix: A = -0.5 * D2 * center A : a = A-(row means of A)-(column means of A)-+(mean of all elements of A) " obtain the first n eigenvectors and the respective eigenvalues of a " scale the eigenvectors such that their lengths are equal to the square root of their respective eigenvalues " if eigenvectors are columns, then points/objects are the rows, and plot Typically, euclidean distance is used as the dissimilarity measure. MDS selects n dimensions that best represent the differences between the data. These dimensions are the eigenvectors of the scaled and centered dissimilarity matrix with the n largest eigenvalues. Expected Results on Data Set By looking at the 3-dimensional plot produced by MDS, the most important result to observe is that cancer samples group together and are separable in some form from the normal samples. Given that the two sets do separate from each other in 3 dimensions, the next step would be to see what type of shape is required to separate the data. It would be helpful to note whether the cancer and normal samples were linearly separable in 3 dimensions, or required some type of non-linear (e.g. quadratic) function for separation. The form of this discriminating function would then be useful when choosing the type of kernel to be used in the support vector machine (Ch 5.1). Results MDS was used to project all nine tissue expression matrices into 3 dimensions. For each tissue, the original data (without any negative genes, see Ch 2.4) was used to create one plot. To explore the utility of the other pre-processing techniques described in Ch 2.4, three additional plots were made. Log normalization, mean- variance normalization, and global scaling were all applied to the expression matrices 37 I MDS into 3 Dimensions of 25 breast samples x cancerous sample 0 non-cancerous sample 20, 0 15, 0 10, 5, X 0 X X X -5 * -10 X -15, x x -20 -25 30 20 10 40 0 20 0 -10 -20 -20 -30 -40 Figure 3-2. Well clustered data with large separation using log normalized breast tissue data set (without negative genes), and then the resulting expression matrices were projected into 3 dimensions. The first observation is that in most of the 3-D plots, the cancer samples and normal samples group well together. In some cases, however, the separation of the two sets can not be clearly seen. The linear separability of such tissues will be determined when SVMs are used to create decision functions with a linear kernel (see Ch. 5.3). Please refer to Ch 3.4 for a summary of which tissues are linearly separable, and the apparent margin/separation between the cancer and normal samples. Another general observation is that log transforming the data seems to improve in separating the data sets the best among the pre-processing techniques, although mean-variance normalization also often increases the margin of separation. 3.3 Locally Linear Embedding (LLE) Technique Description Like MDS and hierarchical clustering, locally linear embedding (LLE) is a unsupervised learning algorithm, i.e. does not require any training to find structure in the 38 MDS into 3 Dimensions of 25 colon samples cancerous sample OXnon-cancerous sample x 104 1.5, X 0x 0.5, OX 0 o, xx X X X )O x 0 -0.5, x x -1 0.5 4 x 10 0 0 -0.5 0 -1 -1 -1.5 10 -2 Figure 3-3. Poorly separated data, but the two sets of samples are reasonably clustered. data [20]. Unlike these two techniques, LLE is able to handle complex non-linear relationships among the data in high dimensions and map the data into lower dimensions that preserve the neighborhood relationships among the data. For example, assume you have data that appears to be on some type of nonlinear manifold in a high dimension like an S curve manifold. Most linear techniques that project the data to lower dimensions would not recognize that the simplest representation of the S curve is just a flat manifold (the S curve stretched out). LLE however is able to recognize this, and when projecting the higher dimensional data to a lower dimension, selects the flat plane representation. This is useful because LLE is able to discover the underlying structure of the manifold. A linear technique such as MDS would instead show data points that are close in euclidean distance but distant in terms of location on the manifold, as close in the lower dimension projection. Specifically relating to the cancer and normal oligonucleotide samples, if the cancer samples were arranged in some high dimensional space on a non-linear manifold, LLE would be able to project the data into a lower dimensional space while still preserving the locality relationships from the higher dimensions. This could provide a more accurate description of how the cancer and 39 + 4 + +4: + +*+ t4+ ++ A + 1 * +**4. +++ 444 Figure 3-4. LLE trained to recognize S-curve [20] normal samples cluster together. +2 manifold, a small portion of the The basic premise of LLE is that on a non-linear manifold can be assumed to be locally linear. Points in these locally linear patches can be reconstructed by the linear combination of its neighbors which are assumed to be in the same locally linear patch. The coefficients used in the linear combination can then be used to reconstruct the points in the lower dimensions. The algorithm is included below: Ei~~~ Y+ -+yWi * for each point, select K neighbors * minimize the error function E(W) = >ji Xi - Z3 Wi~ , where the weights Wig represent the contribution of the jth data point to the ith reconstruction, with the following two constraints: 1. W7i, is zero for all Xy that are not neighbors of Xi 2. EZ Wi = 1 * map Xi from dimension D to Yi of a much smaller dimension d using the Wij's obtained from above by minimizing the embedding cost function <D(Y) 2 40 = The single input to the LLE algorithm is K, the number of neighbors used to reconstruct a point. It is important to note that in order for this method to work, the manifold needs to be highly sampled so that a point's neighbors can be used to reconstruct the point through linear combination. Expected Results on Data Set If the normal and cancer samples were arranged on some type of high dimensional nonlinear manifold, it would be expected that the visualization produced by LLE would show a more accurate clustering of the samples to the separate sets. However, since there are relatively few data points, the visualization might not accurately represent the data. Results LLE was performed in a similar fashion as MDS on the nine tissues, with four plots produced for each tissue corresponding to the original expression matrix, log normalized matrix, mean-variance normalized matrix, and globally scaled matrix (where all matrices have all negative genes removed). LLE in general produces 3-d projections quite similar to those produced by MDS. However, one notable difference is the tendency of LLE to pull samples together. This effect is also more noticable as the number of samples is large. This is clearly an artifact of the neighbor-based nature of algorithm. The LLE plots reinforce the statements regarding the abilities of the pre-processing techniques to increase the separation between the cancer and normal samples. While log normalizing the data produces good separation, inean-variance normalizating seems to work slightly better. Nonetheless, both log normalization and mean-variance normalization both seem to be valuable pre-processing techniques according to both projection techniques. 41 LLE (w/5 nearest neighbors) into 3-D of 32 lung samples x cancerous sample 0 non-cancerous sample 2, xX x 1, xx 0. 0 0 -11 x0 0 0 3i x -2. 3 Xx 2 -2 1 2 0 0 -1-4 Figure 3-5. Branching of data in LLE projection MDS into 3 Dimensions of 32 lung samples cancerous sample 0 non-cancerous sample x 10 4 5, 4., 3,. x 2, 1. xx XX X X X x x -11 -2, x -3 1 x x xx x x x x 0 0 0 x 0 4 2 15 0 x10 4 10 -2 5 -4 X104 0 -6 -8 -5 Figure 3-6. MDS projections maintain relative distance information 42 3.4 Information Content of the Data After observing each of the plots produced by both LLE and MDS, we are able to draw some conclusions about the separability of each tissue data set. In Table 3.1, the overall impression regarding the separability of the data is summarized: Table 3.1. Linear Separability of the Tissue Data Tissue Type using MDS 3-d plot using LLE 3-d plot breast colon gastric possibly yes yes possibly possibly yes kidney possibly possibly liver lung ovary pancreas prostate no yes possibly NO possibly no yes yes NO yes Several of the tissue data sets are clearly linearly separable in 3 dimensions, while a few are not as clearly separable due to the small separation between the two sets of samples even though the cancer and normal samples cluster well, respectively. This may be an indication of the amount of noise present in the data that causes the smaller separation of the clusters. These data sets, however, may just require an additional dimension to better describe the data and possibly increase the margin of the two clusters. The visualizations of the pancreas data set seem to indicate that there may be a possible misclassification of single "normal" sample, which is consistently among a subset of the cancer samples. Also bothersome is the complete dispersion of the cancer samples. There is no tight cluster of cancer samples. Overall, most of the data sets (especially the lung data set) are well suited for analysis. The cancer and normal samples cluster well respectively, and there is a reasonably specific dividing line between the two clusters. Now that it is established that most of the data sets contain genomic measurements that separate the cancer 43 MDS into 3 Dimensions of 10 pancreas samples cancerous sample 0 non-cancerous sample x x 104 4, 3, 2, 0 1 x x x 0 x x 01 -1, -2, -3, -4 4 2 x10 4 S00.5 -2 0 -0.5 x 10 _1 -6 -1.5 Figure 3-7. Possible misclassified sample and normal samples, feature space reduction techniques are required to identify those genes that cause this separation. 44 Chapter 4 Feature Space Reduction As we found from performing the visualization analysis in Chapter 3, the cancerous and normal samples for nearly all tissues are separable by some simple function. In most cases, a linear function can be used to discriminate between the two sets, since a straight line can be drawn in between the two sets in the 3-dimensional plots produced. Since it is established that the data from the oligonucleotide samples provides information that allows the cancerous and normal samples to be separated, all that remains is specifically selecting the set of genes that causes the largest separation consistently between the two sets of data. The thousands of measurements on each oligonucleotide array constitutes an input space X. In some instances, changing the representation of the data can improve the ability of a supervised learning technique to find a more generalizable solution. Changing the representation of the data can be formally expressed as mapping the input space X to a feature space F. x = (X1,I ...,1X7) [-- 0 (X) = (01 (X), ... N(X)) The mapping function O(x) can be used to perform any manipulation on the data, but frequently it is desirable to find the smallest set of features that still conveys the essential information contained in the original attributes. A O(x) that creates a feature space smaller than the input space by finding the critical set of features 45 performs feature space reduction. Given that the oligonucleotide samples each have approximately 10,000 attributes, feature space reduction is necessary to identify the genes that cause the differences between cancerous and normal samples. Variable selection techniques reduce the fea- ture space by identifying critical attributes independent of their behavior with other attributes in the input space. Dimension reduction techniques, however, typically look for sets of attributes that behave similarly across the data and group these "redundant" attributes, thereby reducing the number of features used to represent the data. 4.1 Variable Selection Techniques Variable selection techniques typically focus on identifying subsets of attributes that best represent the essential information contained in the data. Since selecting an optimal subset given a large input space is a computationally hard problem, many existing techniques are heuristic-based. Techniques that were used for variable selection on the two data sets are described below. 4.1.1 Mean-difference Statistical Method The most direct way to find the genes that are the most different between cancerous and normal samples is to average the expression of the cancerous samples and the normal samples separately, and then subtract the resulting mean expression vectors to obtain the mean difference vector d d Xi i=cancer where Xi) i-normai j is the number of cancer samples, and k is the number of normal samples. Those genes with the largest value in d should be those most closely related to the occurrence of cancer. Using statistics, a confidence interval can be assigned to the absolute difference between the means for the gene, and hence can determine whether 46 noise is the source of the difference. Technique Description Assume there is a set X with n samples drawn from a normal distribution with mean p/, and variance o. 2 , and another set Y with m samples drawn from a normal distribution with mean 1,- and the same variance or2 . Since we are only provided with the samples in set X and set Y, the mean and the variance of the distributions are unknown. In order to estimate pt - py, we can use X - Y. However, since this is an approximation based on the samples taken from the distributions, it would be wise to compute confidence intervals for the mean difference. The 100(1 - ac)% confidence interval for pu 2x - pt is (X - Y) ± tm+n- 2 (a/2) * sy-Y where tm+n-2 is the t distribution for m + T1 - 2 degrees of freedom, and syy is the estimated standard deviation of X - Y: 1 syyg =sp -+ 1 - sF Is defined as the pool sampled variance, and weights the variance from the set with the larger number of samples: 2 (n- 1)S±+(m- 1)S2 2 +- 2 Two assumptions must be made in order to correctly use this technique [18, p. 392]. The first assumption is that each set of samples is from a normal (gaussian) distribution. This assumption is usually reasonable, especially when the number of samples within each set is large. Unfortunately all data sets that we have are typically small when looking at the number of normal (non-cancerous) samples. 47 The second assumption is that the variance between the sets X and Y are the same. If this were true, that would indicate that the variability among the cancerous samples were the same among the normal samples. This might not be a safe assumption to make, since the source and magnitude of the variance in the cancerous samples may or may not apply to the normal samples. Specifically, among the cancerous samples, there may be different types of cancers that can cause different sets of genes to be highly expressed. This variance however would not be displayed in the set of normal samples. Application to Tissue Data The above technique was used to select three different subsets of features. Specifically, a gene was chosen to be in this subset if the probability that the difference between its values for the cancer experiments and the normal experiments could be seen by chance is p. Simply stating, the smaller the p value for a gene, the more likely that the difference between the means was statistically significant and not due to luck (or noise). Three values for p were used to select genes: 0.01, 0.001, and 0.0001. A p value of 0.01 meant that the probability that such a mean difference would occur if the two sets were the same was 1/100. After removing all negative genes from the original (untouched) data sets, each tissue set had the following number of genes: " breast: 6203 genes * colon: 5778 genes * gastric: 5477 genes " kidney: 6655 genes * liver: 5952 genes " lung: 5593 genes " ovary: 6645 genes " pancreas: 6913 genes 48 * prostate: 6885 genes Table 4.1 shows how many genes with statistically significant mean differences were found for all tissues, and after pre-processing the data as well. Clearly the number of features to consider are reduced by this technique. The generalization performance of the decision obtained from training on the reduced data sets will be observed in Chapter 5, and validate the usefulness of this technique. From Table 4.1, it is evident that the two tissues that seemed to not be linearly separable from doing the visualization analysis, the liver and pancreas, also have the fewest number of genes with significant differences between the means for the cancer and normal arrays. This shows the relationship of having separation between the cancer and normal samples with having specific sets of genes that actually cause this separation. This possibly indicates that only a subset of genes are needed to differentiate between cancer and normal tissues. 4.1.2 Golub Selection Method The Golub et al. technique (as briefly mentioned in Ch 1.2.2) is a supervised learning method which selects genes that have a specific expression pattern across all the samples. For instance, possible cancer causing genes could likely be highly expressed in the cancerous samples, while having low expression in the normal samples. Genes that have such an expression pattern across the samples could then be used to determine whether a new case is cancerous or normal. This method allows the user to decide what expression pattern cancer-causing genes should have. Golub et al. selected two different expression patterns, which are described below. Technique Description Assume there is a set X of n cancerous samples, and a set Y of rn normal samples. Each gene therefore has an expression vector with n + rm elements: gcnel = (eXpi, CXp 2 , .. , cxp, CXPn+1, - - , cxPn+m) 49 ideali c * (1, 1,.. n ideal2 C * (0,01 n 0,... , 0) m .. ,1) m The similarity between each gene's expression vector gene and an idealized expression vector (e.g. ideal1 ) is measured by using any of several similarity metrics (e.g. Pearson correlation, euclidean distance). Golub et al. introduced a new metric, based on the Fisher criterion, that quantified the signal to noise ratio of the gene: I'x - Iy ox + o-y After calculating the similarity between all gene expression vectors and ideal using one of the above similarity metrics, an arbitrary number n of the genes that score the highest can be selected to represent half of the predictive set. The other half of the set consists of n genes that score the highest on correlation with ideal". Application to Tissue Data The above method was used twice to select subsets of genes of size 50, 100, 150, 200, 250, and 500. The two metrics that were used to select the genes was the Pearson coefficient and separately the Golub similarity metric. The n gene subsets that were selected when using the Pearson coefficient were composed of the n/2 genes most highly correlated with ideal1 , and the n/2 genes most highly correlated with ideal2 . The n gene subsets that were selected when using the Golub metric were the n genes that had the highest Golub metric score. Thi variable selection technique was run on all nine tissues, with the four expression matrices per tissue (as described earlier). The ability of these subsets to improve classification performance is addressed in Ch. 5. 50 4.1.3 SNP Detection Method SNPs are Single Nucleotide Polymorphisms, or variations of one nucleotide between DNA sequences of individuals. While SNPs are typically used as genetic markers, they could also cause mutations which cause cancer. As mentioned in Chapter 2, genechips measure levels of expression of genes by having sets of probes, which are oligonucleotides of length 20. Each probe in a probe set is supposed to be a different segment of the coding region of the same gene. Assume we have two oligonucleotide samples measuring gene expression from two separate individuals. Assume we observe for the probe set of genej that all probes have similar expression levels except for one probe between the two samples. Since each probe is supposed to be just a separate segment of the same gene, the expression level for each probe should be similar. Two possible explanations for this behavior are 1. Other genes that are expressed differently between the samples have a segment of DNA that is exactly the same as that of gcnej (and complementary to the probe in question) 2. A SNP has occurred in one sample, causing it to either increase or decrease the binding affinity of the segments of mRNA to the probe in question Although explanation #1 is reasonable, the designers of the genechip at Affynetrix supposedly selected probes that are specific to the desired gene. If this is the case, then it would be reasonable to assume explanation #2, and further investigate the presence of SNPs by sequencing the oligonucleotide segment of the gene represented by the probe. Technique Description In order to compare probe expression values (fluorescent intensities) to find SNJ's, some normalization is required to ensure that the differences are not due to variations in the preparation of each sample. Since these differences would likely yield a uniform 51 difference among probe intensities for all genes, a concentration factor c could be used to remove this source of variance. For each probe set of a single sample, the original expression data across all probes could be approximated by a probe set pattern vector j multiplied by the concentration factor c C * (pi, P2, ... ,Pn) where pi is an expression level of the ith probe. The concentration factor c would apply to the pattern vectors for each gene on the array, while the probe set pattern vectors would apply for all samples. The factor c and pattern vectors could be obtained by minimizing the error function S= (di, d2, ... , dn) - C * (PI, P2, Pn) while constraining one c for each sample, and one j5 for each probe set across all samples. Using such a technique would provide us with probe pattern vectors ' for a set of samples. In order to use the above technique to find cancer related SNPs, we must first note that the SNPs that can be found among the cancerous samples is likely a combination of SNPs that the normals have as well as cancer-specific SNPs: {cancer SNPs} = {normal SNPs} U {cancer - specific SNPs} Assuming that {normal SNPs} C {cancer SNPs}, we can find {cancer-specific SNPs} by subtracting the {normal SNPs} from the {cancer SNPs}. One method that can be used to find the {normal SNPs} is to compare the p-'s to each normal sample's expression across a probe set. The 1-'s will be calculated by using the above methodology on the set of normal samples. If any sample's probe set expression vector varies significantly from the 'in one probe, it would be reasonable to assume that a potential SNP exists for that probe. A simple technique similar to jack-knifing can be used to find potential SNP 52 containing probes. For each pattern vector ' and each probe expression vector d(both with n elements, the number of probes in each probe set), the Pearson correlation K is calculated. The correlation Kj of ]5 and d without probe i is then calculated and compared with the original K. If the percentage difference of the correlations is above a threshold, threshold < "- then that probe varies significantly between the two vectors and hence can be a potential SNP. The threshold test is done for each probe i, 1 < i < n. After doing such an analysis, a list of (gene,probe) pairs represents the potential {normal SNPs}. Since the overall goal is to have a set of cancer-specific SNPs that we have high confidence in, it is acceptable to place the above threshold to be low so that the set of normal SNPs will be large. In the end of the analysis, we will be "subtracting" the {normal SNPs} from {cancer SNPs}. Hence we would like to remove any SNPs that aren't specific to cancer. Determining the {cancer SNPs} is done by a different technique. The probe pattern vectors j obtained from the normal samples will be compared with each cancer sample's probe set vectors using the jack-knife correlation method described above. The p-'s obtained from the normal samples are used so that we would avoid not identifying SNPs that are common among all cancer samples but not existent in normal samples. As described above, the final step in order to find {cancer-specific SNPs} is to remove from {cancer SNPs} all elements of {normal SNPs}. Application to Tissue Data The SNP detection method was run for all nine tissues on all four variants of the data set. The pattern detection technique was successfully run on each data set, and using the jack-knifing method of detecting single proble variations, a list of potential SNPs were created for the cancer samples and the norial samples. However, for each data set. {cancer SNPs} C {normal SNPs}. In other words, all SNPs found were actual polyilorphisms that existed in the normal samples, and hence weren't critical to the development of cancer. Because this was the case, no reduced data sets were created from this method. 53 A possible explanation for this behavior can be related to the manner in which the normal samples were obtained. As described in Ch. 2, "normal" samples were actually "normal" appearing biopsies of tissues that often contained tumors as well. It is possible that the "normal" samples actually contained cancerous cells. While on average, the overall expression could be different between the cancer and normal samples for the same probe set due to the different concentration of cells afflicted by cancer, the probe patterns could still be the same. 4.2 Dimension Reduction Techniques While variable selection techniques attempt to reduce the feature space by using a specific selection criterion that tests each variable independent of the others, dimension reduction techniques reduce the feature space by grouping variables that behave consistently across the data. The number of dimensions is therefore reduced to the number of groups that encompass all the variables. Most dimension reduction techniques applied to oligonucleotide array data have been linear reduction techniques. These techniques typically perform in a matrix factorization framework, where the matrix to be factorized into the reduced dimensions is the expression matrix from Ch 1.2. Assuming the expression matrix has n rows (genes) and m columns (different experiments), most dimension reduction techniques factorize the expression matrix: V ~ W*H. The matrix W has n rows and r columns, while the matrix H has r rows and m columns. The value r corresponds to the desired reduced dimension of the data, and is typically chosen to be (n + m)r < nrm. Such dimension reduction techniques are linear because a linear combination of the columns of W approximate a single original experiment. The various dimension reduction techniques differ in the constraints that are placed on the elements in the W and H matrices, leading to different factorizations. One common constraint is that the r columns form a basis set that spans a r dimensional space. This requires that each column of W be orthogonal to all other columns of W. Two dimension reduction techniques that were used to reduce the 54 Principal Component 2 Principal Component I Figure 4-1. Principal components of ail ellipse are the primary and secondary axis feature space are described below. 4.2.1 Principal Components Analysis (PCA) The most common technique used for dimension reduction is Principal Components Analysis (PCA). Besides the orthogonality constraint, PCA places the additional constraint that the ith column of the W matrix be the dimension that accounts for the zth order of variability. For example, if the original data was 2-dimensional and the data points formed a shape similar to an ellipse, the first column of the W matrix would be a vector pointing in the direction of the primary axis, and the second column would be a vector pointing in the direction of the secondary axis. Each data point can then be described as a linear combination of the vectors that represent the primary and secondary axis. For the general case, each data point of n dimensions can be approximated by a linear combination of r vectors that are each n dimensional. Each of the r columns of V are known as eigenvectors. They represent "eigenexperiments" because they show common patterns of expression of genes found in the actual experiments. Each original experiment is encoded by a corresponding column of the encoding matrix H, which indicates the specific linear combination of r eigenexperimnent vectors from W that approximates the experiment. PCA effectively groups genes with consistent behavior in the r eigenexperirnent vectors of the WV matrix. These groups of genes can be likened to cellular processes since the genes express themselves consistently across many experiments, as cellular processes would be expected to do (since the interactions between genes in a pathway are relatively consistent). It has been shown that certain eigenexperiments do group together most genes for certain cellular processes [2]. The columns of the H matrix can then be used to understand which eigenexperiments and cellular processes contribute the most to the expression of a certain experiment. Technique Description PCA is typically performed by doing eigenanalysis on the matrix A (note the change of notation from above, where the matrix to be decomposed is V). If A is not square, PCA is implemented via Singular Value Decomposition (SVD), a technique used to find the eigenvectors of a non-square matrix [22, p. 326]. SVD results in the decomposition A = UEVT, where the columns of U correspond to the principal components. The E matrix is diagonal, containing the singular values associated with each principal component. The columns of both U and VT are orthonormal, respectively. The matrix U is similar to the W matrix as described in the general framework, and (EVT) is similar to the H matrix in the general framework. The nth principal component is the nrh column of U. SVD is performed as follows: * compute ATA and its eigenvectors v 1 ,v 2 ,.. " ,vn , since ATA is square normalize the eigenvectors so they are unit eigenvectors " calculate A'J = ai7 for all unit eigenvectors, where U'i is a unit vector and a. is the magnitude of the vector product A6' * is the ith column of U, alphai is the (i, i) element of E, and 'i is the Zth column of V U4 Application to Tissue Data PCA will be used to dimensionally reduce the four data sets per tissue. SVD will be performed on the data set, yielding u, s, and v. As described above, the encoding matrix H is equivalent to s * v', and each experiment (columni of W) corresponds to 56 colurni of H. By using the encoding matrix columns to represent an experiment, we in a sense have changed the feature space, where each dimension is now in the direction of the corresponding principal component. By selecting an arbitrary number of rows of the encoding matrix H, we can arbitrarily project the data into the desired reduced space. Using the first n rows of the H matrix corresponds to reconstructing an experiment by only using the first n principal components. Various projections of the data (utilizing a varying number of principal components) can be used to represent the data, and the classification performance of each projection is examined in Ch. 5. 4.2.2 Non-negative Matrix Factorization (NMF) Non-negative matrix factorization (NMF) is another technique that has been recently used to decompose oligonucleotide array data into components which represent cellular processes (Kim and Tidor, to be published). NMF performs the same factorization of the expression matrix into the W and H matrices, and requires that each column of W (known as a basisvector when using NMF) is orthogonal to all other columns of W. NMF differs from PCA by placing a different constraint on the elements of the W and H matrices. Specifically, NMF requires that all elements of both the W and H matrices be positive. This leads to basisvectors that only contain positive expression values for each gene. Because PCA has no such constraint, frequently genes can have negative expression values in the eigenexperiment vectors. Another result of the positive constraint is that all original experiments are approximated by additive combinations of the basisvectors. This means that the expression of a single experiment is composed of the addition of several basisvectors, each which represents a cellular process. It is more intuitive to think about an experiment's expression pattern as the sum of several cellular processes being active, instead of the addition and subtraction of cellular processes (which PCA produces). 57 Technique Description The NMF technique was developed by Lee and Seung [12]. The technique is nondeterministic, and is an iterative procedure that uses update rules to converge to a maximum of an objective function that is related to the error between the original expression matrix V and the W and H matrices. The update rules are designed such that the non-negativity constraints are met. One requirement is that the matrix V to be decomposed has no negative elements to begin with. The update rules for the W and H matrices are as follows: Wia + A (WH)ip Hap Wa Wia Hap WiaZ EV < SIL Hap EWza a(WH)ip An objective function, which quantifies the error between the original V and WH, suggested by Lee and Seung [13] is n m F= E [Vi1log(WH)ip - (WH)ip] i=1 P=1 Application to Tissue Data The encoding matrix H obtained by performing NMF on all tissue's four expression matrices will be used to project the data just as it was described for PCA. By observing the classification performance produced by projecting the data into various basis vector spaces, we can identify the most critical basis vectors for further study. 4.2.3 Functional Classification Another strategy that can be used to reduce the number of dimensions is to group genes based on their functional classification, or usage in cellular processes. If genes are functionally classified by some analytical means (e.g. by the careful analysis of a pathway), we can use this information instead of using techniques such as PCA and 58 NMF to attempt to find and correctly group all genes related to a particular cellular process. The yeast genome, which has approximately 6000 genes, has been functionally annotated. Each gene has been assigned to be involved with one or more of thirteen different functional groups by the MIPS consortium [16]: 1. Metabolism 2. Energy 3. Growth: cell growth, cell division, and DNA synthesis 4. Transcription 5. Protein Synthesis 6. Protein Destination 7. Transport Faciliation 8. Intracellular Transport 9. Cellular Biogenesis 10. Signal Transduction 11. Cell Rescue, Defense, Death & Aging 12. Ionic Homeostasis 13. Cellular Organization Since these functional classifications are for the yeast genonie and not the human genome, in order to make the classifications relevant for my data I used sequence homology in order to assign functional classification to the genes included on the Affynetrix oligonucleotide arrays. Approximately 4100 human genes on the genechips were homologous to yeast genes, and hence these 4100 genes were grouped into the above 13 groups. Experiments were then defined as having expression it 13 diicnsions, where the expression for each dimension was the average expression of all the genes in the corresponding functional group. 59 3500 3000- 2500- 2000- 1500- 1000- 500 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Figure 4-2. Functional Class Histogram The breakdown of the 4100 genes that belonged to the functional classification groups is shown in the figure above. Note that each gene can belong to multiple classes. Those genes that were specified as being in multiple classes were factored into the expression for each class by being treated as a separate gene. Application to Tissue Data Each tissue's four expression matrices are reduced to have 13 dimensions instead of the approximate 6000 dimensions. By using these dimensionally reduced matrices, the critical dimensions highlighted by the decision function obtained from SVMs will basically be highlighting the critical cellular processes that are most differently expressed between cancer and normal samples. This is further explored in Ch. 5. 60 Table 4.1. Genes Selected by Mean Difference Technique (note that "orig" means the data with no pre-processing, "log" means log normalization, "mv" means mean-variance normalization, and "gs" means global scaling). Tissue Type p = 0.01 p = 0.001 breast, orig breast, log breast, my breast, gs colon, orig colon, log colon, my colon, gs gastric, orig gastric, log gastric, my gastric, gs kidney, orig kidney, log kidney, my kidney, gs liver, orig liver, log liver, my liver, gs lung, orig lung, log lung, my lung, gs ovary, orig ovary, log ovary, my ovary, gs pancreas, orig pancreas, log pancreas, my pancreas, gs prostate, orig prostate, log prostate, mv prostate, gs 732 739 2077 515 739 780 693 809 1351 1014 1090 1173 1023 893 795 1035 51 86 42 46 621 719 616 756 763 682 655 539 133 145 121 1737 1750 349 278 1269 198 260 323 271 308 552 276 530 573 370 227 272 383 14 14 14 14 270 297 292 393 392 259 214 247 13 16 18 14 1024 1064 199 117 567 100 99 138 111 120 166 62 185 205 116 57 91 125 14 0 14 0 171 102 170 220 231 112 100 134 11 11 11 11 601 642 1518 895 524 1746 1066 647 151 61 p 0.0001 62 Chapter 5 Supervised Learning with Support Vector Machines Because the goals of this project were to discover genes that may cause cancer, and to create a decision function that can be used as a diagnosis tool for cancer, the next step of the thesis involved the use of Support Vector Machines (SVMs). SVMs aid in achieving both of these goals. More commonly, SVMs are used to create decision functions that provide good generalization performance. However, since most learning techniques tend to perform worse in high dimensional spaces, in Chapter 4 we introduced ways to reduce the number of dimensions that were to be considered by the SVM, and hence attempt to improve the generalization performance of the decision functions obtained from the SVMs. Additionally, we hoped to use these decision functions to point out the critical features in the feature space. First, our motivation for the usage of SVMs and background on the mechanism of SVMs will be provided. Then, the generalization performance of the (lecision functions obtained from the SVMs when using a specific variable selection technique from Chapter 4 will be presented. Any important genes found will also be listed and described. 63 5.1 Support Vector Machines 5.1.1 Motivation for Use Many different supervised learning techniques have been used to create decision functions. Neural networks were previously the most common type of learning technique. However, SVMs have been shown in several studies to have improved generalization performance over various learning techniques, including neural networks. Equally important was the ability of the SVM to create decision functions that were nonlinear. Many learning techniques create linear decision functions, but it is very possible that a more complex (nonlinear) function of the expression levels of genes is required to accurately describe the situations in which cancer is present. While other nonlinear supervised learning techniques exist, SVMs provides a framework in which the structure of the decision function can be developed by the user through the kernel function (see next section). This framework allows for flexibility, so that both linear and various nonlinear functions can be used as the form of the decision function. The generalization performance can then be used to indicate the structure of the relationships in the data. Another attractive feature of SVMs is that the criterion for the selection of the decision function is intuitive. The selection of the decision function is optimized on two things: the ability to correctly classify training samples, and the maximization of the margin between the two groups that are being separated by the decision function. The latter criterion is an intuitive notion, which also leads to a deterministic solution given a set of data. This criterion enables us to better understand the decision function so that we can analyze it and select critical features. 5.1.2 General Mechanism of SVMs The objective of the SVM is to find a decision function that solves the binary classification problem (classify an object to one of two possible groups) given a set of objects. One way to describe the decision function f(5) is if f(i) > 0, then the 64 U Jr Figure 5-1. A maximal margin hyperplane that correctly classifies all points on either half-space [19] object 7i is assigned to the positive class; otherwise, xi is assigned to the negative class. Assume the decision function has the following form: f(Y) = (W' - 7) + b. The decision function is a linear combination of the various elements of Y, where wi refers to the weight of the ith element of 7. For example, if the number of elements n in 7 is two, then the decision function is a line. For n larger than three, the decision function is known as a hyperplane. Treating each object as a point in n dimensional space, the margin of a point 7i is + b) = yi '-7Y where yi is -1 or 1, denoting which group xi belongs to. The distance the point 7i is from the separating hyperplane is approximated by the lYi| 1, while the sign of -yj indicates whether the hyperplane correctly classifies the point xi or not. If the sign is positive, xi is correctly classified. As mentioned above, the first thing the SVM tries to do is find a W' that correctly classifies all points, given the set of points is a linearly separable set. This reduces to having positive margins for all points. Secondly, if we maximize the sum of the 1 (1 -yi equals the euclidean distance a point X'i is from the hyperplane when we have the hyperplane i, I b). Note this is the geometric margin. 65 margins for all points, we are maximizing the distance each set of points is from the separating hyperplane. Hence the maximal margin classifier type of SVM is an optimization problem, where the goal is to maximize 1- given the constraint that each 7Y is positive yj, (all I points are correctly classified) where the set of points are linearly separable [5]. Before we continue, we should note that the margin -y as defined above is actually known as the functional margin. Note that the functional margin for any point can be arbitrarily scaled by scaling W' and b: ('W, b) -+ (A-, Ab) 1 7i =yi ((Aw -Yi) + A b) 7i A -yi ((W-- -X-i) + b) 7i - i) + b) =yi(( By making A very small, we can make the functional margin very large. In order to remove this degree of freedom, we can require that the functional margin be scaled by I||6, yielding the geometric margin 1 Maximizing the geometric margin implies minimizing I G| . Hence the maximal margin classifier can now be solved for by minimizing ||G|| while meeting the constraints of correctly classifying each of the I objects. At this point, we can utilize Lagrangian optimization techniques to solve for the maximal margin classifier (W',b) given the objective function (W' - W-)to minimize subject to the constraints yi((-'3.i)+b) > 1 for all i = 1,. . . ,1. The primal Lagrangian given this optimization problem is L(W, b, a) = 11 ( 2i=1 ) - a Z[y(( 66 - Xi) + b) - 1] where ac 0 and are the Lagrange multipliers. The Lagrange Theorem tells us that L (z"b,) 0b L ("b,) = = 0. Using these relations, &L(Ub,a) w - "I ow OL (W b, G) yizi = 0 yi = 0 2i= 0 - Oj W The first equation above implies an important relationship between W and each point x. It states that W' is a linear combination of each point Yi. The larger the a,, the more critical Yi is in the placement of the separating hyperplane. To obtain the dual Lagrangian, we use the two new relations above and apply them in the primal Lagrangian: L( v, b, a) =ai~Y~ 1 1 - y)j (Xiyy 1 I I j=1 (:vi -aiZyib + ai i=1 = E 1 a yiyyy(Xi -zj) 1 ii=1~ (VijyiyjXi'$j) 1 1 j=1 W(a) This results in the final statement of the optimization problem for the maximal margin classifer. Given a linearly separable set of points, we maximize the dual Lagrangian W(a) given the constraints that ac > 0 for all i = 1,.. ., and El yizi = 0. The latter constraint demonstrates how the maximal margin SVM selects points to construct the separating hyperplane: points on opposite half spaces of the hyperplane are used to "support" the hyperplane between them, hence the name "Support Vector Machines". In addition, the SVM framework is such that the optimization problem is always in a convex domain. Because of this, we can apply the Karush-Kuhn-Tucker theorem 67 to draw more conclusions about the structure of the solution: [yZ((W -ZA) + b)-11 = 0 This new relation implies that either ai = 0, or y2 (( Y. -z) + b) 1. This means that only points closest to the separating hyperplane have non-zero Lagrangian multipliers, and hence only those points are used to construct the optimal separating hyperplane (', b). 5.1.3 Kernel Functions At this point, the SVM framework has shown how to find a hyperplane that is a linear combination of the original samples in the training set. The notion of the kernel function is what allows SVMs to find nonlinear functions that accurately classify samples. The kernel function K(z, 5) is related to the q function described in Chapter 4. The q function is used to map an input space X to a new feature space F. In the context of Chapter 4, we described # functions that reduced the size of the input space to a more manageable size, which typically results in better generalization performance. However, # functions often a subset of the features). manipulate the data (e.g. log transforming A common motivation for a # function is if the data is not linearly separable in the input space, q projects the data into a new space where the data is linearly separable. This is the strategy of SVMs. Given a training set, the samples are implicitly mapped into a feature space where the samples are then linearly separable. The optimal separating hyperplane is then determined by solving the optimization problem described above. Since the solution is W', a vector of Lagrangian multipliers (or weights) for each sample, these weights indicate which samples are the support vectors. An important idea to note is that the only way the samples enter the optimization problem is through the term ('i - 7Y). Only the dot products of each all samples' vectors are used in the construction of the problem. SVMs takes advantage of this 68 by utilizing kernel functions: K (i, =-) (() . 0(5)) If a mapping of the samples is required, the kernel function takes the place of the original dot product in the objective function of the optimization, yielding 11 W(a) = a 2- aa yyK(zi, ) i=1 j=1 The kernel function allows for the "implicit" mapping of the samples to a new feature space using the 0 function. The term "implicit" is used because it is actually not required to know the 0 function in order to find the optimal hyperplane in a different feature space. All that is required is the kernel function. Oddly, many kernel functions are designed without the consideration of the associated 0. There are requirements regarding what functions can be used as kernel functions. The first requirement is that the function be symmetric. The second requirement is that the Cauchy-Schwarz inequality be satisfied: K(7, 5)2 () _ #(5))2 < = = (5) 120 (:)1 2 (Y) - (Y))(#z - #( )) = K( Y,X)K(, Z) These requirements are the basic properties of the dot product. An additional requirement is stated by Mercer's theorem. Since there are only a finite set of values that can arise using the kernel function on a training set, a matrix K of dimensions / by I can be constructed where each element (i, j) contains the corresponding (K(5i, SJ)). Mercer's theorem states that the kernel matrix K must positive semi-definite. 5.1.4 Soft Margin SVMs The solution the maximal margin SVM can obtain can be dramatically influenced by the location of just a few samples. As mentioned above, only a few of the samples 69 which are near the division between the two classes are support vectors. If the samples in the training set are subject to noise, the optimal separating hyperplane may not be the solution that best suits the data, or at the worst case, make the training set not linearly separable in the kernel induced feature space. In order to deal with noise that may be present in data sets, the soft margin SVMs have been developed. "Soft margin" implies that instead of requiring all samples to be correctly classified, the margin y can be allowed to be negative for some points. A margin slack variable is a way to quantify the amount a sample fails to have a target margin y: i = max(O,y- yi((z - xi) + b)) If ( > -y, then Yi is misclassified. Otherwise, the sample is correctly classified with at least a margin of -y if i is 0. Using the notion of the slack variable which allows for misclassified samples, we construct a new optimization problem minimize subject to (W - W)+CE yi (( W - Yi) + b) ;> I - i, i = 1, . .. , I The extent to which a sample can have a negative y and the number of such samples is determined by the parameter C. A C > 0 has the effect of allowing misclassifications to occur at the expense of increasing the overall margin between the sets. Typically the selection of C requires arbitrarily trying various values to obtain the best solution for a specific data set, although it is suggested to use a larger C if the data set is believed to be noisy. This increases the probability that the decision function obtained will be generalizable. 5.2 Previous Work Using SVMs and Genechips Many researchers have recognized the suitability of SVMs for oligonucleotide array analysis, due to the generalization performance obtained in high dimensional and 70 noisy spaces (e.g. for text recognition). One of the first applications of SVMs to bioinformatics was the study conducted by Brown et al. The study focused on the classification of yeast genes to functional classes (as defined by MIPS, [16]). The training set was composed of 2,467 genes measured on 79 DNA microarrays. Each microarray measured the yeast genome's expression at time points during various processes (diauxic shift, mitotic cell division cycle, sporulation) as well as the genome's reaction to temperature and reducing shocks. These genes were chosen to be specifically related to 6 different functional classes. The functional classes were chosen out of the approximately 120 total MIPS classes because the genes involved were known to cluster well. Four different variants of SVMs were used (four different kernel functions), as well as four other learning methods. The training performance of each of these methods was compared for all 6 functional classes. The testing of the generalization performance of the techniques was done by dividing the 2,467 gene expression vectors randomly into thirds. Two-thirds of the genes were used as a training set, and classified the genes of the remaining third. This process was performed twice more, allowing for each third to be classified. The study showed that SVMs outperform the other methods consistently for each of the functional classes. The study also went on to show that many of the consistently misclassifed genes were incorrectly classified by MIPS. Other consistent misclassified genes include those that are regulated at the translation and protein levels, not at the transcription level. The study then went on to use the decision functions to make predictions about a few unannotated genes, citing further proof that the classification made by the SVM may be correct. Another study that sought to build on the work of Brown et al. Pavlidis & Grundy [17]. was done by Specifically, the authors investigated how well SVMs were able to classify genes into the many other functional classes that were not considered by Brown et al. Another goal of the paper was to improve the SVMs ability to functionally classify genes by using phylogenetic profile data. Phylogenetic profiles indicate the evolutionary history of the gene; specifically, they are vectors that indicate whether a homolog of the gene exists in other genomes. Genes with similar 71 profiles are assumed to be functionally linked, based on the hypothesis that the common presence of the genes is required for a specific function. Phylogenetic profiles for much of the yeast genome was produced using 23 different organism genomes. The BLAST algorithm was used to determine whether homologs of each of the yeast genome genes existed in the other genomes, and the E-value (significance) obtained from BLAST was used to create the 23 element phylogenetic profile vector. These phylogenetic profile vectors were used separately to train the SVM to classify genes, and were shown to provide novel information that was complementary to what the expression vectors (from the Brown experiments) could provide. Phylogenetic profile vectors were actually shown to be slightly superior in classifying genes into more functional classes than the expression vectors. When combining the 79 element expression vector with the 23 element phylogenetic profile vector, the SVM typically improved its performance, although in some cases the inclusion of the expression vector data actually caused incorrect classifications. While the above experiments focused on classifying genes, other researchers attempted to use SVMs to classify samples represented by individual arrays. The most apt usage of the classification technique was to classify tissue samples as cancerous or normal, thereby providing a diagnosis tool for cancer. Ben-Dor et al. performed such a study on two separate data sets composed of tissue samples from tumor and normal biopsies [3]. One issue noted was the relatively small number of samples compared to the number of dimensions (genes) used to represent each sample (in direct contrast to the studies above, where the number of samples was large relative to the number of dimensions). Because this has been shown to affect generalization performance, subsets of genes were selected using a combinatorial error rate score for each gene. Four different learning techniques (two of which were SVMs with different kernels) were used to classify the samples from two data sets: 62 colon cancer related samples, and 32 ovarian cancer related samples. The jack knifing technique, also known as leave one out cross validation, was used in order to observe the generalization performance of the techniques. Jack knifing consists of using n - 1 samples for training, and then the remaining sample is classified. This process is repeated n times so that every 72 sample gets classified. The results indicate that SVMs perform consistently and reasonably well compared to the other techniques, while the other techniques perform inconsistently between the two data sets. Furey et al. performed a similar study of utilizing SVMs to classify tissue samples as cancerous or normal using DNA microarrays [8]. The data set consisted of 31 ovar- ian tissue samples with 97,802 cDNAs. Because the classification was so drastically underdetermined, only a linear kernel was utilized (and hence there was no projection into a higher dimensional feature space). One variable that was manipulated was a diagonalization factor (related to the kernel matrix), which arbitrarily placed a constraint on how close the separating hyperplane was to the cancerous set of samples. Additionally, instead of using all 97,802 measurements, the Golub method (see Ch 4.1.2) was used as a feature selection method to reduce the number of dimensions, and observe if there was any improvement in generalization performance. The study concluded that by reducing the number of dimensions/features fed into the SVMs, the generalization performance can increase significantly. The study also showed that the generalization performance obtained by SVMs is similar to that of other learning methods (e.g. the perceptron algorithm, or the Golub method). A final point made is that the performance of SVMs will be significantly better compared to other algorithms when the number of samples in the data sets increases. 5.3 Support Vector Classification (SVC) Results The soft margin support vector machine described in Ch. 5.1.4 was used to find decision functions for each of the data sets produced by the feature space selection techniques discussed in Ch. 4. The generalization of the resulting decision functions was measured by using the jack-knifing technique [11], also known as leave-one-outcross-validation (LOOCV). This technique reduces to removing one sample from the set of data, training with the remaining samples, and then classifying the removed sample. These steps are repeated such that each sample is removed, and the number of misclassifications made after all iterations is known as the jack-knifing error. The 73 jack-knifing error is commonly divided by the total number of samples in the full data set to obtain an error percentage. It is important to note that each time a sample is removed, a potentially different decision function is obtained. The jack-knifing therefore is more an indication of the ability of the supervised learning method and the given feature space than it is of a particular decision function. However, the jackknifing error percentage does indicate the generalizability of the "average" decision function obtained by the learning method that is trained on the full data set in the given feature space. In the context of using SVMs, jack-knifing requires the soft margin optimization problem to be solved n times given a data set with n samples to obtain n decision functions. Each of the tissue data sets obtained from the various feature space selection techniques were trained on by the soft margin SVM several times, utilizing different kernel functions while also manipulating C. This was done in order to see if any kernel functions specifically fit the data better (as would be demonstrated by an improved jack-knifing error percentage) demonstrating the general complexity of the data. C, as described in the soft margin optimization problem, enable the user to force the SVM to focus more on finding a decision function that creates the largest margin between the two classes of data, even at the expense of a misclassification (in the training set). A C of 0 reduces the soft margin optimization problem to the original maximal margin optimization problem, where no misclassifications are allowed. A large C allows for misclassifications while attempting to find the largest margin between the overall sets of data. The selection of a particular value for C is often more art than science, as is the selection of the kernel function. Optimal selections require that the user have prior knowledge about the structure of the data, which is often not the case. Although 3-dimensional projections of the data were produced in Ch. 3, these projections might not provide an accurate view of the data in the full feature space. Nine different kernel functions were used to train on the data sets, while C was varied from 0, 5, 10, 50, 100. The different kernel functions used were: 74 1. linear 2. polynomial of degree 1 (not equivalent to line due to implicit bias of 1) 3. polynomal of degree 2 4. polynomial of degree 3 5. spline 6. gaussian radial basis function with r = 0.5 7. gaussian radial basis function with a = 1.0 8. gaussian radial basis function with r = 1.5 9. gaussian radial basis function with r = 2.0 where (T is the global basis function width of the data set [24]. The linear kernel function corresponds to the dot product, while the polynomial kernel function has the form: K(x, z) = ((x - z) + 1)'. The implicit bias of 1 has the effect of weighting the higher-order terms higher than the lower-order terms. 5.3.1 SVC of Functional Data Sets From the results of the support vector classification of the data sets produced by reducing the genes to 13 dimensions corresponding to 13 functional classes, several observations can be made. First, the classification performance obtained in the 13 dimensions is actually quite good. Obtaining low jack-knifing error in dimensions larger than the number of samples is trivial. However, when the number of dimensions is significantly fewer than the number of samples, the dimensions used must be expressive of the data. Nearly all tissue sets have jack-knifing error percentages below 15 for at least one of the four versions of the data. However, there is significant variation in the error percentage that results from the different pre-processing techniques. See Table 5.1. First, as was first indicated by the visualizations perform in Ch. 3, mean-variance normalization significantly decreases the jack-knife error percentage compared to the other techniques. Global scaling, on the other hand, has a dramatically bad effect 75 on the data. It is possible that globally scaling the data removes the differences in expression between the functional groups, causing all samples to appear to be expressing at the same level. Another important observation is that varying the soft margin variable C among the values greater than 0 does not have a significant effect on generalization. The difference of performance between the same kernel using a C of 5 and a C of 100 was generally negligible. Possibly, a more appropriate range for varying C would have been 0.5 < C < 5. However, it is clear that by using a C of 0, a data set can be determined to be separable or not (since a C of 0 corresponds to making the soft margin optimization problem the "hard" optimization problem allowing no misclassifications). Many of the tissue data sets are not separable by any type of decision function specified by the kernel functions used. It was also observed that the as the complexity of the kernel function increased, the jack-knifing error increased as well. While a more complex kernel function can be used to create a decision function that correctly classifies all points, typically the function will be used to "overfit" the data, producing a function that is very specific to the training data. Overfitting often results in poor generalization peformance. This can be seen in the error percentage for the complex kernel functions. Overall, the linear kernel function had the best generalization performance. This also may indicate that the difference between the cancer and normal samples can be well characterized by linear relationships in a set of critical genes. If there were higher-order relationships among the functional classes that cause cancer, they were not detected by the SVM. Until larger and less noisy data sets are available, most analysis will likely point to simple linear relationships. 5.3.2 SVC of NMF & PCA Data Sets Support vector classification for both NMF and PCA were done in a similar fashion. PCA and NMF were used to project each of the tissue data sets into specific dimensions, and SVC was performed on the projected data. Each tissue was projected to d = 1, 3, 5,... , n where n was equal to 1 or 1 - 1, depending on whether the number 76 of samples in the set 1 was odd. A curious result was that the globally scaled projections consistently produced the lowest jack-knifing errors. This is completely opposite of the observation made from the classification of the functional data sets. Results were similar, however, regarding the selection of C. There was absolutely no difference in the performance of the obtained decision functions varying C from 5 to 100. A C of 0 consistently caused the SVM to fail to obtain a decision function as well. The most consistent and interesting observation made by looking at the jackknifing error percentages associated with the varying dimensions was that in all cases, the addition of dimensions past a certain point increased the error. It appears that only a few basis vectors/eigenvectors are required for differentiation between cancer and normal samples, since the addition of further dimensions must be adding "noise" to the data, resulting in poorer classification performance. This result is important since it lends credibility to the selection of the basis vectors by NMF, and the selection of eigenvectors by PCA. The fact that only a few dimensions are required to obtain good classification performance also means that there are differentiating cellular processes between the cancer and the normal samples. The jack-knifing performance varied slightly between NMF and PCA. The NMF basis vectors provided consistent 0% jack-knifing errors for several dimensions, ranging from 3 to 11 (depending on the tissue). The PCA eigenvectors did not appear to be quite so robust, but still provided perfect jack-knifing classification in a smaller range of dimensions. The jack-knifing performance of PCA decreased more quickly with the addition of more dimensions compared to NMF. This might be a result of the additive/subtractive nature of PCA, compared with NMF's additive only nature. 5.3.3 SVC of Mean Difference Data Sets Just as the support vector classification of the other data sets has implied, small, critical sets of genes cali be very effective in describing the differences between cancer and normal samples. The mean difference method was used to select 3 sets of genes to describe each data set, and support vector classification was then performed on 77 the dimensionally reduced data sets. The jack-knifing errors were 0 for nearly all reduced data sets (see Table 5.1). One might say that since the number of genes in each of these subsets is still relatively large compared to the number of samples, this is to be expected, since separating a few points in a high dimensional space is trivial. However, the expected result is that the SVM find a decision function that separates all samples in the training set. It is not necessarily expected that the decision functions selected by the SVM are generalizable enough to correctly classify an additional sample. The fact that each time a point was removed and the SVM was trained on the remaining samples, the decision function was able to correctly classify the remaining point means that the dimensions selected are truly representative of the differences between cancer and normal samples. Another important observation was that as the size of the selected subsets of genes decreased, the jack-knifing error actually decreased. The genes in the smaller subsets were identified to have more significant differences between the means of the samples. The fact that a small subset of genes that are supposed to be more significant can provide better generalization than a larger set of genes implies that the technique selects genes which truly are significant in identifying cancer samples from normal samples. Additionally, using a linear kernel with C = 0, most of the data sets were not linearly separable. Performing SVC on these data sets also showed overfitting. Once again, the linear kernel performed the best. When using a more complex kernel function, such as the polynomial kernel of degree 2, the jack-knifing error increased significantly. 5.3.4 SVC of Golub Data Sets As described in Ch 4.1.3, the Golub method of selecting significant features was used to create subsets of genes of size 50, 100, 150, 200, 250, and 500 genes. The metric introduced by Golub quantifying the signal to noise ratio was used to create the above subsets, and additionally the Pearson coefficient was used to find genes that matched the ideal expression vectors. The top scoring genes for each of these metrics were 78 used to create 12 subsets of genes. In general, the jack-knifing errors obtained using these reduced data sets are particularly low. They are similar to those obtained by the SVC of the mean difference method data sets above, although not quite as good for a few data sets. Like the mean difference method, the generalization of the decision functions also improve as the size of the subsets of genes decreases. The Golub and Pearson metrics result in very similar jack-knife errors, although the Golub (signal to noise) metric slightly outperforms the Pearson coefficient. The performance of the metrics only differs in the "difficult to separate" data sets specified by the visualizations performed, such as the kidney and the pancreas. It seems that finding genes that have significant signal content (relative to the noise in the data set) is critical to separating these data sets, hence the better performance of the Golub metric. 5.4 Important Genes Found Given the excellent classification performance yielded by using the variable selection techniques, the genes found by these techniques seem to be closely related to the differentiation of cancerous and normal tissues. A common list of genes representing those that repeatedly came up in several of the subsets was composed. Another method of selecting important genes is to analyze the decision function obtained from support vector classification. Given a specific decision function, those features with the largest coefficients associated with them are the most significant in the proper classification of cancer and normal samples. Since the linear kernel produced the lowest jack-knifing errors across all tissue data sets, the decision function had the form of a linear combination of the values for various genes. Those genes with the highest coefficients were noted, and a list of' important genes was hence created using decision functions from all the variable selection techniques over all the differently pre-processed data sets. See the table below for a few example genes. 79 Table 5.1. Jack-knifing percentage (%) errors using a linear kernel function M-D p=10- 2 M-D p=10- 3 M-D p=10 4 Tissue Type Func breast, orig 8.0000 4.0000 0 0 breast, log breast, mv 12.0000 0 0 NaN 0 7.6923 0 0 breast, gs colon, orig 21.4286 23.0769 0 0 0 0 0 0 colon, log colon, my colon, gs gastric, orig 15.6250 6.2500 50.0000 5.8824 0 0 10.0000 0 0 0 0 0 0 0 0 0 gastric, log 30.7692 7.6923 7.6923 0 gastric, mv 0 0 0 0 gastric, gs kidney, orig NaN 0 0 0 0 0 0 0 kidney, log kidney, my kidney, gs 6.2500 30.7692 30.0000 0 NaN 10.0000 0 NaN 0 0 NaN 0 liver, orig 0 0 0 0 liver, log 5.8824 0 0 0 liver, mv 0 0 0 0 liver, gs 30.7692 7.6923 0 0 18.7500 0 NaN 0 NaN 0 NaN 0 lung, orig lung, log lung, mv 0 0 0 0 lung, gs 50.0000 NaN NaN NaN ovary, orig ovary, log 0 0 0 0 0 0 0 0 ovary, my ovary, gs 2.9412 18.7500 NaN 0 NaN 0 NaN 0 pancreas, orig pancreas, log pancreas, mv 0 0 0 0 0 0 0 0 0 0 0 0 pancreas, gs 50.0000 0 0 0 prostate, orig 0 0 0 0 prostate, log 0 0 0 0 prostate, mv prostate, gs 0 8.8235 0 0 0 0 0 0 80 Table 5.2. Genes selected via "vote" and SVM decision function analysis Gene Function AF053641 X17206 HG51 1-HT511 M60974 U33284 M13981 brain cellular apoptosis susceptibility protein (CSEl) mRNA mRNA for LLRep3 Ras Inhibitor Inf growth arrest and DNA-damage-inducible protein (gadd45) mRNA protein tyrosine kinase PYK2 mRNA inhibin A-subunit mRNA V00568 mRNA encoding the c-myc oncogene S78085 Z33642 AL031778 S78085 PDCD2=programmed cell death-2/Rp8 homolog Z33642:H.sapiens V7 mRNA for leukocyte surface protein AL031778:dJ34B21.3 (PUTATIVE novel protein) Y09615 Y09615:H.sapiens mRNA for mitochondrial transcription termination factor U10991 U10991: G2 protein mRNA U70451 U70451: myleoid differentiation primary response protein MyD88 mRNA D26158 L07493 U16307 AL050034 D26158:Homo sapiens mRNA for PLE21 protein Homo sapiens replication protein A 14kDa subunit (RPA) niRNA glioma pathogenesis-related protein (GliPR) mRNA AL050034:Homo sapiens mRNA 81 82 Chapter 6 Conclusions The goal of this thesis was to further the understanding of the mechanisms of cancer. The methodology to achieve this goal involved comparing and contrasting the genomic expression of cancerous and normal tissues. Oligonucleotide arrays were used to gather large amounts of genomic data, and were able to provide a "snapshot" view of the genomic activity of the cells in both cancerous and normal tissues of nine different human organs. A computational methodology was then utilized to discover critical information in these tissue data sets. The first step of the methodology involved pre-processing the data to ensure that the information pertinent to the differentiation of cancer and normal samples was clearly presented, while minimizing the effect of external factors such as data preparation. Four techniques were used: removal of "negative" genes, log normalization, mean-variance normalization, and global scaling. In order to confirm the presence of relevant information in the pre-processed data sets, three visualization techniques were used: hierarchical clustering, Multidimensional Scaling (MDS), and Locally Linear Embedding (LLE). Each of these techniques confirmed that 7 out of the 9 tissue data sets contained cancer and normal samples that separated into two clusters, with varying degrees of separation between the clusters. The data sets were then reduced by using two classes of feature space selection techniques: variable selection, and dimension reduction. 83 Three variable selection techniques were applied to reduce the number of features representing each sample. Principal Components Analysis (PCA) and Non-negative Matrix Factorization (NMF) were used to select arbitrary sized basis sets to represent the data, thereby transforming the input space to a smaller feature space. The data sets were also reduced to 13 functional classes by using the MIPS yeast database. Human genes were assigned to the functional class of their yeast gene sequence-homologues. The transformed/reduced data sets were then passed to a supervised learning method, known as the Support Vector Machine (SVM). SVMs with various kernel functions and various soft margin parameters were trained on the transformed data sets. The jack-knifing technique was used to assign a value to the classification performance of the obtained decision functions. Many configurations (dataset, type of kernel function, C value) obtained excellent classification performance, inferring that many of the feature space selection techniques were able to identify critical subsets of genes (from an original set of 6000) capable of separating cancerous tissue from normal tissue. The expression measurements of such genes could be used as a diagnostic tool, in the form of the decision function obtained from support vector classification. Effectiveness of Methodology The classification performance of decision functions obtained through the usage of SVMs was certainly improved by introducing the various feature space selection techniques. In the process of developing these decision functions, the support vector machines were able to validate the feature selection techniques used to generate the data sets. The low jack-knifing errors associated with the mean difference technique and the golub selection indicate that simple techniques which separate noise from signal discover critical sets of genes. The decision functions obtained from the support vector classification were also useful as a second "filter" of the original set of genes by highlighting critical genes with large coefficients. Analysis of the support vector classification results also established the utility of the simple linear kernel. For nearly all data sets, the linear kernel achieved the best classification performance. It will be likely to remain this way until larger data sets 84 are obtained which can be used to effectively demonstrate any higher-order structure in the data. The poor generalization of the more complex kernels is likely due to overfitting of the data. In addition, the support vector classification results were able to demonstrate the importance of pre-processing oligonucleotide array data. 85 86 Bibliography [1] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Nati. Acad. Sci. U.S.A., 96:6745-6750, 1999. [2] 0. Alter, P. 0. Brown, and D. Botstein. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Nati. Acad. Sci. U.S.A., 97:10101-10106, 2000. [3] A. Ben-Dor, L. Bruhn, N. Friedman, 1. Nachman, M. Schummer, and Z. Yakhini. Tissue classification with gene expression profiles. pages 1-12, 2000. [4] M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares Jr., and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Nati. Acad. Sci., 97:262-267, 2000. [5] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, UK, 2000. [6] M. B. Eisen, P. T. Spellman, P. 0. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A.. 95:14863--14868, 1998. [7] D. Fasulo. An analysis of recent work on clustering algorithms. pages 1-23, 1999. 87 [8] T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validationof cancer tissue samples using microarray expression data. UCSC Technical report, UCSC-CRL00-04:1-17, 2000. [9] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science (Washington, D.C.), 286:531-537, 1999. [10] J. C. Gower. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53:325, 1966. [11] L. J. Heyer, S. Kruglyak, and S. Yooseph. Exploring expression data: Identification and analysis of coexpressed genes. Genome Res., 9:1106, 1999. [12] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature (London), 401:788-791, 1999. [13] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. Adv. Neural Info. Proc. Syst., 13:556-562, 2001. [14] R. J. Lipshutz, S. P. A. Fodor, T. R. Gingeras, and D. J. Lockhart. High density synthetic oligonucleotide arrays. Nature Genet., 21:20-24, 1999. [15] H. Lodish, A. Berk, S. L. Zipursky, P. Matsudaira, D. Baltimore, and J. E. Darnell. Molecular Cell Biology. W.H. Freemand and Company, New York, 2000. [16] H. W. Mewes, D. Frishman, C. Gruber, B. Geier, D. Haase, A. Kaps, K. Lemcke, G. Mannhaupt, F. Pfeiffer, C. Schuller, S. Stocker, and B. Weil. MIPS: A database for genomes and protein sequences. Nucleic Acids Res., 28:37-40, 2000. [17] P. Pavlidis and W. N. Grundy. Combining microarray expression data and phylogenetic profiles to learn gene functional categories using support vector machines. 88 Columbia University Computer Science Technical Report, CUCS-011-00:1-11, 2000. [18] J. A. Rice. Mathematical Statistics and Data Analysis. Duxbury Press, Belmont, CA, 1995. [19] S. Roweis. Jpl vc slides. http://www.cs.toronto.edu/ roweis/notes.html, page 5, 1996. [20] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323-2326, 2000. [21] American Cancer Society. Cancer facts and figures 2001. A CS website, www.cancer.org, 1:2-5, 2001. [22] G. Strang. Introduction to Linear Algebra. Wellesley-Cambridge Press, Wellesley, MA, 1993. [23] P. Tamayo, D. Slomin, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl A cad. Sci. U.S.A., 96:2907-2912, 1999. [24] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. [25] P. P. Zarrinkar, J. K. Mainquist, M. Zamora, D. Stern, J. B. Welsh, L. M. Sapinoso, G. M. Hampton, and D. J. Lockhart. Arrays of arrays for high- throughput gene expression profiling. Genome Research, 11:1256-1261, 2001. 89