Microarray Gene Expression Data Analysis with SVM Suvrajit Maji Department of Computational Biology University of Pittsburgh Pittsburgh, PA 15213 smaji@andrew.cmu.edu 1 Introduction and Motivation Microarrays are capable of determining the expression levels of thousands of genes simultaneously. One important application of gene expression data is classification of samples into categories. In combination with classification methods, this tool can be useful to support clinical management decisions for individual patients, e.g. in oncology. Standard statistic methodologies in classification or prediction do not work well when the number of variables p (genes) far too exceeds the number of samples n which is the case in gene microarray expression data. Several machine learning algorithms have been applied to the analysis of microarray data. Gene expression profiles generated by Microarray experiments can be quite complex since these experiments can involve hypotheses involving entire genomes. So, modification of existing statistical methodologies or development of new methodologies is needed for the analysis of microarray data It is only natural, then, to apply tried and tested machine learning algorithms to analyze this data in a timely, automated and cost effective way. 2 Related Work Recently, with the development of large-scale high-throughput gene expression technology, several studies have been reported on the application of microarray gene expression data analysis for molecular classification of cancer ( [1], [2], [6] ).Also, the work that predicts clinical outcomes in breast cancer ( [10], [11] ) and lymphoma ( [9] ) from gene expression data has been proven to be successful. [7] utilized a nearestneighbor classifier method for the classification of acute myeloid lymphoma (AML) and acute leukemia lymphoma (ALL) in children. [5] performed a systematic comparison of several discrimination methods for classification of tumors based on microarray experiments. 3 Problem definition In a Microarray classification problem, we are given a training dataset of n samples we are given a training dataset of n training sample pairs: {xi , yi }, i = 1, . . . , n, where xi Є Rd is the ith training sample, and yi is the corresponding class label, which is either +1 or -1.An important goal of the analysis of such data is to determine and explicit or implicit function that maps the points of the feature vector from the input space to an output space. This mapping has to be derived based on a finite number of data, assuming that a proper sampling of the space has been performed. If the predicted quantity is categorical and if we know the value that corresponds to each elements of the training set then the question becomes how to identify the mapping that connects the feature vector and the corresponding categorical value. This problem is known as the classification problem. 4 Proposed method The goal of our proposed project will be to use supervised learning to classify and predict diseases, based on the gene expressions collected from microarrays. Known sets of data will be used to train the machine learning protocols to categorize patients according to their prognosis. The outcome of this study will provide information regarding the efficiency of the machine learning techniques, in particular a SVM method. The basic tool used is a modified version of SVM called Least Square SVM (LS SVM) classifier which employs a set of mapping functions to map the input data into the reproducing kernel Hilbert space, where the mapping function is implicitly defined by the kernel function :k(xi, xj) = Φ(xi). Φ(xj). The efficiency of classification depends on the type of kernel function that is used. So here we will analyze the performance of various kernel functions used for classification purpose. 5 Description of the Methods SVMs provide a machine learning algorithm for classification Gene expression vectors can be thought of as points in an n-dimensional space. The SVM is then trained to discriminate between the data points for that pattern (positive points in the feature space) and other data points that do not show that pattern (negative points in the feature space). Specifically, SVM chooses the hyperplane that provides maximum margin between the plane surface and the positive and negative points. The separating hyperplane is optimal in the sense that it maximizes the distance from the closest data points, which are the support vectors The generalized version of LS SVM is as follows: Given {xi , di }, i = 1, . . . , n, training data set, where xi Є Rp represents a p– dimensional input vector and di = yi + zi ,is the scalar measured output which represents yi disturbed by some noise zi . A function y is defined as follows: The Φ(.): Rp→Rh is mostly a non-linear function. The Optimization problem is then represented as follows: with constraints : The first term gives smooth solution second term minimizes the training error. From this a Lagrangian is formed. where αi parameters are Lagrange multipliers. The conditions for optimality are: which gives the following overall solution : where d = [d1, d2, …, dN] , α = [α1, α 2, …, αN] , and Here K(xi,xj) is the kernel function and Ω is the kernel matrix. For a given input x the response of an LS-SVM is a weighted sum of N kernel functions, where the center parameters of these kernel functions are determined by the training input vectors xi: 5.1 Types of kernel functions that can be used 1. Lineal Kernel: 2. Multilayer perceptron kernel: parameter and t is the bias. 3. Polynomial Kernel: degree of the polynomial. 4. Radial Basis function: the Gaussian kernel. where s is scale where t is the intercept and d is the where is the variance of 5.2 Some new combinations of kernels has also been tested here 5. K(xi,xj) = K1(xi,xj) * K2(xi,xj) = 6. K(xi,xj) = K1(xi,xj) + K2(xi,xj) where K1 & K2 are any of the above kernel functions or any new function . 6 Experiments Description of testbed: In this study the classifiers are tested with few high dimensional datasets like hepatocellular carcinoma data (Iizuka et al., 2003), high-grade Glioma data (Nutt et. al. 2003) etc. Hepatocellular carcinoma data: The training set consists of 33 hepatocellular carcinoma tissues of which 12 suffer from early intrahepatic recurrence and 21 not. The test set consists of 27 hepatocellular carcinoma tissues of which 8 suffer from early intrahepatic recurrence and 19 not. The number of gene expression levels is 7129. Glioma data: 50 high-grade glioma samples were carefully selected, 28 glioblastomas and 22 anaplastic oligodendrogliomas, a total of 21 classic tumors was selected, and the remaining 29 samples were considered non-classic tumor. The training set consists of 21 gliomas with classic histology of which 14 are glioblastomas and 7 anaplastic oligodendrogliomas. The test set consists of 29 gliomas with non-classic histology of which 14 are glioblastomas and 15 are anaplastic oligodendrogliomas. The number of gene expression levels is 12625. The experiment will mainly try to look into the classification and performance of the classifier and try answering the following questions: What are the things we need to do in the preprocessing steps? What about dimensionality reduction? How are the microarray data classified by the classifier? What are the model parameters of the classifier kernel functions? How they effect the efficiency of the method? What type of kernel functions should be used for the classification? 7 Observations For Glioma data set: bias b = 0.3333; test error = 0.5172; train error = 0. For Hepatocellular carcinoma data : bias b = -0.2727; test error = 0.2963 ; train error = 0. 7.1 Tables Kernel function RBF Linear MLP Poly Bias Train error Test error LOOCV cost γ = .01 : -0.2727 0 0.2963 .6667 γ = 10 : -0.2727 0 0.2963 γ = .01 : -0.2727 0 0.4074 γ = 10 : -0.2727 0 0.4074 γ = .01 : -2.8369 0 0.4074 γ = 10 : -2.8369 0.1515 0.2963 γ = .01 : -0.3340 0 0.3636 γ = 10 : -0.3340 0 0.2963 1.5788 1.5233 0.8222 Kernel function RBF Linear & MLP MLP & RBF Poly & RBF Bias Train error Test error γ = .01 : -0.2727 .3636 .2963 γ = 10 : -0.2727 0 .2963 γ = .01 : -0.2727 0 0.3333 γ = 10 : -0.2727 0 0.4704 γ = .01 : -2.9015 0.0606 0.4074 γ = 10 : -2.8369 0.1515 0.4074 γ = .01 : -0.3540 0 0.2963 γ = 10 : -0.3340 0 0.2963 LOOCV cost 0.6667 1.5778 1.5333 0.8222 In both cases the RBF , Poly kernel did better . 7.2 graphs and plots Figure 2: Classification of the glioma gene expression data using RBF kernel. Figure 2: Classification of the glioma gene expression data using polynomial kernel. Similar classification result is produced by this kernel function, might be because of the data being mapped into the higher dimension has similar distribution for the different classifiers. Figure 3: Cost for Leave-one-out crossvalidation on data using the normal kernel functions. Here the Linear and Multilayer perceptron kernel performed better than RBF and Polynomial kernel was consistent for different gamma. Figure 4: Cost for Leave-one-out crossvalidation on data using the combination of kernel functions. Here the combination of Linear and Multilayer perceptron kernel performed better than rest and RBF & Poly kernel was consistent for all gamma. Figure 5: Cost for crossvalidation on data using the general kernel functions. Here the Multilayer perceptron kernel performed better than rest. Figure 6: Cost for crossvalidation on data using combination of kernel functions. Here the combination of RBF & Polynomial kernel performed better than rest and also better than only RBF. 8 Conclusions The LS SVM in general removes the gene expression data examples which are irrelevant, but in doing so, some information is lost, i.e. the weighted solution is not that sparse, but the modified LSSVM produces sparse solution, so dimension reduction while doing preprocessing is obtained maintaining the sparseness. Various kernel functions were tested and their performances were measured. In most cases RBF kernel function performed better than other kernel functions in terms of accuracy but in terms of computational efficiency other kernels did better. In terms of performance cost the combined kernel functions also performed well in some cases in comparison to normal kernels. In future works the performance can be enhanced by tuning the parameter values and making the classification better. References [1]. Alon et al., 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA, 96(12):6745-6750. [2]. Bittner et al.,2000. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406:536 540 [3]. Brown et al. 1999. Support Vector Machine Classification of Microarray Gene Expression Data, UCSC-CRL-99-99. [4]. Cho, and H. Won, 2003. Machine Learning in DNA Microarray Analysis for Cancer Classification. APBC 2003:189-198. [5]. Dudoit et al., 2002. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457):77-?? [6]. Furey et al., 2000. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, Vol.16, no.10, pp.906–914. [7]. Golub et al., 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring Science, 286(5439):531-537. [8]. Nutt et.al. 2003. Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res. Apr 1; 63(7):1602-7. [9]. Shipp et al., 2002. Diffuse large B-cell lymphoma outcome prediction by geneexpression profiling and supervised machine learning. Nat Med, 8(1):68-74. [10]. Veer et al., 2002. Gene expression profiling predicts clinical outcome of breast cancer Nature, 415(6871):530-536. [11]. West et al., 2001. Predicting the clinical status of human breast cancer by using gene expression profiles. PNAS, 98, 11462–11467.