Microarray Gene Expression Data Analysis with SVM

advertisement
Microarray Gene Expression Data
Analysis with SVM
Suvrajit Maji
Department of Computational Biology
University of Pittsburgh
Pittsburgh, PA 15213
smaji@andrew.cmu.edu
1 Introduction and Motivation
Microarrays are capable of determining the expression levels of thousands of genes
simultaneously. One important application of gene expression data is classification of
samples into categories. In combination with classification methods, this tool can be
useful to support clinical management decisions for individual patients, e.g. in oncology.
Standard statistic methodologies in classification or prediction do not work well when the
number of variables p (genes) far too exceeds the number of samples n which is the case
in gene microarray expression data.
Several machine learning algorithms have been applied to the analysis of microarray
data. Gene expression profiles generated by Microarray experiments can be quite
complex since these experiments can involve hypotheses involving entire genomes. So,
modification of existing statistical methodologies or development of new methodologies
is needed for the analysis of microarray data It is only natural, then, to apply tried and
tested machine learning algorithms to analyze this data in a timely, automated and cost
effective way.
2 Related Work
Recently, with the development of large-scale high-throughput gene expression
technology, several studies have been reported on the application of microarray gene
expression data analysis for molecular classification of cancer ( [1], [2], [6] ).Also, the
work that predicts clinical outcomes in breast cancer ( [10], [11] ) and lymphoma ( [9] )
from gene expression data has been proven to be successful. [7] utilized a nearestneighbor classifier method for the classification of acute myeloid lymphoma (AML) and
acute leukemia lymphoma (ALL) in children. [5] performed a systematic comparison of
several discrimination methods for classification of tumors based on microarray
experiments.
3 Problem definition
In a Microarray classification problem, we are given a training dataset of n samples
we are given a training dataset of n training sample pairs: {xi , yi }, i = 1, . . . , n, where xi
Є Rd is the ith training sample, and yi is the corresponding class label, which is either +1
or -1.An important goal of the analysis of such data is to determine and explicit or
implicit function that maps the points of the feature vector from the input space to an
output space. This mapping has to be derived based on a finite number of data, assuming
that a proper sampling of the space has been performed. If the predicted quantity is
categorical and if we know the value that corresponds to each elements of the training set
then the question becomes how to identify the mapping that connects the feature vector
and the corresponding categorical value. This problem is known as the classification
problem.
4 Proposed method
The goal of our proposed project will be to use supervised learning to classify and predict
diseases, based on the gene expressions collected from microarrays. Known sets of data
will be used to train the machine learning protocols to categorize patients according to
their prognosis. The outcome of this study will provide information regarding the
efficiency of the machine learning techniques, in particular a SVM method.
The basic tool used is a modified version of SVM called Least Square SVM (LS
SVM) classifier which employs a set of mapping functions to map the input data
into the reproducing kernel Hilbert space, where the mapping function is implicitly
defined by the kernel function :k(xi, xj) = Φ(xi). Φ(xj).
The efficiency of classification depends on the type of kernel function that is used.
So here we will analyze the performance of various kernel functions used for
classification purpose.
5 Description of the Methods
SVMs provide a machine learning algorithm for classification Gene expression
vectors can be thought of as points in an n-dimensional space. The SVM is then
trained to discriminate between the data points for that pattern (positive points in the
feature space) and other data points that do not show that pattern (negative points in
the feature space). Specifically, SVM chooses the hyperplane that provides maximum
margin between the plane surface and the positive and negative points. The separating
hyperplane is optimal in the sense that it maximizes the distance from the closest data
points, which are the support vectors
The generalized version of LS SVM is as follows:
Given {xi , di }, i = 1, . . . , n, training data set, where xi Є Rp represents a p–
dimensional input vector and di = yi + zi ,is the scalar measured output which
represents yi disturbed by some noise zi . A function y is defined as follows:
The Φ(.): Rp→Rh is mostly a non-linear function. The Optimization problem is then
represented as follows:
with constraints :
The first term gives smooth solution second term minimizes the training error. From this
a Lagrangian is formed.
where αi parameters are Lagrange multipliers. The conditions for optimality are:
which gives the following overall solution :
where d = [d1, d2, …, dN] , α = [α1, α 2, …, αN] ,
and
Here K(xi,xj) is the kernel function and Ω is the kernel matrix. For a given input x the
response of an LS-SVM is a weighted sum of N kernel functions, where the center
parameters of these kernel functions are determined by the training input vectors xi:
5.1 Types of kernel functions that can be used
1. Lineal Kernel:
2. Multilayer perceptron kernel:
parameter and t is the bias.
3. Polynomial Kernel:
degree of the polynomial.
4. Radial Basis function:
the Gaussian kernel.
where s is scale
where t is the intercept and d is the
where
is the variance of
5.2 Some new combinations of kernels has also been tested here
5. K(xi,xj) = K1(xi,xj) * K2(xi,xj) =
6. K(xi,xj) = K1(xi,xj) + K2(xi,xj)
where K1 & K2 are any of the above kernel functions or any new function .
6 Experiments
Description of testbed: In this study the classifiers are tested with few high dimensional
datasets like hepatocellular carcinoma data (Iizuka et al., 2003), high-grade Glioma data
(Nutt et. al. 2003) etc. Hepatocellular carcinoma data: The training set consists of 33
hepatocellular carcinoma tissues of which 12 suffer from early intrahepatic recurrence
and 21 not. The test set consists of 27 hepatocellular carcinoma tissues of which 8 suffer
from early intrahepatic recurrence and 19 not. The number of gene expression levels is
7129.
Glioma data: 50 high-grade glioma samples were carefully selected, 28 glioblastomas and
22 anaplastic oligodendrogliomas, a total of 21 classic tumors was selected, and the
remaining 29 samples were considered non-classic tumor. The training set consists of 21
gliomas with classic histology of which 14 are glioblastomas and 7 anaplastic
oligodendrogliomas. The test set consists of 29 gliomas with non-classic histology of
which 14 are glioblastomas and 15 are anaplastic oligodendrogliomas. The number of
gene expression levels is 12625.
The experiment will mainly try to look into the classification and performance of the
classifier and try answering the following questions:
What are the things we need to do in the preprocessing steps? What about dimensionality
reduction? How are the microarray data classified by the classifier? What are the model
parameters of the classifier kernel functions? How they effect the efficiency of the
method? What type of kernel functions should be used for the classification?
7 Observations
For Glioma data set:
bias b = 0.3333; test error = 0.5172; train error = 0.
For Hepatocellular carcinoma data :
bias b = -0.2727; test error = 0.2963 ; train error = 0.
7.1 Tables
Kernel
function
RBF
Linear
MLP
Poly
Bias
Train error
Test error
LOOCV cost
γ = .01 : -0.2727
0
0.2963
.6667
γ = 10 : -0.2727
0
0.2963
γ = .01 : -0.2727
0
0.4074
γ = 10 : -0.2727
0
0.4074
γ = .01 : -2.8369
0
0.4074
γ = 10 : -2.8369
0.1515
0.2963
γ = .01 :
-0.3340
0
0.3636
γ = 10 :
-0.3340
0
0.2963
1.5788
1.5233
0.8222
Kernel
function
RBF
Linear & MLP
MLP & RBF
Poly & RBF
Bias
Train error
Test error
γ = .01 : -0.2727
.3636
.2963
γ = 10 : -0.2727
0
.2963
γ = .01 : -0.2727
0
0.3333
γ = 10 : -0.2727
0
0.4704
γ = .01 : -2.9015
0.0606
0.4074
γ = 10 : -2.8369
0.1515
0.4074
γ = .01 : -0.3540
0
0.2963
γ = 10 : -0.3340
0
0.2963
LOOCV cost
0.6667
1.5778
1.5333
0.8222
In both cases the RBF , Poly kernel did better .
7.2 graphs and plots
Figure 2: Classification of the glioma gene expression data using RBF kernel.
Figure 2: Classification of the glioma gene expression data using polynomial kernel.
Similar classification result is produced by this kernel function, might be because of the
data being mapped into the higher dimension has similar distribution for the different
classifiers.
Figure 3: Cost for Leave-one-out crossvalidation on data using the normal kernel
functions. Here the Linear and Multilayer perceptron kernel performed better than RBF
and Polynomial kernel was consistent for different gamma.
Figure 4: Cost for Leave-one-out crossvalidation on data using the combination of kernel
functions. Here the combination of Linear and Multilayer perceptron kernel performed
better than rest and RBF & Poly kernel was consistent for all gamma.
Figure 5: Cost for crossvalidation on data using the general kernel functions. Here the
Multilayer perceptron kernel performed better than rest.
Figure 6: Cost for crossvalidation on data using combination of kernel functions. Here
the combination of RBF & Polynomial kernel performed better than rest and also better
than only RBF.
8 Conclusions
The LS SVM in general removes the gene expression data examples which are irrelevant,
but in doing so, some information is lost, i.e. the weighted solution is not that sparse, but
the modified LSSVM produces sparse solution, so dimension reduction while doing
preprocessing is obtained maintaining the sparseness. Various kernel functions were
tested and their performances were measured. In most cases RBF kernel function
performed better than other kernel functions in terms of accuracy but in terms of
computational efficiency other kernels did better. In terms of performance cost the
combined kernel functions also performed well in some cases in comparison to normal
kernels. In future works the performance can be enhanced by tuning the parameter values
and making the classification better.
References
[1]. Alon et al., 1999. Broad patterns of gene expression revealed by clustering
analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
Proc Natl Acad Sci USA, 96(12):6745-6750.
[2]. Bittner et al.,2000. Molecular classification of cutaneous malignant melanoma
by gene expression profiling. Nature 406:536 540
[3]. Brown et al. 1999. Support Vector Machine Classification of Microarray Gene
Expression Data, UCSC-CRL-99-99.
[4]. Cho, and H. Won, 2003. Machine Learning in DNA Microarray Analysis for
Cancer Classification. APBC 2003:189-198.
[5]. Dudoit et al., 2002. Comparison of discrimination methods for the classification
of
tumors
using
gene
expression
data.
Journal of the American Statistical Association, 97(457):77-??
[6]. Furey et al., 2000. Support vector machine classification and validation of
cancer tissue samples using microarray expression data. Bioinformatics, Vol.16,
no.10, pp.906–914.
[7]. Golub et al., 1999. Molecular classification of cancer: class discovery and class
prediction by gene expression monitoring Science, 286(5439):531-537.
[8]. Nutt et.al. 2003. Gene expression-based classification of malignant gliomas
correlates better with survival than histological classification. Cancer Res. Apr
1; 63(7):1602-7.
[9]. Shipp et al., 2002. Diffuse large B-cell lymphoma outcome prediction by geneexpression
profiling
and
supervised
machine
learning.
Nat Med, 8(1):68-74.
[10]. Veer et al., 2002. Gene expression profiling predicts clinical outcome of
breast
cancer Nature, 415(6871):530-536.
[11]. West et al., 2001. Predicting the clinical status of human breast cancer by
using gene expression profiles. PNAS, 98, 11462–11467.
Download