Vector Machines on Genomic Data utilizing ... Space Selection Techniques P.

advertisement
4
Classification Performance of Support Vector
Machines on Genomic Data utilizing Feature
Space Selection Techniques
by
Jason P. Sharma
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Masters of Engineering in Electrical Engineering and Computer
Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
January 2002
BARKER
@ Massachusetts Institute of Technology 2002. All rights reserved.
MASSACHUSETTS tNSTITUTE
OF TECHNOLOGY
JUL 3 1 2002
LIBRARIES
Author .
Department of Electrical Engineering and Computer Science
Jan 18, 2002
Certified by.......
Bruce Tidor
Associate Professor, EECS and BEH
Thesis Supervisor
Accepted by...........
Arthur C. Smith
Chairman, Department Committee on Graduate Students
Classification Performance of Support Vector Machines on
Genomic Data utilizing Feature Space Selection Techniques
by
Jason P. Sharma
Submitted to the Department of Electrical Engineering and Computer Science
on Jan 18, 2002, in partial fulfillment of the
requirements for the degree of
Masters of Engineering in Electrical Engineering and Computer Science
Abstract
Oligonucleotide array technology has recently enabled biologists to study the cell from
a systems level by providing expression levels of thousands of genes simultaneously.
Various computational techniques, such as Support Vector Machines (SVMs), have
been applied to these multivariate data sets to study diseases operated at the transcriptional level. One such disease is cancer. While SVMs have been able to provide
decision functions that successfully classify tissue samples as cancerous or normal
based on the data provided by the array technology, it is known that by reducing the
size of the feature space a more generalizable decision function can be obtained. We
present several feature space selection methods, and show that the classification performance of SVMs can be dramatically improved when using the appropriate feature
space selection method. This work proposes that such decision functions can then
be used as diagnostic tools for cancer. We also propose several genes that appear
to be critical to the differentiation of cancerous and normal tissues, based on the
computational methodology presented.
Thesis Supervisor: Bruce Tidor
Title: Associate Professor, EECS and BEH
Acknowledgments
I would like to first thank my thesis advisor, Prof. Bruce Tidor, for providing me the
opportunity to get involved in the field of bioinformatics with this thesis. In the course
of working on this project, I learned much more than bioinformatics. The list is too
extensive to describe here, but it starts with a true appreciation and understanding
for the research process.
Many thanks go to all the members of the Tidor lab, who quickly became more
than just colleagues (and lunch companions). Relating to my project, I would like to
thank Phillip Kim for his help and advice throughout my time in the lab. I would
also like to thank Bambang Adiwijaya for providing me access to his pm-mm pattern
technique.
I am also indebted to my friend and roommate, Ashwinder Ahluwalia. His encouragement and support was critical to the completion of this project.
Final thanks go to Steve Gunn for providing a freely available matlab implementation of the soft margin SVM used in this project.
Contents
1
Introduction
11
1.1
M otivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.1.1
Cancer Mortality . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.1.2
Usage and Availability of Genechip Data . . . . . . . . . . . .
13
Related Array Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.2.1
Unsupervised Learning Techniques
. . . . . . . . . . . . . . .
14
1.2.2
Supervised Learning Techniques . . . . . . . . . . . . . . . . .
18
Approach to Identifying Critical Genes . . . . . . . . . . . . . . . . .
19
1.2
1.3
2
Background
23
2.1
Basic Cancer Biology . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.2
Oligonucleotide Arrays . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.3
Data Set Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.4
Data Preprocessing Techniques
29
. . . . . . . . . . . . . . . . . . . . .
3 Visualization
33
3.1
Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.2
Multidimensional Scaling (MDS)
. . . . . . . . . . . . . . . . . . . .
36
3.3
Locally Linear Embedding (LLE)
. . . . . . . . . . . . . . . . . . . .
38
3.4
Information Content of the Data
. . . . . . . . . . . . . . . . . . . .
43
4 Feature Space Reduction
4.1
Variable Selection Techniques
45
. . . . . . . . . . . . . . . . . . . . . .
46
4.2
5
Mean-difference Statistical Method . . . . . . . . . . . . . . .
46
4.1.2
Golub Selection Method . . . . . . . . . . . . . . . . . . . . .
49
4.1.3
SNP Detection Method . . . . . . . . . . . . . . . . . . . . . .
51
. . . . . . . . . . . . . . . . . . . .
54
4.2.1
Principal Components Analysis (PCA) . . . . . . . . . . . . .
55
4.2.2
Non-negative Matrix Factorization (NMF) . . . . . . . . . . .
57
4.2.3
Functional Classification . . . . . . . . . . . . . . . . . . . . .
58
Dimension Reduction Techniques
Supervised Learning with Support Vector Machines
63
5.1
Support Vector Machines . . . . . . . . . . . . . . . .
64
5.1.1
Motivation for Use . . . . . . . . . . . . . . .
64
5.1.2
General Mechanism of SVMs
. . . . . . . . .
64
5.1.3
Kernel Functions . . . . . . . . . . . . . . . .
68
5.1.4
Soft Margin SVMs . . . . . . . . . . . . . . .
69
5.2
Previous Work Using SVMs and Genechips . . . . . .
70
5.3
Support Vector Classification (SVC) Results . . . . .
73
5.3.1
SVC of Functional Data Sets . . . . . . . . . .
75
5.3.2
SVC of NMF & PCA Data Sets . . . . . . . .
76
5.3.3
SVC of Mean Difference Data Sets
. . . . . .
77
5.3.4
SVC of Golub Data Sets . . . . . . . . . . . .
78
5.4
6
4.1.1
Important Genes Found
. . . . . . . . . . . . . . . .
Conclusions
79
83
6
List of Figures
2-1
Histogram of unnormalized gene expression for one array . . . . . . .
31
2-2
Histogram of log normalized gene expression for same array . . . . . .
31
3-1
Experiments 1, 2, 8, and 10 are normal breast tissues. Hierarchical
clustering performed using euclidean distance and most similar pair
replacem ent policy
3-2
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Well clustered data with large separation using log normalized breast
tissue data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3-3
35
38
Poorly separated data, but the two sets of samples are reasonably
clustered.
. . ....
. . . . . . . . . . . .. ..
. . ... . . . . ..
. . .
39
3-4
LLE trained to recognize S-curve [20] . . . . . . . . . . . . . . . . . .
40
3-5
Branching of data in LLE projection
. . . . . . . . . . . . . . . . . .
42
3-6
MDS projections maintain relative distance information . . . . . . . .
42
3-7
Possible misclassified sample . . . . . . . . . . . . . . . . . . . . . . .
44
4-1
Principal components of an ellipse are the primary and secondary axis
55
4-2
Functional Class Histogram
60
5-1
A maximal margin hyperplane that correctly classifies all points on
either half-space [191
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
7
65
8
List of Tables
2.1
Tissue D ata Sets
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Linear Separability of the Tissue Data
4.1
Genes Selected by Mean Difference Technique
5.1
Jack-knifing Error Percentages.......
5.2
15 genes with most "votes" . . . . . . . . . . . . . . . . . . . . . . . .
9
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
......................
29
43
61
80
81
10
Chapter 1
Introduction
1.1
Motivation
Oligonucleotide array technology has recently been developed, which allows the measurement of expression levels of thousands of genes simultaneously. This technology
provides biologists with an opportunity to view the cell at a systems level instead
of one subsystem at a time in terms of gene expression. Clearly, this technology is
suited for study of diseases that have responses at the transcriptional level.
One such disease is cancer. Many studies have been done using oligonucleotide
arrays to gather information about how cancer expresses itself in the genome. In the
process of using these arrays, massive amounts of data have been produced.
This data has, however, proven difficult for individuals to just "look" at and
extract significant new insights due to its noisy and high dimensional nature. Computational techniques are much better suited, and hence are required to make proper
use of the data obtained using oligonucleotide arrays.
1.1.1
Cancer Mortality
Cancer is the second leading cause of death. Nearly 553,000 people are expected to
(lie of cancer in the year 2001 [21]. Efforts on various fronts utilizing array technology
are currently underway to increase our understanding of cancer's mechanism.
11
One effort is focused on identifying all genes that are related to the manifestation
of cancer. While a few critical genes have been identified and researched (such as
p53, Ras, or c-Myc), many studies indicate that there are many more genes that are
related to the varied expression of cancer. Cancer has been shown to be a highly
differentiated disease, occurring in different organs, varying in the rate of metastasis,
as well as varying in the histology of damaged tissue.
It has been hypothesized
that different sets of "misbehaving" genes may be causing these variances, but as
of yet these genes have not yet been identified for many of the different forms of
cancer. Identifying these various genes could aid us in understanding the mechanism
of cancer, and potentially lead to new treatments.
Monitoring the expression of
several thousands of genes using oligonucleotide arrays has increased the probability
of finding cancer causing genes.
Another effort has focused on utilizing oligonucleotide arrays as diagnosis tools
for cancer. Given our current methods of treatment, it has been found that the most
effective way of treating cancer is to diagnose its presence as early as possible and as
specifically as possible. Specific treatments have been shown to perform significantly
better on some patients compared to others, and hence the better we are able to
identify the specific type of cancer a patient has, the better we are able to treat that
patient. A diagnosis tool that can accurately and quickly determine the existence
and type of cancer would clearly be beneficial. Again, because cancer has been shown
to be a genetic disease, many efforts have been made to provide a cancer diagnosis
tool that measures the expression of genes in cells. Since such technologies have
been recently developed, most of the focus has been on finding a set of genes which
accurately determine the existence of cancer.
The goal of this thesis is to aid in both of the efforts mentioned above, namely
to discover "new" genes that are causes of various types of cancers, and using these
genes provide an accurate cancer diagnosis tool.
12
1.1.2
Usage and Availability of Genechip Data
The usage of oligonucleotide arrays is not limited to the study of cancer. Oligonucleotide arrays have been used to study the cell cycle progression in yeast, specific
cellular pathways such as glycolysis, etc. In each case, however, the amount of data
is large, highly dimensional, and typically noisy. Computational techniques that can
extract significant information are extremely important in realizing the potential of
the array technology.
Therefore, an additional goal of this thesis was to contribute a new process of
array analysis, using a combination of computational learning techniques to extract
significant information present in the oligonucleotide array data.
1.2
Related Array Analysis
Because cancer is primarily a genetic disease (See Ch 2.1), there have been numerous studies of various types of cancer using oligonucleotide arrays to identify cancer related genes. The typical study focuses on one organ-specific cancer, such as
prostate cancer. First, tissue samples from cancerous prostates (those with tumors)
and healthy prostates are obtained. The mRNA expression levels of the cells in each
of the samples is then measured, using an oligonucleotide array per sample.
The
resulting data set can be compactly described as an expression matrix V, where each
column represents the expression vector C of an experiment, and each row represents
the expression vector ' of a gene. Each element Vj is a relative mRNA expression
level of the
7'h
gene in the Jth sample.
-
UVii
V1 2
- - -
Vn
V2 1
V22
. ..
V2 n
0
\ n1
Vn2
. .
Vnmn
n is the number of experiments, and n is the number of genes.
13
The study then introduces a new computational technique to analyze the data
gathered. The ability of the technique to extract information is first validated, by
analyzing a known data set and/or by comparing the results obtained through other
traditional techniques. Typically, the new technique is then used to make new assertions, which may be cause for future study. All of the techniques can be classified
in two categories: unsupervised and supervised learning techniques. A description of
the techniques applied to array data follows.
Unsupervised learning techniques focus on finding structure in the data, while
supervised learning techniques focus on making classifications. The main difference
between unsupervised and supervised learning techniques is that supervised learning
techniques use user-supplied information about what class an object belongs to, while
unsupervised techniques do not.
1.2.1
Unsupervised Learning Techniques
Unsupervised learning techniques focus on discovering structure on the data. The
word "discover" is used because unlike supervised learning techniques, unsupervised
techniques have no prior knowledge (nor any notion) about what class a sample
belongs to. Most of the techniques find structure in the data based on similarity
measures.
Clustering
One major class of unsupervised learning techniques is known as clustering. Clustering is equivalent to grouping similar objects based on some similarity measure.
A similarity function
f(Y, z)
is typically called repeatedly in order to achieve the
final grouping. In the context of arrays, clustering has been used to group similar
experiments and to group similar genes. Researchers have been able to observe the
expression of cellular processes across different experiments by observing the expression of sets of genes obtained from clustering, and furthermore associate functions
for genes that were previously unannotated. By clustering experiments, researchers
14
have, for example, been able to identify new subtypes ofcancers. Three algorithms
that have been utilized to do clustering on genomic data are hierarchical clustering,
k-means clustering, and Self Organizing Maps (SOMs).
Hierarchical clustering was first applied to oligonucleotide array data by Eisen
et al. [6].
The general algorithm uses a similarity metric to determine the highest
correlated pair of objects among the set of objects to be clustered C. This pair is then
clustered together, removed from the set C, and the average of the pair is added to
C. The next iteration proceeds similarly, with the highest correlated pair in C being
removed and being replaced by a single object. This process continues until C has only
one element. The clustering of pairs of objects can be likened to leaves of a tree being
joined at a branch, and hence a dendrogram can be used to represent the correlations
between each object. Eisen et al. used several similarity measures and tried different
remove/replace policies for the algorithm. They were able to demonstrate the ability
of this technique by successfully grouping genes of known similar function for yeast.
Specifically, by clustering on 2,467 genes using a set of 12 time-course experiments that
measured genomic activity throughout the yeast cell cycle, several clusters of genes
corresponding to a common cellular function were found. One cluster was comprised
of genes encoding ribosomal proteins. Other clusters found contained genes involved
in cholesterol biosynthesis, signaling and angiogenesis, and tissue remodeling and
wound healing.
K-means clustering is a technique that requires the user to input the number of
expected clusters, and often, the coordinates of the centroids in n dimensional space
(where n corresponds to the number of variables used to represent each object).
Once this is done, all points are assigned to their closest centroid (based on some
similarity metric, not necessarily euclidean distance), and the centroid is recomputed
by averaging the positions of all the points assigned to it. This process is repeated
until each centroid does not move after recomputing its position.
This technique
effectively groups objects together in the desired number of bins.
Self Organizing Maps (SOMs) is another clustering technique that is quite similar
to k-means [23]. This technique requires al input: the geometry in which the number
15
of expected clusters in the data is expected. For example, a 3x2 grid means that
a total of six clusters are expected in the data. This grid is then used to calculate
the positions of six centroids in n dimensional space.
The SOM algorithm then
randomly selects a data point and moves the closest centroid towards that point. As
the distance between each successive (randomly selected) data point and the closest
centroid decreases, the distance the centroid moves decreases. After all points have
been used to move the centroids, the algorithm is complete. This technique was used
to cluster genes and identify groups of genes that behaved similarly for yeast cell
cycle array data. SOMs were able to find clusters of yeast genes that were active in
the respective G1,S, G2, and M phases of the cell cycle when analyzing 828 genes
over 16 experiments using a 6x5 grid. While the obtained clusters were desirable, the
selection of the grid was somewhat arbitrary and the algorithm was run until results
matching previous knowledge were obtained.
As noted above, clustering can be performed by either clustering on genes or
clustering on experiments. When both clusterings are performed and the expression
matrix is rearranged accordingly, this is known as two-way clustering. Such an analysis was done on cancerous and non-cancerous colon tissue samples [1]. The study
conducted by Alon et al. was the first to cluster on genes (based on expression across
experiments) and cluster on experiments (based on expression across genes). Given
a data set of 40 cancererous and 22 normal colon tissue samples on arrays with 6500
features, the clustering algorithm used was effective at grouping the cancer and normal tissues respectively. Alon et al. also attempted to use a method similar to that
described in Ch 4.1.1 to reduce the data set and observe the resulting classification
performance of the clustering algorithm.
Dimension Reduction
Dimension reduction is a type of technique that reduces the representation of data by
finding a few components that are repeatedly expressed in the data. Linear dimension
reduction techniques use linear combinations of these components to reconstruct the
original data. Besides finding a space saving minimal representation, such techniques
16
are useful because by finding the critical components, they essentially group together
dimensions that behave consistently throughout the data.
Given V as the original data, dimension reduction techniques perform the following operation: V ~ W * H. The critical components are the columns of the matrix
W. Given an expression matrix V, each column of W can be referred to as a "basis
array". Each basis array contains groupings of genes that behave consistently across
each of the arrays. Just as before, since the genes are all commonly expressed in the
various arrays, it is possible that they are functionally related. If a basis array is relatively sparse, it can then be interpreted to represent a single cellular processes. Since
an experiment is a linear combination of such basis arrays, it can then be observed
which cellular processes are active in an experiemnt.
There are several types of linear dimension reduction techniques, all using the
notion of finding a set of basis vectors that can be used to reconstruct the data. The
techniques differ by placing different constraints to find different sets of basis vectors.
For more details, please see Ch 4.2.
The most common technique used is Principal Components Analysis (PCA). PCA
decomposes a matrix into the critical components as described above (known as eigenvectors), but provides an ordering of these components. The first vector (principal
component) is in the direction of the most variability in the data, and each successive
component accounts for as much of the remaining variability in the data as possible.
Alter et al. applied PCA to the expression matrix, and were able to find a set
of principal components where each component represented a group of genes that
were active during different stages of the yeast cell cycle [2]. Fourteen time-course
experiments were conducted over the length of the yeast cell cycle using arrays with
6,108 features. Further study of two principal components were able to effectively
represent most of the cell cycle expression oscillations.
Another method known as Non-negative Matrix Factorization (NMF) has been
used to decompose oligonucleotide array into similarly interesting components that
represent cellular processes (Kim & Tidor, to be published). NMF places the constraint that each basis vector has only positive values. PCA, in contrast, allows for
17
negative numbers in the eigenvectors. NMF leads to a decomposition that is well
suited for oligonucleotide array analysis. The interaction between the various basis
vectors is clear: it is always additive. With PCA, some genes may be down-regulated
in a certain eigenvector while they maybe be up-regulated in another. Combining two
eigenvectors can cancel out the expression of certain cellular processes, while combining two basis vectors from NMF always results in the expression of two cellular
processes. In this sense, NMF seems to do a slightly better job selecting basis vectors
that represent cellular processes.
Other methods of dimensionality reduction exist as well, such as Multidimensional
Scaling (MDS) and Locally Linear Embedding (LLE, a type of non-linear dimensionality reduction) [10, 20]. Both of these techniques are more commonly used as
visualization techniques. See Ch 3.1 and 3.2 for more details.
1.2.2
Supervised Learning Techniques
Supervised learning techniques take advantage of user provided information about
what class a set of objects belong to in order to learn which features are critical to
the differentiation between classes, and often to come up with a decision function
that distinguishes between classes. The user provided information is known as the
training set. A "good" (generalizable) decision function is then able to accurately
classify samples not in the training set.
One technique that focuses on doing feature selection is the class predictor methodology developed by Golub et al.
[9]
that finds arbitrarily sized sets of genes that could
be used for classification. The technique focuses on finding genes that have an idealized expression pattern across all arrays, specifically genes that have high expression
among the cancerous arrays and low expression among the non-cancerous arrays.
These genes are then assigned a predictive power, and collectively are used to decide
whether an unknown sample is either cancerous or non-cancerous (refer to Ch 4.2 for
more details of the algorithm). Golub et al. were able to use the above technique to
distinguish between two types of leukemias, discover a new subtype of leukemia, and
to determine the class of new leukemia cases.
18
One technique that seeks to solve the binary classification problem (classification
into two groups) is the Support Vector Machine (SVM). The binary classification
problem can be described as finding a decision surface (or separating hyperplane)
in iL dimensions that accurately allows all samples of the same class to be in the
same half-space. The Golub technique is an example of a linear supervised learning
method, because the decision surface described by the decision function is "flat". In
contrast, non-linear techniques create decision functions with higher-order terms that
correspond to hyperplanes which have contours. SVMs have the ability to find such
non-linear decision functions.
Several researchers have utilized SVMs to obtain good classification accuracy using
oligonucleotide arrays to assign samples as cancerous or non-cancerous [4, 8, 17]. This
computational technique classifies data into two groups by finding the "maximal
margin" hyperplane. In a high dimensional space with a few points, there are many
hyperplanes that can be used to distinguish two classes. SVMs choose the "optimal"
hyperplane by selecting the one that provides the largest separation between the two
groups.
In the case where a linear hyperplane can not be found to separate the
two groups, SVMs can be used to find a non-linear hyperplane by projecting the
data into a higher dimensional space, finding a linear hyperplane in this space, and
then projecting the linear hyperplane into the original space which causes it to be
non-linear. See Ch 5 for more details.
1.3
Approach to Identifying Critical Genes
In order to extract information from the data produced by using oligoiucleotide
arrays, computational techniques which are able to handle large, noisy, and high
dimensional data are necessary. Since our goal is two-tiered:
" to discover new genes critical to cancer
" to create a diagnosis tool based on measurements of expression of a small set of
genes
19
our proposed solution is two-tiered as well:
" use various feature selection techniques to discover these cancer causing genes
" use a supervised learning technique to create a decision function that can serve
as a diagnosis tool, as well as to validate and further highlight cancer-related
genes
One of several feature selection methods will be used to reduce the dimensionality
of a cancer data set. The dimensionally reduced data set will then be used as a
training set for a support vector machine, which will build a decision function that
will be able to classify tissue samples as "cancerous" or "normal". The success of
the feature selection method can be tested by observing the jack-knifing classification
performance of the resulting decision function. Also, the decision function can be
further analyzed to determine which variables are associated the heaviest "weight" in
the decision function, and hence are the most critical differentiators between cancerous
and non-cancerous samples.
Pre-Processing
Visualization
Feature Space Reduction
Support Vector Classification
The feature selection method will not only find critical cancer-causing genes, but
will reduce the dimensionality of the data set. This is critical because it has been
20
shown that when the number of features is much greater than the number of samples, the generalization performance of the resulting decision function suffers.
The
feature selection method will be used to improve the generalizability of the decision
function, while the generalization performance of the SVM can also be used to rank
the effectiveness of each feature selection method.
Besides providing a potential diagnosis tool, analysis of a highly generalizable
decision function found by SVMs may also highlight critical genes, which may warrant future study. SVMs were the chosen supervised learning method due to their
ability to deterministically select the optimal decision function, based on maximizing
the margin between the two classes. The maximal margin criteria has been proven
to have better (or equivalent) generalization performance when compared to other
supervised learning methods, such as neural nets, while the decision function's form
(linear, quadratic, etc.) can be easily and arbitrarily selected when using SVMs. This
increases our ability to effectively analyze and use the resulting decision function as
a second method of feature selection.
21
22
Chapter 2
Background
The first section of this chapter will describe the basic mechanisms of cancer, as
well as demonstrate the applicability of oligonucleotide arrays to the study of cancer.
The second section discusses how the array technology works, its capabilities, and its
limitations. Our specific datasets will be described, and then the various techniques
commonly used to prepare array data will also be presented.
2.1
Basic Cancer Biology
Cancer is Primarily a Genetic Disease
Cancer is caused by the occurrence of a mutation that causes the controls or regulatory
elements of a cell to misbehave such that cells grow and divide in an unregulated
fashion, without regard to the body's need for more cells of that type.
The macroscopic expression of cancer is the tumor. There are two general categories of tumors: benign, and malignant. Benign tumors are those that are localized
and of small size. Malignant tumors, in contrast, typically spread to other tissues.
The spreading of tumors is called metastasis.
There are two classes of gencs that have been identified to cause cancer: oncogenes,
and tunior-suppressor genes.
An oncogene is defined as any gene that encodes a
protein capable of' transforming cells in culture or inducing cancer in animals. A
23
tumor-suppressor gene is defined as any gene that typically prevents cancer, but when
mutated is unable to do so because the encoded protein has lost its functionality.
Development of cancer has been shown to require several mutations, and hence
older individuals are more likely to have cancer because of the increased time it
typically takes to accumulate multiple mutations.
Oncogenes
There are three typical ways in which oncogenes arise and operate. Such mutations
result in a gain of function [15]:
" Point mutations in an oncogene that result in a constitutively acting (constantly
"on" or produced) protein product
" Localized gene amplification of a DNA segment that includes an oncogene,
leading to over-expression of the encoded protein
" Chromosomal translocation that brings a growth-regulatory gene under the control of a different promotor and that causes inappropriate expression of the gene
Tumor-Suppressor Genes
There are five broad classes of proteins that are encoded by tumor-suppressor genes.
In these cases, mutations of such genes and resulting proteins cause a loss of function
[15]:
* Intracellular proteins that regulate or inhibit progression through a specific
stage of the cell cycle
* Receptors for secreted hormones that function to inhibit cell proliferation
* Checkpoint-control proteins that arrest the cell cycle if DNA is damaged or
chromosomes are abnormal
* Proteins that promote apoptosis (programmed cell death)
" Enzymes that participate in DNA repair
24
2.2
Oligonucleotide Arrays
The oligonucleotide array is a technology for measuring relative mRNA expression
levels of several thousand genes simultaneously [14].
The largest benefit of this is
that all the gene measurements can be done under the same experimental conditions,
and therefore can be used to look at how the whole genome as a system responds to
different conditions.
The 180 oligonucleotide arrays that compose the data set which I am analyzing
were commercially produced by Affymetrix, Inc.
Technology
GeneChips, Affymetrix's proprietary name for oligonucleotide arrays, can be used to
determine relative levels of mRNA concentrations in a sample by hybridizing complete
cellular mRNA populations to the oligonucleotide array.
The general strategy of oligonucleotide arrays is that for each gene whose expression is to be measured (quantified by its mRNA concentration in the cell), there are
small segments of nucleotides anchored to a piece of glass using photolithography
techniques (borrowed from the semiconductor industry, hence the name GeneChips).
These small segments of nucleotides are supposed to be complementary to parts of
the gene's coding region, and are known as oligonucleotide probes. In a certain area
on the piece of glass (specifically, a small square known as a feature), there exist hundreds of thousands of the exact same oligonucleotide probe. So, when a cell's mRNA
is washed over the array using a special procedure with the correct conditions, the
segments of mRNA that are complementary to those on the array will actually hybridize or bind to one of the thousands of probes. Since the probes on the array are
designed to be complementary to the mRNA sequence of a specific gene, the overall
amount of target mRNA (from the cell) left on the array gives an indication of the
mRNA cellular concentration of such a gene.
In order to actually detect the amount of target mRNA hybridized to a feature on
the array, the target mRNA is prepared with fluorescent material. Since all the probes
25
for a specific gene are localized to a feature, the amount of fluorescence emitted from
the feature can be measured and then interpreted as the level of hybridized target
mRNA.
Genechips
Affymetrix has developed their oligonucleotide arrays with several safeguards to improve accuracy. Because there may be other mRNA segments that have complementary nucleotide sequences to a portion of a probe, Affymetrix has included a
"mismatch" probe which has a single nucleotide inserted in the middle of the original
nucleotide sequence. The target mRNA (for the desired gene) should not bind to
this mismatch probe. Hence, the amount of binding mRNA to the mismatch probe
is supposed to represent some of the noise that also binds to portions of the match
probe. By subtracting the amount of mRNA hybridized to the mismatch probe from
the amount of mRNA hybridized to the match probe, a more accurate description of
the amount of the target gene's mRNA should be obtained.
Since the oligonucleotide probes are only approximately 25 nucleotides in length
and the target gene's mRNA is typically 1000 nucleotides in length, Affymetrix selects
multiple probes that hybridize the best to the target mRNA while hybridizing poorly
to other mRNA's in the cell. The nucleotide sequences for the probes are not released
by Affymetrix. Overall, to measure the relative mRNA concentration of a single gene,
approximately 20 probe pairs (20 match probes and 20 mismatch probes) exist on
the chip.
A measure provided by Affymetrix uses the 20 probe pairs and computes an overall
expression value for the corresponding gene is known as the "average difference". Most
techniques utilize the average difference value instead of analyzing the individual
intensity values observed at each of the 40 probes. The measure is calculated as
follows
average difference =
n
where n is the number of probe set pairs, p-A is the perfect match probe set expression
26
vector, and mrA is the mismatch probe set expression vector.
Capabilities
A major benefit of oligonucleotide arrays is that they provide a snapshot view of nearly
an entire system by maintaining the same set of experimental conditions for each
measurement of each gene. This lends itself to analysis of how different components
interact among each other. Another important benefit is that the use of the arrays
produces a large amount of data quickly. Previously, experiments to measure relative
mRNA expression levels would go a few at time.
Using the arrays, thousands of
measurements are quickly produced. From an experimental point of view, another
benefit of using oligonucleotide arrays is that the method requires less tedious steps
such as preparing clones, PCR produces, or cDNAs.
Limitations
There are several problems with the usage of oligonucleotide arrays.
Some issues
related to the oligonucleotide technology are:
" the manner in which expression levels are measured lacks precision (via fluorescence), because genes with longer mRNA will have more fluorescent dyes
attached than those genes with shorter mRNA, resulting in higher fluorescence
for genes with longer mRNA
* absolute expression levels cannot be interpreted from the data
" no standards for gathering samples for gene chips, and hence conditions in which
the experiments are done are not controlled
" probes used may not accurately represent a gene, or may pick up too much
noise (hybridization to mRNA of genes other than the target gene) resulting in
negative average difference values
Another issue related to using oligonucleotide arrays is that some important or
critical genes may not be on the standard genechips, made by Affymetrix. A more
27
general issue is that while mRNA concentrations are somewhat correlated to the associated protein concentrations, it is not understood to what degree they are correlated.
Also, some proteins are regulated at the transcription level, while some are regulated
at the translation level. Ideally, there would be arrays that measured protein concentrations of all the proteins in the organism to be studied. Analyzing relative amounts
of mRNA (at the transcription level) is basically one step removed from such desired
analysis.
2.3
Data Set Preparation
Two sets of oligonucleotide array data have been provided for analysis by John B.
Welsh from the Novartis Foundation. Both sets use Affymetrix genechips, but each
set has used a different version of the cancer genechip, which contains genes that are
(in the judgment of Affymetrix) relevant to cancer and its cellular mechanisms.
The first data set uses the HuGeneFL genechip from Affymetrix, which contains
probes with length of 25 nucleotides. These chips contain over 6000 probe sets, with 20
(perfect match, mismatch) probe pairs per probe set. Each probe set corresponds to
a gene. The data set is composed of 49 such arrays, with 27 arrays used to measure
the expression patterns of cancerous ovarian tissue, and 4 arrays used to measure
the expression patterns of normal ovarian tissue. The rest of the arrays were used
to test the expression patterns of cancer cell lines, as well as to perform duplicate
experiments. The method of preparation of these experiments was published by Welsh
et al. [25]. 1
The second data set uses the U95avl and U95av2 genechip from Affymetrix. The
two versions each contain probe sets for over 12,000 genes. Each probe set contains
16 probe pairs of probes of length 25. The two chips differ only by 25 probe sets,
and hence by using only the common sets, the experiments can be treated as being
from one type of genechip. This data set is composed of 180 genechip experiments,
'While the HuGeneFL genechips were used for initial analysis, all the analysis included in this
thesis is performed on the second data set.
28
Table 2.1. Tissue Data Sets
Tissue Type
#
Cancerous Samples
#
Normal Samples
breast
21
4
colon
gastric
kidney
liver
lung
21
11
11
10
28
4
2
3
3
4
ovary
pancreas
prostate
14
6
25
2
4
9
from 10 different cancer tissues. There are approximately 35 normal tissue samples,
and 135 cancerous tissue samples. The various tissues include breast, colon, gastric,
kidney, liver, lung, ovary, pancreas, and prostate.
It is important to note that the number of cancerous samples is in almost all
cases, much larger than the number of normal samples. Ideally, these numbers would
be equal to each other. Also, the tissue data sets themselves are not particularly
large. They might not be able to provide a generalizable diagnostic tool because of
the limited size of the training set. One last issue is that the "normal" tissue samples
are not from individuals who are without cancer. Tissue samples are "normal" if they
do not exhibit any metastasis or tumors. Often the normal samples are taken from
individuals with a different type of cancer, or just healthy parts of the tissue that
contains tumors.
2.4
Data Preprocessing Techniques
Various preprocessing techniques can affect the ability of an analysis technique significantly. Using genechips, there are three common ways of preprocessing the data
that have been used in the analysis of the above data sets. Additionally, there are
some preprocessing techniques which were used to remove poor quality data.
29
Removal of "Negative" Genes
The first major concern regarding the quality of the data sets above is that there
are large numbers of genes with negative average difference values. Clearly, negative
expression of a gene has no meaning. Negative average difference values mean that
among all probe pairs in a probe set, there is more binding to the mismatch probe
than there is to the match probe on average. Ideally, the probes would have been
selected such that this would not happen. The exact biological explanation for such
behavior is not known, and hence it is unclear whether to interpret the target gene
as present or not since the same nonspecific mRNAs that hybridized to the mismatch
probe could be hybridized to the respective perfect match probe as well, causing the
perfect match probe to have a high fluorescent intensity.
One approach is to remove any genes which have any negative average difference
values from consideration. Hence, when comparing arrays using computational techniques, any gene which has had a negative average difference value in any array will
be removed and hence that dimension or variable will not be used by the technique
to differentiate between arrays.
Log Normalization
The log function has the interesting property that given a distribution, it has the ability to make it appear more gaussian. Gaussian distributions have statistical properties
such that the statistical significance of a result can be established.
As can be seen from the figures above, log normalizing the data has the effect
of compressing the data close to the intervals, while stretching out the data in the
middle of the original distribution. One issue is that the log of a negative number is
not defined, which is another reason to remove genes with negative average difference
values.
30
I
40UU
I
I
I
3
3.5
4
4000
3500
3000
2500
2000
1500
1000
500
0
0
0.5
1
1.5
2
2.5
4.5
x 10"
Figure 2-1. Histogram of unnormalized gene expression for one array
250-
200-
150-
100-
50
L
0
-2
' -
0
"
2
4
6
8
10
12
Figure 2-2. Histogram of log normalized gene expression for same array
31
Global Scaling
When making preparations for a genechip experiment, many variables can affect the
amount of mRNA which hybridizes to the array. If differences in the preparation
exist, they should have a uniform effect on the data such that one experiment's
average difference values for all genes would be, for example, consistently higher than
those for another experiment's. When doing analysis on the arrays, it is desirable to
eliminate these effects so the analysis can be focused on "real" variations in the cell's
behavior, not differences in the preparation of the experiment.
One common way to do this is to scale each array such that the means of the
average difference values are equal to each other, some arbitrary number c. This
is illustrated below for an array with expression vector Y with n average difference
values:
C* X* (
nn
-
This effectively removes the effects of variation in the preparation of the experiments.
Mean-Variance Normalization
Another technique used to transform the data and remove the effects of the preparation of the experiment is to mean-variance normalize the data. In order to do this,
the mean pu and standard deviation o- for each array Y is calculated. Each average
difference value xi is transformed:
xi - px
The resulting number represents the number of standard deviations the value is from
the mean (commonly known as a z-score). An important result of this type of normalization is that it also removes the effects of scaling.
32
Chapter 3
Visualization
To better understand the information content in the oligonucleotide array data sets,
the first goal was to attempt to visualize the data. Since many computational techniques focus on grouping the data into subgroups, it would be useful to see 1) how well
the traditional grouping techniques such as hierarchical/k-means clustering perform
2) how the data looks projected into dimensions that can be visualized by humans
easily (specifically, three dimensions). By visualizing the data, we can understand
how much noise (relative to the information) is in the data, possibly find inconsistencies in the data, and predict which types of techniques might be useful in extracting
information from the data.
Visualization techniques also enable us to understand how various transformations
on the data affect the information content. Specifically, in order to understand how
each preprocessing technique (described in Ch 2.4) affects the data, we can apply the
preprocessing technique to the data, and observe the effect by comparing the original
data visualization versus the modified data visualization.
33
3.1
Hierarchical Clustering
Technique Description
Hierarchical clustering is an unsupervised learning technique that creates a tree which
shows the similarity between objects.
Sub-trees of the original tree can be seen
as a group of related objects, and hence the tree structure can indicate a grouping/clustering of the objects.
Nodes of the tree represent subsets of the input set of objects S. Specifically, the
set S is the root of the tree, the leaves are the individual elements of S, and the internal
nodes represent the union of the children. Hence, a path down a well constructed
tree should visit increasingly tightly-related elements [7]. The general algorithm is
included below:
" calculate similarity matrix A using set of objects to be clustered
" while number of columns/rows of A is > 1
1. find the largest element Aj
2. replace objects i and j by their average in the set of objects to be clustered
3. recalculate similarity matrix A with new set of objects
Variations on the hierarchical clustering algorithm typically stem from which similarity coefficient is used, as well as how the objects are joined. Euclidean distance,
or the standard correlation coefficient are commonly used as the similarity measure.
Objects or subsets of objects can be joined by averaging all their vectors, or averaging
two vectors from each subset which have the smallest distance between them.
Expected Results on Data Set
Ideally, using the hierarchical clustering technique we would obtain two general clusters. One cluster would contain all the cancer samples, and another cluster would
contain all the normal sample. Since the hypothesis is that cancer samples should have
roughly similar expression pattern across all the genes, all these samples should be
34
x 104
7-
6-
5-
4-
3-
2-
1 -
0-
1
8 10
2
3
6 14 16 20 23 21 17 19 18 22 24 13 11 12 15 25
7
5
9
4
Figure 3-1. Experiments 1, 2, 8, and 10 are normal breast tissues. Hierarchical clustering performed
using euclidean distance and most similar pair replacement policy
grouped together. Similarly, normal samples should be grouped together. Although
using the hierarchical clustering algorithm the sets of cancer and normal samples will
be joined at a node of the tree, ideally the root of the tree would have two children,
representing the cancer and normal sample clusters. This would show that the difference between cancer and normal samples is on average larger than the variation
inside each subset.
Results
Four different variations of hierarchical clustering were performed on 4 expression
matrices per tissue. The 4 variations of clustering resulting from utilizing 2 different
similarity metrics (euclidean distance, and pearson coefficient) with 2 remove/replace
policies (removing the most similar pair and replacing with the average, removing
the most similar pair and replacing with the sample that was most similar to the
remaining samples).
For each tissue, the original expression matrix (after removing negative genes, see
Ch 2.4) was clustered on as well as three other pre-processed expression matrices.
Log normalization, mean-variance normalization, and global scaling were all applied
and the resulting expression matrices were clustered.
For most of the tissues, across all the different variations of clustering and the
differently processed data sets, the desired results were obtained. The normal samples
were consistently grouped together, forming a branch of the resulting dendrogram.
However, the set of cancer samples and set of normal samples were never separate
branches that joined at the root. This would have indicated that for each sample
in each set, the most dissimilar object within the set was still more similar than a
sample from the other set. Apparently, the amount of separation between the sets of
cancer and normal samples is not large enough to give this result.
3.2
Multidimensional Scaling (MDS)
Technique Description
Each multidimensional object can be represented by a point in euclidean space. Given
several objects, it is often desirable to see how the objects are positioned in this space
relative to each other. However, visualization beyond 3 dimensions is typically nonintuitive. Multidimensional scaling (MDS) is a technique that can project multidimensional data into an arbitrary lower dimension. The technique is often used for
visualizing the relative positions of multidimensional objects in 2 or 3 dimensions.
The MDS method is based on dissimilarities between objects. The goal of MDS
is to create a picture in the desired n dimensional space that accurately reflects the
dissimilarities between all the objects being considered. Objects that are similar are
shown to be in close proximity, while objects that are dissimilar are shown to be
distant. When dissimilarities are based on quantitative measures, such as euclidean
distance or the correlation coefficient, the MDS technique used is known as metric
MDS. The general algorithm for metric MDS [10] is included below:
36
"
use euclidean distance to create a dissimilarity matrix D
" scale the above matrix: A = -0.5 * D2
* center A : a = A-(row means of A)-(column means of A)-+(mean of all elements of A)
" obtain the first n eigenvectors and the respective eigenvalues of a
" scale the eigenvectors such that their lengths are equal to the square root of
their respective eigenvalues
" if eigenvectors are columns, then points/objects are the rows, and plot
Typically, euclidean distance is used as the dissimilarity measure. MDS selects n
dimensions that best represent the differences between the data. These dimensions
are the eigenvectors of the scaled and centered dissimilarity matrix with the n largest
eigenvalues.
Expected Results on Data Set
By looking at the 3-dimensional plot produced by MDS, the most important result
to observe is that cancer samples group together and are separable in some form
from the normal samples. Given that the two sets do separate from each other in 3
dimensions, the next step would be to see what type of shape is required to separate
the data. It would be helpful to note whether the cancer and normal samples were
linearly separable in 3 dimensions, or required some type of non-linear (e.g. quadratic)
function for separation. The form of this discriminating function would then be useful
when choosing the type of kernel to be used in the support vector machine (Ch 5.1).
Results
MDS was used to project all nine tissue expression matrices into 3 dimensions. For
each tissue, the original data (without any negative genes, see Ch 2.4) was used
to create one plot.
To explore the utility of the other pre-processing techniques
described in Ch 2.4, three additional plots were made.
Log normalization, mean-
variance normalization, and global scaling were all applied to the expression matrices
37
I
MDS into 3 Dimensions of 25 breast samples
x
cancerous sample
0 non-cancerous sample
20,
0
15,
0
10,
5,
X
0
X
X
X
-5
*
-10
X
-15,
x
x
-20
-25
30
20
10
40
0
20
0
-10
-20
-20
-30
-40
Figure 3-2. Well clustered data with large separation using log normalized breast tissue data set
(without negative genes), and then the resulting expression matrices were projected
into 3 dimensions.
The first observation is that in most of the 3-D plots, the cancer samples and
normal samples group well together. In some cases, however, the separation of the
two sets can not be clearly seen.
The linear separability of such tissues will be
determined when SVMs are used to create decision functions with a linear kernel (see
Ch. 5.3). Please refer to Ch 3.4 for a summary of which tissues are linearly separable,
and the apparent margin/separation between the cancer and normal samples.
Another general observation is that log transforming the data seems to improve
in separating the data sets the best among the pre-processing techniques, although
mean-variance normalization also often increases the margin of separation.
3.3
Locally Linear Embedding (LLE)
Technique Description
Like MDS and hierarchical clustering, locally linear embedding (LLE) is a unsupervised learning algorithm, i.e. does not require any training to find structure in the
38
MDS into 3 Dimensions of 25 colon samples
cancerous sample
OXnon-cancerous sample
x 104
1.5,
X
0x
0.5,
OX
0
o,
xx
X
X
X
)O
x
0
-0.5,
x
x
-1
0.5
4
x 10
0
0
-0.5
0
-1
-1
-1.5
10
-2
Figure 3-3. Poorly separated data, but the two sets of samples are reasonably clustered.
data [20]. Unlike these two techniques, LLE is able to handle complex non-linear relationships among the data in high dimensions and map the data into lower dimensions
that preserve the neighborhood relationships among the data.
For example, assume you have data that appears to be on some type of nonlinear
manifold in a high dimension like an S curve manifold. Most linear techniques that
project the data to lower dimensions would not recognize that the simplest representation of the S curve is just a flat manifold (the S curve stretched out). LLE however
is able to recognize this, and when projecting the higher dimensional data to a lower
dimension, selects the flat plane representation.
This is useful because LLE is able to discover the underlying structure of the
manifold.
A linear technique such as MDS would instead show data points that
are close in euclidean distance but distant in terms of location on the manifold, as
close in the lower dimension projection. Specifically relating to the cancer and normal
oligonucleotide samples, if the cancer samples were arranged in some high dimensional
space on a non-linear manifold, LLE would be able to project the data into a lower
dimensional space while still preserving the locality relationships from the higher
dimensions. This could provide a more accurate description of how the cancer and
39
+
4
+ +4:
+
+*+
t4+
++
A
+
1
*
+**4.
+++
444
Figure 3-4. LLE trained to recognize S-curve [20]
normal samples cluster together.
+2 manifold, a small portion of the
The basic premise of LLE is that on a non-linear
manifold can be assumed to be locally linear. Points in these locally linear patches
can be reconstructed by the linear combination of its neighbors which are assumed to
be in the same locally linear patch. The coefficients used in the linear combination
can then be used to reconstruct the points in the lower dimensions. The algorithm is
included below: Ei~~~ Y+ -+yWi
* for each point, select K neighbors
* minimize the error function E(W)
=
>ji Xi
-
Z3
Wi~
,
where the weights
Wig represent the contribution of the jth data point to the ith reconstruction,
with the following two constraints:
1. W7i, is zero for all Xy that are not neighbors of Xi
2. EZ Wi = 1
* map Xi from dimension D to Yi of a much smaller dimension d using the
Wij's obtained from above by minimizing the embedding cost function <D(Y)
2
40
=
The single input to the LLE algorithm is K, the number of neighbors used to reconstruct a point. It is important to note that in order for this method to work,
the manifold needs to be highly sampled so that a point's neighbors can be used to
reconstruct the point through linear combination.
Expected Results on Data Set
If the normal and cancer samples were arranged on some type of high dimensional nonlinear manifold, it would be expected that the visualization produced by LLE would
show a more accurate clustering of the samples to the separate sets. However, since
there are relatively few data points, the visualization might not accurately represent
the data.
Results
LLE was performed in a similar fashion as MDS on the nine tissues, with four plots
produced for each tissue corresponding to the original expression matrix, log normalized matrix, mean-variance normalized matrix, and globally scaled matrix (where all
matrices have all negative genes removed).
LLE in general produces 3-d projections quite similar to those produced by MDS.
However, one notable difference is the tendency of LLE to pull samples together.
This effect is also more noticable as the number of samples is large. This is clearly
an artifact of the neighbor-based nature of algorithm.
The LLE plots reinforce the statements regarding the abilities of the pre-processing
techniques to increase the separation between the cancer and normal samples. While
log normalizing the data produces good separation, inean-variance normalizating
seems to work slightly better. Nonetheless, both log normalization and mean-variance
normalization both seem to be valuable pre-processing techniques according to both
projection techniques.
41
LLE (w/5 nearest neighbors) into 3-D of 32 lung samples
x cancerous sample
0 non-cancerous sample
2,
xX
x
1,
xx
0.
0
0
-11
x0
0
0
3i
x
-2.
3
Xx
2
-2
1
2
0
0
-1-4
Figure 3-5. Branching
of data in LLE projection
MDS into 3 Dimensions of 32 lung samples
cancerous sample
0 non-cancerous sample
x 10
4
5,
4.,
3,.
x
2,
1.
xx
XX
X
X
X
x
x
-11
-2,
x
-3 1
x
x
xx
x
x
x
x
0
0
0
x
0
4
2
15
0
x10
4
10
-2
5
-4
X104
0
-6
-8
-5
Figure 3-6. MDS projections maintain relative distance information
42
3.4
Information Content of the Data
After observing each of the plots produced by both LLE and MDS, we are able to
draw some conclusions about the separability of each tissue data set. In Table 3.1,
the overall impression regarding the separability of the data is summarized:
Table 3.1.
Linear Separability of the Tissue Data
Tissue Type
using MDS 3-d plot
using LLE 3-d plot
breast
colon
gastric
possibly
yes
yes
possibly
possibly
yes
kidney
possibly
possibly
liver
lung
ovary
pancreas
prostate
no
yes
possibly
NO
possibly
no
yes
yes
NO
yes
Several of the tissue data sets are clearly linearly separable in 3 dimensions, while
a few are not as clearly separable due to the small separation between the two sets of
samples even though the cancer and normal samples cluster well, respectively. This
may be an indication of the amount of noise present in the data that causes the
smaller separation of the clusters.
These data sets, however, may just require an
additional dimension to better describe the data and possibly increase the margin of
the two clusters.
The visualizations of the pancreas data set seem to indicate that there may be
a possible misclassification of single "normal" sample, which is consistently among
a subset of the cancer samples. Also bothersome is the complete dispersion of the
cancer samples. There is no tight cluster of cancer samples.
Overall, most of the data sets (especially the lung data set) are well suited for
analysis. The cancer and normal samples cluster well respectively, and there is a
reasonably specific dividing line between the two clusters. Now that it is established
that most of the data sets contain genomic measurements that separate the cancer
43
MDS into 3 Dimensions of 10 pancreas samples
cancerous sample
0 non-cancerous sample
x
x 104
4,
3,
2,
0
1
x
x
x
0
x
x
01
-1,
-2,
-3,
-4
4
2
x10
4
S00.5
-2
0
-0.5
x 10
_1
-6
-1.5
Figure 3-7. Possible misclassified sample
and normal samples, feature space reduction techniques are required to identify those
genes that cause this separation.
44
Chapter 4
Feature Space Reduction
As we found from performing the visualization analysis in Chapter 3, the cancerous
and normal samples for nearly all tissues are separable by some simple function. In
most cases, a linear function can be used to discriminate between the two sets, since
a straight line can be drawn in between the two sets in the 3-dimensional plots produced. Since it is established that the data from the oligonucleotide samples provides
information that allows the cancerous and normal samples to be separated, all that
remains is specifically selecting the set of genes that causes the largest separation
consistently between the two sets of data.
The thousands of measurements on each oligonucleotide array constitutes an input
space X.
In some instances, changing the representation of the data can improve
the ability of a supervised learning technique to find a more generalizable solution.
Changing the representation of the data can be formally expressed as mapping the
input space X to a feature space F.
x =
(X1,I ...,1X7) [--
0 (X) = (01 (X), ...
N(X))
The mapping function O(x) can be used to perform any manipulation on the data,
but frequently it is desirable to find the smallest set of features that still conveys
the essential information contained in the original attributes.
A O(x) that creates
a feature space smaller than the input space by finding the critical set of features
45
performs feature space reduction.
Given that the oligonucleotide samples each have approximately 10,000 attributes,
feature space reduction is necessary to identify the genes that cause the differences
between cancerous and normal samples. Variable selection techniques reduce the fea-
ture space by identifying critical attributes independent of their behavior with other
attributes in the input space.
Dimension reduction techniques, however, typically
look for sets of attributes that behave similarly across the data and group these "redundant" attributes, thereby reducing the number of features used to represent the
data.
4.1
Variable Selection Techniques
Variable selection techniques typically focus on identifying subsets of attributes that
best represent the essential information contained in the data. Since selecting an optimal subset given a large input space is a computationally hard problem, many existing
techniques are heuristic-based.
Techniques that were used for variable selection on
the two data sets are described below.
4.1.1
Mean-difference Statistical Method
The most direct way to find the genes that are the most different between cancerous
and normal samples is to average the expression of the cancerous samples and the
normal samples separately, and then subtract the resulting mean expression vectors
to obtain the mean difference vector d
d
Xi
i=cancer
where
Xi)
i-normai
j is the number of cancer samples, and k is the number of normal samples.
Those genes with the largest value in d should be those most closely related to the
occurrence of cancer. Using statistics, a confidence interval can be assigned to the
absolute difference between the means for the gene, and hence can determine whether
46
noise is the source of the difference.
Technique Description
Assume there is a set X with n samples drawn from a normal distribution with
mean p/, and variance o. 2 , and another set Y with m samples drawn from a normal
distribution with mean 1,- and the same variance or2 . Since we are only provided with
the samples in set X and set Y, the mean and the variance of the distributions are
unknown.
In order to estimate pt - py, we can use X
-
Y.
However, since this is an
approximation based on the samples taken from the distributions, it would be wise
to compute confidence intervals for the mean difference. The 100(1 - ac)% confidence
interval for pu
2x - pt is
(X - Y) ± tm+n- 2 (a/2) * sy-Y
where
tm+n-2
is the t distribution for m +
T1
- 2 degrees of freedom, and syy is the
estimated standard deviation of X - Y:
1
syyg
=sp
-+
1
-
sF Is defined as the pool sampled variance, and weights the variance from the set
with the larger number of samples:
2
(n-
1)S±+(m- 1)S2
2 +- 2
Two assumptions must be made in order to correctly use this technique [18, p.
392]. The first assumption is that each set of samples is from a normal (gaussian)
distribution. This assumption is usually reasonable, especially when the number of
samples within each set is large. Unfortunately all data sets that we have are typically
small when looking at the number of normal (non-cancerous) samples.
47
The second assumption is that the variance between the sets X and Y are the
same. If this were true, that would indicate that the variability among the cancerous samples were the same among the normal samples. This might not be a safe
assumption to make, since the source and magnitude of the variance in the cancerous
samples may or may not apply to the normal samples. Specifically, among the cancerous samples, there may be different types of cancers that can cause different sets
of genes to be highly expressed. This variance however would not be displayed in the
set of normal samples.
Application to Tissue Data
The above technique was used to select three different subsets of features. Specifically,
a gene was chosen to be in this subset if the probability that the difference between
its values for the cancer experiments and the normal experiments could be seen by
chance is p. Simply stating, the smaller the p value for a gene, the more likely that
the difference between the means was statistically significant and not due to luck (or
noise).
Three values for p were used to select genes: 0.01, 0.001, and 0.0001. A p value of
0.01 meant that the probability that such a mean difference would occur if the two
sets were the same was 1/100. After removing all negative genes from the original
(untouched) data sets, each tissue set had the following number of genes:
" breast: 6203 genes
* colon: 5778 genes
* gastric: 5477 genes
" kidney: 6655 genes
* liver: 5952 genes
" lung: 5593 genes
" ovary: 6645 genes
" pancreas: 6913 genes
48
* prostate: 6885 genes
Table 4.1 shows how many genes with statistically significant mean differences
were found for all tissues, and after pre-processing the data as well.
Clearly the
number of features to consider are reduced by this technique. The generalization
performance of the decision obtained from training on the reduced data sets will be
observed in Chapter 5, and validate the usefulness of this technique.
From Table 4.1, it is evident that the two tissues that seemed to not be linearly
separable from doing the visualization analysis, the liver and pancreas, also have
the fewest number of genes with significant differences between the means for the
cancer and normal arrays. This shows the relationship of having separation between
the cancer and normal samples with having specific sets of genes that actually cause
this separation. This possibly indicates that only a subset of genes are needed to
differentiate between cancer and normal tissues.
4.1.2
Golub Selection Method
The Golub et al. technique (as briefly mentioned in Ch 1.2.2) is a supervised learning method which selects genes that have a specific expression pattern across all the
samples. For instance, possible cancer causing genes could likely be highly expressed
in the cancerous samples, while having low expression in the normal samples. Genes
that have such an expression pattern across the samples could then be used to determine whether a new case is cancerous or normal. This method allows the user
to decide what expression pattern cancer-causing genes should have. Golub et al.
selected two different expression patterns, which are described below.
Technique Description
Assume there is a set X of n cancerous samples, and a set Y of rn normal samples.
Each gene therefore has an expression vector with n + rm elements:
gcnel = (eXpi, CXp 2 , ..
,
cxp, CXPn+1, - - , cxPn+m)
49
ideali
c * (1, 1,..
n
ideal2
C * (0,01
n
0,...
,
0)
m
.. ,1)
m
The similarity between each gene's expression vector gene and an idealized expression vector (e.g. ideal1 ) is measured by using any of several similarity metrics
(e.g. Pearson correlation, euclidean distance). Golub et al. introduced a new metric,
based on the Fisher criterion, that quantified the signal to noise ratio of the gene:
I'x - Iy
ox + o-y
After calculating the similarity between all gene expression vectors and ideal using
one of the above similarity metrics, an arbitrary number n of the genes that score the
highest can be selected to represent half of the predictive set. The other half of the
set consists of n genes that score the highest on correlation with ideal".
Application to Tissue Data
The above method was used twice to select subsets of genes of size 50, 100, 150, 200,
250, and 500. The two metrics that were used to select the genes was the Pearson
coefficient and separately the Golub similarity metric.
The n gene subsets that were selected when using the Pearson coefficient were
composed of the n/2 genes most highly correlated with ideal1 , and the n/2 genes
most highly correlated with ideal2 . The n gene subsets that were selected when using
the Golub metric were the n genes that had the highest Golub metric score.
Thi variable selection technique was run on all nine tissues, with the four expression matrices per tissue (as described earlier). The ability of these subsets to improve
classification performance is addressed in Ch. 5.
50
4.1.3
SNP Detection Method
SNPs are Single Nucleotide Polymorphisms, or variations of one nucleotide between
DNA sequences of individuals. While SNPs are typically used as genetic markers,
they could also cause mutations which cause cancer.
As mentioned in Chapter 2,
genechips measure levels of expression of genes by having sets of probes, which are
oligonucleotides of length 20. Each probe in a probe set is supposed to be a different
segment of the coding region of the same gene.
Assume we have two oligonucleotide samples measuring gene expression from two
separate individuals. Assume we observe for the probe set of genej that all probes
have similar expression levels except for one probe between the two samples. Since
each probe is supposed to be just a separate segment of the same gene, the expression
level for each probe should be similar. Two possible explanations for this behavior
are
1. Other genes that are expressed differently between the samples have a segment
of DNA that is exactly the same as that of gcnej (and complementary to the
probe in question)
2. A SNP has occurred in one sample, causing it to either increase or decrease the
binding affinity of the segments of mRNA to the probe in question
Although explanation #1
is reasonable, the designers of the genechip at Affynetrix
supposedly selected probes that are specific to the desired gene. If this is the case,
then it would be reasonable to assume explanation #2,
and further investigate the
presence of SNPs by sequencing the oligonucleotide segment of the gene represented
by the probe.
Technique Description
In order to compare probe expression values (fluorescent intensities) to find SNJ's,
some normalization is required to ensure that the differences are not due to variations
in the preparation of each sample. Since these differences would likely yield a uniform
51
difference among probe intensities for all genes, a concentration factor c could be used
to remove this source of variance. For each probe set of a single sample, the original
expression data across all probes could be approximated by a probe set pattern vector
j multiplied by the concentration factor c
C * (pi, P2, ... ,Pn)
where pi is an expression level of the ith probe. The concentration factor c would apply
to the pattern vectors for each gene on the array, while the probe set pattern vectors
would apply for all samples. The factor c and pattern vectors could be obtained by
minimizing the error function
S= (di, d2, ... , dn) - C * (PI, P2,
Pn)
while constraining one c for each sample, and one j5 for each probe set across all
samples. Using such a technique would provide us with probe pattern vectors ' for a
set of samples.
In order to use the above technique to find cancer related SNPs, we must first note
that the SNPs that can be found among the cancerous samples is likely a combination
of SNPs that the normals have as well as cancer-specific SNPs:
{cancer SNPs}
=
{normal SNPs} U {cancer - specific SNPs}
Assuming that {normal SNPs} C {cancer SNPs}, we can find {cancer-specific SNPs}
by subtracting the {normal SNPs} from the {cancer SNPs}.
One method that can be used to find the
{normal SNPs} is to compare the p-'s
to each normal sample's expression across a probe set. The 1-'s will be calculated by
using the above methodology on the set of normal samples. If any sample's probe set
expression vector varies significantly from the 'in one probe, it would be reasonable
to assume that a potential SNP exists for that probe.
A simple technique similar to jack-knifing can be used to find potential SNP
52
containing probes. For each pattern vector ' and each probe expression vector d(both
with n elements, the number of probes in each probe set), the Pearson correlation K
is calculated. The correlation
Kj
of ]5 and d without probe i is then calculated and
compared with the original K. If the percentage difference of the correlations is above
a threshold, threshold < "-
then that probe varies significantly between the two
vectors and hence can be a potential SNP.
The threshold test is done for each probe i, 1 < i < n. After doing such an
analysis, a list of (gene,probe) pairs represents the potential {normal SNPs}. Since
the overall goal is to have a set of cancer-specific SNPs that we have high confidence
in, it is acceptable to place the above threshold to be low so that the set of normal
SNPs will be large. In the end of the analysis, we will be "subtracting" the {normal
SNPs} from {cancer SNPs}. Hence we would like to remove any SNPs that aren't
specific to cancer.
Determining the {cancer SNPs} is done by a different technique.
The probe
pattern vectors j obtained from the normal samples will be compared with each
cancer sample's probe set vectors using the jack-knife correlation method described
above. The p-'s obtained from the normal samples are used so that we would avoid
not identifying SNPs that are common among all cancer samples but not existent in
normal samples. As described above, the final step in order to find {cancer-specific
SNPs} is to remove from {cancer SNPs} all elements of {normal SNPs}.
Application to Tissue Data
The SNP detection method was run for all nine tissues on all four variants of the
data set. The pattern detection technique was successfully run on each data set, and
using the jack-knifing method of detecting single proble variations, a list of potential
SNPs were created for the cancer samples and the norial samples.
However, for each data set. {cancer SNPs} C {normal SNPs}. In other words,
all SNPs found were actual polyilorphisms that existed in the normal samples, and
hence weren't critical to the development of cancer. Because this was the case, no
reduced data sets were created from this method.
53
A possible explanation for this behavior can be related to the manner in which
the normal samples were obtained. As described in Ch. 2, "normal" samples were
actually "normal" appearing biopsies of tissues that often contained tumors as well.
It is possible that the "normal" samples actually contained cancerous cells. While
on average, the overall expression could be different between the cancer and normal
samples for the same probe set due to the different concentration of cells afflicted by
cancer, the probe patterns could still be the same.
4.2
Dimension Reduction Techniques
While variable selection techniques attempt to reduce the feature space by using a
specific selection criterion that tests each variable independent of the others, dimension reduction techniques reduce the feature space by grouping variables that behave
consistently across the data. The number of dimensions is therefore reduced to the
number of groups that encompass all the variables.
Most dimension reduction techniques applied to oligonucleotide array data have
been linear reduction techniques. These techniques typically perform in a matrix factorization framework, where the matrix to be factorized into the reduced dimensions
is the expression matrix from Ch 1.2. Assuming the expression matrix has n rows
(genes) and m columns (different experiments), most dimension reduction techniques
factorize the expression matrix: V ~ W*H. The matrix W has n rows and r columns,
while the matrix H has r rows and m columns. The value r corresponds to the desired reduced dimension of the data, and is typically chosen to be (n + m)r < nrm.
Such dimension reduction techniques are linear because a linear combination of the
columns of W approximate a single original experiment.
The various dimension reduction techniques differ in the constraints that are
placed on the elements in the W and H matrices, leading to different factorizations. One common constraint is that the r columns form a basis set that spans a r
dimensional space. This requires that each column of W be orthogonal to all other
columns of W. Two dimension reduction techniques that were used to reduce the
54
Principal Component 2
Principal Component I
Figure 4-1. Principal components of ail ellipse are the primary and secondary axis
feature space are described below.
4.2.1
Principal Components Analysis (PCA)
The most common technique used for dimension reduction is Principal Components
Analysis (PCA). Besides the orthogonality constraint, PCA places the additional
constraint that the ith column of the W matrix be the dimension that accounts for
the zth order of variability. For example, if the original data was 2-dimensional and
the data points formed a shape similar to an ellipse, the first column of the W matrix
would be a vector pointing in the direction of the primary axis, and the second column
would be a vector pointing in the direction of the secondary axis.
Each data point can then be described as a linear combination of the vectors that
represent the primary and secondary axis. For the general case, each data point of n
dimensions can be approximated by a linear combination of r vectors that are each n
dimensional. Each of the r columns of V are known as eigenvectors. They represent
"eigenexperiments" because they show common patterns of expression of genes found
in the actual experiments. Each original experiment is encoded by a corresponding
column of the encoding matrix H, which indicates the specific linear combination of
r eigenexperimnent vectors from W that approximates the experiment.
PCA effectively groups genes with consistent behavior in the r eigenexperirnent
vectors of the WV matrix. These groups of genes can be likened to cellular processes
since the genes express themselves consistently across many experiments, as cellular
processes would be expected to do (since the interactions between genes in a pathway
are relatively consistent). It has been shown that certain eigenexperiments do group
together most genes for certain cellular processes
[2].
The columns of the H matrix can
then be used to understand which eigenexperiments and cellular processes contribute
the most to the expression of a certain experiment.
Technique Description
PCA is typically performed by doing eigenanalysis on the matrix A (note the change
of notation from above, where the matrix to be decomposed is V).
If A is not
square, PCA is implemented via Singular Value Decomposition (SVD), a technique
used to find the eigenvectors of a non-square matrix [22, p. 326]. SVD results in
the decomposition A
=
UEVT, where the columns of U correspond to the principal
components. The E matrix is diagonal, containing the singular values associated
with each principal component. The columns of both U and VT are orthonormal,
respectively.
The matrix U is similar to the W matrix as described in the general framework,
and (EVT) is similar to the H matrix in the general framework. The nth principal
component is the nrh column of U. SVD is performed as follows:
* compute ATA and its eigenvectors v 1 ,v 2 ,..
"
,vn ,
since ATA is square
normalize the eigenvectors so they are unit eigenvectors
" calculate A'J = ai7 for all unit eigenvectors, where U'i is a unit vector and a.
is the magnitude of the vector product A6'
*
is the ith column of U, alphai is the (i, i) element of E, and 'i is the Zth
column of V
U4
Application to Tissue Data
PCA will be used to dimensionally reduce the four data sets per tissue. SVD will be
performed on the data set, yielding u, s, and v. As described above, the encoding
matrix H is equivalent to s * v', and each experiment (columni of W) corresponds to
56
colurni of H. By using the encoding matrix columns to represent an experiment,
we in a sense have changed the feature space, where each dimension is now in the
direction of the corresponding principal component.
By selecting an arbitrary number of rows of the encoding matrix H, we can arbitrarily project the data into the desired reduced space. Using the first n rows of
the H matrix corresponds to reconstructing an experiment by only using the first n
principal components.
Various projections of the data (utilizing a varying number of principal components) can be used to represent the data, and the classification performance of each
projection is examined in Ch. 5.
4.2.2
Non-negative Matrix Factorization (NMF)
Non-negative matrix factorization (NMF) is another technique that has been recently
used to decompose oligonucleotide array data into components which represent cellular processes (Kim and Tidor, to be published).
NMF performs the same factorization of the expression matrix into the W and H
matrices, and requires that each column of W (known as a basisvector when using
NMF) is orthogonal to all other columns of W. NMF differs from PCA by placing
a different constraint on the elements of the W and H matrices. Specifically, NMF
requires that all elements of both the W and H matrices be positive. This leads
to basisvectors that only contain positive expression values for each gene. Because
PCA has no such constraint, frequently genes can have negative expression values
in the eigenexperiment vectors. Another result of the positive constraint is that all
original experiments are approximated by additive combinations of the basisvectors.
This means that the expression of a single experiment is composed of the addition of
several basisvectors, each which represents a cellular process. It is more intuitive to
think about an experiment's expression pattern as the sum of several cellular processes
being active, instead of the addition and subtraction of cellular processes (which PCA
produces).
57
Technique Description
The NMF technique was developed by Lee and Seung [12]. The technique is nondeterministic, and is an iterative procedure that uses update rules to converge to a
maximum of an objective function that is related to the error between the original
expression matrix V and the W and H matrices. The update rules are designed such
that the non-negativity constraints are met. One requirement is that the matrix V
to be decomposed has no negative elements to begin with.
The update rules for the W and H matrices are as follows:
Wia
+
A (WH)ip
Hap
Wa
Wia
Hap
WiaZ
EV
<
SIL
Hap EWza
a(WH)ip
An objective function, which quantifies the error between the original V and WH,
suggested by Lee and Seung [13] is
n
m
F= E
[Vi1log(WH)ip - (WH)ip]
i=1 P=1
Application to Tissue Data
The encoding matrix H obtained by performing NMF on all tissue's four expression
matrices will be used to project the data just as it was described for PCA. By observing the classification performance produced by projecting the data into various
basis vector spaces, we can identify the most critical basis vectors for further study.
4.2.3
Functional Classification
Another strategy that can be used to reduce the number of dimensions is to group
genes based on their functional classification, or usage in cellular processes. If genes
are functionally classified by some analytical means (e.g. by the careful analysis of a
pathway), we can use this information instead of using techniques such as PCA and
58
NMF to attempt to find and correctly group all genes related to a particular cellular
process.
The yeast genome, which has approximately 6000 genes, has been functionally
annotated. Each gene has been assigned to be involved with one or more of thirteen
different functional groups by the MIPS consortium [16]:
1. Metabolism
2. Energy
3. Growth: cell growth, cell division, and DNA synthesis
4. Transcription
5. Protein Synthesis
6. Protein Destination
7. Transport Faciliation
8. Intracellular Transport
9. Cellular Biogenesis
10. Signal Transduction
11. Cell Rescue, Defense, Death & Aging
12. Ionic Homeostasis
13. Cellular Organization
Since these functional classifications are for the yeast genonie and not the human
genome, in order to make the classifications relevant for my data I used sequence
homology in order to assign functional classification to the genes included on the
Affynetrix oligonucleotide arrays. Approximately 4100 human genes on the genechips
were homologous to yeast genes, and hence these 4100 genes were grouped into the
above 13 groups. Experiments were then defined as having expression it 13 diicnsions, where the expression for each dimension was the average expression of all the
genes in the corresponding functional group.
59
3500
3000-
2500-
2000-
1500-
1000-
500
0
1
2
3
4
5
6
7
8
9
10
11
12
13
Figure 4-2. Functional Class Histogram
The breakdown of the 4100 genes that belonged to the functional classification
groups is shown in the figure above. Note that each gene can belong to multiple
classes. Those genes that were specified as being in multiple classes were factored
into the expression for each class by being treated as a separate gene.
Application to Tissue Data
Each tissue's four expression matrices are reduced to have 13 dimensions instead of
the approximate 6000 dimensions. By using these dimensionally reduced matrices,
the critical dimensions highlighted by the decision function obtained from SVMs
will basically be highlighting the critical cellular processes that are most differently
expressed between cancer and normal samples. This is further explored in Ch. 5.
60
Table 4.1. Genes Selected by Mean Difference Technique (note that "orig" means the data with no
pre-processing, "log" means log normalization, "mv" means mean-variance normalization, and "gs"
means global scaling).
Tissue Type
p = 0.01
p = 0.001
breast, orig
breast, log
breast, my
breast, gs
colon, orig
colon, log
colon, my
colon, gs
gastric, orig
gastric, log
gastric, my
gastric, gs
kidney, orig
kidney, log
kidney, my
kidney, gs
liver, orig
liver, log
liver, my
liver, gs
lung, orig
lung, log
lung, my
lung, gs
ovary, orig
ovary, log
ovary, my
ovary, gs
pancreas, orig
pancreas, log
pancreas, my
pancreas, gs
prostate, orig
prostate, log
prostate, mv
prostate, gs
732
739
2077
515
739
780
693
809
1351
1014
1090
1173
1023
893
795
1035
51
86
42
46
621
719
616
756
763
682
655
539
133
145
121
1737
1750
349
278
1269
198
260
323
271
308
552
276
530
573
370
227
272
383
14
14
14
14
270
297
292
393
392
259
214
247
13
16
18
14
1024
1064
199
117
567
100
99
138
111
120
166
62
185
205
116
57
91
125
14
0
14
0
171
102
170
220
231
112
100
134
11
11
11
11
601
642
1518
895
524
1746
1066
647
151
61
p
0.0001
62
Chapter 5
Supervised Learning with Support
Vector Machines
Because the goals of this project were to discover genes that may cause cancer, and
to create a decision function that can be used as a diagnosis tool for cancer, the next
step of the thesis involved the use of Support Vector Machines (SVMs).
SVMs aid in achieving both of these goals. More commonly, SVMs are used to
create decision functions that provide good generalization performance.
However,
since most learning techniques tend to perform worse in high dimensional spaces, in
Chapter 4 we introduced ways to reduce the number of dimensions that were to be
considered by the SVM, and hence attempt to improve the generalization performance
of the decision functions obtained from the SVMs. Additionally, we hoped to use these
decision functions to point out the critical features in the feature space.
First, our motivation for the usage of SVMs and background on the mechanism
of SVMs will be provided.
Then, the generalization performance of the (lecision
functions obtained from the SVMs when using a specific variable selection technique
from Chapter 4 will be presented. Any important genes found will also be listed and
described.
63
5.1
Support Vector Machines
5.1.1
Motivation for Use
Many different supervised learning techniques have been used to create decision functions. Neural networks were previously the most common type of learning technique.
However, SVMs have been shown in several studies to have improved generalization
performance over various learning techniques, including neural networks. Equally important was the ability of the SVM to create decision functions that were nonlinear.
Many learning techniques create linear decision functions, but it is very possible that
a more complex (nonlinear) function of the expression levels of genes is required to
accurately describe the situations in which cancer is present. While other nonlinear supervised learning techniques exist, SVMs provides a framework in which the
structure of the decision function can be developed by the user through the kernel
function (see next section). This framework allows for flexibility, so that both linear
and various nonlinear functions can be used as the form of the decision function.
The generalization performance can then be used to indicate the structure of the
relationships in the data.
Another attractive feature of SVMs is that the criterion for the selection of the
decision function is intuitive. The selection of the decision function is optimized on
two things: the ability to correctly classify training samples, and the maximization of
the margin between the two groups that are being separated by the decision function.
The latter criterion is an intuitive notion, which also leads to a deterministic solution
given a set of data. This criterion enables us to better understand the decision
function so that we can analyze it and select critical features.
5.1.2
General Mechanism of SVMs
The objective of the SVM is to find a decision function that solves the binary classification problem (classify an object to one of two possible groups) given a set of
objects. One way to describe the decision function f(5) is if f(i) > 0, then the
64
U Jr
Figure 5-1. A maximal margin hyperplane that correctly classifies all points on either half-space
[19]
object 7i is assigned to the positive class; otherwise, xi is assigned to the negative
class.
Assume the decision function has the following form: f(Y) = (W' - 7) + b. The
decision function is a linear combination of the various elements of Y, where wi refers
to the weight of the ith element of 7. For example, if the number of elements n in
7 is two, then the decision function is a line. For n larger than three, the decision
function is known as a hyperplane. Treating each object as a point in n dimensional
space, the margin of a point 7i is
+ b)
= yi
'-7Y
where yi is -1 or 1, denoting which group xi belongs to. The distance the point 7i
is from the separating hyperplane is approximated by the lYi| 1, while the sign of -yj
indicates whether the hyperplane correctly classifies the point xi or not. If the sign
is positive, xi is correctly classified.
As mentioned above, the first thing the SVM tries to do is find a W' that correctly
classifies all points, given the set of points is a linearly separable set. This reduces
to having positive margins for all points. Secondly, if we maximize the sum of the
1
(1
-yi equals the euclidean distance a point X'i is from the hyperplane when we have the hyperplane
i, I b). Note this is the geometric margin.
65
margins for all points, we are maximizing the distance each set of points is from the
separating hyperplane.
Hence the maximal margin classifier type of SVM is an optimization problem,
where the goal is to maximize
1-
given the constraint that each 7Y is positive
yj,
(all I points are correctly classified) where the set of points are linearly separable [5].
Before we continue, we should note that the margin -y as defined above is actually
known as the functional margin. Note that the functional margin for any point can
be arbitrarily scaled by scaling W' and b:
('W, b) -+ (A-, Ab)
1
7i
=yi ((Aw -Yi) + A b)
7i
A -yi ((W-- -X-i) + b)
7i
- i) + b)
=yi((
By making A very small, we can make the functional margin very large. In order
to remove this degree of freedom, we can require that the functional margin be scaled
by I||6, yielding the geometric margin
1
Maximizing the geometric margin implies minimizing I G| . Hence the maximal
margin classifier can now be solved for by minimizing ||G|| while meeting the constraints of correctly classifying each of the I objects.
At this point, we can utilize Lagrangian optimization techniques to solve for the
maximal margin classifier (W',b) given the objective function (W' - W-)to minimize
subject to the constraints yi((-'3.i)+b) > 1 for all i = 1,. . . ,1. The primal Lagrangian
given this optimization problem is
L(W, b, a) =
11
(
2i=1
) - a Z[y((
66
- Xi) + b) - 1]
where ac
0 and
are the Lagrange multipliers. The Lagrange Theorem tells us that
L (z"b,)
0b
L ("b,)
=
= 0. Using these relations,
&L(Ub,a)
w - "I
ow
OL (W b, G)
yizi
= 0
yi
= 0
2i=
0 - Oj
W
The first equation above implies an important relationship between W and each point
x. It states that W' is a linear combination of each point Yi. The larger the a,, the
more critical Yi is in the placement of the separating hyperplane.
To obtain the dual Lagrangian, we use the two new relations above and apply
them in the primal Lagrangian:
L( v, b, a)
=ai~Y~
1
1
- y)j
(Xiyy
1
I
I
j=1
(:vi
-aiZyib +
ai i=1
=
E
1
a yiyyy(Xi -zj)
1
ii=1~
(VijyiyjXi'$j)
1 1 j=1
W(a)
This results in the final statement of the optimization problem for the maximal margin
classifer. Given a linearly separable set of points, we maximize the dual Lagrangian
W(a) given the constraints that ac
> 0 for all i = 1,..
.,
and El yizi = 0.
The latter constraint demonstrates how the maximal margin SVM selects points to
construct the separating hyperplane: points on opposite half spaces of the hyperplane
are used to "support" the hyperplane between them, hence the name "Support Vector
Machines".
In addition, the SVM framework is such that the optimization problem is always
in a convex domain. Because of this, we can apply the Karush-Kuhn-Tucker theorem
67
to draw more conclusions about the structure of the solution:
[yZ((W -ZA) + b)-11
=
0
This new relation implies that either ai = 0, or y2 (( Y.
-z) + b) 1. This means that
only points closest to the separating hyperplane have non-zero Lagrangian multipliers,
and hence only those points are used to construct the optimal separating hyperplane
(', b).
5.1.3
Kernel Functions
At this point, the SVM framework has shown how to find a hyperplane that is a linear
combination of the original samples in the training set. The notion of the kernel
function is what allows SVMs to find nonlinear functions that accurately classify
samples.
The kernel function K(z, 5) is related to the q function described in Chapter 4.
The q function is used to map an input space X to a new feature space F. In the
context of Chapter 4, we described
#
functions that reduced the size of the input
space to a more manageable size, which typically results in better generalization
performance. However,
# functions often
a subset of the features).
manipulate the data (e.g. log transforming
A common motivation for a
#
function is if the data
is not linearly separable in the input space, q projects the data into a new space
where the data is linearly separable. This is the strategy of SVMs. Given a training
set, the samples are implicitly mapped into a feature space where the samples are
then linearly separable. The optimal separating hyperplane is then determined by
solving the optimization problem described above. Since the solution is W', a vector
of Lagrangian multipliers (or weights) for each sample, these weights indicate which
samples are the support vectors.
An important idea to note is that the only way the samples enter the optimization
problem is through the term ('i - 7Y).
Only the dot products of each all samples'
vectors are used in the construction of the problem. SVMs takes advantage of this
68
by utilizing kernel functions:
K (i,
=-) (()
. 0(5))
If a mapping of the samples is required, the kernel function takes the place of the
original dot product in the objective function of the optimization, yielding
11
W(a) =
a
2-
aa yyK(zi,
)
i=1 j=1
The kernel function allows for the "implicit" mapping of the samples to a new feature
space using the 0 function. The term "implicit" is used because it is actually not
required to know the 0 function in order to find the optimal hyperplane in a different
feature space. All that is required is the kernel function. Oddly, many kernel functions
are designed without the consideration of the associated 0.
There are requirements regarding what functions can be used as kernel functions.
The first requirement is that the function be symmetric. The second requirement is
that the Cauchy-Schwarz inequality be satisfied:
K(7, 5)2
() _ #(5))2 <
=
=
(5) 120 (:)1 2
(Y) - (Y))(#z - #( )) =
K( Y,X)K(, Z)
These requirements are the basic properties of the dot product. An additional requirement is stated by Mercer's theorem. Since there are only a finite set of values that
can arise using the kernel function on a training set, a matrix K of dimensions / by I
can be constructed where each element (i, j) contains the corresponding (K(5i, SJ)).
Mercer's theorem states that the kernel matrix K must positive semi-definite.
5.1.4
Soft Margin SVMs
The solution the maximal margin SVM can obtain can be dramatically influenced by
the location of just a few samples. As mentioned above, only a few of the samples
69
which are near the division between the two classes are support vectors. If the samples
in the training set are subject to noise, the optimal separating hyperplane may not
be the solution that best suits the data, or at the worst case, make the training set
not linearly separable in the kernel induced feature space.
In order to deal with noise that may be present in data sets, the soft margin SVMs
have been developed. "Soft margin" implies that instead of requiring all samples to
be correctly classified, the margin y can be allowed to be negative for some points.
A margin slack variable
is a way to quantify the amount a sample fails to have a
target margin y:
i = max(O,y- yi((z - xi) + b))
If ( > -y, then Yi is misclassified. Otherwise, the sample is correctly classified with
at least a margin of -y if i is 0.
Using the notion of the slack variable which allows for misclassified samples, we
construct a new optimization problem
minimize
subject to
(W - W)+CE
yi (( W - Yi) + b) ;> I -
i, i = 1, . .. , I
The extent to which a sample can have a negative y and the number of such samples is
determined by the parameter C. A C > 0 has the effect of allowing misclassifications
to occur at the expense of increasing the overall margin between the sets. Typically
the selection of C requires arbitrarily trying various values to obtain the best solution
for a specific data set, although it is suggested to use a larger C if the data set is
believed to be noisy. This increases the probability that the decision function obtained
will be generalizable.
5.2
Previous Work Using SVMs and Genechips
Many researchers have recognized the suitability of SVMs for oligonucleotide array
analysis, due to the generalization performance obtained in high dimensional and
70
noisy spaces (e.g.
for text recognition).
One of the first applications of SVMs to
bioinformatics was the study conducted by Brown et al.
The study focused on the classification of yeast genes to functional classes (as
defined by MIPS, [16]). The training set was composed of 2,467 genes measured on 79
DNA microarrays. Each microarray measured the yeast genome's expression at time
points during various processes (diauxic shift, mitotic cell division cycle, sporulation)
as well as the genome's reaction to temperature and reducing shocks. These genes
were chosen to be specifically related to 6 different functional classes. The functional
classes were chosen out of the approximately 120 total MIPS classes because the
genes involved were known to cluster well. Four different variants of SVMs were used
(four different kernel functions), as well as four other learning methods. The training
performance of each of these methods was compared for all 6 functional classes. The
testing of the generalization performance of the techniques was done by dividing the
2,467 gene expression vectors randomly into thirds. Two-thirds of the genes were used
as a training set, and classified the genes of the remaining third. This process was
performed twice more, allowing for each third to be classified. The study showed that
SVMs outperform the other methods consistently for each of the functional classes.
The study also went on to show that many of the consistently misclassifed genes were
incorrectly classified by MIPS. Other consistent misclassified genes include those that
are regulated at the translation and protein levels, not at the transcription level. The
study then went on to use the decision functions to make predictions about a few
unannotated genes, citing further proof that the classification made by the SVM may
be correct.
Another study that sought to build on the work of Brown et al.
Pavlidis & Grundy [17].
was done by
Specifically, the authors investigated how well SVMs were
able to classify genes into the many other functional classes that were not considered by Brown et al.
Another goal of the paper was to improve the SVMs ability
to functionally classify genes by using phylogenetic profile data. Phylogenetic profiles indicate the evolutionary history of the gene; specifically, they are vectors that
indicate whether a homolog of the gene exists in other genomes. Genes with similar
71
profiles are assumed to be functionally linked, based on the hypothesis that the common presence of the genes is required for a specific function. Phylogenetic profiles
for much of the yeast genome was produced using 23 different organism genomes.
The BLAST algorithm was used to determine whether homologs of each of the yeast
genome genes existed in the other genomes, and the E-value (significance) obtained
from BLAST was used to create the 23 element phylogenetic profile vector. These
phylogenetic profile vectors were used separately to train the SVM to classify genes,
and were shown to provide novel information that was complementary to what the
expression vectors (from the Brown experiments) could provide. Phylogenetic profile
vectors were actually shown to be slightly superior in classifying genes into more functional classes than the expression vectors. When combining the 79 element expression
vector with the 23 element phylogenetic profile vector, the SVM typically improved
its performance, although in some cases the inclusion of the expression vector data
actually caused incorrect classifications.
While the above experiments focused on classifying genes, other researchers attempted to use SVMs to classify samples represented by individual arrays. The most
apt usage of the classification technique was to classify tissue samples as cancerous or
normal, thereby providing a diagnosis tool for cancer. Ben-Dor et al. performed such
a study on two separate data sets composed of tissue samples from tumor and normal
biopsies
[3].
One issue noted was the relatively small number of samples compared to
the number of dimensions (genes) used to represent each sample (in direct contrast to
the studies above, where the number of samples was large relative to the number of
dimensions). Because this has been shown to affect generalization performance, subsets of genes were selected using a combinatorial error rate score for each gene. Four
different learning techniques (two of which were SVMs with different kernels) were
used to classify the samples from two data sets: 62 colon cancer related samples, and
32 ovarian cancer related samples. The jack knifing technique, also known as leave
one out cross validation, was used in order to observe the generalization performance
of the techniques. Jack knifing consists of using n - 1 samples for training, and then
the remaining sample is classified. This process is repeated n times so that every
72
sample gets classified. The results indicate that SVMs perform consistently and reasonably well compared to the other techniques, while the other techniques perform
inconsistently between the two data sets.
Furey et al. performed a similar study of utilizing SVMs to classify tissue samples
as cancerous or normal using DNA microarrays
[8].
The data set consisted of 31 ovar-
ian tissue samples with 97,802 cDNAs. Because the classification was so drastically
underdetermined, only a linear kernel was utilized (and hence there was no projection
into a higher dimensional feature space).
One variable that was manipulated was a
diagonalization factor (related to the kernel matrix), which arbitrarily placed a constraint on how close the separating hyperplane was to the cancerous set of samples.
Additionally, instead of using all 97,802 measurements, the Golub method (see Ch
4.1.2) was used as a feature selection method to reduce the number of dimensions,
and observe if there was any improvement in generalization performance. The study
concluded that by reducing the number of dimensions/features fed into the SVMs,
the generalization performance can increase significantly. The study also showed that
the generalization performance obtained by SVMs is similar to that of other learning methods (e.g. the perceptron algorithm, or the Golub method).
A final point
made is that the performance of SVMs will be significantly better compared to other
algorithms when the number of samples in the data sets increases.
5.3
Support Vector Classification (SVC) Results
The soft margin support vector machine described in Ch.
5.1.4 was used to find
decision functions for each of the data sets produced by the feature space selection
techniques discussed in Ch. 4. The generalization of the resulting decision functions
was measured by using the jack-knifing technique [11], also known as leave-one-outcross-validation (LOOCV). This technique reduces to removing one sample from the
set of data, training with the remaining samples, and then classifying the removed
sample. These steps are repeated such that each sample is removed, and the number
of misclassifications made after all iterations is known as the jack-knifing error. The
73
jack-knifing error is commonly divided by the total number of samples in the full data
set to obtain an error percentage. It is important to note that each time a sample
is removed, a potentially different decision function is obtained. The jack-knifing
therefore is more an indication of the ability of the supervised learning method and
the given feature space than it is of a particular decision function. However, the jackknifing error percentage does indicate the generalizability of the "average" decision
function obtained by the learning method that is trained on the full data set in the
given feature space. In the context of using SVMs, jack-knifing requires the soft
margin optimization problem to be solved n times given a data set with n samples to
obtain n decision functions.
Each of the tissue data sets obtained from the various feature space selection
techniques were trained on by the soft margin SVM several times, utilizing different
kernel functions while also manipulating C. This was done in order to see if any
kernel functions specifically fit the data better (as would be demonstrated by an
improved jack-knifing error percentage) demonstrating the general complexity of the
data. C, as described in the soft margin optimization problem, enable the user to
force the SVM to focus more on finding a decision function that creates the largest
margin between the two classes of data, even at the expense of a misclassification
(in the training set). A C of 0 reduces the soft margin optimization problem to
the original maximal margin optimization problem, where no misclassifications are
allowed. A large C allows for misclassifications while attempting to find the largest
margin between the overall sets of data. The selection of a particular value for C
is often more art than science, as is the selection of the kernel function. Optimal
selections require that the user have prior knowledge about the structure of the data,
which is often not the case. Although 3-dimensional projections of the data were
produced in Ch. 3, these projections might not provide an accurate view of the data
in the full feature space.
Nine different kernel functions were used to train on the data sets, while C was
varied from 0, 5, 10, 50, 100. The different kernel functions used were:
74
1. linear
2. polynomial of degree 1 (not equivalent to line due to implicit bias of 1)
3. polynomal of degree 2
4. polynomial of degree 3
5. spline
6. gaussian radial basis function with r = 0.5
7. gaussian radial basis function with a = 1.0
8. gaussian radial basis function with r = 1.5
9. gaussian radial basis function with r = 2.0
where (T is the global basis function width of the data set
[24].
The linear kernel
function corresponds to the dot product, while the polynomial kernel function has
the form: K(x, z) = ((x - z) + 1)'. The implicit bias of 1 has the effect of weighting
the higher-order terms higher than the lower-order terms.
5.3.1
SVC of Functional Data Sets
From the results of the support vector classification of the data sets produced by
reducing the genes to 13 dimensions corresponding to 13 functional classes, several
observations can be made. First, the classification performance obtained in the 13 dimensions is actually quite good. Obtaining low jack-knifing error in dimensions larger
than the number of samples is trivial. However, when the number of dimensions is
significantly fewer than the number of samples, the dimensions used must be expressive of the data. Nearly all tissue sets have jack-knifing error percentages below 15
for at least one of the four versions of the data. However, there is significant variation
in the error percentage that results from the different pre-processing techniques. See
Table 5.1.
First, as was first indicated by the visualizations perform in Ch. 3, mean-variance
normalization significantly decreases the jack-knife error percentage compared to the
other techniques.
Global scaling, on the other hand, has a dramatically bad effect
75
on the data. It is possible that globally scaling the data removes the differences
in expression between the functional groups, causing all samples to appear to be
expressing at the same level.
Another important observation is that varying the soft margin variable C among
the values greater than 0 does not have a significant effect on generalization. The
difference of performance between the same kernel using a C of 5 and a C of 100
was generally negligible. Possibly, a more appropriate range for varying C would
have been 0.5 < C < 5. However, it is clear that by using a C of 0, a data set
can be determined to be separable or not (since a C of 0 corresponds to making
the soft margin optimization problem the "hard" optimization problem allowing no
misclassifications).
Many of the tissue data sets are not separable by any type of
decision function specified by the kernel functions used.
It was also observed that the as the complexity of the kernel function increased, the
jack-knifing error increased as well. While a more complex kernel function can be used
to create a decision function that correctly classifies all points, typically the function
will be used to "overfit" the data, producing a function that is very specific to the
training data. Overfitting often results in poor generalization peformance. This can
be seen in the error percentage for the complex kernel functions. Overall, the linear
kernel function had the best generalization performance. This also may indicate that
the difference between the cancer and normal samples can be well characterized by
linear relationships in a set of critical genes. If there were higher-order relationships
among the functional classes that cause cancer, they were not detected by the SVM.
Until larger and less noisy data sets are available, most analysis will likely point to
simple linear relationships.
5.3.2
SVC of NMF & PCA Data Sets
Support vector classification for both NMF and PCA were done in a similar fashion.
PCA and NMF were used to project each of the tissue data sets into specific dimensions, and SVC was performed on the projected data. Each tissue was projected to
d = 1, 3, 5,... , n where n was equal to 1 or 1 - 1, depending on whether the number
76
of samples in the set 1 was odd.
A curious result was that the globally scaled projections consistently produced
the lowest jack-knifing errors. This is completely opposite of the observation made
from the classification of the functional data sets.
Results were similar, however,
regarding the selection of C. There was absolutely no difference in the performance
of the obtained decision functions varying C from 5 to 100. A C of 0 consistently
caused the SVM to fail to obtain a decision function as well.
The most consistent and interesting observation made by looking at the jackknifing error percentages associated with the varying dimensions was that in all cases,
the addition of dimensions past a certain point increased the error. It appears that
only a few basis vectors/eigenvectors are required for differentiation between cancer
and normal samples, since the addition of further dimensions must be adding "noise"
to the data, resulting in poorer classification performance. This result is important
since it lends credibility to the selection of the basis vectors by NMF, and the selection of eigenvectors by PCA. The fact that only a few dimensions are required
to obtain good classification performance also means that there are differentiating
cellular processes between the cancer and the normal samples.
The jack-knifing performance varied slightly between NMF and PCA. The NMF
basis vectors provided consistent 0% jack-knifing errors for several dimensions, ranging from 3 to 11 (depending on the tissue). The PCA eigenvectors did not appear to
be quite so robust, but still provided perfect jack-knifing classification in a smaller
range of dimensions. The jack-knifing performance of PCA decreased more quickly
with the addition of more dimensions compared to NMF. This might be a result of
the additive/subtractive nature of PCA, compared with NMF's additive only nature.
5.3.3
SVC of Mean Difference Data Sets
Just as the support vector classification of the other data sets has implied, small,
critical sets of genes cali be very effective in describing the differences between cancer
and normal samples. The mean difference method was used to select 3 sets of genes
to describe each data set, and support vector classification was then performed on
77
the dimensionally reduced data sets.
The jack-knifing errors were 0 for nearly all reduced data sets (see Table 5.1). One
might say that since the number of genes in each of these subsets is still relatively
large compared to the number of samples, this is to be expected, since separating
a few points in a high dimensional space is trivial. However, the expected result is
that the SVM find a decision function that separates all samples in the training set.
It is not necessarily expected that the decision functions selected by the SVM are
generalizable enough to correctly classify an additional sample. The fact that each
time a point was removed and the SVM was trained on the remaining samples, the
decision function was able to correctly classify the remaining point means that the
dimensions selected are truly representative of the differences between cancer and
normal samples.
Another important observation was that as the size of the selected subsets of genes
decreased, the jack-knifing error actually decreased. The genes in the smaller subsets
were identified to have more significant differences between the means of the samples.
The fact that a small subset of genes that are supposed to be more significant can
provide better generalization than a larger set of genes implies that the technique
selects genes which truly are significant in identifying cancer samples from normal
samples. Additionally, using a linear kernel with C
=
0, most of the data sets were
not linearly separable.
Performing SVC on these data sets also showed overfitting. Once again, the linear
kernel performed the best. When using a more complex kernel function, such as the
polynomial kernel of degree 2, the jack-knifing error increased significantly.
5.3.4
SVC of Golub Data Sets
As described in Ch 4.1.3, the Golub method of selecting significant features was used
to create subsets of genes of size 50, 100, 150, 200, 250, and 500 genes. The metric
introduced by Golub quantifying the signal to noise ratio was used to create the above
subsets, and additionally the Pearson coefficient was used to find genes that matched
the ideal expression vectors. The top scoring genes for each of these metrics were
78
used to create 12 subsets of genes.
In general, the jack-knifing errors obtained using these reduced data sets are particularly low. They are similar to those obtained by the SVC of the mean difference
method data sets above, although not quite as good for a few data sets. Like the
mean difference method, the generalization of the decision functions also improve as
the size of the subsets of genes decreases. The Golub and Pearson metrics result in
very similar jack-knife errors, although the Golub (signal to noise) metric slightly
outperforms the Pearson coefficient. The performance of the metrics only differs in
the "difficult to separate" data sets specified by the visualizations performed, such as
the kidney and the pancreas. It seems that finding genes that have significant signal
content (relative to the noise in the data set) is critical to separating these data sets,
hence the better performance of the Golub metric.
5.4
Important Genes Found
Given the excellent classification performance yielded by using the variable selection
techniques, the genes found by these techniques seem to be closely related to the
differentiation of cancerous and normal tissues. A common list of genes representing
those that repeatedly came up in several of the subsets was composed.
Another method of selecting important genes is to analyze the decision function
obtained from support vector classification. Given a specific decision function, those
features with the largest coefficients associated with them are the most significant
in the proper classification of cancer and normal samples.
Since the linear kernel
produced the lowest jack-knifing errors across all tissue data sets, the decision function
had the form of a linear combination of the values for various genes. Those genes
with the highest coefficients were noted, and a list of' important genes was hence
created using decision functions from all the variable selection techniques over all the
differently pre-processed data sets. See the table below for a few example genes.
79
Table 5.1. Jack-knifing percentage (%) errors using a linear kernel function
M-D p=10-
2
M-D p=10- 3
M-D p=10 4
Tissue Type
Func
breast, orig
8.0000
4.0000
0
0
breast, log
breast, mv
12.0000
0
0
NaN
0
7.6923
0
0
breast, gs
colon, orig
21.4286
23.0769
0
0
0
0
0
0
colon, log
colon, my
colon, gs
gastric, orig
15.6250
6.2500
50.0000
5.8824
0
0
10.0000
0
0
0
0
0
0
0
0
0
gastric, log
30.7692
7.6923
7.6923
0
gastric, mv
0
0
0
0
gastric, gs
kidney, orig
NaN
0
0
0
0
0
0
0
kidney, log
kidney, my
kidney, gs
6.2500
30.7692
30.0000
0
NaN
10.0000
0
NaN
0
0
NaN
0
liver, orig
0
0
0
0
liver, log
5.8824
0
0
0
liver, mv
0
0
0
0
liver, gs
30.7692
7.6923
0
0
18.7500
0
NaN
0
NaN
0
NaN
0
lung, orig
lung, log
lung, mv
0
0
0
0
lung, gs
50.0000
NaN
NaN
NaN
ovary, orig
ovary, log
0
0
0
0
0
0
0
0
ovary, my
ovary, gs
2.9412
18.7500
NaN
0
NaN
0
NaN
0
pancreas, orig
pancreas, log
pancreas, mv
0
0
0
0
0
0
0
0
0
0
0
0
pancreas, gs
50.0000
0
0
0
prostate, orig
0
0
0
0
prostate, log
0
0
0
0
prostate, mv
prostate, gs
0
8.8235
0
0
0
0
0
0
80
Table 5.2. Genes selected via "vote" and SVM decision function analysis
Gene
Function
AF053641
X17206
HG51 1-HT511
M60974
U33284
M13981
brain cellular apoptosis susceptibility protein (CSEl) mRNA
mRNA for LLRep3
Ras Inhibitor Inf
growth arrest and DNA-damage-inducible protein (gadd45) mRNA
protein tyrosine kinase PYK2 mRNA
inhibin A-subunit mRNA
V00568
mRNA encoding the c-myc oncogene
S78085
Z33642
AL031778
S78085 PDCD2=programmed cell death-2/Rp8 homolog
Z33642:H.sapiens V7 mRNA for leukocyte surface protein
AL031778:dJ34B21.3 (PUTATIVE novel protein)
Y09615
Y09615:H.sapiens mRNA for mitochondrial transcription termination factor
U10991
U10991: G2 protein mRNA
U70451
U70451: myleoid differentiation primary response protein MyD88 mRNA
D26158
L07493
U16307
AL050034
D26158:Homo sapiens mRNA for PLE21 protein
Homo sapiens replication protein A 14kDa subunit (RPA) niRNA
glioma pathogenesis-related protein (GliPR) mRNA
AL050034:Homo sapiens mRNA
81
82
Chapter 6
Conclusions
The goal of this thesis was to further the understanding of the mechanisms of cancer.
The methodology to achieve this goal involved comparing and contrasting the genomic
expression of cancerous and normal tissues.
Oligonucleotide arrays were used to
gather large amounts of genomic data, and were able to provide a "snapshot" view of
the genomic activity of the cells in both cancerous and normal tissues of nine different
human organs. A computational methodology was then utilized to discover critical
information in these tissue data sets.
The first step of the methodology involved pre-processing the data to ensure that
the information pertinent to the differentiation of cancer and normal samples was
clearly presented, while minimizing the effect of external factors such as data preparation. Four techniques were used: removal of "negative" genes, log normalization,
mean-variance normalization, and global scaling.
In order to confirm the presence of relevant information in the pre-processed data
sets, three visualization techniques were used: hierarchical clustering, Multidimensional Scaling (MDS), and Locally Linear Embedding (LLE). Each of these techniques confirmed that 7 out of the 9 tissue data sets contained cancer and normal
samples that separated into two clusters, with varying degrees of separation between
the clusters.
The data sets were then reduced by using two classes of feature space selection
techniques:
variable selection, and dimension reduction.
83
Three variable selection
techniques were applied to reduce the number of features representing each sample. Principal Components Analysis (PCA) and Non-negative Matrix Factorization
(NMF) were used to select arbitrary sized basis sets to represent the data, thereby
transforming the input space to a smaller feature space. The data sets were also
reduced to 13 functional classes by using the MIPS yeast database. Human genes
were assigned to the functional class of their yeast gene sequence-homologues.
The transformed/reduced data sets were then passed to a supervised learning
method, known as the Support Vector Machine (SVM). SVMs with various kernel
functions and various soft margin parameters were trained on the transformed data
sets. The jack-knifing technique was used to assign a value to the classification performance of the obtained decision functions. Many configurations (dataset, type of
kernel function, C value) obtained excellent classification performance, inferring that
many of the feature space selection techniques were able to identify critical subsets of
genes (from an original set of 6000) capable of separating cancerous tissue from normal tissue. The expression measurements of such genes could be used as a diagnostic
tool, in the form of the decision function obtained from support vector classification.
Effectiveness of Methodology
The classification performance of decision functions obtained through the usage of
SVMs was certainly improved by introducing the various feature space selection techniques. In the process of developing these decision functions, the support vector
machines were able to validate the feature selection techniques used to generate the
data sets. The low jack-knifing errors associated with the mean difference technique
and the golub selection indicate that simple techniques which separate noise from signal discover critical sets of genes. The decision functions obtained from the support
vector classification were also useful as a second "filter" of the original set of genes
by highlighting critical genes with large coefficients.
Analysis of the support vector classification results also established the utility of
the simple linear kernel. For nearly all data sets, the linear kernel achieved the best
classification performance. It will be likely to remain this way until larger data sets
84
are obtained which can be used to effectively demonstrate any higher-order structure
in the data. The poor generalization of the more complex kernels is likely due to
overfitting of the data. In addition, the support vector classification results were able
to demonstrate the importance of pre-processing oligonucleotide array data.
85
86
Bibliography
[1] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J.
Levine. Broad patterns of gene expression by clustering analysis of tumor and
normal colon tissues probed by oligonucleotide arrays. Proc. Nati. Acad. Sci.
U.S.A., 96:6745-6750, 1999.
[2] 0. Alter, P. 0. Brown, and D. Botstein.
Singular value decomposition for
genome-wide expression data processing and modeling. Proc. Nati. Acad. Sci.
U.S.A., 97:10101-10106, 2000.
[3] A. Ben-Dor, L. Bruhn, N. Friedman, 1. Nachman, M. Schummer, and Z. Yakhini.
Tissue classification with gene expression profiles. pages 1-12, 2000.
[4] M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S.
Furey, M. Ares Jr., and D. Haussler. Knowledge-based analysis of microarray
gene expression data by using support vector machines. Proc. Nati. Acad. Sci.,
97:262-267, 2000.
[5] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines.
Cambridge University Press, Cambridge, UK, 2000.
[6] M. B. Eisen, P. T. Spellman, P. 0. Brown, and D. Botstein. Cluster analysis
and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A..
95:14863--14868, 1998.
[7] D. Fasulo. An analysis of recent work on clustering algorithms. pages 1-23, 1999.
87
[8] T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and
D. Haussler. Support vector machine classification and validationof cancer tissue
samples using microarray expression data. UCSC Technical report, UCSC-CRL00-04:1-17, 2000.
[9] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov,
H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S.
Lander. Molecular classification of cancer: Class discovery and class prediction
by gene expression monitoring. Science (Washington, D.C.), 286:531-537, 1999.
[10] J. C. Gower. Some distance properties of latent root and vector methods used
in multivariate analysis. Biometrika, 53:325, 1966.
[11] L. J. Heyer, S. Kruglyak, and S. Yooseph. Exploring expression data: Identification and analysis of coexpressed genes. Genome Res., 9:1106, 1999.
[12] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix
factorization. Nature (London), 401:788-791, 1999.
[13] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization.
Adv. Neural Info. Proc. Syst., 13:556-562, 2001.
[14] R. J. Lipshutz, S. P. A. Fodor, T. R. Gingeras, and D. J. Lockhart. High density
synthetic oligonucleotide arrays. Nature Genet., 21:20-24, 1999.
[15] H. Lodish, A. Berk, S. L. Zipursky, P. Matsudaira, D. Baltimore, and J. E.
Darnell. Molecular Cell Biology. W.H. Freemand and Company, New York,
2000.
[16] H. W. Mewes, D. Frishman, C. Gruber, B. Geier, D. Haase, A. Kaps, K. Lemcke, G. Mannhaupt, F. Pfeiffer, C. Schuller, S. Stocker, and B. Weil. MIPS: A
database for genomes and protein sequences. Nucleic Acids Res., 28:37-40, 2000.
[17] P. Pavlidis and W. N. Grundy. Combining microarray expression data and phylogenetic profiles to learn gene functional categories using support vector machines.
88
Columbia University Computer Science Technical Report, CUCS-011-00:1-11,
2000.
[18]
J. A. Rice. Mathematical Statistics and Data Analysis. Duxbury Press, Belmont,
CA, 1995.
[19] S. Roweis. Jpl vc slides. http://www.cs.toronto.edu/ roweis/notes.html, page 5,
1996.
[20] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear
embedding. Science, 290:2323-2326, 2000.
[21] American Cancer Society.
Cancer facts and figures 2001.
A CS website,
www.cancer.org, 1:2-5, 2001.
[22] G. Strang. Introduction to Linear Algebra. Wellesley-Cambridge Press, Wellesley,
MA, 1993.
[23] P. Tamayo, D. Slomin, J. Mesirov,
Q.
Zhu, S. Kitareewan, E. Dmitrovsky,
E. S. Lander, and T. R. Golub. Interpreting patterns of gene expression with
self-organizing maps: Methods and application to hematopoietic differentiation.
Proc. Natl A cad. Sci. U.S.A., 96:2907-2912, 1999.
[24] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New
York, 1995.
[25] P. P. Zarrinkar, J. K. Mainquist, M. Zamora, D. Stern, J. B. Welsh, L. M.
Sapinoso, G. M. Hampton, and D. J. Lockhart.
Arrays of arrays for high-
throughput gene expression profiling. Genome Research, 11:1256-1261, 2001.
89
Download