Final Report - Department of Computer Science

advertisement
Microarrays: A Comparison of Classification and
Feature Selection Algorithms for Interpretation
(Lynn H. Lee, Hiram Shaish, Eric A. Smith, Min C. Zhang, Yuhang Wang)1
Abstract
Microarrays have the potential to provide a wealth of information as diagnostic
tools in a clinical setting. Through the comparison of different expression patterns to a
standard expression profile, ‘abnormal’ patterns may be diagnosed well before symptoms
arise in individuals. Numerous algorithms have been experimented with for feature
selection and classification of data sets in an attempt to minimize the feature space and to
maximize classification accuracy. In this paper we discuss the background of
microarrays. Additionally we compare the results of several algorithms used to classify a
given data set. We demonstrate that for our selected algorithms the potential
classification accuracy is realized with a very small feature space.
Background
Each cell in our body contains the exact same copy of DNA, which serves as the
code for cell structure and function. In spite of this we know that different cells in our
body have different phenotypes and perform different functions. These range from the
oxygen-transporting denucleauted red blood cells to the information-carrying elongated
neurons. This point is precisely why unraveling the code in its entirety has fallen short of
its goal. Clearly there are regulatory mechanisms by which certain genes are up-regulated
in specific cell types while down-regulated in others. This information is not easily
obtained by simply knowing the code.
Although DNA is the blueprint it does not take an active role in the translation of
proteins. DNA is transcribed into mRNA in the nucleus; and mRNA is subsequently
transported out of the nucleus to the protein factories, the ribosomes. The ribosomes,
1
Work division among the team members was as follows. Lynn Lee studied and described the
classification methods, and performed all the experiments that use KNN as the classification method.
Hiram Shaish studied and described the background of microarrays, and compiled and analyzed the
experimental results. Eric Smith programmed, tested, and described the data parser. Min Zhang studied and
described the feature selection methods, and performed all the experiments that use SVM as the
classification method. Each team member contributed to the writing and editing process.
which serve as the decoders of the mRNA, build long chains of amino acids (a.k.a.
proteins) based on sequential 3-letter mRNA nucleotides. There is a direct correlation
between the amounts of mRNA transcribed from DNA in the nucleus to the amounts of
proteins that are translated. In other words, the amount of mRNA of a given protein is a
good measure of the activation level of the corresponding gene. The microarray is a chip
upon which each gene of an organism is represented by a minute cDNA strand. A sample
cell is lysed and its mRNA is allowed to bind to the matching cDNA strand. Since the
mRNA is fluorescently labeled beforehand the level of fluorescence per cDNA strand
after application of the cell extract is proportional to the amount of corresponding mRNA
in the cell. This technique allows scientists to build an expression profile for different
tissue samples.
There are three broad categories of applications for microarrays. The first, gene
expression profiling, is the comparison of different tissue samples through their
expression patterns. The use of microarrays in this fashion is especially prevalent in the
field of oncology. The aim of cancer prognosis and treatment is to be able to classify
tumors and to selectively attack them. Gene expression profiling provides a powerful tool
to accomplish this goal. The second category of applications is DNA sequencing. In this
approach sample DNA, not mRNA, is allowed to hybridize to a microarray. Given that
the DNA on the chip or in the sample tissue represents an individual with a genetic
disorder of unknown origin it then becomes possible to locate the defective genes. The
third category, genotyping, involves the application of genomic DNA to a microarray of
known genetic markers. These markers have been associated with different phenotypical
conditions and present risk factors [6].
The main obstacle of using microarrays, besides the cost, is the enormous amount
of information that must be analyzed. Given that the human genome consists of ~30,000
genes and that the main use of microarrays is to compare several tissue samples, a
considerable amount of analyzing must be done. This is where the need for computational
methods becomes apparent. To be able to compare tissue samples, the 30,000-gene
profile must be reduced as much as possible. Currently there are several techniques for
minimizing the size of the profile without losing the unique profile.
Most of the algorithms for analyzing microarray data are based on two classical
problems in computer science. The first is that of feature selection. This is the process
whereby certain features (genes in our case) in a training data set are kept and others
discarded as a basis for classification, which we discuss shortly. Most of the commonlyused feature selection methods, except for t-statistics and one-dimensional SVM,
“quantify the effectiveness of a feature by evaluating the strength of class predictability
when the prediction is made by splitting the full range of expression of a given gene into
two regions, the high region and the low region” [7]. The second problem is the
subsequent classification. This is a process that employs a machine-learning type
approach. An algorithm is given a training set upon which it ‘learns’ to classify a given
profile into one or more classes (e.g. cancerous/non-cancerous, anemic/non-anemic, in
the G1 phase/in the G2 phase, etc.). This algorithm is then applied to the desired data set.
An important recent improvement upon classical techniques is the generalization from
classification into two classes to classification into any number of classes. This
improvement is often used in the analysis of microarray data.
Methods and Materials
Microarray Data
We obtained expression data of 22,216 genes taken from human epithelial cells of
75 individuals [9]. These individuals were classified through questionnaires into 3
groups: smokers (34 individuals), those who have never smoked before (23 individuals)
and those who have smoked in the past but have since quit (18 individuals) [10]. The
selection of this data set was important in our studies since the phenotype was easily
verifiable.
Software
We selected WEKA[8] as our algorithmic software. WEKA, an open source code
implementation of many bioinformatics and data processing algorithms, includes
implementations of our selected algorithms. Importantly, WEKA requires input in the
form of ARFF files. This format is different from the format generated by many
microarray applications. In particular, the gene expression omnibus database [9], as
previously mentioned, was our source for microarray data and provides the data in SOFT
format. Thus our first task was to code a parser that would convert the SOFT format into
the ARFF format.
Algorithms
We chose to work with 4 algorithms. ECOC (a subtype of the broader SVM
algorithms), our first classification algorithm, attempts to create some hyperplane in
space such that two classes of data are separated by the largest margin [7]. This method
was chosen since it has traditionally been very strong in binary classification. As a result,
the ECOC method attempts to break the original multiclass problem down into a series of
binary partitions. In general, given some number of classifiers, the ECOC method
functions by deriving an output “word” for any input. This word is then compared to the
codeword for each class (computed in an earlier operation) and the class whose codeword
has the fewest disagreements with the computed word for the input, is selected. Our
second classification algorithm, K-nearest Neighbor (KNN), was chosen due to its
relative straightforwardness and accuracy when the data is preprocessed by feature
extraction. As the title suggests, KNN is given an integer parameter k, where given some
input, KNN then finds the k nearest points from the training data to the input, and
classifies the input on the basis of the classification of the k selected points from the
training data. The error rate for KNN is low, “asymptotically at most two times the
Bayesian error” [7].
For feature selection we chose Information Gain and Chi-square Statistics.
Information Gain is widely used in machine learning algorithms and may achieve good
performance with one of our classification method, the ECOC. “Information Gain2
measures the number of bits of information obtained for class prediction by knowing the
value of a feature” [5]. Alternatively, the Chi-square Statistics3 method is widely used in
statistical theories. Chi-square measures the degree of dependence the feature the class. In
this method, the numbers of rows and columns in the gene data matrix specifies the
degrees of freedom. When the Chi-square Statistics is larger than the critical value
2
The information gain of a feature f is defined as
where
3
,
denotes the set of classes, vV is the set of possible values for feature f [5].
The Chi-square of a feature f is defined as
,
where Ai(f = v) is the number of instances in class ci with f = v and Ei(f = v) is the expected value of Ai(f =
v), calculated as P(f = v)P(ci)N [5].
determined by the degrees of freedom, the feature and the class are considered dependent.
Such features are selected. Wang et al. (2004) show that Chi-square Statistics is
approximately as effective as Information Gain. Therefore, we choose sum of variance as
an alternative to compare results with Information Gain.
Experiments
We crossed the 2 feature selection algorithms with the 2 classification algorithms
to see how the 4 combinations would compare. We ran each combination up to 13 times,
each time with a different sized feature space. In total we ran 46 experiments.
Data and Results
SOFT-to-ARFF Parser
The creators of both the SOFT and ARFF formats have followed the fundamental
design principle of textual storage formats. This principle exists for a number of reasons,
and the reader is referred to [11] for more details.
In addition, we required that the parser’s implementation be completed in the bare
minimum amount of time, and be as small and bug-free as possible. We did not
particularly care about its efficiency in running time or memory usage. Thus it seemed
immediately apparent that Perl was the perfect choice for implementation’s language.
It soon became obvious that our intuitions for choosing Perl were well-founded.
The coding was completed very rapidly. The program consists of about 100 lines of code
and has generated output that is exactly to spec in every one of our tests to date. It takes
about 10 seconds to convert one data set; but since our project only involves looking at a
couple or three data sets, this was not a problem.
How about we change it to: A few other particulars of the parsers are worth
mentioning. We have attempted to create it according to standards that would be applied
to any chunk of commercial-quality UNIX software. Thus, the program works in a
pipeline, reading from standard input and writing to standard output. More than half of
its text is comments. The strict Perl paradigm is used, so that the program can be
easily incorporated into a larger system. A healthy blurb of help is available, when the
program is run with the usual -h flag. All these things significantly increase the
program’s usability and maintainability.
Feature Selection
Algorithm
and
Feature
space size
Info Gain
(1)
Info Gain
(2)
Info Gain
(5)
Info Gain
(10)
Chi Square
(1)
Chi Square
(2)
Chi Square
(5)
Chi Square
(10)
Selected features
6559
6559,4150
6559,4150,9659,2832,2437
6559,4150,9659,2832,2437,9126,10461,5622,5327,1468
6559
6559,4150
6559,4150,9659,2832,9126
6559,4150,9659,2832,9126,2437,5622,10461,17463,1468
Table 1: Identity of the feature profile as selected by Chi Square and Info Gain when run with different
feature space sizes.
Surprisingly, as Table 1 shows, both feature selection algorithms selected almost
the same features for profiles of similar sizes (only sizes 1 through 10 are shown).
Although the latter five features vary in the order of their significance and two of these
are actually different features, the first four features are the same for both algorithms
(WEKA lists the features in their order of significance to classification). Since the two
feature selection algorithms operate under very different methodologies, this suggests
that these particular features correlate highly with the classification of the data. This
provides a logical methodology to be followed when selecting features for classification.
The significance of a selected feature profile for classification by a given algorithm can
be validated by comparing it to the feature profile produced by a very different algorithm.
If the two algorithms select similar features then there is a high degree of correlation with
the data. From this we surmise that feature 6559 is a primary gene that is affected by
smoking or, a gene responsible for a primary phenotypic difference between a “smoker”
and a “non-smoker”.
Classification Accuracy vs. Feature Space
80
70
% Correctly Classified
60
50
Chi Square-ECOC
Info Gain-ECOC
Chi Square-KNN
Info Gain-KNN
40
30
20
10
0
0
100
200
300
400
500
600
Features
Figure 1: Classification accuracy for the four algorithms combinations. The data includes classification
with up to 500 features.
80
70
% Correctly Classified
60
Info Gain-ECOC
50
Info Gain-KNN
Chi Square-ECOC
Chi Square-KNN
40
30
20
10
0
0
10
20
30
40
50
60
Features
Figure 2: Classification accuracy for the four algorithms combinations. The data includes classification
with up to 50 features.
Since the differences in the feature selection results were minor and did not
encompass the five most significant features, we may use the smaller profiles as a method
to compare the two classification algorithms. Figure 2 clearly shows that KNN outperforms ECOC given fewer features. In fact, while ECOC achieves only 45%
classification accuracy with one feature, KNN produces 70% accuracy. As the feature
number grows, so do the differences between the features selected by Info Gain and Chi
Square, thus resulting in four different curves. The highest accuracy of 75 % is produced
by ECOC in combination with both feature selection algorithms.
We mention two important observations. First, even though classification was run
on a feature set containing all of the genes (data not shown) the accuracy never exceeded
75%. This can be attributed to an inherent characteristic of the data. Since the data was
partitioned into 3 classes based on simple, yet clearly oversimplified criteria (in reality
questions like how many cigarettes per day does each smoker smoke and how long ago
have those that quit smoking quit, matter as well), it stands to reason that the ‘boundary’
cases may be harder to classify. Second, a relatively high degree of accuracy is produced
with a limited feature space size (70% with one gene under KNN).
Conclusion
In this paper we have examined the results of classification and feature selection
using different algorithms on a sample data set. We have shown that different feature
selection algorithms may be used to strengthen the relevance of selected features to the
subsequent classification. This is paramount in a clinical setting where scientists and
doctors strive to uncover potential targets for therapy. Additionally we have demonstrated
that relatively high classification accuracy can be achieved with a small number of
features. The maximum accuracy is limited not by the number of features, but by inherent
characteristics of the data. Therefore, preprocessing and clustering the data, as done in
other studies, may be necessary in order to enhance the accuracy of prediction.
Future work should include further experimentation with other algorithms as well
as design of novel ones, customized for this specific problem [5]. Importantly, our
selected algorithms were not originally created to classify microarray data. It would also
be interesting to study what attributes make an expression data set a good candidate for
classification, i.e. one that would give a high degree of accuracy upon classification.
We conclude with a broader perspective. While there are currently many parallel
efforts, such as ours, to test and develop algorithms for microarray data analysis, there are
still numerous obstacles. First, there is a shortage in personnel that have the background
and understanding necessary to bridge the interdisciplinary gap between biology and
computer science. Second, computer scientists will need to dispel the prevailing
‘mistrust’ of more traditional scientists before their solutions are taken more seriously.
Third, large validated data sets, which are fundamental to the testing of new algorithms,
are largely unavailable to many computer science research groups. In spite of the
obstacles, the falling costs of microarray analysis and the enormous clinical potential will
no doubt continue to stimulate research into the algorithmic interpretation of the data.
Footnotes:
1- The parser may be found at: http://www.cs.dartmouth.edu/~eas/soft2arff.txt
Works Cited:
1 - Golub, T. R., D. K. Slonim, et al. (1999). Classification of Cancer: Class Discovery
and Class Prediction by Gene Expression Monitoring. Science 286(5439): 531-537
2 - http://www.cs.toronto.edu/~radford/res-bayes-ex.html
3 - http://www.cse.ucsc.edu/research/compbio/genex/genexTR2html/node12.html
4 - http://www.statsoft.com/textbook/stclatre.html
5 - Yuhang Wang, Fillia Makedon, James Ford, and Justin Pearlman. HykGene: A Hybrid
Approach for Selecting Marker Genes for Phenotype Classification using Microarray
Gene Expression Data. Bioinformatics, in press
6 – Aitman, Timothy J. DNA microarrays in medical practice. BMJ 2001;323:611-615
7 - T. Li, C. Zhang, and M. Ogihara, A comparative study of feature selection and
multiclass classification methods for tissue classification based on gene expression,
Bioinformatics, vol. 20, pp. 2429-2437, 2004
8 - http://www.cs.waikato.ac.nz/%7Eml/weka/
9 - http://www.ncbi.nlm.nih.gov/geo/
10- Avrum Spira, Jennifer Beane, Vishal Shah, Gang Liu, Frank Schembri,
Xuemei Yang, John Palma, and Jerome S. Brody Effects of cigarette smoke on the human
airway epithelial cell transcriptome
11- Raymond, Eric. The Art of Unix Programming. “Chapter 5. Textuality”. Available
online http://www.faqs.org/docs/artu/textualitychapter.html
Additional Works Referenced:
E. Frank, M. Hall, L. Trigg, G. Holmes, and I. H. Witten, Data mining in bioinformatics
using Weka, Bioinformatics, vol. 20, pp. 2479-2481, 2004.
Download