Microarrays: A Comparison of Classification and Feature Selection Algorithms for Interpretation (Lynn H. Lee, Hiram Shaish, Eric A. Smith, Min C. Zhang, Yuhang Wang)1 Abstract Microarrays have the potential to provide a wealth of information as diagnostic tools in a clinical setting. Through the comparison of different expression patterns to a standard expression profile, ‘abnormal’ patterns may be diagnosed well before symptoms arise in individuals. Numerous algorithms have been experimented with for feature selection and classification of data sets in an attempt to minimize the feature space and to maximize classification accuracy. In this paper we discuss the background of microarrays. Additionally we compare the results of several algorithms used to classify a given data set. We demonstrate that for our selected algorithms the potential classification accuracy is realized with a very small feature space. Background Each cell in our body contains the exact same copy of DNA, which serves as the code for cell structure and function. In spite of this we know that different cells in our body have different phenotypes and perform different functions. These range from the oxygen-transporting denucleauted red blood cells to the information-carrying elongated neurons. This point is precisely why unraveling the code in its entirety has fallen short of its goal. Clearly there are regulatory mechanisms by which certain genes are up-regulated in specific cell types while down-regulated in others. This information is not easily obtained by simply knowing the code. Although DNA is the blueprint it does not take an active role in the translation of proteins. DNA is transcribed into mRNA in the nucleus; and mRNA is subsequently transported out of the nucleus to the protein factories, the ribosomes. The ribosomes, 1 Work division among the team members was as follows. Lynn Lee studied and described the classification methods, and performed all the experiments that use KNN as the classification method. Hiram Shaish studied and described the background of microarrays, and compiled and analyzed the experimental results. Eric Smith programmed, tested, and described the data parser. Min Zhang studied and described the feature selection methods, and performed all the experiments that use SVM as the classification method. Each team member contributed to the writing and editing process. which serve as the decoders of the mRNA, build long chains of amino acids (a.k.a. proteins) based on sequential 3-letter mRNA nucleotides. There is a direct correlation between the amounts of mRNA transcribed from DNA in the nucleus to the amounts of proteins that are translated. In other words, the amount of mRNA of a given protein is a good measure of the activation level of the corresponding gene. The microarray is a chip upon which each gene of an organism is represented by a minute cDNA strand. A sample cell is lysed and its mRNA is allowed to bind to the matching cDNA strand. Since the mRNA is fluorescently labeled beforehand the level of fluorescence per cDNA strand after application of the cell extract is proportional to the amount of corresponding mRNA in the cell. This technique allows scientists to build an expression profile for different tissue samples. There are three broad categories of applications for microarrays. The first, gene expression profiling, is the comparison of different tissue samples through their expression patterns. The use of microarrays in this fashion is especially prevalent in the field of oncology. The aim of cancer prognosis and treatment is to be able to classify tumors and to selectively attack them. Gene expression profiling provides a powerful tool to accomplish this goal. The second category of applications is DNA sequencing. In this approach sample DNA, not mRNA, is allowed to hybridize to a microarray. Given that the DNA on the chip or in the sample tissue represents an individual with a genetic disorder of unknown origin it then becomes possible to locate the defective genes. The third category, genotyping, involves the application of genomic DNA to a microarray of known genetic markers. These markers have been associated with different phenotypical conditions and present risk factors [6]. The main obstacle of using microarrays, besides the cost, is the enormous amount of information that must be analyzed. Given that the human genome consists of ~30,000 genes and that the main use of microarrays is to compare several tissue samples, a considerable amount of analyzing must be done. This is where the need for computational methods becomes apparent. To be able to compare tissue samples, the 30,000-gene profile must be reduced as much as possible. Currently there are several techniques for minimizing the size of the profile without losing the unique profile. Most of the algorithms for analyzing microarray data are based on two classical problems in computer science. The first is that of feature selection. This is the process whereby certain features (genes in our case) in a training data set are kept and others discarded as a basis for classification, which we discuss shortly. Most of the commonlyused feature selection methods, except for t-statistics and one-dimensional SVM, “quantify the effectiveness of a feature by evaluating the strength of class predictability when the prediction is made by splitting the full range of expression of a given gene into two regions, the high region and the low region” [7]. The second problem is the subsequent classification. This is a process that employs a machine-learning type approach. An algorithm is given a training set upon which it ‘learns’ to classify a given profile into one or more classes (e.g. cancerous/non-cancerous, anemic/non-anemic, in the G1 phase/in the G2 phase, etc.). This algorithm is then applied to the desired data set. An important recent improvement upon classical techniques is the generalization from classification into two classes to classification into any number of classes. This improvement is often used in the analysis of microarray data. Methods and Materials Microarray Data We obtained expression data of 22,216 genes taken from human epithelial cells of 75 individuals [9]. These individuals were classified through questionnaires into 3 groups: smokers (34 individuals), those who have never smoked before (23 individuals) and those who have smoked in the past but have since quit (18 individuals) [10]. The selection of this data set was important in our studies since the phenotype was easily verifiable. Software We selected WEKA[8] as our algorithmic software. WEKA, an open source code implementation of many bioinformatics and data processing algorithms, includes implementations of our selected algorithms. Importantly, WEKA requires input in the form of ARFF files. This format is different from the format generated by many microarray applications. In particular, the gene expression omnibus database [9], as previously mentioned, was our source for microarray data and provides the data in SOFT format. Thus our first task was to code a parser that would convert the SOFT format into the ARFF format. Algorithms We chose to work with 4 algorithms. ECOC (a subtype of the broader SVM algorithms), our first classification algorithm, attempts to create some hyperplane in space such that two classes of data are separated by the largest margin [7]. This method was chosen since it has traditionally been very strong in binary classification. As a result, the ECOC method attempts to break the original multiclass problem down into a series of binary partitions. In general, given some number of classifiers, the ECOC method functions by deriving an output “word” for any input. This word is then compared to the codeword for each class (computed in an earlier operation) and the class whose codeword has the fewest disagreements with the computed word for the input, is selected. Our second classification algorithm, K-nearest Neighbor (KNN), was chosen due to its relative straightforwardness and accuracy when the data is preprocessed by feature extraction. As the title suggests, KNN is given an integer parameter k, where given some input, KNN then finds the k nearest points from the training data to the input, and classifies the input on the basis of the classification of the k selected points from the training data. The error rate for KNN is low, “asymptotically at most two times the Bayesian error” [7]. For feature selection we chose Information Gain and Chi-square Statistics. Information Gain is widely used in machine learning algorithms and may achieve good performance with one of our classification method, the ECOC. “Information Gain2 measures the number of bits of information obtained for class prediction by knowing the value of a feature” [5]. Alternatively, the Chi-square Statistics3 method is widely used in statistical theories. Chi-square measures the degree of dependence the feature the class. In this method, the numbers of rows and columns in the gene data matrix specifies the degrees of freedom. When the Chi-square Statistics is larger than the critical value 2 The information gain of a feature f is defined as where 3 , denotes the set of classes, vV is the set of possible values for feature f [5]. The Chi-square of a feature f is defined as , where Ai(f = v) is the number of instances in class ci with f = v and Ei(f = v) is the expected value of Ai(f = v), calculated as P(f = v)P(ci)N [5]. determined by the degrees of freedom, the feature and the class are considered dependent. Such features are selected. Wang et al. (2004) show that Chi-square Statistics is approximately as effective as Information Gain. Therefore, we choose sum of variance as an alternative to compare results with Information Gain. Experiments We crossed the 2 feature selection algorithms with the 2 classification algorithms to see how the 4 combinations would compare. We ran each combination up to 13 times, each time with a different sized feature space. In total we ran 46 experiments. Data and Results SOFT-to-ARFF Parser The creators of both the SOFT and ARFF formats have followed the fundamental design principle of textual storage formats. This principle exists for a number of reasons, and the reader is referred to [11] for more details. In addition, we required that the parser’s implementation be completed in the bare minimum amount of time, and be as small and bug-free as possible. We did not particularly care about its efficiency in running time or memory usage. Thus it seemed immediately apparent that Perl was the perfect choice for implementation’s language. It soon became obvious that our intuitions for choosing Perl were well-founded. The coding was completed very rapidly. The program consists of about 100 lines of code and has generated output that is exactly to spec in every one of our tests to date. It takes about 10 seconds to convert one data set; but since our project only involves looking at a couple or three data sets, this was not a problem. How about we change it to: A few other particulars of the parsers are worth mentioning. We have attempted to create it according to standards that would be applied to any chunk of commercial-quality UNIX software. Thus, the program works in a pipeline, reading from standard input and writing to standard output. More than half of its text is comments. The strict Perl paradigm is used, so that the program can be easily incorporated into a larger system. A healthy blurb of help is available, when the program is run with the usual -h flag. All these things significantly increase the program’s usability and maintainability. Feature Selection Algorithm and Feature space size Info Gain (1) Info Gain (2) Info Gain (5) Info Gain (10) Chi Square (1) Chi Square (2) Chi Square (5) Chi Square (10) Selected features 6559 6559,4150 6559,4150,9659,2832,2437 6559,4150,9659,2832,2437,9126,10461,5622,5327,1468 6559 6559,4150 6559,4150,9659,2832,9126 6559,4150,9659,2832,9126,2437,5622,10461,17463,1468 Table 1: Identity of the feature profile as selected by Chi Square and Info Gain when run with different feature space sizes. Surprisingly, as Table 1 shows, both feature selection algorithms selected almost the same features for profiles of similar sizes (only sizes 1 through 10 are shown). Although the latter five features vary in the order of their significance and two of these are actually different features, the first four features are the same for both algorithms (WEKA lists the features in their order of significance to classification). Since the two feature selection algorithms operate under very different methodologies, this suggests that these particular features correlate highly with the classification of the data. This provides a logical methodology to be followed when selecting features for classification. The significance of a selected feature profile for classification by a given algorithm can be validated by comparing it to the feature profile produced by a very different algorithm. If the two algorithms select similar features then there is a high degree of correlation with the data. From this we surmise that feature 6559 is a primary gene that is affected by smoking or, a gene responsible for a primary phenotypic difference between a “smoker” and a “non-smoker”. Classification Accuracy vs. Feature Space 80 70 % Correctly Classified 60 50 Chi Square-ECOC Info Gain-ECOC Chi Square-KNN Info Gain-KNN 40 30 20 10 0 0 100 200 300 400 500 600 Features Figure 1: Classification accuracy for the four algorithms combinations. The data includes classification with up to 500 features. 80 70 % Correctly Classified 60 Info Gain-ECOC 50 Info Gain-KNN Chi Square-ECOC Chi Square-KNN 40 30 20 10 0 0 10 20 30 40 50 60 Features Figure 2: Classification accuracy for the four algorithms combinations. The data includes classification with up to 50 features. Since the differences in the feature selection results were minor and did not encompass the five most significant features, we may use the smaller profiles as a method to compare the two classification algorithms. Figure 2 clearly shows that KNN outperforms ECOC given fewer features. In fact, while ECOC achieves only 45% classification accuracy with one feature, KNN produces 70% accuracy. As the feature number grows, so do the differences between the features selected by Info Gain and Chi Square, thus resulting in four different curves. The highest accuracy of 75 % is produced by ECOC in combination with both feature selection algorithms. We mention two important observations. First, even though classification was run on a feature set containing all of the genes (data not shown) the accuracy never exceeded 75%. This can be attributed to an inherent characteristic of the data. Since the data was partitioned into 3 classes based on simple, yet clearly oversimplified criteria (in reality questions like how many cigarettes per day does each smoker smoke and how long ago have those that quit smoking quit, matter as well), it stands to reason that the ‘boundary’ cases may be harder to classify. Second, a relatively high degree of accuracy is produced with a limited feature space size (70% with one gene under KNN). Conclusion In this paper we have examined the results of classification and feature selection using different algorithms on a sample data set. We have shown that different feature selection algorithms may be used to strengthen the relevance of selected features to the subsequent classification. This is paramount in a clinical setting where scientists and doctors strive to uncover potential targets for therapy. Additionally we have demonstrated that relatively high classification accuracy can be achieved with a small number of features. The maximum accuracy is limited not by the number of features, but by inherent characteristics of the data. Therefore, preprocessing and clustering the data, as done in other studies, may be necessary in order to enhance the accuracy of prediction. Future work should include further experimentation with other algorithms as well as design of novel ones, customized for this specific problem [5]. Importantly, our selected algorithms were not originally created to classify microarray data. It would also be interesting to study what attributes make an expression data set a good candidate for classification, i.e. one that would give a high degree of accuracy upon classification. We conclude with a broader perspective. While there are currently many parallel efforts, such as ours, to test and develop algorithms for microarray data analysis, there are still numerous obstacles. First, there is a shortage in personnel that have the background and understanding necessary to bridge the interdisciplinary gap between biology and computer science. Second, computer scientists will need to dispel the prevailing ‘mistrust’ of more traditional scientists before their solutions are taken more seriously. Third, large validated data sets, which are fundamental to the testing of new algorithms, are largely unavailable to many computer science research groups. In spite of the obstacles, the falling costs of microarray analysis and the enormous clinical potential will no doubt continue to stimulate research into the algorithmic interpretation of the data. Footnotes: 1- The parser may be found at: http://www.cs.dartmouth.edu/~eas/soft2arff.txt Works Cited: 1 - Golub, T. R., D. K. Slonim, et al. (1999). Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286(5439): 531-537 2 - http://www.cs.toronto.edu/~radford/res-bayes-ex.html 3 - http://www.cse.ucsc.edu/research/compbio/genex/genexTR2html/node12.html 4 - http://www.statsoft.com/textbook/stclatre.html 5 - Yuhang Wang, Fillia Makedon, James Ford, and Justin Pearlman. HykGene: A Hybrid Approach for Selecting Marker Genes for Phenotype Classification using Microarray Gene Expression Data. Bioinformatics, in press 6 – Aitman, Timothy J. DNA microarrays in medical practice. BMJ 2001;323:611-615 7 - T. Li, C. Zhang, and M. Ogihara, A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics, vol. 20, pp. 2429-2437, 2004 8 - http://www.cs.waikato.ac.nz/%7Eml/weka/ 9 - http://www.ncbi.nlm.nih.gov/geo/ 10- Avrum Spira, Jennifer Beane, Vishal Shah, Gang Liu, Frank Schembri, Xuemei Yang, John Palma, and Jerome S. Brody Effects of cigarette smoke on the human airway epithelial cell transcriptome 11- Raymond, Eric. The Art of Unix Programming. “Chapter 5. Textuality”. Available online http://www.faqs.org/docs/artu/textualitychapter.html Additional Works Referenced: E. Frank, M. Hall, L. Trigg, G. Holmes, and I. H. Witten, Data mining in bioinformatics using Weka, Bioinformatics, vol. 20, pp. 2479-2481, 2004.