Evaluation of Supervised Learning Algorithms on Gene Expression Data Machine Learning Prediction CSCI 6505 – Machine Learning Adan Cosgaya acosgaya@dal.ca Winter 2006 Dalhousie University Outline Introduction Definition of the Problem Related Work Algorithms Description of the Data Methodology of Experiments Results Relevance of Results Conclusions & Future Work 2 / 18 Introduction ML has gained attention in the biomedical field. Need to turn biomedical data into meaningful information. Microarray technology is used to generate gene expression data. Gene expression data involves a huge number of numeric attributes (gene expression measurements). This kind of data is also characterized by consisting of a small numbers of instances. This work investigates the classification problem on such data. 3 / 18 Definition of the Problem Classifying Gene Expression Data Number of features (n) is much greater than the number of sample instances (m). (n >> m) Typical data: n > 5000, and m < 100 High risk of overfitting the data due the abundance of attributes and shortage of available samples. The datasets produced by Microarray experiments are highly dimensional and often noisy due to the process involved in the experiments. 4 / 18 Related Work Using gene expression data for the task of classification, has recently gained attention in the biomedical community. Golub et al. describe an approach to cancer classification based on gene expression applied to human acute Leukemia (ALL vs AML). A. Rosenwald et al. developed a model predictor of patient survival after chemotherapy (Alive vs Dead). Furey et al. present a method to analyze microarray expression data using SVM. Guyon et al. experiment with reducing the dimensionality of gene expression data. 5 / 18 Algorithms K-Nearest Neighbor (KNN) Naive Bayes (NB) It assumes that the effect of a feature value on a given class is independent of the values of other features. Decision Trees (DT) It is one of the simplest and widely used algorithms for data classification. Internal nodes represent tests on one or more attributes and leaf nodes indicate decision outcomes. Support Vector Machines (SVM) Works well on high dimensional data 6 / 18 Description of the Data Leukemia dataset. A collection of 72 expression measurements. The samples are divided into two variants of leukemia: 25 samples of acute myeloid leukemia (AML) and 47 samples acute lymphoblastic leukemia (ALL). Diffuse Large-B-Cell Lymphoma (DLBCL) dataset Biopsy samples that were examined for gene expression with the use of DNA microarrays. Each sample corresponds to the prediction of survival after chemotherapy for diffuse large-B-cell lymphoma (Alive, Dead). Dataset Leukemia DLBCL # Instances # Classes # Features 72 240 2 2 7129 7399 # Features after feature selection 1026 68 7 / 18 Methodology of Experiments All features Feature Selection Remove irrelevant features (but may have biological meaning). Use of GainRatio Selecting a Supervised Learning Method KNN, NB, DT, SVM Feature Selection (gene subset) Algorithm Testing Methodology Evaluation over independent test set (train/test split) Ratios: 66/34, 80/20, 90/10 10-fold Cross-Validation Compare both methods and see if they are in logical agreement 8 / 18 Methodology of Experiments (cont…) Measuring Performance Accuracy Number of correct classifications Total number of test cases Precision (p) Recall (r) F-Measure It is hard to compare two classifiers using two measures. FMeasure combines precision and recall into one measure. F-Measure is the harmonic mean of precision, and recall. For F to be large, both p and r must be large. F Measure 2 pr pr 9 / 18 Results Without Feature Selection DLBCL (no feat ure select ion) Leuk emia (no feat ure select ion) 70.000 100.000 65.000 90.000 60.000 55.000 80.000 Accuracy 60.000 50.000 t rain / t est split cross-validat ion 40.000 Accuracy 50.000 70.000 45.000 40.000 35.000 30.000 t rain / t est split cross-valid at ion 25.000 20.000 15.000 30.000 20.000 10.000 5.000 10.000 0.000 0.000 KNN NB DT KNN SVM Naive Bayes and SVM perform better DT SVM Algorithm s Algorit hm s NB KNN and SVM perform better Cross-validation results are lower; it uses nearly all the data for training and testing, giving a more realistic estimation. 10 / 18 Results (cont…) With Feature Selection Leukem ia (feat ure select ion) DLBCL (feat u re select ion ) 100. 000 90. 000 Accuracy 70. 000 60. 000 50. 000 t r ain / t est split cr oss- valid at ion 40. 000 30. 000 20. 000 10. 000 0. 000 Accuracy 80. 000 80. 000 75. 000 70. 000 65. 000 60. 000 55. 000 50. 000 45. 000 40. 000 35. 000 30. 000 25. 000 20. 000 15. 000 10. 000 5. 000 0. 000 t r ain / t est split cr oss- validat ion KNN KNN NB DT KNN and SVM perform better DT SVM Algorithms Algorithm s NB SVM NB and SVM perform better There is an increase in the overall accuracy, more notorious in DLBCL 11 / 18 Results (cont…) Summary of classification accuracies with cross-validation KNN NB DT SVM Le u ke m ia d a t a se t DLBCL d a t a se t All featur es Featur e s election All featur es Featur e s election 87.500 9 8 .6 1 1 6 2 .9 1 7 62.250 9 8 .6 1 1 97.222 59.167 70.833 86.111 9 8 .6 1 1 84.722 9 8 .6 1 1 56.250 57.917 64.167 7 1 .2 5 0 F-Measures for both datasets with and without feature selection 1 .0 0 0 0 .9 0 0 0 .8 0 0 0 .7 0 0 F-Measur e 0 .6 0 0 0 .5 0 0 Leu kem ia All Leu kem ia F.S. 0 .4 0 0 D LBCL All D LBCL F.S. 0 .3 0 0 0 .2 0 0 0 .1 0 0 0 .0 0 0 KNN NB DT Algor ithm s SVM 12 / 18 Relevance of Results Performance depends on the characteristics of the problem, the quality of the measurements in the data, and the capabilities of the classifier in finding regularities in the data. Feature selection, helps to minimize the use of redundant and/or noisy features. SVM gave the best results, they perform well with high dimensional data, and also benefit from feature selection. Decision Trees had the overall worst performance, however, they still work at a competitive level. 13 / 18 Relevance of Results (cont…) Surprisingly, KNN behaves relatively well despite its simplicity, this characteristic allows it to scale well for large feature spaces. In the case of the Leukemia dataset, very high accuracies were achieved here for all the algorithms. Perfect accuracy was achieved in many cases. The DLBCL dataset shows lower accuracies, although using feature selection improved them. In the overall, the observations of the accuracy results are consistent with those from the F-measure, giving us confidence in the relevance of the results obtained. 14 / 18 Conclusions & Future Work Supervised learning algorithms can be used to the classification of gene expression data from DNA microarrays with high accuracy. SVM by its very own nature, deal well with high dimensional gene expression data. We have verified that there are subsets of features (genes) that are more relevant than others and better separate the classes. The use of one algorithm instead of others should be evaluated on a case by case basis 15 / 18 Conclusions & Future Work (cont…) The use of feature selection proved to be beneficial to improve the overall performance of the algorithms. This idea can be extended to the use of other feature selection methods or data transformation such as PCA. Analysis of the effect of noisy gene expression data on the reliability of the classifier. While the scope of our experimental results is confined to a couple of datasets, the analysis can be used as a baseline for future use of supervised learning algorithms for gene expression data 16 / 18 References T.R. Golub et al. Molecular classification of cancer: class discovery and class prediction by gene-expression monitoring. Science, Vol. 286, 531– 537, 1999. A. Rosenwald, G. Wright, W. C. Chan, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large B-cell lymphoma. New England Journal of Medicine, Vol. 346, 1937–1947, 2002. Terrence S. Furey, Nello Cristianini, et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, Vol. 16, 906–914, 2001. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. BIOWulf Technical Report, 2000. Ethem Alpaydin. Introduction to Machine Learning. The MIT Press, 2004. Ian H. Witten, Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques. Second Edition. Morgan Kaufmann Publishers , 2005 Wikipedia: www.wikipedia.org Alvis Brazma, Helen Parkinson, Thomas Schlitt, Mohammadreza Shojatalab. A quick introduction to elements of biology-cells, molecules, genes, functional genomics, microarrays. European Bioinformatics Institute. 17 / 18 Thank You! 18 / 18