Rule-Based Microarray Classification System Using Multi-Agent Approach GREGOR ŠTIGLIC, PETER KOKOL Laboratory for System Design, FERI University of Maribor Smetanova 17, 2000 Maribor SLOVENIA Abstract: In the recent years there has been a lot of scientific work done in the field of microarray analysis. The technology itself enables us to analyze multiple tissues and scan their gene expression levels. When a significant number of such gene scans is collected we can perform different supervised classification methods. Our goal is to find significant classification genes using simple rules that can be used by agents when searching through the gene expression database search space. This way we can get a small subset of significant features (genes) that can help us identifying the clinical state of the patient. In our work we propose a method that is based on ideas from multi-agent systems and produces simple rules that can easily be evaluated by experts in the biomedicine field. Key-Words: rule-based systems, microarray classification, bioinformatics, multi-agent systems 1 Introduction Microarray analysis has become an efficient tool for detecting early signs of diseases using only gene expression levels of scanned genes. The technique of producing DNA microarrays is improving continuously. The results of improvement are better and more accurate gene expression databases. The problem in analysis of such databases is their multi-dimensionality, where we have large number of features (genes) and only a few instances (samples). As a possible solution to the problem of classification in gene expression data, we propose a simple rule-based classification method using only two features at a time. To narrow the large search space we employ the system of agents searching for the best classifier. Agents in multi-agent systems are naturally led to building systems that adapt and learn through experience [1]. In our case agents can exchange information about the position of promising classification possibilities in two-dimensional feature space. Using this technique we try to optimize current best solution and search for possible new promising points in the search space. The final product of our method is a set of most successful rules that can be easily interpreted to experts for evaluation of results. One of the first papers dealing with the problem of classification of microarray data was [2] by Golub et al. In this paper authors try to identify a type of Leukemia (acute myeloid leukemia – AML or acute lymphoblastic leukemia - ALL). The authors proposed their own classification method called weighted voting (WV) algorithm. The AML/ALL dataset was later often used for testing different analysis methods and it still serves as a benchmark dataset in the field of gene expression classification. Because of good linear separability between classes there were many results with no misclassifications on this database [3, 4]. The next widely used dataset is Colon tumor dataset by Alon et al. that was first published in [5]. The results of the clustering method misclassified 8 out of 62 samples. In the original paper the authors were using clustering methods, but the same dataset was later usually used for the accuracy estimation of classification methods. From all investigated methods we found only one that managed to classify all samples correctly using leaveone-out cross-validation (LOOCV). Fujarewicz and Wiench [6] achieved this result using a combination of recursive feature selection (RFR) and SVM classification method. In the next section we describe our proposed multi-agent system including details on agent behavior, which is followed by the presentation of the problem datasets. After that a section with the results of gene expression classification on AML/ALL and tumor database is presented. In the final section we discuss about our method and possible further improvements of the multiagent system for predictive gene discovery. 2 Multi-Agent Based Gene Discovery Multi-agent systems offer a new paradigm to organize applications in the way that enables solving problems using distributed and intelligent entities. We can see agents as intelligent computational entities that are able to collaborate and solve problems in groups. In our approach we employ agents to solve the problem of gene expression feature selection and supervised classification. Each agent in the system has to follow some basic rules of acting in the large search space. In our case every possible pair of genes defines a point in the search space. This means that we would have to evaluate n2 points in the search space for each agent, because each agent can have its own settings. To reduce the time complexity of the searching, we have to define basic rules that will help our agents in solving the problem faster. Each cycle of agent’s activity consists of two parts. In the first part an agent is exploring the search space using Monte Carlo method. Using this method an agent samples a pre-defined number of points in search space, builds a classifier from the selected genes and stores the location of the best classifying pair of genes. In the second part we search around the best solution by following vertical and horizontal lines from the best classifying position. This way an agent is using one of the genes selected by the best classifier so far and tries to improve the current best result of classification. At the end of each cycle agents exchange information about their best positions and start the new cycle. Performance of a single agent is represented by the ability of finding the combination of genes with the highest classification accuracy. Each agent represents a pair of features (genes) in the final ensemble of classifying agents. To estimate the accuracy of classification for each agent we use classifiers in the simple IF-THEN rules form like: IF {!}expression(gene1) < value THEN outcome Each agent generates four such rules, assuming that we work in two-dimensional search space. The final rules generated by agents therefore contain four rules in the following form: IF {!}expression(gene1) < value AND {!}expression(gene2) < value THEN outcome When considering which testing method to use, most papers use a hold-out procedure, using a part of the samples to train a predictive model and using the rest of instances as a test set to estimate classifier accuracy. Another possibility is using n-fold cross-validation, which is typically implemented by running the same learning system n times, each time on a different training set of size (n−1)/n times the size of the original data set. A lot of authors in microarray analysis use leave-one-out cross-validation method, in which one sample in the training set is withheld, the remaining samples of the training set are used to build a classifier to predict the class of withheld sample, and the cumulative error is calculated. LOOCV was often criticized, because of higher error variance in comparison to 5 or 10-fold cross-validation [7]. In our research we decided to use 10-fold cross validation, because we compared our results with some other machine learning algorithms and were therefore using selected WEKA tool [8] methods as benchmark. As mentioned earlier we also want to use our system for feature selection purposes. Our aim is to select the smallest possible set of genes and still achieve the highest possible accuracy of classification. Therefore we try to combine the best agents in an ensemble of agents. Ensemble of three best classifying agents on training dataset is built and they are used to classify samples of the test samples for each set of samples in n-fold crossvalidation procedure. 3 Experimental Results To perform our experiment we selected two publicly available datasets from the microarray analysis field. We begin this section with database description; continue with description of experiment execution and conclude with obtained results. 3.1 Database descriptions Leukemia dataset. The original data comes from the research on acute leukemia by Golub et al. [2]. Dataset consists of 38 bone marrow samples from which 27 belong to acute lymphoblastic leukemia (ALL) and 11 to acute myeloid leukemia (AML). Each sample consists of probes for 6817 human genes. Golub et al used this dataset for training. Also 34 samples of testing data were used consisting of 20 ALL and 14 AML samples. Because we used cross-validation, we were able to make tests on all samples together (72). We also have to mention that test set samples and training set of samples differ in their origin (they were collected at different institutions by different researchers). Colon Tumor dataset. Colon cancer is second only to lung cancer as a cause of cancer-related mortality [9]. It is a genetic disease, propagated by the acquisition of somatic alterations that influence gene expression. Using DNA microarray technology we are able to measure the expression level of thousands of genes simultaneously. The most exciting result of microarray technology research in the past has been the demonstration that patterns of gene expression can distinguish between tumors of different anatomical origin. 3.2 Experiment realization and results Our aim in the experiment realization was to compare our method to two similar machine learning algorithms based on decision trees and neural networks. For decision trees method we chose C4.5 pruned trees (J48 classifier in WEKA tool). All default settings were used including pruning factor of 0.25, minimal number of two leafs and one fold for pruning and two folds for growing a tree. The other used method was neural networks, where we trained the neural network consisting of two hidden layers and a learning rate of 0.3. For the learning cycle we used 100 epochs of back propagation training. Agent based approach was initialized with a setting of 500 Monte Carlo sampling points and total search time of 60 seconds. This way we limited the time complexity of our method to 10 minutes when using 10-fold crossvalidation (training for all folds of neural net for Colon Cancer database took over 60 minutes). Improvement of overall accuracy level can be seen in Figure 1, where we can also observe the behavior of the accuracy level after the initial Monte Carlo sampling phase (area on the right side of the dotted line). For easier rule creation we normalized all expression level data to the [0, 1] interval. Agents were therefore selecting their threshold value for separation of low from high expression levels from the initial interval [0, 1]. We compared agent-based system using only votes from the best classifier and an agent-based system using votes from three best performing agents on training dataset to classify samples in the test dataset. All the results are presented in Table 1. Method 10-fold cross-validation accuracy Leukemia Colon Cancer MAS (best) 80.00% 74.29% MAS (ensemble) 86.25% 77.14% J48 Trees 79.17% 82.26% Neural Networks 87.50% 74.19% Table1. Comparison of classification results the results are far better on the AML/ALL database compared to Colon Cancer database. 0.95 0.9 0.85 0.8 0.75 0.7 Best 0.65 Avg 0.6 1 31 61 91 121 151 Figure1. Accuracy of the best and average agent on Colon Cancer database (training) 4 Conclusions and Future Work We presented a novel concept in presenting the results when searching simple classifiers in gene expression data classification problem. Our approach demonstrates its efficiency and effectiveness in dealing with high dimensional data for classification and still produces very basic and easy to understand rules. The obtained results confirm that there is no need to reduce the initial set of genes using statistic gene ranking, as it is usually the case in microarray analysis. Our system is also very useful as a feature selection method and can effectively reduce the number of needed genes for discrimination between tissue classes. In the future we plan to improve our rule-based system by incorporating more complex rules in the system, where we will try to keep the rules in an easy to read and understand form. The results from the upper table suggest that results of microarray classification accuracy still tend to be very unstable although we used 10-fold cross-validation. Two observed datasets are also very different although they contain gene expressions acquired with the same method. Colon cancer dataset is know as a very challenging database for classification as there is no linear correlation between the outcome and specific gene expression, while there are a lot of samples that are linearly correlated with the outcome class in the ALL/AML dataset. This is also the reason why most of Rule if(gene4196 = LOW) and (gene3320 = LOW) then AML if(gene4196 = HIGH) and (gene3320 = LOW) then ALL if(gene4196 = LOW) and (gene3320 = HIGH) then ALL if(gene4196 = HIGH) and (gene3320 = HIGH) then ALL if(gene3320 = LOW) and (gene1176 = LOW) then AML if(gene3320 = HIGH) and (gene1176 = LOW) then ALL if(gene3320 = LOW) and (gene1176 = HIGH) then ALL if(gene3320 = HIGH) and (gene1176 = HIGH) then ALL if(gene1176 = LOW) and (gene3320 = LOW) then AML if(gene1176 = HIGH) and (gene3320 = LOW) then ALL Certainty 0.927 0.857 1.000 1.000 0.905 1.000 0.833 1.000 0.905 0.833 if(gene1176 = LOW) and (gene3320 = HIGH) then ALL if(gene1176 = HIGH) and (gene3320 = HIGH) then ALL Table2. Rules generated from the best three agents References: [1] P.J. Modi and W.M. Shen, “Collaborative Multiagent Learning for Classification Tasks,” Proceedings of the Fifth International Conference on Autonomous Agents, 2001, pp. 37-38. [2] T.R. Golub et al., “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, Vol. 286(15):531-537, Oct. 1999. [3] T.H. Bo and I. Jonassen “New Feature Subset Selection Procedures for Classification of Expression Profiles,” Genome Biology, Vol. 3(4):17.1-27.11, March 2002. [4] B. Krishnapuram, L. Carin and A.J. Hartemink, “Joint classifier and feature optimization for cancer diagnosis using gene expression data,” Proceedings of 7thI International Conference on Computational Molecular Biology, 2003, pp.167-175. [5] U. Alon et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proc. Natl. Acad. Sci., Vol. 96, pp. 6745-6750. [6] K. Fujarewicz and M. Wiench, “Selecting differentially expressed genes for colon tumor classification,” Int. J. Appl. Math. Comput. Sci., Vol. 13, No. 3, pp. 327-335. [7] T. Hastie, R. Tibshirani and J. Friedman, “The Elements of Statistical Learning,” Springer, 2001. [8] I.H. Witten and E.Frank, “Data Mining: Practical machine learning tools with Java implementations,” Morgan Kaufmann, San Francisco, 2000. [9] G.A. Chung-Faye, D.J. Kerr, L.S. Young, P.F. Searle, “Gene therapy strategies for colon cancer”, Mol. Med. Today, Vol. 6, no. 2, pp. 82–87. 1.000 1.000