473-287

advertisement
Rule-Based Microarray Classification System Using Multi-Agent Approach
GREGOR ŠTIGLIC, PETER KOKOL
Laboratory for System Design, FERI
University of Maribor
Smetanova 17, 2000 Maribor
SLOVENIA
Abstract: In the recent years there has been a lot of scientific work done in the field of microarray analysis. The
technology itself enables us to analyze multiple tissues and scan their gene expression levels. When a significant
number of such gene scans is collected we can perform different supervised classification methods. Our goal is to find
significant classification genes using simple rules that can be used by agents when searching through the gene
expression database search space. This way we can get a small subset of significant features (genes) that can help us
identifying the clinical state of the patient. In our work we propose a method that is based on ideas from multi-agent
systems and produces simple rules that can easily be evaluated by experts in the biomedicine field.
Key-Words: rule-based systems, microarray classification, bioinformatics, multi-agent systems
1 Introduction
Microarray analysis has become an efficient tool for
detecting early signs of diseases using only gene
expression levels of scanned genes. The technique of
producing DNA microarrays is improving continuously.
The results of improvement are better and more accurate
gene expression databases. The problem in analysis of
such databases is their multi-dimensionality, where we
have large number of features (genes) and only a few
instances (samples).
As a possible solution to the problem of classification in
gene expression data, we propose a simple rule-based
classification method using only two features at a time.
To narrow the large search space we employ the system
of agents searching for the best classifier. Agents in
multi-agent systems are naturally led to building systems
that adapt and learn through experience [1]. In our case
agents can exchange information about the position of
promising classification possibilities in two-dimensional
feature space. Using this technique we try to optimize
current best solution and search for possible new
promising points in the search space. The final product
of our method is a set of most successful rules that can
be easily interpreted to experts for evaluation of results.
One of the first papers dealing with the problem of
classification of microarray data was [2] by Golub et al.
In this paper authors try to identify a type of Leukemia
(acute myeloid leukemia – AML or acute lymphoblastic
leukemia - ALL). The authors proposed their own
classification method called weighted voting (WV)
algorithm. The AML/ALL dataset was later often used
for testing different analysis methods and it still serves
as a benchmark dataset in the field of gene expression
classification. Because of good linear separability
between classes there were many results with no
misclassifications on this database [3, 4].
The next widely used dataset is Colon tumor dataset by
Alon et al. that was first published in [5]. The results of
the clustering method misclassified 8 out of 62 samples.
In the original paper the authors were using clustering
methods, but the same dataset was later usually used for
the accuracy estimation of classification methods. From
all investigated methods we found only one that
managed to classify all samples correctly using leaveone-out cross-validation (LOOCV). Fujarewicz and
Wiench [6] achieved this result using a combination of
recursive feature selection (RFR) and SVM
classification method.
In the next section we describe our proposed multi-agent
system including details on agent behavior, which is
followed by the presentation of the problem datasets.
After that a section with the results of gene expression
classification on AML/ALL and tumor database is
presented. In the final section we discuss about our
method and possible further improvements of the multiagent system for predictive gene discovery.
2 Multi-Agent Based Gene Discovery
Multi-agent systems offer a new paradigm to organize
applications in the way that enables solving problems
using distributed and intelligent entities. We can see
agents as intelligent computational entities that are able
to collaborate and solve problems in groups. In our
approach we employ agents to solve the problem of gene
expression
feature
selection
and
supervised
classification.
Each agent in the system has to follow some basic rules
of acting in the large search space. In our case every
possible pair of genes defines a point in the search
space. This means that we would have to evaluate n2
points in the search space for each agent, because each
agent can have its own settings. To reduce the time
complexity of the searching, we have to define basic
rules that will help our agents in solving the problem
faster. Each cycle of agent’s activity consists of two
parts. In the first part an agent is exploring the search
space using Monte Carlo method. Using this method an
agent samples a pre-defined number of points in search
space, builds a classifier from the selected genes and
stores the location of the best classifying pair of genes.
In the second part we search around the best solution by
following vertical and horizontal lines from the best
classifying position. This way an agent is using one of
the genes selected by the best classifier so far and tries
to improve the current best result of classification. At the
end of each cycle agents exchange information about
their best positions and start the new cycle.
Performance of a single agent is represented by the
ability of finding the combination of genes with the
highest classification accuracy. Each agent represents a
pair of features (genes) in the final ensemble of
classifying agents. To estimate the accuracy of
classification for each agent we use classifiers in the
simple IF-THEN rules form like:
IF {!}expression(gene1) < value THEN outcome
Each agent generates four such rules, assuming that we
work in two-dimensional search space. The final rules
generated by agents therefore contain four rules in the
following form:
IF {!}expression(gene1) < value AND
{!}expression(gene2) < value THEN outcome
When considering which testing method to use, most
papers use a hold-out procedure, using a part of the
samples to train a predictive model and using the rest of
instances as a test set to estimate classifier accuracy.
Another possibility is using n-fold cross-validation,
which is typically implemented by running the same
learning system n times, each time on a different training
set of size (n−1)/n times the size of the original data set.
A lot of authors in microarray analysis use leave-one-out
cross-validation method, in which one sample in the
training set is withheld, the remaining samples of the
training set are used to build a classifier to predict the
class of withheld sample, and the cumulative error is
calculated. LOOCV was often criticized, because of
higher error variance in comparison to 5 or 10-fold
cross-validation [7]. In our research we decided to use
10-fold cross validation, because we compared our
results with some other machine learning algorithms and
were therefore using selected WEKA tool [8] methods
as benchmark.
As mentioned earlier we also want to use our system for
feature selection purposes. Our aim is to select the
smallest possible set of genes and still achieve the
highest possible accuracy of classification. Therefore we
try to combine the best agents in an ensemble of agents.
Ensemble of three best classifying agents on training
dataset is built and they are used to classify samples of
the test samples for each set of samples in n-fold crossvalidation procedure.
3 Experimental Results
To perform our experiment we selected two publicly
available datasets from the microarray analysis field. We
begin this section with database description; continue
with description of experiment execution and conclude
with obtained results.
3.1 Database descriptions
Leukemia dataset. The original data comes from the
research on acute leukemia by Golub et al. [2]. Dataset
consists of 38 bone marrow samples from which 27
belong to acute lymphoblastic leukemia (ALL) and 11 to
acute myeloid leukemia (AML). Each sample consists of
probes for 6817 human genes. Golub et al used this
dataset for training. Also 34 samples of testing data were
used consisting of 20 ALL and 14 AML samples.
Because we used cross-validation, we were able to make
tests on all samples together (72). We also have to
mention that test set samples and training set of samples
differ in their origin (they were collected at different
institutions by different researchers).
Colon Tumor dataset. Colon cancer is second only to
lung cancer as a cause of cancer-related mortality [9]. It
is a genetic disease, propagated by the acquisition of
somatic alterations that influence gene expression. Using
DNA microarray technology we are able to measure the
expression level of thousands of genes simultaneously.
The most exciting result of microarray technology
research in the past has been the demonstration that
patterns of gene expression can distinguish between
tumors of different anatomical origin.
3.2 Experiment realization and results
Our aim in the experiment realization was to compare
our method to two similar machine learning algorithms
based on decision trees and neural networks.
For decision trees method we chose C4.5 pruned trees
(J48 classifier in WEKA tool). All default settings were
used including pruning factor of 0.25, minimal number
of two leafs and one fold for pruning and two folds for
growing a tree.
The other used method was neural networks, where we
trained the neural network consisting of two hidden
layers and a learning rate of 0.3. For the learning cycle
we used 100 epochs of back propagation training.
Agent based approach was initialized with a setting of
500 Monte Carlo sampling points and total search time
of 60 seconds. This way we limited the time complexity
of our method to 10 minutes when using 10-fold crossvalidation (training for all folds of neural net for Colon
Cancer database took over 60 minutes). Improvement of
overall accuracy level can be seen in Figure 1, where we
can also observe the behavior of the accuracy level after
the initial Monte Carlo sampling phase (area on the right
side of the dotted line).
For easier rule creation we normalized all expression
level data to the [0, 1] interval. Agents were therefore
selecting their threshold value for separation of low from
high expression levels from the initial interval [0, 1].
We compared agent-based system using only votes from
the best classifier and an agent-based system using votes
from three best performing agents on training dataset to
classify samples in the test dataset. All the results are
presented in Table 1.
Method
10-fold cross-validation accuracy
Leukemia
Colon Cancer
MAS (best)
80.00%
74.29%
MAS (ensemble) 86.25%
77.14%
J48 Trees
79.17%
82.26%
Neural Networks 87.50%
74.19%
Table1. Comparison of classification results
the results are far better on the AML/ALL database
compared to Colon Cancer database.
0.95
0.9
0.85
0.8
0.75
0.7
Best
0.65
Avg
0.6
1
31
61
91
121
151
Figure1. Accuracy of the best and average agent on
Colon Cancer database (training)
4 Conclusions and Future Work
We presented a novel concept in presenting the results
when searching simple classifiers in gene expression
data classification problem. Our approach demonstrates
its efficiency and effectiveness in dealing with high
dimensional data for classification and still produces
very basic and easy to understand rules. The obtained
results confirm that there is no need to reduce the initial
set of genes using statistic gene ranking, as it is usually
the case in microarray analysis. Our system is also very
useful as a feature selection method and can effectively
reduce the number of needed genes for discrimination
between tissue classes. In the future we plan to improve
our rule-based system by incorporating more complex
rules in the system, where we will try to keep the rules
in an easy to read and understand form.
The results from the upper table suggest that results of
microarray classification accuracy still tend to be very
unstable although we used 10-fold cross-validation. Two
observed datasets are also very different although they
contain gene expressions acquired with the same
method. Colon cancer dataset is know as a very
challenging database for classification as there is no
linear correlation between the outcome and specific gene
expression, while there are a lot of samples that are
linearly correlated with the outcome class in the
ALL/AML dataset. This is also the reason why most of
Rule
if(gene4196 = LOW) and (gene3320 = LOW) then AML
if(gene4196 = HIGH) and (gene3320 = LOW) then ALL
if(gene4196 = LOW) and (gene3320 = HIGH) then ALL
if(gene4196 = HIGH) and (gene3320 = HIGH) then ALL
if(gene3320 = LOW) and (gene1176 = LOW) then AML
if(gene3320 = HIGH) and (gene1176 = LOW) then ALL
if(gene3320 = LOW) and (gene1176 = HIGH) then ALL
if(gene3320 = HIGH) and (gene1176 = HIGH) then ALL
if(gene1176 = LOW) and (gene3320 = LOW) then AML
if(gene1176 = HIGH) and (gene3320 = LOW) then ALL
Certainty
0.927
0.857
1.000
1.000
0.905
1.000
0.833
1.000
0.905
0.833
if(gene1176 = LOW) and (gene3320 = HIGH) then ALL
if(gene1176 = HIGH) and (gene3320 = HIGH) then ALL
Table2. Rules generated from the best three agents
References:
[1] P.J. Modi and W.M. Shen, “Collaborative
Multiagent Learning for Classification Tasks,”
Proceedings of the Fifth International Conference on
Autonomous Agents, 2001, pp. 37-38.
[2] T.R. Golub et al., “Molecular Classification of
Cancer: Class Discovery and Class Prediction by
Gene Expression Monitoring,” Science, Vol.
286(15):531-537, Oct. 1999.
[3] T.H. Bo and I. Jonassen “New Feature Subset
Selection Procedures for Classification of Expression
Profiles,” Genome Biology, Vol. 3(4):17.1-27.11,
March 2002.
[4] B. Krishnapuram, L. Carin and A.J. Hartemink,
“Joint classifier and feature optimization for cancer
diagnosis using gene expression data,” Proceedings
of 7thI International Conference on Computational
Molecular Biology, 2003, pp.167-175.
[5] U. Alon et al., “Broad patterns of gene expression
revealed by clustering analysis of tumor and normal
colon tissues probed by oligonucleotide arrays,”
Proc. Natl. Acad. Sci., Vol. 96, pp. 6745-6750.
[6] K. Fujarewicz and M. Wiench, “Selecting
differentially expressed genes for colon tumor
classification,” Int. J. Appl. Math. Comput. Sci., Vol.
13, No. 3, pp. 327-335.
[7] T. Hastie, R. Tibshirani and J. Friedman, “The
Elements of Statistical Learning,” Springer, 2001.
[8] I.H. Witten and E.Frank, “Data Mining: Practical
machine learning tools with Java implementations,”
Morgan Kaufmann, San Francisco, 2000.
[9] G.A. Chung-Faye, D.J. Kerr, L.S. Young, P.F.
Searle, “Gene therapy strategies for colon cancer”,
Mol. Med. Today, Vol. 6, no. 2, pp. 82–87.
1.000
1.000
Download