Fast Decision Tree Learning Algorithms for Microarray Data

advertisement
To appear Proc. The 2003 International Conference on Machine Learning and Applications (ICMLA'03)
Los Angeles, California, June 23-24, 2003.
Fast Decision Tree Learning Techniques
for Microarray Data Collections
Xiaoyong Li and Christoph F. Eick
Department of Computer Science
University of Houston, TX 77204-3010
e-mail: ceick@cs.uh.edu
gene expression profiles of tumors from cancer
Abstract
patients [1]. In addition to the enormous scientific
potential of DNA microarrays to help in
DNA microarrays allow monitoring of
understanding gene regulation and interactions,
expression levels for thousands of genes
simultaneously. The ability to successfully
microarrays have very important applications in
pharmaceutical and clinical research. By comparing
analyze the huge amounts of genomic data is of
gene expression in normal and abnormal cells,
increasing importance for research in biology
and medicine. The focus of this paper is the
microarrays may be used to identify which genes are
involved in causing particular diseases. Currently,
discussion of techniques and algorithms of a
most approaches to the computational analysis of
decision tree learning tool that has been devised
taking into consideration the special features of
gene expression data focus more on the attempt to
learn about genes and tumor classes in an
microarray data sets: continuous-valued
unsupervised way. Many research projects employ
attributes and small size of examples with a
large number of genes. The paper introduces
cluster analysis for both tumor samples and genes,
and mostly use hierarchical clustering methods [2,3]
novel approaches to speed up leave-one-out
and partitioning methods, such as self-organizing
cross validation through the reuse of results of
previous computations, attribute pruning, and
maps [4] to identify groups of similar genes and
groups of similar samples.
through approximate computation techniques.
Our approach employs special histogram-based
data structures for continuous attributes for
speed up and for the purpose of pruning. We
present experimental results concerning three
microarray data sets that suggest that these
optimizations lead to speedups between 150%
and 400%. We also present arguments that our
attribute pruning techniques not only lead to
better speed but also enhance the testing
accuracy.
Key words and phrases: decision trees, concept
learning for microarray data sets, leave-one-out
cross validation, heuristics for split point
selection, decision tree reuse.
1. Introduction
The advent of DNA microarray technology provides
biologists with the ability of monitoring expression
levels for thousands of genes simultaneously.
Applications of microarrays range from the study of
gene expression in yeast under different
environmental stress conditions to the comparison of
This paper, however, centers on the application of
supervised learning techniques to microarray data
collections. In particular, we will discuss the features
of a decision tree learning tool for microarray data
sets. We assume that each data set includes gene
expression data of m-RNA samples. Normally, in
these data sets the number of genes is pretty large
(usually between 1000 and 10,000). Each gene is
characterized by numerical values that measure the
degree the gene is turned on for the particular
sample. The number of examples in the training set,
on the other hand, is typically below one hundred.
Associated with each sample is its type or class that
we are trying to predict. Moreover, in this paper we
will restrict our discussions to binary classification
problems.
Section 2 introduces decision tree learning
techniques for microarray data collections. Section 3
discusses how to speed up leave-one-out cross
validation. Section 4 presents experimental results
that evaluate our techniques for three microarrray
data sets and Section 5 summarizes our findings.
2. Decision Tree Learning Techniques
for Microarray Data Collections
2.1 Decision Tree Algorithms Reviewed
The traditional decision tree learning algorithm (for
more discussions on decision trees see [5]) builds a
decision tree breadth-first by recursively dividing the
examples until each partition is pure by definition or
meets other termination conditions (to be discussed
later). If a node satisfies a termination condition, the
node is marked with a class label that is the majority
class of the samples associated with this node. In the
case of microarray data sets, the splitting criterion
for assigning examples to nodes is of the form “A <
v” (where A is an attribute v is a real number).
In algorithms description in Fig. 1 below, we
assume that
1. D is the whole microarray training data set;
2. T is the decision tree to be built;
3. N is one node of the decision tree in which holds
the indexes of samples;
4. R is the root node of the decision tree;
5. Q is a queue which contains nodes of the same
type with N;
6. Si: is a split point which is a structure containing
a gene index i, a real number v and an
information gain value. A split point can be used
to provided a split criterion to partition the tree
node N into two nodes N1 and N2 based on
whether the gene i’s value of each example in
the node is or isnot greater than value v;
7. Gi: denotes the i-th gene.
The result of applying the decision tree learning
algorithm is a tree whose intermediate nodes
associate split points with attributes, and whose leaf
nodes represent decisions (classes in our case). Test
conditions for a node are selected maximizing the
information gain relying on the following
framework: We assume we have 2 classes ,
sometimes called ‘+’ and ‘-“ in the following, in our
classification problem. A test S subdivides the
examples D= (p1,p2) into 2 subsets D1 =(p11,p12)
and D2 =(p21,p22). The quality of a test S is
measured using Gain(D,S):
Let H(D=(p1,…,pm))= i=1 (pi log2(1/pi)) (called
the entropy function)
Gain(D,S)= H(D) 

2
i 1
(| D i | / | D |) * H(D i )
In the above |D| denotes the number of elements in
set D and D=(p1, p2) with p1+ p2 =1 and indicates
that of the |D| examples p1*|D| examples belong to
the first class and p2*|D| examples belong to the
second class.
Procedure buildTree(D):
1. Initialize root node R of tree T using data set D;
2. Initialize queue Q to contain root node R;
3. While Q is not empty do {
4.
De-queue the first node N in Q;
5.
If N is not satisfying the termination
condition {
6.
For each gene Gi (i= 1, 2, …. )
7.
{Evaluate splits on gene Gi based on
information gain;
8.
Record the best split point Si for Gi
and its information gain}
9.
Determine split point Smax with the
highest information gain
10.
Use Smax to divide node N into N1 and N2
and attach nodes N1 and N2 to node N in the
decision tree T;
11.
En-queue N1 and N2 to Q;
12.
}
13. }
Figure 1: Decision Tree Learning Algorithm
2.2 Attribute Histograms
Our research introduced a number of new data
structures for the purpose of speeding up the
decision tree learning algorithms. One of these data
structures is called attribute histogram that captures
the class distribution of a sorted continuous attribute.
Let us assume we have 7 examples and their
attribute values for an attribute A are 1.01, 1.07,
1.44, 2.20, 3.86, 4.3, and 5.71 and their class
distribution is (-, +, +, +, -, -, +); that is, the first
example belongs to class 2, the second example is
class 1,... If we group all the adjacent samples with
the same class, we obtain the histogram for this
attribute which is (1-, 3+, 2-, 1+), for short (1,3,2,1)
as depicted in Fig. 2; if the class distribution for the
sorted attribute A would have been (+,+,-,-,-,-,+) A’s
histogram would be (2,4,1). Efficient algorithms to
compute attribute histograms have been discussed in
[6].
2.3 Searching for the Best Split Point
As mentioned earlier the traditional decision tree
algorithm has a preference for tests that reduce
entropy. To find the best test for a node, we have to
search through all the possible split points for each
attribute. In order to compute the best split point for
a numeric attribute, normally the (sorted) list of its
values is scanned from the beginning, and for each
split point that is placed half way between every two
adjacent attribute values, the entropy is computed.
The entropy for each split point can actually be
efficiently computed as shown in Figure 2 because
of the existence of our attribute histogram data
structure. Based on its histogram (1-, 3+, 2-, 1+), we
only consider three possible split (1- | 3+, 2-, 1+), (1, 3+ | 2-, 1+) and (1-, 3+, 2- | 1+). The vertical bar
represents the split point. Thus we eliminate from 6
split points (Fayyad and Irani proved in [7] that
splitting between adjacent samples that belong to the
same class leads to sub-optimal information gain; in
general, their paper advocates a multi-splitting
algorithms for continuous attributes whereas our
approach relies on binary splits) down to 3 split
points.
Figure 2: Example of an Attribute Histogram
A situation that we have not discussed until
now, involves histograms that contain identical
attribute values that belong to different classes. To
cope with this situation when considering a split
point, we need to check the two neighboring
examples’ attribute values on both sides of the split
point. If they are the same, we have to discard this
split point even if its information gain is high.
After we determined the best split point for all
the attributes (genes in our cases), the attribute with
highest information gain is selected and used to split
the current node.
3. Optimizations for Leave-one-out
Cross-validation
In k-fold cross-validation, we divide the data into k
disjoint subsets of (approximately) equal size, then
train the classifier k times, each time leaving out one
of the subsets from training, but using only the
omitted subset as the test set to compute the error
rate. If k equals the sample size, this is called "leaveone-out" cross-validation. For the large data set size,
leave-one-out is very computation demanding since
it has to construct more decision trees than normal
types of cross validation (k=10 is a popular choice in
the literature). But for data sets with few examples,
such as microarray data sets, leave-one-out cross
validation is pretty popular and practical since it
gives the most unbiased evaluation model. Also,
when doing leave-one-out cross validation the
computations for different subsets tend to be very
similar. Therefore, it seems attractive to speed up
leave-one-out cross validation through the reuse of
results of previous computations, which is the main
topic of the next subsection.
3.1 Reuse of Sub-trees from Previous Runs
It is important to note that the whole data set and the
training sets in leave-one-out only differ in one
example. Therefore, in the likely event that the same
root test is selected for the two data sets, we already
know that at least one of the 2 sub-trees below the
root node generated by the first run (for the whole
data set) can be reused when constructing other
decision trees. Similar opportunities for reuse exist
at other levels of decision trees. Taking advantage of
this property, we compare the node to be split with
the stored nodes that are from pervious runs, and
reuse sub-trees if a match occurs.
In order to get a speed up through sub-tree
reuse, it is critical that matching nodes from
previous runs can be found quickly. To facilitate the
comparison of two nodes, we use bit strings to
represent the sample list of each node. For example,
if we have totally 10 samples, and 5 are associated
with the current node, we use the bit string
“0101001101” as the signature of this node, and use
XOR string comparisons and signature hashing to
quickly determine if a reusable sub-tree exists.
3.2 Using Histograms for Attribute Pruning
Assume that two histograms A (2+, 2-) and B (1+, 1, 1+, 1-) are given. In this case, our job is to find the
best split point among all possible splits of both
histograms. Obviously, B can never give a better
split than A because (2+ | 2-) has entropy 0. This
implies that performing information gain
computations for attribute B is a waste of time. That
prompts us to think of some way to distinguish
between “good” and “bad” histograms, and to
exclude attributes with bad histograms from
consideration for speed up.
Mathematically, it might be quite complicated
to come up with a formula that predicts the best
attribute to be used for a particular node of the
decision tree. However, we are considering an
approximate method that may not always be correct
but hopefully most of the time can be correct. The
idea is to use an index, which we call “hist index”.
The hist index of histogram S is defined as:
m
Hist(S) =

Pj2
j 1
where Pj is the relative frequency of block j in S.
For example, if we have a histogram (1, 3, 4, 2),
its hist index would be: 12 + 32 + 42 + 22 = 30. A
histogram with a high hist index is more likely to
contain the best split point than a histogram with low
hist index. Intuitively, we know that the fewer blocks
the histogram has, the better chance it has to contain
a good split point ---, mathematically, (a2 > a12 + a22)
holds if we have (a = a1 + a2).
Our decision tree learning algorithm uses the
hist index to prune attributes as follows. Prior to
determining the best split point of an attribute, its
hist index is computed and we compare it with the
average hist index of all the previous histograms in
the same round; only if its hist index value is larger
than the previous average the best split point for this
attribute will be determined, otherwise, the attribute
is excluded from consideration for test conditions of
the particular node.
3.3 Approximating Entropy Computations
This sub-section addresses the following question:
Do we really have to compute the log values that
require a lot of floating point computation to find the
smallest entropy values?
Let us assume we have a histogram (2-, 3+, 7-,
5+, 2-) and we need to determine its split point that
minimizes entropy. Let us consider the difference
between the two splits. 1st: (*2-, 3+ | 7-, 5+, 2-) and
2nd: (2-, 3+, 7- | 5+, 2-). Apparently, the 2nd is better
than the 1st. Since we are dealing with only binary
classification, we can assign a numeric value of +1
to one class and a value of –1 to the other class, and
we can use the sum of absolute differences in class
memberships in the two resulting partitions to
approximate entropy computations; the larger this
result is, the lower the entropy is. In this case, for the
first split the sum is |-2 + 3| + |-7 + 5 – 2| = 5, and for
the second the sum is |-2 + 3 – 7| + |5 – 2| = 9. We
call this method absolute difference heuristic. We
performed some experiments [8] to determine how
often the same split point is picked by the
information gain heuristic and the absolute
difference heuristic. Our results indicate that in most
cases (approx. between 91 and 100% depending on
data set characteristics) the same split point is picked
by both methods.
4. Evaluation
In this section we present the results of experiments
that evaluate our methods for 3 different microarray
data sets.
4.1 Data Sets and Experimental Design
The first data set is a leukemia data collection that
consists of 62 bone marrow and 10 peripheral blood
samples from acute leukemia patients (obtained from
Golub el al [8]). The total 72 samples fall into two
types of acute leukemia: acute myeloid leukemia
(AML) and acute lymphoblastic leukemia (ALL).
These samples come from both adults and children.
The RNA samples was hybridized to Affymetrix
high-density oligonucleotide microarrays that
contains probes for p = 7,130 human genes.
The second data set a colon tissue data set
contains expression level (Red intensity/Green
intensity) of the 2000 genes with highest minimal
intensity across 62 colon tissues. These gene
expressions in 40 tumor and 22 normal colon tissue
samples were analyzed with an Affymetrix
oligonucleotide array containing over 6,500 human
genes (Alon et al. [2]).
The third data set comes from a study of gene
expression in the breast cancer patients (Veer et al.
[3]). The data set contains data from 98 primary
breast cancers patients: 34 from patients who
developed distant metastases within 5 years, 44 from
patients who continued to be disease-free after a
period of at least 5 years, 18 from patients with
BRCA1 germline mutations, and 2 from BRCA2
carriers. All patients were lymph node negative, and
under 55 years of age at diagnosis.
In the experiments, we did not use all genes, but
rather selected a subset P with p elements of the
genes. Decision trees were then learnt that operate
on the selected subset of genes. As proposed in [9],
we are removing genes from datasets based on the
ratio of their between-groups to within-groups sum
of squares. For a particular gene j, the ratio is
defined as:
 i  kI ( yi  k )( x kj  x. j ) 2
BSS ( j )
=
,
WSS ( j )
 i  kI ( yi  k )( xij  x kj ) 2
where x . j denotes the average expression level of
gene j across all samples and xkj denotes the average
level of gene j across samples belonging to class k.
To give an explicit example here, assume we
have four samples and two genes for each sample:
the first gene’s expression level values for the four
samples are (1, 2, 3, 4) and the second’s are (1, 3, 2,
4); the sample class memberships are (+, -, +, -)
(listed in the order of samples no.1, no.2, no.3 and
no.4). For gene 1, we have BSS/WSS = 0.125, and
for gene 2, BSS/WSS = 4. If we have to remove one
gene, gene 1 will be removed according to our rule
since it has a lower BSS/WSS value. The removal of
gene 1 is reasonable because we can tell the class
membership of the samples by looking at their gene
2 expression level values: if one sample’s gene 2
expression level is greater than 2.5, the sample
should belong to the negative class, otherwise the
sample belongs to the positive class. If we evaluate
gene 1 instead, we will not be able to perform the
classification in one single step like we have just
done with gene 2.
After we calculate the BSS/WSS ratios for all
genes in a data set, only the p genes with the largest
ratios will remain in the datasets that will be used in
the experiments. Experiments were conducted with
different p values.
In the experiments, we compared the popular
C5.0/See5.0 decision tree tool (which was run with
its default parameter settings) with two versions of
our tool. The first version, called microarray
decision tree tool, does not use any optimizations
but employs pre-pruning. It stops growing the tree
when at least 90% of the examples belong to the
majority class. The second version of our tool, that is
called optimized decision tree tool, uses the same
pre-pruning and employs all the techniques that were
discussed in Section 3.
4.2 Experimental Results
The first experiment evaluated the accuracy of the
three decision tree learning tools. Tables 1-3 below
display each algorithm’s error rate using the three
different data sets and also using three different p
values for gene selection.
The first column of the three tables represents
the p values that were used. The other columns give
the number of total misclassification and the error
rate (inside the braces). Error rates were computed
using leave-one-out cross validation.
Table 1: The Leukemia data set test result (72 samples)
Tools
C5.0
Decision
Tree
Microarray Optimized
Decision
Decision
Tree
Tree
1024
5(6.9%)
5(6.9%)
4(5.6%)
900
4(4.6%)
8(11.1%)
5(6.9%)
750
13(18.1%) 11(15.3%)
P
3(4.2%)
Table 2: Colon Tissue data set test result (62 Samples)
Tools
C5.0
Decision
Tree
Microarray Optimized
Decision
Decision
Tree
Tree
P
1600
12(19.4%) 15(24.2%)
16(25.8%)
1200
12(19.4%) 15(24.2%)
16(25.8%)
800
12(19.4%) 14(22.6%)
16(25.8%)
Table 3: Breast Cancer data set test result (78 Samples)
Tools
C5.0
Decision
Tree
Microarray Optimized
Decision
Decision
Tree
Tree
P
5000
38(48.7%) 29(33.3%)
35(44.9%)
1600
39(50.0%) 32(41.0%)
30(38.5%)
1200
39(50.0%) 31(39.7%)
29(33.3%)
If we study the error rates for the three methods
listed in the three tables carefully, it can be noticed
that at an average the error rates for the optimized
decision tree are lower than that of the one not being
optimized, which looks quite surprising since in the
optimized decision tree tool used a lot of
approximate computations and pruning.
However, further analysis revealed that the use
of attribute pruning (using the hist index we
introduced in Section 3.2) provides an explanation
for the better average accuracy of the optimized
decision tree tool . Why would attribute pruning lead
to a more accurate prediction in some cases? The
reason is that the entropy function does not take the
class distribution on sorted attributes into
consideration. For example, if we have two attribute
histograms (3+, 3-, 6+) and (3+, 1-, 2+, 1-, 2+, 1-,
2+), for the first histogram the best split point is (3+ |
3-, 6+) but for the second histogram there is one
similar split point (3+ | 1-, 2+, 1-, 2+, 1-, 2+) which
is equivalent to (3+ | 3-, 6+) with respect to the
information gain heuristic. Therefore, both split
points have the same chance to be selected. But, just
by intuition, we would say that the second split point
is a much worse than the first split point because of
its large number of blocks, requiring more tests to
separate the two classes properly than the first one.
The traditional information gain heuristic
ignores such distributional aspects at all, which
causes the loss of accuracy in some circumstances.
However, hist index based pruning, as proposed in
3.2, improved on this situation by removing
attributes that have a low hist index (like the second
attribute in the above example) beforehand.
Intuitively, continuous attributes with long
histograms “representing flip-flopping class
memberships” are not very attractive to be chosen in
test conditions, because more nodes/tests are
necessary in a decision tree to predict classes
correctly based on this attribute. In summary, some
of those “bad” attributes were removed by attribute
pruning that explains the higher average accuracy in
the experiments.
In another experiment we compared the cpu
time for leave-one cross validation for the three tree
decision tree learning tools: C5.0 Decision Tree,
normal (Microarray Decision Tree) and optimized
(Optimized Decision Tree). All these experiments
were performed on an 850 Mhz Intel Pentium
processor with 128MB main memory. The cpu time
that is displayed (in seconds) in Table 4 includes the
time of tree building and evaluation process (Note:
these experiments are identical to those previously
listed in Tables 1 to 3). Our experimental results
suggest that the decision tree tool designed for
microarray data sets normally runs slightly faster
than the C5.0 tool, while the speedup of the
optimized microarray decision tree tool is quite
significant and ranges from 150% to 400%.
Table 4: CPU time comparison of three different decision
tree tools
Data Sets
Leukemia
Data set
Colon
Tissue
Data set
Breast
Cancer
Data set
CPU Time (Seconds)
PValue
C5.0
Normal
Optimized
1024
6.7
3.5
1.2
900
5.6
3.1
1.1
750
6.0
4.1
1.1
1600
12.0
8.0
2.2
1200
9.0
6.0
1.7
800
5.9
3.8
1.1
5000
74.5
75.3
15.9
2000
30.4
30.2
6.4
1500
22.4
20.4
4.8
5. Summary and Conclusion
We introduced decision tree learning algorithms for
microarray data sets, and its optimization to speed
up leave-one-out cross validation. Aimed at this
goal, several strategies were employed: the
introduction of hist index to help pruning attributes,
approximate computations that measure entropy; and
the reuse of subtrees from previous runs. We claim
that first two ideas are new, whereas, the third idea
was also explored in Blockeel’s paper [10] that
centered on the reuse of split points. The
performance of microarray decision tree tool was
compared with that of commercially available
decision tree tool C5.0/See5.0 using 3 microarray
data sets. The experiments suggest that our tool runs
between 150% and 400% faster than C5.0.
We also compared the trees that were generated
in the experiments for the same data sets. We
observed that the trees generated by the same tool
are very similar. Trees generated by different tools
also had a significant degree of similarity. Basically,
all the trees that were generated for the three data
sets are of small size with normally less than 10
nodes. We also noticed that smaller trees seem to be
correlated with a lower error rates.
Also worth mentioning is that our experimental
results revealed that the use of the hist index resulted
in a better accuracy in some cases. These results also
suggest that for continuous attributes the traditional
entropy-based information gain heuristic does not
work very well, because of its weakness to reflect
the class distribution characteristics of the samples
with respect to continuous attributes. Therefore,
better evaluation heuristics are needed for
continuous attributes. This problem is the subject of
our current research; in particular, we are currently
investigating multi-modal heuristics that use both
hist index and entropy. Another problem that is
investigated in our current research is the
generalization of the techniques described in this
paper to classification problems that involve more
than two classes.
References
[1] A. Brazma, J. Vilo. Gene expression data
analysis, FEBS Letters, 480:17-24, 2000.
[2] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S.
Ybarra, D. Mack, and A. J. Levine. Broad patterns
of gene expression revealed by clustering analysis
of tumor and normal colon tissues probed by
oligonucleotide arrays, Cell Biology, Vol. 96, pp.
6745-6750, June 1999.
[3] Laura J. van ‘t Veer, Hongyue Dai, Marc J. van
de Vijver, Yudong D. He, Augustinus A.M. Hart,
Mao Mao, Hans L. Peterse, Karin van der Kooy,
Matthew J. Marton, Anke T. Witteveen, George J.
Schreiber, Ron M. Kerkhoven, Chris Roberts,
Peter S. Linsley, René Bernards and Stephen H.
Friend. Gene expression profiling predicts clinical
outcome of breast cancer, Nature, 415, pp. 530–
536, 2002.
[4] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S.
Kitareewan, E. Dmitrovsky, E. Lander, and T.
Golub. Interpreting patterns of gene expression
with self-organizing maps. PNAS, 96:2907-2912,
1999.
[5] J.R. Quinlan. C4.5: Programs for machine
learning. Morgan Kaufman, San Mateo, 1993.
[6] Xiaoyong Li. Concept learning techniques for
microarray data collections, Master’s Thesis,
University of Houston, December 2002.
[7] U. Fayyad, and K. Irani. Multi-interval
discretization of continuous-valued attributes for
classification learning, Proc. Int. Joint Conf. On
Artificial Intelligence (IJCAI-93), pp. 1022-1029,
1993.
[8] T. R. Golub, D. K. Slonim, P. Tamayo, C.
Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller,
M.L. Loh, J. R. Downing, M. A. Caligiuri, C. D.
Bloomfield, and E. S. Lander. Molecular
classification of cancer: class discovery and class
prediction by gene expression monitoring, Science,
286:531-537, 1999.
[9] S. Dudoit, J. Fridlyand, and T. P. Speed.
Comparison of discrimination methods for the
classification of tumors using gene expression
data, Journal of the American Statistical
Association, Vol. 97, No. 457, pp. 77—87, 2002.
[10] H. Blockeel, J. Struyf. Efficient algorithms for
decision tree cross-validation, Machine Learning:
Proceedings of the Eighteenth International
Conference, 11-18, 2001.
Download