To appear Proc. The 2003 International Conference on Machine Learning and Applications (ICMLA'03) Los Angeles, California, June 23-24, 2003. Fast Decision Tree Learning Techniques for Microarray Data Collections Xiaoyong Li and Christoph F. Eick Department of Computer Science University of Houston, TX 77204-3010 e-mail: ceick@cs.uh.edu gene expression profiles of tumors from cancer Abstract patients [1]. In addition to the enormous scientific potential of DNA microarrays to help in DNA microarrays allow monitoring of understanding gene regulation and interactions, expression levels for thousands of genes simultaneously. The ability to successfully microarrays have very important applications in pharmaceutical and clinical research. By comparing analyze the huge amounts of genomic data is of gene expression in normal and abnormal cells, increasing importance for research in biology and medicine. The focus of this paper is the microarrays may be used to identify which genes are involved in causing particular diseases. Currently, discussion of techniques and algorithms of a most approaches to the computational analysis of decision tree learning tool that has been devised taking into consideration the special features of gene expression data focus more on the attempt to learn about genes and tumor classes in an microarray data sets: continuous-valued unsupervised way. Many research projects employ attributes and small size of examples with a large number of genes. The paper introduces cluster analysis for both tumor samples and genes, and mostly use hierarchical clustering methods [2,3] novel approaches to speed up leave-one-out and partitioning methods, such as self-organizing cross validation through the reuse of results of previous computations, attribute pruning, and maps [4] to identify groups of similar genes and groups of similar samples. through approximate computation techniques. Our approach employs special histogram-based data structures for continuous attributes for speed up and for the purpose of pruning. We present experimental results concerning three microarray data sets that suggest that these optimizations lead to speedups between 150% and 400%. We also present arguments that our attribute pruning techniques not only lead to better speed but also enhance the testing accuracy. Key words and phrases: decision trees, concept learning for microarray data sets, leave-one-out cross validation, heuristics for split point selection, decision tree reuse. 1. Introduction The advent of DNA microarray technology provides biologists with the ability of monitoring expression levels for thousands of genes simultaneously. Applications of microarrays range from the study of gene expression in yeast under different environmental stress conditions to the comparison of This paper, however, centers on the application of supervised learning techniques to microarray data collections. In particular, we will discuss the features of a decision tree learning tool for microarray data sets. We assume that each data set includes gene expression data of m-RNA samples. Normally, in these data sets the number of genes is pretty large (usually between 1000 and 10,000). Each gene is characterized by numerical values that measure the degree the gene is turned on for the particular sample. The number of examples in the training set, on the other hand, is typically below one hundred. Associated with each sample is its type or class that we are trying to predict. Moreover, in this paper we will restrict our discussions to binary classification problems. Section 2 introduces decision tree learning techniques for microarray data collections. Section 3 discusses how to speed up leave-one-out cross validation. Section 4 presents experimental results that evaluate our techniques for three microarrray data sets and Section 5 summarizes our findings. 2. Decision Tree Learning Techniques for Microarray Data Collections 2.1 Decision Tree Algorithms Reviewed The traditional decision tree learning algorithm (for more discussions on decision trees see [5]) builds a decision tree breadth-first by recursively dividing the examples until each partition is pure by definition or meets other termination conditions (to be discussed later). If a node satisfies a termination condition, the node is marked with a class label that is the majority class of the samples associated with this node. In the case of microarray data sets, the splitting criterion for assigning examples to nodes is of the form “A < v” (where A is an attribute v is a real number). In algorithms description in Fig. 1 below, we assume that 1. D is the whole microarray training data set; 2. T is the decision tree to be built; 3. N is one node of the decision tree in which holds the indexes of samples; 4. R is the root node of the decision tree; 5. Q is a queue which contains nodes of the same type with N; 6. Si: is a split point which is a structure containing a gene index i, a real number v and an information gain value. A split point can be used to provided a split criterion to partition the tree node N into two nodes N1 and N2 based on whether the gene i’s value of each example in the node is or isnot greater than value v; 7. Gi: denotes the i-th gene. The result of applying the decision tree learning algorithm is a tree whose intermediate nodes associate split points with attributes, and whose leaf nodes represent decisions (classes in our case). Test conditions for a node are selected maximizing the information gain relying on the following framework: We assume we have 2 classes , sometimes called ‘+’ and ‘-“ in the following, in our classification problem. A test S subdivides the examples D= (p1,p2) into 2 subsets D1 =(p11,p12) and D2 =(p21,p22). The quality of a test S is measured using Gain(D,S): Let H(D=(p1,…,pm))= i=1 (pi log2(1/pi)) (called the entropy function) Gain(D,S)= H(D) 2 i 1 (| D i | / | D |) * H(D i ) In the above |D| denotes the number of elements in set D and D=(p1, p2) with p1+ p2 =1 and indicates that of the |D| examples p1*|D| examples belong to the first class and p2*|D| examples belong to the second class. Procedure buildTree(D): 1. Initialize root node R of tree T using data set D; 2. Initialize queue Q to contain root node R; 3. While Q is not empty do { 4. De-queue the first node N in Q; 5. If N is not satisfying the termination condition { 6. For each gene Gi (i= 1, 2, …. ) 7. {Evaluate splits on gene Gi based on information gain; 8. Record the best split point Si for Gi and its information gain} 9. Determine split point Smax with the highest information gain 10. Use Smax to divide node N into N1 and N2 and attach nodes N1 and N2 to node N in the decision tree T; 11. En-queue N1 and N2 to Q; 12. } 13. } Figure 1: Decision Tree Learning Algorithm 2.2 Attribute Histograms Our research introduced a number of new data structures for the purpose of speeding up the decision tree learning algorithms. One of these data structures is called attribute histogram that captures the class distribution of a sorted continuous attribute. Let us assume we have 7 examples and their attribute values for an attribute A are 1.01, 1.07, 1.44, 2.20, 3.86, 4.3, and 5.71 and their class distribution is (-, +, +, +, -, -, +); that is, the first example belongs to class 2, the second example is class 1,... If we group all the adjacent samples with the same class, we obtain the histogram for this attribute which is (1-, 3+, 2-, 1+), for short (1,3,2,1) as depicted in Fig. 2; if the class distribution for the sorted attribute A would have been (+,+,-,-,-,-,+) A’s histogram would be (2,4,1). Efficient algorithms to compute attribute histograms have been discussed in [6]. 2.3 Searching for the Best Split Point As mentioned earlier the traditional decision tree algorithm has a preference for tests that reduce entropy. To find the best test for a node, we have to search through all the possible split points for each attribute. In order to compute the best split point for a numeric attribute, normally the (sorted) list of its values is scanned from the beginning, and for each split point that is placed half way between every two adjacent attribute values, the entropy is computed. The entropy for each split point can actually be efficiently computed as shown in Figure 2 because of the existence of our attribute histogram data structure. Based on its histogram (1-, 3+, 2-, 1+), we only consider three possible split (1- | 3+, 2-, 1+), (1, 3+ | 2-, 1+) and (1-, 3+, 2- | 1+). The vertical bar represents the split point. Thus we eliminate from 6 split points (Fayyad and Irani proved in [7] that splitting between adjacent samples that belong to the same class leads to sub-optimal information gain; in general, their paper advocates a multi-splitting algorithms for continuous attributes whereas our approach relies on binary splits) down to 3 split points. Figure 2: Example of an Attribute Histogram A situation that we have not discussed until now, involves histograms that contain identical attribute values that belong to different classes. To cope with this situation when considering a split point, we need to check the two neighboring examples’ attribute values on both sides of the split point. If they are the same, we have to discard this split point even if its information gain is high. After we determined the best split point for all the attributes (genes in our cases), the attribute with highest information gain is selected and used to split the current node. 3. Optimizations for Leave-one-out Cross-validation In k-fold cross-validation, we divide the data into k disjoint subsets of (approximately) equal size, then train the classifier k times, each time leaving out one of the subsets from training, but using only the omitted subset as the test set to compute the error rate. If k equals the sample size, this is called "leaveone-out" cross-validation. For the large data set size, leave-one-out is very computation demanding since it has to construct more decision trees than normal types of cross validation (k=10 is a popular choice in the literature). But for data sets with few examples, such as microarray data sets, leave-one-out cross validation is pretty popular and practical since it gives the most unbiased evaluation model. Also, when doing leave-one-out cross validation the computations for different subsets tend to be very similar. Therefore, it seems attractive to speed up leave-one-out cross validation through the reuse of results of previous computations, which is the main topic of the next subsection. 3.1 Reuse of Sub-trees from Previous Runs It is important to note that the whole data set and the training sets in leave-one-out only differ in one example. Therefore, in the likely event that the same root test is selected for the two data sets, we already know that at least one of the 2 sub-trees below the root node generated by the first run (for the whole data set) can be reused when constructing other decision trees. Similar opportunities for reuse exist at other levels of decision trees. Taking advantage of this property, we compare the node to be split with the stored nodes that are from pervious runs, and reuse sub-trees if a match occurs. In order to get a speed up through sub-tree reuse, it is critical that matching nodes from previous runs can be found quickly. To facilitate the comparison of two nodes, we use bit strings to represent the sample list of each node. For example, if we have totally 10 samples, and 5 are associated with the current node, we use the bit string “0101001101” as the signature of this node, and use XOR string comparisons and signature hashing to quickly determine if a reusable sub-tree exists. 3.2 Using Histograms for Attribute Pruning Assume that two histograms A (2+, 2-) and B (1+, 1, 1+, 1-) are given. In this case, our job is to find the best split point among all possible splits of both histograms. Obviously, B can never give a better split than A because (2+ | 2-) has entropy 0. This implies that performing information gain computations for attribute B is a waste of time. That prompts us to think of some way to distinguish between “good” and “bad” histograms, and to exclude attributes with bad histograms from consideration for speed up. Mathematically, it might be quite complicated to come up with a formula that predicts the best attribute to be used for a particular node of the decision tree. However, we are considering an approximate method that may not always be correct but hopefully most of the time can be correct. The idea is to use an index, which we call “hist index”. The hist index of histogram S is defined as: m Hist(S) = Pj2 j 1 where Pj is the relative frequency of block j in S. For example, if we have a histogram (1, 3, 4, 2), its hist index would be: 12 + 32 + 42 + 22 = 30. A histogram with a high hist index is more likely to contain the best split point than a histogram with low hist index. Intuitively, we know that the fewer blocks the histogram has, the better chance it has to contain a good split point ---, mathematically, (a2 > a12 + a22) holds if we have (a = a1 + a2). Our decision tree learning algorithm uses the hist index to prune attributes as follows. Prior to determining the best split point of an attribute, its hist index is computed and we compare it with the average hist index of all the previous histograms in the same round; only if its hist index value is larger than the previous average the best split point for this attribute will be determined, otherwise, the attribute is excluded from consideration for test conditions of the particular node. 3.3 Approximating Entropy Computations This sub-section addresses the following question: Do we really have to compute the log values that require a lot of floating point computation to find the smallest entropy values? Let us assume we have a histogram (2-, 3+, 7-, 5+, 2-) and we need to determine its split point that minimizes entropy. Let us consider the difference between the two splits. 1st: (*2-, 3+ | 7-, 5+, 2-) and 2nd: (2-, 3+, 7- | 5+, 2-). Apparently, the 2nd is better than the 1st. Since we are dealing with only binary classification, we can assign a numeric value of +1 to one class and a value of –1 to the other class, and we can use the sum of absolute differences in class memberships in the two resulting partitions to approximate entropy computations; the larger this result is, the lower the entropy is. In this case, for the first split the sum is |-2 + 3| + |-7 + 5 – 2| = 5, and for the second the sum is |-2 + 3 – 7| + |5 – 2| = 9. We call this method absolute difference heuristic. We performed some experiments [8] to determine how often the same split point is picked by the information gain heuristic and the absolute difference heuristic. Our results indicate that in most cases (approx. between 91 and 100% depending on data set characteristics) the same split point is picked by both methods. 4. Evaluation In this section we present the results of experiments that evaluate our methods for 3 different microarray data sets. 4.1 Data Sets and Experimental Design The first data set is a leukemia data collection that consists of 62 bone marrow and 10 peripheral blood samples from acute leukemia patients (obtained from Golub el al [8]). The total 72 samples fall into two types of acute leukemia: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). These samples come from both adults and children. The RNA samples was hybridized to Affymetrix high-density oligonucleotide microarrays that contains probes for p = 7,130 human genes. The second data set a colon tissue data set contains expression level (Red intensity/Green intensity) of the 2000 genes with highest minimal intensity across 62 colon tissues. These gene expressions in 40 tumor and 22 normal colon tissue samples were analyzed with an Affymetrix oligonucleotide array containing over 6,500 human genes (Alon et al. [2]). The third data set comes from a study of gene expression in the breast cancer patients (Veer et al. [3]). The data set contains data from 98 primary breast cancers patients: 34 from patients who developed distant metastases within 5 years, 44 from patients who continued to be disease-free after a period of at least 5 years, 18 from patients with BRCA1 germline mutations, and 2 from BRCA2 carriers. All patients were lymph node negative, and under 55 years of age at diagnosis. In the experiments, we did not use all genes, but rather selected a subset P with p elements of the genes. Decision trees were then learnt that operate on the selected subset of genes. As proposed in [9], we are removing genes from datasets based on the ratio of their between-groups to within-groups sum of squares. For a particular gene j, the ratio is defined as: i kI ( yi k )( x kj x. j ) 2 BSS ( j ) = , WSS ( j ) i kI ( yi k )( xij x kj ) 2 where x . j denotes the average expression level of gene j across all samples and xkj denotes the average level of gene j across samples belonging to class k. To give an explicit example here, assume we have four samples and two genes for each sample: the first gene’s expression level values for the four samples are (1, 2, 3, 4) and the second’s are (1, 3, 2, 4); the sample class memberships are (+, -, +, -) (listed in the order of samples no.1, no.2, no.3 and no.4). For gene 1, we have BSS/WSS = 0.125, and for gene 2, BSS/WSS = 4. If we have to remove one gene, gene 1 will be removed according to our rule since it has a lower BSS/WSS value. The removal of gene 1 is reasonable because we can tell the class membership of the samples by looking at their gene 2 expression level values: if one sample’s gene 2 expression level is greater than 2.5, the sample should belong to the negative class, otherwise the sample belongs to the positive class. If we evaluate gene 1 instead, we will not be able to perform the classification in one single step like we have just done with gene 2. After we calculate the BSS/WSS ratios for all genes in a data set, only the p genes with the largest ratios will remain in the datasets that will be used in the experiments. Experiments were conducted with different p values. In the experiments, we compared the popular C5.0/See5.0 decision tree tool (which was run with its default parameter settings) with two versions of our tool. The first version, called microarray decision tree tool, does not use any optimizations but employs pre-pruning. It stops growing the tree when at least 90% of the examples belong to the majority class. The second version of our tool, that is called optimized decision tree tool, uses the same pre-pruning and employs all the techniques that were discussed in Section 3. 4.2 Experimental Results The first experiment evaluated the accuracy of the three decision tree learning tools. Tables 1-3 below display each algorithm’s error rate using the three different data sets and also using three different p values for gene selection. The first column of the three tables represents the p values that were used. The other columns give the number of total misclassification and the error rate (inside the braces). Error rates were computed using leave-one-out cross validation. Table 1: The Leukemia data set test result (72 samples) Tools C5.0 Decision Tree Microarray Optimized Decision Decision Tree Tree 1024 5(6.9%) 5(6.9%) 4(5.6%) 900 4(4.6%) 8(11.1%) 5(6.9%) 750 13(18.1%) 11(15.3%) P 3(4.2%) Table 2: Colon Tissue data set test result (62 Samples) Tools C5.0 Decision Tree Microarray Optimized Decision Decision Tree Tree P 1600 12(19.4%) 15(24.2%) 16(25.8%) 1200 12(19.4%) 15(24.2%) 16(25.8%) 800 12(19.4%) 14(22.6%) 16(25.8%) Table 3: Breast Cancer data set test result (78 Samples) Tools C5.0 Decision Tree Microarray Optimized Decision Decision Tree Tree P 5000 38(48.7%) 29(33.3%) 35(44.9%) 1600 39(50.0%) 32(41.0%) 30(38.5%) 1200 39(50.0%) 31(39.7%) 29(33.3%) If we study the error rates for the three methods listed in the three tables carefully, it can be noticed that at an average the error rates for the optimized decision tree are lower than that of the one not being optimized, which looks quite surprising since in the optimized decision tree tool used a lot of approximate computations and pruning. However, further analysis revealed that the use of attribute pruning (using the hist index we introduced in Section 3.2) provides an explanation for the better average accuracy of the optimized decision tree tool . Why would attribute pruning lead to a more accurate prediction in some cases? The reason is that the entropy function does not take the class distribution on sorted attributes into consideration. For example, if we have two attribute histograms (3+, 3-, 6+) and (3+, 1-, 2+, 1-, 2+, 1-, 2+), for the first histogram the best split point is (3+ | 3-, 6+) but for the second histogram there is one similar split point (3+ | 1-, 2+, 1-, 2+, 1-, 2+) which is equivalent to (3+ | 3-, 6+) with respect to the information gain heuristic. Therefore, both split points have the same chance to be selected. But, just by intuition, we would say that the second split point is a much worse than the first split point because of its large number of blocks, requiring more tests to separate the two classes properly than the first one. The traditional information gain heuristic ignores such distributional aspects at all, which causes the loss of accuracy in some circumstances. However, hist index based pruning, as proposed in 3.2, improved on this situation by removing attributes that have a low hist index (like the second attribute in the above example) beforehand. Intuitively, continuous attributes with long histograms “representing flip-flopping class memberships” are not very attractive to be chosen in test conditions, because more nodes/tests are necessary in a decision tree to predict classes correctly based on this attribute. In summary, some of those “bad” attributes were removed by attribute pruning that explains the higher average accuracy in the experiments. In another experiment we compared the cpu time for leave-one cross validation for the three tree decision tree learning tools: C5.0 Decision Tree, normal (Microarray Decision Tree) and optimized (Optimized Decision Tree). All these experiments were performed on an 850 Mhz Intel Pentium processor with 128MB main memory. The cpu time that is displayed (in seconds) in Table 4 includes the time of tree building and evaluation process (Note: these experiments are identical to those previously listed in Tables 1 to 3). Our experimental results suggest that the decision tree tool designed for microarray data sets normally runs slightly faster than the C5.0 tool, while the speedup of the optimized microarray decision tree tool is quite significant and ranges from 150% to 400%. Table 4: CPU time comparison of three different decision tree tools Data Sets Leukemia Data set Colon Tissue Data set Breast Cancer Data set CPU Time (Seconds) PValue C5.0 Normal Optimized 1024 6.7 3.5 1.2 900 5.6 3.1 1.1 750 6.0 4.1 1.1 1600 12.0 8.0 2.2 1200 9.0 6.0 1.7 800 5.9 3.8 1.1 5000 74.5 75.3 15.9 2000 30.4 30.2 6.4 1500 22.4 20.4 4.8 5. Summary and Conclusion We introduced decision tree learning algorithms for microarray data sets, and its optimization to speed up leave-one-out cross validation. Aimed at this goal, several strategies were employed: the introduction of hist index to help pruning attributes, approximate computations that measure entropy; and the reuse of subtrees from previous runs. We claim that first two ideas are new, whereas, the third idea was also explored in Blockeel’s paper [10] that centered on the reuse of split points. The performance of microarray decision tree tool was compared with that of commercially available decision tree tool C5.0/See5.0 using 3 microarray data sets. The experiments suggest that our tool runs between 150% and 400% faster than C5.0. We also compared the trees that were generated in the experiments for the same data sets. We observed that the trees generated by the same tool are very similar. Trees generated by different tools also had a significant degree of similarity. Basically, all the trees that were generated for the three data sets are of small size with normally less than 10 nodes. We also noticed that smaller trees seem to be correlated with a lower error rates. Also worth mentioning is that our experimental results revealed that the use of the hist index resulted in a better accuracy in some cases. These results also suggest that for continuous attributes the traditional entropy-based information gain heuristic does not work very well, because of its weakness to reflect the class distribution characteristics of the samples with respect to continuous attributes. Therefore, better evaluation heuristics are needed for continuous attributes. This problem is the subject of our current research; in particular, we are currently investigating multi-modal heuristics that use both hist index and entropy. Another problem that is investigated in our current research is the generalization of the techniques described in this paper to classification problems that involve more than two classes. References [1] A. Brazma, J. Vilo. Gene expression data analysis, FEBS Letters, 480:17-24, 2000. [2] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Cell Biology, Vol. 96, pp. 6745-6750, June 1999. [3] Laura J. van ‘t Veer, Hongyue Dai, Marc J. van de Vijver, Yudong D. He, Augustinus A.M. Hart, Mao Mao, Hans L. Peterse, Karin van der Kooy, Matthew J. Marton, Anke T. Witteveen, George J. Schreiber, Ron M. Kerkhoven, Chris Roberts, Peter S. Linsley, René Bernards and Stephen H. Friend. Gene expression profiling predicts clinical outcome of breast cancer, Nature, 415, pp. 530– 536, 2002. [4] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. Lander, and T. Golub. Interpreting patterns of gene expression with self-organizing maps. PNAS, 96:2907-2912, 1999. [5] J.R. Quinlan. C4.5: Programs for machine learning. Morgan Kaufman, San Mateo, 1993. [6] Xiaoyong Li. Concept learning techniques for microarray data collections, Master’s Thesis, University of Houston, December 2002. [7] U. Fayyad, and K. Irani. Multi-interval discretization of continuous-valued attributes for classification learning, Proc. Int. Joint Conf. On Artificial Intelligence (IJCAI-93), pp. 1022-1029, 1993. [8] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M.L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286:531-537, 1999. [9] S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American Statistical Association, Vol. 97, No. 457, pp. 77—87, 2002. [10] H. Blockeel, J. Struyf. Efficient algorithms for decision tree cross-validation, Machine Learning: Proceedings of the Eighteenth International Conference, 11-18, 2001.