Computing and Information Systems, 7 (2000), p. 91-97 © University of Paisley 2000 Decision Trees as a Data Mining Tool Bruno Crémilleux step of data preparation, but also during the whole process. In fact, using decision trees can be embedded in the KDD process within the main steps (selection, preprocessing, data mining, interpretation / evaluation). The aim of the paper is to show the role of the user and to connect the use of decision trees within the data mining framework. The production of decision trees is usually regarded as an automatic method to discover knowledge from data: trees directly stemmed from the data without other intervention. However, we cannot expect acceptable results if we naively apply machine learning to arbitrary data. By reviewing the whole process and some other works which implicitly have to be done to generate a decision tree, this papers shows that this method has to be placed in the knowledge discovery in databases processing and, in fact, the user has to intervene both during the core of the method (building and pruning) and other associated tasks. This paper is organized as follows. Section 2 outlines the core of decision trees method (i.e. building and pruning). Literature usually presents these points from a technical side without describing the part regarding the user: we will see that he has a role to play. Section 3 deals with associated tasks which are, in fact, absolutely necessary. These tasks, where clearly the user has to intervene, are often not emphasized when we speak of decision trees. We will see that they have a great relevance and they act upon the final result. 1. INTRODUCTION Data mining and Knowledge Discovery in Databases (KDD) are fields of increasing interest combining databases, artificial intelligence, machine learning and statistics. Briefly, the purpose of KDD is to extract from large amounts of data, non trivial ”nuggets” of information in an easily understandable form. Such discovered knowledge may be for instance regularities or exceptions. 2. BUILDING AND PRUNING 2.1 Building decision trees: choice of an attribute selection criterion In induction of decision trees various attribute selection criteria are used to estimate the quality of attributes in order to select the best one to split on. But we know at a theoretical level that criteria derived from an impurity measure have suitable properties to generate decision trees and perform comparably (see [10], [1] and [6]). We call such criteria C.M. criteria (concave-maximum criteria) because an impurity measure, among other characteristics, is defined by a concave function. The most commonly used criteria which are the Shannon entropy (in the family of ID3 algorithms) and the Gini criterion (in CART algorithms, see [1] for details), are C.M. criteria. Decision tree is a method which comes from the machine learning community and explores data. Such a method is able to give a summary of the data (which is easier to analyze than the raw data) or can be used to build a tool (like for example a classifier) to help a user for many different decision making tasks. Broadly speaking, a decision tree is built from a set of training data having attribute values and a class name. The result of the process is represented as a tree which nodes specify attributes and branches specify attribute values. Leaves of the tree correspond to sets of examples with the same class or to elements in which no more attributes are available. Construction of decision trees is described, among others, by Breiman et al. (1984) [1] who present an important and wellknow monograph on classification trees. A number of standard techniques have been developed, for example like the basic algorithms ID3 [20] and CART [1]. A survey of different methods of decision tree classifiers and the various existing issues are presented in Safavian and Landgrebe [25]. Nevertheless, it exists other paradigms to build decision trees. For example, Fayyad and Irani [10] claim that grouping values of attributes and building binary trees yield better trees. For that, they propose the ORT measure. ORT favours attributes that simply separate the different classes without taking into account the number of examples of nodes so that ORT produces trees with small pure (or nearly pure) leaves at their top more often than C.M. criteria. To better understand the differences between C.M. and ORT criteria, let us consider the data set given in the appendix and the trees induced from this data depicted in Figure 1: a tree built with a C.M. criterion is represented at the top and the tree built with the ORT criterion at the bottom. ORT rapidly comes out Usually, the production of decision trees is regarded as an automatic process: trees are straightforwardly generated from data and the user is relegated to a minor role. Nevertheless, we think that this method intrinsically requires the user and not only during the 91 with the pure leaf Y2 = y21 while C.M. criterion splits it and arrives later at the split leaves. suggest some improvements and show they generally obtain better empirical results than those found by Quinlan. Buntine [3] presents a tree learning algorithm stemmed from Bayesian statistics whose main objective is to provide outstanding predicted class probabilities on the nodes. (2500,200,2500) Y =y Y =y 1 11 1 (2350,150,150) Y =y 2 Y =y 21 2 22 12 We can also address the question of deciding which sub-nodes have to be built. For a splitting, the GID3* algorithm [12] groups in a single branch the values of an attribute which are estimated meaningless compared to its other values. For building of binary trees, another criterion is twoing [1]. Twoing groups classes into two superclasses so that considered as a two-class problem, the greatest decrease in node impurity is realized. Some properties of twoing are described in Breiman [2]. About binary decision trees, let us note that in some situations, users do not always agree to group values since it yields meaningless trees and thus non-binary trees must not be definitively discarded. (150,50,2350) Y =y 2 (0,150,0) (2350,0,150) 21 Y =y 2 22 (0,50,0) (150,0,2350) C.M. tree (2500,200,2500) Y =y Y =y 2 21 2 (0,200,0) 22 (2500,0,2500) Y1 = y11 (2350,0,150) Y1 = y12 So, we have seen that there are many attribute selection criteria and even if some of them can be gathered in families, some choice has to be done. According to us, we think that the choice of a paradigm depends whether the used data sets embed uncertainty or not, whether the phenomenon under study admits deterministic causes, and what level of intelligibility is required. (150,0,2350) ORT tree Figure 1: An example of C.M. and ORT trees. We give here just a simple example, but some others both in artificial and real world domains are detailed in [6]: they show that ORT criterion produces more often than C.M. criteria trees with small leaves at their top. We also see in [6] that overspecified leaves with C.M criteria tend to be small and at the bottom of the tree (thus easy to prune) while leaves at the bottom of ORT trees can be large. In uncertain domains (we will see this point on the next paragraph), such leaves produced by ORT may be irrelevant and it is difficult to prune them without destroying the tree. In the next paragraph, we move to the pruning stage. 2.2 Pruning decision trees: what about the classification and the quality? We know that in many areas, like in medicine, data are uncertain: there are always some examples which escape from the rules. Translated in the context of decision trees, that means these examples seem similar but in fact differ from their classes. In these situations, it is well-known (see [1], [4]) that decision trees algorithms tend to divide nodes having few examples and that the resulting trees tend to be very large and overspecified. Some branches, especially towards the bottom, are present due to sample variability and are statistically meaningless (one can also say that they are due to noise in the sample). Such branches must either not be built or be pruned. If we do not want to build them, we have to set out rules to stop the building of the tree. We know it is better to generate the entire tree and then to prune it (see for example [1] and [14]). Pruning methods (see [1], [19], [20]) try to cut such branches in order to avoid this drawback. Let us note that other selection criteria, such as the ratio criterion, are related to other specific issues. The ratio criterion proposed by Quinlan [20], deriving from the entropy criterion, is customized to avoid favouring attributes with many values. Actually, in some situations, to select an attribute essentially because it has many values might jeopardize the semantic acceptance of the induced trees ([27] and [18]). The J-measure [15] is the product of two terms that are considered by Goodman and Smyth as the two basic criteria for evaluating a rule: one term is derived from the entropy function and the other measures the simplicity of a rule. Quinlan and Rivest [21] were interested in the minimum description length principle to construct a decision tree minimizing a false classification rate when one looks for general rules and their case’s exceptional conditions. This principle has been resumed by Wallace and Patrick [26] who The principal methods for pruning decision trees are examined in [9] and [19]. Most of these pruning methods are based on minimizing a classification error rate when each element of the same node is classified 92 in the most frequent class in this node. The latter is estimated with a test file or using statistical methods such as cross-validation or bootstrap. the quality of each node is a key-point in uncertain domains. These pruning methods are inferred from situations where the built tree will be used as a classifier and they systematically discard a sub-tree which doesn’t improve the used classification error rate. Let us consider the sub-tree depicted in Figure 2. D is the class and it is here bivalued. In each node the first (resp. second) value indicates the number of examples having the first (resp. second) value of D. This subtree doesn't lessen the error rate, which is 10% both in its root or in its leaves; nevertheless the sub-tree is of interest since it points out a specific population with a constant value of D while in the remaining population it's impossible to predict a value for D. So, about the pruning stage, the user is confronted to some questions: am I interested in obtaining a quality value of each node? - and he has to know which use of the tree is pursued: a tree can be an efficient description oriented by an a priori classification of its elements. Then, pruning the tree discards overspecific information to get a more legible description. a tree can be built to highlight reliable subpopulations. Here only some leaves of the pruned tree will be considered for further investigation. (90,10) (79,0) is there uncertainty in the data? the tree can be transformed into a classifier for any new element in a large population. (11,10) The choice of a pruning strategy is tied to the answers to these questions. Figure 2: A tree which could be interesting although it 3. ASSOCIATED TASKS doesn’t decrease the number of errors. We indicate in this paragraph when and how the users, by means of various associated tasks, intervene in the process of developing decision trees. Schematically, it is about gathering the data for the design of the training set, the encoding of the attributes, the specific analysis of examples, the resulting tree analysis,… In [5], we have proposed a pruning method (called C.M. pruning because a C.M. criterion is used to build the entire tree) suitable in uncertain domains. C.M. pruning builds a new attribute binding the root of a tree with its leaves, the attribute’s values corresponding to the branches leading to a leaf. It permits computation of the global quality of a tree. The best sub-tree for pruning is the one that yields the highest quality pruned tree. This pruning method is not tied to the use of the pruned tree as a classifier. Generally, these tasks are not emphasized in the literature, they are usually considered as secondary, but we will see that they have a great relevance and that they act upon the final result. Of course, these tasks intersect with the building and pruning work that we have previously described. This work has been resumed in [13]. In uncertain domains, a deep tree is less relevant than a small one: the deeper a tree, the less understandable and reliable. So, a new quality index (called DI for Depth-Impurity) has been defined in [13]. The latter manages a tradeoff between depth and impurity of each node of a tree. From this index, a new pruning method (denoted DI pruning) has been inferred. With regard to C.M. pruning, DI pruning introduces a damping function to take into account the depth of the leaves. Moreover, by giving the quality of each nodes (and not only of a sub-tree), DI pruning is able to distinguish some subpopulations of interest in large populations, or, on the contrary, highlight set of examples with high uncertainty (in the context of the studied problem). In this case, the user has to come back to the data to try and improve their collection and preparation. Getting In practice, apart from the building and pruning steps, there is another step: the data preparation. We add a fourth step which aims to study the classification of new examples on an - potentially pruned - tree. The user strongly intervenes during the first step, but also has a supervising role during all steps and more particularly a critics role after the second and third steps (see Figure 3). We do not detail here the fourth step which is marginal from the point of view of the user's role. 3.1 Data preparation The aim of thisstep is to supply, from the database gathering examples in their raw form, a training set as adapted as possible to the decision trees development. This step is the one where the user intervenes most directly. His tasks are numerous: deleting examples 93 decision trees software data manipulation building pruning classification 60-65 20-3 data set entire tree pruned tree 10-100-50 30-2 results of the classification checks and intervenes checks and intervenes prepares 40-62 checks and intervenes user Figure 3: Process to generate decision trees and relations with the user. considered as aberrant (outliers) and/or containing too many missing values, deleting attributes evaluated as irrelevant to the given task, re-encoding the attributes values (one knows that if the attributes have very different numbers of values, those having more values tend to be chosen first ([27] and [18]), we have already referred to this point with the gain ratio criterion), re-encoding several attributes (for example, the fusion of attributes), segmenting continuous attributes, analyzing missing data, ... new re-encodings and/or fusions of attributes, often causing a more general description level. The current decision trees construction algorithms deal most often with missing values by means of specific and internal treatments [7]. On the contrary, by a preliminary analysis of the database, relying on the search of associations between data and leading to uncertain rules that determine missing values, Ragel ([7], [24]) offers a strategy where the user can intervene: such a method leaves a place for the user and his knowledge in order to delete, add or modify some rules. Let us get back to some of these tasks. At first [16], the decision trees algorithms did not accept quantitative attributes, these had to be discretized. This initial segmentation can be done by asking experts to set thresholds or by using a strategy relying on an impurity function [11]. The segmentation can also be done while building the trees as is the case with the software C4.5 [22]. A continuous attribute can then be segmented several times in a same tree. It seems relevant to us that the user may actively intervene in this process by indicating, for example, an a priori discretization of the attributes for which it is meaningful and by letting the system manage the others. One shall remark that, if one knows in a reasonable way how to split a continuous attribute to binary, the question is more delicate for a three-valued (or more) discretization. As we can see, this step depends in fact a lot on the user's work. 3.2 Building step The aim of this step is to induce a tree from a training set arising from the previous step. Some system parameters are to be specified. For example, it is useless to keep on building a tree from a node having too few examples, this amount being relative to the initial number of examples in the base. An important parameter to set is thus the minimum amount of examples necessary for the node segmentation. Facing a particularly huge tree, the user will ask for the construction of a new tree by setting this parameter to a higher value, which is pruning the tree by means of a pragmatic process. We have seen (paragraph 2.1) that in uncertain induction, the user will most probably choose a C.M. criterion in order to be able to prune. But if he knows that the studied phenomenon allows deterministic causes in situations with few examples, The user also has generally to decide the deletion, reencoding or fusion of attributes. He has a priori ideas allowing a first pass in this task. But we shall see that the tree construction, by making explicit the underlying studied phenomenon, suggests to the user 94 he can choose the ORT criterion to get a more concise description of these situations. attributes from those that it can be necessary to redefine. The presentation of the attributes and their respective criterion scores at each node may allow the user to select attributes that might not have the best score but that provide a promising way to lead to a relevant leaf. Finally, building and pruning steps can be viewed as part of the study of the attributes. Experts of the domain usually appreciate to be able to restructure the set of the initial attributes and to see at once the effect of such a modification on the tree (in general, after a preliminary decision tree, they define new attributes which summarize some of the initial ones). We have noticed [5] that when such attributes are used, the shape of the graphic representation of the quality index as a function of the number of pruned sub-trees changes and tends to show three parts: in the first one, the variation of the quality index is small, in the second part this quality decreases regularly and in the third part the quality becomes rapidly very low. It shows that the information embedded in the data set is mainly in the top of the tree while the bottom can be pruned. The critics of the tree thus obtained is the most important participation of the user in this step. He checks if the tree is understandable regarding his domain knowledge, if its general structure conforms to his expectations. Facing a surprising result, he wonders if this is due to a bias in the training step or if it reflects a phenomenon, sometimes suspected, but not yet explicitly uttered. Most often, seeing the tree gives the user new ideas about the attributes and he will choose to build again the tree after working again on the training set and/or changing a parameter in the induction system to confirm or infirm a conjecture. 3.3 Pruning step 3.4 Conclusion Apart from the questions at the end of paragraph 2.2 about the data types and the aim searched for in producing a tree, more questions arise to the user if he uses a technique such as DI pruning. Through this paragraph, we have seen that the user interventions are numerous, that the associated tasks realization are closely linked to him. These tasks are fundamental since they directly affect the results: the study of the results brings new experiments. The user starts again many times the work done during a step by changing the parameters or comes back to previous steps (the arrows in Figure 3 shows all the relations between the different steps). At each step, the user may accept, override, or modify the generated rules, but more often he suggests alternative features and experiments. Finally, the rule set is redefined through subsequent data collection, rule induction, and expert consideration. In fact, in this situation, the user has more information to react upon. First, he knows the quality index of the entire tree, which allows him to evaluate the global complexity of the problem. If this index is low, this means that the problem is delicate or inadequately described, that the training set is not representative, or even that the decision trees method is not adapted to this specific problem. If the user has several trees, the quality index allows to compare them and eventually to suggest new experiments. Moreover, the quality index on each node enhances the populations where the class is easy to determine with regards to sets of examples where it is impossible to predict it. Such areas can suggest new experiments on smaller populations or even can question on the existence of additional attributes (which will have to be collected) to help determine the class for examples where it is not yet possible. We think it is necessary for the user to take part in the system so that a real development cycle takes place. The latter seems fundamental to us in order to obtain useful and satisfying trees. The user does not usually know beforehand which tree is relevant to his problem and this is because he finds it gratifying to take part in this search that he takes interest in the induction work. Let us note that most authors try and define software architecture explicitly integrating the user. In the area of induction graph (which is a generalization of decision trees), the SIPINA software offers to the user to fix the choice of an attribute, to gather temporarily some values of an attribute, to stop the construction from some nodes, and so on. Dabija & al. [8] offer an learning system architecture (called KAISER, for Knowledge Acquisition Inductive System driven by Explanatory Reasoning) for an interactive knowledge acquisition system based on decision trees and driven by explanatory reasoning. Moreover, the experts can incrementally add knowledge corresponding to the From experiments [13], we noticed that the degree of pruning is quite bound to the uncertainty embedded in the data. In practice, that means that the damping process has to be adjusted according to the data in order to obtain, in all situations, a relevant number of pruned trees. For that, we introduce a parameter to control the damping process. By varying this parameter, one follows the quality index evolution during the pruning (for example the user distinguishes the parts of the tree that are due to random from those reliable). Such a work enhances the most relevant 95 [7] domain theory. KAISER confronts built trees with the domain theory, so that some incoherences may be detected (for instance, the value of the attribute "eye" for a cat has to be "oval"). Keravhut & Potvin [17] have designed an assistant to collaborate with the user. This assistant, which is in the form of a graphic interface, helps the user test the methods and their parameters in order to get the most relevant combination for the problem at hands. [8] 4. CONCLUSION [9] Producing decision trees is often presented as "automatic" with a marginal participation from the user: we have stressed on the fact that the user has a fundamental critics and supervisor role and that he intervenes in a major way. This leads to a real development cycle between the user and the system. This cycle is only possible because the construction of a tree is nearly instantaneous. [10] The participation of the user for the data preparation, the choice of the parameters, the critics of the results is in fact at the heart of the more general process of Knowledge Discovery in Databases. As usual in KDD, we claim that the understanding and the declarativity of the mechanism of the methods is a key point to achieve in practice a fruitful process of information extraction. Finally, we think that, in order to really reach a data exploration reasoning, associating the user in a profitable way, it is important to give him a framework gathering all the tasks intervening in the process, so that he may freely explore the data, react, innovate with new experiments. [11] [12] [13] References [1] [3] [3] [4] [5] [6] [14] Breiman L., Friedman J. H., Olshen R. A., & Stone C. J. Classification and regression trees. Wadsworth. Statistics probability series. Belmont, 1984. Breiman L. Some properties of splitting criteria (technical note). Machine Learning 21, 41-47, 1996. Buntine W. Learning classification trees. Statistics and Computing 2, 63-73, 1992. Catlett J. Overpruning large decision trees. In proceedings of the Twelfth International Joint Conference on Artificial Intelligence IJCAI 91, pp. 764-769, Sydney, Australia, 1991. Crémilleux B., & Robert C. A Pruning Method for Decision Trees in Uncertain Domains: Applications in Medicine. In proceedings of the workshop Intelligent Data Analysis in Medicine and Pharmacology, ECAI 96, pp. 15-20, Budapest, Hungary, 1996. Crémilleux B., Robert C., & Gaio M. Uncertain domains and decision trees: ORT versus C.M. criteria. In proceedings of the 7th Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems, pp. 540-546, Paris, France, 1998. [15] [16] [17] [18] [19] 96 Crémilleux B., Ragel A., & Bosson J. L. An Interactive and Understandable Method to Treat Missing Values: Application to a Medical Data Set. In proceedings of the 5th International Conference on Information Systems Analysis and Synthesis (ISAS / SCI 99), pp. 137-144, M. Torres, B. Sanchez & E. Wills (Eds.), Orlando, FL, 1999. Dabija V. G., Tsujino K., & Nishida S. Theory formation in the decision trees domain. Journal of Japanese Society for Artificial Intelligence, 7 (3), 136147, 1992. Esposito F., Malerba D., & Semeraro G. Decision tree pruning as search in the state space. In proceedings of European Conference on Machine Learning ECML 93, pp. 165-184, P. B. Brazdil (Ed.), Lecture notes in artificial intelligence, N° 667, Springer-Verlag, Vienna, Austria, 1993. Fayyad U. M., & Irani K. B. The attribute selection problem in decision tree generation. In proceedings of Tenth National Conference on Artificial Intelligence, pp. 104-110, Cambridge, MA: AAAI Press/MIT Press, 1992. Fayyad U. M., & Irani K. B. Multi-interval discretization of continuous-valued attributes for classification learning. In proceedings of the Thirteenth International Joint Conference on Artificial Intelligence IJCAI 93, pp. 1022-1027, Chambéry, France, 1993. Fayyad U. M. Branching on attribute values in decision tree generation. In proceedings of Twelfth National Conference on Artificial Intelligence, pp. 601-606, AAAI Press/MIT Press, 1994. Fournier D., & Crémilleux B. Using impurity and depth for decision trees pruning. In proceedings of the 2th International ICSC Symposium on Engineering of Intelligent Systems (EIS 2000), Paisley, UK, 2000. Gelfand S. B., Ravishankar C. S., & Delp E. J. An iterative growing and pruning algorithm for classification tree design. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(2), 163174, 1991. Goodman R. M. F., & Smyth, P. Information-theoretic rule induction. In proceedings of the Eighth European Conference on Artificial Intelligence ECAI 88, pp. 357-362, München, Germany, 1988. Hunt E. B., Marin J., & Stone P. J. Experiments in induction. New York Academic Press, 1966. Kervahut T., & Potvin J. Y. An interactive-graphic environment for automatic generation of decision trees. Decision Support Systems 18, 117-134, 1996. Kononenko I. On biases in estimating multi-valued attributes. In proceedings of the Fourteenth International Joint Conference on Artificial Intelligence IJCAI 95, pp. 1034-1040, Montréal, Canada, 1995. Mingers J. An empirical comparison of pruning methods for decision-tree induction. Machine Learning 4, 227-243, 1989. [20] Quinlan J. R. Induction of decision trees. Machine Learning 1, 81-106, 1986. [21] Quinlan J. R., & Rivest R. L. Inferring decision trees using the minimum description length principle. Information and Computation 80(3), 227-248, 1989. [22] Quinlan J. R. C4.5 Programs for Machine Learning. San Mateo, CA. Morgan Kaufmann, 1993. [23] Quinlan J. R. Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research 4, 7790, 1996. [24] Ragel A., & Crémilleux B. Treatment of Missing Values for Association Rules, Second Pacific Asia Conference on KDD, PAKDD 98, pp. 258-270, X. Wu, R. Kotagiri & K. B. Korb (Eds.), Lecture notes in artificial intelligence, N° 1394, Springer-Verlag, Melbourne, Australia, 1998. [25] Safavian S. R., & Landgrebe D. A survey of decision tree classifier methodology. IEEE Transactions on Systems, Man, and Cybernetics 21(3), 660-674, 1991. [26] Wallace C. S., & Patrick J. D. Coding decision trees. Machine Learning11, 7-22, 1993. [27] White A. P., & Liu W. Z Bias in Information-Based Measures in Decision Tree Induction. Machine Learning 15, 321-329, 1994. APPENDIX Data file used to build trees for Figure 1 (D denotes the class and Y1 and Y2 are the attributes). B. Crémilleux is Maître des Conférences at the Université do Caen, France. 97 D Y1 Y2 1 d1 y11 y22 2350 2351 d1 d1 y11 y12 y22 y22 2500 2501 d1 d2 y12 y11 y22 y21 2650 2651 d2 d2 y11 y12 y21 y21 2700 2701 d2 d3 y12 y11 y21 y22 2850 2851 d3 d3 y11 y12 y22 y22 5200 d3 y12 y22