DBMS - DBMS Data Mining Solutions Supplement Page 1 of 10 DBMS, Data Mining Solutions Supplement When a businessperson needs to make a decision based on several factors, a decision tree can help identify which factors to consider and how each factor has historically been associated with different outcomes of the decision. For example, in our credit risk case study (See the sidebar Predicting Credit Risk), we have data for each applicantýs debt, income, and marital status. A decision tree creates a model as either a graphical tree or a set of text rules that can predict (classify) each applicant as a good or bad credit risk. A decision tree is a model that is both predictive and descriptive. It is called a decision tree because the resulting model is presented in the form of a tree structure. (See Figure 1.) The visual presentation makes the decision tree model very easy to understand and assimilate. As a result, the decision tree has become a very popular data mining technique. Decision trees are most commonly used for classification (predicting what group a case belongs to), but can also be used for regression (predicting a specific value). The decision tree method encompasses a number of specific algorithms, including Classification and Regression Trees (CART), Chi-squared Automatic Interaction Detection (CHAID), C4.5 and C5.0 (from work by J. Ross Quinlan of Rulequest Research Pty Ltd, in St. Ives, Australia, www.rulequest.com). Decision trees graphically display the relationships found in data. Most products also translate the treeto-text rules such as If Income = High and Years on job > 5 Then Credit risk = Good. In fact, decision tree algorithms are very similar to rule induction algorithms which produce rule sets without a decision tree. http://www.dbmsmag.com/9807m05.html 1/11/2005 DBMS - DBMS Data Mining Solutions Supplement Page 2 of 10 The primary output of a decision tree algorithm is the tree itself. The training process that creates the decision tree is usually called induction. Induction requires a small number of passes (generally far fewer than 100) through the training dataset. This makes the algorithm somewhat less efficient than Naýve-Bayes algorithms (See Naýve-Bayes and Nearest Neighbor.), which require only one pass, but significantly more efficient than neural nets, which typically require a large number of passes, sometimes numbering in the thousands. To be more precise, the number of passes required to build a decision tree is no more than the number of levels in the tree. There is no predetermined limit to the number of levels, although the complexity of the tree as measured by the depth and breadth of the tree generally increases as the number of independent variables increases. Understanding Decision Trees Before we illustrate how tree induction works, letýs take a look at the end product, the decision tree, to understand its structure and to see how we can use it to predict and understand. For our example we will use the data from a credit risk classification problem (see the sidebar Predicting Credit Risk). Keep in mind that the size of the dataset was limited purely for expository purposes and the number of instances we used is far too small to be realistic. Each box in the tree in Figure 1 represents a node. The top node is called the root node. A decision tree grows from the root node, so you can think of the tree as growing upside down, splitting the data at each level to form new nodes. The resulting tree comprises many nodes connected by branches. Nodes that are at the end of branches are called leaf nodes and play a special role when the tree is used for prediction. In Figure 1 each node contains information about the number of instances at that node, and about the distribution of dependent variable values (Credit Risk). The instances at the root node are all of the instances in the training set. This node contains 5 instances, of which 60 percent are Good risks and 40 percent are Poor risks. Below the root node (parent) is the first split that, in this case, splits the data into two new nodes (children) based on whether Income is High or Low. The rightmost node (Low Income) resulting from this split contains two instances, both of which are associated with Poor credit risk. Because all instances have the same value of the dependent variable (Credit Risk), this node is termed pure and will not be split further. The leftmost node in the first split contains three instances, 66.7 percent of which are Good. The leftmost node is then further split based on the value of Married (Yes or No), resulting in two more nodes which are each also pure. The order of the splits, Income first and then Married, is determined by an induction algorithm, which is discussed further in the "Tree Induction" section, below. A tree that has only pure leaf nodes is called a pure tree, a condition that is not only unnecessary but is usually undesirable. Most trees are impure, that is, their leaf nodes contain cases with more than one outcome. Once grown, a tree can be used for predicting a new case by starting at the root (top) of the tree and following a path down the branches until a leaf node is encountered. The path is determined by imposing the split rules on the values of the independent variables in the new instance. Consider the first row in the training set for Joe. Because Joe has High income, follow the branch to the left. Because Joe is married, follow the tree down the branch to the right. At this point we have arrived at a leaf node, and the predicted value is the predominant value of the leaf node, or Good in this case. Checking the predictions for each of the other instances in the training set in the same way will reveal that this particular tree is 100 percent accurate on the training set. http://www.dbmsmag.com/9807m05.html 1/11/2005 DBMS - DBMS Data Mining Solutions Supplement Page 3 of 10 A tree that is pure will always be 100 percent accurate on the training dataset, but that does not mean that it will be 100 percent accurate ý or even close to 100 percent -- on an independent test set. Most algorithms will use additional parameters during induction that determine whether or not to split a node, reducing the likelihood of a pure tree. It is possible that all outcomes are equally present in a leaf node and that, therefore, such a leaf node has no predominant value. A prediction for a case arriving at such a node is completely dependent on the implementation, and if the implementation permits it, on settings that the user has elected. Some decision tree systems permit the prediction "unknown," others will default to the most likely value (i.e., the predominant value in the root node), and yet others will prune such nodes, effectively backtracking up the tree to use the predominant value in a predecessor node. Navigating a tree to produce predicted values can become cumbersome as trees increase in size and complexity. It is possible to derive a set of rules for a tree ý one rule for each leaf node -- simply by following the path between the root and that leaf node. The rules for the leaf nodes in Figure 1, taken left to right, are as follows: IF Income = High AND Married = No THEN Risk = Poor IF Income = High AND Married = Yes THEN Risk = Good IF Income = Low THEN Risk = Poor It is possible to reduce this set of rules to just two rules, one for Poor and one for Good through judicious use of the OR connector. In fact, we can make the following general statements about rules and trees: l l There are exactly as many rules (using only AND) as there are leaf nodes. By using OR to combine certain rules, the total number of rules can be reduced so that there is exactly one rule for each possible value of the dependent variable. Even when not used for prediction, the rules provide interesting descriptive information about the data. There are often additional interesting and potentially useful observations about the data that can be made after a tree has been induced. In the case of our sample data, the tree in Figure 1 reveals: l l l Debt appears to have no role in determining Risk. People with Low Income are always a Poor Risk. Income is the most significant factor in determining risk. We should also make some observations about our observations. Note first of all that the last observation above is only true to the extent that the induction algorithm tried to prioritize its splits by choosing the most significant split first. Second, note that these are observations about a sample. Clearly, one needs to be extra careful when generalizing to the larger population from a sample. Finally, we note that data mining, because it frequently analyzes information about people, can quickly lead into important ethical, legal, moral, and political issues when such rules are applied to the larger population. Before we go into the specifics of the induction process, we need to explain a few other characteristics of trees. The tree in Figure 1 is called a binary tree because each split has two branches. Although there are some algorithms and products that will only create binary trees, there is no general restriction to trees of this type. Algorithms that can only produce binary trees can nevertheless model exactly the same relationships as algorithms that are not restricted to binary splits. The only difference is that the nonbinary tree will be somewhat more compact because it will have fewer levels. http://www.dbmsmag.com/9807m05.html 1/11/2005 DBMS - DBMS Data Mining Solutions Supplement Page 4 of 10 Decision trees impose certain restrictions on the data that is analyzed. First, decision trees permit only a single dependent variable such as Credit Risk. If you wish to predict more than one dependent variable, each variable requires a separate model. Also, most decision tree algorithms require that continuous data be binned (grouped or converted to categorical data). There are a few algorithms that do not have this requirement, and are capable of positioning a split anywhere within a continuum of values. All decision trees support classification and some also support regression. In particular, the CART algorithm supports both and handles continuous variables directly. Tree Induction Let's explore how the tree induction algorithm works. Most decision tree algorithms go through two phases: a tree growing (splitting) phase followed by a pruning phase. The tree growing phase is an iterative process which involves splitting the data into progressively smaller subsets. Each iteration considers the data in only one node. The first iteration considers the root node that contains all the data. Subsequent iterations work on derivative nodes that will contain subsets of the data. The algorithm begins by analyzing the data to find the independent variable (such as income, marital status, or debt) that when used as a splitting rule will result in nodes that are most different from each other with respect to the dependent variable (Credit Risk, in our example). There are several alternative ways to measure this difference. Some implementations have only one measure built in, others let the user choose which measure to use. We wonýt go into the various measures here, except to list some of the names that they go under: entropy, mutual info, gain ratio, gini, and chi-squared. Regardless of the measurement used, all methods require a cross-tabulation between the dependent variable and each of the independent variables. Table 1 presents the cross-tabulation for the root node data in Figure 1. Predicted Risk High Debt Low Debt High Income Low Income Married NotMarried Good 1 1 2 0 2 0 Poor 1 2 1 2 2 1 Table 1. Cross-tabulation of the independent vs. dependent columns for the root node. For this example, our own simplistic difference measure is to pick the split that has the largest number of instances on the diagonal of its cross-tabulation. With this measure, the first split is the split on Income, which has a total of 4 instances on the diagonal (2 for High Income/Good Risk plus 2 for Low Income/Poor Risk). Both of the other splits have only 3 on the diagonal. Our measure is less ad hoc than it seems. The split chosen by our measure will have the fewest number of instances that deviate from the predominant dependent variable value in each node formed by the split. One important characteristic of the tree splitting algorithm is that it is greedy. Greedy algorithms make decisions locally rather than globally. When deciding on a split at a particular node, a greedy algorithm does not look forward in the tree to see if another decision would produce a better overall result. While our example is too simple to illustrate this, in the general case it could very well be that there is an early split (a split nearer the root) that is not the best split relative to any local measure, but if used would result in a tree with better overall accuracy. http://www.dbmsmag.com/9807m05.html 1/11/2005 DBMS - DBMS Data Mining Solutions Supplement Page 5 of 10 Once a node is split, the same process is performed on the new nodes, each of which contains a subset of the data in the parent node. The variables are analyzed and the best split is chosen. This process is repeated until only nodes where no splits should be made remain. When to Stop At what point do you stop growing the tree? When do you know that no more splits should be made? We have seen before that pure nodes are not split any further. While this provides a natural condition for stopping tree growth, there are reasons to stop splitting before the nodes are pure. For this reason, treebuilding algorithms usually have several other stopping rules . These rules are usually based on several factors including maximum tree depth, minimum number of elements in a node considered for splitting, or its near equivalent, the minimum number of elements that must be in a new node. In most implementations the user can alter the parameters associated with these rules. Why not build a tree to maximum depth? Would such a tree be pure in all nodes? The answer to the second question is "maybe." Such a tree would be pure only if there were no conflicting records in the training set. Two records are conflicting if they have the same values for all independent columns (Income, Marital Status, Debt, etc.), but different values in the dependent column (Credit Risk). Because the values of the independent columns are the same, conflicting records must always be in the same leaf node. There is no way to create a split to differentiate them unless a new variable is introduced. Because the conflicting records have different values for the dependent column, this node cannot be pure. If there are no conflicting records, then it is possible to build a pure tree. Now letýs consider the first question. Why not build a tree to maximum depth, so that all leaf nodes are either pure, or contain conflicting records? Some algorithms, in fact, begin by building trees to their maximum depth. While such a tree can precisely predict all the instances in the training set (except conflicting records), the problem with such a tree is that, more than likely, it has overfit the data. You can think of such a tree functioning to find a record in the training set that most closely matches a new record, and then predicting the dependent variable based on the value found in that record (the one most closely matching the new record). Such a tree is too specific and will not find whatever general principles are at work. Pruning Trees After a data mining product grows a tree, a business analyst must explore the model. Exploring the tree model, even one that is grown with stopping rules, may reveal nodes or subtrees that are undesirable because of overfitting, or may contain rules that the domain expert feels are inappropriate. Pruning is a common technique used to make a tree more general. Pruning removes splits and the subtrees created by them. In some implementations, pruning is controlled by user configurable parameters that cause splits to be pruned because, for example, the computed difference between the resulting nodes falls below a threshold and is insignificant. With such algorithms, users will want to experiment to see which pruning rule parameters result in a tree that predicts best on a test dataset. Algorithms that build trees to maximum depth will automatically invoke pruning. In some products users also have the ability to prune the tree interactively. We can see the effects of pruning even in our simple example. What if marital status in fact has nothing to do with whether or not someone is a good risk? The split on marital status results from Johnýs record (John has High income, but Married is No and he has Poor as the value for Risk). Maybe John turned out to be a poor credit risk because he is a gambler. But because we have no data on whether or not loan applicants were gamblers, it is not possible to build a model taking this into account. The only data that http://www.dbmsmag.com/9807m05.html 1/11/2005 DBMS - DBMS Data Mining Solutions Supplement Page 6 of 10 the model inducer has that differentiates John from the other instances in the node is marital status, so that is what it uses. If this second split is pruned (See Figure 2), then the model will predict Good for all people with high income regardless of marital status. The node used to make this prediction, while not pure, has Good as the predominant value and, therefore, Good will be the predicted value for instances that end on this node. Given that some pruning is usually a good idea, how does the pruning algorithm determine where to prune the tree? There are several algorithms, but one that we find very appealing is to use a control sample with known and verified relationships between the independent and dependent variables. By comparing the performance of each node (as measured by its accuracy) to its subtree, it will be obvious which splits need to be pruned to attain the highest overall accuracy on the control sample. This technique can be found in StarTree, the decision tree component of Darwin from Thinking Machines Corp. (See Figure 3.) After generating a tree to maximum depth, Darwin will generate a large number of subtrees in the pruning phase and then automatically compute the accuracy of each on the test dataset. An analyst can then choose a subtree that contains the fewest number of nodes, yet has a very high accuracy. Darwin's approach is an effective way to eliminate overfitting with a decision tree. Testing a Tree Prior to integrating any decision tree into your business as a predictor, you must test and validate the model using an independent dataset. Once accuracy has been measured on an independent dataset and is determined to be acceptable, the tree (or its rules) is ready to be used as a predictor. Be sure to retest the tree periodically to insure that it maintains the desired accuracy. All Algorithms Are Not Alike While all decision tree algorithms have basic elements in common, they have definite differences as well. Given the same training data all algorithms will not necessarily produce the same tree or rule set. The distinguishing features between algorithms include: l l l l Target Variables: Most tree algorithms require that the target (dependent) variable be categorical. Such algorithms require that continuous variables be binned for use with regression. The most notable exception to this is CART, which handles continuous target variables directly. Splits: Many algorithms support only binary splits, that is, each parent node can be split into at most two child nodes. Others generate more than two splits and produce a branch for each value of a categorical variable. Rule Generation: Algorithms such as C4.5 and C5 include methods to generalize rules associated with a tree; this removes redundancies. Others simply accumulate all the tests between the root node and the leaf node to produce the rules. Split measures: Different algorithms support different and sometimes multiple measures for selecting which variable to use to split at a particular node. Common split measures include gain criterion, gain ratio criterion, gini criterion, chi-squared, and entropy. Beyond the basic algorithmic differences, users will find that different implementations of the same algorithm provide additional useful features which are too numerous to mention here. Understanding the Output One of the inherent benefits of a decision tree model is its ability to be understood by a broad user http://www.dbmsmag.com/9807m05.html 1/11/2005 DBMS - DBMS Data Mining Solutions Supplement Page 7 of 10 community. Presentation of a decision tree model in a graphical format along with the ability to interactively explore it have become standard features supported by many decision tree vendors. Figure 4 is an example of a decision tree visualizer from Angossýs KnowledgeSeeker. Decision tree output is often presented as a set of rules which are more concise and, particularly when the tree is large, are often easier to understand. Figure 5 shows the rules generated by the C5 algorithm in Integral Solutions Ltd.'s (ISL) Clementine. Knowledge Discovery Workbench (KDW) from NCR also incorporates Clementine technology. Decision trees have obvious value as both predictive and descriptive models. We have seen that prediction can be done on a case-by-case basis by navigating the tree. More often, prediction is accomplished by processing multiple new cases through the tree or rule set automatically and generating an output file with the predicted value or class appended to the record for each case. Many implementations offer the option of exporting the rules to be used externally or embedded in other applications. The distinctive output from a decision tree algorithm makes it easy to recognize its descriptive or exploratory value. In an exploratory mode the user is interested in outputs that facilitate insight about relationships between independent and dependent variables. In recognition of the descriptive and exploratory value of decision trees, some OLAP tools such as BusinessObjects have integrated decision tree modules to facilitate data investigation. Such tools have no predictive component at all. A common use of decision trees in this descriptive-only mode is the identification of market segments. Future Trends Decision trees have become very popular classification tools. Many users find decision trees easy to use and understand. As a result, users more easily trust decision tree models than they do "black box" models, such as those produced by neural networks (See Neural Networks). Research to improve decision tree algorithms continues, and products are rapidly evolving. Within the last year, C5.0 was released by RuleQuest Research (for which J. Ross Quinlan, its author, claims improved speed and quality of rule generation over its predecessor, C4.5). C5.0 implements "boosting" a technique that combines multiple decision trees into a single classifier. Silicon Graphics, in the 2.0 release of MineSet, has an Option Tree algorithm, in which multiple trees (or subtrees) coexist as "options." Each option makes a prediction, and then the options vote for a consensus prediction. The C5.0 algorithm is also used in the latest release of Clementine. Boosting and option trees are techniques to get around suboptimization problems resulting from the "greedy" aspect of the decision tree algorithm. Other recent work includes improved handling of continuous variables and oblique trees which are trees with multivariate splits (see On Growing Better Decision Trees from Data, a Ph.D. thesis by Sreerama K. Murthy, on the Web at www.cs.jhu.edu/~murthy/thesis/home.html). In addition to keeping pace with new algorithms, vendors are expanding their interfaces to include tree visualizers and other drill down ties to the data that facilitates interactive exploration. See Decision Trees Products for more information. http://www.dbmsmag.com/9807m05.html 1/11/2005 DBMS - DBMS Data Mining Solutions Supplement Page 8 of 10 Figure 1. A decision tree for the credit risk dataset (excerpted from simple output generated using KnowledgeSeeker from Angoss Software International Ltd.). Figure 2. The original decision tree, pruned to remove the Married? split and subtree. http://www.dbmsmag.com/9807m05.html 1/11/2005 DBMS - DBMS Data Mining Solutions Supplement Page 9 of 10 Figure 3. Darwin, from Thinking Machines Corp., uses pruning to generate a number of subtrees that are then compared on the test dataset. Figure 4. Decision Tree visualizer in KnowledgeSeeker from Angoss Software includes ability to color code dependent variable values and a tree map (upper right-hand corner) to aid in navigating large trees. http://www.dbmsmag.com/9807m05.html 1/11/2005 DBMS - DBMS Data Mining Solutions Supplement Page 10 of 10 Figure 5. Rules generated by the C5.0 algorithm in Integral Solutions Ltd.ýs Clementine. The first number inside the parenthesis is a count of the number of instances. The second number measures the purity of the node. Estelle Brand (estelle@xore.com) and Rob Gerritsen (rob@xore.com) are founders of Exclusive Ore Inc., based in Blue Bell, Pennsylvania, which is a consulting and training company specializing in data mining. During the last two years they have used more than a dozen data mining products. Their database management systems experience dates back to the dark ages. For more information about Exclusive Ore and data mining, see www.xore.com. What did you think of this article? Send a letter to the editor. Subscribe to DBMS -- It's free for qualified readers in the United States Preview Table of Contents | Other Contents | Article Index | Search | Site Index | Home DBMS (http://www.dbmsmag.com) Copyright © 1998 Miller Freeman, Inc. ALL RIGHTS RESERVED Redistribution without permission is prohibited. Please send questions or comments to dbms@mfi.com Updated February 26, 1998 http://www.dbmsmag.com/9807m05.html 1/11/2005