Page 1 of 10 DBMS - DBMS Data Mining Solutions Supplement

advertisement
DBMS - DBMS Data Mining Solutions Supplement
Page 1 of 10
DBMS, Data Mining Solutions Supplement
When a businessperson needs to make a decision based on several factors, a decision tree can help
identify which factors to consider and how each factor has historically been associated with different
outcomes of the decision. For example, in our credit risk case study (See the sidebar Predicting Credit
Risk), we have data for each applicantýs debt, income, and marital status. A decision tree creates a
model as either a graphical tree or a set of text rules that can predict (classify) each applicant as a good
or bad credit risk.
A decision tree is a model that is both predictive and descriptive. It is called a decision tree because the
resulting model is presented in the form of a tree structure. (See Figure 1.) The visual presentation
makes the decision tree model very easy to understand and assimilate. As a result, the decision tree has
become a very popular data mining technique. Decision trees are most commonly used for classification
(predicting what group a case belongs to), but can also be used for regression (predicting a specific
value).
The decision tree method encompasses a number of specific algorithms, including Classification and
Regression Trees (CART), Chi-squared Automatic Interaction Detection (CHAID), C4.5 and C5.0 (from
work by J. Ross Quinlan of Rulequest Research Pty Ltd, in St. Ives, Australia, www.rulequest.com).
Decision trees graphically display the relationships found in data. Most products also translate the treeto-text rules such as If Income = High and Years on job > 5 Then Credit risk = Good. In fact, decision
tree algorithms are very similar to rule induction algorithms which produce rule sets without a decision
tree.
http://www.dbmsmag.com/9807m05.html
1/11/2005
DBMS - DBMS Data Mining Solutions Supplement
Page 2 of 10
The primary output of a decision tree algorithm is the tree itself. The training process that creates the
decision tree is usually called induction. Induction requires a small number of passes (generally far
fewer than 100) through the training dataset. This makes the algorithm somewhat less efficient than
Naýve-Bayes algorithms (See Naýve-Bayes and Nearest Neighbor.), which require only one pass, but
significantly more efficient than neural nets, which typically require a large number of passes,
sometimes numbering in the thousands. To be more precise, the number of passes required to build a
decision tree is no more than the number of levels in the tree. There is no predetermined limit to the
number of levels, although the complexity of the tree as measured by the depth and breadth of the tree
generally increases as the number of independent variables increases.
Understanding Decision Trees
Before we illustrate how tree induction works, letýs take a look at the end product, the decision tree, to
understand its structure and to see how we can use it to predict and understand. For our example we will
use the data from a credit risk classification problem (see the sidebar Predicting Credit Risk). Keep in
mind that the size of the dataset was limited purely for expository purposes and the number of instances
we used is far too small to be realistic.
Each box in the tree in Figure 1 represents a node. The top node is called the root node. A decision tree
grows from the root node, so you can think of the tree as growing upside down, splitting the data at each
level to form new nodes. The resulting tree comprises many nodes connected by branches. Nodes that
are at the end of branches are called leaf nodes and play a special role when the tree is used for
prediction.
In Figure 1 each node contains information about the number of instances at that node, and about the
distribution of dependent variable values (Credit Risk). The instances at the root node are all of the
instances in the training set. This node contains 5 instances, of which 60 percent are Good risks and 40
percent are Poor risks. Below the root node (parent) is the first split that, in this case, splits the data into
two new nodes (children) based on whether Income is High or Low.
The rightmost node (Low Income) resulting from this split contains two instances, both of which are
associated with Poor credit risk. Because all instances have the same value of the dependent variable
(Credit Risk), this node is termed pure and will not be split further. The leftmost node in the first split
contains three instances, 66.7 percent of which are Good. The leftmost node is then further split based
on the value of Married (Yes or No), resulting in two more nodes which are each also pure. The order of
the splits, Income first and then Married, is determined by an induction algorithm, which is discussed
further in the "Tree Induction" section, below.
A tree that has only pure leaf nodes is called a pure tree, a condition that is not only unnecessary but is
usually undesirable. Most trees are impure, that is, their leaf nodes contain cases with more than one
outcome.
Once grown, a tree can be used for predicting a new case by starting at the root (top) of the tree and
following a path down the branches until a leaf node is encountered. The path is determined by
imposing the split rules on the values of the independent variables in the new instance. Consider the first
row in the training set for Joe. Because Joe has High income, follow the branch to the left. Because Joe
is married, follow the tree down the branch to the right. At this point we have arrived at a leaf node, and
the predicted value is the predominant value of the leaf node, or Good in this case. Checking the
predictions for each of the other instances in the training set in the same way will reveal that this
particular tree is 100 percent accurate on the training set.
http://www.dbmsmag.com/9807m05.html
1/11/2005
DBMS - DBMS Data Mining Solutions Supplement
Page 3 of 10
A tree that is pure will always be 100 percent accurate on the training dataset, but that does not mean
that it will be 100 percent accurate ý or even close to 100 percent -- on an independent test set. Most
algorithms will use additional parameters during induction that determine whether or not to split a node,
reducing the likelihood of a pure tree.
It is possible that all outcomes are equally present in a leaf node and that, therefore, such a leaf node has
no predominant value. A prediction for a case arriving at such a node is completely dependent on the
implementation, and if the implementation permits it, on settings that the user has elected. Some
decision tree systems permit the prediction "unknown," others will default to the most likely value (i.e.,
the predominant value in the root node), and yet others will prune such nodes, effectively backtracking
up the tree to use the predominant value in a predecessor node.
Navigating a tree to produce predicted values can become cumbersome as trees increase in size and
complexity. It is possible to derive a set of rules for a tree ý one rule for each leaf node -- simply by
following the path between the root and that leaf node. The rules for the leaf nodes in Figure 1, taken left
to right, are as follows:
IF Income = High AND Married = No THEN Risk = Poor
IF Income = High AND Married = Yes THEN Risk = Good
IF Income = Low THEN Risk = Poor
It is possible to reduce this set of rules to just two rules, one for Poor and one for Good through
judicious use of the OR connector. In fact, we can make the following general statements about rules
and trees:
l
l
There are exactly as many rules (using only AND) as there are leaf nodes.
By using OR to combine certain rules, the total number of rules can be reduced so that there is
exactly one rule for each possible value of the dependent variable.
Even when not used for prediction, the rules provide interesting descriptive information about the data.
There are often additional interesting and potentially useful observations about the data that can be made
after a tree has been induced. In the case of our sample data, the tree in Figure 1 reveals:
l
l
l
Debt appears to have no role in determining Risk.
People with Low Income are always a Poor Risk.
Income is the most significant factor in determining risk.
We should also make some observations about our observations. Note first of all that the last
observation above is only true to the extent that the induction algorithm tried to prioritize its splits by
choosing the most significant split first. Second, note that these are observations about a sample.
Clearly, one needs to be extra careful when generalizing to the larger population from a sample. Finally,
we note that data mining, because it frequently analyzes information about people, can quickly lead into
important ethical, legal, moral, and political issues when such rules are applied to the larger population.
Before we go into the specifics of the induction process, we need to explain a few other characteristics
of trees. The tree in Figure 1 is called a binary tree because each split has two branches. Although there
are some algorithms and products that will only create binary trees, there is no general restriction to trees
of this type. Algorithms that can only produce binary trees can nevertheless model exactly the same
relationships as algorithms that are not restricted to binary splits. The only difference is that the
nonbinary tree will be somewhat more compact because it will have fewer levels.
http://www.dbmsmag.com/9807m05.html
1/11/2005
DBMS - DBMS Data Mining Solutions Supplement
Page 4 of 10
Decision trees impose certain restrictions on the data that is analyzed. First, decision trees permit only a
single dependent variable such as Credit Risk. If you wish to predict more than one dependent variable,
each variable requires a separate model. Also, most decision tree algorithms require that continuous data
be binned (grouped or converted to categorical data). There are a few algorithms that do not have this
requirement, and are capable of positioning a split anywhere within a continuum of values. All decision
trees support classification and some also support regression. In particular, the CART algorithm
supports both and handles continuous variables directly.
Tree Induction
Let's explore how the tree induction algorithm works. Most decision tree algorithms go through two
phases: a tree growing (splitting) phase followed by a pruning phase.
The tree growing phase is an iterative process which involves splitting the data into progressively
smaller subsets. Each iteration considers the data in only one node. The first iteration considers the root
node that contains all the data. Subsequent iterations work on derivative nodes that will contain subsets
of the data.
The algorithm begins by analyzing the data to find the independent variable (such as income, marital
status, or debt) that when used as a splitting rule will result in nodes that are most different from each
other with respect to the dependent variable (Credit Risk, in our example). There are several alternative
ways to measure this difference. Some implementations have only one measure built in, others let the
user choose which measure to use. We wonýt go into the various measures here, except to list some of
the names that they go under: entropy, mutual info, gain ratio, gini, and chi-squared.
Regardless of the measurement used, all methods require a cross-tabulation between the dependent
variable and each of the independent variables. Table 1 presents the cross-tabulation for the root node
data in Figure 1.
Predicted
Risk
High Debt Low Debt High Income Low Income Married NotMarried
Good
1
1
2
0
2
0
Poor
1
2
1
2
2
1
Table 1. Cross-tabulation of the independent vs. dependent columns for the root node.
For this example, our own simplistic difference measure is to pick the split that has the largest number
of instances on the diagonal of its cross-tabulation. With this measure, the first split is the split on
Income, which has a total of 4 instances on the diagonal (2 for High Income/Good Risk plus 2 for Low
Income/Poor Risk). Both of the other splits have only 3 on the diagonal. Our measure is less ad hoc than
it seems. The split chosen by our measure will have the fewest number of instances that deviate from the
predominant dependent variable value in each node formed by the split.
One important characteristic of the tree splitting algorithm is that it is greedy. Greedy algorithms make
decisions locally rather than globally. When deciding on a split at a particular node, a greedy algorithm
does not look forward in the tree to see if another decision would produce a better overall result. While
our example is too simple to illustrate this, in the general case it could very well be that there is an early
split (a split nearer the root) that is not the best split relative to any local measure, but if used would
result in a tree with better overall accuracy.
http://www.dbmsmag.com/9807m05.html
1/11/2005
DBMS - DBMS Data Mining Solutions Supplement
Page 5 of 10
Once a node is split, the same process is performed on the new nodes, each of which contains a subset of
the data in the parent node. The variables are analyzed and the best split is chosen. This process is
repeated until only nodes where no splits should be made remain.
When to Stop
At what point do you stop growing the tree? When do you know that no more splits should be made?
We have seen before that pure nodes are not split any further. While this provides a natural condition for
stopping tree growth, there are reasons to stop splitting before the nodes are pure. For this reason, treebuilding algorithms usually have several other stopping rules . These rules are usually based on several
factors including maximum tree depth, minimum number of elements in a node considered for splitting,
or its near equivalent, the minimum number of elements that must be in a new node. In most
implementations the user can alter the parameters associated with these rules.
Why not build a tree to maximum depth? Would such a tree be pure in all nodes? The answer to the
second question is "maybe." Such a tree would be pure only if there were no conflicting records in the
training set. Two records are conflicting if they have the same values for all independent columns
(Income, Marital Status, Debt, etc.), but different values in the dependent column (Credit Risk). Because
the values of the independent columns are the same, conflicting records must always be in the same leaf
node. There is no way to create a split to differentiate them unless a new variable is introduced. Because
the conflicting records have different values for the dependent column, this node cannot be pure. If there
are no conflicting records, then it is possible to build a pure tree.
Now letýs consider the first question. Why not build a tree to maximum depth, so that all leaf nodes are
either pure, or contain conflicting records? Some algorithms, in fact, begin by building trees to their
maximum depth. While such a tree can precisely predict all the instances in the training set (except
conflicting records), the problem with such a tree is that, more than likely, it has overfit the data. You
can think of such a tree functioning to find a record in the training set that most closely matches a new
record, and then predicting the dependent variable based on the value found in that record (the one most
closely matching the new record). Such a tree is too specific and will not find whatever general
principles are at work.
Pruning Trees
After a data mining product grows a tree, a business analyst must explore the model. Exploring the tree
model, even one that is grown with stopping rules, may reveal nodes or subtrees that are undesirable
because of overfitting, or may contain rules that the domain expert feels are inappropriate. Pruning is a
common technique used to make a tree more general. Pruning removes splits and the subtrees created by
them. In some implementations, pruning is controlled by user configurable parameters that cause splits
to be pruned because, for example, the computed difference between the resulting nodes falls below a
threshold and is insignificant. With such algorithms, users will want to experiment to see which pruning
rule parameters result in a tree that predicts best on a test dataset. Algorithms that build trees to
maximum depth will automatically invoke pruning. In some products users also have the ability to prune
the tree interactively.
We can see the effects of pruning even in our simple example. What if marital status in fact has nothing
to do with whether or not someone is a good risk? The split on marital status results from Johnýs record
(John has High income, but Married is No and he has Poor as the value for Risk). Maybe John turned
out to be a poor credit risk because he is a gambler. But because we have no data on whether or not loan
applicants were gamblers, it is not possible to build a model taking this into account. The only data that
http://www.dbmsmag.com/9807m05.html
1/11/2005
DBMS - DBMS Data Mining Solutions Supplement
Page 6 of 10
the model inducer has that differentiates John from the other instances in the node is marital status, so
that is what it uses. If this second split is pruned (See Figure 2), then the model will predict Good for all
people with high income regardless of marital status. The node used to make this prediction, while not
pure, has Good as the predominant value and, therefore, Good will be the predicted value for instances
that end on this node.
Given that some pruning is usually a good idea, how does the pruning algorithm determine where to
prune the tree? There are several algorithms, but one that we find very appealing is to use a control
sample with known and verified relationships between the independent and dependent variables. By
comparing the performance of each node (as measured by its accuracy) to its subtree, it will be obvious
which splits need to be pruned to attain the highest overall accuracy on the control sample.
This technique can be found in StarTree, the decision tree component of Darwin from Thinking
Machines Corp. (See Figure 3.) After generating a tree to maximum depth, Darwin will generate a large
number of subtrees in the pruning phase and then automatically compute the accuracy of each on the test
dataset. An analyst can then choose a subtree that contains the fewest number of nodes, yet has a very
high accuracy. Darwin's approach is an effective way to eliminate overfitting with a decision tree.
Testing a Tree
Prior to integrating any decision tree into your business as a predictor, you must test and validate the
model using an independent dataset. Once accuracy has been measured on an independent dataset and is
determined to be acceptable, the tree (or its rules) is ready to be used as a predictor. Be sure to retest the
tree periodically to insure that it maintains the desired accuracy.
All Algorithms Are Not Alike
While all decision tree algorithms have basic elements in common, they have definite differences as
well. Given the same training data all algorithms will not necessarily produce the same tree or rule set.
The distinguishing features between algorithms include:
l
l
l
l
Target Variables: Most tree algorithms require that the target (dependent) variable be categorical.
Such algorithms require that continuous variables be binned for use with regression. The most
notable exception to this is CART, which handles continuous target variables directly.
Splits: Many algorithms support only binary splits, that is, each parent node can be split into at
most two child nodes. Others generate more than two splits and produce a branch for each value
of a categorical variable.
Rule Generation: Algorithms such as C4.5 and C5 include methods to generalize rules associated
with a tree; this removes redundancies. Others simply accumulate all the tests between the root
node and the leaf node to produce the rules.
Split measures: Different algorithms support different and sometimes multiple measures for
selecting which variable to use to split at a particular node. Common split measures include gain
criterion, gain ratio criterion, gini criterion, chi-squared, and entropy.
Beyond the basic algorithmic differences, users will find that different implementations of the same
algorithm provide additional useful features which are too numerous to mention here.
Understanding the Output
One of the inherent benefits of a decision tree model is its ability to be understood by a broad user
http://www.dbmsmag.com/9807m05.html
1/11/2005
DBMS - DBMS Data Mining Solutions Supplement
Page 7 of 10
community. Presentation of a decision tree model in a graphical format along with the ability to
interactively explore it have become standard features supported by many decision tree vendors. Figure
4 is an example of a decision tree visualizer from Angossýs KnowledgeSeeker.
Decision tree output is often presented as a set of rules which are more concise and, particularly when
the tree is large, are often easier to understand. Figure 5 shows the rules generated by the C5 algorithm
in Integral Solutions Ltd.'s (ISL) Clementine. Knowledge Discovery Workbench (KDW) from NCR also
incorporates Clementine technology.
Decision trees have obvious value as both predictive and descriptive models. We have seen that
prediction can be done on a case-by-case basis by navigating the tree. More often, prediction is
accomplished by processing multiple new cases through the tree or rule set automatically and generating
an output file with the predicted value or class appended to the record for each case. Many
implementations offer the option of exporting the rules to be used externally or embedded in other
applications.
The distinctive output from a decision tree algorithm makes it easy to recognize its descriptive or
exploratory value. In an exploratory mode the user is interested in outputs that facilitate insight about
relationships between independent and dependent variables. In recognition of the descriptive and
exploratory value of decision trees, some OLAP tools such as BusinessObjects have integrated decision
tree modules to facilitate data investigation. Such tools have no predictive component at all. A common
use of decision trees in this descriptive-only mode is the identification of market segments.
Future Trends
Decision trees have become very popular classification tools. Many users find decision trees easy to use
and understand. As a result, users more easily trust decision tree models than they do "black box"
models, such as those produced by neural networks (See Neural Networks).
Research to improve decision tree algorithms continues, and products are rapidly evolving. Within the
last year, C5.0 was released by RuleQuest Research (for which J. Ross Quinlan, its author, claims
improved speed and quality of rule generation over its predecessor, C4.5). C5.0 implements "boosting" a
technique that combines multiple decision trees into a single classifier. Silicon Graphics, in the 2.0
release of MineSet, has an Option Tree algorithm, in which multiple trees (or subtrees) coexist as
"options." Each option makes a prediction, and then the options vote for a consensus prediction. The
C5.0 algorithm is also used in the latest release of Clementine. Boosting and option trees are techniques
to get around suboptimization problems resulting from the "greedy" aspect of the decision tree
algorithm. Other recent work includes improved handling of continuous variables and oblique trees
which are trees with multivariate splits (see On Growing Better Decision Trees from Data, a Ph.D.
thesis by Sreerama K. Murthy, on the Web at www.cs.jhu.edu/~murthy/thesis/home.html).
In addition to keeping pace with new algorithms, vendors are expanding their interfaces to include tree
visualizers and other drill down ties to the data that facilitates interactive exploration.
See Decision Trees Products for more information.
http://www.dbmsmag.com/9807m05.html
1/11/2005
DBMS - DBMS Data Mining Solutions Supplement
Page 8 of 10
Figure 1. A decision tree for the credit risk dataset (excerpted from simple output generated using
KnowledgeSeeker from Angoss Software International Ltd.).
Figure 2. The original decision tree, pruned to remove the Married? split and subtree.
http://www.dbmsmag.com/9807m05.html
1/11/2005
DBMS - DBMS Data Mining Solutions Supplement
Page 9 of 10
Figure 3. Darwin, from Thinking Machines Corp., uses pruning to generate a number of subtrees that
are then compared on the test dataset.
Figure 4. Decision Tree visualizer in KnowledgeSeeker from Angoss Software includes ability to color
code dependent variable values and a tree map (upper right-hand corner) to aid in navigating large trees.
http://www.dbmsmag.com/9807m05.html
1/11/2005
DBMS - DBMS Data Mining Solutions Supplement
Page 10 of 10
Figure 5. Rules generated by the C5.0 algorithm in Integral Solutions Ltd.ýs Clementine. The first
number inside the parenthesis is a count of the number of instances. The second number measures the
purity of the node.
Estelle Brand (estelle@xore.com) and Rob Gerritsen (rob@xore.com) are founders of Exclusive Ore
Inc., based in Blue Bell, Pennsylvania, which is a consulting and training company specializing in data
mining. During the last two years they have used more than a dozen data mining products. Their
database management systems experience dates back to the dark ages. For more information about
Exclusive Ore and data mining, see www.xore.com.
What did you think of this article? Send a letter to the editor.
Subscribe to DBMS -- It's free for qualified readers in the United States
Preview Table of Contents | Other Contents | Article Index | Search | Site Index | Home
DBMS (http://www.dbmsmag.com)
Copyright © 1998 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated February 26, 1998
http://www.dbmsmag.com/9807m05.html
1/11/2005
Download