Data Mining and Knowledge Discovery in Business Databases

advertisement
Decision Trees and
their Pruning
CART (Classific.and Regression Tree):
Key Parts of Tree Struct. Data Analysis
 Tree growing
 Splitting rules to generate tree
 Stopping criteria: how far to grow?
 Missing values: using surrogates (náhrada)
 Tree pruning
 Trimming off parts of the tree that don’t work
 Ordering the nodes of a large tree by contribution to tree accuracy
… which nodes come off first?
 Optimal tree selection
 Deciding on the best tree after growing and pruning
 Balancing simplicity against accuracy
Maximal Tree Example
3
Stopping criteria
for growing the tree
 All instances in the node belong to the same class
 The maximum tree depth has been reached
 Size of the data in the node is below a threshold
(e.g. 5% of the original dataset)
 The best splitting criteria is below a threshold
…
4
How to Address Overfitting
 Pre-Pruning (Early Stopping Rule)
 Stop the algorithm before it becomes a fully-grown tree
 Typical stopping conditions for a node:
 Stop if all instances belong to the same class
 Stop if all the attribute values are the same
 More restrictive conditions:
 Stop if number of instances is less than some user-specified threshold
 Stop if class distribution of instances are independent of the available
features (e.g., using  2 test)
 Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
5
How to Address Overfitting…
 Post-pruning
 Grow decision tree to its entirety
 Trim the nodes of the decision tree in a bottom-up
fashion
 If generalization error improves after trimming, replace
sub-tree by a leaf node.
 Class label of leaf node is determined from majority
class of instances in the sub-tree
 Can use MDL for post-pruning
6
CART Pruning Method:
Grow Full Tree, Then Prune
 You will never know when to stop . . . so don’t!
 Instead . . . grow trees that are obviously too big
 Largest tree grown is called “maximal” tree
 Maximal tree could have hundreds or thousands of nodes
 usually instruct CART to grow only moderately too big
 rule of thumb: should grow trees about twice the size of the truly
best tree
 This becomes first stage in finding the best tree
 Next we will have to get rid the parts of the overgrown
tree that don’t work (not supported by test data)
Tree Pruning
 Take a very large tree (“maximal” tree)
 Tree may be radically over-fit
 Tracks all the idiosyncrasies of THIS data set
 Tracks patterns that may not be found in other data sets
 At bottom of tree splits based on very few cases
 Analogous to a regression with very large number of variables
 PRUNE away branches from this large tree
 But which branch to cut first?
 CART determines a pruning sequence:
 the exact order in which each node should be removed
 pruning sequence determined for EVERY node
 sequence determined all the way back to root node
Pruning: Which nodes come off
next?
9
Order of Pruning: Weakest
Link Goes First
 Prune away "weakest link" — the nodes that add least to overall
accuracy of the tree
 contribution to overall tree a function of both increase in accuracy and
size of node
 accuracy gain is weighted by share of sample
 small nodes tend to get removed before large ones !
 If several nodes have same contribution they all prune away
simultaneously
 Hence more than two terminal nodes could be cut off in one pruning
 Sequence determined all the way back to root node
 need to allow for possibility that entire tree is bad
 if target variable is unpredictable we will want to prune back to root .
. . the no model solution
Pruning Sequence Example
24 Terminal Nodes
21 Terminal Nodes
20 Terminal Nodes
11
18 Terminal Nodes
Now we test every tree in
the pruning sequence
 Take a test data set and drop it down the largest tree in
the sequence and measure its predictive accuracy
 how many cases right and how many wrong
 measure accuracy overall and by class
 Do same for 2nd largest tree, 3rd largest tree, etc
 Performance of every tree in sequence is measured
 Results reported in table and graph formats
 Note that this critical stage is impossible to complete
without test data
 CART procedure requires test data to guide tree
evaluation
12
Training Data Vs. Test Data
Error Rates
 Compare error rates
measured by
 learn data
 large test set
 Learn R(T) always decreases
as tree grows (Q: Why?)
 Test R(T) first declines then
increases (Q: Why?)
 Overfitting is the result tree of
too much reliance on learn
R(T)
 Can lead to disasters when
applied to new data
No.
Terminal
Nodes
71
63
58
40
34
19
**10
9
7
6
5
2
1
R(T)
Rts(T)
.00
.00
.03
.10
.12
.20
.29
.32
.41
.46
.53
.75
.86
.42
.40
.39
.32
.32
.31
.30
.34
.47
.54
.61
.82
.91
Why look at training data error
rates (or cost) at all?
 First, provides a rough guide of how you are doing
 Truth will typically be WORSE than training data measure
 If tree performing poorly on training data error may not want to
pursue further
 Training data error rate more accurate for smaller trees
 So reasonable guide for smaller trees
 Poor guide for larger trees

At optimal tree training and test error rates should be similar

if not something is wrong

useful to compare not just overall error rate but also within node
performance between training and test data
CART: Optimal Tree
 Within a single CART run which tree
is best?
 Process of pruning the maximal tree
can yield many sub-trees
 Test data set or cross- validation
measures the error rate of each tree
1
The Best Pruned Subtree:
An Estimation Problem
^
R(Tk)
 Current wisdom — select the tree
with smallest error rate
 Only drawback — minimum may not
be precisely estimated
 Typical error rate as a function of
tree size has flat region
 Minimum could be anywhere in this
region
0
0
10
20
30
~
|Tk |
40
50
In what sense is the optimal
tree “best”?
 Optimal tree has lowest or near lowest cost as
determined by a test procedure
 Tree should exhibit very similar accuracy when applied
to new data
 BUT Best Tree is NOT necessarily the one that happens to be most
accurate on a single test database
 trees somewhat larger or smaller than “optimal” may be preferred
 Room for user judgment
 judgment not about split variable or values
 judgment as to how much of tree to keep
 determined by story tree is telling
 willingness to sacrifice a small amount of accuracy for simplicity
Decision Tree Summary
 Decision Trees

splits – binary, multi-way

split criteria – entropy, gini, …

missing value treatment

pruning

rule extraction from trees
 Both C4.5 and CART are robust tools
 No method is always superior –
experiment!
witten & eibe
17
Download