Artificial Intelligence: Representation and Problem Solving 15-381 April 12, 2007 Decision Trees 2 20 questions • Consider this game of 20 questions on the web: 20Q.net Inc. Artificial Intelligence: Decision Trees 2 2 Michael S. Lewicki ! Carnegie Mellon Pick your poison • • • How do you decide if a mushroom is edible? What’s the best identification strategy? Let’s try decision trees. “Death Cap” Artificial Intelligence: Decision Trees 2 Michael S. Lewicki ! Carnegie Mellon 3 Some mushroom data (from the UCI machine learning repository) EDIBLE? CAP-SHAPE CAP-SURFACE CAP-COLOR ODOR STALK-SHAPE POPULATION HABITAT • •!• 1 edible flat fibrous red none tapering several woods • •!• 2 poisonous convex smooth red foul tapering several paths • •!• 3 edible flat fibrous brown none tapering abundant grasses • •!• 4 edible convex scaly gray none tapering several woods • •!• 5 poisonous convex smooth red foul tapering several woods • •!• 6 edible convex fibrous gray none tapering several woods • •!• 7 poisonous flat scaly brown fishy tapering several leaves • •!• 8 poisonous flat scaly brown spicy tapering several leaves • •!• 9 poisonous convex fibrous yellow foul enlarging several paths • •!• 10 poisonous convex fibrous yellow foul enlarging several woods • •!• 11 poisonous flat smooth brown spicy tapering several woods • •!• 12 edible convex smooth yellow anise tapering several woods • •!• 13 poisonous knobbed scaly red foul tapering several leaves • •!• 14 poisonous flat smooth brown foul tapering several leaves • •!• 15 poisonous flat fibrous gray foul enlarging several woods • •!• 16 edible sunken fibrous brown none enlarging solitary urban • •!• 17 poisonous flat smooth brown foul tapering several woods • •!• 18 poisonous convex smooth white foul tapering scattered urban • •!• 19 poisonous flat scaly yellow foul enlarging solitary paths • •!• 20 edible convex fibrous gray none tapering several woods • •!• • •!• • •!• • •!• • •!• • •!• • •!• • •!• • •!• • •!• Artificial Intelligence: Decision Trees 2 4 Michael S. Lewicki ! Carnegie Mellon An easy problem: two attributes provide most of the information Poisonous: 44 Edible: 46 ODOR is almond, anise, or none +++"(",+-.+/0++1++23 no yes Poisonous: 1 Edible: 46 Poisonous: 43 Edible: 0 !"#$"%"&$ SPORE-PRINT-COLOR is +++$!",'!!,#%4!5"*",+-.+/0++1++6++7++8++2++9++:3 green no yes Poisonous: 0 Edible: 46 '(#)*' 100% classification accuracy on a 100 examples. Poisonous: 1 !"#$"%"&$ Edible: 0 Artificial Intelligence: Decision Trees 2 Michael S. Lewicki ! Carnegie Mellon 5 Same problem with no odor or spore-print-color +++,#%%!-'%'.+/0+12++3++45 +++,#%%!(&6-#),+7+8 +++-6&!-'%'.+/0+18++=++2++>++?5 !"#$%! &'#(')'*( +++(96%:!(*.;6-!!6$'<!!.#),+7+2 +++,#%%!-'%'.+7+8@ &'#(')'*( +++,#%%!(#A!+7+= !"#$%! 100% classification accuracy on a 100 examples. +++(96%:!(*.;6-!!6$'<!!.#),+7+8 &'#(')'*( Pretty good, right? !"#$%! &'#(')'*( What if we go off hunting with this decision tree? !"#$%! Performance on another set of 100 mushrooms: Why? 80% Artificial Intelligence: Decision Trees 2 6 Michael S. Lewicki ! Carnegie Mellon Not enough examples? Training error 100 98 Testing error %correct on another set of t he same size 96 94 92 90 88 86 84 82 80 0 200 400 Artificial Intelligence: Decision Trees 2 600 Side example with Why is the testing errorboth discrete and always lower than the continuous attributes: training error? Predicting MPG (‘GOOD’ or ‘BAD’) from attributes: Cylinders Horsepower 800 1000 1200 1400 1600 1800 2000 Acceleration # training examples Maker (discrete) Displacement 7 Michael S. Lewicki ! Carnegie Mellon The Overfitting Problem: Example The Overfitting Problem: Example Class B Class A • • • • Suppose that, in an ideal world, class B is everything 0.5 and class A is X2>= 0.5 and class Suppose that, in an idealsuch world,that classX2B>= is everything such that everything with X < 0.5 A is everything with X2< 0.5 2 • Note that attribute X1 is irrelevant Note that• attribute irrelevant a decision tree would be 1 is generating Seems X like trivial Generating a decision tree would be trivial, right? The following examples are from Prof Hebert. Artificial Intelligence: Decision Trees 2 8 Michael S. Lewicki ! Carnegie Mellon 2 The Overfitting Problem: Example The Overfitting Problem: Example The Overfitting Problem: Example • But in the real world, our observations have variability. • However, we collect training examples from the They can also be corrupted by noise. •perfect world through some imperfect observation Thus, the observed pattern is more complex than it appears. •device • However, we collect training examples from the world through imperfect noise. • As a perfect result, the training datasome is corrupted byobservation device • As a result, the training data is corrupted by noise. Artificial Intelligence: Decision Trees 2 9 Michael S. Lewicki ! Carnegie Mellon The Overfitting Problem: Example The Overfitting Problem: Example The Overfitting Problem: Example The Overfitting Problem: Example • noise makes the decision tree more complex than itnoise, should bethe Because of the • decision tree is • However, we collect training examples fromresulting the algorithm tries to classify than all complicated it should be •farThemore perfect world through some imperfect observation of the training set perfectly • This is because the learning algorithm tries to device This is a fundamental in setthe all of the training perfectly ! This is a tree is •classify • Because ofproblem the noise, resulting decision learning and is called overfitting noise. • As a result, the training data is corrupted by fundamental in learning: far moreproblem complicated than it overfitting should be • This is because the learning algorithm tries to classify all of the training set perfectly ! This is a fundamental problem in learning: overfitting Artificial Intelligence: Decision Trees 2 10 Michael S. Lewicki ! Carnegie Mellon 3 The Overfitting Problem: Example The Overfitting Problem: Example The Overfitting Problem: Example The Overfitting Problem: Example • noise makes the decision tree more complex than it should be • However, we collect training examples from the tries to classify all • The algorithm perfect world through some imperfect observation of the training set perfectly The tree classifies device • However, we collect training examples from the this point as tree ‘A’, butis This is a fundamental problem in • • Because of the noise, the resulting decision it won’t generalize learning and is called overfitting noise. • As a result, the training data corrupted bythan perfect through some imperfect observation far moreisworld complicated it should be to new examples. device • This is because the learning algorithm tries to all ofthe thetraining trainingdata set perfectly ! This is a • classify As a result, is corrupted by noise. fundamental problem in learning: overfitting Artificial Intelligence: Decision Trees 2 11 Michael S. Lewicki ! Carnegie Mellon The Overfitting Problem: Example The Overfitting Problem: Example The Overfitting Problem: Example The Overfitting Problem: Example • noise makes the decision tree morethe complex than it should be thecollect noise, resulting decision • Because However,ofwe training examples fromtree the is far more complicated it should be all algorithm tries to classify • Thethan perfect world through some imperfect observation the training algorithm set perfectly tries to This is because theoflearning device all of the training set perfectly ! This is a classify is a fundamental in • This • Because ofproblem the noise, the resulting decision tree is The problem started overfitting problem in learning: learning and is called overfitting noise. • fundamental As a result, the training data is corrupted by far more complicated than it should here.be X1 is irrelevant to the underlying • This is because the learning algorithm triesstructure. to classify all of the training set perfectly ! This is a fundamental problem in learning: overfitting Artificial Intelligence: Decision Trees 2 12 Michael S. Lewicki ! Carnegie Mellon 3 The Overfitting Problem: Example The Overfitting Problem: Example The Overfitting Problem: Example Is there a way to identify that splitting this node is not helpful? • noise makes the decision tree splitting would result in Idea: When more complex than aittree should thatbeis too “complex”? • However, we collect training examples from the tries to classify all • The algorithm perfect world through some imperfect observation of the training set perfectly device is a fundamental in • This • Because ofproblem the noise, the resulting decision tree is The problem started learning and is called overfitting noise. • As a result, the training data is corrupted by far more complicated than it should here.be X1 is irrelevant to the underlying • This is because the learning algorithm triesstructure. to classify all of the training set perfectly ! This is a fundamental problem in learning: overfitting Artificial Intelligence: Decision Trees 2 13 Michael S. Lewicki ! Carnegie Mellon The Overfitting Problem: Example Addressing overfitting • • • • Grow tree based on training data. This yields an unpruned tree. Then prune nodes from the tree that are unhelpful. How do we know when this is the case? - Use additional data not used in training, ie test data Use a statistical significance test to see if extra nodes are different from noise Penalize the complexity of the tree Because of the noise, the resulting decision tree is far more complicated than it should be This is because the learning algorithm tries to classify all of the training set perfectly ! This is a fundamental problem in learning: overfitting Artificial Intelligence: Decision Trees 2 14 Michael S. Lewicki ! Carnegie Mellon 3 Training Data Training Data Artificial Intelligence: Decision Trees 2 Unpruned decision tree Unpruned decision from training data tree from training data 15 Training data Training data with the partitions induced withthe thedecision partitions induced by tree by the decision tree (Notice the tiny regions at (Notice the tiny regions at the top necessary to the top necessary to ‘A’ correctly classify the correctly classify the ‘A’ outliers!) outliers!) Artificial Intelligence: Decision Trees 2 Michael S. Lewicki ! Carnegie Mellon Unpruned decision tree Unpruned decision from training data tree from training data 16 Michael S. Lewicki ! Carnegie Mellon 6 Training data Training data Test data Test data Artificial Intelligence: Decision Trees 2 17 Unpruned decision tree Unpruned decision from training data tree from training data Performance (% Performance (% correctly classified) correctly Training:classified) 100% Training: 100% Test: 77.5% Test: 77.5% Michael S. Lewicki ! Carnegie Mellon Training data Training data Pruned decision tree Pruned decision from training datatree from training data Performance (% Performance (% correctly classified) correctly Training:classified) 95% Training: Test: 80%95% Test: 80% Test data Test data Artificial Intelligence: Decision Trees 2 18 Michael S. Lewicki ! Carnegie Mellon 7 Training data Training data Test data Test data 19 Michael S. Lewicki ! Carnegie Mellon % of % data of data correctly correctly classified classified Tree Tree withwith best best performance performance on on testtest setset Artificial Intelligence: Decision Trees 2 Pruned decision tree from Pruned tree from training decision data training data (% correctly Performance Performance (% correctly classified) classified) Training: 80% Training: 80% Test: 97.5% Test: 97.5% Artificial Intelligence: Decision Trees 2 Performance on training seton Performance training set Performance on test set on Performance test set Size of decision tree Size of decision tree 20 Michael S. Lewicki ! Carnegie Mellon 8 General principle • • • Region of overfitting the training data % Correct classification %correct classification Classification Using Test Data performance on training data Classification rate on training data Classification performance on test data Complexity of model (eg size In this region, the tree overfits the of tree) Classification training data (including the noise!) and onthetest data start doing poorly onmodel the test datato better classify As its complexity increases, the is able training data rate Performance on the test data initially increases, but then falls as of thetree model Size overfits, or becomes specialized for classifying the noise training • General principle: theofcomplexity The complexity in decision trees is the As number free parameters,of ie the the number classifier increases (depth of the decision tree), of nodes the performance on the training data increases and the performance on the test data decreases when the classifier overfits the training data. Artificial Intelligence: Decision Trees 2 Michael S. Lewicki ! Carnegie Mellon 21 Decision Tree Pruning Strategies for avoiding overfitting: Pruning • • • Ovoiding is equivalent achievingtree good generalization • overfitting Construct thetoentire as before All strategies need some at way the to control the complexity of the model • Starting leaves, recursively Pruning: - eliminate splits: constructs a standard decision tree, but keep a test data set on which the – Evaluate performance of the tree on test data model is not trained (also called validation data, or hold out data set) splits are eliminated (or pruned) by evaluating performance on the test data – Prune the tree if the classification a leaf is pruned if classification on the test databy increases by removing split performance increases removing thethe split prunes leaves recursively (1) Artificial Intelligence: Decision Trees 2 Prune node if classification performance on test set is greater for (2) than for (1) 22 (2) Michael S. Lewicki ! Carnegie Mellon Strategies for avoiding overfitting: Statistical significance tests • • • • • A Criterion to Detect Us For each split, ask of there is a significant increase in the info. gain • The problem is IG increases, b change in entr If we’re splitting noise, then data are significant random • Reasoning: What proportion of data go to left node? • The proportion node is pL = (N NAL + NBL 5 pL = = • Suppose now NAL = 1 NA + NB 9 randomly distri =4 •NBLGrow tree based on make training dat sense to If data were random, how many would we expect to go to the left? (unpruned tree) • The expected left node would ! NAL = NA × pL = 10/9 •• The A inby •removing # class Anumber in root node isof NA =class 2 • Prune the tree use The expected ! B in root node isis NBN =7 = 2 • # class NBL = NB × pL = 35/9 the root node left node would is NAL = on: 1A • # class A in left node based NA = 2 NB = 7 Possible Overfitting So # class Bnumber in left node is N 4 •• The of=class B in – Additional test data used for the root node is NB = 7 • (not Question: BL Is there a statistically significant different from what we observe and what we expect? Artificial Intelligence: Decision Trees 2 – Statistical If not, don’t split! significance • Aretests NA and NB N’A and N’B. If is not statistica should not spli children are no what we would distribution at t • The number of class A in the left node is NAL = 1 • The number of class B in the left node is NBL = 4 23 Michael S. Lewicki ! Carnegie Mellon Detecting statistically significant splits • A measure of statistical significance K= ! ! ! ! (NBL − NBL )2 (NAR − NBR )2 (NBR − NBR )2 (NAL − NAL )2 + + + ! ! ! ! NAL NBL NBR NBR A Criterion to Detect Usel • K measures how much the split deviates from what we would expect from random data • • K small • • the information gain from the split is not significant • • Here, K= • • The number of class A in (35/9 − 4)2 (10/9 − 1)2 + + · · · = 0.0321 the root node is NA = 2 10/9 35/9 • The number of class B in the root node is NB = 7 Artificial Intelligence: Decision Trees 2 24 • The number ofS. Lewicki class A Mellon in Michael ! Carnegie the left node is NAL = 1 • The number of class B in the left node is NBL = 4 • • • The problem is that IG increases, but w change in entropy i significant Reasoning: The proportion of th node is pL = (NAL + Suppose now that t randomly distribute make sense to split The expected numb left node would be The expected numb left node would be Question: Are NA and NB suffi N’A and N’B. If not, i is not statistically si should not split the children are not sig what we would get distribution at the ro “!2 criterion”: general case • Small “Chi-square” values imply low statistical significance • Nodes that have K smaller than threshold are pruned • The threshold regulates the complexity of the model - N data points PL PR NL data points Low thresholds allow larger trees and more overfitting High thresholds keep trees small but may sacrifice performance ! K= all classes i all children j NR data points (Nij − Nij! )2 Nij! Nij = Number of points from class i in child j Nij! = Number of points from class i in child j assuming random selection Nij! = N i × pj Artificial Intelligence: Decision Trees 2 25 Michael S. Lewicki ! Carnegie Mellon Illustration on our toy problem K = 10.58 K = 0.0321 K = 0.83 Artificial Intelligence: Decision Trees 2 The gains obtained by these splits are not significant 26 Michael S. Lewicki ! Carnegie Mellon Illustration on our toy problem • By thresholding K we end up with the decision that we wouldthresholding, expect (i.e., that does not With appropriate we one get the decision •tree overfit the data) tree we expect, ie only one split. • Note: The approach is presented with Note: this approach caninbethis applied to bothbut it works attributes example •continuous 2 continuous and discrete attributes just as well withχdiscrete attributes Pruning • The test on K is a version of a standard statistical test, the χ2 (‘chi-square’) test. • The value of t is retrieved from statistical Artificial Intelligence: Decision Trees 2 Michael S. Lewicki ! Carnegie Mellon 27 tables. For example, K > t means that, with confidence 95%, the information gain due to the split is significant. • If K < t, with high confidence, the information gain will be 0 over very large training samples A real example:– Fisher’s data Reduces Iris overfitting • • – Eliminates irrelevant attributes three classes of irises four attributes Example Class Setosa Setosa Setosa Versicolor Versicolor Versicolor Virginica Virginica Artificial Intelligence: Decision Trees 2 Sepal Length (SL) 5.1 4.9 5.4 5.2 5 6 6.4 7.2 Sepal Width (SW) 3.5 3 3.9 2.7 2 2.2 2.8 3 Petal Length (PL) 1.4 1.4 1.7 3.9 3.5 4 5.6 5.8 Petal Width (PW) 0.2 0.2 0.4 1.4 1 1 2.1 1.6 28 50 examples from each class Michael S. Lewicki ! Carnegie Mellon 13 Full (unpruned) decision tree Full Decision Tree Full Decision Tree Artificial Intelligence: Decision Trees 2 Michael S. Lewicki ! Carnegie Mellon 29 2.5 !"#$%&-./#)&*!0, 2 1.5 The scatter plot of the data with decision boundaries 2.5 1 0.5 2 !"#$%&-./#)&*!0, 0 1 2 3 4 5 6 7 !"#$%&%"'(#)&*!+, 1.5 1 15 0.5 0 1 2 3 4 5 6 7 !"#$%&%"'(#)&*!+, Artificial Intelligence: Decision Trees 2 30 Michael S. Lewicki ! Carnegie Mellon Tree statistics Artificial Intelligence: Decision Trees 2 31 Michael S. Lewicki ! Carnegie Mellon Pruning One Level Pruning one level Pruning One Level 16 Artificial Intelligence: Decision Trees 2 32 Michael S. Lewicki ! Carnegie Mellon 16 Pruning two levels Pruning Two Levels Pruning Two Levels Artificial Intelligence: Decision Trees 2 Michael S. Lewicki ! Carnegie Mellon 33 2.5 2 !"#$%&-./#)&*!0, The tree with pruned decision boundaries 1.5 !"#$%&-./#)&*!0, 2.5 1 2 0.5 1.5 0 1 2 3 4 5 6 7 !"#$%&%"'(#)&*!+, 1 0.5 17 0 1 2 3 4 5 6 7 !"#$%&%"'(#)&*!+, Artificial Intelligence: Decision Trees 2 34 Michael S. Lewicki ! Carnegie Mellon Recap: What you should understand • • Learning is fitting models (estimating their parameters) from data • There complexity of a model (related but not identical to the number of parameters) determines how well it can fit the data • If there are insufficient data relative to the complexity, the model will exhibit poor generalization, ie it will overfit the data • • To avoid this learning algorithms divide examples into training and testing data The goal of learning is to achieve good predictions/classifications for novel data, ie good generalization Decision Trees: - a simple hierarchical approach to classification goal is to achieve best classification with minimal # of decisions works with binary, categorical, or continuous data information gain is a useful splitting strategy (there are many others) the decision tree is built recursively it can be pruned to reduce the problem of overfitting Artificial Intelligence: Decision Trees 2 35 Michael S. Lewicki ! Carnegie Mellon