CS B551: DECISION TREES AGENDA Decision trees Complexity Learning curves Combatting overfitting Boosting RECAP Still in supervised setting with logical attributes Find a representation of CONCEPT in the form: CONCEPT(x) S(A,B, …) where S(A,B,…) is a sentence built with the observable attributes, e.g.: CONCEPT(x) A(x) (B(x) v C(x)) PREDICATE AS A DECISION TREE The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree: Example: A? A mushroom is poisonous iff True it is yellow and small, or yellow, big and spotted B? • x is a mushroom False True • CONCEPT = POISONOUS • A = YELLOW True • B = BIG C? • C = SPOTTED True False True False False False PREDICATE AS A DECISION TREE The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree: Example: A? A mushroom is poisonous iff True it is yellow and small, or yellow, big and spotted B? • x is a mushroom False True • CONCEPT = POISONOUS • A = YELLOW True • B = BIG C? • C = SPOTTED True False • D = FUNNEL-CAP • E = BULKY True False False False TRAINING SET Ex. # A B C D E CONCEPT 1 False False True False True False 2 False True False False False False 3 False True True True True False 4 False False True False False False 5 False False False True True False 6 True False True False False True 7 True False False True False True 8 True False True False True True 9 True True True False True True 10 True True True True True True 11 True True False False False False 12 True True False False True False 13 True False True True True True POSSIBLE DECISION TREE D T F E Ex. # A B C D E CONCEPT 1 False False True False True False 2 False True False False False False 3 False True True True True False 4 False False True False False False 5 False False False True True False 6 True False True False False True 7 True False False True False True 8 True False True False True True 9 True True True False True True 10 True True True True True True 11 True True False False False False 12 True True False False True False 13 True False True True True True T A T C F B F T E A F A T T F POSSIBLE DECISION TREE CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA)))))) D T F E CONCEPT A (B v C) True B? True True False T F B F T E False False C? True T A A? C True A False False F A T T F POSSIBLE DECISION TREE CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA)))))) D T F E CONCEPT A (B v C) A A? True C T B F False T Fdecision T tree KIS bias Build smallest E B? True False False True C? A A Computationally intractable problem False True True False greedy algorithm F T T F A TOP-DOWN INDUCTION OF A DT True C False False True B True DTL(D, Predicates) 1. 2. 3. 4. 5. False True False False True If all examples in D are positive then return True If all examples in D are negative then return False If Predicates is empty then return majority rule A error-minimizing predicate in Predicates Return the tree whose: - root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A) LEARNABLE CONCEPTS Some simple concepts cannot be represented compactly in DTs Parity(x) = X1 xor X2 xor … xor Xn Majority(x) = 1 if most of Xi’s are 1, 0 otherwise Exponential size in # of attributes Need exponential # of examples to learn exactly The ease of learning is dependent on shrewdly (or luckily) chosen attributes that correlate with CONCEPT PERFORMANCE ISSUES Assessing performance: Training set and test set Learning curve % correct on test set 100 size of training set Typical learning curve PERFORMANCE ISSUES Assessing performance: Training set and test set Learning curve Some concepts are unrealizable within a machine’s capacity % correct on test set 100 size of training set Typical learning curve PERFORMANCE ISSUES Assessing performance: Training set and test set Learning curve % correct on test set Overfitting 100 size of training set Typical learning curve Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set PERFORMANCE ISSUES Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set Terminate recursion when # errors / information gain is small PERFORMANCE ISSUES Assessing performance: Training set and test set Learning curve Risk of using irrelevant Overfitting observable predicates to Tree pruning generate an hypothesis that agrees with all The resulting decision tree + examples majority rule not in may the training set classify correctly all Terminate recursion when in the training set examples # errors / information gain is small PERFORMANCE ISSUES Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Incorrect examples Missing data Multi-valued and continuous attributes USING INFORMATION THEORY Rather than minimizing the probability of error, minimize the expected number of questions needed to decide if an object x satisfies CONCEPT Use the information-theoretic quantity known as information gain Split on variable with highest information gain ENTROPY / INFORMATION GAIN Entropy: encodes the quantity of uncertainty in a random variable Properties I(X,Y) = Ey[H(X) – H(X|Y)] = y P(y) x [P(x|y) log P(x|y) – P(x)log P(x)] Properties H(X) = 0 if X is known, i.e. P(x)=1 for some value x H(X) > 0 if X is not known with certainty H(X) is maximal if P(X) is uniform distribution Information gain: measures the reduction in uncertainty in X given knowledge of Y H(X) = -xVal(X) P(x) log P(x) Always nonnegative = 0 if X and Y are independent If Y is a choice, maximizing IG = > minimizing Ey[H(X|Y)] MAXIMIZING IG / MINIMIZING CONDITIONAL ENTROPY IN DECISION TREES Ey[H(X|Y)] = y P(y) x P(x|y) log P(x|y) Let n be # of examples Let n+,n- be # of examples on T/F branches of Y Let p+,p- be accuracy on true/false branches of Y P(Y) = (p+n++p-n-)/n P(correct|Y) = p+, P(correct|-Y) = p Ey[H(X|Y)] n+ [p+log p+ + (1-p+)log (1-p+)] + n- [p-log p- + (1-p-) log (1-p-)] CONTINUOUS ATTRIBUTES Continuous attributes can be converted into logical ones via thresholds X => X<a When considering splitting on X, pick the threshold a to minimize # of errors / entropy 7 7 6 5 6 5 4 5 4 3 4 5 4 5 6 7 MULTI-VALUED ATTRIBUTES Simple change: consider splits on all values A can take on Caveat: the more values A can take on, the more important it may appear to be, even if it is irrelevant More values => dataset split into smaller example sets when picking attributes Smaller example sets => more likely to fit well to spurious noise STATISTICAL METHODS FOR ADDRESSING OVERFITTING / NOISE There may be few training examples that match the path leading to a deep node in the decision tree More susceptible to choosing irrelevant/incorrect attributes when sample is small Idea: Make a statistical estimate of predictive power (which increases with larger samples) Prune branches with low predictive power Chi-squared pruning TOP-DOWN DT PRUNING Consider an inner node X that by itself (majority rule) predicts p examples correctly and n examples incorrectly At k leaf nodes, number of correct/incorrect examples are p1/n1,…,pk/nk Chi-squared statistical significance test: Null hypothesis: example labels randomly chosen with distribution p/(p+n) (X is irrelevant) Alternate hypothesis: examples not randomly chosen (X is relevant) Prune X if testing X is not statistically significant CHI-SQUARED TEST Let Z = i (pi – pi’)2/pi’ + (ni – ni’)2/ni’ Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the expected number of true/false examples at leaf node i if the null hypothesis holds Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom Look up p-Value of Z from a table, prune if pValue > a for some a (usually ~.05) ENSEMBLE LEARNING (BOOSTING) IDEA It may be difficult to search for a single hypothesis that explains the data Construct multiple hypotheses (ensemble), and combine their predictions “Can a set of weak learners construct a single strong learner?” – Michael Kearns, 1988 MOTIVATION 5 classifiers with 60% accuracy On a new example, run them all, and pick the prediction using majority voting If errors are independent, new classifier has 94% accuracy! (In reality errors will not be independent, but we hope they will be mostly uncorrelated) BOOSTING Main idea: If learner 1 fails to learn an example correctly, this example is more important for learner 2 If learner 1 and 2 fail to learn an example correctly, this example is more important for learner 3 … Weighted training set Weights encode importance BOOSTING Weighted training set Ex. # Weight A B C D E CONCEPT 1 w1 False False True False True False 2 w2 False True False False False False 3 w3 False True True True True False 4 w4 False False True False False False 5 w5 False False False True True False 6 w6 True False True False False True 7 w7 True False False True False True 8 w8 True False True False True True 9 w9 True True True False True True 10 w10 True True True True True True 11 w11 True True False False False False 12 w12 True True False False True False 13 w13 True False True True True True BOOSTING Start with uniform weights wi=1/N Use learner 1 to generate hypothesis h1 Adjust weights to give higher importance to misclassified examples Use learner 2 to generate hypothesis h2 … Weight hypotheses according to performance, and return weighted majority MUSHROOM EXAMPLE “Decision stumps” - single attribute DT Ex. # Weight A B C D E CONCEPT 1 1/13 False False True False True False 2 1/13 False True False False False False 3 1/13 False True True True True False 4 1/13 False False True False False False 5 1/13 False False False True True False 6 1/13 True False True False False True 7 1/13 True False False True False True 8 1/13 True False True False True True 9 1/13 True True True False True True 10 1/13 True True True True True True 11 1/13 True True False False False False 12 1/13 True True False False True False 13 1/13 True False True True True True MUSHROOM EXAMPLE Pick C first, learns CONCEPT = C Ex. # Weight A B C D E CONCEPT 1 1/13 False False True False True False 2 1/13 False True False False False False 3 1/13 False True True True True False 4 1/13 False False True False False False 5 1/13 False False False True True False 6 1/13 True False True False False True 7 1/13 True False False True False True 8 1/13 True False True False True True 9 1/13 True True True False True True 10 1/13 True True True True True True 11 1/13 True True False False False False 12 1/13 True True False False True False 13 1/13 True False True True True True MUSHROOM EXAMPLE Pick C first, learns CONCEPT = C Ex. # Weight A B C D E CONCEPT 1 1/13 False False True False True False 2 1/13 False True False False False False 3 1/13 False True True True True False 4 1/13 False False True False False False 5 1/13 False False False True True False 6 1/13 True False True False False True 7 1/13 True False False True False True 8 1/13 True False True False True True 9 1/13 True True True False True True 10 1/13 True True True True True True 11 1/13 True True False False False False 12 1/13 True True False False True False 13 1/13 True False True True True True MUSHROOM EXAMPLE Update weights (precise formula given in R&N) Ex. # Weight A B C D E CONCEPT 1 .125 False False True False True False 2 .056 False True False False False False 3 .125 False True True True True False 4 .125 False False True False False False 5 .056 False False False True True False 6 .056 True False True False False True 7 .125 True False False True False True 8 .056 True False True False True True 9 .056 True True True False True True 10 .056 True True True True True True 11 .056 True True False False False False 12 .056 True True False False True False 13 .056 True False True True True True MUSHROOM EXAMPLE Next try A, learn CONCEPT=A Ex. # Weight A B C D E CONCEPT 1 .125 False False True False True False 2 .056 False True False False False False 3 .125 False True True True True False 4 .125 False False True False False False 5 .056 False False False True True False 6 .056 True False True False False True 7 .125 True False False True False True 8 .056 True False True False True True 9 .056 True True True False True True 10 .056 True True True True True True 11 .056 True True False False False False 12 .056 True True False False True False 13 .056 True False True True True True MUSHROOM EXAMPLE Next try A, learn CONCEPT=A Ex. # Weight A B C D E CONCEPT 1 .125 False False True False True False 2 .056 False True False False False False 3 .125 False True True True True False 4 .125 False False True False False False 5 .056 False False False True True False 6 .056 True False True False False True 7 .125 True False False True False True 8 .056 True False True False True True 9 .056 True True True False True True 10 .056 True True True True True True 11 .056 True True False False False False 12 .056 True True False False True False 13 .056 True False True True True True MUSHROOM EXAMPLE Update weights Ex. # Weight A B C D E CONCEPT 1 0.07 False False True False True False 2 0.03 False True False False False False 3 0.07 False True True True True False 4 0.07 False False True False False False 5 0.03 False False False True True False 6 0.03 True False True False False True 7 0.07 True False False True False True 8 0.03 True False True False True True 9 0.03 True True True False True True 10 0.03 True True True True True True 11 0.25 True True False False False False 12 0.25 True True False False True False 13 0.03 True False True True True True MUSHROOM EXAMPLE Next try E, learn CONCEPT=E Ex. # Weight A B C D E CONCEPT 1 0.07 False False True False True False 2 0.03 False True False False False False 3 0.07 False True True True True False 4 0.07 False False True False False False 5 0.03 False False False True True False 6 0.03 True False True False False True 7 0.07 True False False True False True 8 0.03 True False True False True True 9 0.03 True True True False True True 10 0.03 True True True True True True 11 0.25 True True False False False False 12 0.25 True True False False True False 13 0.03 True False True True True True MUSHROOM EXAMPLE Next try E, learn CONCEPT=E Ex. # Weight A B C D E CONCEPT 1 0.07 False False True False True False 2 0.03 False True False False False False 3 0.07 False True True True True False 4 0.07 False False True False False False 5 0.03 False False False True True False 6 0.03 True False True False False True 7 0.07 True False False True False True 8 0.03 True False True False True True 9 0.03 True True True False True True 10 0.03 True True True True True True 11 0.25 True True False False False False 12 0.25 True True False False True False 13 0.03 True False True True True True MUSHROOM EXAMPLE Update Weights… Ex. # Weight A B C D E CONCEPT 1 0.07 False False True False True False 2 0.03 False True False False False False 3 0.07 False True True True True False 4 0.07 False False True False False False 5 0.03 False False False True True False 6 0.03 True False True False False True 7 0.07 True False False True False True 8 0.03 True False True False True True 9 0.03 True True True False True True 10 0.03 True True True True True True 11 0.25 True True False False False False 12 0.25 True True False False True False 13 0.03 True False True True True True MUSHROOM EXAMPLE Final classifier, order C,A,E,D,B Weights on hypotheses determined by overall error Weighted majority weights A=2.1, B=0.9, C=0.8, D=1.4, E=0.09 100% accuracy on training set BOOSTING STRATEGIES Prior weighting strategy was the popular AdaBoost algorithm see R&N pp. 667 Many other strategies Typically as the number of hypotheses increases, accuracy increases as well Does this conflict with Occam’s razor? ANNOUNCEMENTS Next class: Neural networks & function learning R&N 18.6-7