Today’s Topics • Ensembles • Decision Forests (actually, Random Forests) • Bagging and Boosting • Decision Stumps 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 1 Ensembles (Bagging, Boosting, and all that) Old View – Learn one good model New View Naïve Bayes, k-NN, neural net, d-tree, SVM, etc – Learn a good set of models Probably best example of interplay between ‘theory & practice’ in machine learning 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 2 Ensembles of Neural Networks (or any supervised learner) OUTPUT Combiner Network Network Network INPUT • Ensembles often produce accuracy gains of 5-10 percentage points! • Can combine “classifiers” of various types – Eg, decision trees, rule sets, neural networks, etc. 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 3 Three Explanations of Why Ensembles Help 1. Statistical Key true concept learned models search path (sample effects) 2. Computational (limited cycles for search) 3. Representational (wrong hypothesis space) Concept Space Considered From: Dietterich, T. G. (2002). Ensemble Learning. In The Handbook of Brain Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. 405-408 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 4 Combining Multiple Models Three ideas for combining predictions 1. Simple (unweighted) votes • Standard choice 2. Weighted votes • eg, weight by tuning-set accuracy 3. Learn a combining function • • 9/22/15 Prone to overfitting? ‘Stacked generalization’ (Wolpert) CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 5 Random Forests (Breiman, Machine Learning 2001; related to Ho, 1995) A variant of something called BAGGING (‘multi-sets’) Algorithm Let N = # of examples F = # of features i = some number << F Repeat k times (1) Draw with replacement N examples, put in train set (2) Build d-tree, but in each recursive call – Choose (w/o replacement) i features – Choose best of these i as the root of this (sub)tree (3) Do NOT prune In HW2, we’ll give you 101 ‘bootstrapped’ samples of the Thoracic Surgery Dataset 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 6 Using Random Forests After training we have K decision trees How to use on TEST examples? Some variant of If at least L of these K trees say ‘true’ then output ‘true’ How to choose L ? Use a tune set to decide 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 7 More on Random Forests • Increasing i – Increases correlation among individual trees (BAD) – Also increases accuracy of individual trees (GOOD) • Can also use tuning set to choose good value for i • Overall, random forests – Are very fast (eg, 50K examples, 10 features, 10 trees/min on 1 GHz CPU back in 2004) – Deal well with large # of features – Reduce overfitting substantially; NO NEED TO PRUNE! – Work very well in practice 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 8 A Relevant Early Paper on ENSEMBLES Hansen & Salamen, PAMI:20, 1990 – If (a) the combined predictors have errors that are independent from one another – And (b) prob any given model correct predicts any given testset example is > 50%, then lim ( test set error rate of N predictors ) 0 N 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 9 Some More Relevant Early Papers • Schapire, Machine Learning:5, 1990 (‘Boosting’) – If you have an algorithm that gets > 50% on any distribution of examples, you can create an algorithm that gets > (100% - ), for any > 0 – Need an infinite (or very large, at least) source of examples - Later extensions (eg, AdaBoost) address this weakness • Also see Wolpert, ‘Stacked Generalization,’ Neural Networks, 1992 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 10 Some Methods for Producing ‘Uncorrelated’ Members of an Ensemble • K times randomly choose (with replacement) N examples from a training set of size N • Give each training set to a std ML algo – ‘Bagging’ by Brieman (Machine Learning, 1996) – Want unstable algorithms (so learned models vary) • Reweight examples each cycle (if wrong, increase weight; else decrease weight) – ‘AdaBoosting’ by Freund & Schapire (1995, 1996) 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 11 Empirical Studies (from Freund & Schapire; reprinted in Dietterich’s AI Magazine paper) Error Rate of C4.5 (Each point one data set) Error Rate of Bagging ID3 successor Boosting and Bagging helped almost always! Error Rate of Bagged (Boosted) C4.5 9/22/15 On average, Boosting slightly better? Error Rate of AdaBoost CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 12 Some More Methods for Producing “Uncorrelated” Members of an Ensemble • Directly optimize accuracy + diversity – Opitz & Shavlik (1995; used genetic algo’s) – Melville & Mooney (2004-5) • Different number of hidden units in a neural network, different k in k -NN, tie-breaking scheme, example ordering, diff ML algos, etc – Various people – See 2005-2008 papers of Rich Caruana’s group for large-scale empirical studies of ensembles 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 13 Boosting/Bagging/etc Wrapup • An easy to use and usually highly effective technique - always consider it (Bagging, at least) when applying ML to practical problems • Does reduce ‘comprehensibility’ of models - see work by Craven & Shavlik though (‘rule extraction’) • Increases runtime, but cycles usually much cheaper than examples (and easily parallelized) 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 14 Decision “Stumps” (formerly part of HW; try on your own!) • Holte (ML journal) compared: – Decision trees with only one decision (decision stumps) vs – Trees produced by C4.5 (with pruning algorithm used) • Decision ‘stumps’ do remarkably well on UC Irvine data sets – Archive too easy? Some datasets seem to be • Decision stumps are a ‘quick and dirty control for comparing to new algorithms – But ID3/C4.5 easy to use and probably a better control 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 15 C4.5 Compared to 1R (‘Decision Stumps’) See Holte paper in Machine Learning for key (eg, HD=heart disease) 9/22/15 Testset Accuracy Dataset C4.5 1R BC 72.0% 68.7% CH 99.2% 68.7% GL 63.2% 67.6% G2 74.3% 53.8% HD 73.6% 72.9% HE 81.2% 76.3% HO 83.6% 81.0% HY 99.1% 97.2% IR 93.8% 93.5% LA 77.2% 71.5% LY 77.5% 70.7% MU 100.0% 98.4% SE 97.7% 95.0% SO 97.5% 81.0% VO 95.6% 95.2% V1 89.4% 86.8% CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3 16