Example: Weather Forecast (Two heads are better than one) Reality 2 X X X 3 4 5 X X X X X X 1 X X X X Combine Picture Source: Carla Gomez Majority Vote Model • Majority vote – Choose the class predicted by more than ½ the classifiers – If no agreement return an error – Why and when does this work? Example • Majority vote • Suppose we have 5 completely independent classifiers… – If accuracy is 70% for each classifier • (.35)+5(.34)(.7)+ 10 (.33)(.72) • 16.3% majority vote classifier is in error – 101 such classifiers • 0.1% majority vote classifier is in error Example • Majority vote • Suppose we have 5 completely independent classifiers… – If accuracy is 70% for each classifier • (.35)+5(.34)(.7)+ 10 (.33)(.72) • 16.3% majority vote classifier is in error – 101 such classifiers • 0.1% majority vote classifier is in error – what happens if p<50%? Try 30%? Majority Vote Model • Let p be the probability that a classifier makes an error • Assume that classifier errors are independent • The probability that k of the n classifiers make an error is: " n% $ ' p k (1 − p) n− k $ ' #k & • Therefore the probability that a majority vote classifier € is: is in error # n& n k n− k % ( p (1 − p) ∑% ( k> n / 2 k $ ' Value of Ensembles • “No Free Lunch” Theorem – No single algorithm wins all the time! • When combing multiple independent decisions, each of which is at least more accurate than random guessing, random errors cancel each other out and correct decisions are reinforced • Human ensembles are demonstrably better – How many jelly beans are in the jar? – Who Wants to be a Millionaire: “Ask the audience” What is Ensemble Learning? • Ensemble: collection of base learners – Each learns the target function – Combine their outputs for a final prediction – Often called “meta-learning” • How can you get different learners? • How can you combine learners? Ensemble Method1: Bagging • Create ensembles by “bootstrap aggregation”, i.e., repeatedly randomly re-sampling training data • Bootstrap: draw n items from X with replacement Given a training set X of m instances For i = 1..T Draw sample of size n < m from X uniformly w/ replacement Learn classifier Ci from sample i Final classifier is an unweighted vote of C1 .. CT Will Bagging Improve Accuracy? • Depends on the stability of the base classifiers – If small changes in the sample cause small changes in the base-level classifier, then the ensemble will not be much better than the base classifiers – If small changes in the sample cause large changes and the error is < ½ then we will see a big improvement Ensemble Method 2: Boosting: • Key idea: Instead of sampling (as in bagging) re-weigh examples Let m be the number of learners to generate Initialize all training instances to have the same weight for i=1,m generate learner hi increase weights of the training instances that hi misclassifies Final classifier is a weighted vote of all m learners (where the weights are set based on training set accuracy) Adaptive Boosting • Each rectangle corresponds to an example, with weight proportional to its height ✓ ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✗ ê ê ê h1 h2 h3 Adaboost ( x1, x1 ... x m , y m ,T, H) D1 = ( m1 ,..., m1 ) T is the number of iterations m is the number of instances H is the classifier algorithm for t = 1,T € ht = H(x, y, Dt ) εt = ∑ € i:h t ( x i )≠ y i Dt (i) if εt ≥ 12 then T := t −1 break α t = 12 ln 1−εε for i = 1, m € t t € €if h (x ) ≠ y then D (i) := t i i t +1 € else Dt +1 (i) := € €end D t ( i)e −α Zt end € (h1,..., ht ,α1,...,α t , t) return € t D t ( i)e α Zt t € Example • Let m =20, then D1 = (.05,...,.05) • Imagine that h1 is correct on 15 and incorrect on 5 instances, then ε = 0.25 and α = 0.5493 € • We reweight as follows: –€ Correct Instances: D2 (i) = (.05e−.54993 ) / Z1 = 0.0289 /.8665 = .0333 – Incorrect Instances: D2 (i) = (.05e.54993 ) / Z1 = 0.0866 /.8665 = .0999 • Note that after normalization € € ∑ D (i) = 1 2 Summary of Boosting and Bagging • Called “homogenous ensembles” • Both use a single learning algorithm but manipulate training data to learn multiple models – Data1 ¹ Data2 ¹ … ¹ Data T – Learner1 = Learner2 = … = Learner T • Methods for changing training data: – Bagging: Resample training data – Boosting: Reweight training data Strong and Weak Learners • “Strong learner” produces a classifier which can be arbitrarily accurate. • “Weak Learner” produces a classifier more accurate than random guessing. • Goal: Create a single strong learner from a set of weak learners. What is Ensemble Learning? • Ensemble: collection of base learners – Each learns the target function – Combine their outputs for a final predication – Often called “meta-learning” • How can you get different learners? • How can you combine learners? Where do Learners come from? • Bagging • Boosting • Partitioning the data (must have a large amount) • Using different – feature subsets – algorithms – parameters of the same algorithm Ensemble Method 3: Random Forests • For i = 1 to T, – Take a bootstrap sample (bag) – Grow a random decision tree T_i • At each node choose a feature from one of n features (n < total number of features) • Grow a full tree (do not prune) • Classify new objects by taking a majority vote of the T random trees What is Ensemble Learning? • Ensemble: collection of base learners – Each learns the target function – Combine their outputs for a final predication – Often called “meta-learning” • How can you get different learners? • How can you combine learners? Methods for Combining Classifiers • Unweighted vote • Weighted vote (typically a function of the accuracy) • Stacking – learning how to combine classifiers Introduction to Machine Learning and Data Mining, Carla Brodley !"#$ %&'$( )$*+$,-"*,+ ./+*0$ !"#$%&'()&'*"#+, ! =9+!*+,-./0!>7/?+!:@A89,!,@!:AB:,C3,/C..D /EF7@;+!,9+!C44A7C4D!@-!F7+</4,/@3:!CB@A, 9@G!EA49!:@E+@3+!/:!8@/38!,@!+3H@D!C E@;/+!BC:+<!@3!,9+/7!E@;/+!F7+-+7+34+:5 I3!J+F,+EB+7!($1!())%!G+!CGC7<+<!,9+ K$L!M7C3<!>7/?+!,@!,+CE!NO+..P@7Q: >7C8EC,/4!R9C@:S5!T+C<!CB@A,!,9+/7 C.8@7/,9E1!49+4U@A,!,+CE!:4@7+:!@3!,9+ V+C<+7B@C7<1!C3<!H@/3!,9+!</:4A::/@3:!@3 ,9+!W@7AE5 ! X+!CFF.CA<!C..!,9+!4@3,7/BA,@7:!,@!,9/: YA+:,1!G9/49!/EF7@;+:!@A7!CB/./,D!,@ 4@33+4,!F+@F.+!,@!,9+!E@;/+:!,9+D!.@;+5 ! 123 ! " ! 1",&# ! " ! 4$05'678!"#$ #!$%%&'())%!*+,-./01!2345!6..!7/89,:!7+:+7;+<5 Began October 2006 • Supervised learning task – Training data is a set of users and ratings (1,2,3,4,5 stars) those users have given to movies. – Construct a classifier that given a user and an unrated movie, correctly classifies that movie as either 1, 2, 3, 4, or 5 stars • $1 million prize for a 10% improvement over Netflix’s current movie recommender “Our final solution (RMSE=0.8712) consists of blending 107 individual results. ” !"#$ . Ensemble methods are the best performers… %&'$( )$*+$,-"*,+ ./+*0$ !"#$"%&'#%$ !"#$%&'()*+,(!-#.*/!"#$%&!'()(!*+!,'+-!./$0!,%+)( 0%+1234(,#1( !" (2*35*.+/ ! !"#$ %&"'()"'& *&+,(%&+,(-./0& 1(2'30/4&'&#, *&+,(-56'7,(%7'& !"#$%&'"()*&+&,-./&0&123456&+&7($$($8&9*#:;&<*==>?"@A&'"#8:#B(C&DE#?A 1 !!!!!2(##3+)4,!5)6786*$%!"'6+, 9:;<=> 19:9= ?99@A9>A?=!1;B1;B?; ? !!!!!C'(!DE,(8F#( 9:;<=> 19:9= ?99@A9>A?=!1;BG;B?? G !!!!!H)6EI!5)$0(!C(68 9:;<;? @:@9 ?99@A9>A19!?1B?JBJ9 J !!!!!KL()6!M+#/*$+E,!6EI!N6EI(#6O!PE$*(I 9:;<;; @:;J ?99@A9>A19!91B1?BG1 < !!!!!N6EI(#6O!QEI/,*)$(,!R 9:;<@1 @:;1 ?99@A9>A19!99BG?B?9 = !!!!!5)6786*$%C'(+)O 9:;<@J @:>> ?99@A9=A?J!1?B9=B<= > !!!!!2(##3+)!$E!2$7"'6+, 9:;=91 @:>9 ?99@A9<A1G!9;B1JB9@ ; !!!!!S6%(T 9:;=1? @:<@ ?99@A9>A?J!1>B1;BJG @ !!!!!U((I,? 9:;=?? @:J; ?99@A9>A1?!1GB11B<1 19 !!!!!2$7"'6+, 9:;=?G @:J> ?99@A9JA9>!1?BGGB<@ 11 !!!!!KL()6!M+#/*$+E, 9:;=?G @:J> ?99@A9>A?J!99BGJB9> 1? !!!!!2(##3+) 9:;=?J @:J= ?99@A9>A?=!1>B1@B11 '"?8"*AA&'"()*&F113&+&,-./&0&1235F6&+&7($$($8&9*#:;&<*==>?"&($&<(8DE#?A