Today’s Topics • Ensembles • Decision Forests (actually, Random Forests)

advertisement
Today’s Topics
• Ensembles
• Decision Forests (actually, Random Forests)
• Bagging and Boosting
• Decision Stumps
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
1
Ensembles
(Bagging, Boosting, and all that)
Old View
– Learn one good model
New View
Naïve Bayes, k-NN, neural net,
d-tree, SVM, etc
– Learn a good set of models
Probably best example of interplay between
‘theory & practice’ in machine learning
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
2
Ensembles of Neural Networks
(or any supervised learner)
OUTPUT
Combiner
Network
Network
Network
INPUT
• Ensembles often produce accuracy gains of
5-10 percentage points!
• Can combine “classifiers” of various types
– Eg, decision trees, rule sets, neural networks, etc.
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
3
Three Explanations of Why
Ensembles Help
1. Statistical
Key
 true concept
 learned models
search path
(sample effects)
2. Computational
(limited cycles for search)
3. Representational
(wrong hypothesis space)
Concept Space
Considered
From: Dietterich, T. G. (2002). Ensemble Learning. In The Handbook of Brain Theory and Neural
Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. 405-408
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
4
Combining
Multiple Models
Three ideas for combining predictions
1. Simple (unweighted) votes
•
Standard choice
2. Weighted votes
•
eg, weight by tuning-set accuracy
3. Learn a combining function
•
•
9/22/15
Prone to overfitting?
‘Stacked generalization’ (Wolpert)
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
5
Random Forests
(Breiman, Machine Learning 2001; related to Ho, 1995)
A variant of something called BAGGING (‘multi-sets’)
Algorithm
Let N = # of examples
F = # of features
i = some number << F
Repeat k times
(1) Draw with replacement N examples, put in train set
(2) Build d-tree, but in each recursive call
– Choose (w/o replacement) i features
– Choose best of these i as the root
of this (sub)tree
(3) Do NOT prune
In HW2, we’ll give you 101 ‘bootstrapped’ samples of the
Thoracic Surgery Dataset
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
6
Using Random Forests
After training we have K decision trees
How to use on TEST examples?
Some variant of
If at least L of these K trees say ‘true’ then output ‘true’
How to choose L ?
Use a tune set to decide
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
7
More on Random Forests
• Increasing i
– Increases correlation among individual trees (BAD)
– Also increases accuracy of individual trees (GOOD)
• Can also use tuning set to choose good value for i
• Overall, random forests
– Are very fast (eg, 50K examples, 10 features,
10 trees/min on 1 GHz CPU back in 2004)
– Deal well with large # of features
– Reduce overfitting substantially; NO NEED TO PRUNE!
– Work very well in practice
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
8
A Relevant Early
Paper on ENSEMBLES
Hansen & Salamen, PAMI:20, 1990
– If (a) the combined predictors have errors that are
independent from one another
– And (b) prob any given model correct predicts any
given testset example is > 50%, then
lim ( test set error rate of N predictors )  0
N 
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
9
Some More
Relevant Early Papers
• Schapire, Machine Learning:5, 1990 (‘Boosting’)
– If you have an algorithm that gets > 50% on any
distribution of examples, you can create an algorithm
that gets > (100% - ), for any  > 0
– Need an infinite (or very large, at least) source
of examples
- Later extensions (eg, AdaBoost)
address this weakness
• Also see Wolpert, ‘Stacked Generalization,’
Neural Networks, 1992
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
10
Some Methods for Producing
‘Uncorrelated’ Members
of an Ensemble
• K times randomly choose (with replacement)
N examples from a training set of size N
• Give each training set to a std ML algo
– ‘Bagging’ by Brieman (Machine Learning, 1996)
– Want unstable algorithms (so learned models vary)
• Reweight examples each cycle (if wrong,
increase weight; else decrease weight)
– ‘AdaBoosting’ by Freund & Schapire (1995, 1996)
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
11
Empirical Studies
(from Freund & Schapire; reprinted in Dietterich’s AI Magazine paper)
Error
Rate
of
C4.5
(Each point one data set)
Error
Rate of
Bagging
ID3
successor
Boosting and
Bagging
helped almost
always!
Error Rate of Bagged (Boosted) C4.5
9/22/15
On average,
Boosting
slightly better?
Error Rate of AdaBoost
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
12
Some More Methods for
Producing “Uncorrelated”
Members of an Ensemble
• Directly optimize accuracy + diversity
– Opitz & Shavlik (1995; used genetic algo’s)
– Melville & Mooney (2004-5)
• Different number of hidden units in a neural
network, different k in k -NN, tie-breaking
scheme, example ordering, diff ML algos, etc
– Various people
– See 2005-2008 papers of Rich Caruana’s group
for large-scale empirical studies of ensembles
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
13
Boosting/Bagging/etc
Wrapup
• An easy to use and usually highly
effective technique
- always consider it (Bagging, at least) when
applying ML to practical problems
• Does reduce ‘comprehensibility’ of models
- see work by Craven & Shavlik though (‘rule extraction’)
• Increases runtime, but cycles usually
much cheaper than examples
(and easily parallelized)
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
14
Decision “Stumps”
(formerly part of HW; try on your own!)
• Holte (ML journal) compared:
– Decision trees with only one decision (decision stumps)
vs
– Trees produced by C4.5 (with pruning algorithm used)
• Decision ‘stumps’ do remarkably well on
UC Irvine data sets
– Archive too easy? Some datasets seem to be
• Decision stumps are a ‘quick and dirty control for
comparing to new algorithms
– But ID3/C4.5 easy to use and probably a better control
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
15
C4.5 Compared
to 1R (‘Decision
Stumps’)
See Holte paper in Machine
Learning for key
(eg, HD=heart disease)
9/22/15
Testset Accuracy
Dataset
C4.5
1R
BC
72.0%
68.7%
CH
99.2%
68.7%
GL
63.2%
67.6%
G2
74.3%
53.8%
HD
73.6%
72.9%
HE
81.2%
76.3%
HO
83.6%
81.0%
HY
99.1%
97.2%
IR
93.8%
93.5%
LA
77.2%
71.5%
LY
77.5%
70.7%
MU
100.0%
98.4%
SE
97.7%
95.0%
SO
97.5%
81.0%
VO
95.6%
95.2%
V1
89.4%
86.8%
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
16
Download