PDF slides - Stanford University

advertisement
It can not be emphasized enough that no claim whatsoever is
being made in this paper that all algorithms are equivalent in
practice, in the real world. In particular, no claim is being made
that one should not use cross-validation in the real world.
|Wolpert, 1994
Cross-Validation and Bootstrap
for Accuracy Estimation and Model Selection
Ron Kohavi
Stanford University
IJCAI-95
(ronnyk@CS.Stanford.EDU)
1
Motivation & Summary of Results
You have ten induction algorithms. Which one is the best for a
given dataset?
Answer: run them all and pick the one with the highest estimated
accuracy.
1. Which accuracy estimation?
Resubstitution? Holdout? Cross-validation? Bootstrap?
2. How sure would you be of your choice?
3. If you had spare CPU cycles, would you alway choose leaveone-out?
For accuracy estimation: stratied 10 to 20-fold CV.
For model selection: multiple runs of 3-5 CV.
(ronnyk@CS.Stanford.EDU)
2
Talk Outline
☞ ➀
➁
Accuracy estimation: the problem.
Accuracy estimation methods:
F Holdout & random subsampling.
F Cross-validation.
F Bootstrap.
➂
Experiments, recent experiments.
➃
Summary.
(ronnyk@CS.Stanford.EDU)
3
Basic Denitions
1. Let D be a dataset of n labelled instances.
2. Assume D is an i.i.d. sample from some underlying distribution on the set of labelled instances.
3. Let I be an induction algorithm that produces a classier
from data D.
(ronnyk@CS.Stanford.EDU)
4
The Problem
Estimate the accuracy of the classier induced from the dataset.
The accuracy of a classier is the probability of correctly classifying a randomly selected instance from the distribution.
All resampling methods (holdout, cross-validation, bootstrap)
will use a uniform distribution on the given dataset as a distribution from which they sample.
Sample 1
Real world
Distribution F
Dataset
Distribution F’
Sample 2
Sample k
(ronnyk@CS.Stanford.EDU)
5
Holdout
Holdout partition the data into two mutually exclusive subsets
called a training set and a test set. The training set is given to
the inducer, and the induced classier is tested on the test set.
Training Set
123
2/3
Inducer
Classifier
Eval
1/3
Pessimistic estimator because only a portion of the data is given
to the inducer for training.
(ronnyk@CS.Stanford.EDU)
6
Condence Bounds
The classication of each test instance can be viewed as a
Bernoulli trial: correct or incorrect prediction.
Let S be the number of correct classications on the test set of
size h, then S is distributed binomially and S=h is approximately
normal with mean acc and variance of acc (1 acc)=h.
8
>
<
Pr >:
9
>=
acch acc
z<q
< z> ;
acc(1 acc)=h
(1)
(ronnyk@CS.Stanford.EDU)
7
Random Subsampling
Repeating the holdout many times is called random subsampling.
Common mistake:
One cannot compute the standard deviation of the samples' mean accuracy in random subsampling because the
test instances are not independent.
The t-tests you commonly see with signicance levels are
usually wrong.
Example: random concept (50% for each class). One induction algorithm predicts 0, the other 1. Given a sample of 100
instances, chances are it won't be 50/50, but something like
53/47. Leaving 1/3 out gives s.d.=8.7%.
(ronnyk@CS.Stanford.EDU)
8
Audience comments?
Pedro Domingos et al., please keep the comments to the end.
Accuracy as dened here is the prediction accuracy on instances
sampled from the parent population, not from the small dataset
we have at hand.
t-tests
with random subsampling tests the hypothesis that one
algorithm is better than another, assuming the distribution is
uniform and each instance in the dataset has 1=n probability.
This is not an interesting hypothesis, except that we know how
to compute it.
(ronnyk@CS.Stanford.EDU)
9
Cross Validation
In k-fold cross-validation, the dataset D is randomly split into k
mutually exclusive subsets (the folds) D1; D2; : : : ; Dk of approximately equal size.
The inducer is trained and tested k times; each time
f1; 2; : : : ; kg it is trained on D n Dt and tested on Dt.
Train:
1 2
Training Set
1 2 3
1 3
3 2
t
2
Test:
Inducer
Inducer
Inducer
c
c
c
3
2
1
Eval
Eval
Avg
Eval
(ronnyk@CS.Stanford.EDU)
10
Standard Deviation
Two ways to look at cross-validation:
1. A method that works for stable inducers, i.e., inducers that
do not change their predictions much when given similar
datasets (see paper).
2. A method of computing the expected accuracy for a dataset
of size n k=n, where k is the number of folds.
Since each instance is used only once as a test instance, the
standard deviation can be estimated as acccv (1 acccv)=n,
where n is the number of instances in the dataset.
(ronnyk@CS.Stanford.EDU)
11
Failure of cross-validation/leave-one-out
Leave-one-out is n-fold cross-validation.
Fisher's iris dataset contains 50 instances of each class (three
classes), leading one to expect that a majority inducer should
have accuracy about 33%.
When an instance is deleted from the dataset, its label is a
minority in the training set; thus the majority inducer predicts
one of the other two classes and always errs in classifying the
test instance.
The leave-one-out estimated accuracy for a majority inducer on the iris dataset is therefore 0%0%
(ronnyk@CS.Stanford.EDU)
12
.632 Bootstrap
A bootstrap sample is created by sampling
formly from the data (with replacement).
n
instances uni-
Given a number b, the number of bootstrap samples, let 0i be
the accuracy estimate for bootstrap sample i on the instances
not included in the sample. The .632 bootstrap estimate is
dened as
b
X
1
accboot =
(0
:632 0i + :368 accs)
b i=1
where accs is the resubstitution accuracy estimate on the full
dataset.
(ronnyk@CS.Stanford.EDU)
13
Bootstrap failure
The .632 bootstrap fails to give the expected result when the
classier can perfectly t the data (e.g., an unpruned decision
tree or a one nearest neighbor classier) and the dataset is completely random, say with two classes.
The apparent accuracy is 100%, and the 0 accuracy is about
50%. Plugging these into the bootstrap formula, one gets an
estimated accuracy of about 68.4%, far from the real accuracy
of 50%.
(ronnyk@CS.Stanford.EDU)
14
Experimental design: Q & A
1. Which induction algorithm to use? C4.5 and Naive Bayes.
Reason: FAST inducers.
2. Which datasets to use? Seven datasets were chosen.
Reasons:
(a) Wide variety of domains.
(b) At least 500 instances.
(c) Learning curve that did not atten too early.
(ronnyk@CS.Stanford.EDU)
15
Learning Curves
% Accuracy
% Accuracy
96
94
92
90
88
86
84
82
100
97.5
95
92.5
90
87.5
85
82.5
NB
C4.5
50
100
150
TS size
250
200
% Accuracy
Hypothyroid
C4.5
C4.5
99
98.5
98
NB
NB
97.5
500 1000
2000
3000
TS size
% Accuracy
97
500
C4.5
2000 4000 6000 8000
NB
C4.5
80
70
60
50
40
30
20
NB
TS size
1500
TS size
2500
% Accuracy
Soybean (large)
Mushroom
100
99
98
97
96
95
94
% Accuracy
Chess endgames
Breast cancer
50
100
150
200
TS size
Vehicle
70
65
60
55
50
45
40
35
C4.5
NB
100
200
300
400
TS size
Arrow indicate sampling points.
(ronnyk@CS.Stanford.EDU)
16
The bias (CV)
The bias of a method that estimates a parameter is dened as
the expected estimated value minus the value of . An unbiased
estimation method is a method that has zero bias.
% acc
75
% acc
100
99.5 Mushroom
99
70 Soybean
65
60
98.5
Vehicle
98
Hypo
Chess
97.5
55
97
50
45
Rand
96.5
2
5
10
20
-5
-2
-1
folds
96
2
5
10
20
-5
-2
-1
folds
C4.5: The bias of cross-validation with varying folds. A negative k folds
stands for leave-k-out. Error bars are 95% condence intervals for the mean.
The gray regions indicate 95% condence intervals for the true accuracies.
(ronnyk@CS.Stanford.EDU)
17
Bootstrap Bias
Bootstrap is right on the mark for three out of six, but biased
for soybean and extremely biased for vehicle and rand.
% acc
% acc
100
Estimated
75
99.5 Mushroom
99
Soybean
Vehicle
70 Soybean
65
98.5
Rand
98
60 Vehicle
Hypo
Chess
97.5
55
50
45
97
Rand
96.5
1
2
5
10
20
50 100
samples
96
1
2
5
10
20
50
100
samples
C4.5: The bias of bootstrap.
(ronnyk@CS.Stanford.EDU)
18
The Variance of CV
The variance is high at the ends: two fold and leave-f1,2g-out
std dev
Vehicle
7 Soybean
std dev
C4.5
6
5
6
Rand
Breast
5
4
4
3
3
2
2
Hypo
1
Chess
Mushroom
0
C4.5
7
1
2
5
10
20
-5
-2
-1
folds
0
Vehicle
Soybean
Rand
Breast
Hypo
Chess
Mushroom
2
5
10
20
-5
Regular (left), Stratied (right).
-2
folds
Stratication slightly reduces both bias and variance.
(ronnyk@CS.Stanford.EDU)
19
The Variance of Bootstrap
std dev
std dev
C4.5
7 Soybean
Rand
6 Vehicle
6
5
5
7
Breast
4
4
3
3
2
Hypo
1 Mushroom
Chess
0
Naive-Bayes
Rand
Vehicle
Breast
Soybean
Chess
Hypo
1 Mushroom
0
2
1
2
5
10
20
50
100
samples
1
2
5
10
20
50
100
samples
(ronnyk@CS.Stanford.EDU)
20
Multiple Times
To stabilize the cross-validation, we can execute CV multiple
times, shuing the instances each time.
std dev
std dev
C4.5: 5 x CV
7
7
6
6
5
5
4
Soybean
Vehicle
Breast
4
Rand
Vehicle
Soybean
3
Rand
3
Breast
2
1
0
Naive Bayes: 5 x CV
2
Hypo
Chess
Mushroom
1
2
5
10
20
-5
-2
folds
0
Chess
Hypo
Mushroom
2
5
10
20
-5
-2
folds
The graphs show the eect on C4.5 and NB when CV was run
ve times. The variance for lower folds was due to the variance
because of smaller dataset size.
(ronnyk@CS.Stanford.EDU)
21
Summary
1. Reviewed accuracy estimation: holdout, cross-validation,
stratied CV, bootstrap. All methods fail in some cases.
2. Bootstrap has low variance, but is extremely biased.
3. With cross-validation, you can determine the bias/variance
tradeo that is appropriate, and run multiple times to reduce
variance.
4. Standard deviations can be computed for cross-validation,
but are incorrect for random subsampling (holdout).
Recommendation
For accuracy estimation: stratied 10 to 20-fold CV.
For model selection: multiple runs of 3-5 folds (dynamic).
(ronnyk@CS.Stanford.EDU)
22
What is the \true" accuracy
To estimate the true accuracy at the sampling points, we sampled 500 times and estimated the accuracy using the unseen
instances (holdout).
Dataset
no. of sample-size no. of C4.5 Naive-Bayes
attr. / total size categories
Breast cancer 10 50/699
2 91.370.10 94.220.10
Chess
36 900/3196
2 98.190.03 86.800.07
Hypothyroid
25 400/3163
2 98.520.03 97.630.02
Mushroom
22 800/8124
2 99.360.02 94.540.03
Soybean large 35 100/683
19 70.490.22 79.760.14
Vehicle
18 100/846
4 60.110.16 46.800.16
Rand
20 100/3000
2 49.960.04 49.900.04
(ronnyk@CS.Stanford.EDU)
23
How Many Times Should We Repeat?
std dev
std dev
C4.5: 2-fold CV
Vehicle
7 Soybean
6
Rand
5
Breast
4
7
Vehicle
Rand
Soybean
5
4 Breast
6
3
3
2
2
Hypo
1 Chess
Mushroom
0
1
Naive Bayes: 2-fold CV
2
3
5
10
20
50
100
times
Chess
Hypo
1
Mushroom
0
1
2
3
5
10
20
50
100
times
Multiple runs with two-fold CV. The more, the better.
(ronnyk@CS.Stanford.EDU)
24
Download