It can not be emphasized enough that no claim whatsoever is being made in this paper that all algorithms are equivalent in practice, in the real world. In particular, no claim is being made that one should not use cross-validation in the real world. |Wolpert, 1994 Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection Ron Kohavi Stanford University IJCAI-95 (ronnyk@CS.Stanford.EDU) 1 Motivation & Summary of Results You have ten induction algorithms. Which one is the best for a given dataset? Answer: run them all and pick the one with the highest estimated accuracy. 1. Which accuracy estimation? Resubstitution? Holdout? Cross-validation? Bootstrap? 2. How sure would you be of your choice? 3. If you had spare CPU cycles, would you alway choose leaveone-out? For accuracy estimation: stratied 10 to 20-fold CV. For model selection: multiple runs of 3-5 CV. (ronnyk@CS.Stanford.EDU) 2 Talk Outline ☞ ➀ ➁ Accuracy estimation: the problem. Accuracy estimation methods: F Holdout & random subsampling. F Cross-validation. F Bootstrap. ➂ Experiments, recent experiments. ➃ Summary. (ronnyk@CS.Stanford.EDU) 3 Basic Denitions 1. Let D be a dataset of n labelled instances. 2. Assume D is an i.i.d. sample from some underlying distribution on the set of labelled instances. 3. Let I be an induction algorithm that produces a classier from data D. (ronnyk@CS.Stanford.EDU) 4 The Problem Estimate the accuracy of the classier induced from the dataset. The accuracy of a classier is the probability of correctly classifying a randomly selected instance from the distribution. All resampling methods (holdout, cross-validation, bootstrap) will use a uniform distribution on the given dataset as a distribution from which they sample. Sample 1 Real world Distribution F Dataset Distribution F’ Sample 2 Sample k (ronnyk@CS.Stanford.EDU) 5 Holdout Holdout partition the data into two mutually exclusive subsets called a training set and a test set. The training set is given to the inducer, and the induced classier is tested on the test set. Training Set 123 2/3 Inducer Classifier Eval 1/3 Pessimistic estimator because only a portion of the data is given to the inducer for training. (ronnyk@CS.Stanford.EDU) 6 Condence Bounds The classication of each test instance can be viewed as a Bernoulli trial: correct or incorrect prediction. Let S be the number of correct classications on the test set of size h, then S is distributed binomially and S=h is approximately normal with mean acc and variance of acc (1 acc)=h. 8 > < Pr >: 9 >= acch acc z<q < z> ; acc(1 acc)=h (1) (ronnyk@CS.Stanford.EDU) 7 Random Subsampling Repeating the holdout many times is called random subsampling. Common mistake: One cannot compute the standard deviation of the samples' mean accuracy in random subsampling because the test instances are not independent. The t-tests you commonly see with signicance levels are usually wrong. Example: random concept (50% for each class). One induction algorithm predicts 0, the other 1. Given a sample of 100 instances, chances are it won't be 50/50, but something like 53/47. Leaving 1/3 out gives s.d.=8.7%. (ronnyk@CS.Stanford.EDU) 8 Audience comments? Pedro Domingos et al., please keep the comments to the end. Accuracy as dened here is the prediction accuracy on instances sampled from the parent population, not from the small dataset we have at hand. t-tests with random subsampling tests the hypothesis that one algorithm is better than another, assuming the distribution is uniform and each instance in the dataset has 1=n probability. This is not an interesting hypothesis, except that we know how to compute it. (ronnyk@CS.Stanford.EDU) 9 Cross Validation In k-fold cross-validation, the dataset D is randomly split into k mutually exclusive subsets (the folds) D1; D2; : : : ; Dk of approximately equal size. The inducer is trained and tested k times; each time f1; 2; : : : ; kg it is trained on D n Dt and tested on Dt. Train: 1 2 Training Set 1 2 3 1 3 3 2 t 2 Test: Inducer Inducer Inducer c c c 3 2 1 Eval Eval Avg Eval (ronnyk@CS.Stanford.EDU) 10 Standard Deviation Two ways to look at cross-validation: 1. A method that works for stable inducers, i.e., inducers that do not change their predictions much when given similar datasets (see paper). 2. A method of computing the expected accuracy for a dataset of size n k=n, where k is the number of folds. Since each instance is used only once as a test instance, the standard deviation can be estimated as acccv (1 acccv)=n, where n is the number of instances in the dataset. (ronnyk@CS.Stanford.EDU) 11 Failure of cross-validation/leave-one-out Leave-one-out is n-fold cross-validation. Fisher's iris dataset contains 50 instances of each class (three classes), leading one to expect that a majority inducer should have accuracy about 33%. When an instance is deleted from the dataset, its label is a minority in the training set; thus the majority inducer predicts one of the other two classes and always errs in classifying the test instance. The leave-one-out estimated accuracy for a majority inducer on the iris dataset is therefore 0%0% (ronnyk@CS.Stanford.EDU) 12 .632 Bootstrap A bootstrap sample is created by sampling formly from the data (with replacement). n instances uni- Given a number b, the number of bootstrap samples, let 0i be the accuracy estimate for bootstrap sample i on the instances not included in the sample. The .632 bootstrap estimate is dened as b X 1 accboot = (0 :632 0i + :368 accs) b i=1 where accs is the resubstitution accuracy estimate on the full dataset. (ronnyk@CS.Stanford.EDU) 13 Bootstrap failure The .632 bootstrap fails to give the expected result when the classier can perfectly t the data (e.g., an unpruned decision tree or a one nearest neighbor classier) and the dataset is completely random, say with two classes. The apparent accuracy is 100%, and the 0 accuracy is about 50%. Plugging these into the bootstrap formula, one gets an estimated accuracy of about 68.4%, far from the real accuracy of 50%. (ronnyk@CS.Stanford.EDU) 14 Experimental design: Q & A 1. Which induction algorithm to use? C4.5 and Naive Bayes. Reason: FAST inducers. 2. Which datasets to use? Seven datasets were chosen. Reasons: (a) Wide variety of domains. (b) At least 500 instances. (c) Learning curve that did not atten too early. (ronnyk@CS.Stanford.EDU) 15 Learning Curves % Accuracy % Accuracy 96 94 92 90 88 86 84 82 100 97.5 95 92.5 90 87.5 85 82.5 NB C4.5 50 100 150 TS size 250 200 % Accuracy Hypothyroid C4.5 C4.5 99 98.5 98 NB NB 97.5 500 1000 2000 3000 TS size % Accuracy 97 500 C4.5 2000 4000 6000 8000 NB C4.5 80 70 60 50 40 30 20 NB TS size 1500 TS size 2500 % Accuracy Soybean (large) Mushroom 100 99 98 97 96 95 94 % Accuracy Chess endgames Breast cancer 50 100 150 200 TS size Vehicle 70 65 60 55 50 45 40 35 C4.5 NB 100 200 300 400 TS size Arrow indicate sampling points. (ronnyk@CS.Stanford.EDU) 16 The bias (CV) The bias of a method that estimates a parameter is dened as the expected estimated value minus the value of . An unbiased estimation method is a method that has zero bias. % acc 75 % acc 100 99.5 Mushroom 99 70 Soybean 65 60 98.5 Vehicle 98 Hypo Chess 97.5 55 97 50 45 Rand 96.5 2 5 10 20 -5 -2 -1 folds 96 2 5 10 20 -5 -2 -1 folds C4.5: The bias of cross-validation with varying folds. A negative k folds stands for leave-k-out. Error bars are 95% condence intervals for the mean. The gray regions indicate 95% condence intervals for the true accuracies. (ronnyk@CS.Stanford.EDU) 17 Bootstrap Bias Bootstrap is right on the mark for three out of six, but biased for soybean and extremely biased for vehicle and rand. % acc % acc 100 Estimated 75 99.5 Mushroom 99 Soybean Vehicle 70 Soybean 65 98.5 Rand 98 60 Vehicle Hypo Chess 97.5 55 50 45 97 Rand 96.5 1 2 5 10 20 50 100 samples 96 1 2 5 10 20 50 100 samples C4.5: The bias of bootstrap. (ronnyk@CS.Stanford.EDU) 18 The Variance of CV The variance is high at the ends: two fold and leave-f1,2g-out std dev Vehicle 7 Soybean std dev C4.5 6 5 6 Rand Breast 5 4 4 3 3 2 2 Hypo 1 Chess Mushroom 0 C4.5 7 1 2 5 10 20 -5 -2 -1 folds 0 Vehicle Soybean Rand Breast Hypo Chess Mushroom 2 5 10 20 -5 Regular (left), Stratied (right). -2 folds Stratication slightly reduces both bias and variance. (ronnyk@CS.Stanford.EDU) 19 The Variance of Bootstrap std dev std dev C4.5 7 Soybean Rand 6 Vehicle 6 5 5 7 Breast 4 4 3 3 2 Hypo 1 Mushroom Chess 0 Naive-Bayes Rand Vehicle Breast Soybean Chess Hypo 1 Mushroom 0 2 1 2 5 10 20 50 100 samples 1 2 5 10 20 50 100 samples (ronnyk@CS.Stanford.EDU) 20 Multiple Times To stabilize the cross-validation, we can execute CV multiple times, shuing the instances each time. std dev std dev C4.5: 5 x CV 7 7 6 6 5 5 4 Soybean Vehicle Breast 4 Rand Vehicle Soybean 3 Rand 3 Breast 2 1 0 Naive Bayes: 5 x CV 2 Hypo Chess Mushroom 1 2 5 10 20 -5 -2 folds 0 Chess Hypo Mushroom 2 5 10 20 -5 -2 folds The graphs show the eect on C4.5 and NB when CV was run ve times. The variance for lower folds was due to the variance because of smaller dataset size. (ronnyk@CS.Stanford.EDU) 21 Summary 1. Reviewed accuracy estimation: holdout, cross-validation, stratied CV, bootstrap. All methods fail in some cases. 2. Bootstrap has low variance, but is extremely biased. 3. With cross-validation, you can determine the bias/variance tradeo that is appropriate, and run multiple times to reduce variance. 4. Standard deviations can be computed for cross-validation, but are incorrect for random subsampling (holdout). Recommendation For accuracy estimation: stratied 10 to 20-fold CV. For model selection: multiple runs of 3-5 folds (dynamic). (ronnyk@CS.Stanford.EDU) 22 What is the \true" accuracy To estimate the true accuracy at the sampling points, we sampled 500 times and estimated the accuracy using the unseen instances (holdout). Dataset no. of sample-size no. of C4.5 Naive-Bayes attr. / total size categories Breast cancer 10 50/699 2 91.370.10 94.220.10 Chess 36 900/3196 2 98.190.03 86.800.07 Hypothyroid 25 400/3163 2 98.520.03 97.630.02 Mushroom 22 800/8124 2 99.360.02 94.540.03 Soybean large 35 100/683 19 70.490.22 79.760.14 Vehicle 18 100/846 4 60.110.16 46.800.16 Rand 20 100/3000 2 49.960.04 49.900.04 (ronnyk@CS.Stanford.EDU) 23 How Many Times Should We Repeat? std dev std dev C4.5: 2-fold CV Vehicle 7 Soybean 6 Rand 5 Breast 4 7 Vehicle Rand Soybean 5 4 Breast 6 3 3 2 2 Hypo 1 Chess Mushroom 0 1 Naive Bayes: 2-fold CV 2 3 5 10 20 50 100 times Chess Hypo 1 Mushroom 0 1 2 3 5 10 20 50 100 times Multiple runs with two-fold CV. The more, the better. (ronnyk@CS.Stanford.EDU) 24