From: AAAI-96 Proceedings. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. oosting, a C4.5 J. R. Quinlan University of Sydney Sydney, Australia 2006 quinlan@cs.su.oz.au Abstract Breiman’s bagging and Freund and Schapire’s boosting are recent methods for improving the predictive power of classifier learning systems. Both form a set of classifiers that are combined by voting, bagging by generating replicated bootstrap samples of the data, and boosting by adjusting the weights of training instances. This paper reports results of applying both techniques to a system that learns decision trees and testing on a representative collection of datasets. While both approaches substantially improve predictive accuracy, boosting shows the greater benefit. On the other hand, boosting also produces severe degradation on some datasets. A small change to the way that boosting combines the votes of learned classifiers reduces this downside and also leads to slightly better results on most of the datasets considered. Introduction Designers of empirical machine learning systems are concerned with such issues as the computational cost of the learning method and the accuracy and intelligibility of the theories that it constructs. Much of the research in learning has tended to focus on improved predictive accuracy, so that the performance of new systems is often reported from this perspective. It is easy to understand why this is so - accuracy is a primary concern in all applications of learning and is easily measured (as opposed to intelligibility, which is more subjective), while the rapid increase in computers’performance/cost ratio has de-emphasized comput ational issues in most applications.’ In the active subarea of learning decision tree classifiers, examples of methods that improve accuracy are: o Construction of multi-attribute tests using logical combinations (Ragavan and Rendell 1993)) arithmetic combinations (Utgoff and Brodley 1990; ‘For extremely large datasets, however, learning time can remain the dominant issue (Catlett 1991; Chan and Stolfo 1995). Heath, Kasif, and Salzberg 1993), and counting operations (Murphy and Pazzani 1991; Zheng 1995). e Use of error-correcting codes when there are more than two classes (Dietterich and Bakiri 1995). o Decision trees that incorporate classifiers of other kinds (Brodley 1993; Ting 1994). e Automatic methods for setting learning system parameters (Kohavi and John 1995). On typical datasets, all have been shown to lead to more accurate classifiers at the cost of additional computation that ranges from modest to substantial. There has recently been renewed interest in increasing accuracy by generating and aggregating multiple classifiers. Although the idea of growing multiple trees is not new (see, for instance, (Quinlan 1987; Buntine 1991)), the justification for such methods is often empirical. In contrast, two new approaches for producing and using several classifiers are applicable to a wide variety of learning systems and are based on theoretical analyses of the behavior of the composite classifier. The data for classifier learning systems consists of Both bootstrap attribute-value vectors or instances. aggregating or bagging (Breiman 1996) and boosting (F’reund and Schapire 1996a) manipulate the training data in order to generate different classifiers. Bagging produces replicate training sets by sampling with replacement from the training instances. Roosting uses all instances at each repetition, but maintains a weight for each instance in the training set that reflects its importance; adjusting the weights causes the learner to focus on different instances and so leads to different classifiers. In either case, the multiple classifiers are then combined by voting to form a composite classifier. In bagging, each component classifier has the same vote, while boosting assigns different voting strengths to component classifiers on the basis of their accuracy. This paper examines the application of bagging and boosting to C4.5 (Quinlan 1993), a system that learns decision tree classifiers. After a brief summary of both methods, comparative results on a substantial number of datasets are reported. Although boosting generally increases accuracy, it leads to a deterioration on DecisionTrees 725 some datasets; further experiments probe the reason for this. A small change to boosting in which the voting strengths of component classifiers are allowed to vary from instance to instance shows still further improvement . The final section summarizes the (sometimes tentative) conclusions reached in this work and outlines directions for further research. Bagging and Boosting We assume a given set of N instances, each belonging to one of K classes, and a learning system that constructs a classifier from a training set of instances. Bagging and boosting both construct multiple classifiers from the instances; the number T of repetitions or trials will be treated as fixed, although Freund and Schapire (1996a) note that this parameter could be determined automatically by cross-validation. The classifier learned on trial t will be denoted as Ct while C* is the composite (bagged or boosted) classifier. For any instance x, Ct(x) and C*(x) are the classes predicted by Ct and C* respectively. Bagging For each trial t = 1,2,... ,T, a training set of size N is sampled (with replacement) from the original instances. This training set is the same size as the original data, but some instances may not appear in it while others appear more than once. The learning system generates a classifier Ct from the sample and the final classifier C* is formed by aggregating the T classifiers from these trials. To classify an instance x, a vote for class k is recorded by every classifier for which C”(Z) = k and C*(x) is then the class with the most votes (ties being resolved arbitrarily). Using CART (Breiman, Friedman, Olshen, and Stone 1984) as the learning system, Breiman (1996) reports results of bagging on seven moderate-sized datasets. With the number of replicates T set at 50, the average error of the bagged classifier C* ranges from 0.57 to 0.94 of the corresponding error when a single classifier is learned. Breiman introduces the concept of an order-correct classifier-learning system as one that, over many training sets, tends to predict the correct class of a test instance more frequently than any other class. An order-correct learner may not produce optimal classifiers, but Breiman shows that aggregating classifiers produced by an order-correct learner results in an optimal classifier. Breiman notes: “The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.” Boosting The version of boosting investigated in this paper is AdaBoost.Ml (F’reund and Schapire 1996a). Instead of drawing a succession of independent bootstrap samples from the original instances, boosting maintains a 726 Learning weight for each instance - the higher the weight, the more the instance influences the classifier learned. At each trial, the vector of weights is adjusted to reflect the performance of the corresponding classifier, with the result that the weight of misclassified instances is increased. The final classifier also aggregates the learned classifiers by voting, but each classifier’s vote is a function of its accuracy. Let ZV: denote the weight of instance x at trial t where, for every x, wi = l/N. At each trial t = 1,2,. .. ,T, a classifier Ct is constructed from the given instances under the distribution wt (i.e., as if the weight wz of instance x reflects its probability of occurrence). The error et of this classifier is also measured with respect to the weights, and consists of the sum of the weights of the instances that it misclassifies. If et is greater than 0.5, the trials are terminated and T is altered to t-l. Conversely, if Ct correctly classifies all instances so that et is zero, the trials terminate and T becomes t. Otherwise, the weight vector wt+’ for the next trial is generated by multiplying the weights of instances that Ct classifies correctly by the factor ,# = et/(1 - et) and then renormalizing so that C, wg+r equals 1. The boosted classifier C* is obtained by summing the votes of the classifiers c1,c2,..., CT, where the vote for classifier Ct is worth log(l/pL) units. Provided that et is always less than 0.5, Freund and Schapire prove that the error rate of C* on the given examples under the original (uniform) distribution w1 approaches zero exponentially quickly as T increases. A succession of “weak” classifiers {C”} can thus be boosted to a “strong” classifier C* that is at least as accurate as, and usually much more accurate than, the best weak classifier on the training data, Of course, this gives no guarantee of C*‘s generalization performance on unseen instances; Freund and Schapire suggest the use of mechanisms such as Vapnik’s (1983) structural risk minimization to maximize accuracy on new data. Requirements for Boosting and Bagging These two methods for utilizing multiple classifiers make different assumptions about the learning system. As above, bagging requires that the learning system should not be “stable”, so that small changes to the training set should lead to different classifiers. Breiman also notes that “poor predictors can be transformed into worse ones” by bagging. Boosting, on the other hand, does not preclude the use of learning systems that produce poor predictors, provided that their error on the given distribution can be kept below 50%. However, boosting implicitly requires the same instability as bagging; if Ct is the same as Ct-‘, the weight adjustment scheme has the property that et = 0.5. Although Freund and Schapire’s specification of AdaBoost.Ml does not force termination when et = 0.5, ,@ = 1 in this case so that wt+l = wt and all classifiers from Ct on have votes with zero c4.5 anneal audiology auto breast-w chess colic credit-a credit-g diabetes glass heart-c heart-h hepatitis hYP0 iris labor letter lymphography phoneme segment sick sonar soybean splice vehicle vote waveform average J EqYJ 7.67 22.12 17.66 5.28 8.55 14.92 14.70 28.44 25.39 32.48 22.94 21.53 20.39 .48 4.80 19.12 11.99 21.69 19.44 3.21 1.34 25.62 7.73 5.91 27.09 5.06 27.33 --lzm- Table 1 err (Yo) 6.25 19.29 19.66 4.23 8.33 15.19 14.13 25.81 23.63 27.01 21.52 20.31 18.52 .45 5.13 14.39 7.51 20.41 18.73 2.74 1.22 23.80 7.58 5.58 25.54 4.37 19.77 vs c4.5 w-l 10-O 9-o 2-8 9-o 6-2 O-6 8-2 10-O 9-l 10-O 7-2 8-l 9-o 7-2 2-6 10-O 10-O 8-2 10-O 9-l 7-1 7-l 6-3 9-l 10-O 9-o 10-O 14.11 ratio .814 .872 1.113 .802 .975 1.018 .962 .908 .931 .832 .938 .943 .908 .928 1.069 .752 .626 .941 .964 .853 .907 .929 .981 .943 .943 .864 .723 err YO 4.73 15.71 15.22 4.09 4.59 18.83 15.64 29.14 28.18 23.55 21.39 21.05 17.68 .36 6.53 13.86 4.66 17.43 16.36 1.87 1.05 19.62 7.16 5.43 22.72 5.29 18.53 .905 13.36 vs c4.5 w-l 10-O 10-O 9-l 9-o 10-O O-10 1-9 2-8 O-10 10-O 8-O 5-4 10-O 9-1 O-10 9-l 10-O 10-O 10-O 10-O 10-O 10-O 8-2 9-o 10-O 3-6 10-O ratio .617 .710 .862 .775 .537 1.262 1.064 1.025 1.110 .725 .932 .978 .867 .746 1.361 .725 .389 .804 .842 .583 .781 .766 .926 .919 .839 1.046 .678 247 vs Bagging w-l ratio .758 o-o o-o .814 9-l .774 7-2 .966 o-o .551 O-10 1.240 O-10 1.107 O-10 1.129 O-10 1.192 9-l .872 5-4 .994 3-6 1.037 6-l .955 9-l .804 O-8 1.273 5-3 .963 o-o .621 o-o .854 .0-o .873 .0-o .684 9-l .861 .0-o .824 8-l .944 6-4 .974 .0-o .889 l-9 1.211 8-2 .938 .930 Comparison of C4.5 and its bagged and boosted versions. weight in the final classification. Similarly, an overfitting learner that produces classifiers in total agreement with the training data would cause boosting to terminate at the first trial. Experiments C4.5 was modified to produce new versions incorporating bagging and boosting as above. (C4.5’~ facility to deal with fractional instances, required when some attributes have missing values, is easily adapted to handle the instance weights wi used by boosting.) These versions, referred to below as bugged C4.5 and boosted C4.5,have been evaluated on a representative collection of datasets from the UC1 Machine Learning Repository. The 27 datasets, summarized in the Appendix, show considerable diversity in size, number of classes, and number and type of attributes. The parameter T governing the number of classifiers generated was set at 10 for these experiments. Breiman (1996) notes that most of the improvement from bagging is evident within ten replications, and it is interesting to see the performance improvement that can be bought by a single order of magnitude increase in computation. All C4.5 parameters had their default values, and pruned rather than unpruned trees were used to reduce the chance that boosting would terminate prematurely with 8 equal to zero. Ten complete IO-fold cross-validations were carried out with each dataset. The results of these trials appear in Table 1. For each dataset, the first column shows C4.5’~ mean error rate over the ten cross-validations. The second section contains similar results for bagging, i.e., the class of a test instance is determined by voting multiple C4.5 trees, each obtained from a bootstrap sample as above. The next figures are the number of complete cross-validations in which bagging gives better or worse results respectively than C4.5, ties being omitted. This section also shows the ratio of the error rate using bagging to the error rate using C4.5 - a value 21n a lo-fold (stratified) cross-validation, the training instances are partitioned into 10 equal-sized blocks with similar class distributions. Each block in turn is then used as test data for the classifier generated from the remaining nine blocks. Decision Trees 727 colic chess - boosting bagging g 18 v 6 17 g-9 -8 f3 7 1 6 ii Q, 16 10 20 30 40 number of trials 2’ 3 50 141. 1 10 20 30 40 number of trials 2’ 50 Figure 1: Comparison of bagging and boosting on two datasets less than 1 represents an improvement due to bagging. Similar results for boosting are compared to C4.5 in the third section and to bagging in the fourth. It is clear that, over these 27 datasets, both bagging and boosting lead to markedly more accurate classifiers. Bagging reduces C4.5’~ classification error by approximately 10% on average and is superior to C4.5 on 24 of the 27 datasets. Boosting reduces error by 15%) but improves performance on 21 datasets and degrades performance on six. Using a two-tailed sign test, both bagging and boosting are superior to C4.5 at a significance level better than 1%. When bagging and boosting are compared head to head, boosting leads to greater reduction in error and is superior to bagging on 20 of the’27 datasets (significant at the 2% level). The effect of boosting is more erratic, however, and leads to a 36% increase in error on the iris dataset and 26% on colic. Bagging is less risky: its worst performance is on the auto dataset, where the error rate of the bagged classifier is 11% higher than that of C4.5. The difference is highlighted in Figure 1, which compares bagging and boosting on two datasets, chess and colic, as a function of the number of trials 2’. For T=l, boosting is identical to C4.5 and both are almost always better than bagging - they use all the given instances while bagging employs a sample of them with some omissions and some repetitions. As 2’ increases, the performance of bagging usually improves, but boosting can lead to a rapid degradation (as in the colic dataset). Why Does Boosting Sometimes Fail? A further experiment was carried out in order to better understand why boosting sometimes leads to a deterioration in generalization performance. Freund and Schapire (1996a) put this down to overfitting - a large number of trials 2’ allows the composite classifier C* to become very complex. As discussed earlier, the objective of boosting is to 728 Learning construct a classifier C* that performs well on the training data even when its constituent classifiers Ct are weak. A simple alteration attempts to avoid overfitting by keeping T as small as possible without impacting this objective. AdaBoostMl stops when the error of any Ct drops to zero, but does not address the possibility that C* might correctly classify all the training data even though no Ct does. Further trials in this situation would seem to offer no gain - they will increase the complexity of C* but cannot improve its performance on the training data. The experiments of the previous section were repeated with T=lO as before, but adding this further condition for stopping before all trials are complete. In many cases, C4.5 requires only three boosted trials to produce a classifier C* that performs perfectly on the training data; the average number of trials over all datasets is now 4.9. Despite using fewer trials, and thus being less prone to overfitting, C4.5’~ generalization performance is worse. The overfitting avoidance strategy results in lower cross-validation accuracy on 17 of the datasets, higher on six, and unchanged on four, a degradation significant at better than the 5% level. Average error over the 27 datasets is 13% higher than that reported for boosting in Table 1. These results suggest that the undeniable benefits of boosting are not attributable just to producing a composite classifier C* that performs well on the training data. It also calls into question the hypothesis that overfitting is sufficient to explain boosting’s failure on some datasets, since much of the benefit realized by boosting seems to be caused by overfitting. Changing the Voting Weights Freund and Schapire (1996a) explicitly consider the use by AdaBoost.Ml of confidence estimates provided by some learning systems. When instance x is classified by Ct, let Ht (x) be a number between 0 and 1 that represents some informal measure of the reliability of the prediction Ct (x). Freund and Schapire suggest us- ing this estimate to give a more flexible measure of classifier error. An alternative use of the confidence estimate Ht is in combining the predictions of the classifiers { Ct } to give the final prediction C*(z) of the class of instance 2. Instead of using the fixed weight log(l/Pt) for the vote of classifier Ct, it seems plausible to allow the voting weight of Ct to vary in response to the confidence with which x is classified. 64.5 can be “tweaked” to yield such a confidence estimate. If a single leaf is used by Ct to classify an instance x as belonging to class k=Ct (x), let S denote the set of training&stances that are’mapped to the leaf, and Sk the subset of them that belong to class L. The confidence of the prediction that instance x belongs to class Iccan then be estimated by the Laplace ratio N x CiESk w; + 1 Ht(x) = Nx&~w; -I- 2’ (When x has unknown values for some attributes, C4.5 can use several leaves in making a prediction. A similar confidence estimate can be constructed for such situations.) Note that the confidence measure Ht (x) is still determined relative to the boosted distribution wt, not to the original uniform distribution of the instances. The above experiments were repeated with a modified form of boosting, the only change being the use of Ht(x) rather than log(l/Pt) as the voting weight of Ct when classifying instance x. Results show improvement on 25 of the 27 datasets, the same error rate on one dataset, and a higher error rate on only one of the 27 datasets (chess)‘: Average error rate is approximately 3% less than that obtained with the original voting weights. This modification is necessarily ad-hoc, since the confidence estimate Ht has only an intuitive meaning. However, it will be interesting to experiment with other voting schemes, and to see whether any of them can be used to give error bounds similar to those proved for the original boosting method. Conclusion Trials over a diverse collection of datasets have confirmed that boosted and bagged versions of C4.5 produce noticeably more accurate classifiers than the standard version. Boosting and bagging both have a sound theoretical base and also have the advantage that the extra computation they require is known in advance - if T classifiers are generated, then both require T times the computational effort of C4.5. In these experiments, a lo-fold increase in computation buys an average reduction of between 10% and 19% of the classification error. In many applications, improvements of this magnitude would be well worth the computational cost. In some cases the improvement is dramatic - for the largest dataset (letter) with 20,000 instances, modified boosting reduces C4.5’~ classification error from 12% to 4.5%. Boosting seems to be more effective than bagging when applied to C4.5, although the performance of the bagged C4.5 is less variable that its boosted counterpart. If the voting weights used to aggregate component classifiers into a boosted classifier are altered to reflect the confidence with which individual instances are classified, better results are obtained on almost all the datasets investigated. This adjustment is decidedly ad-hoc, however, and undermines the theoretical foundations of boosting to some extent. A better understanding of why boosting sometimes fails is a clear desideratum at this point. F’reund and Schapire put this down to overfitting, although the degradation can occur at very low values of T as shown in Figure 1. In some cases in which boosting increases error, I have noticed that the class distributions across the weight vectors wt become very skewed. With the iris dataset, for example, the initial weights of the three classes are equal, but the weight vector w5 of the fifth trial has them as setosa=2%, versicolor=75%, and virginica=23%. Such skewed weights seem likely to lead to an undesirable bias towards or against predicting some classes, with a concomitant increase in error on unseen instances. This is especially damaging when, as in this case, the classifier derived from the skewed distribution has a high voting weight. It may be possible to modify the boosting approach and its associated proofs so that weights are adjusted separately within each class without changing overall class weights. Since this paper was written, F’reund and Schapire (1996b) have also applied AdaBoost.Ml and bagging to C4.5 on 27 datasets, 18 of which are used in this paper. Their results confirm that the error rates of boosted and bagged classifiers are significantly lower than those of single classifiers. However, they find bagging much more competitive with boosting, being superior on 11 datasets, equal on four, and inferior on 12. Two important differences between their experiments and those reported here might account for this discrepancy. First, F’reund and Schapire use a much higher number T=lOO of boosting and bagging trials than the T=lO of this paper. Second, they did not modify C4.5 to use weighted instances, instead resampling the training data in a manner analogous to bagging, but using wk as the probability of selecting instance x at each draw on trial t. This resampling negates a major advantage enjoyed by boosting over bagging, viz. that all training instances are used to produce each constituent classifier. Acknowledgements Thanks to Manfred Warmuth and Rob Schapire for a stimulating tutorial on Winnow and boosting. This research has been supported by a grant from the Australian Research Council. DecisionTrees 729 Appendix: Description of Datasets vame Cases Classes anneal audiology zuto breast-w chess colic credit-a “redit-g diabetes glass heart-c heart-h hepatitis hYP0 iris labor letter 898 226 205 699 551 368 690 1,000 768 214 303 294 155 3,772 150 57 20,000 148 5,438 2,310 3,772 208 683 3,190 846 435 300 6 6 6 2 2 2 2 2 2 6 2 2 2 5 3 2 26 4 47 7 2 2 19 3 4 2 3 lymph phoneme segment sick sonar soybean splice vehicle vote waveform Dietterich, T. G., and Bakiri, G. 1995. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research 2: 263-286. Attributes Cont Discr 9 29 69 15 9 - 10 - - 10 6 7 8 9 8 8 6 7 4 8 16 19 7 60 18 21 Freund, Y., and Schapire, R. E. 1996a. A decisiontheoretic generalization of on-line learning and an application to boosting.Unpublished manuscript, available from the authors’ home pages (“http://www.research. att.com/orgs/ssr/people/{yoav,schapire}”). An extended abstract appears in Computational Learning 39 12 9 13 - Theory: L. 1996. Bagging predictors. Learning, forthcoming. 5 5 13 22 - ‘95, Heath, D., Kasif, S., and Salzberg, S. 1993. Learning oblique decision trees. In Proceedings 13th International Joint Conference on Artificial Intelligence, 10021007. San Francisco: Morgan Kaufmann. 8 - Kohavi, R., and John, G. H. 1995. Automatic parameter selection by minimizing estimated error. In 18 7 22 Proceedings 12th International Conference on Machine Learning, 304-311. San Francisco: Morgan Kaufmann, Murphy, P. M., and Pazzani, M. 3. 1991. ID2-of-3: constructive induction of M-of-N concepts for discriminators in decision trees. In Proceedings 8th International Workshop on Machine Learning, 183-187. San Francisco: Morgan Kaufmann. 35 62 16 - Quinlan, J. R. 1987. Inductive knowledge acquisition: a case study. In Quinlan, J. R. (ed), Applications of Expert Systems. Wokingham, UK: Addison Wesley. Machine Brodley, C. E. 1993. Addressing the selective superiority problem: automatic algorithm/model class selection. In Proceedings 10th International Conference on Machine Learning, 17-24. San Francisco: Morgan Kaufmann. Buntine, W. L. 1991. Learning classification trees. In Hand, D. J. (ed), Artificial Intelligence Frontiers in Statistics, 182-201. London: Chapman & Hall. Megainduction: a test flight. Ragavan, H., and Rendell, L. 1993. Lookahead feature construction for learning hard concepts. In Proceedings 10th International Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. 1984. Classification and regression trees. Belmont, CA: Wadsworth. In Proceedings 8th International Workshop on Machine Learning, 596-599. San Francisco: Morgan Kaufmann. Chan, P. K. and Stolfo, S. J. 1995. A comparative evaluation of voting and meta-learning on partitioned data. In Proceedings 12th International Conference on Machine Learning, 90-98. San Francisco: Morgan Kaufmann. Learning EuroCOLT Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. San Mateo: Morgan Kaufmann. Breiman, 730 Conference, Freund, Y., and Schapire, R. E. 199613. Experiments with a new boosting algorithm. Unpublished manuscript. References Catlett, J. 1991. Second European 23-27, Springer-Verlag, 1995. Conference on Machine Learning, 252-259. San Francisco: Morgan Kaufmann. Ting, K. M. 1994. The problem of small disjuncts: its remedy in decision trees. In Proceedings 10th Canadian Conference on Artificial Intelligence, 91-97. Utgoff, P. E., and Brodley, C. E. 1990. An incremental method for finding multivariate splits for decision trees. In Proceedings ‘7th International Conference on Machine Learning, 58-65. San Francisco: Morgan Kaufmann. Vapnik, V. 1983. Estimation of Dependences Based on Empirical Data. New York: Springer-Verlag. Zheng, Z. 1995. Constructing nominal X-of-N attributes. In Proceedings 14th International Joint Conference on Artificial Intelligence, 1064-1070. San Francisco: Morgan Kaufmann.