Chaitanya Sai Gaddam Lingqiang Kong Arun Ravindran 02/06/20006 Boosting: Preliminary Discussion Proposal We propose to discuss boosting in next week’s class. Boosting is a general technique used to improve the accuracy of prediction by any classifier. It is based on the interesting concept that a crude classifier is equivalent to a classifier that models a date set with high accuracy. In particular, we would like to discuss the use of boosting in improving learning, its merits relative to a mixture of experts, and the relation of boosting and bagging. We also intend to take up the standard boosting algorithm (Adaboost; Freund and Schaphire) and describe the implementation of the algorithm and the definition of the bounds on training, testing and generalization errors. Further, we would like to compare Adaboost with support vector machines, game theory and logistic regression. Plausible extensions to Adaboost could also be discussed. (This document is now up on the class wiki, along with links to the papers) Core Readings : Richard Duda, Peter Hart and David Stork (2001). Pattern Classification. Second Edition. New York:Wiley. pg 475-480 Introduces bagging and boosting and gives the implementation specifics of Adaboost without actually getting into the theory. Yoav Freund and Robert Schaphire (1999) A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5) 771-780. Explains the implementation of Adaboost and discusses the bound on the generalization error of the final hypothesis and compares the margins in the training samples to the margins in support vector machines. Robert Schaphire (2003) The boosting approach to machine learning: An overview. In D D Denison, M H Hensen, C Holmes, B Mallick and B Yu (Editors), Nonlinear estimation and classification. Springer. An overview paper. Given the above two paper, the discussion on the implementation of AdaBoost is redundant; but the comparison to game theory and logistic regression is interesting. Sharon Rosset (2005) Robust boosting and its relation to bagging. (Unpublished). This paper explains the theory behind boosting and its relation to bagging. You could ignore the section on Huberized boosting. Rosset works with Hastie and Friedman, who developed Logit Boosting (another popular boosting algorithm that is explicitly related to logistic regression). Supplementary Readings: Michael Collins, Robert Schaphire and Yoram Singer (2002). Logistic regression, Adaboost and Bregman distances. Machine Learning, 48 (1/2/3) This paper explains how bregman distances could be used to explain the relationship Adaboost 1 and logistic regression. William W. Cohen and Yoram Singer. A simple, fast, and effective rule learner. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, 1999. Yoav Freund and Llew Mason. The alternating decision tree learning algorithm. In Machine Learning: Proceedings of the Sixteenth International Conference, 1999. * These two papers talk about how boosting could be related to a rule learning algorithm, such as one used in building decision trees. CN 710: Boosting (Class Discussion Notes) What is the need for Boosting? Consider the following introduction from one of the early papers on Boosting: A gambler, frustrated by persistent horseracing losses and envious of his friends’ winnings decides to allow a group of his fellow gamblers to make bets on his behalf. He decides he will wager a fixed sum of money in every race, but that he will apportion his money among his friends based on how well they are doing. Certainly, if he knew psychically ahead of time which of his friends would win the most, he would naturally have that friend handle all his wagers. Lacking such clairvoyance, however, he attempts to allocate each race’s wager in such a way that his total winnings for the season will be reasonably close to what he would have won had he bet everything with the luckiest of his friends…Returning to the horse racing story, suppose now that the gambler grows weary of choosing among the experts and instead wishes to create a computer program that will accurately predict the winner of a horse race based on the usual information (number of races recently won by each horse betting odds for each horse etc.). Boosting promises to improve the performance by simply pooling together better than average classifiers. How can this be? Consider a finite circle-and-square data set, with the circle and square centered at a nonorigin point in the Cartesian coordinate frame. Let our component classifiers be singlelayer Perceptrons with a zero bias. The space of resulting decision boundaries (hypotheses) will be lines of all possible orientations passing through the origin. Can boosting then help these impaired classifiers to combine and achieve an arbitrarily high accuracy? (Given the data set and the construction of the classifiers, each of the classifiers can achieve greater than 50% accuracy on the data set.) Other Discussion Points Bagging and boosting to improve classifier performance Comparison with mixture of experts/classifiers Adaboost: Algorithm, bounds, training error, testing error, generalization error 2 Comparison of boosting with Support Vector Machines, Game theory, Logistic Regression Versions of Adaboost: Real boost, Soft boost, Multi-class, etc. Theoretical proof, in terms of loss function Issues: How to choose the type of component classifiers? How does boosting deal with overfitting? How does the initial selection of data points for the first classifier affect performance? How does it handle noise? What does it mean to say “loss function”? Why an exponential loss function in AdaBoost? What is the relation to "forward stage wise additive modeling"? What does dataset dependency in empirical comparison of boosting, bagging and other classifiers mean, when there is no formal proof? A General Summary of AdaBoost Applications of Boosting: Optical Character Recognition (OCR) (post office, banks), object recognition in images. Webpage classification (search engines), email filtering, document retrieval Bioinformatics (analysis of gene array data, protein classification, tumor classification with 3 gene expression data,etc. ) Speech recognition, automatic .mp3 sorting Etc….etc.. Why? Characteristics of boosting cases: The example in spam filter gather large collection of examples of spam and non-spam easy to find "rules of thumbs" that are often correct "buy now"--->spam! hard to find single universal rule that is highly accurate Same in the horse racing case! Devise algorithm for deriving rough rules of thumb Apply procedure to subset Concentrate on “hardest” examples--those most often misclassified by previous rules of thumb Obtain rule of thumb--Combine with previous rules by taking weighted sum score Repeat Converts rough rules of thumb into highly accurate prediction rule Regularization: how to determine the number of boosting iterations T? if T to large---> poor generalization i.e. overfitting Answer: Usually T selected by validation. *The problem of overfitting Actually, AdaBoost often tends not to overfit. (Breiman 96, Cortes and Drucker 97, etc.) As a result, the margin theory (Schapire, Freund, Bartlett and Lee 98) developed, which is based on loose generalization bounds. **Caveat! Margin for boosting is not the same as margin for SVM--- AdaBoost was invented before the margin theory. The question remained: Does AdaBoost maximize the margin? Empirical results on convergence of AdaBoost: AdaBoost seemed to maximize the margin in the limit (Grove and Schuurmans 98, and others). AdaBoost generates a margin that is at least ½ρ, where ρ is the maximum margin. (Schapire, Freund, Bartlett, and Lee 98) ---- Seems very much like “yes”… 4 Hundreds of papers were published using AdaBoost between 1997-2004, even though fundamental convergence properties were not understood! Even after 7 years, this problem was still open! A recent answer……. Theorem (R, Daubechies, Schapire 04) AdaBoost may converge to a margin that is significantly below maximum. Does AdaBoost choose λfinal so that the margin µ( f ) is maximized? That is, does AdaBoost maximize the margin? No! ( f ) : min i (Mλ final )i λ final . 1 5