Chaitanya Sai Gaddam Lingqiang Kong Arun Ravindran 02/06

advertisement
Chaitanya Sai Gaddam
Lingqiang Kong
Arun Ravindran
02/06/20006
Boosting: Preliminary Discussion Proposal
We propose to discuss boosting in next week’s class. Boosting is a general technique used to
improve the accuracy of prediction by any classifier. It is based on the interesting concept that a
crude classifier is equivalent to a classifier that models a date set with high accuracy. In
particular, we would like to discuss the use of boosting in improving learning, its merits relative
to a mixture of experts, and the relation of boosting and bagging.
We also intend to take up the standard boosting algorithm (Adaboost; Freund and Schaphire) and
describe the implementation of the algorithm and the definition of the bounds on training, testing
and generalization errors. Further, we would like to compare Adaboost with support vector
machines, game theory and logistic regression. Plausible extensions to Adaboost could also be
discussed.
(This document is now up on the class wiki, along with links to the papers)
Core Readings :
Richard Duda, Peter Hart and David Stork (2001). Pattern Classification. Second Edition. New
York:Wiley. pg 475-480
 Introduces bagging and boosting and gives the implementation specifics of Adaboost
without actually getting into the theory.
Yoav Freund and Robert Schaphire (1999) A short introduction to boosting. Journal of Japanese
Society for Artificial Intelligence, 14(5) 771-780.
 Explains the implementation of Adaboost and discusses the bound on the generalization error
of the final hypothesis and compares the margins in the training samples to the margins in
support vector machines.
Robert Schaphire (2003) The boosting approach to machine learning: An overview. In D D
Denison, M H Hensen, C Holmes, B Mallick and B Yu (Editors), Nonlinear estimation and
classification. Springer.
 An overview paper. Given the above two paper, the discussion on the implementation of
AdaBoost is redundant; but the comparison to game theory and logistic regression is
interesting.
Sharon Rosset (2005) Robust boosting and its relation to bagging. (Unpublished).
 This paper explains the theory behind boosting and its relation to bagging. You could ignore
the section on Huberized boosting. Rosset works with Hastie and Friedman, who developed
Logit Boosting (another popular boosting algorithm that is explicitly related to logistic
regression).
Supplementary Readings:
Michael Collins, Robert Schaphire and Yoram Singer (2002). Logistic regression, Adaboost and
Bregman distances. Machine Learning, 48 (1/2/3)
 This paper explains how bregman distances could be used to explain the relationship Adaboost
1
and logistic regression.
William W. Cohen and Yoram Singer. A simple, fast, and effective rule learner. In Proceedings
of the Sixteenth National Conference on Artificial Intelligence, 1999.
Yoav Freund and Llew Mason. The alternating decision tree learning algorithm. In
Machine Learning: Proceedings of the Sixteenth International Conference, 1999.
* These two papers talk about how boosting could be related to a rule learning algorithm,
such as one used in building decision trees.
CN 710: Boosting (Class Discussion Notes)
What is the need for Boosting?
Consider the following introduction from one of the early papers on Boosting: A
gambler, frustrated by persistent horseracing losses and envious of his friends’ winnings
decides to allow a group of his fellow gamblers to make bets on his behalf. He decides he
will wager a fixed sum of money in every race, but that he will apportion his money
among his friends based on how well they are doing. Certainly, if he knew psychically
ahead of time which of his friends would win the most, he would naturally have that
friend handle all his wagers. Lacking such clairvoyance, however, he attempts to allocate
each race’s wager in such a way that his total winnings for the season will be reasonably
close to what he would have won had he bet everything with the luckiest of his
friends…Returning to the horse racing story, suppose now that the gambler grows weary
of choosing among the experts and instead wishes to create a computer program that will
accurately predict the winner of a horse race based on the usual information (number of
races recently won by each horse betting odds for each horse etc.).
Boosting promises to improve the performance by simply pooling together better than
average classifiers. How can this be?
Consider a finite circle-and-square data set, with the circle and square centered at a nonorigin point in the Cartesian coordinate frame. Let our component classifiers be singlelayer Perceptrons with a zero bias. The space of resulting decision boundaries
(hypotheses) will be lines of all possible orientations passing through the origin. Can
boosting then help these impaired classifiers to combine and achieve an arbitrarily high
accuracy? (Given the data set and the construction of the classifiers, each of the
classifiers can achieve greater than 50% accuracy on the data set.)
Other Discussion Points
Bagging and boosting to improve classifier performance
Comparison with mixture of experts/classifiers
Adaboost: Algorithm, bounds, training error, testing error, generalization error
2
Comparison of boosting with Support Vector Machines, Game theory, Logistic
Regression
Versions of Adaboost: Real boost, Soft boost, Multi-class, etc.
Theoretical proof, in terms of loss function
Issues:
How to choose the type of component classifiers?
How does boosting deal with overfitting?
How does the initial selection of data points for the first classifier affect performance?
How does it handle noise?
What does it mean to say “loss function”?
Why an exponential loss function in AdaBoost?
What is the relation to "forward stage wise additive modeling"?
What does dataset dependency in empirical comparison of boosting, bagging and other
classifiers mean, when there is no formal proof?
A General Summary of AdaBoost
Applications of Boosting:

Optical Character Recognition (OCR) (post office, banks), object recognition in images.

Webpage classification (search engines), email filtering, document retrieval

Bioinformatics (analysis of gene array data, protein classification, tumor classification with
3


gene expression data,etc. )
Speech recognition, automatic .mp3 sorting
Etc….etc..
Why? Characteristics of boosting cases: The example in spam filter

gather large collection of examples of spam and non-spam

easy to find "rules of thumbs" that are often correct
"buy now"--->spam!

hard to find single universal rule that is highly accurate

Same in the horse racing case!
Devise algorithm for deriving rough rules of thumb

Apply procedure to subset

Concentrate on “hardest” examples--those most often misclassified by previous rules of
thumb

Obtain rule of thumb--Combine with previous rules by taking weighted sum score

Repeat
Converts rough rules of thumb into highly accurate prediction rule
Regularization:
how to determine the number of boosting iterations T?
if T to large---> poor generalization i.e. overfitting
Answer: Usually T selected by validation.
*The problem of overfitting
Actually, AdaBoost often tends not to overfit. (Breiman 96, Cortes and Drucker 97, etc.)
As a result, the margin theory (Schapire, Freund, Bartlett and Lee 98) developed, which is based
on loose generalization bounds.
**Caveat!
Margin for boosting is not the same as margin for SVM--- AdaBoost was invented before the
margin theory.
The question remained: Does AdaBoost maximize the margin?
Empirical results on convergence of AdaBoost:
AdaBoost seemed to maximize the margin in the limit (Grove and Schuurmans 98, and others).
AdaBoost generates a margin that is at least ½ρ, where ρ is the maximum margin. (Schapire,
Freund, Bartlett, and Lee 98)
---- Seems very much like “yes”…
4
Hundreds of papers were published using AdaBoost between 1997-2004, even though
fundamental convergence properties were not understood! Even after 7 years, this problem was
still open!
A recent answer…….
Theorem (R, Daubechies, Schapire 04)
AdaBoost may converge to a margin that is significantly
below maximum.
Does AdaBoost choose λfinal so that the margin µ( f ) is maximized?
That is, does AdaBoost maximize the margin? No!
 ( f ) : min
i
(Mλ final )i
λ final
.
1
5
Download