Additive Logistic Regression: a Statistical View of Boosting

advertisement
Additive Logistic Regression:
a Statistical View of Boosting
J. Friedman, T. Hastie, & R. Tibshirani
Journal of Statistics
1
Outline





Introduction
A brief history of boosting
Additive Models
AdaBoost – an Additive logistic regression model
Simulation studies
2
Discrete AdaBoost
3
Performance of Discrete
AdaBoost
4
Re-sampling in AdaBoost

Connection with bagging
 Bagging is a variance reduction technique.
 Is Boosting also a variance reduction technique?

Boosting performs comparably well when:
 Weighted tree-growing algorithm rather than weighted resampling.

Removing the randomizatoin component.
 Stumps


Have low variance but high bias
Boosting is capable of both bias and variance reduction.
5
Real AdaBoost
6
Statistical Interpretation of the
AdaBoost

Fitting an additive model by minimizing squarederror loss in a forward stagewise manner.
 At the mth stage, fix
 Minimize squared error to obtain

AdaBoost fits an additive model using a criterion
similar to, but not the same as, the binomial loglikelihood
 Better loss function for classification
7
A brief history of boosting

The first simple boosting procedure is developed
in the PAC-learning framework.
 Strength of Weak Learnability
 After learning an initial classifier h1 on the first N
training points.
 h2 is learned on a new sample of N points, half of which
are misclassified by h1.
 h3 is learned on N points for which h1 and h2 disagree.
 The boosted classifier is hB = Majority Vote(h1, h2, h3)
8
Additive Models

Addtive regression models
 Extended additive models
 Classification problems
9
Additive Regression Models


Modeling the mean
The additive model:
 There is a separate function
variables xj.

for each of the p input
Backfitting algorithm
 A modular “Gauss-Seidel” algorithm for fitting additive models
 Backfitting update
 Backfitting cycles are repeated until convergence
 Backfitting converges to the minimizer of
fairly general conditions.
under
10
Extended Additive Models (1)

Additive models whose elements
are
functions of potentially all of the input features x.
 If we set

Generalized backfitting algorithm
 Updates

Greedy forward stepwise approach
11
Extended Additive Models (2)

Algorithm for fitting a single weak leaner to data

In the forward stepwise procedure
 This can be viewed as a procudure for boosting a weak
learner to form a powerful commttee
12
Classification problems

Additive logistic regression

Inverting

These models are usually fit by maximizing the
binomial log-likelihood.
13
AdaBoost – an Additive Logistic
Regression Model

AdaBoost can be interpreted as stage-wise
estimation procedures for fitting an additive
logistic regression model
 AdaBoost optimize an exponential criterion which
to second order is equivalent to the binomial loglikelihood criterion
 Proposing a more standard likelihood-based
boosting procedure
14
An Exponential Criterion (1)


Minimizing the criterion
The function F(x) that minimizes J(F) is the symmetric
logistic transform of P(y=1|x).
 Can be proved by setting the derivative to zero
15
An Exponential Criterion (2)

The usual logistic model

The Discrete AdaBoost algorithm (population
version) builds an additive logistic regression
model via Newton-like update for minimizing
16
Derivation
(a)
, where
For c > 0, minimizing (a) is equivalent to maximizing
17
Continued…
The solution is
Note that
Minimizing a quadratic approximation to the critetrion leads
to a weighted least-sqares choice of f(x).
Minimizing J(F+cf) to determine c:
where,
18
Update for F(x)
Since
The function and weight updates are of an identical form
to those used in Discrete AdaBoost
19
Corollary

After each update to the weights, the weighted
misclassification error of the most recent weak
learner is 50%
 Weights are updated to make the new weighted problem
maximally difficult for the next weak learner
20
Derivation

The Real AdaBoost algorithm fits an additive
logistic regression model by stage-wise and
approximate optimization of
Dividing through by
And setting the derivative w.r.t f(x) to zero
21
Corollary

At the optimal F(x), the weighted conditional
mean of y is 0.
22
Why Ee-yF(x)?

The populbation minimizer of
coincide.
and
23
Losses as Approximations to
Misclassification Error
24
Direct optimization of the
binomial log-likelihood

Fitting additive logistic regression models by
stage-wise optimization of the Bernoulli loglikelihood.
25
Derivation of LogitBoost (1)
Newton update
,where
26
Derivation of LogitBoost (2)

Equivalently, the Newton update f(x) solves the
weighted least-squares approximation to the loglikelihood
27
Optimizing Ee-yF(x) by Newton
stepping

Proposing the “Gentle AdaBoost” procedure that
instead takes adaptive Newton steps much like the
LogitBoost algorithm just described
28
Derivation

The Gentle AdaBoost algorithm uses Newton
steps for minimizing Ee-yF(x) .

Newton update
29
Comparison with Real
AdaBoost

Update in Gentle AdaBoost

Update in Real AdaBoost
 Log-ratios can be numerically unstable, leading to very large
updates in pure regions.

Empirical evidence suggests that this more conservative
algorithm has similar performance to both the Real
AdaBoost and LogitBoost algorightms.
30
Simulation Studies

Four boosting methods compared here
 DAB: Discrete AdaBoost
 RAB: Real AdaBoost
 LB: LogitBoost
 GAB: Gentle AdaBoost
31
Data Generation





All of the simulated examples involve fairly complex decision
boundaries
Ten input features randomly drawn from a 10-dim. standard normal
dist.
Approximately 1000 training observations in each class
10000 observations for test set.
Averaged over 10 such indepently drawn training/test set combinations.
C1
C2 C3
32
Additive Decision Boundary (1)
33
Additive Decision Boundary (2)
34
Additive Decision Boundary (3)
35
Boosting Trees with 8-terminal
nodes
36
Analysis

Optimal decision boundary for the above
examples is also additive in the original features,
with
 For RAG, GAB, and LB the error rate using the
bigger trees is in fact 33% higher than that for
stumps at 800 iterations, even though the former is
four times more complex.
 Non-additive decision boundaries
 Boosting stumps would be less advantageous than using
larger trees.
37
Non-additive Decision
Boundaries

Higher order basis functions provide the
possibility to more accurately estimate those
decision boundaries with high order interaction.
 Data generation
 2 classes
 5000 training observations drawn from a 10-dim
normal dist.
 Class labels were randomly assigned to each
observation with log-odds
38
Non-additive Decision
Boundaries (2)
39
Non-additive Decision
Boundaries (3)
40
Analysis

Boosting stumps can sometimes be superior to
using larger trees when decision boundaries can be
closely approximated by functions that are
additive in the original predictor features.
41
Some experiments with Real
World Data

UC-Irvine machine learning archive + a popular
simulated dataset.
 The real data examples fail to demonstrate
performance differences between the various
boosting methods.
42
Additive Logistic Trees

ANOVA decomposition

Allowing the base classifier to produce higher
order interactions can reduce the accuracy of the
final boosted model.
 Higher order interactions are produced by deeper trees.

Maximum depth becomes a “meta-parameter” of
the procedure to be estimated by some model
selection technique, such as cross-validation.
43
Additive Logistic Trees (2)

Growing trees until a maximum number M of
terminal nodes are induced.
 “Additive logistic trees” (ALT)
 Combination of truncated best-first trees, with boosting.

Another advantage of low order approximations is
model visualization
44
Weight Trimming

Trainig observations with weight wi less than a
threshold
 Observations deleted at a particular iteration may
therefore re-enter at later iterations.
 LogitBoost sometimes gives an advantage with
weight trimming
 Weights measre nearness to the currently estimated
decision boundary
 For the other three procedures the weight is monotone
in
 Subsample
passed to the base learner can be highly unbalanced
45
The test error for the letter
recognition problem
46
Further Generalizations of
Boosting

The Newton step can be replaced by a gradient
step, slowing down the fitting procedure.
 Reducing susceptibility to overfitting

Any smooth loss function can be used.
47
Concluding Remarks

Bagging, randomized trees
 “Variance” reducing techniques

Boosting
 Appears tobe mainly a “bias” reducing procedure.

Boosting seems resistant to overfitting
 As the LogitBoost iterations proceed, the overall impact of changes
introcued by fm(x) reduces.
 The stage-wise nature of the boosting algorithms does not allow
the full collection of parameters to be jointly fit, and thus has far
lower variance than the full parameterization might suggest.
 Classifiers are hurt less by overfitting than other function
estimators
48
Download