BoostingB

advertisement
Boosting and other
Expert Fusion Strategies
References
•
•
•
•
http://www.boosting.org
Chapter 9.5 Duda Hart & Stock
Leo Breiman Boosting Bagging Arcing
Presentation adapted from:
Rishi Sinha, Robin Dhamankar
Types of Multiple Experts
• Single expert on full observation space
• Single expert for sub regions of
observation space (Trees)
• Multiple experts on full observation
space
• Multiple experts on sub regions of
observation space
Types of
Multiple Experts Training
• Use full observation space for each expert
• Use different observation features for each
expert
• Use different observations for each expert
• Combine the above
Online Experts Selection
• N strategies (experts)
• At time t:
– Learner A chooses a distribution over N experts
– Let pt(i) be the probability of i-th expert
– Spt(i) = 1 and for a loss vector lt
Loss at time t: Spt(i) lt(i)
• Assume bounded loss, lt(i) in [0,1]
Experts Algorithm: Greedy
• For each expert define its cumulative
loss:
t
L  l
t
i
j 1
j
i
• Greedy: At time t choose the expert
with minimum loss, namely, arg min Lit
Greedy Analysis
• Theorem: Let LGT be the loss of Greedy
at time T, then
L  N (min L  1)
T
G
i
T
i
• Proof in notes.
• Weakness: Relies on a single expert for
every observation
Better Multiple Experts Algorithms
• Would like to bound
L  min L
T
A
i
T
i
• Better Bound: Hedge Algorithm
Utilizes all experts for each observation
Multiple Experts Algorithm: Hedge
•
•
•
•
Maintain weight vector at time t: wt
Probabilities pt(k) = wt(k) / S wt(j)
Initialization w1(i) = 1/N
Updates:
 wt+1(k) = wt(k) Ub(lt(k))
 where b in [0,1] and
 br < Ub (r) < 1-(1-b)r
Hedge Analysis
• Lemma: For any sequence of losses
N
ln(  w
j 1
T 1
( j ))  (1  b) LH
• Proof (Mansour’s scribe)
• Corollary:
N
LH  
ln(  wT 1 ( j ) )
j 1
1b
Hedge: Properties
• Bounding the weights
T 1
w
T
(i )  w (i ) U b (l (i ))
1
t
t 1
T
 w (i )b
1

t 1
l t (i )
 w (i )b
1
• Similarly for a subset of experts.
LTi
Hedge: Performance
• Let k be with minimal loss
N
w
T 1
T 1
( j)  w
(k )  w (k )b
1
LTk
j 1
• Therefore
LH  
1 LTk
ln( b )
N
1b

ln(N )  LTk ln(1/ b )
1b
Hedge: Optimizing b
• For b=1/2 we have
LH  2 ln( N )  2 L ln( 2)
T
k
• Better selection of b:
LH  min i Li  2L ln N  ln( N )
Occam Razor
• Finding the shortest consistent
hypothesis.
• Definition: (a,b)-Occam algorithm
– a >0 and b <1
– Input: a sample S of size m
– Output: hypothesis h
– for every (x,b) in S: h(x)=b
– size(h) < sizea(ct) mb
• Efficiency.
Occam Razor Theorem
•
•
•
•
A: (a,b)-Occam algorithm for C using H
D distribution over inputs X
ct in C the target function
Sample size:
a 1 (1 b )
 1 1  2n
m  2  ln  
e d  e








• with probability 1-d A(S)=h has error(h) < e
Occam Razor Theorem
• Use the bound for finite hypothesis
class.
• Effective hypothesis class size 2size(h)
• size(h) < na mb
• Sample size:
1
m   ln
e

na m b
2
d
  na m b 2   1 1 


ln
  ln 


  e
d  e d 

Weak and Strong Learning
PAC Learning model
(Strong Learning)
• There exists a distribution D over domain X
• Examples: <x, c(x)>
– use c for target function (rather than ct)
• Goal:
– With high probability (1-d)
– find h in H such that
– error(h,c ) < e
– e arbitrarily small, thus STRONG LEARNING
Weak Learning Model
• Goal: error(h,c) < ½ - g
(Slightly above chance)
• The parameter g is small
– constant
Intuitively: A much easier task
• Question:
– Assume C is weak learnable,
– C is PAC (strong) learnable
Majority Algorithm
• Hypothesis: hM(x)= MAJ[ h1(x), ... ,
hT(x) ]
• size(hM) < T size(ht)
• Using Occam Razor
Majority: outline
• Sample m example
• Start with a distribution 1/m per
example.
• Modify the distribution and get ht
• Hypothesis is the majority
• Terminate when perfect classification
– of the sample
Majority: Algorithm
• Use the Hedge algorithm.
• The “experts” will be associate with
points.
• Loss would be a correct classification.
– lt(i)= 1 - | ht(xi) – c(xi) |
• Setting b= 1- g
• hM(x) = MAJORITY( hi(x))
• Q: How do we set T?
Majority: Analysis
• Consider the set of errors S
S={i | hM(xi)c(xi) }
• For every i in S:
Li / T < ½ (Proof!)
• From Hedge properties:
LM 
iS
 ln(
D ( xi ) )  (g g 2 )T / 2
g
MAJORITY: Correctness
• Error Probability:
e  iS D( xi )
• Number of Rounds:
e e
Tg 2 / 2
• Terminate when error less than 1/m
T
2 ln m
g
2
Bagging
• Generate a random sample from training set by selecting
elements with replacement.
• Repeat this sampling procedure, getting a sequence of k
“independent” training sets
• A corresponding sequence of classifiers C1,C2,…,Ck is
constructed for each of these training sets, by using the same
classification algorithm
• To classify an unknown sample X, let each classifier predict.
• The Bagged Classifier C* then combines the predictions of the
individual classifiers to generate the final outcome. (sometimes
combination is simple voting)
Taken from Lecture slides for Data Mining Concepts and Techniques by Jiawei Han and M Kamber
Boosting
• Also Ensemble Method. =>The final prediction is a
combination of the prediction of several predictors.
• What is different?
– Its iterative.
– Boosting: Successive classifiers depends upon its
predecessors.
Previous methods : Individual classifiers were “independent”
– Training Examples may have unequal weights.
– Look at errors from previous classifier step to decide how to
focus on next iteration over data
– Set weights to focus more on ‘hard’ examples. (the ones on
which we committed mistakes in the previous iterations)
Boosting
• W(x) is the distribution of weights over N training
observations ∑ W(xi)=1
• Initially assign uniform weights W0(x) = 1/N for all x,
step k=0
• At each iteration k :
– Find best weak classifier Ck(x) using weights Wk(x)
– With error rate εk and based on a loss function:
• weight αk the classifier Ck‘s weight in the final hypothesis
• For each xi , update weights based on εk to get Wk+1(xi )
• CFINAL(x) =sign [ ∑ αi Ci (x) ]
Boosting (Algorithm)
Boosting As Additive Model
• The final prediction in boosting f(x) can be expressed
as an additive expansion of individual classifiers
f ( x) 
M
b
m 1
m
b( x; g m )
• The process is iterative and can be expressed as
follows.
f m ( x)  f m1 ( x)  b mb( x; g m )
• Typically we would try to minimize a loss function on
the training examples
M


min M  L yi ,  b mb( xi ; g m ) 
{ b m ,g m }1
i 1 
m 1

N
Boosting As Additive Model
• Simple case: Squared-error loss
L ( y , f ( x )) 
1
( y  f ( x )) 2
2
• Forward stage-wise modeling amounts to just
fitting the residuals from previous iteration.
L( yi , f m 1 ( xi )  bb( xi ; g ))
 ( yi  f m 1 ( xi )  bb( xi ; g )) 2
 (rim  bb( xi ; g )) 2
• Squared-error loss not robust for classification
Boosting As Additive Model
• AdaBoost for Classification:
L(y, f (x)) = exp(-y ∙ f (x)) - the exponential loss
function
N
arg min
 L( y , f ( x ))
i
i 1
f
 arg min
b ,Gm
 arg min
b ,Gm
i
N
 exp(  y  [ f
i 1
i
m 1
N
 exp(  y  f
i 1
i
m 1
( xi )  b  Gm ( xi )])
( xi ))  exp(  yi  b  Gm ( xi ))
Boosting As Additive Model
First assume that β is constant, and minimize w.r.t. G:
N
arg min
b ,Gm
 arg min
b , Gm
 arg min
G
 exp(  y  f
i
i 1
N
 wi
(m)
i 1
N
 wi
m 1
 exp(  yi  b  Gm ( xi )), where wi
(m)
N
 wi
 eb 
yi G ( xi )
 arg min (e b  e  b ) [ wi
G
(m)
i 1
N
 arg min (e b  e  b )
(m)
(m)
 exp(  yi  f m 1 ( xi ))
 eb
yi  G ( xi )
N
G
( xi )) exp(  yi  b  Gm ( xi ))
 [ wi
i 1
N
 I ( yi  G ( xi ))]  e  b  wi
i 1
(m)
 I ( yi  G ( xi ))]
N
 wi
i 1
(m)
 eb
(m)
Boosting As Additive Model
N
arg min (e b  e  b )
G
 [ wi
i 1
(m)
 I ( yi  G ( xi ))]
N
w
i 1
b
 arg min (e  e
b
)  errm  e
b
 eb
(m)
i
 H (b )
G
errm : It is the training error on the weighted samples
The last equation tells us that in each iteration we must
find a classifier that minimizes the training error on the
weighted samples.
Boosting As Additive Model
Now that we have found G, we minimize w.r.t. β:
H ( b )  errm  (e b  e  b )  e  b
H
 errm  (e b  e  b )  e  b  0
b
1  e b  errm (e b  e  b )  0
1  e 2 b  errm  errm  0
1  errm
 e2b
errm
1  errm
1
b  log(
)
2
errm
Boosting (Recall)
• W(x) is the distribution of weights over the N training
observations ∑ W(xi)=1
• Initially assign uniform weights W0(x) = 1/N for all x,
step k=0
• At each iteration k :
– Find best weak classifier Ck(x) using weights Wk(x)
– With error rate εk and based on a loss function:
• weight αk the classifier Ck‘s weight in the final hypothesis
• For each xi , update weights based on εk to get Wk+1(xi )
• CFINAL(x) =sign [ ∑ αi Ci (x) ]
AdaBoost
• W(x) is the distribution of weights over the N training
points ∑ W(xi)=1
• Initially assign uniform weights W0(x) = 1/N for all x.
• At each iteration k :
– Find best weak classifier Ck(x) using weights Wk(x)
– Compute εk the error rate as
εk= [ ∑ W(xi ) ∙ I(yi ≠ Ck(xi )) ] / [ ∑ W(xi )]
– weight αk the classifier Ck‘s weight in the final hypothesis Set
αk = log ((1 – εk )/εk )
– For each xi , Wk+1(xi ) = Wk(xi ) ∙ exp[αk ∙ I(yi ≠ Ck(xi ))]
• CFINAL(x) =sign [ ∑ αi Ci (x) ]
AdaBoost(Example)
Original Training set : Equal Weights to all training
samples
Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire
AdaBoost (Example)
ROUND 1
AdaBoost (Example)
ROUND 2
AdaBoost (Example)
ROUND 3
AdaBoost (Example)
AdaBoost (Characteristics)
• Why exponential loss function?
– Computational
• Simple modular re-weighting
• Derivative easy so determining optimal parameters
is relatively easy
– Statistical
• In a two label case it determines one half the log
odds of P(Y=1|x) => We can use the sign as the
classification rule
• Accuracy depends upon number of
iterations ( How sensitive.. we will see
soon).
Boosting performance
Decision stumps are very simple rules of thumb that test
condition on a single attribute.
Decision stumps formed the individual classifiers whose
predictions were combined to generate the final prediction.
The misclassification rate of the Boosting algorithm was
plotted against the number of iterations performed.
Boosting performance
Steep decrease in
error
Boosting performance
• Pondering over how many iterations would
be sufficient….
• Observations
– First few ( about 50) iterations increase the
accuracy substantially.. Seen by the steep
decrease in misclassification rate.
– As iterations increase training error decreases
? and generalization error decreases ?
Can Boosting do well if?
• Limited training data?
– Probably not ..
• Many missing values ?
• Noise in the data ?
• Individual classifiers not very accurate ?
– It cud if the individual classifiers have
considerable mutual disagreement.
Adaboost
• “Probably one of the three most
influential ideas in machine learning in
the last decade, along with Kernel
methods and Variational
approximations.”
• Original idea came from Valiant
• Motivation: We want to improve the
performance of a weak learning
algorithm
• Algorithm:
Adaboost
Boosting Trees Outline
• Basics of boosting trees.
• A numerical optimization problem
• Control the model complexity, generalization
– Size of trees
– Number of Iterations
– Regularization
• Interpret the final model
– Single variable
– Correlation of variables
Boosting Trees : Basics
• Formally a tree is
J
T ( x, )   g j I ( x  R j )
j 1
• The parameters
found by
J
  {R j , g j }1
minimizing the empirical risk.
• Finding:
ˆ  arg min


J
  L( y , g
j 1 xi R j
i
– gj given R j : typically mean of yi in Rj
– Rj : Is tough but solutions exist.
j
)
Basics Continued …
• Approximate criterion for optimizing 
N
~
~
  arg min  L ( yi , T ( xi , ))
• Boosted tree modeli1 is sum of such trees
induced in a forward stage wise manner
• In case of binary classification and
exponential loss functions this reduces
to Ada Boost

Numerical Optimization
• Loss Function is
• So the problem boils down to finding
^
f  arg min L( f )  arg min
f
f
N
 L( yi , f ( xi ))
i 1
• Which in optimization procedures are
solved as
Numerical Optimization
Methods
• Steepest Descent
g im  [
L( yi , f ( xi ))
] f ( xi ) f m1 ( xi )  g1m , g 2m ,..., g Nm T
f ( xi )
 m  arg min L( f m1   * g m )

f m  f m1   * g m
• Loss on Training Data converges to 0.
f m  { f m ( x1 ), f m ( x2 ),..., f m ( xN )}  { y1, y2 ,..., y N }
Generalization
• Gradient Boosting
– We want the algorithm to generalize.
– Gradient on the other hand is defined only on the
training data points.
– So fit the tree T to the negative gradient values by
least squares.
~
  arg min

N
2
(

g

T
(
x
;

))
 im
i
i 1
• MART – Multiple additive regression trees
Algorithm
1.Initialize f 0 ( x) to single terminal node tree
2. For m  1 to M :
a) Compute pseudo residuals rim based on loss function
b) Fit a regression tree to rim  R jm , j  1,2,... J m
c) Find the optimal value of coefficien t within different region R lm
g jm  arg min
gjm
 L( yi , f m1 ( xi )  g )
xi R jm
Jm
d ) f m ( x)  f m1 ( x)   g jm I ( x  R jm )
j 1
End For
^
Output : f  f M
Tuning the Parameters
• The parameters that can be tuned are
– The size of constituent trees J.
– The number of boosting iterations M.
– Shrinkage
– Penalized Regression
Right-sized trees
• The optimal for one step might not be
the optimal for the algorithm
– Using very large tree (such as C4.5) as
weak learner to fit the residue assumes
each tree is the last one in the expansion.
Usually degrade performance and increase
computation
• Solution : restrict the value of J to be
the same for all trees.
Right sized trees.
• For trees the higher order interactions
effects present in large trees suffer
inaccuracies.
• J is the factor that helps control the
higher order interactions. Thus we
would like to keep J low.
• In practice the value of 4  J  8 is seen
to have worked the best.
Controlling M (Regularization)
• After each iteration the training risk
L(fM).
• As M  , L(fM)  0
• But this would risk over fitting the
training data.
• To avoid this monitor prediction risk on
a validation sample.
• Other methods in the Next Chapter.
Shrinkage
• Scale the contribution of each tree by a
factor 0 <  < 1 to control the learning
rate.
J
^
f m  f m1 ( x)   *  g jm I ( x  R jm )
j 1
•  and M control the prediction risk on
training data.
• Are not independent of each other.
Experts: Motivation
• Given a set of experts
– No prior information
– No consistent behavior
– Goal: Predict as the best expert
• Model
– online model
– Input: historical results.
Expert: Goal
• Match the loss of best expert.
• Loss:
– LA
– Li
• Can we hope to do better?
Example: Guessing letters
• Setting:
– Alphabet S of k letters
• Loss:
– 1 incorrect guess
– 0 correct guess
• Experts:
– Each expert guesses a certain letter
always.
• Game: guess the most popular letter
online.
Example 2: Rock-PaperScissors
• Two player game.
• Each player chooses: Rock, Paper, or Scissors.
• Loss Matrix:
Rock
Paper Scissor
s
Rock
1/2
1
0
Paper
0
1/2
1
Scissor
0 the opponent.
1/2
• Goal: Play as best
as we1 can given
s
Example 3: Placing a point
•
•
•
•
Action: choosing a point d.
Loss (give the true location y): ||d-y||.
Experts: One for each point.
Important: Loss is Convex
(d1  (1   )d 2 )  y   || d1  y || (1   ) || d 2  y ||
• Goal: Find a “center”
Adaboost
• Line 1: Given input space X and training examples
x1,…xm and label space Y = {-1,1}
• Line 2: Initialize a distribution D to 1/m where m is
the number of instances in the input space.
• Line 3: for( int t=0;t<T;t++)
• Line 4: Train weak learning algorithm using Dt
• Line 5: Get a weak hypothesis ht which maps the
input space to the label space. The error of this
hypothesis is εt
• Line 6: αt = (1/2)ln((1- εt)/ εt)
• Line 7: Dt(instancei)=(1/ Zt)(Dt(instancei)x{e- αt }if
the hypothesis correctly matched the instance to
the label
• x{eαt } otherwise
Adaboost
• Final hypothesis: H(x) = sign(sum(αt ht (x)))
• Main ideas:
– Adaboost forces the weak learner to focus on incorrectly
classified instances
– Training error decreases exponentially
– Does boosting overfit?
• Baum showed Generalization error = O(sqrt(Td/m))
• Schapire showed error = O(sqrt(d/mθ))
• Does Generalization error depend on T or not? The jury is still
out.
– No overfit mechanism
AdaBoost: Dynamic Boosting
• Better bounds on the error
• No need to “know” g
• Each round a different b
– as a function of the error
AdaBoost: Input
• Sample of size m: < xi,c(xi) >
• A distribution D over examples
– We will use D(xi)=1/m
• Weak learning algorithm
• A constant T (number of iterations)
AdaBoost: Algorithm
• Initialization: w1(i) = D(xi)
• For t = 1 to T DO
–
–
–
–
–
–
pt(i) = wt(i) / Swt(j)
Call Weak Learner with pt
Receive ht
Compute the error et of ht on pt
Set bt= et/(1-et)
wt+1(i) = wt(i) (bt)e, where e=1-|ht(xi)-c(xi)|
• Output
T

T

A
h ( x)  I  (log 1 bt )ht ( x)  1 2t 1 log 1 bt 
 t 1

AdaBoost: Analysis
• Theorem:
– Given e1, ... , eT
– the error e of hA is bounded by
T
e  2T  e t (1  e t )
t 1
AdaBoost: Proof
• Let lt(i) = 1-|ht(xi)-c(xi)|
• By definition: pt lt = 1 –et
• Upper bounding the sum of weights
– From the Hedge Analysis.
• Error occurs only if
T
 (log 1 b ) | h ( x)  c( x) |  1 2
t 1
T
t
t
log
1
b
t
t 1
AdaBoost Analysis (cont.)
•
•
•
•
Bounding the weight of a point
Bounding the sum of weights
Final bound as function of bt
Optimizing bt:
– bt= et / (1 – et)
AdaBoost: Fixed bias
• Assume et= 1/2 - g
• We bound:
e  (1  4g )
2 T /2
e
2g 2T
Learning OR with few
attributes
• Target function: OR of k literals
• Goal: learn in time:
– polynomial in k and log n
– e and d constant
• ELIM makes “slow” progress
– disqualifies one literal per round
– May remain with O(n) literals
Set Cover - Definition
•
•
•
•
Input: S1 , … , St and Si  U
Output: Si1, … , Sik and j Sjk=U
Question: Are there k sets that cover U?
NP-complete
Set Cover Greedy algorithm
• j=0 ; Uj=U; C=
• While Uj  
– Let Si be arg max |Si  Uj|
– Add Si to C
– Let Uj+1 = Uj – Si
– j = j+1
Set Cover: Greedy Analysis
•
•
•
•
•
•
•
At termination, C is a cover.
Assume there is a cover C’ of size k.
C’ is a cover for every Uj
Some S in C’ covers Uj/k elements of Uj
Analysis of Uj: |Uj+1|  |Uj| - |Uj|/k
Solving the recursion.
Number of sets j < k ln |U|
Building an Occam algorithm
• Given a sample S of size m
– Run ELIM on S
– Let LIT be the set of literals
– There exists k literals in LIT that classify
correctly all S
• Negative examples:
– any subset of LIT classifies theme correctly
Building an Occam algorithm
• Positive examples:
–
–
–
–
–
Search for a small subset of LIT
Which classifies S+ correctly
For a literal z build Tz={x | z satisfies x}
There are k sets that cover S+
Find k ln m sets that cover S+
• Output h = the OR of the k ln m literals
• Size (h) < k ln m log 2n
• Sample size m =O( k log n log (k log n))
Application : Data mining
• Challenges in real world data mining problems
– Data has large number of observations and large number of
variables on each observation.
– Inputs are a mixture of various different kinds of variables
– Missing values, outliers and variables with skewed
distribution.
– Results to be obtained fast and they should be interpretable.
• So off-shelf techniques are difficult to come up with.
• Boosting Decision Trees ( AdaBoost or MART) come
close to an off-shelf technique for Data Mining.
Boosting Trees
Presented by Rishi Sinha
Occam Razor
Occam algorithm and
compression
S
(xi,bi)
A
B
x 1, … , x m
compression
• Option 1:
– A sends B the values b1 , … , bm
– m bits of information
• Option 2:
– A sends B the hypothesis h
– Occam: large enough m has size(h) < m
• Option 3 (MDL):
– A sends B a hypothesis h and “corrections”
– complexity: size(h) + size(errors)
Independent Component Analysis
(ICA)
• This is the first ICA paper
– My source for this explanation of ICA is
“Independent Component Analysis: A
Tutorial” by Aapo Hyvarinen and
“Variational Methods for
Bayesian Independent Component
Analysis” ICA chapter by Rizwan A.
Choudrey
Download