slides

advertisement
Experts and Boosting Algorithms
Experts: Motivation
• Given a set of experts
– No prior information
– No consistent behavior
– Goal: Predict as the best expert
• Model
– online model
– Input: historical results.
Experts: Model
• N strategies (experts)
• At time t:
–
–
–
–
–
Learner A chooses a distribution over N.
Let pt(i) probability of i-th expert.
Clearly Spt(i) = 1
Receiving a loss vector lt
Loss at time t: Spt(i) lt(i)
• Assume bounded loss, lt(i) in [0,1]
Expert: Goal
• Match the loss of best expert.
• Loss:
– LA
– Li
• Can we hope to do better?
Example: Guessing letters
• Setting:
– Alphabet S of k letters
• Loss:
– 1 incorrect guess
– 0 correct guess
• Experts:
– Each expert guesses a certain letter always.
• Game: guess the most popular letter online.
Example 2: Rock-Paper-Scissors
• Two player game.
• Each player chooses: Rock, Paper, or Scissors.
• Loss Matrix:
Rock
Paper
Scissors
Rock
1/2
1
0
Paper
0
1/2
1
0
1/2
Scissors 1
• Goal: Play as best as we can given the opponent.
Example 3: Placing a point
•
•
•
•
Action: choosing a point d.
Loss (give the true location y): ||d-y||.
Experts: One for each point.
Important: Loss is Convex
(d1  (1   )d 2 )  y   || d1  y || (1   ) || d 2  y ||
• Goal: Find a “center”
Experts Algorithm: Greedy
• For each expert define its cumulative loss:
t
L  l
t
i
j 1
j
i
• Greedy: At time t choose the expert with
minimum loss, namely, arg min Lit
Greedy Analysis
• Theorem: Let LGT be the loss of Greedy at
time T, then
L  N (min L  1)
T
G
• Proof!
i
T
i
Better Expert Algorithms
• Would like to bound
L  min L
T
A
i
T
i
Expert Algorithm: Hedge(b)
•
•
•
•
Maintains weight vector wt
Probabilities pt(k) = wt(k) / S wt(j)
Initialization w1(i) = 1/N
Updates:
– wt+1(k) = wt(k) Ub(lt(k))
– where b in [0,1] and
– br < Ub (r) < 1-(1-b)r
Hedge Analysis
• Lemma: For any sequence of losses
N
ln(  w
j 1
T 1
( j ))  (1  b) LH
• Proof!
• Corollary:
N
LH  
ln(  wT 1 ( j ) )
j 1
1b
Hedge: Properties
• Bounding the weights
T 1
w
T
(i )  w (i )U b (l (i ))
1
t
t 1
T
 w (i )b
1

t 1
l t (i )
 w (i )b
• Similarly for a subset of experts.
1
LTi
Hedge: Performance
• Let k be with minimal loss
N
w
T 1
T 1
( j)  w
(k )  w (k )b
1
LTk
j 1
• Therefore
LH  
1 LTk
ln( b )
N
1b

ln(N )  LTk ln(1/ b )
1b
Hedge: Optimizing b
• For b=1/2 we have
LH  2 ln( N )  2 L ln( 2)
T
k
• Better selection of b:
LH  min i Li  2L ln N  ln( N )
Occam Razor
Occam Razor
• Finding the shortest consistent hypothesis.
• Definition: (a,b)-Occam algorithm
–
–
–
–
–
a >0 and b <1
Input: a sample S of size m
Output: hypothesis h
for every (x,b) in S: h(x)=b
size(h) < sizea(ct) mb
• Efficiency.
Occam algorithm and compression
S
(xi,bi)
A
B
x 1, … , x m
compression
• Option 1:
– A sends B the values b1 , … , bm
– m bits of information
• Option 2:
– A sends B the hypothesis h
– Occam: large enough m has size(h) < m
• Option 3 (MDL):
– A sends B a hypothesis h and “corrections”
– complexity: size(h) + size(errors)
Occam Razor Theorem
•
•
•
•
A: (a,b)-Occam algorithm for C using H
D distribution over inputs X
ct in C the target function
Sample size:
1 (1 b )
 1 1  2na
m  2  ln  
e d  e




• with probability 1-d A(S)=h has error(h) < e




Occam Razor Theorem
•
•
•
•
Use the bound for finite hypothesis class.
Effective hypothesis class size 2size(h)
size(h) < na mb
Sample size:
1
m   ln
e

na m b
2
d
  na m b 2   1 1 


ln
  ln 


  e
d  e d 

Weak and Strong Learning
PAC Learning model
• There exists a distribution D over domain X
• Examples: <x, c(x)>
– use c for target function (rather than ct)
• Goal:
–
–
–
–
With high probability (1-d)
find h in H such that
error(h,c ) < e
e arbitrarily small.
Weak Learning Model
• Goal: error(h,c) < ½ - g
• The parameter g is small
– constant
– 1/poly
• Intuitively: A much easier task
• Question:
– Assume C is weak learnable,
– C is PAC (strong) learnable
Majority Algorithm
• Hypothesis: hM(x)= MAJ[ h1(x), ... , hT(x) ]
• size(hM) < T size(ht)
• Using Occam Razor
Majority: outline
•
•
•
•
•
Sample m example
Start with a distribution 1/m per example.
Modify the distribution and get ht
Hypothesis is the majority
Terminate when perfect classification
– of the sample
Majority: Algorithm
• Use the Hedge algorithm.
• The “experts” will be associate with points.
• Loss would be a correct classification.
– lt(i)= 1 - | ht(xi) – c(xi) |
• Setting b= 1- g
• hM(x) = MAJORITY( hi(x))
• Q: How do we set T?
Majority: Analysis
• Consider the set of errors S
– S={i | hM(xi)c(xi) }
• For ever i in S:
– Li/T < ½ (Proof!)
• From Hedge properties:
LM 
iS
 ln(
D ( xi ) )  (g g 2 )T / 2
g
MAJORITY: Correctness
• Error Probability:
e  iS D( xi )
• Number of Rounds:
e e
Tg 2 / 2
• Terminate when error less than 1/m
T
2 ln m
g
2
AdaBoost: Dynamic Boosting
• Better bounds on the error
• No need to “know” g
• Each round a different b
– as a function of the error
AdaBoost: Input
• Sample of size m: < xi,c(xi) >
• A distribution D over examples
– We will use D(xi)=1/m
• Weak learning algorithm
• A constant T (number of iterations)
AdaBoost: Algorithm
• Initialization: w1(i) = D(xi)
• For t = 1 to T DO
–
–
–
–
–
–
pt(i) = wt(i) / Swt(j)
Call Weak Learner with pt
Receive ht
Compute the error et of ht on pt
Set bt= et/(1-et)
wt+1(i) = wt(i) (bt)e, where e=1-|ht(xi)-c(xi)|
• Output
T

T

A
h ( x)  I  (log 1 bt )ht ( x)  1 2t 1 log 1 bt 
 t 1

AdaBoost: Analysis
• Theorem:
– Given e1, ... , eT
– the error e of hA is bounded by
T
e  2T  e t (1  e t )
t 1
AdaBoost: Proof
• Let lt(i) = 1-|ht(xi)-c(xi)|
• By definition: pt lt = 1 –et
• Upper bounding the sum of weights
– From the Hedge Analysis.
• Error occurs only if
T
 (log 1 b ) | h ( x)  c( x) |  1 2
t 1
T
t
t
log
1
b
t
t 1
AdaBoost Analysis (cont.)
•
•
•
•
Bounding the weight of a point
Bounding the sum of weights
Final bound as function of bt
Optimizing bt:
– bt= et / (1 – et)
AdaBoost: Fixed bias
• Assume et= 1/2 - g
• We bound:
e  (1  4g )
2 T /2
e
2g 2T
Learning OR with few attributes
• Target function: OR of k literals
• Goal: learn in time:
– polynomial in k and log n
– e and d constant
• ELIM makes “slow” progress
– disqualifies one literal per round
– May remain with O(n) literals
Set Cover - Definition
•
•
•
•
Input: S1 , … , St and Si  U
Output: Si1, … , Sik and j Sjk=U
Question: Are there k sets that cover U?
NP-complete
Set Cover Greedy algorithm
• j=0 ; Uj=U; C=
• While Uj  
–
–
–
–
Let Si be arg max |Si  Uj|
Add Si to C
Let Uj+1 = Uj – Si
j = j+1
Set Cover: Greedy Analysis
•
•
•
•
•
•
•
At termination, C is a cover.
Assume there is a cover C’ of size k.
C’ is a cover for every Uj
Some S in C’ covers Uj/k elements of Uj
Analysis of Uj: |Uj+1|  |Uj| - |Uj|/k
Solving the recursion.
Number of sets j < k ln |U|
Building an Occam algorithm
• Given a sample S of size m
– Run ELIM on S
– Let LIT be the set of literals
– There exists k literals in LIT that classify
correctly all S
• Negative examples:
– any subset of LIT classifies theme correctly
Building an Occam algorithm
• Positive examples:
–
–
–
–
–
Search for a small subset of LIT
Which classifies S+ correctly
For a literal z build Tz={x | z satisfies x}
There are k sets that cover S+
Find k ln m sets that cover S+
• Output h = the OR of the k ln m literals
• Size (h) < k ln m log 2n
• Sample size m =O( k log n log (k log n))
Download