Experts and Boosting Algorithms Experts: Motivation • Given a set of experts – No prior information – No consistent behavior – Goal: Predict as the best expert • Model – online model – Input: historical results. Experts: Model • N strategies (experts) • At time t: – – – – – Learner A chooses a distribution over N. Let pt(i) probability of i-th expert. Clearly Spt(i) = 1 Receiving a loss vector lt Loss at time t: Spt(i) lt(i) • Assume bounded loss, lt(i) in [0,1] Expert: Goal • Match the loss of best expert. • Loss: – LA – Li • Can we hope to do better? Example: Guessing letters • Setting: – Alphabet S of k letters • Loss: – 1 incorrect guess – 0 correct guess • Experts: – Each expert guesses a certain letter always. • Game: guess the most popular letter online. Example 2: Rock-Paper-Scissors • Two player game. • Each player chooses: Rock, Paper, or Scissors. • Loss Matrix: Rock Paper Scissors Rock 1/2 1 0 Paper 0 1/2 1 0 1/2 Scissors 1 • Goal: Play as best as we can given the opponent. Example 3: Placing a point • • • • Action: choosing a point d. Loss (give the true location y): ||d-y||. Experts: One for each point. Important: Loss is Convex (d1 (1 )d 2 ) y || d1 y || (1 ) || d 2 y || • Goal: Find a “center” Experts Algorithm: Greedy • For each expert define its cumulative loss: t L l t i j 1 j i • Greedy: At time t choose the expert with minimum loss, namely, arg min Lit Greedy Analysis • Theorem: Let LGT be the loss of Greedy at time T, then L N (min L 1) T G • Proof! i T i Better Expert Algorithms • Would like to bound L min L T A i T i Expert Algorithm: Hedge(b) • • • • Maintains weight vector wt Probabilities pt(k) = wt(k) / S wt(j) Initialization w1(i) = 1/N Updates: – wt+1(k) = wt(k) Ub(lt(k)) – where b in [0,1] and – br < Ub (r) < 1-(1-b)r Hedge Analysis • Lemma: For any sequence of losses N ln( w j 1 T 1 ( j )) (1 b) LH • Proof! • Corollary: N LH ln( wT 1 ( j ) ) j 1 1b Hedge: Properties • Bounding the weights T 1 w T (i ) w (i )U b (l (i )) 1 t t 1 T w (i )b 1 t 1 l t (i ) w (i )b • Similarly for a subset of experts. 1 LTi Hedge: Performance • Let k be with minimal loss N w T 1 T 1 ( j) w (k ) w (k )b 1 LTk j 1 • Therefore LH 1 LTk ln( b ) N 1b ln(N ) LTk ln(1/ b ) 1b Hedge: Optimizing b • For b=1/2 we have LH 2 ln( N ) 2 L ln( 2) T k • Better selection of b: LH min i Li 2L ln N ln( N ) Occam Razor Occam Razor • Finding the shortest consistent hypothesis. • Definition: (a,b)-Occam algorithm – – – – – a >0 and b <1 Input: a sample S of size m Output: hypothesis h for every (x,b) in S: h(x)=b size(h) < sizea(ct) mb • Efficiency. Occam algorithm and compression S (xi,bi) A B x 1, … , x m compression • Option 1: – A sends B the values b1 , … , bm – m bits of information • Option 2: – A sends B the hypothesis h – Occam: large enough m has size(h) < m • Option 3 (MDL): – A sends B a hypothesis h and “corrections” – complexity: size(h) + size(errors) Occam Razor Theorem • • • • A: (a,b)-Occam algorithm for C using H D distribution over inputs X ct in C the target function Sample size: 1 (1 b ) 1 1 2na m 2 ln e d e • with probability 1-d A(S)=h has error(h) < e Occam Razor Theorem • • • • Use the bound for finite hypothesis class. Effective hypothesis class size 2size(h) size(h) < na mb Sample size: 1 m ln e na m b 2 d na m b 2 1 1 ln ln e d e d Weak and Strong Learning PAC Learning model • There exists a distribution D over domain X • Examples: <x, c(x)> – use c for target function (rather than ct) • Goal: – – – – With high probability (1-d) find h in H such that error(h,c ) < e e arbitrarily small. Weak Learning Model • Goal: error(h,c) < ½ - g • The parameter g is small – constant – 1/poly • Intuitively: A much easier task • Question: – Assume C is weak learnable, – C is PAC (strong) learnable Majority Algorithm • Hypothesis: hM(x)= MAJ[ h1(x), ... , hT(x) ] • size(hM) < T size(ht) • Using Occam Razor Majority: outline • • • • • Sample m example Start with a distribution 1/m per example. Modify the distribution and get ht Hypothesis is the majority Terminate when perfect classification – of the sample Majority: Algorithm • Use the Hedge algorithm. • The “experts” will be associate with points. • Loss would be a correct classification. – lt(i)= 1 - | ht(xi) – c(xi) | • Setting b= 1- g • hM(x) = MAJORITY( hi(x)) • Q: How do we set T? Majority: Analysis • Consider the set of errors S – S={i | hM(xi)c(xi) } • For ever i in S: – Li/T < ½ (Proof!) • From Hedge properties: LM iS ln( D ( xi ) ) (g g 2 )T / 2 g MAJORITY: Correctness • Error Probability: e iS D( xi ) • Number of Rounds: e e Tg 2 / 2 • Terminate when error less than 1/m T 2 ln m g 2 AdaBoost: Dynamic Boosting • Better bounds on the error • No need to “know” g • Each round a different b – as a function of the error AdaBoost: Input • Sample of size m: < xi,c(xi) > • A distribution D over examples – We will use D(xi)=1/m • Weak learning algorithm • A constant T (number of iterations) AdaBoost: Algorithm • Initialization: w1(i) = D(xi) • For t = 1 to T DO – – – – – – pt(i) = wt(i) / Swt(j) Call Weak Learner with pt Receive ht Compute the error et of ht on pt Set bt= et/(1-et) wt+1(i) = wt(i) (bt)e, where e=1-|ht(xi)-c(xi)| • Output T T A h ( x) I (log 1 bt )ht ( x) 1 2t 1 log 1 bt t 1 AdaBoost: Analysis • Theorem: – Given e1, ... , eT – the error e of hA is bounded by T e 2T e t (1 e t ) t 1 AdaBoost: Proof • Let lt(i) = 1-|ht(xi)-c(xi)| • By definition: pt lt = 1 –et • Upper bounding the sum of weights – From the Hedge Analysis. • Error occurs only if T (log 1 b ) | h ( x) c( x) | 1 2 t 1 T t t log 1 b t t 1 AdaBoost Analysis (cont.) • • • • Bounding the weight of a point Bounding the sum of weights Final bound as function of bt Optimizing bt: – bt= et / (1 – et) AdaBoost: Fixed bias • Assume et= 1/2 - g • We bound: e (1 4g ) 2 T /2 e 2g 2T Learning OR with few attributes • Target function: OR of k literals • Goal: learn in time: – polynomial in k and log n – e and d constant • ELIM makes “slow” progress – disqualifies one literal per round – May remain with O(n) literals Set Cover - Definition • • • • Input: S1 , … , St and Si U Output: Si1, … , Sik and j Sjk=U Question: Are there k sets that cover U? NP-complete Set Cover Greedy algorithm • j=0 ; Uj=U; C= • While Uj – – – – Let Si be arg max |Si Uj| Add Si to C Let Uj+1 = Uj – Si j = j+1 Set Cover: Greedy Analysis • • • • • • • At termination, C is a cover. Assume there is a cover C’ of size k. C’ is a cover for every Uj Some S in C’ covers Uj/k elements of Uj Analysis of Uj: |Uj+1| |Uj| - |Uj|/k Solving the recursion. Number of sets j < k ln |U| Building an Occam algorithm • Given a sample S of size m – Run ELIM on S – Let LIT be the set of literals – There exists k literals in LIT that classify correctly all S • Negative examples: – any subset of LIT classifies theme correctly Building an Occam algorithm • Positive examples: – – – – – Search for a small subset of LIT Which classifies S+ correctly For a literal z build Tz={x | z satisfies x} There are k sets that cover S+ Find k ln m sets that cover S+ • Output h = the OR of the k ln m literals • Size (h) < k ln m log 2n • Sample size m =O( k log n log (k log n))