pptx

Online Algorithms Lecturer: Yishay Mansour Elad Walach Alex Roitenberg Introduction  Up until now, our algorithms start with input and work with it  suppose input arrives a little at a time, need instant response Oranges example     Suppose we are to build a robot that removes bad oranges from a kibutz packaging line After classification the kibutz worker looks at the orange and tells our robot if his classification was correct And repeat indefinitely Our model:    Input: unlabeled orange 𝑥 Output: classification (good or bad) 𝑏 The algorithm then gets the correct classification 𝐶𝑡 (𝑥) Introduction      At every step t, the algorithm predicts the classification based on some hypothesis 𝐻𝑡 The algorithm receives the correct classification𝐶𝑡 (𝑥) A mistake is an incorrect prediction: H𝑡 𝑥 ≠ 𝐶𝑡 (𝑥) The goal is to build an algorithm with a bound number of mistakes Number of mistakes Independent of the input size Linear Separators Linear saperator  The goal: find 𝑤0 and 𝑤 defining a hyper plane 𝑤 ∗ 𝑥 = 𝑤0  All positive examples will be on the one side of the hyper plane and all the negative on the other  I.E. 𝑤 ∗ 𝑥 > 𝑤0 for positive 𝑥 only  We will now look at several algorithms to find the separator Perceptron  The Idea: correct? Do nothing  Wrong? Move separator towards mistake  We’ll 2 scale all x’s so that 𝑥 = 1, since this doesn’t affect which side of the plane they are on The perceptron algorithm 1. 2. 3. initialize 𝑤1 = 0, 𝑡 = 0 Given 𝑥𝑡 , predict positive IFF 𝑤1 ∗ 𝑥𝑡 >0 On a mistake: 1. 2. Mistake on positive 𝑤𝑡+1 ← 𝑤𝑡 + 𝑥𝑡 Mistake on negative 𝑤𝑡+1 ← 𝑤𝑡 − 𝑥𝑡 The perceptron algorithm  Suppose a positive sample 𝑥  If we misclassified 𝑥, then after the update we’ll get 𝑤𝑡+1 ∗ 𝑥 = (𝑤𝑡 + 𝑥) ∗ 𝑥 = 𝑤𝑡 ∗ 𝑥 + 1  𝑋 was positive, but since we made a mistake 𝑤𝑡 ∗ 𝑥 was negative, so a correction was made in the right direction Mistake Bound Theorem    Let 𝑆 =< 𝑥𝑖 > consistent with 𝑤∗ ∗ 𝑥 > 0 ⟺ 𝑙 𝑥 = 1 M= 𝑖: 𝑙 𝑥𝑖 ≠𝑏𝑖 is the number of mistakes Then 𝑀 ≤ 𝑤∗ 1 𝛾2 where 𝛾 = 𝑤 ∗ ∗𝑥 min 𝑥 𝑥𝑖 ∈𝑆 the margin of  the minimal distance of the samples in S from 𝑤 ∗ (after normalizing both 𝑤 ∗ and the samples) Mistake Bound Proof  WLOG, the algorithm makes a mistake on every step (otherwise nothing happens)  Claim 1: 𝑤𝑡+1 ∗ 𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾  Proof:   𝑥 > 0: 𝑤𝑡+1 ∗ 𝑤 ∗ = 𝑤𝑡 + 𝑥 ∗ 𝑤 ∗ = 𝑤𝑡 ∗ 𝑤 ∗ + 𝑥 ∗ 𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾 𝑏𝑦 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝛾 𝑥 < 0: 𝑤𝑡+1 ∗ 𝑤 ∗ = 𝑤𝑡 − 𝑥 ∗ 𝑤 ∗ = 𝑤𝑡 ∗ 𝑤 ∗ − 𝑥 ∗ 𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾 𝑏𝑦 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝛾 Proof Cont.  Claim  𝑥 > 0: 𝑤𝑡+1 2𝑤 𝑡 2 ∗ 𝑥 = 𝑤𝑡  𝑤𝑡  2: 𝑤𝑡+1 < ||𝑤𝑡 ||2 + 1 = ||𝑤𝑡 + 𝑥 2 + 2𝑤 𝑡 |2 = ||𝑤𝑡 |2 ∗ 𝑥 + 1 ≤ 𝑤𝑡 + 𝑥 2 2 + +1 ∗ 𝑥 < 0 since the algorithm made a mistake 𝑥 < 0: 𝑤𝑡+1 2 2𝑤 𝑡 ∗ 𝑥 = 𝑤𝑡  𝑤𝑡 2 = ||𝑤𝑡 − 𝑥 |2 = ||𝑤𝑡 |2 + 𝑥 2 − 2𝑤 𝑡 ∗ 𝑥 + 1 ≤ 𝑤𝑡 2 2 − +1 ∗ 𝑥 > 0 since the algorithm made a mistake Proof Cont.  From Claim 1:  From Claim 2:  Also: w t  w *  w t  Since w M 1  w  M  * w M 1  M w 1 *  Combining: M   w M 1  w *  w M 1   M  1  2 M The world is not perfect  What if there is no perfect separator? The world is not perfect       Claim 1(reminder):𝑤𝑡+1 ∗ 𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾 previously we made γ progress on each mistake now we might be making negative progress 𝑇𝐷𝛾 = total distance we would have to move the points to make them separable by a mragin γ * So: w M 1  w  M   TD  With claim 2: M   TD   w  w  w  M * M 1 M 1  M  1  2  2  TD  The world is not perfect  The  Alt. 1 𝛾 total hinge loss of 𝑤 ∗ : 𝑇𝐷𝛾 definition:  Hinge max( 0 ,1  y ),  y  loss illustration: l( x)  x  w  * Perceptron for maximizing margins  the idea: update 𝑤𝑡 whenever the correct classification margin is less 𝛾 than 2  No. of steps polynomial in  Generalization: 1 𝛾 𝛾 2  Update margin: → (1 − 𝜀)𝛾  No. of steps polynomial in 1 𝜀𝛾 Perceptron Algorithm (maximizing margin)  Assuming∀𝑥𝑖 ∈ 𝑆, 𝑥𝑖 = 1    Init: 𝑤1 ← 𝑙 𝑥1 𝑥 Predict:  𝑤𝑡 ∗𝑥 𝑤𝑡 ≥ 2 → 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝛾  𝑤𝑡 ∗𝑥 𝑤𝑡 ≤ − 2 → 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒  𝑤𝑡 ∗𝑥 𝑤𝑡 ∈ − 2 , 2 → 𝑚𝑎𝑟𝑔𝑖𝑛 𝑚𝑖𝑠𝑡𝑎𝑘𝑒 𝛾 𝛾 𝛾 On mistake (prediction or margin), update:  𝑤𝑡+1 ← 𝑤𝑡 + 𝑙 𝑥 𝑥 Mistake Bound Theorem  Let 𝑠 =< 𝑥𝑖 > consistent with : 𝑤∗ ∗ 𝑥 > 0   𝑙 𝑥 =1 M=No. of mistakes + No. of margin mistakes Then 𝑀 ≤ 𝑤∗ 12 𝛾2 where 𝛾 = 𝑤 ∗ ∗𝑥 min 𝑥 𝑥𝑖 ∈𝑆 the margin of similar to the perceptron proof. * * Claim 1 remains the same: w t 1  w  w t  w   We only have to bound w t  1 Mistake bound proof  WLoG, the algorithm makes a mistake on every step  Claim2: 𝑤𝑡+1 ≤ ||𝑤𝑡 +  Proof: |𝑤𝑡+1 | = 𝑤𝑡 𝑤𝑡 1+ 𝛾 𝑤𝑡 + 1+ 1 𝑤𝑡 2  And since 1 + 𝛼 ≤ 1 +  ≤ 𝑤𝑡 1+ 𝛾 2 𝑤𝑡 1 2 𝑤𝑡 + 𝛼 2 1 2 𝑤𝑡 2 + 𝛾 2 2𝑙 𝑥 𝑤𝑡 ∗𝑥 𝑤𝑡 2 + 1 𝑤𝑡 2 ≤ Proof Cont.  Since  𝑙 𝑥 𝑤𝑡 𝑤𝑡  And  the algorithm made a mistake on t ≤ 𝛾 2 thus: 𝑤𝑡+1 ≤ 𝑤𝑡 1 + 2 𝑤𝑡 𝛾 + 2 Proof Cont.  So:  If 𝑤𝑡+1 ≤ 𝑤𝑡 + 1 2 𝑤𝑡 + 𝛾 2 𝑤𝑡 ≥ 2𝛾, 𝑤𝑡+1 < 𝑤𝑡 + 2 𝛾 𝑤𝑀+1 ≤ 1 + +  From 𝑤𝑀+1 3𝛾 𝑀 4 3𝛾 4 → Claim 1 as before: M𝛾 ≤ 𝑤𝑀+1 ∗ 𝑤 ∗ ≤  Solving we get: 𝑀 ≤ 12 𝛾2 The mistake bound model Con Algorithm ∈ 𝐶 set of concepts consistent on 𝑥1 , 𝑥2 . . 𝑥𝑖−1  At step t  𝐶𝑡   Randomly choose concept c Predict 𝑏𝑡 = 𝑐 𝑥𝑖 CON Algorithm  Theorem: For any concept class C, Con makes the most |𝐶| − 1 mistakes  Proof: at first 𝐶1 = 𝐶 .  After each mistake|𝐶𝑡 | decreases by at least 1  |𝐶𝑡 | >= 1,since 𝑐𝑡 ∈ 𝐶 at any t  Therefore number of mistakes is bound by |𝐶| − 1 The bounds of CON  This bound is too high!  There  We 𝑛 are 22 different functions on 0,1 can do better! 𝑛 HAL – halving algorithm ∈ 𝐶 set of concepts consistent on 𝑥1 , 𝑥2 . . 𝑥𝑖−1  At step t  𝐶𝑡   Conduct a vote amongst all c Predict 𝑏𝑡 with accordance to the majority HAL –halving algorithm  Theorem: For any concept class C, Con makes the most log 2 |𝐶| − 1 mistakes  Proof: 𝐶1 = 𝐶. After each mistake 𝐶𝑡+1 ≤ 1 𝐶 sine majority of concepts were 2 𝑡 wrong.  Therefore number of mistakes is bound by log 2 |𝐶| Mistake Bound model and PAC  Generates strong online algorithms  In the past we have seen PAC  Restrictions for mistake bound are much harsher than PAC  If we know that A learns C in mistake bound model , should A learn C in PAC model? Mistake Bound model and PAC      A – mistake bound algorithm Our goal: to construct Apac a pac algorithm Assume that after A gets xi he construct hypothesis hi Definition : A mistake bound algorithm A is conservative iff for every sample xi if 𝑐𝑡 𝑥𝑖 = ℎ𝑖−1 (𝑥𝑖 ) then in the ith step the algorithm will make a choice ℎ𝑖 = ℎ𝑖−1 Mistake made change hypothesis Conservative equivalent of Mistake Bound Algorithm    Let A be an algorithm whose mistake is bound by M Ak is A’s hypothesis after it had seen {𝑥1 , 𝑥2 , . . 𝑥𝑛 } Define A’   Initially ℎ0 = 𝐴0 . At 𝑥𝑖 update:      Guess ℎ𝑖−1 𝑥𝑖 If 𝑐𝑡 𝑥𝑖 = ℎ𝑖−1 𝑥𝑖 , ℎ𝑖 = ℎ𝑖−1 Else ℎ𝑖 = 𝐴 If we run A onS = {𝑥𝑡 : 𝑐𝑡 𝑥𝑖 = ℎ𝑖−1 𝑥𝑖 } it would make |𝑆| mistakes ⇒ 𝐴′ makes as many mistakes as A Building Apac 𝑖 𝑀 𝛿  𝑘𝑖 = 𝜀 ln  Apac algorithm:  Run A’ over a sample of size 𝜀 ln( 𝛿 ) divided to M equal blocks Build hypothesis ℎ𝑘 for each block 𝑖 Run the hypothesis on the next block If there are no mistakes output ℎ𝑘    ,0 ≤ 𝑖 ≤ 𝑀 − 1 𝑀 hk 0 h k1 inconsistent M  ln       𝑀 consistent inconsistent … consistent 1 hk 0 h k1 Building Apac      If A’ makes at most M mistakes then Apac guarantees to finish 𝑀 → APAC outputs a perfect classifier What happens otherwise? Theorem: Apac learns PAC Proof: Pr ℎ𝑘𝑖 𝑠𝑢𝑐𝑐𝑒𝑒𝑑𝑠 𝑜𝑛 𝑏𝑙𝑜𝑐𝑘 𝑤ℎ𝑖𝑙𝑒 𝑏𝑒𝑖𝑛𝑔 𝜀 −  M -1 Pr( A PAC outputs  - bad h )  Pr(  0  i  M  1 s.t. h k i is  - bad )   i0 Pr( h k i is  - bad)  M -1  i0 M   Disjunction of Conjuctions Disjunction of Conjunctions  We have proven that every algorithm in mistake bound model can be converted to PAC  Lets look at some algorithms in the mistake bound model Disjunction Learning  Our goal: classify the set of disjunctions e.g. 𝑥1 ⋁𝑥2 ⋁ 𝑥6 ⋁𝑥8  Let L be the hypothesis set {𝑥1 , 𝑥1 , 𝑥2 , 𝑥2 …𝑥𝑛 , 𝑥𝑛 } h = ⋁𝑥: 𝑥 ∈ 𝐿 Given a sample y do: If our hypothesis does a mistake (ht (𝑦) ≠ ct (𝑦)) Than: 𝐿 ← 𝐿\S 𝑤ℎ𝑒𝑟𝑒 𝑆 1. 2. 3. = {𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑎𝑙𝑙 𝑥𝑖 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑦𝑖 𝑖𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑎𝑛𝑑 𝑥𝑖 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑦𝑖 𝑖𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒} 1. 2. Else do nothing Return to step 1 ( update our hypothesis) Example  If we have only 2 variables   L is {𝑥1 , 𝑥1 , 𝑥2 , 𝑥2 } ℎ𝑡 = 𝑥1 ∨ 𝑥1 ∨ 𝑥2 ∨ 𝑥2  Assume the first sample is y=(1,0)  ℎ𝑡 𝑦 = 1 If 𝑐𝑡 𝑦 = 0    we update 𝐿 = 𝑥2 , 𝑥1 ℎ𝑡+1 = 𝑥1 ∨ 𝑥2 Mistake Bound Analysis  The number of mistakes is bound by n+1  n is the number of variables  Proof:  Let R be the set of literals in 𝑐𝑡  Let 𝐿𝑡 be the hypothesis after t samples 𝑦𝑡 Mistake Bound Analysis    We prove by induction that 𝑅 ⊆ 𝐿𝑡 For t=0 it is obvious that 𝑅 ⊆ 𝐿0 Assume after t-1 samples 𝑅 ⊆ 𝐿𝑡−1  If 𝑐𝑡 𝑦𝑡 = 1 than 𝑐𝑡 𝑦𝑡 = ℎ𝑡 𝑦𝑡 and we dont update If𝑐𝑡 𝑦𝑡 = 1 than ofcourse S and R don’t intersect. Either way 𝑅 ⊆ 𝐿𝑡  Thus we can only make mistakes when 𝑐𝑡 𝑦 = 0   Mistake analysis proof  At first mistake we eliminate n literals  At any further mistake we eliminate at least 1 literal  L0 has 2n literals  So we can have at most n+1 mistakes k-DNF  Definition: k-DNF functions are functions that can be represented by a disjunction of conjunctions in which there are at most k literals  E.g.  3-DNF (𝑥1 ∧ 𝑥2 ∧ 𝑥6 ) ∨ (𝑥1 ∧ 𝑥3 ∧ 𝑥5 ) The number of conjunctions of i terms is  We choose i variables ( we choose a sign (2𝑖 ) 𝑛 𝑖 2 𝑖 𝑛 ) for each of which 𝑖 k-DNF classification  We can learn this class by changing the previous algorithm to deal with terms instead of variables  Reducing the space 𝑋 = 0,1 𝑋 gives a disjunction on 𝑌   𝑛 to 𝑌 = 0,1 2 usable algorithms ELIM for PAC  The previous algorithm (In mistake bound model) which has 𝑂 𝑛𝑘 mistakes  𝑛𝑘 Winnow  Monotone Disjunction: Disjunctions containing only positive literals. e.g. 𝑥1 ∨ 𝑥3 ∨ 𝑥5  Purpose: to learn the class of monotone disjunctions in a mistake-bound model  We look at winnow which is similar to perceptron  One main difference: it uses multiplicative steps rather than additive Winnow    Same classification scheme as perceptron  ℎ 𝑥 = 𝑥 ∗ 𝑤 ≥ 𝜃 ⇒ 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛  ℎ 𝑥 = 𝑥 ∗ 𝑤 < 𝜃 ⇒ 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 Initialize 𝑤0 = 1,1, … , 1 Update scheme:   On positive misclassification (ℎ(𝑥) =1, 𝑐𝑡(𝑥)=0) 𝑤 ∀𝑥𝑖 = 1: 𝑤𝑖 ← 𝑖 2 On negative misclassification : ∀𝑥𝑖 = 1: 𝑤𝑖 ← 2𝑤𝑖 Mistake bound analysis  Similar to perceptron if the margin is bigger than 𝛾 then we can prove the error 1 rate is Θ( 2) 𝛾 Winnow Proof:Definitions  Let 𝑆 = {𝑥𝑖1 , 𝑥𝑖2 , . . , 𝑥𝑖𝑟 } be the set of relevant variables in the target concept  I.e. 𝐶𝑡 = 𝑥𝑖1 ∨ 𝑥𝑖2 ∨. . 𝑥𝑖𝑟  We define 𝑊𝑟 = 𝑤𝑖1 , 𝑤𝑖2 , . . , 𝑤𝑖𝑟 the weights of the relevant variables  Let 𝑤 𝑡 be the weight w at time t  Let TW(t) be the total weight of all w(t) of both relevant and irrelevant variables Winnow Proof: Positive Mistakes       Lets look at the positive mistakes Any mistake on a positive example doubles (at least) 1 of the relevant weights ∃𝑤 ∈ 𝑊𝑟 𝑠. 𝑡. 𝑤 𝑡 + 1 = 2𝑤(𝑡) If ∃𝑤𝑖 𝑠. 𝑡. 𝑤𝑖 ≥ 𝑛 we get 𝑥 ∗ 𝑤 ≥ 𝑛 therefore always a positive classification So, ∀𝑤𝑖 : 𝑤𝑖 can only be doubled at most 1 + log 𝑛 times Thus, we can bind the number of positive mistakes: 𝑀+ ≤ 𝑟(1 + log 𝑛 ) Winnow Proof: Positive Mistakes  For a positive mistake  ℎ 𝑥 = 𝑤1 𝑡 𝑥1 . … + 𝑤𝑛 𝑡 𝑥𝑛 < 𝑛  𝑇𝑊 𝑡 + 1 = 𝑇𝑊 𝑡 +(𝑤1 𝑡 𝑥1 . … + 𝑤𝑛 𝑡 𝑥𝑛 )  (1)𝑇𝑊 𝑡 + 1 < 𝑇𝑊 𝑡 + 𝑛 Winnow Proof: Negative Mistakes    On negative examples none of the relevant weight change Thus ∀𝑤 ∈ 𝑊𝑟 , 𝑤 𝑡 + 1 ≥ w(t) For a negative mistake to occur:  𝑤1 𝑡 𝑥1 +. . +𝑤𝑛 𝑡 𝑥𝑛 ≥ 𝑛  𝑇𝑊 𝑡 + 1 = 𝑇𝑊 𝑡 −  ⇒ (2)𝑇𝑊 𝑡 + 1 ≤ 𝑇𝑊 𝑡 − 𝑤1 𝑡 𝑥1 +..+𝑤𝑛 𝑡 𝑥𝑛 2 𝑛 2 Winnow Proof:Cont.  Combining  the equations (1),(2): (3)0 < 𝑇𝑊 𝑡 ≤ 𝑇𝑊 0 + 𝑛𝑀+ − 𝑛 2  At 𝑀− the beginning all weight are 1 so 𝑇𝑊 0 = 𝑛  (3)(4) ⇒ 𝑀− < 2 + 2𝑀+ ≤ 2 + 2𝑟 𝑙𝑜𝑔𝑛 + 1 ⇒ 𝑀− + 𝑀+ ≤ 2 + 3𝑟(𝑙𝑜𝑔𝑛 + 1) What should we know? I  Linear Separators  Perceptron algorithm : 𝑀 ≤  Margin Perceptron : 𝑀 ≤  The   1 𝛾2 12 𝛾2 mistake bound model CON algorithm : 𝑀 < 𝐶 − 1 but C may be very large! HAL the halving algorithm: 𝑀 < log 2 |𝐶| What should you know? II  The relation between PAC and the mistake bound model  Basic algorithm for learning disjunction of conjunctions  Learning K-DNF functions  Winnow algorithm :𝑀 ≤ 2 + 3𝑟(log 𝑛 + 1) Questions?

pptx

Related documents

Products

Support

pptx

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib