Online Algorithms Lecturer: Yishay Mansour Elad Walach Alex Roitenberg Introduction Up until now, our algorithms start with input and work with it suppose input arrives a little at a time, need instant response Oranges example Suppose we are to build a robot that removes bad oranges from a kibutz packaging line After classification the kibutz worker looks at the orange and tells our robot if his classification was correct And repeat indefinitely Our model: Input: unlabeled orange 𝑥 Output: classification (good or bad) 𝑏 The algorithm then gets the correct classification 𝐶𝑡 (𝑥) Introduction At every step t, the algorithm predicts the classification based on some hypothesis 𝐻𝑡 The algorithm receives the correct classification𝐶𝑡 (𝑥) A mistake is an incorrect prediction: H𝑡 𝑥 ≠ 𝐶𝑡 (𝑥) The goal is to build an algorithm with a bound number of mistakes Number of mistakes Independent of the input size Linear Separators Linear saperator The goal: find 𝑤0 and 𝑤 defining a hyper plane 𝑤 ∗ 𝑥 = 𝑤0 All positive examples will be on the one side of the hyper plane and all the negative on the other I.E. 𝑤 ∗ 𝑥 > 𝑤0 for positive 𝑥 only We will now look at several algorithms to find the separator Perceptron The Idea: correct? Do nothing Wrong? Move separator towards mistake We’ll 2 scale all x’s so that 𝑥 = 1, since this doesn’t affect which side of the plane they are on The perceptron algorithm 1. 2. 3. initialize 𝑤1 = 0, 𝑡 = 0 Given 𝑥𝑡 , predict positive IFF 𝑤1 ∗ 𝑥𝑡 >0 On a mistake: 1. 2. Mistake on positive 𝑤𝑡+1 ← 𝑤𝑡 + 𝑥𝑡 Mistake on negative 𝑤𝑡+1 ← 𝑤𝑡 − 𝑥𝑡 The perceptron algorithm Suppose a positive sample 𝑥 If we misclassified 𝑥, then after the update we’ll get 𝑤𝑡+1 ∗ 𝑥 = (𝑤𝑡 + 𝑥) ∗ 𝑥 = 𝑤𝑡 ∗ 𝑥 + 1 𝑋 was positive, but since we made a mistake 𝑤𝑡 ∗ 𝑥 was negative, so a correction was made in the right direction Mistake Bound Theorem Let 𝑆 =< 𝑥𝑖 > consistent with 𝑤∗ ∗ 𝑥 > 0 ⟺ 𝑙 𝑥 = 1 M= 𝑖: 𝑙 𝑥𝑖 ≠𝑏𝑖 is the number of mistakes Then 𝑀 ≤ 𝑤∗ 1 𝛾2 where 𝛾 = 𝑤 ∗ ∗𝑥 min 𝑥 𝑥𝑖 ∈𝑆 the margin of the minimal distance of the samples in S from 𝑤 ∗ (after normalizing both 𝑤 ∗ and the samples) Mistake Bound Proof WLOG, the algorithm makes a mistake on every step (otherwise nothing happens) Claim 1: 𝑤𝑡+1 ∗ 𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾 Proof: 𝑥 > 0: 𝑤𝑡+1 ∗ 𝑤 ∗ = 𝑤𝑡 + 𝑥 ∗ 𝑤 ∗ = 𝑤𝑡 ∗ 𝑤 ∗ + 𝑥 ∗ 𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾 𝑏𝑦 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝛾 𝑥 < 0: 𝑤𝑡+1 ∗ 𝑤 ∗ = 𝑤𝑡 − 𝑥 ∗ 𝑤 ∗ = 𝑤𝑡 ∗ 𝑤 ∗ − 𝑥 ∗ 𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾 𝑏𝑦 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝛾 Proof Cont. Claim 𝑥 > 0: 𝑤𝑡+1 2𝑤 𝑡 2 ∗ 𝑥 = 𝑤𝑡 𝑤𝑡 2: 𝑤𝑡+1 < ||𝑤𝑡 ||2 + 1 = ||𝑤𝑡 + 𝑥 2 + 2𝑤 𝑡 |2 = ||𝑤𝑡 |2 ∗ 𝑥 + 1 ≤ 𝑤𝑡 + 𝑥 2 2 + +1 ∗ 𝑥 < 0 since the algorithm made a mistake 𝑥 < 0: 𝑤𝑡+1 2 2𝑤 𝑡 ∗ 𝑥 = 𝑤𝑡 𝑤𝑡 2 = ||𝑤𝑡 − 𝑥 |2 = ||𝑤𝑡 |2 + 𝑥 2 − 2𝑤 𝑡 ∗ 𝑥 + 1 ≤ 𝑤𝑡 2 2 − +1 ∗ 𝑥 > 0 since the algorithm made a mistake Proof Cont. From Claim 1: From Claim 2: Also: w t w * w t Since w M 1 w M * w M 1 M w 1 * Combining: M w M 1 w * w M 1 M 1 2 M The world is not perfect What if there is no perfect separator? The world is not perfect Claim 1(reminder):𝑤𝑡+1 ∗ 𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾 previously we made γ progress on each mistake now we might be making negative progress 𝑇𝐷𝛾 = total distance we would have to move the points to make them separable by a mragin γ * So: w M 1 w M TD With claim 2: M TD w w w M * M 1 M 1 M 1 2 2 TD The world is not perfect The Alt. 1 𝛾 total hinge loss of 𝑤 ∗ : 𝑇𝐷𝛾 definition: Hinge max( 0 ,1 y ), y loss illustration: l( x) x w * Perceptron for maximizing margins the idea: update 𝑤𝑡 whenever the correct classification margin is less 𝛾 than 2 No. of steps polynomial in Generalization: 1 𝛾 𝛾 2 Update margin: → (1 − 𝜀)𝛾 No. of steps polynomial in 1 𝜀𝛾 Perceptron Algorithm (maximizing margin) Assuming∀𝑥𝑖 ∈ 𝑆, 𝑥𝑖 = 1 Init: 𝑤1 ← 𝑙 𝑥1 𝑥 Predict: 𝑤𝑡 ∗𝑥 𝑤𝑡 ≥ 2 → 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝛾 𝑤𝑡 ∗𝑥 𝑤𝑡 ≤ − 2 → 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑤𝑡 ∗𝑥 𝑤𝑡 ∈ − 2 , 2 → 𝑚𝑎𝑟𝑔𝑖𝑛 𝑚𝑖𝑠𝑡𝑎𝑘𝑒 𝛾 𝛾 𝛾 On mistake (prediction or margin), update: 𝑤𝑡+1 ← 𝑤𝑡 + 𝑙 𝑥 𝑥 Mistake Bound Theorem Let 𝑠 =< 𝑥𝑖 > consistent with : 𝑤∗ ∗ 𝑥 > 0 𝑙 𝑥 =1 M=No. of mistakes + No. of margin mistakes Then 𝑀 ≤ 𝑤∗ 12 𝛾2 where 𝛾 = 𝑤 ∗ ∗𝑥 min 𝑥 𝑥𝑖 ∈𝑆 the margin of similar to the perceptron proof. * * Claim 1 remains the same: w t 1 w w t w We only have to bound w t 1 Mistake bound proof WLoG, the algorithm makes a mistake on every step Claim2: 𝑤𝑡+1 ≤ ||𝑤𝑡 + Proof: |𝑤𝑡+1 | = 𝑤𝑡 𝑤𝑡 1+ 𝛾 𝑤𝑡 + 1+ 1 𝑤𝑡 2 And since 1 + 𝛼 ≤ 1 + ≤ 𝑤𝑡 1+ 𝛾 2 𝑤𝑡 1 2 𝑤𝑡 + 𝛼 2 1 2 𝑤𝑡 2 + 𝛾 2 2𝑙 𝑥 𝑤𝑡 ∗𝑥 𝑤𝑡 2 + 1 𝑤𝑡 2 ≤ Proof Cont. Since 𝑙 𝑥 𝑤𝑡 𝑤𝑡 And the algorithm made a mistake on t ≤ 𝛾 2 thus: 𝑤𝑡+1 ≤ 𝑤𝑡 1 + 2 𝑤𝑡 𝛾 + 2 Proof Cont. So: If 𝑤𝑡+1 ≤ 𝑤𝑡 + 1 2 𝑤𝑡 + 𝛾 2 𝑤𝑡 ≥ 2𝛾, 𝑤𝑡+1 < 𝑤𝑡 + 2 𝛾 𝑤𝑀+1 ≤ 1 + + From 𝑤𝑀+1 3𝛾 𝑀 4 3𝛾 4 → Claim 1 as before: M𝛾 ≤ 𝑤𝑀+1 ∗ 𝑤 ∗ ≤ Solving we get: 𝑀 ≤ 12 𝛾2 The mistake bound model Con Algorithm ∈ 𝐶 set of concepts consistent on 𝑥1 , 𝑥2 . . 𝑥𝑖−1 At step t 𝐶𝑡 Randomly choose concept c Predict 𝑏𝑡 = 𝑐 𝑥𝑖 CON Algorithm Theorem: For any concept class C, Con makes the most |𝐶| − 1 mistakes Proof: at first 𝐶1 = 𝐶 . After each mistake|𝐶𝑡 | decreases by at least 1 |𝐶𝑡 | >= 1,since 𝑐𝑡 ∈ 𝐶 at any t Therefore number of mistakes is bound by |𝐶| − 1 The bounds of CON This bound is too high! There We 𝑛 are 22 different functions on 0,1 can do better! 𝑛 HAL – halving algorithm ∈ 𝐶 set of concepts consistent on 𝑥1 , 𝑥2 . . 𝑥𝑖−1 At step t 𝐶𝑡 Conduct a vote amongst all c Predict 𝑏𝑡 with accordance to the majority HAL –halving algorithm Theorem: For any concept class C, Con makes the most log 2 |𝐶| − 1 mistakes Proof: 𝐶1 = 𝐶. After each mistake 𝐶𝑡+1 ≤ 1 𝐶 sine majority of concepts were 2 𝑡 wrong. Therefore number of mistakes is bound by log 2 |𝐶| Mistake Bound model and PAC Generates strong online algorithms In the past we have seen PAC Restrictions for mistake bound are much harsher than PAC If we know that A learns C in mistake bound model , should A learn C in PAC model? Mistake Bound model and PAC A – mistake bound algorithm Our goal: to construct Apac a pac algorithm Assume that after A gets xi he construct hypothesis hi Definition : A mistake bound algorithm A is conservative iff for every sample xi if 𝑐𝑡 𝑥𝑖 = ℎ𝑖−1 (𝑥𝑖 ) then in the ith step the algorithm will make a choice ℎ𝑖 = ℎ𝑖−1 Mistake made change hypothesis Conservative equivalent of Mistake Bound Algorithm Let A be an algorithm whose mistake is bound by M Ak is A’s hypothesis after it had seen {𝑥1 , 𝑥2 , . . 𝑥𝑛 } Define A’ Initially ℎ0 = 𝐴0 . At 𝑥𝑖 update: Guess ℎ𝑖−1 𝑥𝑖 If 𝑐𝑡 𝑥𝑖 = ℎ𝑖−1 𝑥𝑖 , ℎ𝑖 = ℎ𝑖−1 Else ℎ𝑖 = 𝐴 If we run A onS = {𝑥𝑡 : 𝑐𝑡 𝑥𝑖 = ℎ𝑖−1 𝑥𝑖 } it would make |𝑆| mistakes ⇒ 𝐴′ makes as many mistakes as A Building Apac 𝑖 𝑀 𝛿 𝑘𝑖 = 𝜀 ln Apac algorithm: Run A’ over a sample of size 𝜀 ln( 𝛿 ) divided to M equal blocks Build hypothesis ℎ𝑘 for each block 𝑖 Run the hypothesis on the next block If there are no mistakes output ℎ𝑘 ,0 ≤ 𝑖 ≤ 𝑀 − 1 𝑀 hk 0 h k1 inconsistent M ln 𝑀 consistent inconsistent … consistent 1 hk 0 h k1 Building Apac If A’ makes at most M mistakes then Apac guarantees to finish 𝑀 → APAC outputs a perfect classifier What happens otherwise? Theorem: Apac learns PAC Proof: Pr ℎ𝑘𝑖 𝑠𝑢𝑐𝑐𝑒𝑒𝑑𝑠 𝑜𝑛 𝑏𝑙𝑜𝑐𝑘 𝑤ℎ𝑖𝑙𝑒 𝑏𝑒𝑖𝑛𝑔 𝜀 − M -1 Pr( A PAC outputs - bad h ) Pr( 0 i M 1 s.t. h k i is - bad ) i0 Pr( h k i is - bad) M -1 i0 M Disjunction of Conjuctions Disjunction of Conjunctions We have proven that every algorithm in mistake bound model can be converted to PAC Lets look at some algorithms in the mistake bound model Disjunction Learning Our goal: classify the set of disjunctions e.g. 𝑥1 ⋁𝑥2 ⋁ 𝑥6 ⋁𝑥8 Let L be the hypothesis set {𝑥1 , 𝑥1 , 𝑥2 , 𝑥2 …𝑥𝑛 , 𝑥𝑛 } h = ⋁𝑥: 𝑥 ∈ 𝐿 Given a sample y do: If our hypothesis does a mistake (ht (𝑦) ≠ ct (𝑦)) Than: 𝐿 ← 𝐿\S 𝑤ℎ𝑒𝑟𝑒 𝑆 1. 2. 3. = {𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑎𝑙𝑙 𝑥𝑖 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑦𝑖 𝑖𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑎𝑛𝑑 𝑥𝑖 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑦𝑖 𝑖𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒} 1. 2. Else do nothing Return to step 1 ( update our hypothesis) Example If we have only 2 variables L is {𝑥1 , 𝑥1 , 𝑥2 , 𝑥2 } ℎ𝑡 = 𝑥1 ∨ 𝑥1 ∨ 𝑥2 ∨ 𝑥2 Assume the first sample is y=(1,0) ℎ𝑡 𝑦 = 1 If 𝑐𝑡 𝑦 = 0 we update 𝐿 = 𝑥2 , 𝑥1 ℎ𝑡+1 = 𝑥1 ∨ 𝑥2 Mistake Bound Analysis The number of mistakes is bound by n+1 n is the number of variables Proof: Let R be the set of literals in 𝑐𝑡 Let 𝐿𝑡 be the hypothesis after t samples 𝑦𝑡 Mistake Bound Analysis We prove by induction that 𝑅 ⊆ 𝐿𝑡 For t=0 it is obvious that 𝑅 ⊆ 𝐿0 Assume after t-1 samples 𝑅 ⊆ 𝐿𝑡−1 If 𝑐𝑡 𝑦𝑡 = 1 than 𝑐𝑡 𝑦𝑡 = ℎ𝑡 𝑦𝑡 and we dont update If𝑐𝑡 𝑦𝑡 = 1 than ofcourse S and R don’t intersect. Either way 𝑅 ⊆ 𝐿𝑡 Thus we can only make mistakes when 𝑐𝑡 𝑦 = 0 Mistake analysis proof At first mistake we eliminate n literals At any further mistake we eliminate at least 1 literal L0 has 2n literals So we can have at most n+1 mistakes k-DNF Definition: k-DNF functions are functions that can be represented by a disjunction of conjunctions in which there are at most k literals E.g. 3-DNF (𝑥1 ∧ 𝑥2 ∧ 𝑥6 ) ∨ (𝑥1 ∧ 𝑥3 ∧ 𝑥5 ) The number of conjunctions of i terms is We choose i variables ( we choose a sign (2𝑖 ) 𝑛 𝑖 2 𝑖 𝑛 ) for each of which 𝑖 k-DNF classification We can learn this class by changing the previous algorithm to deal with terms instead of variables Reducing the space 𝑋 = 0,1 𝑋 gives a disjunction on 𝑌 𝑛 to 𝑌 = 0,1 2 usable algorithms ELIM for PAC The previous algorithm (In mistake bound model) which has 𝑂 𝑛𝑘 mistakes 𝑛𝑘 Winnow Monotone Disjunction: Disjunctions containing only positive literals. e.g. 𝑥1 ∨ 𝑥3 ∨ 𝑥5 Purpose: to learn the class of monotone disjunctions in a mistake-bound model We look at winnow which is similar to perceptron One main difference: it uses multiplicative steps rather than additive Winnow Same classification scheme as perceptron ℎ 𝑥 = 𝑥 ∗ 𝑤 ≥ 𝜃 ⇒ 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 ℎ 𝑥 = 𝑥 ∗ 𝑤 < 𝜃 ⇒ 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 Initialize 𝑤0 = 1,1, … , 1 Update scheme: On positive misclassification (ℎ(𝑥) =1, 𝑐𝑡(𝑥)=0) 𝑤 ∀𝑥𝑖 = 1: 𝑤𝑖 ← 𝑖 2 On negative misclassification : ∀𝑥𝑖 = 1: 𝑤𝑖 ← 2𝑤𝑖 Mistake bound analysis Similar to perceptron if the margin is bigger than 𝛾 then we can prove the error 1 rate is Θ( 2) 𝛾 Winnow Proof:Definitions Let 𝑆 = {𝑥𝑖1 , 𝑥𝑖2 , . . , 𝑥𝑖𝑟 } be the set of relevant variables in the target concept I.e. 𝐶𝑡 = 𝑥𝑖1 ∨ 𝑥𝑖2 ∨. . 𝑥𝑖𝑟 We define 𝑊𝑟 = 𝑤𝑖1 , 𝑤𝑖2 , . . , 𝑤𝑖𝑟 the weights of the relevant variables Let 𝑤 𝑡 be the weight w at time t Let TW(t) be the total weight of all w(t) of both relevant and irrelevant variables Winnow Proof: Positive Mistakes Lets look at the positive mistakes Any mistake on a positive example doubles (at least) 1 of the relevant weights ∃𝑤 ∈ 𝑊𝑟 𝑠. 𝑡. 𝑤 𝑡 + 1 = 2𝑤(𝑡) If ∃𝑤𝑖 𝑠. 𝑡. 𝑤𝑖 ≥ 𝑛 we get 𝑥 ∗ 𝑤 ≥ 𝑛 therefore always a positive classification So, ∀𝑤𝑖 : 𝑤𝑖 can only be doubled at most 1 + log 𝑛 times Thus, we can bind the number of positive mistakes: 𝑀+ ≤ 𝑟(1 + log 𝑛 ) Winnow Proof: Positive Mistakes For a positive mistake ℎ 𝑥 = 𝑤1 𝑡 𝑥1 . … + 𝑤𝑛 𝑡 𝑥𝑛 < 𝑛 𝑇𝑊 𝑡 + 1 = 𝑇𝑊 𝑡 +(𝑤1 𝑡 𝑥1 . … + 𝑤𝑛 𝑡 𝑥𝑛 ) (1)𝑇𝑊 𝑡 + 1 < 𝑇𝑊 𝑡 + 𝑛 Winnow Proof: Negative Mistakes On negative examples none of the relevant weight change Thus ∀𝑤 ∈ 𝑊𝑟 , 𝑤 𝑡 + 1 ≥ w(t) For a negative mistake to occur: 𝑤1 𝑡 𝑥1 +. . +𝑤𝑛 𝑡 𝑥𝑛 ≥ 𝑛 𝑇𝑊 𝑡 + 1 = 𝑇𝑊 𝑡 − ⇒ (2)𝑇𝑊 𝑡 + 1 ≤ 𝑇𝑊 𝑡 − 𝑤1 𝑡 𝑥1 +..+𝑤𝑛 𝑡 𝑥𝑛 2 𝑛 2 Winnow Proof:Cont. Combining the equations (1),(2): (3)0 < 𝑇𝑊 𝑡 ≤ 𝑇𝑊 0 + 𝑛𝑀+ − 𝑛 2 At 𝑀− the beginning all weight are 1 so 𝑇𝑊 0 = 𝑛 (3)(4) ⇒ 𝑀− < 2 + 2𝑀+ ≤ 2 + 2𝑟 𝑙𝑜𝑔𝑛 + 1 ⇒ 𝑀− + 𝑀+ ≤ 2 + 3𝑟(𝑙𝑜𝑔𝑛 + 1) What should we know? I Linear Separators Perceptron algorithm : 𝑀 ≤ Margin Perceptron : 𝑀 ≤ The 1 𝛾2 12 𝛾2 mistake bound model CON algorithm : 𝑀 < 𝐶 − 1 but C may be very large! HAL the halving algorithm: 𝑀 < log 2 |𝐶| What should you know? II The relation between PAC and the mistake bound model Basic algorithm for learning disjunction of conjunctions Learning K-DNF functions Winnow algorithm :𝑀 ≤ 2 + 3𝑟(log 𝑛 + 1) Questions?