On-Line Algorithms in Machine Learning By: WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ KIM HYEONGCHEOL HE RUIDAN SHANG XINDI Overview 2 Introduction: online learning vs. offline learning 2. Predicting from Expert Advice 1. Weighted Majority Algorithm: Simple Version Weighted Majority Algorithm: Randomized Version 3. Mistake Bound Model Learning a Concept Class C Learning Monotone Disjunctions Simple Algorithm Winnow Algorithm Learning Decision List Conclusion 5. Q & A 4. 1 Offline Learning Online Learning Intro to Machine Learning WALEED ABDULWAHAB YAHYA AL-GOBI Machine Learning | Definition 4 Definition “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” --- [Mitchell, 1997] More concrete Example: Task T : Prediction traffic patterns at a busy intersection. Experience E : Historical or past traffic pattern. Performance Measure P : Accuracy of predicting future traffic patterns. Learned Model (i.e. Target function) y = h(x) Machine Learning | Offline Learning vs Online Learning 5 Offline Learning: Learning Phase the learning algorithm is trained on a pre-defined set of learning examples to create a hypothesis. Testing phase The hypothesis will be used to find the accurate conclusion for a given new data. Training Examples Learning Algorithm h(x) Example: MRI brain images classification Training Labels Learned model h(x) Image Features Training Training Images Machine Learning | Offline Learning vs Online Learning 6 Online Learning As opposite to offline learning that finds the predictor h(x) on the entire training set at once. Online learning algorithm is a common technique used in the areas of ML where it is computationally infeasible to train on the entire dataset all at once. Online learning is a method of ML in which data becomes available in sequential order, and is used to update our predictor h(x) at each step. Machine Learning | Offline Learning vs Online Learning 7 Examples of Online Learning Stock Price Prediction: Here the data is generated as a function of time so online learning can dynamically adopt to new patterns in the new data. Spam Filtering: Here the data is generated based on the output of learning algorithm (Spam Detector) so online learning can dynamically adopt to new pattern to minimize our losses. Machine Learning | Offline Learning vs Online Learning 8 Online Learning: Training Examples Learning Algorithm h(x) Training Examples Learning Algorithm h(x) …........... Training Examples Learning Algorithm h(x) Example: Stock Price Prediction Stock prices Time Receiving Truth Data Features Prediction Update hypothesis h(x) Machine Learning | Offline Learning vs Online Learning 9 Offline Learning Online Learning Two Phase Learning: How? Multi-phase Learning: How? Entire dataset given at once One Example given at time Learn the dataset to Construct target function h(x) Predict, Receive correct answer, Update target function h(x) at each step of learning Predict incoming new data Learning phase is separated from testing phase Learning phase is combined with testing phase 2 Basic Flow An Example Predicting from Expert Advice WALEED ABDULWAHAB YAHYA AL-GOBI Predicting from Expert Advice | Basic Flow 11 Receiving prediction from experts Combining Expert Advice Prediction Assumption: prediction ∈ {0, 1}. Truth Making its own prediction Being told the correct answer Predicting from Expert Advice | An Example 12 Task Input Output Goal : predicting whether it will rain today. : advices of n experts ∈ {1 (yes), 0 (no)}. : 1 or 0. : make the least number of mistakes. Expert 1 Expert 2 Expert 3 Truth 21 Jan 2013 1 0 1 1 22 Jan 2013 0 1 0 1 23 Jan 2013 1 0 1 1 24 Jan 2013 0 1 1 1 25 Jan 2013 1 0 1 1 3 Simple Version Randomized Version The Weighted Majority Algorithm WALEED ABDULWAHAB YAHYA AL-GOBI The Weighted Majority Algorithm 14 The Weighted Majority Algorithm 15 Date Expert Advice Weight ∑wi Prediction Correct Answer 21 Jan 2013 x1 1 x2 0 x3 1 w1 1 w2 1 w3 1 (xi=0) 1 (xi=1) 2 1 1 22 Jan 2013 0 1 0 1 0.50 1 2 0.50 0 1 23 Jan 2013 1 0 1 0.50 0.50 0.50 0.50 1 1 1 24 Jan 2013 0 1 1 0.50 0.25 0.50 0.50 0.75 1 1 25 Jan 2013 1 0 1 0.25 0.25 0.50 0.25 0.75 1 1 The Weighted Majority Algorithm 16 Proof: Let A mistaken prediction: W ≤ n(¾)M Assuming the best expert made m mistakes. At least ½ W predicted incorrectly. In step 3 total weight reduced by a factor of ¼ (= ½ W x ½). M := # of mistakes made by Weight Majority algorithm. W := total weight of all experts (initially = n). W ≥ ½m So, ½m ≤ n(¾)M M ≤ 2.41(m + lgn). 4 Simple Version Randomized Version Randomized Weighted Majority Algorithm (RWMA) MUHAMMAD BURHAN HAFEZ The Randomized Weighted Majority Algorithm (RWMA) 18 MWMA ≤ 2.41 (m + lg n) Suppose n = 10, m = 20, and we run 100 prediction trials. MWMA = 56!!! Can we do better? The Randomized Weighted Majority Algorithm (RWMA) 19 Two modifications: Weights 1. View weights as probabilities. 0.25 0.5 0.25 1st Expert 2nd Expert 3rd Expert 4th Expert 1 1. Replace “multiply by ½” with “multiply by β” . The Randomized Weighted Majority Algorithm (RWMA) 20 The algorithm: 1. Initialize the weights w1, …, wn of all experts to 1. 2. Given a set of predictions {x1, …, xn} by experts, output xi with probability wi /W. 3. Receive the correct answer l and penalize each mistaken expert by multiplying its weight by β. Go-to 2. The Randomized Weighted Majority Algorithm (RWMA) 21 RWMA in action (β = ½ ): Experts E1 E2 E3 E4 E5 E6 Weights 1 1 1 1 1 1 Advice 1 1 0 0 0 0 Weights 1 1 ½ ½ ½ ½ Advice 0 1 1 1 1 0 Weights 1 ½ ¼ ¼ ¼ ½ prediction Correct answer 0 1 1 0 The Randomized Weighted Majority Algorithm (RWMA) 22 Mistake bound: M 1 m ln( ) ln( n) 1 Define Fi to be the fraction of the total weight on the wrong answers at ith trial. Say we have seen t examples. t Let M be our expected # of mistakes so far, so M Fi i 1 On the ith trial, W W (1 (1 ) Fi ) t W n 1 1 Fi i 1 t n 1 1 Fi m i 1 t ln( n) ln 1 1 Fi m ln ( ) i 1 t ln n ln 1 1 Fi m ln ( 1 ) i 1 The Randomized Weighted Majority Algorithm (RWMA) 23 t ln n ln 1 1 Fi m ln ( 1 ) i 1 x t ln n 1 β Fi i 1 x t t ln n ln 1 1 Fi m ln( 1 ) i 1 ln( n) (1 ) Fi m ln ( 1 ) i 1 M m ln( 1 ) ln( n) 1 The Randomized Weighted Majority Algorithm (RWMA) 24 The relation between β and M: β M ¼ 1.85m + 1.3 ln (n) ½ 1.39m + 2 ln (n) ¾ 1.15m + 4 ln (n) When β = ½ The simple algorithm RWMA M ≤ 2.41m + 2.41 ln(n) M ≤ 1.39m + 2 ln(n) The Randomized Weighted Majority Algorithm (RWMA) 25 Other advantages of RWMA: 1. Consider the case where just only %51 of the experts were mistaken. • In WMA, we directly use this majority and predict accordingly, resulting in a wrong prediction. • In RWMA, there is still roughly a 50/50 chance that we’ll predict correctly. 2. Consider the case where predictions are strategies (cannot easily be combined together). • In WMA, since all strategies are generally different, we cannot combine experts who predicted the same strategies. • RWMA can be directly applied, because it doesn’t depend on summing the weights of experts who gave the same strategy to make a decision, but rather on the individual weights of experts 5 A Concept Class Mistake Bound Model Definition of learning a class in Mistake Bound Model Learning a Concept Class in Mistake Bound Model KIM HYEONGCHEOL Quick Review 27 What we covered so far … Input : Yes/No from the “experts” Output : The algorithm make a prediction as well Weather experts Question to experts : Will it rain tomorrow? Experts’ prediction : Yes/No Question to the algorithm : Will it rain tomorrow? Prediction : Yes/No Penalization according to the correctness Simple algorithm & Better randomized algorithm Learn a Concept Class 28 On line learning a concept class C in Mistake Bound Model # Questions What is a concept class C? What is Mistake Bound Model? What do we mean by learning a concept class in Mistake Bound Model? A Concept Class C 29 Definition A concept class C A set of Boolean functions over a domain X Each Boolean function in the set can be called a concept E.g. A concept class of disjunctions over a domain X ∈ {0,1}n All functions of the class can be described as *disjunctions over the variables {X1,…..,Xn} A concept Class C of disjunctions * Disjunction : a ∨ b X1 ∨ X2 : A concept * Conjunction : a ∧ b X3 ∨ X2 ∨ X6 : A concept X5 ∨ X1 ∨ X7 ∨ X8 ∨ Xn : A concept : Mistake Bound Model 30 On-line learning Iteration: The algorithm receives unlabeled example The algorithm predicts the label of the example The algorithm is then given the true label Penalization will be applied to the algorithm depending on correctness Mistake Bound The mistake made by the algorithm is bounded by M (ideally, we hope M is as small as possible) Learning a Concept Class in Mistake Bound Model 31 Assumption & Condition The target concept could be any (unknown) concept from the concept class. The target concept is fixed during the process. The true label attached to examples are generated by target concept c ∈ C In each example, true label of an example = c(x) The true label will be given to the algorithm to update hypothesis The goal is to make as few mistakes as possible Learning a Concept Class in Mistake Bound Model 32 Assumption & Condition (cont’d) any concept c ∈ C, At most, poly(n, size(c)) mistakes For poly : a polynomial equation n : The description length of the examples Ex) For X { X1, X2….X10 }, n = 10 size(c) : the description length of some concept c ∈ C Ex) Size (X1 ∨ X2 ∨ X6 ) = 3 Learning a Concept Class in Mistake Bound Model 33 If the algorithm takes the assumption and the condition, we can say that, it learns class C in the mistake bound learning model Especially, if the number of mistakes made is only poly(size(c)) ∙ polylog(n), the algorithm is robust to the presence of many additional irrelevant variables : attribute efficient Examples of Learnings 34 Some examples of learning classes in Mistake Bound Model Monotone disjunctions Simple algorithm The winnow algorithm Decision list 6 Simple Algorithm Winnow Algorithm Learning Monotone Disjunctions KIM HYEONGCHEOL Learning Monotone disjunctions | Problem Definition 36 Monotone disjunctions, Boolean functions of the form for some subset S ⊆ {1, . . . , n} i.e. X2 ∨ X3 Input: 𝑛 a sequence of 𝑋 ∈ {0, 1} a sequence of true label 𝐶 ∈ {0,1}, which is generated by an unknown monotone disjunction Output: a sequence of predicted label 𝐵 ∈ {0,1} Objective: make least false prediction Simple Algorithm 37 Algorithm workflow A initial hypothesis for prediction h = X1 ∨ X2 ………..∨ Xn hypothesis will be given examples, 𝑋 ∈ {0,1}n If a mistake is made on negative example 𝑋, remove all variables in X that equals to 1 from hypothesis h. The Mistake bound At most n mistakes ! Simple Algorithm | An example 38 When the target concept c(x) = X2 ∨ X3 Hypothesis ‘h’ Red -> A mistake on negative example Green -> A correct prediction ‘n’ = 6 c(x) Negative examples Simple Algorithm | An example 39 6 Simple Algorithm Winnow Algorithm Learning Monotone Disjunctions HE RUIDAN Learning the class of disjunctions | Winnow Algorithm 41 The simple algorithm learns the class of disjunctions with mistakes bound by n Winnow algorithm : An algorithm with less mistakes Winnow Algorithm | Basic Concept 42 Winnow Algorithm: Each input vector x = {x1, x2, … xn}, xi ∈ {0, 1} Assume the target function is the disjunction of r relevant variables. i.e. c(x) = xt1 V xt2 V … V xtr Winnow algorithm provides a linear separator Winnow Algorithm | Work Flow 43 Initialize: weights w1 = w2 = … = wn=1 Iterate: Receive an example vector x = {x1, x2, … xn} Predict: Output 1 if Output 0 otherwise Get the true label Update if making a mistake: Predict negative on positve example: for each xi = 1: wi = 2*wi Predict positive on negative example: for each xi = 1: wi = wi/2 Winnow Algorithm | Mistake Bound 44 Theorem: The Winnow Algorithm learns the class of disjunctions in the Mistake Bound model, making at most 2+3r(1 + lgn) mistakes when the target concept is a disjunction of r variables Attribute efficient: the # of mistakes is only poly(r) . polylog(n) Particularly good for learning where the number of relevant variables r is much less than the total number of variables n Winnow Algorithm | Proof of Mistake Bound 45 u: # of mistakes made on positive examples (output 0 while true result is 1) v: # of mistakes made on negative examples (output 1 while true result is 0) Proof 1: u <= r(1 + lgn) Proof 2: v < 2(u + 1) Therefore, # of total mistakes = u + v = 3u + 2, which is bounded by 2 + 3r(1 + lgn) Winnow Algorithm | Proof of Mistake Bound 46 u: # of mistakes made on positive examples v: # of mistakes made on negative examples Proof 1: u <= r(1+lgn) Any mistakes made on positive examples must double at least one of the weights in the target function Winnow Algorithm | Proof of Mistake Bound 47 Any mistakes made on positive examples must double at least one of the weights in the target function For an example X, h(X) = negative, c(X) = positive c(X) = positive at least one target variable is one in X According to the algorithm, when hypothesis predicts positive as negative, the weights of variables equals to one in the example are doubled, therefore at least the weight of one target variable will be doubled. Winnow Algorithm | Proof of Mistake Bound 48 u: # of mistakes made on positive examples v: # of mistakes made on negative examples Proof 1: u <= r(1+lgn) Any mistakes made on positive examples must double at least one of the weights in the target function The weights of target variables will not be halved. Winnow Algorithm | Proof of Mistake Bound 49 The weights of target variables will not be halved. According to the algorithm, only when h(X) = positive while c(X) = negative, the weights of variables equals to one in X will be halved. c(X) = negative no target variable is one in X no target variables’ weight will be halved Winnow Algorithm | Proof of Mistake Bound 50 u: # of mistakes made on positive examples v: # of mistakes made on negative examples Proof 1: u <= r(1+lgn) Any mistakes made on positive examples must double at least one of the weights in the target function The weights of target variables will not be halved. Each of the weights of target variables can be doubled at most 1 + lgn times. Winnow Algorithm | Proof of Mistake Bound 51 Each of the weights of target variables can be doubled at most 1 + lgn times The weight of target variable could only be doubled, will never be halved. When the weight of any target variable equals or larger than n, hypothesis will always predict positive, if the target variable is one. Only when hypothesis predict negative on positive examples, the weights of variables equals to one will be doubled if the hypothesis always predict positive, no weights will be doubled. Therefore, the weight of any target variable cannot be doubled when it equals or larger than n The weight of any target variable can be doubled at most 1+lgn times Winnow Algorithm | Proof of Mistake Bound 52 u: # of mistakes made on positive examples v: # of mistakes made on negative examples Proof 1: u <= r(1+lgn) Any mistakes made on positive examples must double at least one of the weights in the target function Each of the weight of target variables will not be halved. As when any of the target variable is one, the example must not be negative Each of the weights of target variables can be doubled at most 1 + lgn times as only weight less than n can be doubled. Therefore, u <= r(1+lgn) since there are r variables in target function Winnow Algorithm | Proof of Mistake Bound 53 u: # of mistakes made on positive examples v: # of mistakes made on negative examples Proof 2: v < 2(u+1) Initially, total weight W = n Mistake on positive examples: W < W + n Mistake on negative examples: W <= W – n/2 Therefore, W < n + un – v(n/2) 0 <= W < n + un – v(n/2) v < 2(u + 1) # mistakes = u + v < 3u + 2 # mistakes < 2 + 3r(1 + lgn) 7 Learning Decision List in Mistake Bound Model Learning Decision List in Mistake Bound Model SHANG XINDI Decision List 55 if 𝑋1 then 𝐵1 … else if 𝑋𝑟 then 𝐵𝑟 else 𝐵𝑟+1 A general form of a decision list is {𝑋1 → 𝐵1 , ⋯ , 𝑋𝑛1 → 𝐵𝑛1 } ⋯ 𝑒𝑙𝑠𝑒 𝑋𝑛𝑟−1+1 → 𝐵𝑛𝑟−1+1 , ⋯ , 𝑋𝑛𝑟 → 𝐵𝑛𝑟 𝑒𝑙𝑠𝑒 {𝑇𝑟𝑢𝑒 → 𝐵𝑛𝑟+1 , ⋯ } where 𝑋𝑖 is boolean variable, 𝐵𝑖 ∈ 0,1 -- level 1 -- level r -- level r+1 Decision List | Example 56 Decision List vs Disjunction 57 Learning Decision List 58 Learning Decision List | Algorithm 59 Hypothesis ℎ: decision list Initialize: 1-level decision list, which contains 4n+2 possible “if/then” rules Iterate: Receive an example 𝑋 = {𝑋1 , 𝑋2 , ⋯ , 𝑋𝑛 } Predict: Find the first level that contains a rule satisfied by 𝑋 Use that rule for prediction (if there are several choices, choose one arbitrarily) Receive the true label Update if making a mistake: Move all rules that predict wrong down to the next level Learning Decision List | Example 60 {𝑋 {𝑋11→ →0,0,𝑋𝑋11→ →1,1,𝑋𝑋11→ →0,0,𝑋𝑋11→ →1,1, 𝑋𝑋22→ →0,0,𝑋𝑋22→ →1,1,𝑋𝑋22→ →0,0,𝑋𝑋22→ →1,1, 𝑇𝑟𝑢𝑒 𝑇𝑟𝑢𝑒→ →0,0,𝑇𝑟𝑢𝑒 𝑇𝑟𝑢𝑒→ →1} 1} Learning Decision List | Example 61 {𝑋1 → 0, 𝑋1 → 1, 𝑋1 → 0, 𝑋2 → 0, 𝑋2 → 0, 𝑋2 → 1, 𝑇𝑟𝑢𝑒 → 0} else {𝑋1 → 1, 𝑋2 → 1, 𝑇𝑟𝑢𝑒 → 1} Learning Decision List | Example 62 𝑿𝟏 𝑿𝟐 𝒄 𝒉 0 0 1 1 0 1 0 0 1 0 0 0 1 1 0 0 Learning Decision List | Mistake Bound 63 Summary 64 Introduction: online learning vs. offline learning 2. Predicting from Expert Advice 1. Weighted Majority Algorithm: Simple Version Weighted Majority Algorithm: Randomized Version 3. Mistake Bound Model Learning a Concept Class C Learning Monotone Disjunctions Simple Algorithm Winnow Algorithm Learning Decision List 4. Demo of online learning Learning to Swing-Up and Balance 65 Q&A