Online Learning Rong Jin Batch Learning • Given a collection of training examples D • Learning a classification model from D • What if training examples are received one at each time ? Online Learning For t=1, 2, … T • Receive an instance • Predict its class label • Receive the true class label • Encounter loss • Update the classification model Online Learning For t=1, 2, … T • Receive an instance • Predict its class label • Receive the true class label • Encounter loss • Update the classification model Online Learning For t=1, 2, … T • Receive an instance • Predict its class label • Receive the true class label • Encounter loss • Update the classification model A sequence of classifiers is generated Objective • Minimize the total loss • Loss function • Zero-One loss: • Hinge loss: 6 Loss Functions Hinge Loss Zero-One Loss 1 1 7 Linear Classifiers • Restrict our discussion to linear classifier • Prediction: • Confidence: 8 Separable Set 9 Inseparable Sets 10 Why Online Learning? Fast Memory efficient - process one example at a time Simple to implement For t=1, 2, … T • Receive an instance • Predict its class label • Receive the true class label • Encounter loss • Update the classification model 11 Why Online Learning? Formal guarantees – Regret/Mistake bounds No statistical assumptions Adaptive For t=1, 2, … T • Receive an instance • Predict its class label • Receive the true class label • Encounter loss • Update the classification model 12 Concept Drifting t • Online learning algorithm is able to track the changing classifiers as long as the number of changes is small Why Online Learning? Online to Batch conversions • How to compute one classifier from the sequence of classifiers generated by online learning For t=1, 2, … T • Receive an instance • Predict its class label • Receive the true class label • Encounter loss • Update the classification model 14 Online to Batch Conversation Why Online Learning? • Not as good as a well designed batch algorithms Online Learning 16 Online Learning For t=1, 2, … T • Receive an instance • Predict its class label • Receive the true class label • Encounter loss • Update the classification model Update Rules • Online algorithms are based on an update rule which defines from (and possibly other information) • Linear Classifiers : find from based on the input Some Update Rules : – – – – Perceptron (Rosenblat) ALMA (Gentile) ROMMA (Li & Long) NORMA (Kivinen et. al) – MIRA (Crammer & Singer) – EG (Littlestown and Warmuth) – Bregman Based (Warmuth) 18 Perceptron Initialize For t=1, 2, … T • Receive an instance • Predict its class label • Receive the true class label • If then Geometrical Interpretation 20 Mistake Bound: Separable Case • Assume the data set D is linearly separable with margin , i.e., • Assume R Then, the maximum number of mistakes made by the Perceptron algorithm is bounded by Mistake Bound: Separable Case Mistake Bound: Separable Case Mistake Bound: Inseparable Case • Let be the best linear classifier • We measure our progress by • Consider we make a mistake for Mistake Bound: Inseparable Case Mistake Bound: Inseparable Case • Result 1: Mistake Bound: Inseparable Case • Result 2 Perceptron with Projection Initialize For t=1, 2, … T • Receive an instance • Predict its class label • Receive the true class label • If then • If then Remarks • Mistake bound is measured for a sequence of classifiers • Bound does not depend on dimension of the feature vector • The bound holds for all sequences (no i.i.d. assumption). • It is not tight for most real world data. But, it can not be further improved in general. 29 Perceptron Conservative: updates Initialize the classifier only For t=1, 2, … T when it misclassifies • Receive an instance • Predict its class label • Receive the true class label • If then Aggressive Perceptron Initialize For t=1, 2, … T • Receive an instance • Predict its class label • Receive the true class label • If then Regret Bound Learning a Classifier • The evaluation (mistake bound or regret bound) concerns a sequence of classifiers • But, by the end of the day, which classifier should used ? The last? By Cross Validation ? Learning with Expert Advice • Learning to combine the predictions from multiple experts • An ensemble of d experts: • Combination weights: • Combined classifier Hedge Simple Case • There exists one expert, denoted by who can perfectly classify all the training examples • What is your learning strategy ? , Difficult case • What if we don’t have such a perfect expert ? Hedge Algorithm +1 -1 +1 +1 Hedge Algorithm Initialize For t=1, 2, … T • Receive a training example • Prediction • If then For i=1, 2, …, d • If then Mistake Bound Mistake Bound • Measure the progress • Lower bound Mistake Bound • Upper bound Mistake Bound • Upper bound Mistake Bound