Slide 1

advertisement
Online Learning Algorithms
1
Outline
• Online learning Framework
• Design principles of online learning algorithms (additive
updates)
 Perceptron, Passive-Aggressive and Confidence weighted
classification
 Classification – binary, multi-class and structured prediction
 Hypothesis averaging and Regularization
• Multiplicative updates
 Weighted majority, Winnow, and connections to Gradient
Descent(GD) and Exponentiated Gradient Descent (EGD)
2
Formal setting – Classification
• Instances
 Images, Sentences
• Labels
 Parse tree, Names
• Prediction rule
 Linear prediction rule
• Loss
 No. of mistakes
3
Predictions
• Continuous predictions :
 Label
 Confidence
• Linear Classifiers
 Prediction :
 Confidence:
4
Loss Functions
• Natural Loss:
 Zero-One loss:
• Real-valued-predictions loss:
 Hinge loss:
 Exponential loss (Boosting)
5
Loss Functions
Hinge Loss
Zero-One Loss
1
1
6
Online Framework
• Initialize Classifier
• Algorithm works in rounds
• On round the online algorithm :
 Receives an input instance
 Outputs a prediction
 Receives a feedback label
 Computes loss
 Updates the prediction rule
• Goal :
 Suffer small cumulative loss
7
Margin
• Margin of an example
to the classifier
:
with respect
• Note :
• The set
is
separable iff there exists u such that
8
Geometrical Interpretation
Margin <<0
Margin >0
Margin >>0
Margin <0
9
Hinge Loss
10
Why Online Learning?
• Fast
• Memory efficient - process one example at
a time
• Simple to implement
• Formal guarantees – Mistake bounds
• Online to Batch conversions
• No statistical assumptions
• Adaptive
11
Update Rules
• Online algorithms are based on an update rule
which defines
from
(and possibly other
information)
• Linear Classifiers : find
from
based on
the input
• Some Update Rules :




Perceptron (Rosenblat)
ALMA (Gentile)
ROMMA (Li & Long)
NORMA (Kivinen et. al)




12
MIRA (Crammer & Singer)
EG (Littlestown and Warmuth)
Bregman Based (Warmuth)
CWL (Dredge et. al)
Design Principles of Algorithms
•
If the learner suffers non-zero loss at any round, then
we want to balance two goals:


Corrective: Change weights enough so that we don’t make
this error again
(1)
Conservative: Don’t change the weights too much
(2)
How to define too much ?
13
Design Principles of Algorithms
•
If we use Euclidean distance to measure the change between old and new
weights

Enforcing (1) and minimizing (2)
 e.g., Perceptron for squared loss (Windrow-Hoff or Least Mean Squares)
•
Passive-Aggressive algorithms do exactly same

except (1) is much stronger – we want to make a correct classification with
margin of at least 1
•
Confidence-Weighted classifiers

maintains a distribution over weight vectors
 (1) is same as passive-aggressive with a probabilistic notion of margin
 Change is measured by KL divergence between two distributions
14
Design Principles of Algorithms
• If we assume all weights are positive
 we can use (unnormalized) KL divergence to
measure the change
 Multiplicative update or EG algorithm (Kivinen
and Warmuth)
15
The Perceptron Algorithm
• If No-Mistake
 Do nothing
• If Mistake
 Update
• Margin after update:
16
Passive-Aggressive Algorithms
17
Passive-Aggressive: Motivation
• Perceptron: No guaranties of margin after
the update
• PA: Enforce a minimal non-zero margin
after the update
• In particular:
 If the margin is large enough (1), then do
nothing
 If the margin is less then unit, update such
that the margin after the update is enforced to
be unit
18
Aggressive Update Step
• Set
to be the solution of the following
optimization problem:
(2)
(1)
• Closed-form update:
where,
19
Passive-Aggressive Update
20
Unrealizable Case
21
Confidence Weighted Classification
22
Confidence-Weighted Classification: Motivation
• Many positive reviews with the word best
Wbest
• Later negative review
 “boring book – best if you want to sleep in seconds”
• Linear update will reduce both
Wbest Wboring
• But best appeared more than boring
• How to adjust weights at different rates?
Wboring
Wbest
23
Update Rules
• The weight vector is a linear combination
of examples
• Two rate schedules (among others):
 Perceptron algorithm, conservative:
 Passive-aggressive
24
Distributions in Version Space
Mean weight-vector
Q uick Tim e™ and a
deco mt pr
ar e n eeded
o essor
s ee t his pic t ur e.
Example
25
Margin as a Random Variable
• Signed margin
is a Gaussian-distributed variable
• Thus:
26
PA-like Update
• PA:
• New Update :
27
Weight Vector (Version) Space
Place most of the
probability mass in this
region
28
Passive Step
Nothing to do, most
weight vectors already
classify the example
correctly
29
Aggressive Step
Mean moved past
the mistake line
(large margin)
The covariance is
Project the current
shirked in the
Gaussian distribution
direction of the
onto the half-space
new example
30
Extensions:
Multi-class and Structured Prediction
31
Multiclass Representation I
• k Prototypes
• New instance
• Compute
Class r
1
2
3
4
-1.08
1.66
0.37
-2.09
• Prediction: the class achieving the highest Score
32
Multiclass Representation II
• Map all input and labels into a joint vector space
F
Estimated volume was a light 2.4 million ounces .
B
I
O
B I
I
I
I
O
= (0 1 1 0 … )
• Score labels by projecting the corresponding
feature vector
33
Multiclass Representation II
• Predict label with highest score (Inference)
• Naïve search is expensive if the set of possible
labels is large
Estimated volume was a light 2.4 million ounces .
B
I
O
B I
 No. of labelings = 3No. of words
34
I
I
I
O
Efficient Viterbi decoding for
sequences!
Two Representations
• Weight-vector per class (Representation I)
 Intuitive
 Improved algorithms
• Single weight-vector (Representation II)
 Generalizes representation I
F(x,4) =
0
0
0
x
0
 Allows complex interactions between input
and output
35
Margin for Multi Class
• Binary:
• Multi Class:
36
Margin for Multi Class
• But different mistakes cost (aka loss function)
differently – so use it!
• Margin scaled by loss function:
37
Perceptron Multiclass online algorithm
• Initialize
• For
 Receive an input instance




Outputs a prediction
Receives a feedback label
Computes loss
Update the prediction rule
38
PA Multiclass online algorithm
• Initialize
• For
 Receive an input instance




Outputs a prediction
Receives a feedback label
Computes loss
Update the prediction rule
39
Regularization
• Key Idea:
 If an online algorithm works well on a
sequence of i.i.d examples, then an ensemble
of online hypotheses should generalize well.
• Popular choices:
 the averaged hypothesis
 the majority vote
 use validation set to make a choice
40
Download