SVM by Sequential Minimal Optimization (SMO)

advertisement
SVM by Sequential Minimal
Optimization (SMO)
Algorithm by John Platt
Lecture by David Page
CS 760: Machine Learning
Quick Review
• Inner product (specifically here the dot product)
is defined as
• Changing x to w and z to x, we may also write:
or
or wTx or w x
• In our usage, x is feature vector and w is weight
(or coefficient) vector
• A line (or more generally a hyperplane) is written
as wTx = b, or wTx - b = 0
• A linear separator is written as sign(wTx - b)
Quick Review (Continued)
• Recall in SVMs we want to maximize margin
subject to correctly separating the data… that
is, we want to:
• Here the y’s are the class values (+1 for
positive and -1 for negative), so
says xi correctly labeled with room to spare
• Recall
is just wTw
Recall Full Formulation
• As last lecture showed us, we can
– Solve the dual more efficiently (fewer unknowns)
– Add parameter C to allow some misclassifications
– Replace xiTxj by more more general kernel term
Intuitive Introduction to SMO
• Perceptron learning algorithm is essentially
doing same thing – find a linear separator by
adjusting weights on misclassified examples
• Unlike perceptron, SMO has to maintain sum
over examples of example weight times
example label
• Therefore, when SMO adjusts weight of one
example, it must also adjust weight of another
An Old Slide:
Perceptron as Classifier
• Output for example x is sign(wTx)
• Candidate Hypotheses: real-valued weight vectors w
• Training: Update w for each misclassified example x (target
class t, predicted o) by:
– wi wi + h(t-o)xi
– Here h is learning rate parameter
• Let E be the error o-t. The above can be rewritten as
w w – hEx
• So final weight vector is a sum of weighted contributions from
each example. Predictive model can be rewritten as weighted
combination of examples rather than features, where the
weight on an example is sum (over iterations) of terms –hE.
Corresponding Use in Prediction
• To use perceptron, prediction is wTx - b. May
treat -b as one more coefficient in w (and omit
it here), may take sign of this value
• In alternative view of last slide, prediction can
be written as  a(x
j'
j x ) - b or, revising the
i
weight appropriately,
 yja(xj i x ) - b
T
T
i
• Prediction in SVM is:
From Perceptron Rule to SMO Rule
• Recall that SVM optimization problem has the added
requirement that:
• Therefore if we increase one  by an amount h, in
either direction, then we have to change another 
by an equal amount in the opposite direction
(relative to class value). We can accomplish this by
changing:
w w – hEx also viewed as:
  – yhE to:
1 1 – y1h(E1-E2)
or equivalently: 1 1 + y1h(E2-E1)
We then also adjust 2 so that y11 + y22 stays same
First of Two Other Changes
• Instead of h = or some similar constant, we
1
set h = x  x  x  x - 2 x  x or more generally h =
1
20
1
1
2
2
1
2
1
K ( x1, x1)  K ( x 2, x 2) - 2 K ( x1, x 2)
• For a given difference in error between two
examples, the step size is larger when the two
examples are more similar. When x1 and x2
are similar, K(x1,x2) is larger (for typical
kernels, including dot product). Hence the
denominator is smaller and step size is larger.
Second of Two Other Changes
• Recall our formulation had an additional
constraint:
• Therefore, we “clip” any change we make to
any  to respect this constraint
• This yields the following SMO algorithm. (So
far we have motivated it intuitively. Afterward
we will show it really minimizes our objective.)
Updating Two ’s: One SMO Step
• Given examples e1 and e2, set
where:
• Clip this value in the natural way:
if y1 = y2 then:
•
otherwise:
• Set
where s = y1y2
Karush-Kuhn-Tucker (KKT) Conditions
• It is necessary and sufficient for a solution to our objective that all ’s
satisfy the following:
• An  is 0 iff that example is correctly labeled with room to spare
• An  is C iff that example is incorrectly labeled or in the margin
• An  is properly between 0 and C (is “unbound”) iff that example is
“barely” correctly labeled (is a support vector)
• We just check KKT to within some small epsilon, typically 10-3
SMO Algorithm
• Input: C, kernel, kernel parameters, epsilon
• Initialize b and all ’s to 0
• Repeat until KKT satisfied (to within epsilon):
– Find an example e1 that violates KKT (prefer unbound
examples here, choose randomly among those)
– Choose a second example e2. Prefer one to maximize step
size (in practice, faster to just maximize |E1 – E2|). If that
fails to result in change, randomly choose unbound
example. If that fails, randomly choose example. If that
fails, re-choose e1.
– Update α1 and α2 in one step
– Compute new threshold b
At End of Each Step, Need to Update b
• Compute b1 and b2 as follows
• Choose b = (b1 + b2)/2
• When α1 and α2 are not at bound, it is in fact
guaranteed that b = b1 = b2
Derivation of SMO Update Rule
• Recall objective function we want to optimize:
α2 with
everybody else
• Can write objective as:
α1 with
everybody else
all the
other α’s
where:
Starred values are from end of previous round
Expressing Objective in Terms of One
Lagrange Multiplier Only
• Because of the constraint
update has to obey:
, each
where w is a constant and s is y1y2
• This lets us express the objective function in
terms of α2 alone:
Find Minimum of this Objective
• Find extremum by first derivative wrt α2:
• If second derivative positive (usual case), then
the minimum is where (by simple Algebra,
remembering ss = 1):
Finishing Up
• Recall that w =
and
• So we can rewrite
as:
= s(K11-K12)(sα2*+α1*) +
y2(u1-u2+y1α1*(K12-K11) + y2α2*(K22-K21)) +y2y2 –y1y2
• We can rewrite this as:
Finishing Up
• Based on values of w and v, we rewrote
as:
u1 – y1
• From this, we get our update rule:
where
u2 – y 2
K11 + K22 – 2K12
Download