Foundations: fortunate choices

advertisement
Foundations: fortunate choices
• Unusual choice of separation strategy:
> Maximize “street” between groups
• Attack maximization problem:
> Lagrange multipliers + hairy mathematics
• New problem is a quadratic minimization:
> Susceptible to fancy numerical methods
• Result depends on dot products only
> Enables use of kernel methods.
1
Key idea: find widest separating “street”
2
Classifier form is given and constrained
• Classify unknown u as plus if:
f (u) = w ⋅ u + b > 0
• Then, constrain, for all plus sample vectors:
f (x + ) = w ⋅ x + + b ≥ 1
• And for all minus sample vectors
f (x − ) = w ⋅ x − + b ≤ −1
3
Distance between street’s gutters
•
The constraints require:
w ⋅ x1 + b = +1
w ⋅ x 2 + b = −1
x1
•
So, subtracting:
w ⋅ (x1 − x 2 ) = 2
x2
•
Dividing by the length of w
produces the distance between
the lines:
w
2
⋅ (x1 − x 2 ) =
w
w
4
From maximizing to minimizing…
• So, to maximize the width of the street, you need
to “wiggle” w until the length of w is minimum,
while still honoring constraints on gutter values:
2
= separation
w
• One possible approach to finding the minimum is
to use the method devised by Lagrange. Working
through the method reveals how the sample data
figures into the classification formula.
5
From maximizing to minimizing…
• A step toward solving the problem using
LaGrange’s method is to note that maximizing
street width is ensured if you minimize the
following, while still honoring constraints on
gutter values.
1
2
2
w
• Translation of the previous formula into this one,
with ½ and squaring, is a mathematical
convenience.
6
…while honoring constraints
• Remember, the minimization is constrained
• You can write the constraints as:
yi ( w ⋅ x i + b ) ≥ 1
Where yi is 1 for plusses and –1 for minuses.
7
Dependence on dot products
• Using LaGrange’s method, and working through some
mathematics, you get to the following problem. When
solved for the alphas, you then have what you need for the
classification formula.
1 l
Maximize ∑ ai − ∑ ai a j yi y j x i ⋅ x j
2 i , j =1
i =1
l
Subject to
l
∑a y
i
i
= 0i
and ai ≥ 0
i
l
Then check sign of f (u) = w ⋅ u + b = ( ∑ ai yi x i ⋅ u) + b
i , j =1
8
Key to importance
• Learning depends only on dot products of
sample pairs.
• Classification depends only on dot products
of unknown with samples.
• Exclusive reliance on dot products enables
approach to problems in which samples
cannot be separated by a straight line.
9
Example
10
Another example
11
Not separable?
Try another space!
Using some mapping, Ф
Φ
Problem starts here, 2D
Dot products computed here, 3D
12
What you need
• To get x1 into the high-dimensional space, you use
Φ (x1 )
• To optimize, you need
Φ (x1 ) ⋅ Φ (x 2 )
• To use, you need
Φ (x ) ⋅ Φ (u)
1
• So, all you need is a way to compute dot products in highdimensional space as a function of vectors in original
space!
13
What you don’t need
• Suppose dot products are supplied by
Φ(x ) ⋅ Φ(x ) = K (x , x )
1
2
1 2
• Then, all you need is
K (x , x )
1 2
• Evidently, you don’t need to know what Ф
is; having K is enough!
14
Standard choices
• No change
Φ (x ) ⋅ Φ(x ) = K (x , x ) = x ⋅ x
1
2
1 2
1 2
• Polynomial
K (x , x ) = (x ⋅ x + 1) n
1 2
1 2
• Radial basis function
− x1−x2 2
K (x , x ) = e
1 2
2σ
2
15
Polynomial Kernel
16
Radial-basis kernel
17
Another radial-basis example
18
Aside: about the hairy mathematics
• Step 1: Apply method of Lagrange
multipliers
To minimize
1
w
2
2
subject to constraints
yi ( x i ⋅ w + b ) ≥ 1
l
1
2
Find places where L = w − ∑ ai ( yi (x i ⋅ w + b) − 1)
2
i =1
has zero derivatives
19
Aside: about the hairy mathematics
Step 2: remember how to differentiate vectors
∂w
2
∂x ⋅ w
= 2w and
=x
∂w
∂w
Step 3: find derivatives of the Lagrangian L
l
∂L
= w − ∑ ai yi x i = 0
∂w
i =1
l
∂L
= ∑ ai yi = 0
∂b i =1
20
Aside: about the hairy mathematics
• Step 4: do the algebra, then ask a numerical
analyst to write a program to find the values
of alpha that produce an extreme value for:
1 l
L = ∑ ai − ∑ ai a j yi y j x i ⋅ x j
2 i , j =1
i =1
l
21
But, note that
• Quadratic minimization depends on only on
dot products of sample vectors
• Recognition depends only on dot products
of unknown vector with sample vectors
• Reliance on only dot products key to
remaining magic
22
MIT OpenCourseWare
http://ocw.mit.edu
6.034 Artificial Intelligence
Fall 2010
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
23
Download