Yihui Saw (MIT)
Support Vector Machines
Yihui Saw
Massachusetts Institute of Technology yihui@mit.edu
April 30, 2013
April 30, 2013 1 / 22
Overview
1
2
3
Support Vector Machines with Errors
4
5
6
Yihui Saw (MIT)
April 30, 2013 2 / 22
Background/Motivation
Given 2 classes, and a new data point we want to be able to classify this new data point as one of the 2 classes.
Yihui Saw (MIT)
April 30, 2013 3 / 22
Background/Motivation
Goal: Find a linear separator.
Yihui Saw (MIT)
April 30, 2013 4 / 22
Background/Motivation
But there are many possible linear separators.
Yihui Saw (MIT)
April 30, 2013 5 / 22
The SVM Model
But there are many possible linear separators.
Yihui Saw (MIT)
April 30, 2013 6 / 22
The SVM Model
Maximize the gap between support vectors.
Yihui Saw (MIT)
April 30, 2013 7 / 22
Formal Definition
Input : Set of samples S where each sample x i x ∈ R d and y i is its class y ∈ { +1 , − 1 } .
is a vector of d variables
Goal: Find Θ , Θ
0
1
| Θ | such that y i
(Θ · x i
+ Θ
0
) ≥ 1 while maximizing the gap
Yihui Saw (MIT)
April 30, 2013 8 / 22
The quadratic program
Primal min
1
2
| Θ | 2 subject to y i
(Θ · x i
+ Θ
0
) ≥ 1 where i = 1 , ..., n
Yihui Saw (MIT)
April 30, 2013 9 / 22
The dual problem
Dual max P n i =1
α i
−
1
2
P
α i
≥ 0 , i = 1 , 2 , ..., n n i =1
P n j =1
α i
α j y i y j
( x i
· x j
) subject to
The solution satisfies:
(support vector) α i
> 0 : y i
( P n j =1
α j y j x j
) · x i
= 1
(non-support vector) α i
= 0 : y i
( P n j =1
α j y j x j
) · x i
≥ 1
Yihui Saw (MIT)
April 30, 2013 10 / 22
Support Vector Machines with Errors
Sometimes our examples contain errors
Solution: Introduce “slack” variables ξ i
≥ 0 to our optimization problem
Yihui Saw (MIT)
April 30, 2013 11 / 22
Support Vector Machines with Errors
(primal) min
λ
2
| Θ | 2
+
1 n
P n i =1
ξ i subject to y i
(Θ · x i
+ Θ
0
) ≥ 1 − ξ i
, ξ i
≥ 0 , i = 1 , ..., n
λ is the regularization parameter. It balances how much we favor increasing the margin over satisfying the classification constraints.
Yihui Saw (MIT)
April 30, 2013 12 / 22
Support Vector Machines with Errors
The effect of slack when examples are still linearly separable
Yihui Saw (MIT)
April 30, 2013 13 / 22
Support Vector Machines with Errors
The effect of slack when examples are no longer linearly separable
Yihui Saw (MIT)
April 30, 2013 14 / 22
Non-linear problems
Problems that are not linearly separable
Yihui Saw (MIT)
April 30, 2013 15 / 22
Non-linear problems
The idea is to gain linearly separation by mapping the data to a higher dimensional space
Yihui Saw (MIT)
April 30, 2013 16 / 22
Non-linear problems
Recall the dual of the problem: max P n i =1
α i
−
1
2
P n i =1
P n j =1
α i
α j y i y j
( x i
· x j
)
In the non-linear case, we replace x i
· x j with φ ( x i
) · φ ( x j
).
So we don’t need to know what Φ is explicitly. Calculate Kernel function
K instead where
K ( x i
, x j
) = φ ( x i
) · φ ( x j
)
Yihui Saw (MIT)
April 30, 2013 17 / 22
Kernels
With Kernels, we can implicitly work with very high (or even infinite) dimensional feature vectors.
Example the radial basis kernel
K ( x i
, x j
) = φ ( x i
) · φ ( x j
) = e
−| xi − xj |
2
2 is infinite dimensional.
Kernel function
A kernel function is valid if and only if there exists some map φ ( x ) such that
K ( x i
, x j
) = φ ( x i
) · φ ( x j
)
Yihui Saw (MIT)
April 30, 2013 18 / 22
Kernels
Rules
1
2
3
4
K ( x i
, x j
) = 1 is a kernel function.
Let f : R d → R be any real valued function of x. Then, if K ( x i
, x j
) is a kernel, function, then so is ˜ ( x i
, x j
) = f ( x i
) K ( x i
, x j
) f ( x j
).
If K
1
( x i
, x j
) and K
2
( x i
, x j
) are kernels, then so is their sum.
If K
1
( x i
, x j
) and K
2
( x i
, x j
) are kernels, then so is their product.
Yihui Saw (MIT)
April 30, 2013 19 / 22
Kernels
Yihui Saw (MIT)
April 30, 2013 20 / 22
Applications
Text (and hypertext) categorization
Image classification
Bioinformatics (Protein classification, Cancer classification) etc ...
Yihui Saw (MIT)
April 30, 2013 21 / 22
Yihui Saw (MIT)
April 30, 2013 22 / 22