Support Vector Machines Yihui Saw April 30, 2013 Massachusetts Institute of Technology

advertisement

Yihui Saw (MIT)

Support Vector Machines

Yihui Saw

Massachusetts Institute of Technology yihui@mit.edu

April 30, 2013

SVM

April 30, 2013 1 / 22

Overview

1

Introduction

2

Formal Definition

3

Support Vector Machines with Errors

4

Non-linear problems

5

Kernels

6

Applications

Yihui Saw (MIT)

SVM

April 30, 2013 2 / 22

Background/Motivation

Given 2 classes, and a new data point we want to be able to classify this new data point as one of the 2 classes.

Yihui Saw (MIT)

SVM

April 30, 2013 3 / 22

Background/Motivation

Goal: Find a linear separator.

Yihui Saw (MIT)

SVM

April 30, 2013 4 / 22

Background/Motivation

But there are many possible linear separators.

Yihui Saw (MIT)

SVM

April 30, 2013 5 / 22

The SVM Model

But there are many possible linear separators.

Yihui Saw (MIT)

SVM

April 30, 2013 6 / 22

The SVM Model

Maximize the gap between support vectors.

Yihui Saw (MIT)

SVM

April 30, 2013 7 / 22

Formal Definition

Input : Set of samples S where each sample x i x ∈ R d and y i is its class y ∈ { +1 , − 1 } .

is a vector of d variables

Goal: Find Θ , Θ

0

1

| Θ | such that y i

(Θ · x i

+ Θ

0

) ≥ 1 while maximizing the gap

Yihui Saw (MIT)

SVM

April 30, 2013 8 / 22

The quadratic program

Primal min

1

2

| Θ | 2 subject to y i

(Θ · x i

+ Θ

0

) ≥ 1 where i = 1 , ..., n

Yihui Saw (MIT)

SVM

April 30, 2013 9 / 22

The dual problem

Dual max P n i =1

α i

1

2

P

α i

≥ 0 , i = 1 , 2 , ..., n n i =1

P n j =1

α i

α j y i y j

( x i

· x j

) subject to

The solution satisfies:

(support vector) α i

> 0 : y i

( P n j =1

α j y j x j

) · x i

= 1

(non-support vector) α i

= 0 : y i

( P n j =1

α j y j x j

) · x i

≥ 1

Yihui Saw (MIT)

SVM

April 30, 2013 10 / 22

Support Vector Machines with Errors

Sometimes our examples contain errors

Solution: Introduce “slack” variables ξ i

≥ 0 to our optimization problem

Yihui Saw (MIT)

SVM

April 30, 2013 11 / 22

Support Vector Machines with Errors

(primal) min

λ

2

| Θ | 2

+

1 n

P n i =1

ξ i subject to y i

(Θ · x i

+ Θ

0

) ≥ 1 − ξ i

, ξ i

≥ 0 , i = 1 , ..., n

λ is the regularization parameter. It balances how much we favor increasing the margin over satisfying the classification constraints.

Yihui Saw (MIT)

SVM

April 30, 2013 12 / 22

Support Vector Machines with Errors

The effect of slack when examples are still linearly separable

Yihui Saw (MIT)

SVM

April 30, 2013 13 / 22

Support Vector Machines with Errors

The effect of slack when examples are no longer linearly separable

Yihui Saw (MIT)

SVM

April 30, 2013 14 / 22

Non-linear problems

Problems that are not linearly separable

Yihui Saw (MIT)

SVM

April 30, 2013 15 / 22

Non-linear problems

The idea is to gain linearly separation by mapping the data to a higher dimensional space

Yihui Saw (MIT)

SVM

April 30, 2013 16 / 22

Non-linear problems

Recall the dual of the problem: max P n i =1

α i

1

2

P n i =1

P n j =1

α i

α j y i y j

( x i

· x j

)

In the non-linear case, we replace x i

· x j with φ ( x i

) · φ ( x j

).

So we don’t need to know what Φ is explicitly. Calculate Kernel function

K instead where

K ( x i

, x j

) = φ ( x i

) · φ ( x j

)

Yihui Saw (MIT)

SVM

April 30, 2013 17 / 22

Kernels

With Kernels, we can implicitly work with very high (or even infinite) dimensional feature vectors.

Example the radial basis kernel

K ( x i

, x j

) = φ ( x i

) · φ ( x j

) = e

−| xi − xj |

2

2 is infinite dimensional.

Kernel function

A kernel function is valid if and only if there exists some map φ ( x ) such that

K ( x i

, x j

) = φ ( x i

) · φ ( x j

)

Yihui Saw (MIT)

SVM

April 30, 2013 18 / 22

Kernels

Rules

1

2

3

4

K ( x i

, x j

) = 1 is a kernel function.

Let f : R d → R be any real valued function of x. Then, if K ( x i

, x j

) is a kernel, function, then so is ˜ ( x i

, x j

) = f ( x i

) K ( x i

, x j

) f ( x j

).

If K

1

( x i

, x j

) and K

2

( x i

, x j

) are kernels, then so is their sum.

If K

1

( x i

, x j

) and K

2

( x i

, x j

) are kernels, then so is their product.

Yihui Saw (MIT)

SVM

April 30, 2013 19 / 22

Kernels

Yihui Saw (MIT)

SVM

April 30, 2013 20 / 22

Applications

Text (and hypertext) categorization

Image classification

Bioinformatics (Protein classification, Cancer classification) etc ...

Yihui Saw (MIT)

SVM

April 30, 2013 21 / 22

The End

Yihui Saw (MIT)

SVM

April 30, 2013 22 / 22

Download