Support Vector Machines

advertisement

CIS525: Neural Compuatation

Lecture 6 (Feb 25, 2004)

Support Vector Machines (SVM)

Used mostly for classification (also, can be modified for regression and even for unsupervised learning applications).

Achieve accuracy comparable to Multilayer Perceptrons

Assume training data D

(

 x i

, y i

), i

1 ...

N

with y i

1 ,

1

is separable by a hyperplane.

-

-

-

-

-

-

-

-

+

+

+

-

-

-

+

+

-

+

+ +

+

+

+

Here are three possible separations

Question: What is the best linear classifier of the type f (

 x )

 w

T

 x

 b (

 w

1 x

1

 w

2 x

2

 w

3 x

3

...

 w n x n

 b ) .

While there can be an infinite number of hyperplanes that achieve 100% accuracy on training data, the question is what hyperplane is the optimal with respect to the accuracy on test data?

-

-

-

-

-

-

-

+

-

+

-

-

+

-

+

+

-

+

+ +

+

+

+

Common sense solution: we want to increase the gap (margin) between positive and negative cases as much as possible.

The best linear classifier is the hyperplane in the middle of the gap.

Given f(x), the classification is obtained as

 sign ( f ( x

))

1

1 f f

( x

)

( x

)

0

0

Note: Different w and b can result in the identical classification. For example, we can apply any scalar

such that:

 sign

( w T x  b

)

 sign

 w T x  b

Therefore there are many identical solutions.

Definitions of SVM and Margin

Find f ( x

)

( w T x  b

) with maximal margin, such that for points closest to the separating hyperplane, w T x i

 b 

1 (also called the support vectors ) and for other points, w T x i

 b 

1

Illustration:

+

+

+

+

SUPPORT VECTORS

+

+ margin

+

+ depends on closest points

Question: How can we calculate the length of the margin as a function of w? The following diagram shows a point x and its projection x p

to the separating hyperplane. r is defined as the distance between data point x and the hyperplane. w everything below is 0 zero x r p x

everything above is 1

Note that w is a vector perpendicular to the hyperplane, so we have: f ( x

)

 f ( x p

 w w

 r ) = w T x p

 w T w w r

 b  w  r (since w T x p

 b  0

)

Therefore: r

 f ( x

) w

SUPPORT VECTORS

+

+

+

ρ

+

+

+

+

+

Now, solve for margin length ρ:

+1 -1

ρ  f ( x w

)

 f ( x w

)

2 w

Conclusion: Maximizing the margin is equivalent to minimizing w

(since we can ignore the constant 2 above).

Theoretical Justification:

The following inequality could be derived: some constant h

R

2

2

1 margin h represents the VC dimension that measures how powerful the learning algorithm is.

It is preferable to use the simplest possible algorithm that learns sufficiently accurantly from the given data.

Therefore, we want to minimize h .

Conclusion: In order to minimize h (to minimize L(f)) we need to maximize the margin ρ.

Support Vector Machines: Learning Problem

Assuming a linearly separable dataset, the task of learning coefficients w and b of support vector machine f ( x )

( w

T x i

 b ) reduces to solving the following constrained optimization problem: find w and b that minimize: subject to constraints: y i w

T

( x i

1

2

 b )

2 w

1 ,

 i

This optimization problem can be solved by using the Lagrangian function defined as:

L ( w , b ,

α

)

1

2 w

T w

 i

N 

1

 i

[ y i

( w

T x i

 b )

1 ] , such that

 i

0 ,

 i where

1

,

2

, … 

N

are Lagrange multipliers and

= [

1

,

2

, … 

N

]

T

.

The solution of the original constrained optimization problem is determined by the saddle point of L ( w , b ,

) which has to be minimized with respect to w and b and maximized with respect to

.

Comments about Lagrange multipliers:

If y i

( w

T x i

 b )

1 , the value of

 i

that maximizes L (w, b ,

) is

 i

= 0.

If y i

( w

T x i

 b )

1 , the value of

 i

that maximizes L (w, b ,

) is

 i

= +

.

However, since w and b are trying to minimize L (w, b ,

), they will be changed in such a way to make y i

( w

T x i

 b ) at least equal to +1.

From this brief discussion, the so-called Kuhn_Tucker Conditions follow:

 i

 y i w

T

( x i

 b )

1

0 ,

 i

Notation:

Data points x i

with

 i

> 0 are called the support vectors

Optimality conditions:

The necessary conditions for the saddle point of L ( w , b ,

) are

L

 w j

L

  i

0 ,

0 ,

 j

 i or, stated a different way,

 w

L

0 ,

α

L

0

Solving for the necessary conditions results in w 

N  i

1

 i y i x i

(***)

By replacing

N  i

1 a i y i

0 w 

N  i

1

 i y i x i into the Lagrangian function and by using

N  i

1 a i y i

0 as a new constrain the dual optimization problem can be constructed as

Find

that maximizes subject to

 i

 i

 N i

1

 i y i

1

2

 i j

 i

 j

0 ,

 i

0 , y i y j x i

T x j

 i

This is a convex quadratic programming problem , so there is a global minimum. There are a number of optimization routines capable of solving this optimization problem. The optimization can be solved in O(N 3 ) time (cubic with the size of training data) and in linear time in the number of attributes . (Compare this to neural networks that are trained in O(N) time)

Support Vector Machine: Final Predictor

Given the values

1

,

2

, … 

N

obtained by solution of the dual problem, the final SVM predictor can be expressed from (***) as f ( x )

 w

T x i

 b

N  i

1

 i y i x i

T x

 b where b

1

I support i

I



  j su p p o rt y i

 j y j x j

T x i

 and I support

is the set of support vectors.

Important comments:

To obtain the prediction, all data points from the training data are consulted

Since

 i

0 only for the support vectors, only support vectors are used in giving a prediction

Note that x j

T x is a scalar

Support Vector Machine: Linearly Nonseparable Case

So far, we have discussed the construction of support vector machines on linearly separable training data. This is a very strong assumption that is unrealistic in most real life applications.

Question: What should we do if the training data set is not linearly separable ?

+

+

+

+

+

+

+

Solution: Introducing the slack variables

 i

, i

=1, 2, …, N, to relax the constraint y i

( w

T x i

 b )

1 to y i

( w

T x i

 b )

1

  i

,

 i

0 . Ideally, one would prefer all slack variables to be zero and this would correspond to the linearly separable case. Therefore, the optimization problem for construction of SVM on linearly nonseparable data is defined as: find w and b that minimize: subject to:

1

2 w y i

( w

T x i

 b )

1

  i

,

2 

 i

C

 i

 i

2

0 ,

 i where C > 0 is an appropriately selected parameter. The additional term C

 i

 i

2

enforces all slack variables to be as close to zero as possible.

Dual problem: As in the linearly separable problem, this optimization problem can be converted to its dual problem: find

that maximizes

 i

 i

 subject to

1

2

 i j

 i

 j y i y j x i

T x j

0

N i

1

 i

  i

 y i

C ,

0 ,

 i

NOTE: The consequence of introducing parameter C is in constraining the range of acceptable values of Lagrange multipliers

 i

. The most appropriate choice for C will depend on the specific data set available.

Support Vector Machine: Nonlinear Case

Problem: Support vector machines represented with a linear function f ( x ) (i.e. a separating hyperplane) have very limited representational power. As such, they could not be very useful in practical classification problems.

X

2

Good News: With a slight modification, SVM could solve highly nonlinear classification problems!!

- - - -

+ + + - - -

Justification: Cover’s Theorem

Suppose that data set D is nonlinearly

+ + + - -

+ + - - -

- - - - -

X

1 separable in the original attribute space. The attribute space can be transformed into a new attribute space where D is linearly separable!

Caveat: Cover’s Theorem only proves the existence of the transformed attribute space that could solve the nonlinear problem. It does not provide the guideline for the construction of the attribute transformation!

Example 1. XOR problem

X

2

- - - + + +

- - + +

+ + + - -

X

1

By constructing a new attribute:

X

1

’ = X

1

X

2 the XOR problem becomes linearly separable by the new attribute X1’.

+ + - - -

Example 2. Taylor expansion.

Value of a multidimensional function F( x ) at point x can be approximated as

F (

 x )

F (

 x

0

)

 

F (

 x

0

)(

 x

 x

0

)

(

 x

 x

0

)

T  2

F (

 x

0

)(

 x

 x

0

)

O (||

 x

 x

0

||

3

)

Therefore, F( x ) can be considered as a linear combination of complex attributes derived from the original ones,

F (

 x )

F (

 x

0

)

 i m 

1 a i x i

 i , m j

1 a ij x i x j

 i , j , m  k

1 a ijk x i x j x k

O (||

 x

 x

0

||

3

)

Example 3. Second order monomials derived from the original two-dimensional attribute space

( x

1

, x

2

)

( z

1

, z

2

, z

3

)

( x

1

2

, 2 x

1 x

2

, x

2

2

)

Example 4. Fifth order monomials derived from the original 256-dimensional attribute space

There are



5

256

5

1

10

10 dimensional attribute space!!

of such monomials, which is an extremely high-

SVM and curse-of-dimensionality:

If the original attribute space is transformed into a very high dimensional space, the likelihood of being able to solve the nonlinear classification increases.

However, one is likely to quickly encounter the curse-of-dimensionality problem.

The strength of SVM lies in the theoretical justification that margin maximization is an effective mechanism for alleviating the curse-of-dimensionality problem (i.e.

SVM is the simplest classifier that solves the given classification problem).

Therefore, SVM are able to successfully solve classification problems with extremely high attribute dimensionality!!

SVM solution for classification:

Denote

:

 M 

F as a mapping from the original M-dimensional attribute space to the highly dimensional attribute space F .

By solving the following dual problem find

that maximizes subject to

 i

 i

 1

2

 i j

 i

0

 N i

1

  i i

 y i

C ,

0 ,

 i j y i y j

( x i

)

T 

( x j

) the resulting SVM is of the form f ( x )

 w

T 

( x i

)

 b

N  i

1

 i y i

( x i

)

T 

( x )

 b

Practical Problem: Although SVM are successful in dealing with highly dimensional attribute spaces, the fact that the SVM training scales linearly with the number of attributes, and considering limited memory space could largely limit the choice of mapping

.

Solution: Kernel Trick

It allows computing scalar products (e.g.

( x i

)

T 

( x ) ) in the original attribute space. It follows from

Mercer’s Theorem:

There is a class of mappings

that has the following property:

( x )

T 

( y )

K ( x , y ) , where K is a corresponding kernel function.

Examples of kernel function: x

 y

2

• Gaussian Kernel:

K ( x , y )

 e A , A is a constant

• Polynomial Kernel:

K ( x , y )

( x

T y

1 )

B

, B is a constant

By introducing the kernel trick:

The dual problem: find

that maximizes subject to

 i

 i

 1

2

 i j

 i

 j y i y j

K ( x i

, x j

)

0

 N i

1

  i i

 y i

C ,

0 ,

 i

The resulting SVM is: f ( x )

 w

T 

( x i

)

 b

N  i

1

 i y i

K ( x i

, x )

 b

Some practical issues with SVM

Modeling choices : When using some of the available SVM software packages or toolboxes a user should choose (1) kernel function (e.g. Gaussian kernel) and its parameter(s) (e.g. constant A), (2) constant C related to the slack variables. Several choices should be examined using validation set in order to find the best SVM.

SVM training does not scale well with the size of the training data (i.e. scaling as O(N

3

)).

There are several solutions that offer speed-up of the original SVM algorithm: o chunking; start with a subset of D, build SVM, apply it on all data, add

“problematic” data points into the training data, remove “nice” points, repeat). o decomposition; similar to chunking, the size of the subset is kept constant o sequential minimal optimization ; extreme version of chunking, only 2 data points are used in each iteration.

SVM-Based solutions exist for problems outside binary classification

 multi-class classification problems

SVM for regression

Kernel PCA

Kernel Fischer discriminant

Clustering

PCA Kernel PCA

Download