Support Vector Machines

advertisement

Support Vector Machines

1. Introduction to SVMs

2. Linear SVMs

3. Non-linear SVMs

References:

1. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine Learning

Approach, Prentice Hall, to appear.

2. S.R. Gunn, 1998. Support Vector Machines for Classification and Regression.

(http://www.isis.ecs.soton.ac.uk/resources/svminfo/)

3. Bernhard Schölkopf. Statistical learning and kernel methods. MSR-TR 2000-23,

Microsoft Research, 2000. (ftp://ftp.research.microsoft.com/pub/tr/tr-2000-23.pdf)

4. For more resources on support vector machines, see http://www.kernel-machines.org/

4/17/2020 Support Vector Machines M.W. Mak 1

Introduction

SVMs were developed by Vapnik in 1995 and are becoming popular due to their attractive features and promising performance.

Conventional neural networks are based on empirical risk minimization where network weights are determined by minimizing the mean squares error between the actual outputs and the desired outputs.

SVMs are based on the structural risk minimization principle where parameters are optimized by minimizing classification error.

SVMs have been shown to posses better generalization capability than conventional neural networks.

4/17/2020 Support Vector Machines M.W. Mak 2

Introduction (Cont.)

Given N labeled empirical data:

( x , ),  ,

1 y

1

( x

N

, y

N

)

X where X is the set of input data in

 D

{

1 ,

1 } x

2 and y i are the labels.

y y i

1

1 i

(1) c

1 c

2

Domain X

4/17/2020 Support Vector Machines x

1

M.W. Mak 3

Introduction (Cont.)

We construct a simple classifier by computing the means of the two classes c

1

1

N

1 i : y i

 

1 x i and c

2

1

N

2 i : y i

 

1 x i

(2) where N

1 and N

2 are the number of data in the class with positive and negative labels, respectively.

We assign a new point x to the class whose mean is closer to it.

To achieve this, we compute c

( c

1

 c

2

) 2

4/17/2020 Support Vector Machines M.W. Mak 4

Introduction (Cont.) x

2

Then, we determine the class of x by checking whether the vector connecting x and c encloses an angle smaller than

/2 with the vector

Domain X w

 c

1

 y

 sgn sgn sgn

(

( c x x x

2

.

c

1 c )

( c

1

)

 w

( c x

2

) c

2

2

)

( c

1 b

 c

2

)

 c

2 c

1 x c where b

1

2

( c

2 

2 c

1

2

) x

1

4/17/2020 Support Vector Machines M.W. Mak 5

Introduction (Cont.)

In the special case where b = 0, we have y

 sgn

1

N

1 i : y i

 

1

( x

 x i

)

1

N

2 i : y i

 

1

( x

 x i

)

 sgn

 i : y i

 

1

( x

1

N

1 x i

)

 i : y i

 

1

( x

1

N

2 x i

)

This means that we use ALL data point x i

, each being weighted equally by 1/ plane.

N

1 or 1/ N

2

, to define the decision

(3)

4/17/2020 Support Vector Machines M.W. Mak 6

c

1

Domain X x

2

Introduction (Cont.) w w x c

2

Decision plan x

1 y i y i

1

1

4/17/2020 Support Vector Machines M.W. Mak 7

Introduction (Cont.)

However, we might want to remove the influence of patterns that are far away from the decision boundary, because their influence is usually small.

We may also select only a few important data point (called support vectors) and weight them differently.

Then, we have a support vector machine.

4/17/2020 Support Vector Machines M.W. Mak 8

x

2

Introduction (Cont.)

Margin y y i

1

1 i

Support vectors

 x

Decision plane

Domain X x

1

We aim to find a decision plane that maximizes the margin.

4/17/2020 Support Vector Machines M.W. Mak 9

Linear SVMs

Assume that all training data satisfy the constraints: which means x i

 w

 b

 x i

 w

 b

1 for y i

1 for y i

 

1

 

1 y i

( x i

 w

 b )

1

0

 i

(4)

(5)

Training data points for which the above equality holds lie on hyperplanes parallel to the decision plane.

4/17/2020 Support Vector Machines M.W. Mak 10

Margin: d x

2

Linear SVMs (Conts.)

( w

 w w

: x

1

: x

2

( w

( w x

1 x

2

)

)

( x

1 b b

 x

2

1

1

))

2



 w w



( x

1

 x

2

)

 w

 x

 b

 

1

 d

2 w

 x

 b

0 w w

2

 w

 x

 b

 

1 x

1

Therefore, maximizing the margin is equivalent to minimizing ||w|| 2 .

4/17/2020 Support Vector Machines M.W. Mak 11

Linear SVMs (Lagrangian)

We minimize ||w|| 2 subject to the constraint that y i

( x i

 w

 b )

1

0

 i (6)

{

This can be achieved by introducing Lagrange multipliers

 i

0 } i

N

1 and a Lagrangian

L ( w , b ,

)

1

2 w

2  i

N 

1

 i

( y i

( x i

 w

 b )

1 )

(7)

The Lagrangian has to be minimized with respect to w and b and maximized with respect to

 i

0

4/17/2020 Support Vector Machines M.W. Mak 12

Linear SVMs (Lagrangian)

Setting

 b

L ( w , b ,

)

0 and

 w

L ( w , b ,

)

0

We obtain i

N 

1

 i y i

0 and w

 i

N 

1

 i y i x i

(8)

Patterns for which

 

0 are called Support Vectors .

These k vectors lie on the margin and satisfy y k

( x k

 w

 b )

1

0

 k

S where S contains the indexes to the support vectors.

Patterns for which

 k

0 are considered to be irrelevant to the classification.

4/17/2020 Support Vector Machines M.W. Mak 13

Linear SVMs (Wolfe Dual)

Substituting (8) into (7), we obtain the Wolfe dual:

Maximize : L (

)

 i

N 

1

 i

1

2 i

N N 

1 j

1

 i

 j y i y j

( x i

 x j

) subject to

 i

0 , i

1 ,  , N , and i

N 

1

 i y i

0

(9)

The hyper-decision plane is thus b f ( x )

 sgn

 w

 x

 b

1

 w

 x k

where y k sgn

 i

N 

1 y i

 i

( x i

 x )

 b

1 and x k is a support ve ctor .

4/17/2020 Support Vector Machines M.W. Mak 14

Linear SVMs (Example)

Analytical example (3-point problem): x

1

[ 0 .

0 x

2

[ 1 .

0 x

3

[ 0 .

0

0 .

0 ]

T y

1

 

1

0 .

0 ]

T y

2

 

1

1 .

0 ]

T y

3

 

1

Objective function:

Maximize : L (

)

 i

3 

1

 i

1

2

3 3  i

1 j

1

 i

 j y i y j

( x i

 x j

) subject to

 i

0 , i

1 ,  , 3 , and i

3 

1

 i y i

0

4/17/2020 Support Vector Machines M.W. Mak 15

Linear SVMs (Example)

We introduce another Lagrange multiplier λ to obtain the

Lagrangian

F (

,

)

L (

)

  i

3 

1

 i y i

 

1

 

2

 

3

1

2

 2

2

1

2

 2

3

 

(

 

1

 

2

 

3

)

Differentiating F (α, λ) with respect to λ and α i results to zero, we obtain and set the

1

4 ,

2

2 ,

3

2 ,

  

1

4/17/2020 Support Vector Machines M.W. Mak 16

Linear SVMs (Example)

Substitute the Lagrange multipliers into Eq. 8 w

 i

3 

1

 i y i x i b

1

 w

T x

2

[ 2

 

1

2 ] T

2

Linear SVM, C=100, #SV=3, acc=100.00%, normW =2.83

1.5

1 3

0.5

w

T x

 b

0

 x

1

 x

2

0 .

5

0 x

[ x

1 x

2

]

T

0

-0.5

-1

-1

1 2

-0.5

0 0.5

x

1

1 1.5

2

4/17/2020 Support Vector Machines M.W. Mak 17

Linear SVMs (Example)

4-point linear separable problem:

Linear SVM, C=100, #SV=3, accuracy=100.00%

2

1.5

1

0.5

0

-0.5

-1

-1

Linear SVM, C=100, #SV=4, accuracy=100.00%

-0.5

2

1

4

1

3

1.5

0.5

0

-0.5

2

-1

-1

2

1.5

1

0 0.5

x

1

4 SVs

-0.5

2

1

0 0.5

x

1

3 SVs

1

4

3

1.5

2

4/17/2020 Support Vector Machines M.W. Mak 18

Linear SVMs (Non-linearly separable)

Non-linearly separable: patterns that cannot be separated by a linear decision boundary without incurring classification error.

10

18 14 9

8

7 1

11 12

16

17

3

2

4

Data that causes classification error in linear SVMs

6

5

4

3

2

1

0

0 2

6

5

4

20

7

6

10

8

19

8

9

13

15

10

4/17/2020 Support Vector Machines M.W. Mak 19

Linear SVMs (Non-linearly separable)

We introduce a set of slack variables

 i

0 y i

( x i

 w

 b )

1

 

{

1

,

2

,  ,

N

} with

 i

 i

The slack variables allow some data to violate the constraints defined for the linearly separable case (Eq. 6): y i

( x i

 w

 b )

1

 i

Therefore, for some

 k

where

  k

0 , we have y k

( x k

 w

 e.g.

y k

( x k

 w

 b )

0 .

5 and

 k b )

1

0.8

4/17/2020 Support Vector Machines M.W. Mak 20

Linear SVMs (Non-linearly separable)

E.g. because

10 19 x

10 and x

19 are inside the margins, i.e. they violate the constraint (Eq. 6).

6

5

4

3

2

1

0

0

10

9

Linear SVM, C=1000.0, #SV=7, acc=95.00%, normW=0.94

18 16 14

8

7 1

11 12 17

3

2

4

2

6

5

4

20

7

6

10

8

19

8

9

13

15

10

x

1

10

 

19

0 .

667

4/17/2020 Support Vector Machines M.W. Mak 21

Linear SVMs (Non-linearly separable)

For non-separable cases:

Minimize :

1

2 subject to y i

( x i

 w w

2  b )

C i

N 

1

 i

1

  i

 where C is a user-defined penalty parameter to penalize any violation of the margins.

The Lagrangian becomes

L ( w , b ,

)

1

2 w

2 

C i

N 

1

 i

 i

N 

1

 i

( y i

( x i

 w

 b )

1

  i

)

 i

N 

1

 i

 i

4/17/2020 Support Vector Machines M.W. Mak 22

Linear SVMs (Non-linearly separable)

Wolfe dual optimization:

Maximize : L (

)

 i

N 

1

 i

1

2 i

N N 

1 j

1

 i

 j y i y j

( x i

 x j

) subject to C

  i

0 , i

1 ,  , N , and i

N 

1

 i y i

0

The output weight vector and bias term are w

 i

N 

1

 i y i x i b

1

 w

 x k

where y k

1 and x k is a support ve ctor .

4/17/2020 Support Vector Machines M.W. Mak 23

2. Linear SVMs (Types of SVs)

Three types of support vectors

1. On the margin:

C

  i

0 ,

 i

0 y i

( w

T x i

 b )

1

11

1

0 .

44 ;

2 .

85 ;

1

11

0

0

10

9

8

7

Linear SVM, C=10.0, #SV=7, acc=95.00%, normW =0.94

1

11 12

18 16 14

17

17

 

17

0

2. Inside the margin:

 i

C ; 0

  i

2 y i

( w

T x i

10

 b

10 ;

10

)

1

0 .

667

3. Outside the margin:

 i

C ;

 i

2 y i

(

20 w

T x i

 b

10 ;

20

)

1

2 .

667

4/17/2020

6

5

4

3

2

1

3

6

2

4

5

0

0 2

Support Vector Machines

4

20

7

10

8

x

1

6

M.W. Mak

19

8

9

13

15

10

24

2. Linear SVMs (Types of SVs)

10

9

8

7

6

5

4

3

2

1

0

0

Linear SVM, C=10.0, #SV=7, acc=95.00%, normW =0.94

2

1

3

6

2

4

5

4

11

20

12

7

6

18

10

8

19

8

16

17

9

14

13

15 y y i i

(

 w

1 ;

T x w T x

 b i

 i

 b

)

1

0 ;

 i

1

0 y

20

 

1 ;

20

C ;

20

2 .

67 y

20

( w

T x

20 w T x

20

 b

 b )

1 .

67

1

2 .

67

 

1 .

67 w T x

 b

 

1 w T x

 b

 

0 .

33 w

T x

 b

0 .

33 w T x

 b

1 y y i i

(

 w

1 ;

T x i

 i

 b )

0 ;

 i

1 w

T x

 b

1

0

10

x

Support Vector Machines M.W. Mak 25

10

9

8

7

6

5

4

3

2

1

0

0

2. Linear SVMs (Types of SVs)

Swapping Class 1 and Class 2

Linear SVM, C=10.0, #SV=7, acc=95.00%, normW =0.94

2

1

3

6

2

4

5

4

11

20

12

7

6

18

10

8

19

8

16

17

9

14

13

15 y i

1 ;

 y i

( w T x i w T x

 b i

0 ;

 i

 b

1

)

1

0 y

20

1 ;

20

C ;

20

2 .

67 y

20

( w

T x

20 w T x

20

 b

 b )

1

 

1 .

67

2 .

67

 

1 .

67 w T x

 b

1 w T x

 b

0 .

33 w

T x

 b

 

0 .

33 w T x

 b

 

1 y i

 

1 ;

 i

0 ;

 i y i

( w

T x i w

T x

 b

 b )

 

1

 

1

0

10

x

Support Vector Machines M.W. Mak 26

2. Linear SVMs (Types of SVs)

Effect of varying C :

6

5

4

10

9

8

7

3

2

1

0

0

Linear SVM, C=0.1, #SV=10, acc=95.00%, normW =0.57

2

1

3

6

2

4

5

4

11

20

12

7

6

18

10

8

19

8

16

17

9

14

13

15

x

1

C = 0.1

i

  i

5 .

2

10

6

5

4

10

9

8

7

3

2

1

0

0

Linear SVM, C=100.0, #SV=7, acc=95.00%, normW=0.94

2

1

3

6

2

4

5

4

11

20

12

7

18

10

8

19

16

17

9

14

13

15

C = 100

8 6

x

1

 i

 i

4 .

0

10

4/17/2020 Support Vector Machines M.W. Mak 27

3. Non-linear SVMs

In case the training data X are not linearly separable, we may use a kernel function to map the data from the input space to a feature space where data become linearly separable.

x

2

Decision boundary y i y i

1

1 y i y i

1

1

K ( x , x i

) x

1

Input Space (Domain X )

4/17/2020 Support Vector Machines

Decision boundary

Feature Space

M.W. Mak 28

3. Non-linear SVMs (Conts.)

The decision function becomes f ( x )

 sgn

 i

N 

1 y i

 i

K ( x , x i

)

 b

4/17/2020 Support Vector Machines (a) M.W. Mak 29

3. Non-linear SVMs (Conts.)

4/17/2020 Support Vector Machines M.W. Mak 30

3. Non-linear SVMs (Conts.)

The decision function becomes f ( x )

 sgn

 i

N 

1 y i

 i

K ( x , x i

)

 b

For RBF kernels

K ( x , x i

)

 exp

 x

 x i

2

2

 2

For polynomial kernels

K ( x , x i

)

1

 x

 2 x i p

, p

0

4/17/2020 Support Vector Machines M.W. Mak 31

3. Non-linear SVMs (Conts.)

The optimization problem becomes:

Maximize : W (

)

 i

N 

1

 i

1

2 i

N N 

1 j

1

 i

 j y i y j

K ( x i

, x j

) subject to

 i

0 , i

1 ,  , N , and i

N 

1

 i y i

0

The decision function becomes f ( x )

 sgn

 i

N 

1 y i

 i

K ( x , x i

)

 b

(9)

4/17/2020 Support Vector Machines M.W. Mak 32

3. Non-linear SVMs (Conts.)

The effect of varying C on RBF-SVMs:

10

9

8

7

6

5

4

3

2

1

0

0

1

3

6

2

4

5

2

=8.0, C=10.0, #SV=9, acc=90.00%

11

20

12

7

18

10

8

19

16

17

9

14

13

15 4

3

2

1

6

5

8

7

10

RBF SVM, 2*sigma

2

=8.0, C=1000.0, #SV=7, acc=100.00%

9 18 16 14

11 12 17

1

3

6

2

4

5

20

7

10

8

19

9

13

15

2 4

0

0

C = 10

6 8

x

1

 i

 i

3 .

09

10

4/17/2020 Support Vector Machines

2

C = 1000

4

M.W. Mak

8 6

x

1  i

 i

0 .

0

10

33

3. Non-linear SVMs (Conts.)

The effect of varying C on Polynomial-SVMs:

6

5

4

3

2

10

9

8

7

1

0

0

Polynomial SVM, degree=2, C=10.0, #SV=7, acc=90.00%

2

1

3

6

2

4

5

4

11

20

12

7

18

10

8

19

16

17

9

14

13

15

10

C = 10

8 6

x

1 i

  i

2 .

99

4/17/2020

6

5

4

3

2

8

7

1

0

0

10

Polynomial SVM, degree=2, C=1000.0, #SV=8, acc=90.00%

9 18 16 14

11 12 17

2

1

3

6

2

4

5

4

20

7

10

8

19

9

13

15

10

C = 1000

8

x

1  i

 i

6

2 .

97

Support Vector Machines M.W. Mak 34

Download