Support Vector Machines

Support Vector Machines

1. Introduction to SVMs

2. Linear SVMs

3. Non-linear SVMs

References:

1. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine Learning

Approach, Prentice Hall, to appear.

2. S.R. Gunn, 1998. Support Vector Machines for Classification and Regression.

(http://www.isis.ecs.soton.ac.uk/resources/svminfo/)

3. Bernhard Schölkopf. Statistical learning and kernel methods. MSR-TR 2000-23,

Microsoft Research, 2000. (ftp://ftp.research.microsoft.com/pub/tr/tr-2000-23.pdf)

4. For more resources on support vector machines, see http://www.kernel-machines.org/

4/17/2020 Support Vector Machines M.W. Mak 1

Introduction









SVMs were developed by Vapnik in 1995 and are becoming popular due to their attractive features and promising performance.

Conventional neural networks are based on empirical risk minimization where network weights are determined by minimizing the mean squares error between the actual outputs and the desired outputs.

SVMs are based on the structural risk minimization principle where parameters are optimized by minimizing classification error.

SVMs have been shown to posses better generalization capability than conventional neural networks.


Introduction (Cont.)



Given N labeled empirical data:

( x , ),  ,

1 y

1

( x

N

, y

N

)



X where X is the set of input data in



 D

{



1 ,



1 } x

2 and y i are the labels.

y y i





1



1 i

(1) c

1 c

2

Domain X

4/17/2020 Support Vector Machines x

1

M.W. Mak 3




We construct a simple classifier by computing the means of the two classes c

1



1

N

1 i : y i



 

1 x i and c

2



1

N

2 i : y i



 

1 x i

(2) where N

1 and N

2 are the number of data in the class with positive and negative labels, respectively.





We assign a new point x to the class whose mean is closer to it.

To achieve this, we compute c



( c

1

 c

2

) 2


Introduction (Cont.) x

2



Then, we determine the class of x by checking whether the vector connecting x and c encloses an angle smaller than



/2 with the vector

Domain X w

 c

1

 y





 sgn sgn sgn









(

( c x x x

2







.

c

1 c )

( c

1

)





 w



( c x

2



) c

2

2

)







( c

1 b



 c

2

)

 c

2 c

1 x c where b



1

2

( c

2 

2 c

1

2

) x

1





In the special case where b = 0, we have y

 sgn







1

N

1 i : y i



 

1

( x

 x i

)



1

N

2 i : y i



 

1

( x

 x i

)







 sgn





 i : y i



 

1

( x



1

N

1 x i

)

 i : y i



 

1

( x



1

N

2 x i

)









This means that we use ALL data point x i

, each being weighted equally by 1/ plane.

N

1 or 1/ N

2

, to define the decision

(3)


c

1

Domain X x

2

Introduction (Cont.) w w x c

2

Decision plan x

1 y i y i





1



1





However, we might want to remove the influence of patterns that are far away from the decision boundary, because their influence is usually small.



We may also select only a few important data point (called support vectors) and weight them differently.



Then, we have a support vector machine.


x

2


Margin y y i





1



1 i

Support vectors

 x

Decision plane

Domain X x

1

We aim to find a decision plane that maximizes the margin.


Linear SVMs



Assume that all training data satisfy the constraints: which means x i

 w

 b

 x i

 w

 b





1 for y i



1 for y i

 

1

 

1 y i

( x i

 w

 b )



1



0

 i

(4)

(5)



Training data points for which the above equality holds lie on hyperplanes parallel to the decision plane.


Margin: d x

2

Linear SVMs (Conts.)

( w

 w w

: x

1

: x

2

( w





( w x

1 x

2

)



)





( x

1 b b





 x

2



1



1

))



2







 w w







( x

1

 x

2

)

 w

 x

 b

 

1

 d



2 w

 x

 b



0 w w

2

 w

 x

 b

 

1 x

1

Therefore, maximizing the margin is equivalent to minimizing ||w|| 2 .


Linear SVMs (Lagrangian)



We minimize ||w|| 2 subject to the constraint that y i

( x i

 w

 b )



1



0

 i (6)





{

This can be achieved by introducing Lagrange multipliers

 i



0 } i

N



1 and a Lagrangian

L ( w , b ,



)



1

2 w

2  i

N 



1

 i

( y i

( x i

 w

 b )



1 )

(7)

The Lagrangian has to be minimized with respect to w and b and maximized with respect to

 i



0










Linear SVMs (Lagrangian)

Setting



 b

L ( w , b ,



)



0 and



 w

L ( w , b ,



)



0

We obtain i

N 



1

 i y i



0 and w

 i

N 



1

 i y i x i

(8)

Patterns for which

 

0 are called Support Vectors .

These k vectors lie on the margin and satisfy y k

( x k

 w

 b )



1



0

 k



S where S contains the indexes to the support vectors.

Patterns for which

 k



0 are considered to be irrelevant to the classification.


Linear SVMs (Wolfe Dual)



Substituting (8) into (7), we obtain the Wolfe dual:

Maximize : L (



)

 i

N 



1

 i



1

2 i

N N 



1 j



1

 i

 j y i y j

( x i

 x j

) subject to

 i



0 , i



1 ,  , N , and i

N 



1

 i y i



0

(9)



The hyper-decision plane is thus b f ( x )

 sgn

 w

 x

 b







1

 w

 x k

where y k sgn





 i

N 



1 y i

 i

( x i

 x )

 b









1 and x k is a support ve ctor .


Linear SVMs (Example)



Analytical example (3-point problem): x

1



[ 0 .

0 x

2



[ 1 .

0 x

3



[ 0 .

0

0 .

0 ]

T y

1

 

1

0 .

0 ]

T y

2

 

1

1 .

0 ]

T y

3

 

1



Objective function:

Maximize : L (



)

 i

3 



1

 i



1

2

3 3  i



1 j



1

 i

 j y i y j

( x i

 x j

) subject to

 i



0 , i



1 ,  , 3 , and i

3 



1

 i y i



0





We introduce another Lagrange multiplier λ to obtain the

Lagrangian

F (



,



)



L (



)

  i

3 



1

 i y i

 

1

 

2

 

3



1

2

 2

2



1

2

 2

3

 

(

 

1

 

2

 

3

)



Differentiating F (α, λ) with respect to λ and α i results to zero, we obtain and set the



1



4 ,



2



2 ,



3



2 ,

  

1





Substitute the Lagrange multipliers into Eq. 8 w

 i

3 



1

 i y i x i b



1

 w

T x

2



[ 2

 

1

2 ] T

2

Linear SVM, C=100, #SV=3, acc=100.00%, normW =2.83

1.5

1 3

0.5

w

T x

 b



0

 x

1

 x

2



0 .

5



0 x



[ x

1 x

2

]

T

0

-0.5

-1

-1

1 2

-0.5

0 0.5

x

1

1 1.5

2





4-point linear separable problem:

Linear SVM, C=100, #SV=3, accuracy=100.00%

2

1.5

1

0.5

0

-0.5

-1

-1

Linear SVM, C=100, #SV=4, accuracy=100.00%

-0.5

2

1

4

1

3

1.5

0.5

0

-0.5

2

-1

-1

2

1.5

1

0 0.5

x

1

4 SVs

-0.5

2

1

0 0.5

x

1

3 SVs

1

4

3

1.5

2


Linear SVMs (Non-linearly separable)



Non-linearly separable: patterns that cannot be separated by a linear decision boundary without incurring classification error.

10

18 14 9

8

7 1

11 12

16

17

3

2

4

Data that causes classification error in linear SVMs

6

5

4

3

2

1

0

0 2

6

5

4

20

7

6

10

8

19

8

9

13

15

10





We introduce a set of slack variables

 i



0 y i

( x i

 w

 b )



1



 

{



1

,



2

,  ,



N

} with

 i

 i





The slack variables allow some data to violate the constraints defined for the linearly separable case (Eq. 6): y i

( x i

 w

 b )



1

 i

Therefore, for some

 k

where

  k

0 , we have y k

( x k

 w

 e.g.

y k

( x k

 w

 b )



0 .

5 and

 k b )



1



0.8





E.g. because

10 19 x

10 and x

19 are inside the margins, i.e. they violate the constraint (Eq. 6).

6

5

4

3

2

1

0

0

10

9

Linear SVM, C=1000.0, #SV=7, acc=95.00%, normW=0.94

18 16 14

8

7 1

11 12 17

3

2

4

2

6

5

4

20

7

6

10

8

19

8

9

13

15

10

x

1



10

 

19



0 .

667





For non-separable cases:

Minimize :

1

2 subject to y i

( x i

 w w



2  b )

C i

N 



1

 i



1

  i

 where C is a user-defined penalty parameter to penalize any violation of the margins.

The Lagrangian becomes

L ( w , b ,



)



1

2 w

2 

C i

N 



1

 i

 i

N 



1

 i

( y i

( x i

 w

 b )



1

  i

)

 i

N 



1

 i

 i





Wolfe dual optimization:

Maximize : L (



)

 i

N 



1

 i



1

2 i

N N 



1 j



1

 i

 j y i y j

( x i

 x j

) subject to C

  i



0 , i



1 ,  , N , and i

N 



1

 i y i



0



The output weight vector and bias term are w

 i

N 



1

 i y i x i b



1

 w

 x k

where y k



1 and x k is a support ve ctor .




2. Linear SVMs (Types of SVs)

Three types of support vectors

1. On the margin:

C

  i



0 ,

 i



0 y i

( w

T x i

 b )



1



11



1





0 .

44 ;



2 .

85 ;



1

11





0

0

10

9

8

7

Linear SVM, C=10.0, #SV=7, acc=95.00%, normW =0.94

1

11 12

18 16 14



17

17

 

17



0

2. Inside the margin:

 i



C ; 0

  i



2 y i

( w

T x i



10



 b

10 ;



10

)





1

0 .

667

3. Outside the margin:

 i



C ;

 i



2 y i

(



20 w



T x i

 b

10 ;



20

)





1

2 .

667

4/17/2020

6

5

4

3

2

1

3

6

2

4

5

0

0 2

Support Vector Machines

4

20

7

10

8

x

1

6

M.W. Mak

19

8

9

13

15

10

24


10

9

8

7

6

5

4

3

2

1

0

0


2

1

3

6

2

4

5

4

11

20

12

7

6

18

10

8

19

8

16

17

9

14

13

15 y y i i

(

 w



1 ;

T x w T x

 b i





 i

 b



)

1

0 ;

 i



1



0 y

20

 

1 ;



20



C ;



20



2 .

67 y

20

( w

T x

20 w T x

20

 b

 b )





1 .

67

1



2 .

67

 

1 .

67 w T x

 b

 

1 w T x

 b

 

0 .

33 w

T x

 b



0 .

33 w T x

 b



1 y y i i

(

 w



1 ;

T x i



 i

 b )

0 ;

 i



1 w

T x

 b



1



0

10

x

Support Vector Machines M.W. Mak 25

10

9

8

7

6

5

4

3

2

1

0

0


Swapping Class 1 and Class 2


2

1

3

6

2

4

5

4

11

20

12

7

6

18

10

8

19

8

16

17

9

14

13

15 y i



1 ;

 y i

( w T x i w T x

 b i



0 ;

 i

 b



1

)



1



0 y

20



1 ;



20



C ;



20



2 .

67 y

20

( w

T x

20 w T x

20

 b

 b )



1

 

1 .

67



2 .

67

 

1 .

67 w T x

 b



1 w T x

 b



0 .

33 w

T x

 b

 

0 .

33 w T x

 b

 

1 y i

 

1 ;

 i



0 ;

 i y i

( w

T x i w

T x

 b

 b )

 

1

 

1



0

10

x





Effect of varying C :

6

5

4

10

9

8

7

3

2

1

0

0


2

1

3

6

2

4

5

4

11

20

12

7

6

18

10

8

19

8

16

17

9

14

13

15

x

1

C = 0.1

i

  i



5 .

2

10

6

5

4

10

9

8

7

3

2

1

0

0

Linear SVM, C=100.0, #SV=7, acc=95.00%, normW=0.94

2

1

3

6

2

4

5

4

11

20

12

7

18

10

8

19

16

17

9

14

13

15

C = 100

8 6

x

1

 i

 i



4 .

0

10


3. Non-linear SVMs



In case the training data X are not linearly separable, we may use a kernel function to map the data from the input space to a feature space where data become linearly separable.

x

2

Decision boundary y i y i





1



1 y i y i





1



1

K ( x , x i

) x

1

Input Space (Domain X )

4/17/2020 Support Vector Machines

Decision boundary

Feature Space

M.W. Mak 28

3. Non-linear SVMs (Conts.)



The decision function becomes f ( x )

 sgn



 i

N 



1 y i

 i

K ( x , x i

)

 b





4/17/2020 Support Vector Machines (a) M.W. Mak 29









 sgn



 i

N 



1 y i

 i

K ( x , x i

)

 b





For RBF kernels

K ( x , x i

)

 exp



 x

 x i

2

2

 2





For polynomial kernels

K ( x , x i

)







1

 x



 2 x i p

, p



0





The optimization problem becomes:

Maximize : W (



)

 i

N 



1

 i



1

2 i

N N 



1 j



1

 i

 j y i y j

K ( x i

, x j

) subject to

 i



0 , i



1 ,  , N , and i

N 



1

 i y i



0




 sgn



 i

N 



1 y i

 i

K ( x , x i

)

 b





(9)





The effect of varying C on RBF-SVMs:

10

9

8

7

6

5

4

3

2

1

0

0

1

3

6

2

4

5

2

=8.0, C=10.0, #SV=9, acc=90.00%

11

20

12

7

18

10

8

19

16

17

9

14

13

15 4

3

2

1

6

5

8

7

10

RBF SVM, 2*sigma

2

=8.0, C=1000.0, #SV=7, acc=100.00%

9 18 16 14

11 12 17

1

3

6

2

4

5

20

7

10

8

19

9

13

15

2 4

0

0

C = 10

6 8

x

1

 i

 i



3 .

09

10

4/17/2020 Support Vector Machines

2

C = 1000

4

M.W. Mak

8 6

x

1  i

 i



0 .

0

10

33




The effect of varying C on Polynomial-SVMs:

6

5

4

3

2

10

9

8

7

1

0

0

Polynomial SVM, degree=2, C=10.0, #SV=7, acc=90.00%

2

1

3

6

2

4

5

4

11

20

12

7

18

10

8

19

16

17

9

14

13

15

10

C = 10

8 6

x

1 i

  i



2 .

99

4/17/2020

6

5

4

3

2

8

7

1

0

0

10

Polynomial SVM, degree=2, C=1000.0, #SV=8, acc=90.00%

9 18 16 14

11 12 17

2

1

3

6

2

4

5

4

20

7

10

8

19

9

13

15

10

C = 1000

8

x

1  i

 i

6



2 .

97


Support Vector Machines

Support Vector Machines

Related documents

Products

Support

Support Vector Machines

Support Vector Machines

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib