How to use the D2K Support Vector Machine (SVM

advertisement
How to use the D2K Support Vector Machine (SVM) MATLAB
toolbox
This MATLAB toolbox consist of several m-files designed to perform two
tasks:
1. Multiclass classification
2. Regression
1 How to do a SVM multiclass classification
SVM multiclass classification is done in three steps:
1. Loading and normalizing data (function normsv)
2. Training of SVM (function multiclass)
3. Obtaining the results (function classify)
1.1 Loading and normalizing data
After loading the data to MATLAB, it is necessary to normalize them to the
corresponding form for the specified kernel function. Function normsv can be
used for this purpose (Table 1-1).
Usage:
[X A B] = normsv(X,kernel,isotropic)
Parameters:
X
kernel
isotropic
Returned values:
X
A, B
data to be normalized
kernel type ('linear','poly','rbf')
isotropic (1), or anisotropic (0,default)
scaling.
normalized data
These matrices can be used for
normalizing another data in the same way
as X:
n = size(X1,2);
for i=1:n
X1(:,i) = A(i)*X1(:,i) + B(i);
end
Table 1-1 Parameters and returned values of function normsv
X is matrix. Every row represents one training input (Table 1-2):
12.4
23.4
...
26.1
34.5
15.2
...
23.5
...
...
...
...
16.3
13.1
...
9.1
X1
X2
...
Xn
Table 1-2 Format of the matrix X (input vectors)
Variable kernel represents the type of kernel function; the data will be
normalized specifically for this kernel. Three types of kernel functions are
1
available in this version : linear ('linear') , polynomial ('poly') and Gausian RBF
('rbf'). Description of these kernel functions can be found in section 3.
Parameter isotropic determines the type of scaling that will be used. Data are
normalized to the range <-1,1>, for the three mentioned kernels. If isotropic
scaling is used, every column is normalized with the same normalizing
coefficients. On the other hand, anisotropic scaling uses for every column
different coefficients. It means, that after normalization, each component of
the input vector will have the same importance.
1.2
Training of SVM for multiclass classification
Function multiclass performs training of SVM for multiclass classification
(Table 1-3):
Usage:
[b,alpha,Ynew,cl] = multiclass(X,Y,kernel,p1,C,filename)
Parameters:
X
Y
kernel
p1
C
filename
Returned values:
alpha
b
Ynew
cl
Training inputs (normalized) - matrix
Training targets - vector
type of kernel function, default 'linear', see
3.
parameter of the kernel function, default 1
non separable case, default Inf
name of the log file, default
'multiclasslog.txt' . Here are written the
messages during the training.
Matrix of Lagrange Multipliers
vector of bias terms
Modified training outputs
Number of classes
Table 1-3 Parameters and returned values of function multiclass
Training inputs should be in the form described in section 1.1 and
normalized. Vector of training targets consists of one column. Values of Y1 to
Yn are integers from the range < 1,cl >, where cl is number of the classes. Yi
determines to which class belongs vector Xi.
Parameter C is a real number from range ( 0,Inf >. If it is not possible to
separate classes using specific kernel function, or it is needed to accept
prticular error in order to improve the generalization, the constraints have to
be relaxed. Setting C < Inf relaxes the constraints and certain number of
wrong classifications will be accepted. .
It is possible to check the status of the training process. Short message about
each important operation is written to the file specified by parameter filename.
During the training process, only viewing of the file is possible - editing of the
file would cause sharing violation and crash of the training.
Function multiclass returns several values. Alpha is a matrix of Lagrangian
multiplyers and has a form described in Table 1-4.
2
size n
(number of
input vectors
alpha_X1_cl1
alpha_X2_cl1
size m (number of classes)
alpha_X1_cl2
...
alpha_X1_clm
alpha_X1_cl2
...
alpha_X2_clm
...
...
...
...
alpha_Xn_cl1
alpha_X1_cl2
...
alpha_Xn_clm
Table 1-4 Format of the matrix alpha (lagrangian multiplyers)
Each column of matrix alpha is a set of Lagrangian multiplyers and can be
used to separate points of class cli from all other points in the training set. If
alpha_Xi_cl > 0, vector Xi is called support vector for this particular
classification (Vectors X for which are alphas equal to zero in all columns
could be removed and the results will remain the same). Please see
[Burges 99].
Returned value b represents a bias term. It is a column vector and its length
is m. Each element of b (bi ) belongs to the corresponding set alpha_X_cli. In
this version are parameters bi not equal to zero only if linear kernel is used.
By using polynomial and gausian RBF kernel , separation planes in the
transformed space will always contain origin ([Burges 99]. In that case is bias
term always equal to zero.
Ynew is a transformed vector Y. The transformation is shown at an example
(4 clases,Table 1-5).
Y
4
2
1
...
3
cl. 1
-1
-1
1
...
-1
cl. 2
-1
1
-1
...
-1
Ynew
cl. 3
-1
-1
-1
...
1
cl. 4
1
-1
-1
...
-1
Table 1-5 Format of the matrix Ynew (modified output)
Multiclass classification is done by decomposition of the problem to n twoclass classifications, where n is number of the classes. Multiclass calculates
Lagrange multipliers and bias terms for separating each class from all other
classes. For each two-class classification, function svcm is called. This
function is based on the algorithm described in [Burges 99].
If there is only two-class classification needed, function svcm could be called
directly (Table 1-6).
Usage:
Parameters and
returned values:
[b,alpha,h] = svcm(X,Y,kernel,C,p1,filename,h)
X, kernel, C,
Meaning of this parameters is the same
as by function multiclass
p1, alpha, b
filename
default is 'classlog.txt'
Y
Vector of targets has a slightly different
3
h
form than the one used for multiclass. It is
one column vector with length n (number
of training inputs). Yi is 1 when Xi belongs
to class 1, or -1 when Xi belongs to class
2 ,or better, do not belong to class 1
if possible, this raw Hessian (see section
3.1) should be used. If not, new one is
calculated (to avoid multiplex calculation
of h. Usualy used for calculations on the
same training set but with different
parameters - but not different kernel
parameters!!)
Table 1-6 Parameters and returned values of function svcm
1.3
Obtaining the results.
After training of the SVM, the results could be obtained using function classify
(Table 1-7).
Usage:
Parameters:
Returned
values:
[res,res1,Rawres] =
classify(Xtest,Xtrain,Ynew,alpha,b,kernel,p1)
Xtest
Testing inputs - matrix, normalized the
same way as Xtrain (see 1.1)
Xtrain
Training inputs - matrix
Ynew
Training outputs - in modified form,
output from multiclass = matrix(!)
alpha
Matrix of Lagrange Multipliers for each
discrimination plane, every column is one
set output from multiclass
b
vector of bias terms , output from
multiclass
kernel
type of kernel function (must be the same
as for training!)
p1
parameter of the kernel (see 3), must be
the same as for training
Rawres
matrix of distances of points Xtest from
discrimination lines for each class (in
transformed space)
res
transformed Rawres, points of Xtest are
classified with 1 or 0 for each class matrix
res1
Another format of res, the same as Y for
function multiclass
Table 1-7 Parameters and returned values of function classify
This function calculates the distance of each point Xtesti from the
discrimination plane in the transformed space. Please note, that it is not a
perpendicular distance in the space of Xtest but the distance in the
transformed multidimensional space. Depending on the kernel function, the
4
size m (number of
classes)
dimension of transformed space is usualy much greather than the dimension
of input vectors. Acording to the theory of SVM, the discrimination surface in
the transformed space is a plane. Matrix of distances Rawres is in the form
described in Table 1-8.
size ntest (number of input vectors in Xtest)
distance_Xtest1_cl1 distance_Xtest2_cl1 ... distance_Xtestn_cl1
distance_Xtest1_cl2 distance_Xtest2_cl2 ... distance_Xtestn_cl1
...
...
distance_Xtest1_clm
distance_Xtest2_clm
...
...
... distance_Xtestn_clm
Table 1-8 Format of the matrix Rawres
The distances could be both, positive or negative. This reflects the position of
the point to the discrimination plane (class cli or not class cli). After this,
following procedure is applied:
1. If all distances for the point Xtesti are negative: Xtesti belongs to the class
with the smallest distance (in absolut value).
2. If there are two and more positive distances: Xtesti belongs to the class
with the largest distance (in absolute value) .
Figure 2-1 illustrates the situation. If Xtesti is located in some of the
problematic areas, this conflict must be solved. If more than three classes are
used, problematic areas are overlaping in a more complicated way. However,
conflicts are solved in the same way.
2 How to use SVM for regression
SVM regression is as well as the SVM multiclass classification done in three
steps:
1. Loading and normalizing data (function normsv)
2. Training of SVM (function svrm)
3. Obtaining the results (function regress)
Process of SVM regression is very similar to SVM classification. Anyway,
there are some important differences.
2.1
Loading and normalizing data
This step is exactly the same as for SVM multiclass classification. See
section 1.1, please.
5
problematic
area
problematic
area
problematic
area
problematic
area
Figure 2-1 SVM three-class classification, the conflicts solving
2.2
Training of SVM for regression
Training of SVM for regression is performed by function svrm (Table 2-1). This
routine performs similar function as svcm, but there are some differences.
Svrm is based on the algorithm described in [Smola 98].
Matrix of inputs X should be in the form described in section 1.1. Y is a vector
of real numbers, where Yi = f(Xi) (Xi is a input vector). Information about
different availiable kernel functions and their parameters can be found in
section 3.1.
Usage:
[b, beta,H] = svrm(X,Y,kernel,p1,C,loss,e,filename,h)
Parameters: X
Training inputs (normalized) - matrix
Y
Training targets - vector of real numbers(!)
kernel
type of kernel function, availiable: 'linear', 'poly',
'rbf', default 'linear', see 3.
p1
parameter of the kernel function, default 1
penalty term, default Inf
C
loss
type of loss function,'ei' ε-intensitive,'quad'
quadratic, default 'ei'
e
insensitivity, default 0.0
filename
name of the log file, default 'regresslog.txt' . Here
are written the messages during the learning.
h
if possible, this raw Hessian matrix (see section
6
Returned
values:
3.1) should be used. If not, new one is calculated
(to avoid multiplex calculation of h. Usualy used for
calculations on the same training set but with
different parameters - but not with different kernel
parameters!!)
bias term - scalar
b
beta
H
vector of differences of Lagrange Multipliers
Not normalized and not adjusted Hessian matrix. If
there was h in the input parameters, H == h. If not,
H was calculated during the run of svrm.
Table 2-1 Parameters and returned values of function svrm
Parameters e, C and loss are strongly interconnected. Parameter loss
defines the loss function. Loss function determines, how will be penalized the
SVM's error. There are implemented two types of loss functions (Figure 2-2).
ε-insensitive loss
function
penalty
penalty
quadratic loss function
λ=f(C)
-e
Errors in this
area are ignored
e
error
Shape of the curve is
influenced by C
error
Figure 2-2 Implemented loss functions
If ε-insensitive loss function is used, errors between -e and e are ignored. If
C=Inf is set, regresion curve will follow the training data inside of the margin
determined by e (Figure 2-3).
C is number from range ( 0 , Inf >. If C < Inf is set, constraints are relaxed and
regression curve need not to remain in the margin determined by e. In some
cases (for example in case of defective data) this leads to generality
improvement. If kernel with not infinite VC dimension is used, it could be
necessary to relax constraints. It might by not possible to calculate the
regression curve following the training data in the margin 2e. Parameter C
determines angle λ of the loss function (Figure 2-2). For C = Inf, λ=90˚ (no
error out of +e -e is tolerated) and for C = 0, λ = 0˚ (every error is accepted).
7
1.5
2e
1
0.5
0
Training
data
-0.5
SVM
-1
-1.5
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Figure 2-3 SVM regression, 'ei' loss function, e = 0.2, C = Inf
Quadratic loss function penalizes every error. It is recommended to use this
loss function. If using this function, memory requirements are four times less,
than if ε-intensitive loss function is used. Parameter e is not used for this
function and can be set to arbitrary value (0.0 is prefered).
1
0.8
C = 10
0.6
0.4
0.2
0
Training
data
-0.2
-0.4
SVM
-0.6
-0.8
-1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Figure 2-4 SVM regression, 'quad' loss function, C = 10
8
1
0.8
C = 0.5
0.6
0.4
0.2
0
-0.2
Training
data
-0.4
-0.6
SVM
-0.8
-1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Figure 2-5 SVM regression, 'quad' loss function, C = 0.5
Figures Figure 2-4 and Figure 2-5 illustrate the influence of parameter C. It is
recommended to use C from range ( 0, 100 > In this range is SVM most
sensible for change of the value of C. Larger values of C influence SVM very
similar as C = Inf does.
It is possible to check the status of the training process. Short message about
each performed operation is written to the file specified by parameter
filename. During the training process, only viewing of the file is possible editing of the file would cause sharing violation and crashing of training.
Returned values b (scalar) and beta (column vector) represent output of SVM
according to [Smola 98].
2.3 Obtaining results using regress
Results of SVM regression can be obtained using function regress (Table
2-2).
Usage:
[Ytest,k] = regress(Xtrain,Xtest,b,beta,kernel,p1,k)
Parameters: Xtrain
Training inputs - matrix that was used for training)
Xtest
Testing inputs - matrix (normalized the same way
as Xtrain)
kernel
kernel function (must be the same as for
training),see 3
p1
parameter of the kernel, default 1 , but must be the
same as for trainning
beta
Difference of Lagrange Multipliers, output from
svrm
b
bias term output from svrm
k
matrix of dot products of Xtrain and Xtest. This
matrix should be used it is availiable from previous
calculation. If it's not, during run of this function will
9
Returned
values:
Ytest
k
be calculated new one and returned on the output.
For more details, see also section 3.2.
testing output, one column vector.
If there was k availiable as an input parameter, the
same matrix is returned on the output. Otherwise, k
is new calculated matrix of dot products of Xtrain
and Xtest.
Table 2-2 Parameters and returned values of function regress
Meaning of the parameters was discussed in the previous sections. Ytest is a
real valued one column vector of results. Parameter k is described in section
3.2
3 Important SVM subfunctions
3.1
Function product
This function calculates different types of dot products of vectors, depending
on the kernel function (Table 3-1).
Usage:
h = product(X,kernel,p1)
Parameters: X
matrix of inputs
kernel
type of kernel function
'linear' = usual dot product
'poly' = p1 is degree of polynomial
'rbf' = p1 is width of rbfs (sigma)
Returned
h
matrix of dot products of input vector
value:
Table 3-1 Parameters and returned values of function product
This function is basic for SVM. It contains different kernel functions (three in
this version) used for calculation of vector's dot product in transformed space.
Used kernel function determines the properties and performance of SVM.
Matrix X has to be normalized for the used kernel function (see 1.1). Every
row of X is considered as one input vector (Table 3-2).
0.4
-0.4
...
0.1
-0.5
0.2
...
0.5
...
...
...
...
0.3
0.1
...
-0.1
X1
X2
...
Xn
Table 3-2 Format of the matrix X
Matrix h has the form described in Table 3-3. Please note, that matrix h is
symmetric.
10
size n
1
X1X2
size n (number of vectors X)
...
X2X1
XnX1
1
...
XnX2
...
...
...
...
X1Xn
X2Xn
...
1
Table 3-3 Format of the matrix h
Operation  in Table 3-3 represents the dot product of two vectors in the
transformed space. Operation  is determined by used kernel function. There
are three different kernel functions implemented in this version (Table 3-4):
Parameter
'linear'
Kernel type
linear
Description
X 1  X 2  X 1  X 2 ,parameter p1 is not
used
'poly'
polynomial
X 1  X 2   X 1  X 2  1 , p1 is degree of
p1
the polynomial
'rbf'
 X1  X 2  X1  X 2 
Gausian RBF
X1  X 2  e
2 p12
,p1 is width of
the RBF's
Table 3-4 Availiable kernel types
3.2
Function product_res
Simmilar to function product, this function also calculates different types of dot
products of vectors. It is used for calculation of the results (Table 3-5).
Usage:
Parameters:
Returned
value
k = product_res(Xtrain,Xtest,kernel,p1)
Xtrain
Training inputs (normalized)
Xtest
Testing inputs (normalized the same way as
Xtrain)
kernel
type of kernel function (see 3)
p1
parameter of the kernel (see 3)
k
matrix of dot products
Table 3-5 Parameters and returned values of function product_res
Format of parameters Xtrain and Xtest was described in previous sections.
Returned value k is a matrix and is described in Table 3-6.
11
size ntr
(number of
vectors
Xtrain)
size nte (number of vectors Xtest)
...
Xtrain1Xtest1
Xtrain1Xtest2
Xtrain1Xtestnte
...
Xtrain2Xtest1
Xtrain2Xtest2
Xtrain2Xtestnte
...
...
...
...
XtrainntrXtest1
XtrainntrXtest2
...
XtrainntrXtestnte
Table 3-6 Format of the matrix k
Meaning of the symbol  is explained in section 3.
4 References
Burges, CH.,1999 ,Tutorial on Support Vector Machines for Pattern
Recognition. This paper could be downloaded at http://svm.research.belllabs.com/SVMdoc.html
Smola A., Schölkopf B.,1998, A Tutorial on Support Vector Regression,
NeuroCOLT2 Technical Report Series, October 1998
12
Download