How to use the D2K Support Vector Machine (SVM) MATLAB toolbox This MATLAB toolbox consist of several m-files designed to perform two tasks: 1. Multiclass classification 2. Regression 1 How to do a SVM multiclass classification SVM multiclass classification is done in three steps: 1. Loading and normalizing data (function normsv) 2. Training of SVM (function multiclass) 3. Obtaining the results (function classify) 1.1 Loading and normalizing data After loading the data to MATLAB, it is necessary to normalize them to the corresponding form for the specified kernel function. Function normsv can be used for this purpose (Table 1-1). Usage: [X A B] = normsv(X,kernel,isotropic) Parameters: X kernel isotropic Returned values: X A, B data to be normalized kernel type ('linear','poly','rbf') isotropic (1), or anisotropic (0,default) scaling. normalized data These matrices can be used for normalizing another data in the same way as X: n = size(X1,2); for i=1:n X1(:,i) = A(i)*X1(:,i) + B(i); end Table 1-1 Parameters and returned values of function normsv X is matrix. Every row represents one training input (Table 1-2): 12.4 23.4 ... 26.1 34.5 15.2 ... 23.5 ... ... ... ... 16.3 13.1 ... 9.1 X1 X2 ... Xn Table 1-2 Format of the matrix X (input vectors) Variable kernel represents the type of kernel function; the data will be normalized specifically for this kernel. Three types of kernel functions are 1 available in this version : linear ('linear') , polynomial ('poly') and Gausian RBF ('rbf'). Description of these kernel functions can be found in section 3. Parameter isotropic determines the type of scaling that will be used. Data are normalized to the range <-1,1>, for the three mentioned kernels. If isotropic scaling is used, every column is normalized with the same normalizing coefficients. On the other hand, anisotropic scaling uses for every column different coefficients. It means, that after normalization, each component of the input vector will have the same importance. 1.2 Training of SVM for multiclass classification Function multiclass performs training of SVM for multiclass classification (Table 1-3): Usage: [b,alpha,Ynew,cl] = multiclass(X,Y,kernel,p1,C,filename) Parameters: X Y kernel p1 C filename Returned values: alpha b Ynew cl Training inputs (normalized) - matrix Training targets - vector type of kernel function, default 'linear', see 3. parameter of the kernel function, default 1 non separable case, default Inf name of the log file, default 'multiclasslog.txt' . Here are written the messages during the training. Matrix of Lagrange Multipliers vector of bias terms Modified training outputs Number of classes Table 1-3 Parameters and returned values of function multiclass Training inputs should be in the form described in section 1.1 and normalized. Vector of training targets consists of one column. Values of Y1 to Yn are integers from the range < 1,cl >, where cl is number of the classes. Yi determines to which class belongs vector Xi. Parameter C is a real number from range ( 0,Inf >. If it is not possible to separate classes using specific kernel function, or it is needed to accept prticular error in order to improve the generalization, the constraints have to be relaxed. Setting C < Inf relaxes the constraints and certain number of wrong classifications will be accepted. . It is possible to check the status of the training process. Short message about each important operation is written to the file specified by parameter filename. During the training process, only viewing of the file is possible - editing of the file would cause sharing violation and crash of the training. Function multiclass returns several values. Alpha is a matrix of Lagrangian multiplyers and has a form described in Table 1-4. 2 size n (number of input vectors alpha_X1_cl1 alpha_X2_cl1 size m (number of classes) alpha_X1_cl2 ... alpha_X1_clm alpha_X1_cl2 ... alpha_X2_clm ... ... ... ... alpha_Xn_cl1 alpha_X1_cl2 ... alpha_Xn_clm Table 1-4 Format of the matrix alpha (lagrangian multiplyers) Each column of matrix alpha is a set of Lagrangian multiplyers and can be used to separate points of class cli from all other points in the training set. If alpha_Xi_cl > 0, vector Xi is called support vector for this particular classification (Vectors X for which are alphas equal to zero in all columns could be removed and the results will remain the same). Please see [Burges 99]. Returned value b represents a bias term. It is a column vector and its length is m. Each element of b (bi ) belongs to the corresponding set alpha_X_cli. In this version are parameters bi not equal to zero only if linear kernel is used. By using polynomial and gausian RBF kernel , separation planes in the transformed space will always contain origin ([Burges 99]. In that case is bias term always equal to zero. Ynew is a transformed vector Y. The transformation is shown at an example (4 clases,Table 1-5). Y 4 2 1 ... 3 cl. 1 -1 -1 1 ... -1 cl. 2 -1 1 -1 ... -1 Ynew cl. 3 -1 -1 -1 ... 1 cl. 4 1 -1 -1 ... -1 Table 1-5 Format of the matrix Ynew (modified output) Multiclass classification is done by decomposition of the problem to n twoclass classifications, where n is number of the classes. Multiclass calculates Lagrange multipliers and bias terms for separating each class from all other classes. For each two-class classification, function svcm is called. This function is based on the algorithm described in [Burges 99]. If there is only two-class classification needed, function svcm could be called directly (Table 1-6). Usage: Parameters and returned values: [b,alpha,h] = svcm(X,Y,kernel,C,p1,filename,h) X, kernel, C, Meaning of this parameters is the same as by function multiclass p1, alpha, b filename default is 'classlog.txt' Y Vector of targets has a slightly different 3 h form than the one used for multiclass. It is one column vector with length n (number of training inputs). Yi is 1 when Xi belongs to class 1, or -1 when Xi belongs to class 2 ,or better, do not belong to class 1 if possible, this raw Hessian (see section 3.1) should be used. If not, new one is calculated (to avoid multiplex calculation of h. Usualy used for calculations on the same training set but with different parameters - but not different kernel parameters!!) Table 1-6 Parameters and returned values of function svcm 1.3 Obtaining the results. After training of the SVM, the results could be obtained using function classify (Table 1-7). Usage: Parameters: Returned values: [res,res1,Rawres] = classify(Xtest,Xtrain,Ynew,alpha,b,kernel,p1) Xtest Testing inputs - matrix, normalized the same way as Xtrain (see 1.1) Xtrain Training inputs - matrix Ynew Training outputs - in modified form, output from multiclass = matrix(!) alpha Matrix of Lagrange Multipliers for each discrimination plane, every column is one set output from multiclass b vector of bias terms , output from multiclass kernel type of kernel function (must be the same as for training!) p1 parameter of the kernel (see 3), must be the same as for training Rawres matrix of distances of points Xtest from discrimination lines for each class (in transformed space) res transformed Rawres, points of Xtest are classified with 1 or 0 for each class matrix res1 Another format of res, the same as Y for function multiclass Table 1-7 Parameters and returned values of function classify This function calculates the distance of each point Xtesti from the discrimination plane in the transformed space. Please note, that it is not a perpendicular distance in the space of Xtest but the distance in the transformed multidimensional space. Depending on the kernel function, the 4 size m (number of classes) dimension of transformed space is usualy much greather than the dimension of input vectors. Acording to the theory of SVM, the discrimination surface in the transformed space is a plane. Matrix of distances Rawres is in the form described in Table 1-8. size ntest (number of input vectors in Xtest) distance_Xtest1_cl1 distance_Xtest2_cl1 ... distance_Xtestn_cl1 distance_Xtest1_cl2 distance_Xtest2_cl2 ... distance_Xtestn_cl1 ... ... distance_Xtest1_clm distance_Xtest2_clm ... ... ... distance_Xtestn_clm Table 1-8 Format of the matrix Rawres The distances could be both, positive or negative. This reflects the position of the point to the discrimination plane (class cli or not class cli). After this, following procedure is applied: 1. If all distances for the point Xtesti are negative: Xtesti belongs to the class with the smallest distance (in absolut value). 2. If there are two and more positive distances: Xtesti belongs to the class with the largest distance (in absolute value) . Figure 2-1 illustrates the situation. If Xtesti is located in some of the problematic areas, this conflict must be solved. If more than three classes are used, problematic areas are overlaping in a more complicated way. However, conflicts are solved in the same way. 2 How to use SVM for regression SVM regression is as well as the SVM multiclass classification done in three steps: 1. Loading and normalizing data (function normsv) 2. Training of SVM (function svrm) 3. Obtaining the results (function regress) Process of SVM regression is very similar to SVM classification. Anyway, there are some important differences. 2.1 Loading and normalizing data This step is exactly the same as for SVM multiclass classification. See section 1.1, please. 5 problematic area problematic area problematic area problematic area Figure 2-1 SVM three-class classification, the conflicts solving 2.2 Training of SVM for regression Training of SVM for regression is performed by function svrm (Table 2-1). This routine performs similar function as svcm, but there are some differences. Svrm is based on the algorithm described in [Smola 98]. Matrix of inputs X should be in the form described in section 1.1. Y is a vector of real numbers, where Yi = f(Xi) (Xi is a input vector). Information about different availiable kernel functions and their parameters can be found in section 3.1. Usage: [b, beta,H] = svrm(X,Y,kernel,p1,C,loss,e,filename,h) Parameters: X Training inputs (normalized) - matrix Y Training targets - vector of real numbers(!) kernel type of kernel function, availiable: 'linear', 'poly', 'rbf', default 'linear', see 3. p1 parameter of the kernel function, default 1 penalty term, default Inf C loss type of loss function,'ei' ε-intensitive,'quad' quadratic, default 'ei' e insensitivity, default 0.0 filename name of the log file, default 'regresslog.txt' . Here are written the messages during the learning. h if possible, this raw Hessian matrix (see section 6 Returned values: 3.1) should be used. If not, new one is calculated (to avoid multiplex calculation of h. Usualy used for calculations on the same training set but with different parameters - but not with different kernel parameters!!) bias term - scalar b beta H vector of differences of Lagrange Multipliers Not normalized and not adjusted Hessian matrix. If there was h in the input parameters, H == h. If not, H was calculated during the run of svrm. Table 2-1 Parameters and returned values of function svrm Parameters e, C and loss are strongly interconnected. Parameter loss defines the loss function. Loss function determines, how will be penalized the SVM's error. There are implemented two types of loss functions (Figure 2-2). ε-insensitive loss function penalty penalty quadratic loss function λ=f(C) -e Errors in this area are ignored e error Shape of the curve is influenced by C error Figure 2-2 Implemented loss functions If ε-insensitive loss function is used, errors between -e and e are ignored. If C=Inf is set, regresion curve will follow the training data inside of the margin determined by e (Figure 2-3). C is number from range ( 0 , Inf >. If C < Inf is set, constraints are relaxed and regression curve need not to remain in the margin determined by e. In some cases (for example in case of defective data) this leads to generality improvement. If kernel with not infinite VC dimension is used, it could be necessary to relax constraints. It might by not possible to calculate the regression curve following the training data in the margin 2e. Parameter C determines angle λ of the loss function (Figure 2-2). For C = Inf, λ=90˚ (no error out of +e -e is tolerated) and for C = 0, λ = 0˚ (every error is accepted). 7 1.5 2e 1 0.5 0 Training data -0.5 SVM -1 -1.5 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Figure 2-3 SVM regression, 'ei' loss function, e = 0.2, C = Inf Quadratic loss function penalizes every error. It is recommended to use this loss function. If using this function, memory requirements are four times less, than if ε-intensitive loss function is used. Parameter e is not used for this function and can be set to arbitrary value (0.0 is prefered). 1 0.8 C = 10 0.6 0.4 0.2 0 Training data -0.2 -0.4 SVM -0.6 -0.8 -1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Figure 2-4 SVM regression, 'quad' loss function, C = 10 8 1 0.8 C = 0.5 0.6 0.4 0.2 0 -0.2 Training data -0.4 -0.6 SVM -0.8 -1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Figure 2-5 SVM regression, 'quad' loss function, C = 0.5 Figures Figure 2-4 and Figure 2-5 illustrate the influence of parameter C. It is recommended to use C from range ( 0, 100 > In this range is SVM most sensible for change of the value of C. Larger values of C influence SVM very similar as C = Inf does. It is possible to check the status of the training process. Short message about each performed operation is written to the file specified by parameter filename. During the training process, only viewing of the file is possible editing of the file would cause sharing violation and crashing of training. Returned values b (scalar) and beta (column vector) represent output of SVM according to [Smola 98]. 2.3 Obtaining results using regress Results of SVM regression can be obtained using function regress (Table 2-2). Usage: [Ytest,k] = regress(Xtrain,Xtest,b,beta,kernel,p1,k) Parameters: Xtrain Training inputs - matrix that was used for training) Xtest Testing inputs - matrix (normalized the same way as Xtrain) kernel kernel function (must be the same as for training),see 3 p1 parameter of the kernel, default 1 , but must be the same as for trainning beta Difference of Lagrange Multipliers, output from svrm b bias term output from svrm k matrix of dot products of Xtrain and Xtest. This matrix should be used it is availiable from previous calculation. If it's not, during run of this function will 9 Returned values: Ytest k be calculated new one and returned on the output. For more details, see also section 3.2. testing output, one column vector. If there was k availiable as an input parameter, the same matrix is returned on the output. Otherwise, k is new calculated matrix of dot products of Xtrain and Xtest. Table 2-2 Parameters and returned values of function regress Meaning of the parameters was discussed in the previous sections. Ytest is a real valued one column vector of results. Parameter k is described in section 3.2 3 Important SVM subfunctions 3.1 Function product This function calculates different types of dot products of vectors, depending on the kernel function (Table 3-1). Usage: h = product(X,kernel,p1) Parameters: X matrix of inputs kernel type of kernel function 'linear' = usual dot product 'poly' = p1 is degree of polynomial 'rbf' = p1 is width of rbfs (sigma) Returned h matrix of dot products of input vector value: Table 3-1 Parameters and returned values of function product This function is basic for SVM. It contains different kernel functions (three in this version) used for calculation of vector's dot product in transformed space. Used kernel function determines the properties and performance of SVM. Matrix X has to be normalized for the used kernel function (see 1.1). Every row of X is considered as one input vector (Table 3-2). 0.4 -0.4 ... 0.1 -0.5 0.2 ... 0.5 ... ... ... ... 0.3 0.1 ... -0.1 X1 X2 ... Xn Table 3-2 Format of the matrix X Matrix h has the form described in Table 3-3. Please note, that matrix h is symmetric. 10 size n 1 X1X2 size n (number of vectors X) ... X2X1 XnX1 1 ... XnX2 ... ... ... ... X1Xn X2Xn ... 1 Table 3-3 Format of the matrix h Operation in Table 3-3 represents the dot product of two vectors in the transformed space. Operation is determined by used kernel function. There are three different kernel functions implemented in this version (Table 3-4): Parameter 'linear' Kernel type linear Description X 1 X 2 X 1 X 2 ,parameter p1 is not used 'poly' polynomial X 1 X 2 X 1 X 2 1 , p1 is degree of p1 the polynomial 'rbf' X1 X 2 X1 X 2 Gausian RBF X1 X 2 e 2 p12 ,p1 is width of the RBF's Table 3-4 Availiable kernel types 3.2 Function product_res Simmilar to function product, this function also calculates different types of dot products of vectors. It is used for calculation of the results (Table 3-5). Usage: Parameters: Returned value k = product_res(Xtrain,Xtest,kernel,p1) Xtrain Training inputs (normalized) Xtest Testing inputs (normalized the same way as Xtrain) kernel type of kernel function (see 3) p1 parameter of the kernel (see 3) k matrix of dot products Table 3-5 Parameters and returned values of function product_res Format of parameters Xtrain and Xtest was described in previous sections. Returned value k is a matrix and is described in Table 3-6. 11 size ntr (number of vectors Xtrain) size nte (number of vectors Xtest) ... Xtrain1Xtest1 Xtrain1Xtest2 Xtrain1Xtestnte ... Xtrain2Xtest1 Xtrain2Xtest2 Xtrain2Xtestnte ... ... ... ... XtrainntrXtest1 XtrainntrXtest2 ... XtrainntrXtestnte Table 3-6 Format of the matrix k Meaning of the symbol is explained in section 3. 4 References Burges, CH.,1999 ,Tutorial on Support Vector Machines for Pattern Recognition. This paper could be downloaded at http://svm.research.belllabs.com/SVMdoc.html Smola A., Schölkopf B.,1998, A Tutorial on Support Vector Regression, NeuroCOLT2 Technical Report Series, October 1998 12