D2C-SVM Matlab (v 1.0) Manual By D. Lai Copyright 2006. Department of Electrical and Electronic Engineering, University of Melbourne, Parkville Campus, Melbourne, Australia. Email: d.lai@ee.unimelb.edu.au. Figure 1: Easy to use GUI interface. Introduction Welcome to the Matlab GUI interface written for easy use of the D2C-SVM software for training Support Vector Machines (SVM). What is SVM? SVM is a function estimation technique based on Statistical Learning Theory (1970), introduced by V. Vapnik since the early 1990s. The standard SVM is a binary classifier which has found widespread use in pattern recognition problems such as image and audio recognition, handwriting recognition, medicine, science, finance and so on. For an introduction to SVM theory, one can try the following references a) Cristianini, N. and Shawe-Taylor, J., An introduction to Support Vector Machines : and other kernel-based learning methods, Cambridge University Press, New York, 2000. b) Vapnik, V. N., The nature of statistical learning theory, Springer, New York, 2000. D2C-SVM is yet another SVM training package which implements a heuristic training algorithm for improving the training efficiency of the SVM. This short manual is an attempt to help the user with installing and running the Matlab GUI interface use of the D2CSVM training package and to introduce the user to the usage of the SVM as an easy to use pattern recognition tool. Installation and System Requirements D2CMatlab v 1.0 comes in a single zip file named D2CMatlab.zip which can be downloaded from http://www.ee.unimelb.edu.au/people/dlai/ This first version is intended to be used with the D2C-SVM classifier package and is included in the zip file. The user needs to have Matlab v 6.01 and above installed in order to run the GUI file. Generally, the latest computers running of Pentium IV with more than 256 MB would have no problems in running this software. Earlier PCs shouldn’t suffer either, just that the training of the SVM tends to become slower as your data size increases. Input Files The program (v1.0) currently accepts either a training or test file in sparse format. For example, a training example in the file will have the following format. <label> 1: attribute 1 2: attribute 2 3: attribute 3…… e.g. +1 1: 0.3421 3: 2.3424 5:-1.2342 - 1 4: 23.31 If your data is in matrix format, this can be easily read in using Matlab’s load function or if the data is in an Excel file, this can be read using xlsread. The data file must then be converted to sparse format in order to be properly used by the program. An Excel to Spare file converter is provided in the D2CMatlab tool directory and can be used for fast and easy conversion of data files not in sparse format. SVM parameters Figure 2: SVM Parameter Selection Panel Experienced SVM users will find that the GUI interface allows easy input and changing of SVM parameters so that the training can be painlessly done. This is a big advantage compared to programming the DOS command line interface which is found in many of the other SVM packages. The user can change the C, algorithm tolerance levels, kernel parameters and options. For new SVM users, the following a simple run down of the options available for use. a) C: The SVM penalty parameter controls the tradeoff between overfitting and underfitting the classifier to the training set. If C is set large, the number of training errors will be reduced, but the result of doing this is that the classifier is performs poorly on test data (ovefitted). If C is too small, the number of training errors will increase and the classifier performs poorly on the training set, but does NOT necessarily perform well on the test data. One could end up with a really bad classifier which doesn’t classify anything well. Generally, set C=10 and check the classification accuracies, then adjust the parameter to obtain a better model. C Cycle: Selecting this option allows easy training of the SVM model using increasing values of C. The C step determines the multiplicative factor with which C is increased. Press ‘Run’ again to automatically train different SVMs with increasing C values. b) Tol: This is the tolerance level with which the classifier is optimized to and is intended to allow finite termination of the algorithm. The default is 0.001 which seems to work well for practical purposes. Smaller tolerances tend to increase training time while giving better end accuracies and vice versa. c) Kernel Parameters: The standard three popular kernels are included namely linear, polynomial and RBF (Gaussian) kernel. Future versions of this software will include other kernels e.g. sigmoid and so on. d) Model: Describes the training algorithm to be used. Currently the D2C Adaptive Heuristic is the default which seems to work well for most data sets. The standard max violater algorithm is used in other SVM algorithms, the Naïve algorithm is generally slower and the ASVM models are semiparametric version of the SVM classifier. Status Window Figure 3: Status Window The status window provides information of the simulation status. The window is color coded where GREEN means the simulation is ready for user input e.g. before start, after simulation has finished and RED means that the simulation is running and the user cannot input or change any variables further. a) Run: Executes the simulation with the selected choice of SVM parameters. b) Reset: Resets all variables to default values, clears buffers and clears the graph axes. c) Plot Graph: Plots the various graph depending on the analysis selections made. d) Save Graph: Opens the graph in a Matlab figure to allow saving, adjusting of axes and exporting to other formats. e) Exit: Closes the simulation and returns to Matlab command line. Analysis Tools Figure 4 Analysis Tools Panel The current version of this software supports two types of analysis, namely accuracy analysis and graphical analysis. In accuracy analysis, the standard cross validation and leave one out accuracies have been implemented, while the graphical analysis tool set currently contains 5 different methods. a) Cross Validation: Partitions the main data set into n-number of subsets called folds. The user selects the number of folds to perform the experiment on. The average accuracy over the n-folds is then reported along with sensitivity and specificity ratios. b) Leave One Out: This is actually the extreme of the cross validation where the SVM is trained n times by using n-1 data points and 1 data point for testing. In short the average accuracy is obtained by leaving each point out of the main data set and testing the classifier trained on the remaining data. Sensitivity: This is a measure of how good the classifier is at classifying positive examples. Specificity: This is a measure of how good the classifier is at classifying negative examples. c) ROC Plot: This plots the receiver operating characteristics (ROC) of the classifier. A larger ROC area generally means a better performing classifier. d) Best Feature Selection: This automatically selects the set of best features from the training data for a particular SVM model. The feature numbers are written to an output file “BestFeature.txt”. Selection of the best features are based on a hill climbing method using accuracies that can be determined using n-fold cross validation of leave one out errors. e) 2D SVM Surface: Plots the SVM decision surface in input space for 2 features. This is a useful graphical option to view the performance of the classifier if you only want 2 features and the resulting separating boundary. It is not meant to be used to decide the best SVM parameters or classification accuracy for data sets with more than 2 features. f) 3D SVM Surface: This adds a third feature to option e) and allows similar visualization. Unfortunately dimension 4 and higher are slightly more complex to be viewed graphically and hence are not included here. g) SVM Posterior Probability: This plots a graph of SVM outputs for the test data which have been calibrated to posterior probabilities. A test example may have an SVM output of say 5.34 which is only useful for classifying it to the +1 class. The conversion to posterior probability denotes how sure we are that the example belongs to +1 class. Values range from 0 to 1. Figure 5: ROC Graph Analysis Disclaimer: The author is not liable for any damages what so ever that may occur from usage of this software. This software is copyrighted and available can be distributed freely for use. Any modifications should be made only after written consent from the author. Please cite the following when using this software in your research works. BibTeX citation: @misc{D2CSVM, author = {Lai, D.}, title = {{D2C-SVM}: A heuristic algorithm for training {S}upport {V}ector {M}achines}, year = {2005}, note = {Software available at {\tt http://www.ee.unimelb.edu.au/people/dlai/}} Send all email queries, bug reports, criticisms and opinions to daniel.thlai@hotmail.com