Offline and Online SVM Performance Analysis by Kathy F Chen Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2007 © Massachusetts Institute of Technology 2007. All rights reserved. A uthor .................................... Department of Electrical Engineering andt<omputer Science December 8, 2006 ......................... Una-May O'Reilly Principle Researcher Thesis Supervisor C ertified by Accepted by. Arthur U. smith Chairman, Department Committee on Graduate Theses MASSACHUSETTS INSTI TyjE OF TECHNoLGY OCr 0 32007 LIBRARIES BARKER 2 Offline and Online SVM Performance Analysis by Kathy F Chen Submitted to the Department of Electrical Engineering and Computer Science on December 8, 2006, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract To understand and evaluate the performance of a machine learning algorithm, the Support Vector Machine, this thesis compares the strengths and weaknesses between the offline and online SVM. The work includes the performance comparisons of SVMLight and LaSVM, with results of training time, number of support vectors, kernel evaluations, and test accuracies. Multiple datasets are experimented to cover a wide range of input data and training problems. Overall, the online LaSVM has trained with less time and returned comparable test accuracies than SVMLight. A general breakdown of the two algorithms and their computation efforts are included for detailed analysis. Thesis Supervisor: Una-May O'Reilly Title: Principle Researcher 3 4 Acknowledgments I would like to thank my thesis advisor, Una-May O'Reilly, for guiding me through the research process and for providing constant support throughout the unfolding of the mysterious SVM algorithms. I felt truly lucky to have an advisor like her and have learned tremendously from our weekly meetings and discussions. I would also like to thank my friends and family for always being there for me when I feel discouraged. They have given me the strength to continue and the will to finish this important task. This project was made possible by the funding support from DARPA IPTO and Architecture for Cognitive Information Processing (ACIP) program. Special thanks also go to Janice McMahon of USC/ISI and Chris Archer of Northrop Grumman. 5 6 Contents 1 13 Introduction 15 2 Background SVM Overview . . . . . . . . . . . . . . 15 Kernel Functions . . . . . . . . . 18 2.2 SVMLight . . . . . . . . . . . . . . . . . 18 2.3 Online SVM . . . . . . . . . . . . . . . . 21 2.1 2.1.1 2.3.1 2.4 3 Sequential Minimal Optimization 21 . . . . . . . . . . . . . . . . . . 22 2.4.1 Overview . . . . . . . . . . . . . 22 2.4.2 Selection Strategy . . . . . . . . 22 2.4.3 Process and Reprocess . . . . . . 23 2.4.4 Termination Condition . . . . . . 24 LaSVM 27 SVM Experiments 3.1 Experimental Setup . . . . . . . . . . . . 27 3.2 MNIST Dataset . . . . . . . . . . . . . . 30 3.3 Income Prediction . . . . . . . . . . . . . 33 3.4 Webpage Classification . . . . . . . . . . 36 3.5 Face Detection . . . . . . . . . . . . . . 38 3.6 CEARCH Entity Classification . . . . . . 7 40 4 5 Performance and Tradeoff 45 4.1 Cross-sectional Dataset Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Procedure Count Analysis 4.3 Sensitivity to Tolerance Parameter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 . . . . . . . . . . . . . . . . . . . . . . . . . Summary 5.1 Future Work. 47 49 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 8 List of Figures . ... . . . . . . . . . . . . . . . . . . . . . . . . 16 2-1 SVM exam ple . . . . . . . .. 2-2 SVM with outliers ........ 2-3 SVMLight High Level Code Flow .......................... 2-4 LaSVM Pseudo code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3-1 Sim ulation map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3-2 MNIST training time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3-3 MNIST number of support vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3-4 M NIST test error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3-5 MNIST training time VS number of support vectors . . . . . . . . . . . . . . . . . 32 3-6 MNIST training time VS cache size 3-7 Income training time comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3-8 Training time vs sample size 3-9 Income number of support vectors ................................... 17 20 . . . . . . . . . . . . . . . . . . . . . . . . . 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 . . . . . . . . . . . . . . . . . . . . . . . . . . 34 . . . . . . . . . . . . . . . . . . . . . . 34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3-15 Webpage training time comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3-10 Number of support vectors vs sample size 3-11 Income test accuracies 3-12 Test accuracy vs sample size 3-13 Income kernel evaluations 3-14 Kernel evaluations vs sample size 3-16 Training time vs sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3-17 Webpage number of support vectors . . . . . . . . . . . . . . . . . . . . . . . . . 37 3-18 Number of support vectors vs sample size . . . . . . . . . . . . . . . . . . . . . . 9 37 3-19 Webpage test accuracies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3-20 Test accuracy vs sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3-21 Webpage kernel evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3-22 Kernel evaluations vs sample size . . . . . . . . . . . . . . . . . . . . . . . . 37 3-23 Face Detection training time comparison . . . . . . . . . . . . . . . . . . . . . 38 3-24 Training time vs Sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3-25 Face detection number of support vector . . . . . . . . . . . . . . . . . . . . . 39 3-26 Number of support vectors vs Sample size . . . . . . . . . . . . . . . . . . . . 39 3-27 Face detection: test accuracies . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3-28 Test accuracy vs Sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3-29 Face detection: kernel evaluations . . . . . . . . . . . . . . . . . . . . . . . . 39 3-30 Kernel evaluations vs Sample size . . . . . . . . . . . . . . . . . . . . . . . . 39 3-31 Test accuracy vs gamma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3-32 Test accuracy vs c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3-33 Test accuracy vs weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3-34 CEARCH training time comparison . . . . . . . . . . . . . . . . . . . . . . . 42 3-35 Training time vs sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3-36 CEARCH number of support vectors . . . . . . . . . . . . . . . . . . . . . . . 42 3-37 Number of support vectors vs sample size . . . . . . . . . . . . . . . . . . . . 42 3-38 CEARCH test accuracies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3-39 Test accuracy vs sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4-1 Top 20 processes of SVMLight . . . . . . . . . . . . . . . . . . . 46 4-2 Top 20 processes of LaSVM . . . . . . . . . . . . . 46 4-3 Income: Accuracy vs Sample size . . . . . . . . . . . . . . . . . 47 4-4 Income: Accuracy vs Sample size . . . . . . . . . . . . . . . . . 47 4-5 Income: Accuracy vs sample size . . . . . . . . . . . . . . . . . . 48 . . . . . . . 10 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Dataset summary 3.2 MNIST dataset details 3.3 MNIST summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Income dataset details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Income summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6 Webpage dataset details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.7 Webpage summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.8 Face detection dataset details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.9 Face detection summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.10 CEARCH dataset details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.11 CEARCH entity summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 SVMLight performance summary 4.2 LaSVM performance summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3 Top performing algorithms for each dataset 11 . . . . . . . . . . . . . . . . . . . . . 45 12 Chapter 1 Introduction Support Vector Machines [9] are being applied extensively to classification problems in many applications, ranging from handwriting recognition [1], text categorization [5] to face detection [6], and produce promising results. Developed by Vapnik, SVM aims to maximize the classification margin of a hyperplane boundary determined by a subset of training points called support vectors. In his paper [4], Joachims described the approach of SVMLight, a software implementation of the SVM algorithm in C. SVMLight is among one of the most widely used SVM classification tools, and has been shown to solve large-scale datasets efficiently through decomposing the Quadratic programming (QP) problem into smaller subproblems. Similar to SVMLight's approach, Platt proposed the Sequential Minimization Optimization(SMO) method in [11], which breaks up the problem in the smallest possible subset of 2 datapoints. The problem can then be solved analytically without a QP solver. An online approach that incorporates the essence of both the SMO and the incremental approach[7] is the LaSVM method, presented in [10]. The LaSVM method updates the SVM model by inserting and removing appropriate support vectors over the entire training process. This thesis will explore the performance of both offline and online SVM, using the algorithms provided in SVMLight and LaSVM. Chapter 2 will describe their background details; Chapter 3 contains the results of multiple experiments on a wide variety of datasets. Chapter 4 covers a cross-sectional dataset study of both algorithms and their procedure details. 13 14 Chapter 2 Background The section will give an overview of the SVM algorithm and discuss the use of kernel functions. The second section will describe the implementation details of SVMLight by Thorsten Joachims, and explain the list of parameters which can be optimized based on the scope of a problem. The third section will give an overview of online SVM and discuss some of the key differences from the offline approach. 2.1 SVM Overview Let the training data be (x 1 , yI), ..., (xi, yi), x E R, y E +1, (2.1) -1. The Support Vector Machines learn to classify the data points by finding a set of Lagrange multipliers a,, a 2 ,...,a for the n-dimensional training points in a set of size 1 such that there is a maximum margin in between two separate classes marked by y = + and y = -1. The training points with non-zero a values are called the support vectors, that determine the optimal hyperplane [2][4][9]. A simple example in 2 dimensions can be seen in Figure 2-1. Two classes with features of x 1 and X2 are classified by a hyperplane (solid line) with the largest margin determined by the support vectors on the dashed lines. 15 x2 IdI Figure 2-1: SVM example The decision function based on the trained model is thus the following function, f (x) = sign(w -x + b), (2.2) where w - x is a linear combination of all the support vectors. The training data that are not support vectors will have zero a values and do not have any effect on the decision function. In order to find the hyperplane that satisfies the maximum margin requirement, the problem is transformed into a Quadratic programming problem. First, one class (black points) are denoted with +1, and the other class (white points) are denoted with -1. Given a vector w which is perpendicular to the hyperplane from the origin, the class boundaries can be represented as the following equations after normalization. Class +1 boundary (dashed line): W. Class -1 boundary (dashed line): W +=X+ (2.3) -x = -1 (2.4) The margin M is the distance from the +1 boundary to the -1 boundary. M = x+ _ X-1 2 (2.5) (2.6) Given training data with multidimensional features, the SVM algorithm maximizes the mar16 gin between two classes by solving for the global minimum in a convex quadratic programming problem. Quadratic Programming is a mathematical optimization problem which minimizes a non-linear quadratic cost function subject to linear equality or inequality constraints. The resulting QP: min (2.7) IIw112 w,b2 s81. yi -((w - xi) + b) ;> 1, 1, .. (2.8) For training data that are not linearly separable, we can relax our constraints by allowing data points to lie on the opposite side of the hyperplane with a predetermined penalty C. The distances of wrongly classified points from the correct boundary are represented by a slack variable E, and the objective function of the QP transforms to the following equation: min w,b 2 1 R IwI + C7&, where R is the total number of datapoints misclassified, and (2.9) k is the kth point's distance away from the decision boundary. In Figure 2-2, there are three outliers that lie on the wrong side of the hyperplane. The value of C determines the trade-off between a generalized model and perfect classification. For problems with outliers or noise, we often want to utilize the "Soft margin" approach to avoid over-fitting. x2 : 1 Figure 2-2: SVM with outliers 17 2.1.1 Kernel Functions To classify non-linearly separable data, SVM first applies kernel functions which map the feature data into another finite or infinite space. The non-separable data in a lower dimension may now be separable with a hyperplane in the new higher dimension. Below is a list of commonly used kernel functions. In practice, the choice of kernel and kernel parameters are often optimized using cross-validation. " Linear Kernel K(x, x') = (xTx') (2.10) K(x, ') = (1 + (xTx') P (2.11) " Polynomial Kernel * Radial Basis (Gaussian) Kernel K(x, x') exp - 2 lix - X'11) (2.12) * Sigmoid Kernel K(x, x') 2.2 tanh(,(x, x) + 19) (2.13) SVMLight Training a large-scale SVM classifier often challenges memory and training time resources. In Joachim's paper [5], he described the SVMLight which incorporates several techniques to make training large-scale classifiers more efficient and practical. The SVMLight algorithm includes the following techniques: * decomposition strategy (Osuna's active set) " select the appropriate variables for the working set (Zoutendijk's method) " shrinking the QP problem 18 * software caching of the Hessian (2nd derivative) computation The algorithm first decomposes the original QP problem to a smaller subproblem by selecting the appropriate working set based on the Zoutendijk's method. The method tries to find a direction d which is the steepest decent direction with q non-zero elements. As discussed in [5], the working set selection is in itself another QP problem, which can be solved by restricting q to be an even number, and select each q/2 largest elements from the positive and negative weights. After finding the appropriate working set, only the weights for the working set are updated while the rest remain fixed in the current iteration. This smaller QP subproblem can then be solved with a nonlinear interior-point solver [8]. Besides the decomposition and working set selection mentioned above, SVMLight also incorporates the idea of shrinking. The main purpose of shrinking is to identify the bounded support vectors using heuristics and remove them from the training set in order to reduce the size of the QP problem. Since the most computation-intensive operation in training the SVM is the Hessian computation, the use of a software cache saves repetitive matrix computation and reduces the training time. There are many parameters associated with SVM and also in particular with the SVMLight implementation. " C : The parameter C is a measure of the trade-off between training error and margin. For a large C, the trained model may have a near-perfect separation of the two classes; however, the disadvantage may be the small margin size between the class boundaries and a poor generalization performance of the model. In the QP problem, C is also the upper bound of the Lagrange multipliers a,, & 2 ,...,Z 1 for the support vectors as the following inequality constraints, 0 < aO < C. (2.14) * q : maximum size of QP-subproblem. The parameter q determines the size of the working set, which contains the variables that are updated in the current iteration. 19 * m : size of cache for kernel evaluation in MB. The parameter m specifies the size of the cache used to store the precomputed kernel computations. E c: allowable tolerance for QP termination condition. In SVMLight, the value is the amount of error allowed while training the model to satisfy the inequality constraint in Eq. (2.8), yi , ((w - xi) + b) - 1 ;> e, svmlearnclassification() //train the svm f optimize to convergence optimizeto_convergenceQ (2.15) /Do optimization on the working set optimin-svM() ( compute matrices for optimizationo; // makes calls to kernel optimizeqp0; # hildrethdespo method reassign variables accordigly; ( while (not convergent) clear working set; updatelinearcomponest()( if(inear case) ( clearvector no; for( each element in the working set) addvector ns0; sprod-nso; / selecting working set. //(random set once in 100 times.) if (counter++% 101) ( select next qp subproblem grado; else ( selectnextqpsubproblem-randO; cachemultiple-kernel-rows0; optimize_svm0; updatelinear componentQ; calculatesvm_model(); check optimalityo; if (seemingly convergent) //general case else -- ( //Make all variables active again which had been removed by // shrinking computes im for those variables from scratch reactivateinactIveexamples(){ if(linear case){ clearvector n0; for( each element in the working set) ( addvector nsQ; sprod-s reactivateinactive_examplesQ; if(counter++ %10)( shrink_problem if (num of sv > max element in cache)( kemel cache shrink /general case else free variables for the next iteration; Figure 2-3: SVMLight High Level Code Flow 20 0 2.3 Online SVM In this thesis, an online learning algorithm is defined to be an algorithm that allows multiple incremental updates of a model as training examples are processed. There may be other definitions of online algorithms, such as ones that label incoming training data while learning or others that use the accuracies of trained models to update their decision boundaries. The following sections focus on the incremental-model definition and will explain the basic setup of the online SVM described in the LaSVM paper [10], including the criterion for selecting and removing support vectors, and the termination condition. 2.3.1 Sequential Minimal Optimization In order to solve for the global optimum in the SVM QP problem, LaSVM utilizes the Sequential Minimal Optimization (SMO) proposed in [11]. Rather than solving for a large QP problem in the training step, SMO solves sequentially on only two datapoints without increasing the value of dual objective function. The SMO algorithm consists of three main components. The first component is an analytic method to solve for the subproblem of two Lagrange multipliers. The second component is a heuristic for picking which two multipliers to optimize. The third component is a method to compute the bias b, which in SVMLight is calculated from the linear components as in Figure 2-3. The analytic method that solves for the two Lagrange multipliers can be described as a hill climbing algorithm where the multiplier values are updated iteratively to satisfy the Karush-KuhnTucker(KKT) conditions. In order to optimize the value for the objective function, SMO first selects one Lagrange multiplier that violates the KKT condition. The other multiplier is then picked based on the maximal improvement for the value of the objective function. The bias b is recomputed after every iteration of the analytic method. 21 2.4 LaSVM 2.4.1 Overview LaSVM combines the techniques of Sequential Minimal Optimization and support vector removal. The resulting algorithm has been shown to use less memory and train faster than traditional SVM solvers. Especially for handling noisy data, where an increasing number of support vectors implies more computation cost, LaSVM aims to remove unnecessary support vectors to avoid overfitting and to lower kernel computation requirements. LaSVM implements SMO using a practical method suggested in [15]. The method defines a small positive tolerance r, and the search direction of SMO is determined by a T-violating pair where the following conditions hold true. Point i and Point j is a T-violating pair if: &e < Bi cej > Aj gi - gj > 7 In the initialization, LaSVM picks the first five samples from each class and assigns them as support vectors through the Process procedure. Next, for every epoch, LaSVM selects multiple data points until all points have been selected or until the termination condition is met. The order of the data points selected depends on a predetermined selection strategy: random, gradient or active selection. When the particular data point is chosen, LaSVM then calls the Process procedure, which inserts the data point as a support vector under constraints. For each insertion process, there is a removal Reprocess procedure, where the algorithm removes any non-support-vector datapoints from the model. 2.4.2 Selection Strategy The online LaSVM has three selection strategies for picking the next data point for Process: random, gradient and active selection. 22 Random Selection The random selection picks a random sample from the data points that have not yet been processed. Gradient Selection The gradient selection takes in a user-specified candidate number as input. The goal is to pick the best gradient among the candidates. The program will sequentially pick out random samples from the unprocessed data points, and select the data point with the maximal gradient. The gradient here means the gradient of the dual objective function. Having a maximal gradient means the largest improvement for satisfying the objective function. The gradient values are predicted by computing the kernel expansion on the selected datapoints. The goal of the gradient selection is, therefore, to pick the most misclassified sample in the hope of reducing the error of the objective function as much as possible. Active Selection Comparing to the aggressive correction of the gradient selection, the active selection takes a more conservative approach by picking points that are misclassified but lie close to the decision boundary. This may be especially useful for noisy datasets where the most misclassified points are usually exceptions and should be ignored. The active selection picks the next candidate point by first computing the margin of all the candidates points from the current boundary, and then selecting the one closest to the margin for the process step. 2.4.3 Process and Reprocess Process The Process operation's goal is to recompute the alpha and gradient values after inserting the selected data point as a support vector. The operation can be described into three main steps. The first step determines a datapoint to be processed as one of the potential support vectors. The selected sample is used as a new training 23 point for the model. The second step selects a corresponding support vector, which will form the T-violating pair with the new added sample, and the pair will have the maximal gradient. The third step will update the new weight of every support vector. Reprocess The Reprocess operation removes some elements from the support vector set where the alphas are equal to zero. The operation first determines a T-violating pair with the maximal gradient, next it performs a direction search and recomputes the alpha weight. At the end of the computation, if any of the previous support vector now has an alpha weight of zero, it will be removed from the set. The bias term and the new gradient will also be recomputed at the end. 2.4.4 Termination Condition Each epoch of the online training terminates when all samples have been selected for the processing step. The current implementation also consists of three types of termination criterion, which are based on number of iterations, number of support vectors or the training time. 24 -1 Figure 2-4 shows the pseudocode of LaSVM. The main loop starts by selecting 5 samples of each class as support vectors, and iterates through the datapoints in each epoch. The algorithm then selects a fresh sample datapoint by its selection strategy, and add it into the support vector set. Next, the reprocess operation will attempt to remove non-support-vector point from the set, and update the bias term and gradient values. If the termination condition is met at the end of the loop, reprocess operation will be called until there are zero T-violating pairs. train_online () { select () case RANDOM //pick a random candidate case GRADIENT //pick best gradient from candidates case MARGIN /pick closest to margin from candidates select 5 examples of each class as support vectors for (every epoch) { for (every point in training set) { lasvmnprocess lasvmprocess () check if already in expansion as sv select if (deltamax <=1000) { lasvm reprocess while (error > deltamax) lasvm_reprocess compute gradient insert perform a direction search -, -. ( lasvm_reprocess () find maximal gradient perform a direction search remove non-support vector compute new bias and gradient values if (termination condition met) { repeat reprocess until gradient < t ) Figure 2-4: LaSVM Pseudo code 25 26 Chapter 3 SVM Experiments This chapter compares the performance of offline SVM (SVMLight) and online SVM (LaSVM) on five different datasets. They are MNIST handwritten digits classification, UCI's income prediction, webpage classification, face detection and CEARCH entity classification problems. The experiments were run on a 3.20GHz Intel machine with 2GB available cache memory. The first four datasets are chosen from published paper, and they cover a wide range of input data formats and training problems. An additional entity classification dataset is compiled from a simulation tool developed by Northrop Grumman. 3.1 Experimental Setup Dataset MNIST Income Webpage Face CEARCH Train Size 60000 32562 49749 6977 18782 I , Test Size 10000 16282 38994 24045 100 Feature 780 123 300 363 10 Gamma Tradeoff ci 0.005 10 0.005 0.005 1 5 0.01 10 10/ 7 0.0001/0.01 Cache(MB) 256 80 80 80 100 Working Set(q) 10 2 2 20 10 Table 3.1: Dataset summary MNIST Setup The MNIST experiments compare the performance of SVMLight and LaSVM with one epoch and two epochs. The associated parameters are selected to be equal for a fair comparison. Besides 27 algorithm specific parameters like working set size of 10 for SVMLight and random selection scheme for LaSVM, both methods were tested with a RBF kernel with gamma=0.005 and tradeoff constant c=10 with 256MB cache. Income Prediction Setup Similar to the MNIST digit classification experiments, the training parameters were set equal for both the SVMLight and LaSVM methods. All training runs uses the RBF kernel with gamma=0.005, tradeoff constant c=1, and 80MB cache. Webpage Classification Setup In the webpage classification experiment, the parameters for training the algorithms are the following: RBF kernel with gamma=0.005, tradeoff constant c=5 and cache size of 80MB. The working set for the SVMLight is chosen to be q=2, and LaSVM has the random selection scheme for both epoch of 1 and 2. Face Detection Setup All the experiments of the face detection dataset have the same training parameters of a RBF kernel with gamma = 0.01, tradeoff constant c=10, and 80MB cache. The working set q is set to 20 for SVMLight. CEARCH Setup The Distributed Sensing Test Bed [12] developed by Northrop Grumman is a simulation tool for modeling a two-dimensional entity movements observed by platforms with sensors. The sensor reports are generated whenever an entity moves within the specified sensor range with the appropriate sensor dimension. At the same timestamps, there are also truth reports which log the true positions and velocities of all entities with no sensor noise. The simulation tool takes in two XML input files, which specify the simulation parameters including the entity type, platform speed, sensor dimension, sensor range and the region map. The entity, platform, and map are all implemented as c++ classes. 28 T I I 11T I I I ITI I I ITI I I ITI I 1111 11tft Figure 3-1: Simulation map Entity - An entity can have one or more signature dimensions and a specified behavior. The signature values are Gaussian distributed with a mean and a standard deviation. The behavior can be random or specified trajectories along the wait points. Platform - Each platform can have multiple sensors which detect any entities within the sensor range. A platform moves across the region with a random behavior or a commandable route, provided by the C2FusionAlgorithm class. Map - The map specifies the world in which the platforms and entities operate. There are different map layers, including the infrastructure of roads and parking space, the building structure of houses and stores, and also the natural features like ponds. The simulation used the exact scenario file provided by Northrup Grumman (scenario-cearch.xml), with the increased running time to get a larger dataset (cearch_ _big). There are a total of 630 entities (50 types), and 20 platforms (10 types). The config file (dstb-cearch_1.xml) kept the entity information the same, and increased the platform's sensor dimensions from 1 to 10. The goal is to have 10 simultaneous sensor reports from a single entity. All 10 reports will be used as additional features for training the SVM models. The added sensors on the same platform have the same sensor range and noise as the existing sensor on that platform. The data-size associated with the 29 signature, however, remained the same for the same signature dimension across all 10 platforms Two simulations are run with the same random seed 5147. The cearch-la simulation lasts for 10 time units, and the cearch-labig simulation lasts for 60.12 time units with 0.02 resolution. Out of the 50 types of entities, only the first two have the behavior of going from house to stores, and the rest of the entity types have random behaviors. Entities in type 1 are actuated with a consistent move behavior, and entities in type 2 have inconsistent route behaviors. The platforms are commandable and have zig-zagging behaviors surveying the simulation area. 3.2 MNIST Dataset The MNIST handwritten digit (0-9) dataset consists of 60,000 training samples and 10,000 testing samples. The features are 780 gray-level pixel values ranging from 0 to 1. In order to classify each digit, we train the SVM with the digit class as +1 and the rest of nine classes as -1. The positive and negative samples of each digit are summarized in Table 3.2. Label 0 1 2 3 4 5 6 7 8 9 train(+1) 5923 6742 5958 6131 5842 5421 5918 6265 5851 5949 train(-1) 54077 53258 54042 53869 54158 54579 54082 53735 54149 54051 test(+1) 980 1135 1032 1010 982 892 958 1028 974 1009 test(-1) 9020 8865 8968 8990 9018 9108 9042 8972 9026 8991 Table 3.2: MNIST dataset details In Figure 3-2, across all 10 digits, LaSVM with one epoch takes the least amount of training time. LaSVM with two epochs also trains faster than SVMLight. In Figure 3-3, all three methods have comparable numbers of support vectors. Running LaSVM with one epoch returns the least numbers of support vectors overall. Among the digits, label 0 and 1 were the easiest to train because of the few support vectors required, and label 8 and 9 were the hardest. Similar relationship also reflects in the training time. In Figure 3-4, all three methods have comparable test errors among the 10 digits. Surprisingly training LaSVM with two epochs does not return a significantly higher accuracy for the extra training time. 30 MNIST Label vs Training Time (randern salecftkr kern4e, 256mb cache, ganmma-M.O6, 1000candnshingstep) (rtbfkern. 256mb cache, workkngsel -10,tradeeff c-10, amrma-nOUS) 0 LaSVM (xl) U LaSVM (x2) aSVMLight S. 7 90.00 200.00 600.00 400.00 800.00 1000.00 1200.00 Training time (s) Figure 3-2: MNIST training time MNIST Label vs Number of SV (random selnctlan, rhfkernei, 25enb cache, (rbfkernel, 256n* cache, wordne set -1, 10 10man-In.O5, cud, fladldnah step) tradeoffc=10, gamma-0.005) 0 2 3 4 O LaSVM (xl) *LaSVM (x2) M SVMLIgh 7 8 9 1000 0 2000 Number of Sv 4000 3000 Figure 3-3: MNIST number of support vectors MNIST Label vs Test Error (randomo (rt elacftn.rbftkenel, 256b cache, gunia-tIAn, 1000 cand, hlahig step) kene, 258mb cache, working as -10,tradeoffc-10,awnma-0.lOS) 3 0 LaSVM (x1) _j E 0 LaSVM (x2) O SVMLight 5 7 9 0.00 0.20 0.40 0.60 0.80 Test Error (%) Figure 3-4: MNIST test error 31 1.00 In Figure 3-5 LaSVM with one epoch has the smallest training time to support vector ratio, and SVMLight has the highest ratio. Training Time vs Number of SV 25n* e Ime.4d.. Vk...d , upbkwnK 250,0,eh.,eWon" sWemO. , 00.lUOe..ndi.,g W-10, lrdwM1 v-1O.9eemt-O05) si*) 12001000 800 * LaSVM (xl) MSVMUight A LaSVM (x2) 600 400 200 0 0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 3500.00 Number of Support Vectors Figure 3-5: MNIST training time VS number of support vectors In Figure 3-6, reducing the available cache size significantly affects the training time of SVMLight because of the requirement of Quadratic Programming. In SVMLight, the increased training time is spent on recalculating the kernels not in the smaller cache, whereas the LaSVM has a much higher tolerance for cache as low as 16MB. 1100 1 ' SVMLight LaSVM(x1) 1000 Soo LaSVM(x2) . 600 700 E / ~600 500 / 400 300 200 100 1024 512 256 16 32 64 126 Kernel cache size(MB) 8 4 2 Figure 3-6: MNIST training time VS cache size 32 Algorithms SVMLight LaSVM(xl) LaSVm(x2) Training time Number of SVs X X Accuracy Table 3.3: MNIST summary 3.3 Income Prediction The goal of the income dataset is to predict whether a household has an income greater than $50,000 based on 123 binary features. The feature dimensions are derived from 14 attributes, and 8 out of 14 are categorical binary values and the rest are continuous attributes, discretized into intervals. The training datasets were split into increasing sample sizes to analyze its effect on the training performance. The details of the training and testing dataset are shown in Table 3.4. Examples 1605 2265 3185 4781 6414 11221 16101 22697 32562 train(+1) 395 572 773 1188 1569 2692 3918 5506 7841 train(-1) 1210 1693 2412 3593 4845 8528 12182 17190 24721 test(+1) 3846 3846 3846 3846 3846 3846 3846 3846 3846 test(-1) 12436 12436 12436 12436 12436 12436 12436 12436 12436 Table 3.4: Income dataset details The following figures show the results of training time, number of support vectors and test accuracies. Across the different training sample sizes, LaSVM with one epoch trains faster than SVMLight, except the largest dataset a9a, and LaSVM with two epochs trains the slowest among the three experiments in Figure 3-8. All three of them returns similar numbers of support vectors in Figure 3-10. For the extra training time, LaSVM with two epochs did not return any improvement of test accuracies and performs the worst. Overall, the three algorithms perform equally on training time, number of support vectors, and even accuracy. In the largest dataset a9a, SVMLight has the fastest training time and a similar test accuracy as two-epoch LaSVM. 33 SVMLight -- 400 Examples 1605 (ala) 2265 (a2a) 3185 (a3a) 4781 (a4a) 6414 (a5a) 11221 (a6a) 16101 (a7a) 22697 (a8a) 32562 (a9a) SVMLight 0.47 0.94 1.75 3.91 7.07 24.54 57.38 110.3 227.96 LaSVM(xl) 0.26 0.55 1.10 2.41 4.65 16.51 41.72 103.95 248.59 LaSVM(x2) 0.52 1.08 2.09 4.90 9.34 35.45 85.43 196.52 442.24 - LSaSVM(xl) - -LaSVM(x2) . 350 300 250 .. .0200 150 100 so Figure 3-7: Income training time comparison 0.5 0 1 2 1.5 Number of examples 3.5 3 2.5 X10 Figure 3-8: Training time vs sample size --- 12000 ---+-' Examples 1605 (ala) 2265 (a2a) 3185 (a3a) 4781 (a4a) 6414 (a5a) 11221 (a6a) 16101 (a7a) 22697 (a8a) 32562 (a9a) SVMLight 786 (bsv:757) 1105 (bsv:1082) 1450 (bsv:1412) 2090 (bsv:2051) 2710 (bsv:2661) 4478 (bsv:4424) 6322 (bsv:6252) 8731 (bsv:8668) 12162 (bsv:12082) LaSVM(xl) 767 1085 1419 2044 2662 4408 6226 8604 12040 LaSVM(x2) 784 1102 1443 2081 2700 4467 6304 8707 12142 M SVMLIghI LeSVM(x1) LaSVM(x2) 10000- 8000 m M 60000 4000 2000 0 0.5 1 Figure 3-9: Income number of support vectors 1.5 Number 2 3.5 a 2.5 of examples x1O Figure 3-10: Number of support vectors vs sample size Examples 1605 (ala) 2265 (a2a) 3185 (a3a) 4781 (a4a) 6414 (a5a) 11221 (a6a) 16101 (a7a) 22697 (a8a) 32562 (a9a) SVMLight 82.39% 83.11% 83.28% 83.48% 83.74% 83.79% 84.02% 84.18% 84.35% LaSVM(xl) 83.04% 83.54% 83.58% 83.59% 83.74% 84.14% 84.24% 84.35% 84.26% LaSVM(x2) 82.42% 83.09% 83.27% 83.48% 83.72% 83.75% 84.02% 84.15% 84.34% 04 03.5 -e-- SVMLigh1 02.51 0 Figure 3-11: Income test accuracies LaSVM~x1) -~-- 0.5 1 1.5 2 Number of examples 2.5 2.5 ~ LaSVM(x2) 3.5 3 3 10 Figure 3-12: Test accuracy vs sample size 34 Figure 3-14 also shows the direct relationship between kernel evaluations and training time. Since the bulk of the SVM training depends on the kernel evaluations, the more kernel evaluations an algorithm requires, the longer the training time tends to take. (10 - 5 Examples 1605 (ala) 2265 (a2a) 3185 (a3a) 4781 (a4a) 6414 (a5a) 11221 (a6a) 16101 (a7a) 22697 (a8a) 32562 (a9a) SVMLight 1009656 1993157 3762116 8231945 14474588 50842630 109866178 221425039 453929799 LaSVM(xl) 653448 1270049 2583480 5091158 9170704 25503523 56062076 136418730 310893163 LaSVM(x2) 1316311 2632005 4945027 10796052 19057417 56208622 120879722 263881292 563633998 -SVMLight SLaSVM(xl) SLaSVM(x2) 4 E 2 0 Figure 3-13: Income kernel evaluations -- 0.5 1 1.5 2 Number of examples 3.5 2.5 S10 Figure 3-14: Kernel evaluations vs sample size Algorithms SVMLight LaSVM(xl) LaSVm(x2) Training time Number of SVs = = Table 3.5: Income summary 35 Accuracy 3.4 Webpage Classification The webpage dataset prepared by John Platt is an experiment on text categorization. The goal is to train the SVM with the keywords extracted from a webpage, and classify whether the webpage belongs to a certain category. There are 300 keyword features in a training set of 49749 webpages. Similar to the income prediction problem, the training set samples are split into multiple training sizes from wla to w9a. Examples 2477(w1a) 3470(w2a) 4912(w3a) 7366(w4a) 9888(w5a) 17188(w6a) 24692(w7a) 49749(w8a) train(+1) 72 107 143 216 281 525 740 1479 train(-1) 2405 3363 4769 7150 9607 16663 23952 48270 test(+1) 1094 1094 1094 1094 1094 1094 1094 1094 test(-1) 37900 37900 37900 37900 37900 37900 37900 37900 Table 3.6: Webpage dataset details The training time of the three approaches is shown in Figure 3-16, and LaSVM with one epoch takes the least amount of time, while two epochs take close to twice as much time if not more to train. Interestingly, even though LaSVM trains fastest, it also has the highest number of support vectors (Figure 3-18). 200 Examples 2477(w1a) 3470(w2a) 4912(w3a) 7366(w4a) 9888(w5a) 17188(w6a) 24692(w7a) 49749(w8a) SVMLight 0.29 0.53 0.85 1.79 3.05 9.31 20.13 80.12 LaSVM(xl) 0.16 0.29 0.57 1.12 2.31 5.87 12.94 72.87 LaSVM(x2) 0.41 0.74 1.36 2.79 3.91 14.32 33.54 188.90 180 -1 _ - 160 -+ --- SVMLight aSVM() LaSVM(x2) 140 120 100 80 0 Figure 3-15: Webpage training time comparison 0 0.5 1 1.5 2 2.5 3 3.5 Number of examples 4 5 4.5 X10 Figure 3-16: Training time vs sample size Overall, SVMLight is more accurate and has fewer number of support vectors. 36 4000 ---3500 Examples 2477(wla) 3470(w2a) 4912(w3a) 7366(w4a) 9888(w5a) 17188(w6a) 24692(w7a) 49749(w8a) SVMLight 224 (bsv:102) 300 (bsv: 160) 360 (bsv:220) 506 (bsv:349) 628 (bsv:460) 1068 (bsv:836) 1433 (bsv: 1164) 2510 (bsv:2147) LaSVM(xl) 282 367 410 598 612 1438 2056 3830 - SVMLight + L aSVM~x1) 3000 LaSVM(x2) 232 279 345 503 616 1046 1458 2836 2500 2000 1500 Z 1000500 Figure 3-17: Webpage number of support vectors 00.5 1 1.5 3 2 x.5 Number of examples 3.5 4 4.5 5 x10 Figure 3-18: Number of support vectors vs sample size 90.0 Examples 2477(wla) 3470(w2a) 4912(w3a) 7366(w4a) 9888(w5a) 17188(w6a) 24692(w7a) 49749(w8a) SVMLight 97.44% 97.46% 97.58% 97.87% 97.97% 98.34% 98.48% 98.79% LaSVM(xl) 97.45% 97.54% 97.70% 97.88% 97.90% 98.35% 98.33% 98.52% LaSVM(x2) 97.45% 97.46% 97.53% 97.73% 97.79% 98.09% 98.16% 98.41% 90.4 90e 98.2 98 97.0 -- +-SVM",ghl ___LaSVM(x1)_ / 97.0 4--LaSVMWk) --- Figure 3-19: Webpage test accuracies 5 0.5 1 1.5 2.5 3 2 Number of examples 3.5 4 4.5 5 Figure 3-20: Test accuracy vs sample size -_x10 -0-- Examples 2477(wla) 3470(w2a) 4912(w3a) 7366(w4a) 9888(w5a) 17188(w6a) 24692(w7a) 49749(w8a) SVMLight 642643 1183361 2002387 4238842 7095305 21019709 45002780 172741988 LaSVM(xl) 456728 870890 1653750 3083272 6238597 14728017 31948290 104253875 LaSVM(x2) 1225663 2186374 4100866 7990208 10574545 37232710 82493882 314426160 SVMLIOOI ~LaSVMpo1) LaS VM(x2) 3 2.5 1.5 I 0. Figure 3-21: Webpage kernel evaluations ~~0~ 0 0.5 1 1.5 2 2.5 3 Number of examples 3.5 4 4.5 5 X1i Figure 3-22: Kernel evaluations vs sample size 37 Algorithms SVMLight LaSVM(xl) LaSVm(x2) Training time Accuracy X Number of SVs X X Table 3.7: Webpage summary 3.5 Face Detection The goal of this experiment is to classify whether or not an image contains a face. The features are the 361 pixels in the 19x19 images, and the values range from 0 to 1. The SVM models were trained with each dataset and evaluated on the same testing samples (Figure 3.8). Examples 435 (faceO) 871 (face l) 1744 (face2) 3488 (face3) 6977 (face4) train(+1) 151 303 607 1214 2429 test(-1) 23573 23573 23573 23573 23573 test(+1) 472 472 472 472 472 train(-1) 284 568 1137 2274 4548 Table 3.8: Face detection dataset details LaSVM with one epoch consistently has the least training time, and SVMLight lies in between (Figure 3-24). As can be seen in the following figures, the one-epoch LaSVM also has the smallest number of support vectors in average, yet with a comparable performance as SVMLight. In this example, the two-epoch version also does not gain any improvements for the test accuracies. An interesting observation can be made by comparing Figure 3-24 and 3-30. Even though SVMLight has kernel evaluations counts similar to the two-epoch LaSVM, the training time is significantly less. The may be due to the efficiency of a large working set of 20. 6 - Examples 435 (faceO) 871 (facel) 1744 (face2) 3488 (face3) 6977 (face4) SVMLight 0.14 0.41 1.01 2.95 8.14 LaSVM(xl) 0.07 0.23 0.69 2.51 7.88 -6- SVMLight --- LaSVM(x1) LaSVMOs2) 1. LaSVM(x2)_ 0.14 0.54 1.65 6.03 19.36 Si 2 - 0 4 2 Figure 3-23: Face Detection training time comparison 0 1000 2000 4000 3000 Number of examples 5000 6000 7000 Figure 3-24: Training time vs Sample size 38 450 5 VMLight - 400 - LaSVM(x1) e-- 4-- LaSVM(x2) . 350 Examples 435 (face0) 871 (facel) 1744 (face2) 3488 (face3) 6977 (face4) SVMLight 99 (bsv:3) 156 (bsv:7) 208 (bsv:9) 313 (bsv:30) 449 (bsv:47) LaSVM(xl) 97 143 189 288 409 LaSVM(x2) 99 155 206 307 44 300 250200 0 z 150 Figure 3-25: Face detection number of support vector 100 3000 r000 4000 Number of examples 1000 0 6000 5000 7000 Figure 3-26: Number of support vectors vs Sample size QA q so Examples 435 (faceO) 871 (facel) 1744 (face2) 3488 (face3) 6977 (face4) SVMLight 95.91% 96.82% 97.74% 98.08% 98.29% LaSVM(xi) 95.90% 96.80% 97.74% 98.07% 98.34% 97.5 LaSVM(x2) 95.91% 96.80% 97.74% 98.08% 98.29% 97- SVMLIght - 06 Figure 3-27: Face detection: test accuracies 0 1000 2000 3000 4000 Number of examples 6000 5000 7000 Figure 3-28: Test accuracy vs Sample size )r10 4 SVMLIght .5 - LaSVM(x1) + 4 LaSVMWx2) 7' cc 3.5 Examples 435 (faceO) 871 (facel) 1744 (face2) 3488 (face3) 6977 (face4) SVMLight 86142 242288 589349 1701520 4644224 LaSVM(xl) 32953 94665 246205 732452 2035821 LaSVM(x2) 62166 208801 559261 1665282 4892865 3 1 .5 u2 .5 Figure 3-29: Face detection: kernel evaluations 0 1000 2000 4000 3000 Number of examples 5000 6000 7000 Figure 3-30: Kernel evaluations vs Sample size 39 Algorithms SVMLight LaSVM(xI) Training time Number of SVs X X Accuracy LaSVm(x2) Table 3.9: Face detection summary 3.6 CEARCH Entity Classification In order to determine the optimal parameters to train both algorithms, a series of experiments were run on a smaller dataset of size 4060, and a testing dataset of 16 with 8 samples of each class. After the parameter tuning process, the algorithms were evaluated across multiple datasets of increasing sizes from 2000 to 18782 (Table 3.10). Examples 2000 4000 6000 8000 10000 15000 18782 train(+1) 761 761 761 761 761 761 761 train(-1) 1239 3239 5239 7239 9239 14239 18021 test(+1) 50 50 50 50 50 50 50 test(-1) 50 50 50 50 50 50 50 Table 3.10: CEARCH dataset details The initial training parameter is chosen to have a RBF kernel with tradeoff constant of 10 and cache of 100MB. Due to the small size of positive samples in our dataset, we have initialized the weight to be the ratio between the negative and positive sample at 66.7. The working set size q was set to 2. Figure 3-31 shows the test accuracies varying the value of gamma. In the case of SVMLight, the smaller the gamma values, the higher the test accuracies. Hence, a low gamma of 0.0001 is chosen for the subsequent training. Surprisingly, LaSVM outputs a slightly different result with the maximum test accuracy at gamma=0.01. Overall, the average LaSVM test accuracy is significantly lower than the SVMLight by more than 10%. After tuning for the kernel parameter (gamma), we continued to experiment with the tradeoff constant c (Figure 3-32). SVMLight returns a high accuracy of 93.75% for changing c values within the range of 2 to 6, while LaSVM returns the maximal accuracy at c=7. Again, LaSVM returns a much lower testing performance overall. Figure 3-33 confirms our choice for the weight of the positive class, as both SVMLight and LaSVM returns a reasonably high accuracies for w=66.7 among other weight values. 40 85 -LaSVM | 85 I 70 4guma 0.0 1 6580 55 S47 . Kernl voAje gamma Figure 3-31: Test accuracy vs gamma 90 ... c-7 65 55- VMjght LaSVM 0 2 4 6 8 12 10 Tradeoff vale c 14 16 18 20 Figure 3-32: Test accuracy vs c e-SVIAught 0 / 75 - - 7-- * -- - -70 55 0 20 40 s0 85 WeIght 100 120 of positN Clas" 140 160 180 200 Figure 3-33: Test accuracy vs weight 41 After determining the optimal parameter values for gamma, tradeoff c and the weight of positive class, the final experiments focus on the performance across different sizes of training datasets. While SVMLight improves its accuracies given more training data, LaSVM's performance degraded, which is an opposite behavior from all the previous datasets. We do not know the exact reason; however, one speculation is the parameters may be chosen to fit the smaller dataset and not tailored for larger sizes. Though more experiments may need to be taken place before a clear explanation. 25 Examples 2000 4000 6000 8000 10000 15000 18782 SVMiLight 1.34 4.86 9.82 17.35 183.67 249.23 256.32 0 - LaSVM(xl) 0.57 1.82 3.54 4.98 6.94 9.86 13.172 20 -SVMLight LaSVM 15 0 10 0 50 Figure 3-34: CEARCH training time comparison 0.2 0.4 0.6 0.8 1.4 1-2 1 Number of examples 2 1.6 1.6 x10 Figure 3-35: Training time vs sample size 900 0 7 Soo '0 700 0 Examples 2000 4000 6000 8000 10000 15000 18782 SVMLight 961 2175 3274 4235 5192 7265 8837 LaSVM(xl) 955 1232 1482 1633 1767 1957 2066 e00 0 -- SVMVLIght LaSVM 500 0 400 " 0 200 100 Figure 3-36: CEARCH number of support vectors 0.2 0.4 0.6 0.0 1.4 1.2 1 Number of examples 1.8 2 1.8 10 Figure 3-37: Number of support vectors vs sample size 42 111111 5 Examples 2000 4000 6000 8000 10000 15000 18782 SVMtLight 75% 78% 79% 74% 76% 85% 87% 80 LaSVM(xl) 76% 68% 69% 60% 59% 55% 54% 5 75 - SV, VLight -+-LaSVM \ 70 65 60 55 Figure 3-38: CEARCH test accuracies 0 0.2 0.4 0.6 0.8 1 1.2 Number of examples 1.4 1.6 1.8 2 010 Figure 3-39: Test accuracy vs sample size Algorithms SVMLight LaSVM(xl) LaSVm(x2) Training time Number of SVs X X Accuracy X Table 3.11: CEARCH entity summary 43 44 Chapter 4 Performance and Tradeoff 4.1 Cross-sectional Dataset Comparison Dataset MNNIST Income Webpage Face CEARCH Training Time 660.69 227.96 80.12 8.14 198.35 SVMLight Kernel Counts 1.6E+08 4.54E+08 1.73E+08 4.64E+06 2.IE+08 SV 2086 12162 2510 449 9147 BSV 804 12082 2147 47 9097 Accuracy(%) 99.55 84.35 98.79 98.29 85.00 Table 4.1: SVMLight performance summary Dataset MNIST Income Webpage Face CEARCH Training Time 171.51 248.59 72.87 7.88 12.83 LaSVM Kernel Counts 6.94E+07 3.11E+08 1.04E+08 2.04E+06 2.54E+07 SV 1811 12040 3830 409 2056 Accuracy(%) 99.57 84.26 98.52 98.34 53.10 Table 4.2: LaSVM performance summary Dataset MNIST Income Webpage Face CEARCH Training time LaSVM(xl) SVMLight or LaSVM(xI) LaSVM(xl) LaSVM(xl) LaSVM(xl) Number of SVs LaSVM(xl) All SVMLight or LaSVM(x2) LaSVM(xl) LaSVm(xl) Accuracy All All SVMLight or LaSVM(x1) All SVMLight Table 4.3: Top performing algorithms for each dataset One interesting observation is the relatively fast training time for both the Webpage and Face dataset, this can very well be due to their large tradeoff c. 45 4.2 Procedure Count Analysis In order to compare the two SVM algorithms closely on the exact procedure runs during the training process, we used a program called proccount to record the total number of calls and instructions. The SVM models were trained on the income prediction dataset using the same training parameters (RBF kernel with gamma=0.005, c=1, cache=80MB and q=2). 38.64% kernel 13.48%, kernel vqp 6.73%. kernel 6.31% kernel 6.61 % kern el 2.50% kenel 2.3%krnel 0.84%. kernel calculatesym_model compute_matricesfor_optimization _0vfscanf internal _EW6.get_pc_thunk.bx reactivate_inactive_e xamples shrinkproblern selectnextqpsubproblern_rand svmilearn learn libc.so 6 sym libmro_6 svmrlearn symlearn symlearn 795E+03 7.95E+03 2 36E+06 4.53E+08 1 00E+00 7 94E402 7 80E+01 1.92E+09 1.13E+09 9.20E +08 9.07E+08 7 78E+08 4 24E408 1 97E+t8 0.59% 0.35% 0.28% 0.28% 10 0.28% kernel 0.24% 0.13% 0.06% qp 76.92% Figure 4-1: Top 20 processes of SVMLight 57.17% kernel 9.49%. kemel 9.07% kernel 4.22% kernel process process 0.42% kernel 0 80.37% Figure 4-2: Top 20 processes of LaSVM Among the top 20 procedures for both SVMLight and LaSVM, more than 70 to 80 percent of instructions are kernel evaluations (represented as yellow rows in 4-1 and 4-2). Overall, LaSVM 46 also have much fewer calls and instructions than SVMLight, which reflects the characteristic of the LaSVM algorithm solving 2 points at any given time, rather than solving for the entire QP problem like the SVMLight. 4.3 Sensitivity to Tolerance Parameter E The performance of the experiments on 5 datasets are affected by different parameters in the algorithms. In order to verify the robustness of an algorithm to varying parameters, the following section explore the algorithm sensitivity to the tolerance parameter e. See Section 2.2 Page 20 for the role of the termination condition E in both offline and online SVM algorithms. Below is an experiment varying the termination condition on both LaSVM and SVMLight; the section will describe how the accuracy performances change. The dataset is the Income dataset, and the original experiments used tol=0.001. Figure 4-3 shows the effect of varying tolerance (tol = 0.1 and 0.01) on one-epoch LaSVM. The accuracies of one-epoch LaSVM have a large variance, and the two-epoch LaSVM in Figure 4-4 stays consistent with the default tolerance (tol = 0.001). Figure 4-5 shows the effect of varying tolerance (tol = 0.1 and 0.01) on SVMLight. The results stay consistent with the SVMLight having default tolerance (tol = 0.001). 04.5 84.5 84 84 83.5- 83 83( -- 82.5 - SVMLight (tol-0.001) -SVMLght - -+-LaSVM (x1,tol-0.001) LaSVM (x1,t-0.1) ___BLaSVM (x1,tol-0.01) 0 05 1 2 1.5 Number of examples 25 3 82.5 - ---- +LaSVM 0 3.5 0.5 1 2 1.5 Number of examples (tol-0.001) -LaSVM (x2,tol-0.001) LaSVM (x2,tol.0.1) 2.5 (x2,tol-0.01) 3 3.5 10 Figure 4-3: Income: Accuracy vs Sample Figure 4-4: Income: Accuracy vs Sample size size The tolerance experiments show that SVMLight may be more robust to changes in the toler47 84.5 84 83 SVMLight 82.5 (ol-0.001) SVMLIght (0ol-0.1) SSVMVLight (Ool-0.01) 82 0 0.5 1 1.5 Number 2 of examples 2.5 3.5 3 x 10 Figure 4-5: Income: Accuracy vs sample size ance setting, where one-epoch LaSVM performs less consistently. Comparing to the one-epoch LaSVM, the two-epoch trials generate more stable test accuracies, similar to SVMLight. One interesting aspect that is worth mentioning is the fact one-epoch LaSVM still returns the highest accuracies for all three tolerance values (tol = 0.1, 0.01 and 0.001). 48 Chapter 5 Summary This thesis compares the strengths and weaknesses between the offline and online SVM. The work includes the performance comparisons of SVMLight and LaSVM, with results of training time, number of support vectors, kernel evaluations and test accuracies. Overall, the online approach of LaSVM has trained with less time and returned comparable test accuracies than SVMLight. The offline SVMLight; however, is more robust to varying tolerance parameter than one-epoch LaSVM. 5.1 Future Work The current experiments only include the random selection scheme for the LaSVM algorithm. The alternate scheme like gradient and active selection may provide different training time and accuracies tradeoff among the multiple datasets. The current online LaSVM algorithm does not offer an approach to unsupervised training, which may be an added strength, combining with its fast training speed. The potential extension can be applied to realtime feedback adjustment and allow the model to change with respect to new incoming data. 49 50 Bibliography [1] C. Bahlmann, B. Haasdonk, and H. Burkhardt. On-line Handwriting Recognition with Support Vector Machines-A Kernel Approach. In Proc. Of the 8th IWFHR, pages 49-54, 2002. [2] C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. In Data Mining and Knowledge Discovery 1998. [3] I. Guyon, A. Elisseeff. An Introduction to Variable and Feature Selection. In Journal of Machine Learning Research, 2003. [4] T. Joachims. Making Large-Scale SVM Learning Practical. In Advances in Kernel Methods, B.Scholkopf C. Burges, A. Smola, Cambridge, MIT Press, 1998. [5] T. Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In European Conference on Machine Learning (ECML), 1998. [6] E. Osuna, R. Freund, and F. Girosi. Support Vector Machines: Training and Applications. A.I. Memo 1602, MIT A. I. Lab., 1997. [7] G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In Advances in Neural ProcessingSystems, 2001. [8] R. Vanderbei. Loqo: An Interior Point Code for Quadratic Programming. Technical Report SOR 94-15, Princeton University, 1994. [9] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. [10] A. Bordes, S. Ertekin, J. Weston, and Leon Bottou. Fast Kernel Classifiers with Online and Active Learning. In Journal of Machine Learning Research, 2005. 51 [11] J. Platt. Fast Training of Support Vector Machines Using Sequential Minimal Optimization. In Advances in Kernel Methods, B.Scholkopf C. Burges, A. Smola, Cambridge, MIT Press, 1998. [12] Distributed Sensing Test Bed (DSTB) User's/Developer's Guide [13] C. Cortes, and V. Vapnik. Support Vector Networks. In Machine Learning, 20:273-297, 1995. [14] J. Weston, S. Mukherjee, 0. Chapelle, M. Pontil, T. Poggio, V. Vapnik. Feature Selection for SVMs. [15] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. Technical report, Computer Science and Information Engineering, National Taiwan University, 2001-2004. 52