An Experimental Multi-Objective Study of the SVM Model Selection problem Giuseppe Narzisi Courant Institute of Mathematical Sciences New York, NY 10012, USA narzisi@nyu.edu Abstract. Support Vector machines (SVMs) are a powerful method for both regression and classification. However, any SVM formulation requires the user to set two or more parameters which govern the training process and such parameters can have a strong effect on the result performance of the engine. Moreover, the design of learning systems is inherently a multi-objective optimization problem. It requires to find a suitable trade-off between at least two conflicting objectives: model complexity and accuracy. In this work the SVM model selection problem is cast as a multi-objective optimization problem, where the error and the number of support vectors of the model define the two objectives. Experimental analysis is presented on a well known test-bed of datasets using two different kernels: RBF and sigmoid. Key words: Support Vector Machine, Multi-Objective optimization, NSGA-II, SVM Model Selection. 1 Introduction Support Vector Machines have been proven to be very effective methods for classification and regression [12]. However, in order to obtain good generalization errors the user needs to choose appropriate values for the involved parameters of the model. The kernel parameters together with the regularization parameter C, are called hyperparameters of the SVM, and the problem of tuning them in order, for example, to improve the generalization error of the model is called SVM model selection problem. Usually the standard method to determine the hyperparameter is by grid search. In the simple grid-search approach the hyperparameters are varied with a fixed step-size through a wide range of values and the performance of every combination is measured. Because of its computational complexity, grid-search is only suitable for the adjustment of very few parameters. Further, the choice of the discretization of the search space may be crucial. Figure 1 shows the typical parameter surface for the error and the number of support vectors as a function of the hyperparameters C and γ for the diabetes dataset. Recently gradient-based approaches have been explored for choosing the hyperparameters[2, 6, 8]. However they have some drawbacks and limitations. First of all, the score function to evaluate the quality of a set 2 Giuseppe Narzisi of hyperparameters must be differentiable, which excludes important important measure, as the number of support vectors. Also because the objective function is strongly multimodal, the performance of the gradient-based heuristic depend on the initialization, which means that the algorithm can easily get stuck in a sub-optimal local minima. error # of SVs 36 650 34 36 600 32 34 550 32 500 30 450 28 400 350 300 30 28 26 26 24 24 22 650 600 550 500 450 400 350 300 250 250 22 0.01 0.1 0.01 1 C 10 100 0.1 1000 0.01 10000 0.001 1000000.0001 (a) Error 1 10 100 100000 10000 1000 gamma 0.1 1 C 10 100 0.1 1000 0.01 10000 0.001 1000000.0001 1 10 100 100000 10000 1000 gamma (b) Num. of SVs Fig. 1. Parameter surface of the error (a) and the number of SVs (b) as a function of the two hyperparameters C and γ for the diabetes dataset using 5-fold cross-validation. The main idea which is missing in this kind of approaches is that the SVM model selection problem is inherently a multi-objective optimization problem. Designing supervised learning systems for classification requires finding a suitable trade-off between several objectives. Typically we want to reduce the complexity of the model and at the same time obtaining a model with high accuracy level (or low error rate). Sometimes having a model with the best generalization error could not be the best choice if the price that we have to pay is working with a very complex model both in terms of time and space. Usually this problem is tackled aggregating the objectives into a scalar function (linear weighting of the objectives) and applying standards method to the resulting single objective optimization problem. However, it has been shown how this approach is not a good solution because it requires that the aggregate function correctly matches the problem, and this is not an easy task. The best solution is to apply directly the multi-objective approach in order to find the Pareto optimal set of solutions for the problem. Among the many possible approaches to solve a multi-objective optimization problem, the last decade has seen Multi-objective Evolutionary Algorithms (MOEA) as the emerging method in this area. Successful application have been already obtained in the machine learning area in the case of feature selection for SVMs [9, 10]. Similar experiments to the ones presented in this paper has been proposed in [7] where the split modified radius margin bounds and the training error were used in conjunction with the number of SVs. The experiments presented in this work differ from that approach in many ways: 1) the impact of different kernels A multi-objective analysis of Support Vector Machines 3 is analyzed; 2) the simple straightforward 2-objective formulation is considered (num. of SVs and CV error) before any additional sophistication; 3) the standard NSGA-II algorithms is used instead of the NSES algorithm proposed in [7]; 4) the error is evaluated using the 5-fold cross-validation method. There are many reasons for using a multi-objective evolutionary approach for SVM model selection: – ability to obtain in one run, not just a single model but several model which are optimal (in the Pareto sense) respect to the selected objectives or criteria; – the “best” SVM model can be selected later from the Pareto front according to some higher level information or preferences; – multiple hyperparameters can be tuned at the same time overcoming the limitation of the naive grid-search method; – the objective/criteria do not need to be differentiable (as required for the gradient-based methods); – efficient exploration of the multimodal search space associated with the parameters. The goal of this research work is to show the effectiveness of this approach for SVM model selection using a very simple 2-objective formulation which takes into account the complexity and the accuracy of the model. The paper is organized as follow. We first introduce SVMs and SVM model selection from the perspective of multi-objective optimization. Then we give the background on multi-objective optimization. Then we introduce the class of multi-objective evolutionary algorithms. Section 5 reports the results obtained on a test bed of four datasets widely used in the literature. Finally, the conclusions are presented and possible future line of investigation are given. 2 Multi-objective view of SVM The first evidence of the multi-objective nature of SVMs is directly related to their standard formulation in the inconsistent case, the so called C-SVM formulation: Pm 1 2 min i ξi 2 ||w|| + C (1) subject to yi [w × xi + b] ≥ 1 − ξi , ξi ≥ 0, i ∈ [1, m] where C is the the regularization parameter which determines the trade-off beP tween the margin and the sum of the slack variables m ξ . The constant C is i i usually determined using same heuristic approach. However, the more natural formulation of the problem is the following: 1 2 min 2 ||w|| P m min i ξi subject to yi [w × xi + b] ≥ 1 − ξi , ξi ≥ 0, i ∈ [1, m] (2) where the objective in (1) is split in two different conflicting objectives, overcoming the problem of determining the parameter C. Even if this formulation 4 Giuseppe Narzisi is more natural than (1), not so much effort on this problem is present in the literature. It would be interesting to analyze this problem using the theoretical approach presented by Mihalis Yannakakis in [13] where he discusses the condition under which an approximate trade-off curve can be constructed efficiently (in polynomial time). The multi-objective nature of SVM training is also present at the level of model selection. The typical criteria of evaluation for a classifier is given by the accuracy of the model in classifying new generated points, and this metric is often used alone in order to select/generate good classifiers. However there are many other important factors that must be taken into account when selecting a SVM model. A possible (not exhaustive) list is the following: – number of input features; – bound on the generalization error (e.g., radius margin bound); – number of support vectors. In this paper we consider the last one, number of SVs, as an additional selection criteria. 3 Multi-Objective Optimization When an optimization problem involves more then a single-valued objective function, the task of finding one (or more) optimum solution(s), is known as the Multi-Objective Optimization Problem (MOOP) [4]. An optimum solution with respect to one objective may not be optimum with respect to another objective. As a consequence, one cannot choose a solution, which is optimal with respect to only one objective. In problems characterized by more than one conflicting objective, there is no single optimum solution; instead there exists a set of solutions which are all optimal, called the Optimal Pareto front. A general multi-objective optimization problem is defined as follows (minimization case): min F (x) = [f1 (x), f2 (x), . . . , fM (x)] subject to E(x) = [e1 (x), e2 (x), . . . , eL (x)] ≥ 0 (U) (L) xi ≤ xi ≤ xi , i = 1, . . . , N, (3) where x = (x1 , x2 , . . . , xN ) is the vector of the N decision variables, M is the (U) (L) number of objectives fi , L is the number of constraints ej , and xi and xi are respectively the lower and upper bound for each decision variables xi . Two different solutions are compared using the concept of dominance, which induces a strict partial order in the objective space F . Here a solution a is said to dominate a solution b if it is better or equal in all objectives and better in at least one objective. For the minimization case we have: fi (a) ≤ fi (b) ∀i ∈ 1, . . . , M F (a) ≺ F(b) iff (4) ∃j ∈ 1, . . . , M fj (a) < fj (b) A multi-objective analysis of Support Vector Machines 5 In the specific case of the SVM model selection, we have that the hyperparameters are the decision variables of the problem, the range of exploration for each parameters are the bounds for each decision variable, and the model selection criteria are the objectives (no constraints are used in this formulation). 4 4.1 Method Model selection metrics As discussed in section 2 there are many criteria that can be used for SVM model selection. In this section we introduce the two objectives that have been used for the simulations. Accuracy. The most direct way to evaluate the quality of a SVM model is to consider its classification performance (accuracy). In the simple case the data is split into a training and a validation set. The first set is used to generate the SVM model, the second set is used to evaluate the performance of the classifier. In this work we use the more general approach called L-fold cross-validation (CV) error. The data is partitioned into L disjoint sets D1 , D2 , . . . , DL and the SVM is trained L times on all data but the Di set which is used later as validation data. The accuracy (or error) is computed as the mean of the L different experiments. For reasons of computational complexity we use a 5-fold CV error for each dataset. Number of support vectors. We know that the in the hard margin case the number of SVs is an upper bound on the expected number of errors made by the leave-one-out procedure. Moreover the space and time complexity of the SVM classifier scales with the number of SVs. It follows that it is important to have a SVM model which has few number of support vector (SVs). Similarly to the 5-fold CV error, the number of SVs is computed as the mean on the 5 different experiments of the CV method. 4.2 Multi-Objective Evolutionary Algorithms Evolutionary algorithms (EAs) are search methods that take their inspiration from natural selection and survival of the fittest in the biological world. EAs differ from more traditional optimization techniques in that they involve a search from a “population” of solutions, not from a single point. Each iteration of an EA involves a competitive selection that weeds out poor solutions. The solutions with high ”fitness” are ”recombined” with other solutions by swapping parts of a solution with another. Solutions are also “mutated” by making a small change to a single element of the solution. Recombination and mutation are used to generate new solutions that are biased towards regions of the space for which good solutions have already been seen. Multi-Objective Evolutionary Algorithms (MOEAs) are a special class of EAs with the goal of solving problems involving many conflicting objectives [4]. 6 Giuseppe Narzisi LIBSVM library Test on new data Hyperparameters Decision Making phase error and mean number of SVs on 5−fold cross−validation SVM model selection NSGA−II (Multi−Objective Evolutionary Algorithm) Population Evolution Output Pareto fronts (trade−off curve) Fig. 2. NSGA-II and LIBSVM pipeline. Over the last decade, a steady stream of MOEAS has continued to be proposed and studied [4, 3]. MOEAs have been successfully applied to several real world problems (protein folding, circuit design, safety related systems, etc) even if no strong proof of convergence is available. Among the growing class of MOEAs, in this work we employ the well-known NSGA-II [5] (Nondominated Sorting Genetic Algorithm II). NSGA-II is based on the use of fast nondominated sorting approach to sort a population of solutions into different nondomination levels. It then uses elitism and a crowded-comparison operator for diversity preservation. Table 1. Benchmark datasets. Name diabetes australian german splice 5 5.1 Size Features Repository 768 8 UCI 690 14 Statlog 1,000 24 Statlog 1,000 60 Delve Results Experiments In this research work we deal with the standard application of SVM for binary classification. We used a common benchmark of four datasets (table 1 shows the characteristics of the datasets). We consider two different kernel and their parameters: – RBF (radial basis function): K(u, v) = exp(−γ ∗ |u − v|2 ) – Sigmoid: K(u, v) = tanh(γuT ∗ v + coef0 ) A multi-objective analysis of Support Vector Machines 7 It follows that the hyperparameters considered will be respectively (C, γ) for the RBF kernel and (C, γ, coef0 ) for the sigmoid kernel. The parameter ranges are: log2 C ∈ [−5, . . . , 15], log2γ ∈ [−10, . . . , 4], coef0 ∈ [0, 1]. According to the values suggested in [5], the NSGA-II parameters are set as follow: pc = 0.9, pm = 0.1, νc = 10, νm = 20. No effort has been spent in this work to tune these parameter, which clearly would improve the efficiency of the algorithm. A population size of 60 individuals is used and each simulation is curried out for a total of 250 generations. Each plot shows the Pareto fronts (trade-off curves) of all the points (SVM models) sampled by the algorithm after the first 50 generations. As it is described later, 50 iterations are enough to converge versus the final approximated Pareto front. SVMs are constructed using the LIBSVM1 library [1] version 2.84. Figure 2 shows the interaction between NSGA-II and LIBSVM library. 350 340 340 320 330 300 Num. of SVs Num. of SVs 320 310 300 290 280 260 240 280 220 270 200 260 250 180 20 22 24 26 28 30 32 34 36 22 24 Error (%) 28 30 32 (b) Sigmoid Num. of SVs 31 320 Num. of SVs 30.5 305 32 31 300 30 300 30 29.5 280 28.5 error 29 290 28 285 27.5 29 260 28 error 295 num. of SVs num. of SVs 34 Error (%) (a) RBF 310 26 27 240 280 26 27 275 220 25 26.5 Error 270 0 50 100 150 Iterations 200 26 250 (c) RBF Error 200 0 50 100 150 Iterations 200 24 250 (d) Sigmoid Fig. 3. Diabetes dataset: Pareto front of the sampled points using RBF (a) and sigmoid (b) kernel; mean evolution of the population for the error and the number of SVs during the optimization of NSGA-II using RBF kernel (c) and sigmoid (d) kernel. 1 http://www.csie.ntu.edu.tw/ cjlin/libsvm 8 Giuseppe Narzisi 5.2 Discussion Figures 3, 4, 5 and 6 show the results obtained using the experimental protocol previously defined. Inspecting the results we observe, first of all, that approximate Pareto fronts are effectively obtained for each of the datasets showing how the two used objectives present a conflict behavior. This is also evident from the analysis of the evolution curves: an improvement of one objective is nearly always accompanied by a worsening in the other, but the interaction during the evolution produces a global minimization of both objectives. The choice of the kernel clearly affects the final outcome of the optimization algorithm. In particular the RBF kernel shows a better performance than the sigmoid kernel. Inspecting the Pareto fronts obtained we note that the RBF kernel allows to obtain a better distribution of solution along the two objectives. This is an important factor in multi-objective optimization: we want Pareto fronts with a wide range of values so that the selection of a final point in the second step (decision making) is facilitated. 180 180 170 175 Num. of SVs Num. of SVs 160 170 165 150 140 130 120 160 110 155 100 14 16 18 Error (%) 20 22 24 14 16 (a) RBF Num. of SVs 168 30 160 28 155 26 150 24 145 166 22 20 164 162 Error 0 50 100 150 Iterations (c) RBF 200 Num. of SVs Num. of SVs 160 20 Error (%) 22 24 (b) Sigmoid Error (%) 170 18 Num. of SVs 20 19 18 140 135 17 18 130 16 125 14 120 12 250 115 Error (%) 12 16 Error 0 50 100 150 Iterations 200 15 250 (d) Sigmoid Fig. 4. Australian dataset: Pareto front of the sampled points using RBF (a) and sigmoid (b) kernel; mean evolution of the population for the error and the number of SVs during the optimization of NSGA-II using RBF kernel (c) and sigmoid (d) kernel. A multi-objective analysis of Support Vector Machines 9 For each dataset we also plot the mean evolution curves for the error and the number of support vectors for the population of SVM models at each iteration. Inspecting the plots we observe that the algorithm generally converges very quickly to a set of good SVM models (first 50 iterations). It then uses the rest of the time to explore locally the space of solution for an additional finer refinement. If we compare the accuracy of the SVM models obtained using this method with other approaches in the literature we find comparable results . For example the best error obtained for the diabetes dataset with this approach is 21.7 while the error obtained by Keerthi in [8], Chapelle in [2] and Staelin in [11] are respectively 24.33, 23.19 and 20.3. Similarly for the splice dataset we obtain an error of 12.4 while the error obtained by by Keerthi in [8] and Staelin in [11] are respectively 10.16 and 11.7. 600 440 420 550 Num. of SVs Num. of SVs 400 500 450 380 360 340 320 400 300 350 280 13 14 15 16 Error (%) 17 18 19 20 18 20 (a) RBF 24 Error (%) 26 28 20 380 19 370 Num. of SVs 16 460 440 15 420 14 400 13 num. of SVs 17 480 28 350 error num. of SVs 500 30 360 18 26 340 24 330 22 320 Error 0 50 100 150 Iterations (c) RBF 200 12 250 34 32 520 380 30 (b) Sigmoid Num. of SVs 540 22 error 12 20 310 18 300 16 Error 290 0 50 100 150 Iterations 200 14 250 (d) Sigmoid Fig. 5. Splice dataset: Pareto front of the sampled points using RBF (a) and sigmoid (b) kernel; mean evolution of the population for the error and the number of SVs during the optimization of NSGA-II using RBF kernel (c) and sigmoid (d) kernel. An important advance of this approach is that together with good models, in terms of accuracy, the algorithm generate also many other models with different number of support vectors which are relevant in case that the complexity of the 10 Giuseppe Narzisi 450 440 440 420 430 400 420 380 Num. of SVs 410 400 360 340 390 320 380 300 370 280 360 260 22 24 26 28 30 Error (%) 32 34 36 22 24 (a) RBF 398 380 31 30 360 30 340 29 29 392 28 388 386 Error (c) RBF 200 Num. of SVs 32 320 28 27 300 27 26 280 390 100 150 Iterations 34 31 394 50 32 400 error num. of SVs 396 0 30 (b) Sigmoid Num. of SVs 384 28 Error (%) 32 Num. of SVs 400 26 25 250 Error (%) Num. of SVs final model is an important factor for the final model selection. For example, in the case of the splice dataset, we could be happy to lose same degree of accuracy, and select a solution with an error of 14% instead of 12%, in favor of a model that has a much lower complexity, 370 SVs instead of 570 (see figure 5). 26 Error 260 0 50 100 150 Iterations 200 25 250 (d) Sigmoid Fig. 6. German dataset: Pareto front of the sampled points using RBF (a) and sigmoid (b) kernel; mean evolution of the population for the error and the number of SVs during the optimization of NSGA-II using RBF kernel (c) and sigmoid (d) kernel. 6 Conclusions and possible future investigations The SVM model selection problem clearly presents the characteristics of a multiobjective optimization problem. The results in this experimental work have shown that it is possible to effectively obtain approximated Pareto fronts of SVM models based on a simple 2-objective formulation where the accuracy and the complexity of the model are compared for Pareto dominance. This approach allows to visualize the characteristic trade-off curve for a specific dataset from where the user can select a specific model according to its own preferences and computational needs. A multi-objective analysis of Support Vector Machines 11 The proposed method also allows to obtain comparable results to other approaches in the literature but with the advance that as set of Pareto optimal solutions (not a single one) is generated in output. Of course a deeper investigation is required and many different line of investigation can be considered : – extending the formulation from 2-objectives to possibly k-objective (k > 2) including many other important criteria of model selection (like for example the number of input features); – studying the performance of the proposed approach on the regression case; – adapting the approach to the multi-classification case where it is harder to choose appropriate values for the base binary models of a decomposition scheme. References 1. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~ cjlin/libsvm. 2. Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, and Sayan Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1-3):131–159, 2002. 3. Coello Coello and G. B. Lamont. Applications of Multi-Objective Evolutionary Algorithms. World Scientific, 2004. 4. Kalyanmoy Deb. Multi-Objective Optimization Using Evolutionary Algorithms. John Wiley & Sons, Inc., New York, NY, USA, 2001. 5. Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-II. IEEE Trans. Evolutionary Computation, 6(2):182–197, 2002. 6. Tobias Glasmachers and Christian Igel. Gradient-based adaptation of general gaussian kernels. Neural Comput., 17(10):2099–2105, 2005. 7. Christian Igel. Multi-objective model selection for support vector machines. Evolutionary Multi-Criterion Optimization, pages 534–546, 2005. 8. S.S. Keerthi. Efficient tuning of svm hyperparameters using radius/margin bound and iterative algorithms. IEEE Transactions on Neural Networks, 13:1225–1229, 2002. 9. S. Pang and N. Kasabov. Inductive vs. transductive inference, global vs. local models: Svm, tsvm, and svmt for gene expression classification problems. International Joint Conference on Neual Networks (IJCNN), 2:1197–1202, 2004. 10. S.Y.M. Shi, P.N. Suganthan, and K. Deb. Multi-class protein fold recognition using multi-objective evolutionary algorithms. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pages 61–66, 2004. 11. Carl Staelin. Parameter selection for support vector machines. HP Labs Technical Reports, 2002. 12. Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. 13. Mihalis Yannakakis. Approximation of multiobjective optimization problems. Algorithms and Data Structures : 7th International Workshop, pages 1–, 2001.