An Experimental Multi-Objective Study of the SVM Model Selection

advertisement
An Experimental Multi-Objective Study of the
SVM Model Selection problem
Giuseppe Narzisi
Courant Institute of Mathematical Sciences
New York, NY 10012, USA
narzisi@nyu.edu
Abstract. Support Vector machines (SVMs) are a powerful method for
both regression and classification. However, any SVM formulation requires the user to set two or more parameters which govern the training process and such parameters can have a strong effect on the result
performance of the engine. Moreover, the design of learning systems is
inherently a multi-objective optimization problem. It requires to find a
suitable trade-off between at least two conflicting objectives: model complexity and accuracy. In this work the SVM model selection problem is
cast as a multi-objective optimization problem, where the error and the
number of support vectors of the model define the two objectives. Experimental analysis is presented on a well known test-bed of datasets using
two different kernels: RBF and sigmoid.
Key words: Support Vector Machine, Multi-Objective optimization,
NSGA-II, SVM Model Selection.
1
Introduction
Support Vector Machines have been proven to be very effective methods for classification and regression [12]. However, in order to obtain good generalization
errors the user needs to choose appropriate values for the involved parameters
of the model. The kernel parameters together with the regularization parameter
C, are called hyperparameters of the SVM, and the problem of tuning them in
order, for example, to improve the generalization error of the model is called
SVM model selection problem. Usually the standard method to determine the
hyperparameter is by grid search. In the simple grid-search approach the hyperparameters are varied with a fixed step-size through a wide range of values
and the performance of every combination is measured. Because of its computational complexity, grid-search is only suitable for the adjustment of very few
parameters. Further, the choice of the discretization of the search space may
be crucial. Figure 1 shows the typical parameter surface for the error and the
number of support vectors as a function of the hyperparameters C and γ for
the diabetes dataset. Recently gradient-based approaches have been explored
for choosing the hyperparameters[2, 6, 8]. However they have some drawbacks
and limitations. First of all, the score function to evaluate the quality of a set
2
Giuseppe Narzisi
of hyperparameters must be differentiable, which excludes important important
measure, as the number of support vectors. Also because the objective function
is strongly multimodal, the performance of the gradient-based heuristic depend
on the initialization, which means that the algorithm can easily get stuck in a
sub-optimal local minima.
error
# of SVs
36
650
34
36
600
32
34
550
32
500
30
450
28
400
350
300
30
28
26
26
24
24
22
650
600
550
500
450
400
350
300
250
250
22
0.01
0.1
0.01
1
C
10
100
0.1
1000
0.01
10000
0.001
1000000.0001
(a) Error
1
10
100
100000
10000
1000
gamma
0.1
1
C
10
100
0.1
1000
0.01
10000
0.001
1000000.0001
1
10
100
100000
10000
1000
gamma
(b) Num. of SVs
Fig. 1. Parameter surface of the error (a) and the number of SVs (b) as a function of
the two hyperparameters C and γ for the diabetes dataset using 5-fold cross-validation.
The main idea which is missing in this kind of approaches is that the SVM
model selection problem is inherently a multi-objective optimization problem.
Designing supervised learning systems for classification requires finding a suitable trade-off between several objectives. Typically we want to reduce the complexity of the model and at the same time obtaining a model with high accuracy
level (or low error rate). Sometimes having a model with the best generalization
error could not be the best choice if the price that we have to pay is working with
a very complex model both in terms of time and space. Usually this problem
is tackled aggregating the objectives into a scalar function (linear weighting of
the objectives) and applying standards method to the resulting single objective
optimization problem. However, it has been shown how this approach is not a
good solution because it requires that the aggregate function correctly matches
the problem, and this is not an easy task.
The best solution is to apply directly the multi-objective approach in order to find the Pareto optimal set of solutions for the problem. Among the
many possible approaches to solve a multi-objective optimization problem, the
last decade has seen Multi-objective Evolutionary Algorithms (MOEA) as the
emerging method in this area. Successful application have been already obtained
in the machine learning area in the case of feature selection for SVMs [9, 10].
Similar experiments to the ones presented in this paper has been proposed
in [7] where the split modified radius margin bounds and the training error were
used in conjunction with the number of SVs. The experiments presented in this
work differ from that approach in many ways: 1) the impact of different kernels
A multi-objective analysis of Support Vector Machines
3
is analyzed; 2) the simple straightforward 2-objective formulation is considered
(num. of SVs and CV error) before any additional sophistication; 3) the standard
NSGA-II algorithms is used instead of the NSES algorithm proposed in [7]; 4)
the error is evaluated using the 5-fold cross-validation method.
There are many reasons for using a multi-objective evolutionary approach
for SVM model selection:
– ability to obtain in one run, not just a single model but several model which
are optimal (in the Pareto sense) respect to the selected objectives or criteria;
– the “best” SVM model can be selected later from the Pareto front according
to some higher level information or preferences;
– multiple hyperparameters can be tuned at the same time overcoming the
limitation of the naive grid-search method;
– the objective/criteria do not need to be differentiable (as required for the
gradient-based methods);
– efficient exploration of the multimodal search space associated with the parameters.
The goal of this research work is to show the effectiveness of this approach
for SVM model selection using a very simple 2-objective formulation which takes
into account the complexity and the accuracy of the model.
The paper is organized as follow. We first introduce SVMs and SVM model
selection from the perspective of multi-objective optimization. Then we give
the background on multi-objective optimization. Then we introduce the class of
multi-objective evolutionary algorithms. Section 5 reports the results obtained on
a test bed of four datasets widely used in the literature. Finally, the conclusions
are presented and possible future line of investigation are given.
2
Multi-objective view of SVM
The first evidence of the multi-objective nature of SVMs is directly related to
their standard formulation in the inconsistent case, the so called C-SVM formulation:
Pm
1
2
min
i ξi
2 ||w|| + C
(1)
subject to yi [w × xi + b] ≥ 1 − ξi , ξi ≥ 0, i ∈ [1, m]
where C is the the regularization parameter which determines
the trade-off beP
tween the margin and the sum of the slack variables m
ξ
.
The
constant C is
i
i
usually determined using same heuristic approach. However, the more natural
formulation of the problem is the following:
1
2
min
2 ||w||
P
m
min
i ξi
subject to yi [w × xi + b] ≥ 1 − ξi , ξi ≥ 0, i ∈ [1, m]
(2)
where the objective in (1) is split in two different conflicting objectives, overcoming the problem of determining the parameter C. Even if this formulation
4
Giuseppe Narzisi
is more natural than (1), not so much effort on this problem is present in the
literature. It would be interesting to analyze this problem using the theoretical
approach presented by Mihalis Yannakakis in [13] where he discusses the condition under which an approximate trade-off curve can be constructed efficiently
(in polynomial time).
The multi-objective nature of SVM training is also present at the level of
model selection. The typical criteria of evaluation for a classifier is given by the
accuracy of the model in classifying new generated points, and this metric is
often used alone in order to select/generate good classifiers. However there are
many other important factors that must be taken into account when selecting a
SVM model. A possible (not exhaustive) list is the following:
– number of input features;
– bound on the generalization error (e.g., radius margin bound);
– number of support vectors.
In this paper we consider the last one, number of SVs, as an additional selection
criteria.
3
Multi-Objective Optimization
When an optimization problem involves more then a single-valued objective
function, the task of finding one (or more) optimum solution(s), is known as
the Multi-Objective Optimization Problem (MOOP) [4]. An optimum solution
with respect to one objective may not be optimum with respect to another
objective. As a consequence, one cannot choose a solution, which is optimal
with respect to only one objective. In problems characterized by more than one
conflicting objective, there is no single optimum solution; instead there exists a
set of solutions which are all optimal, called the Optimal Pareto front.
A general multi-objective optimization problem is defined as follows (minimization case):
min
F (x) = [f1 (x), f2 (x), . . . , fM (x)]
subject to E(x) = [e1 (x), e2 (x), . . . , eL (x)] ≥ 0
(U)
(L)
xi ≤ xi ≤ xi , i = 1, . . . , N,
(3)
where x = (x1 , x2 , . . . , xN ) is the vector of the N decision variables, M is the
(U)
(L)
number of objectives fi , L is the number of constraints ej , and xi and xi
are respectively the lower and upper bound for each decision variables xi . Two
different solutions are compared using the concept of dominance, which induces a
strict partial order in the objective space F . Here a solution a is said to dominate
a solution b if it is better or equal in all objectives and better in at least one
objective. For the minimization case we have:
fi (a) ≤ fi (b) ∀i ∈ 1, . . . , M
F (a) ≺ F(b) iff
(4)
∃j ∈ 1, . . . , M fj (a) < fj (b)
A multi-objective analysis of Support Vector Machines
5
In the specific case of the SVM model selection, we have that the hyperparameters are the decision variables of the problem, the range of exploration for each
parameters are the bounds for each decision variable, and the model selection
criteria are the objectives (no constraints are used in this formulation).
4
4.1
Method
Model selection metrics
As discussed in section 2 there are many criteria that can be used for SVM
model selection. In this section we introduce the two objectives that have been
used for the simulations.
Accuracy. The most direct way to evaluate the quality of a SVM model is to
consider its classification performance (accuracy). In the simple case the data is
split into a training and a validation set. The first set is used to generate the
SVM model, the second set is used to evaluate the performance of the classifier.
In this work we use the more general approach called L-fold cross-validation
(CV) error. The data is partitioned into L disjoint sets D1 , D2 , . . . , DL and
the SVM is trained L times on all data but the Di set which is used later
as validation data. The accuracy (or error) is computed as the mean of the L
different experiments. For reasons of computational complexity we use a 5-fold
CV error for each dataset.
Number of support vectors. We know that the in the hard margin case the
number of SVs is an upper bound on the expected number of errors made by the
leave-one-out procedure. Moreover the space and time complexity of the SVM
classifier scales with the number of SVs. It follows that it is important to have
a SVM model which has few number of support vector (SVs). Similarly to the
5-fold CV error, the number of SVs is computed as the mean on the 5 different
experiments of the CV method.
4.2
Multi-Objective Evolutionary Algorithms
Evolutionary algorithms (EAs) are search methods that take their inspiration
from natural selection and survival of the fittest in the biological world. EAs
differ from more traditional optimization techniques in that they involve a search
from a “population” of solutions, not from a single point. Each iteration of an
EA involves a competitive selection that weeds out poor solutions. The solutions
with high ”fitness” are ”recombined” with other solutions by swapping parts of
a solution with another. Solutions are also “mutated” by making a small change
to a single element of the solution. Recombination and mutation are used to
generate new solutions that are biased towards regions of the space for which
good solutions have already been seen.
Multi-Objective Evolutionary Algorithms (MOEAs) are a special class of
EAs with the goal of solving problems involving many conflicting objectives [4].
6
Giuseppe Narzisi
LIBSVM library
Test on new data
Hyperparameters
Decision Making phase
error and mean number of SVs
on 5−fold cross−validation
SVM model selection
NSGA−II
(Multi−Objective Evolutionary Algorithm)
Population
Evolution
Output
Pareto fronts
(trade−off curve)
Fig. 2. NSGA-II and LIBSVM pipeline.
Over the last decade, a steady stream of MOEAS has continued to be proposed
and studied [4, 3]. MOEAs have been successfully applied to several real world
problems (protein folding, circuit design, safety related systems, etc) even if no
strong proof of convergence is available. Among the growing class of MOEAs,
in this work we employ the well-known NSGA-II [5] (Nondominated Sorting
Genetic Algorithm II). NSGA-II is based on the use of fast nondominated sorting
approach to sort a population of solutions into different nondomination levels. It
then uses elitism and a crowded-comparison operator for diversity preservation.
Table 1. Benchmark datasets.
Name
diabetes
australian
german
splice
5
5.1
Size Features Repository
768
8
UCI
690
14
Statlog
1,000
24
Statlog
1,000
60
Delve
Results
Experiments
In this research work we deal with the standard application of SVM for binary
classification. We used a common benchmark of four datasets (table 1 shows
the characteristics of the datasets). We consider two different kernel and their
parameters:
– RBF (radial basis function): K(u, v) = exp(−γ ∗ |u − v|2 )
– Sigmoid: K(u, v) = tanh(γuT ∗ v + coef0 )
A multi-objective analysis of Support Vector Machines
7
It follows that the hyperparameters considered will be respectively (C, γ) for the
RBF kernel and (C, γ, coef0 ) for the sigmoid kernel. The parameter ranges are:
log2 C ∈ [−5, . . . , 15], log2γ ∈ [−10, . . . , 4], coef0 ∈ [0, 1]. According to the values
suggested in [5], the NSGA-II parameters are set as follow: pc = 0.9, pm =
0.1, νc = 10, νm = 20. No effort has been spent in this work to tune these
parameter, which clearly would improve the efficiency of the algorithm.
A population size of 60 individuals is used and each simulation is curried
out for a total of 250 generations. Each plot shows the Pareto fronts (trade-off
curves) of all the points (SVM models) sampled by the algorithm after the first
50 generations. As it is described later, 50 iterations are enough to converge
versus the final approximated Pareto front.
SVMs are constructed using the LIBSVM1 library [1] version 2.84. Figure 2
shows the interaction between NSGA-II and LIBSVM library.
350
340
340
320
330
300
Num. of SVs
Num. of SVs
320
310
300
290
280
260
240
280
220
270
200
260
250
180
20
22
24
26
28
30
32
34
36
22
24
Error (%)
28
30
32
(b) Sigmoid
Num. of SVs
31
320
Num. of SVs
30.5
305
32
31
300
30
300
30
29.5
280
28.5
error
29
290
28
285
27.5
29
260
28
error
295
num. of SVs
num. of SVs
34
Error (%)
(a) RBF
310
26
27
240
280
26
27
275
220
25
26.5
Error
270
0
50
100
150
Iterations
200
26
250
(c) RBF
Error
200
0
50
100
150
Iterations
200
24
250
(d) Sigmoid
Fig. 3. Diabetes dataset: Pareto front of the sampled points using RBF (a) and sigmoid
(b) kernel; mean evolution of the population for the error and the number of SVs during
the optimization of NSGA-II using RBF kernel (c) and sigmoid (d) kernel.
1
http://www.csie.ntu.edu.tw/ cjlin/libsvm
8
Giuseppe Narzisi
5.2
Discussion
Figures 3, 4, 5 and 6 show the results obtained using the experimental protocol
previously defined. Inspecting the results we observe, first of all, that approximate Pareto fronts are effectively obtained for each of the datasets showing how
the two used objectives present a conflict behavior. This is also evident from
the analysis of the evolution curves: an improvement of one objective is nearly
always accompanied by a worsening in the other, but the interaction during the
evolution produces a global minimization of both objectives.
The choice of the kernel clearly affects the final outcome of the optimization
algorithm. In particular the RBF kernel shows a better performance than the
sigmoid kernel. Inspecting the Pareto fronts obtained we note that the RBF
kernel allows to obtain a better distribution of solution along the two objectives.
This is an important factor in multi-objective optimization: we want Pareto
fronts with a wide range of values so that the selection of a final point in the
second step (decision making) is facilitated.
180
180
170
175
Num. of SVs
Num. of SVs
160
170
165
150
140
130
120
160
110
155
100
14
16
18
Error (%)
20
22
24
14
16
(a) RBF
Num. of SVs
168
30
160
28
155
26
150
24
145
166
22
20
164
162
Error
0
50
100
150
Iterations
(c) RBF
200
Num. of SVs
Num. of SVs
160
20
Error (%)
22
24
(b) Sigmoid
Error (%)
170
18
Num. of SVs
20
19
18
140
135
17
18
130
16
125
14
120
12
250
115
Error (%)
12
16
Error
0
50
100
150
Iterations
200
15
250
(d) Sigmoid
Fig. 4. Australian dataset: Pareto front of the sampled points using RBF (a) and
sigmoid (b) kernel; mean evolution of the population for the error and the number of
SVs during the optimization of NSGA-II using RBF kernel (c) and sigmoid (d) kernel.
A multi-objective analysis of Support Vector Machines
9
For each dataset we also plot the mean evolution curves for the error and the
number of support vectors for the population of SVM models at each iteration.
Inspecting the plots we observe that the algorithm generally converges very
quickly to a set of good SVM models (first 50 iterations). It then uses the rest of
the time to explore locally the space of solution for an additional finer refinement.
If we compare the accuracy of the SVM models obtained using this method
with other approaches in the literature we find comparable results . For example
the best error obtained for the diabetes dataset with this approach is 21.7 while
the error obtained by Keerthi in [8], Chapelle in [2] and Staelin in [11] are
respectively 24.33, 23.19 and 20.3. Similarly for the splice dataset we obtain an
error of 12.4 while the error obtained by by Keerthi in [8] and Staelin in [11] are
respectively 10.16 and 11.7.
600
440
420
550
Num. of SVs
Num. of SVs
400
500
450
380
360
340
320
400
300
350
280
13
14
15
16
Error (%)
17
18
19
20
18
20
(a) RBF
24
Error (%)
26
28
20
380
19
370
Num. of SVs
16
460
440
15
420
14
400
13
num. of SVs
17
480
28
350
error
num. of SVs
500
30
360
18
26
340
24
330
22
320
Error
0
50
100
150
Iterations
(c) RBF
200
12
250
34
32
520
380
30
(b) Sigmoid
Num. of SVs
540
22
error
12
20
310
18
300
16
Error
290
0
50
100
150
Iterations
200
14
250
(d) Sigmoid
Fig. 5. Splice dataset: Pareto front of the sampled points using RBF (a) and sigmoid
(b) kernel; mean evolution of the population for the error and the number of SVs during
the optimization of NSGA-II using RBF kernel (c) and sigmoid (d) kernel.
An important advance of this approach is that together with good models, in
terms of accuracy, the algorithm generate also many other models with different
number of support vectors which are relevant in case that the complexity of the
10
Giuseppe Narzisi
450
440
440
420
430
400
420
380
Num. of SVs
410
400
360
340
390
320
380
300
370
280
360
260
22
24
26
28
30
Error (%)
32
34
36
22
24
(a) RBF
398
380
31
30
360
30
340
29
29
392
28
388
386
Error
(c) RBF
200
Num. of SVs
32
320
28
27
300
27
26
280
390
100
150
Iterations
34
31
394
50
32
400
error
num. of SVs
396
0
30
(b) Sigmoid
Num. of SVs
384
28
Error (%)
32
Num. of SVs
400
26
25
250
Error (%)
Num. of SVs
final model is an important factor for the final model selection. For example, in
the case of the splice dataset, we could be happy to lose same degree of accuracy,
and select a solution with an error of 14% instead of 12%, in favor of a model
that has a much lower complexity, 370 SVs instead of 570 (see figure 5).
26
Error
260
0
50
100
150
Iterations
200
25
250
(d) Sigmoid
Fig. 6. German dataset: Pareto front of the sampled points using RBF (a) and sigmoid
(b) kernel; mean evolution of the population for the error and the number of SVs during
the optimization of NSGA-II using RBF kernel (c) and sigmoid (d) kernel.
6
Conclusions and possible future investigations
The SVM model selection problem clearly presents the characteristics of a multiobjective optimization problem. The results in this experimental work have
shown that it is possible to effectively obtain approximated Pareto fronts of
SVM models based on a simple 2-objective formulation where the accuracy and
the complexity of the model are compared for Pareto dominance.
This approach allows to visualize the characteristic trade-off curve for a specific dataset from where the user can select a specific model according to its own
preferences and computational needs.
A multi-objective analysis of Support Vector Machines
11
The proposed method also allows to obtain comparable results to other approaches in the literature but with the advance that as set of Pareto optimal
solutions (not a single one) is generated in output.
Of course a deeper investigation is required and many different line of investigation can be considered :
– extending the formulation from 2-objectives to possibly k-objective (k > 2)
including many other important criteria of model selection (like for example
the number of input features);
– studying the performance of the proposed approach on the regression case;
– adapting the approach to the multi-classification case where it is harder to
choose appropriate values for the base binary models of a decomposition
scheme.
References
1. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~ cjlin/libsvm.
2. Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, and Sayan Mukherjee.
Choosing multiple parameters for support vector machines. Machine Learning,
46(1-3):131–159, 2002.
3. Coello Coello and G. B. Lamont. Applications of Multi-Objective Evolutionary
Algorithms. World Scientific, 2004.
4. Kalyanmoy Deb. Multi-Objective Optimization Using Evolutionary Algorithms.
John Wiley & Sons, Inc., New York, NY, USA, 2001.
5. Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and T. Meyarivan. A fast and
elitist multiobjective genetic algorithm: Nsga-II. IEEE Trans. Evolutionary Computation, 6(2):182–197, 2002.
6. Tobias Glasmachers and Christian Igel. Gradient-based adaptation of general gaussian kernels. Neural Comput., 17(10):2099–2105, 2005.
7. Christian Igel. Multi-objective model selection for support vector machines. Evolutionary Multi-Criterion Optimization, pages 534–546, 2005.
8. S.S. Keerthi. Efficient tuning of svm hyperparameters using radius/margin bound
and iterative algorithms. IEEE Transactions on Neural Networks, 13:1225–1229,
2002.
9. S. Pang and N. Kasabov. Inductive vs. transductive inference, global vs. local models: Svm, tsvm, and svmt for gene expression classification problems. International
Joint Conference on Neual Networks (IJCNN), 2:1197–1202, 2004.
10. S.Y.M. Shi, P.N. Suganthan, and K. Deb. Multi-class protein fold recognition
using multi-objective evolutionary algorithms. IEEE Symposium on Computational
Intelligence in Bioinformatics and Computational Biology, pages 61–66, 2004.
11. Carl Staelin. Parameter selection for support vector machines. HP Labs Technical
Reports, 2002.
12. Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New
York, Inc., New York, NY, USA, 1995.
13. Mihalis Yannakakis. Approximation of multiobjective optimization problems. Algorithms and Data Structures : 7th International Workshop, pages 1–, 2001.
Download