Using BCP for initializing Support Vector Machines

advertisement
Reducing the training speed of Support Vector Machines By
Barycentric Correction Procedure
ARIEL LUCIEN GARCIA and NEIL HERNANDEZ GRESS
Department of Computer Science
ITESM, Campus Estado de México
Carretera Lago de Guadalupe Km. 3.5, Atizapán de Zaragoza, Estado de México
Abstract: - The Chunking algorithm developed by Vapnik has shown to be an effective method for reducing
training speed while optimizing Support Vector Machines in relation to classical QP optimization. However, a
problem concerning Chunking is still related to the optimization rate for real problems. The objective of the
method is to iteratively hunt for the Support Vectors maintaining a reduced training set and thus simplifying the
QP optimization. This task is generally computational time-demanding since the procedure is initialized
through patterns selected randomly. By using BCP, we can select the initial vectors to be trained and
consequently decrease substantially the training rate while optimizing Support Vector Machines. Some
experimental results while training well-known benchmarks are discussed to show the capability of the
proposed method.
Key-Words: - SVM, Chunking, BCP, Kernel functions, Training speed, Classification.
1 Introduction
Support Vector Machines (SVMs) is a well-known
technique for training separating functions in pattern
recognition tasks and for function estimation in
regression problems [8]. In several problems has
shown its generalization capabilities. The
mathematical problem and solutions was settled by
Vapnik and Chervonenkis [1]. In classification
tasks, the main idea can be stated as follows; given a
training
data
set
characterized
by
n
patterns xi   , i  1,.., l belonging to two possible
classes yi  1,1 , there exists a solution
represented by the following optimization problem:
Minimize:
l
W     i 
i 1
1 l

2 i 1
 y y   k x ; x 
l
j 1
i
j
i
j
i
j
(1)
 y
0
(2)
0  i  C
(3)
i 1
i
i
The above formulation stands for a standard
Quadratic Optimization problem (QP). It is possible
to define the problem in matrix notation:
Minimize:
1
W     T 1  T Q
2
(5)
Subject to:
T y  0
0  i  C
Subject to:
l
The solution to the problem formulated in eq. 1,..,3
is a vector  i*  0 for which the  i* strictly
grater than zero are the Support Vectors (SVs).
While solving the optimization problem, constraints
(2) and (3) are satisfied and thus (1) is minimized.
(6)
(7)
In the other hand, Barycentric Correction Procedure
(BCP) stands for a very fast (more than 70,000 than
perceptron) method for training linearly separable
problems [3]. The result is a hyperplane separating
patterns in the data set into two classes.
And a decision function:


f ( x)  sgn   yi i  k x  xi   b 
 i 1

l
(4)
This paper deals with a new idea combining BCP
and chunking in order to reduce the training speed
for SVM. Other ideas concerning time consuming
and memory requirements are discussed in [6, 7, 8].
Several other ideas are also developed in [5].
Support Vectors are patterns belonging to the
original data set defining and thus near the
separating hyperplane and margin. Here, the goal is
to profit of the advantages of BCP to detect patterns
near the margin even if these patterns are to be finetuned.
This paper is structured as follows: in section 2, we
discuss the SVM standard formulation and its
optimization problems. Section 3 describes the
Barycentric Correction Procedure used in the
method proposed. In section 4, the original idea and
its algorithm is discussed. It combines chunking and
BCP in order to improve the training speed for
SVM. Finally, experimental results for three
benchmarks i) Adult data set, ii) Pima Indians
Diabetes, iii) Tic-Tac-Toe and iv) Iris benchmarks
are presented.
2 Support Vector Problem
Potential disadvantages using SVM are: a) the
method may use a prohibitive computational time
for real problems and b) the quadratic form in (5)
involves a Hessian matrix l·l  producing memory
limitations for large scale problems. Equations
(1...4) show that having a training data set, one can
produce a Hessian matrix Q and then solve the
problem by a classic QP solution. Some of the most
used algorithms found in literature are loqo [9],
MinQ, MINOS [10], and QP. This approach is very
time consuming so it is no viable to use in real
applications. One approach found in literature to
make this problem tractable is to decompose the QP
problem into a series of smaller QP sub problems.
One example is the chunking algorithm [1].
2.1 Chunking
Chunking algorithm brakes down the problem (eq.
1... 3) into smaller QP sub problems (or chunks) that
are optimized iteratively. This algorithm is based on
the idea that the solution of QP optimization is the
same as if we remove vectors with α=0. The idea is
to structure the sub problems with vectors whose
 * are non-zero as well as the M worst examples
that violate KKT conditions and then solves the sub
problem. The QP sub problem is initialized with the
results of the previous one.
The general chunking procedure is described as:
1. Decompose the original training data base in:
a) Training data set (TRN).
b) Testing data set (TST).
2. Optimize the reduced TRN by classical QP
optimization.
3. Obtain the Support Vectors by the previous
optimization.
4. Test the Support Vectors in TST by using a
modified generalization function (see ea 4):
 l

yi   y j *j k xk , x   b   1
 j 1

5. Recombine Support Vectors and testing errors on
a new training Set (TRNnew).
6. Repeat from step 2 until no errors are found by 4.
The first chunk (TRN) is composed using q training
patterns randomly selected (q is smaller than l) from
the original data set. The starting point of the
algorithm is blind (the solution can be near or far
from the real one). By the use of chunking
procedure, training time for SVM are reduced as
revealed in [1, 2] but, it is observed that even
chunking is time consuming for large scale
problems. This is why we propose a more intelligent
optimization method, described in section 4.
3 Barycentric Correction Procedure
BCP is an algorithm based on geometrical
characteristics for training a threshold unit [3]. It
addresses slow convergence of algorithms like
Perceptron and is more efficient for training linearly
separable problems. It was proven that the algorithm
must rapidly converge towards a solution. It has
proven to be more than 70,000 time faster than
classical perceptron. Also, BCP is a free parameters
algorithm so it doesn’t suffers the critical problem
of adjusting initial considerations. Convergence
proves are given in [4].
The main idea of BCP is to achieve a controlled
search of the weights. Each time the weight vector is
modified, correct and misclassified patterns are
taken into account. In this way, the weight vector is
defined as a vector connecting two different
barycenters where each one belongs to a different
class. The way BCP converges on linearly separable
case is modifying these barycenters to achieve a
better direction of the hyperplane until a solution is
reach. The algorithm defines a hyperplane
w·x    0 dividing the input space for each class.
Thus, we can define:
I1  1,..., N1  and I 0  1,.., N 0  where N1
represents the number of patterns of target 1 and
N 0 the number of patterns of target  1 . The
training set is given by C  C1  C0 . Also, let b
( b1 , b0 respectively) be the barycenter of points C1
and C 0 , weighted by the positive coefficients
  ( 1 ,...,  N1 ) and   ( 1 ,...,  N 0 ) referred as
weighting coefficients [3]:
b1 

iI1
i
pi

iI1
And b0 
i
 m
iI 0
i
(14)
 j  J0
 new j   (old ) j   j
(15)
Where  and  are defined as follows:


,
N1  

N 0 



  max  min , min  max ,
N 0 

N1  
iI 0
max
(16)
i
(8)
i
w  b1  b0
(9)
H : w·x  
(10)
At every step, the weight vector w should be
modified, so barycenter is also modified. Barycenter
that should be increased is the corresponding to
misclassified patterns. Increasing the value of
particular barycenter implies hyperplane moves on
that direction. For computing the bias term  , let’s
define  :  n   as follows:
 ( p )   w·p
(11)
Now, introduce the sets:
1    pi  / pi  C1 
0    p j / p j  C0 


(17)


According to different results obtained, we
determine that  min ,  min can be set to 1 and  maz ,
 max set to 30.
The general algorithm works as follows:
and the separating function
Initialize weighting coefficients α and µ
randomly.
1. Calculate b1, b0 and w according to (8) and
(9).
2. Calculate  differentiating 1 from  0 using
(11)
3. Do the computation of bias term by (13).
4. Calculate modifications on weighting
coefficients according to (14) and (15).
5. Checks if optimal hyperplane was found
then stop other wise go to step 1.
4 Initializing Chunking strategies by
Barycentric Correction Procedure
(12)
  1  0
and bias term could be calculated as follows:
max 1  min 0
2
 newi   old i   i
  max  min , min  
So the weight vector w is defined as the difference
between barycenter,

i  J 1
(13)
The weighting modifications are calculated
assuming the existence of J1  I1 defined as the
index subset of misclassified examples of target 1
and J 0  I 0 to refer misclassified examples of
target  1 . The modifications are computed as
follows:
As we have already mention, the goal is to reduce
the training rate of Support Vector Machines. The
original idea presented here takes profit of the
characteristics of BCP (training speed) in order to
pre-select the Support Vectors before to optimize by
QP.
The general algorithm works as follows:
1. Train the entire data set by BCP. An hyperplane
for linearly separable and non separable problems is
obtained.
2. Get the q closest patterns to the hyperplane.
These patterns are TRN.
3. Optimize TRN with a QP solver. Get the Support
Vectors from TRN and keep in A.
4. Test the entire data set. Testing errors are kept
in  .
5. While the number of errors is different from zero:
5.1 Train  by BCP.
5.2 Get the q closest patterns to the hyperplane.
These patterns are  closest .
5.3 Optimize  closest with QP solver. Get the
Support Vectors from  closest .
5.5
Add
 closest to
TRN (new)
and
optimize TRN (new) with QP solver and set the
solution in A.
5.6 Test the entire data set. Testing errors are
kept in  and go to 5.
6. Finish
The strategy proposed consists in combining
chunking and BCP algorithm to tackle the training
speed in Support Vector Machines. From a training
data set the QP optimization problem will provide
the same result (the same Support Vectors) if the
entire data set or a reduced data set having only the
Support Vectors is used. To perform the task, we
obtain the proximal solutions using Barycentric
Correction Procedure. In this way, a hyperplane
well-classifying data set is found. Then by Euclidian
distance we get the closest points to the hyperplane
and obtain a TRN. Due to geometrical characteristics
of the clusters we can demonstrate that if the length
of the TRS is greater enough the Support Vectors are
included in TRS. What we can not control yet is that
there are some patterns not belonging to the Support
Vectors. Now we have a small training data set and
thus training speed is reduced (cf. Fig. 1).
with BCP and select the q closest points to the
hyperplane found and perform the TRN as follows:
TRN  TRN 1  TRN 0
(18)
TRN i  1,.., q / 2;
i  I1
(19)
TRN j  1,.., q / 2;
j  I0
(20)
The TRN is optimized in order to find Support
Vectors A , where A   n is the collection of 
such that  i  0 , i  1,.., number of Support
Vectors. Now, the entire data set is tested by (4)
using A . Errors are kept in order to form a subset
 i where i  1,.., M and M is the Number of
testing errors. Following chunking strategy, 
subset must be added to the solution found before,
TRN (new)  TRN (old )  
(21)
This TRN (new) sub set is optimized until optimal
solution is reached.
It has been observed that when the number of
testing errors is large, the optimization becomes also
a hard problem. To avoid the difficulty, we train
testing errors by BCP and get  closest patterns (q
closest patterns to the solution obtained training  )
then, optimize  closest and add the solution to
TRN (new) as follows:
 closest  Closest patterns to solution 
A ,closest    0
TRN (new)  TRN (old )  x Aclosest
(22)
Adding only relevant patterns to TRN (new) subset
number of added patterns is controlled.
Fig. 1
In order to have a better approximation, we still use
the chunking strategy. First, we set q value referred
as chunking size. Then, we train the entire data set
As mentioned before, kernel evaluations make
difficult to improve training speed on SVM, to
avoid it, we use the caching strategy [5]. At every
step, some rows of Hessian H are calculated, these
rows are stored and during the optimization process
if any row is needed, it is obtained from cache and
not calculated again. When any of those rows is not
used during several steps, it is removed from cache.
5 Experimental Results
The proposed algorithm is compared to the
traditional chunking algorithm. The QP solver used
is QP developed by K. SCHITTKOWSKI. The
chunking size q after different tests was set to 10%
from the data set and C parameter set to 1000. Some
parameters like number of examples, Support
Vectors at first iteration, total Support Vectors,
training time, training error, and number of
iterations are taken into account. The benchmarks
used to test the algorithm are: Pima Indians
diabetes, Adult data set, Tic-tac-toe and Iris [11].
5.1 Pima Indians Diabetes benchmark
Pima Indians diabetes benchmark is used to observe
if a patient shows signs of diabetes according to
World Health Organization criteria. The number of
instances is 768 having 8 attributes.
Algorithm
with BCP
768
Examples
RBF
Kernel
0.5
Gamma
1000
C
10%
Chunk
SV’s at first 76
iteration
499
Total SV’s
Training time 4.76sec.
0.0%
Error
7
Iterations
Table 1
Traditional
Chunking
768
RBF
0.5
1000
10%
33
768
13.089sec
0%
7
Table 1 shows that the proposed algorithm is
considerably faster than the traditional chunking. As
it can be observed, at the first iteration of the
proposed algorithm, having 768 patterns and a q of
10%, 76 patterns become Support Vectors which
means that all the patterns in q were Support
Vectors. On the other hand, chunking algorithm at
first iteration only found 33 Support Vectors.
In contrast, the number of Support Vectors obtained
by the proposed algorithm is smaller than the
Support Vectors found by chunking. This may be
caused by the fact that the proposed algorithm takes
into account only relevant patterns and traditional
chunking takes all the patterns to make the
optimization.
5.2 Adult benchmark
This benchmark compiled by J. Platt [6] from UCI
repository predicts whether a person makes over
$50, 000 during a year. This data set was composed
by eight categorical and six continuous attributes.
Continuous attributes were also discretized by Platt.
Examples
Kernel
Gamma
C
Chunk size
SV’s at first
iteration
Total SV’s
Training time
Error
Iterations
Algorithm Traditional
with BCP Chunking
1478
1478
RBF
RBF
1
1
1000
1000
10%
10%
146
115
860
649
23.063sec.
2078.46
0.94%
19.28%
9
27
Table 2
In this case, training speed of proposed algorithms is
faster than the time spent by traditional chunking
algorithm. At first step, optimizing with 147
patterns, 146 were identified as Support Vectors for
the proposed algorithm while traditional chunking
only identifies 115 Support Vectors (cf. Table 2).
The differences in training time observed for this
benchmark between proposed and chunking
algorithms are noticeable. This is caused since at
first iteration the found Support Vectors can not
give a small testing error so when all testing errors
are added to the TRN set, it grows a lot and it causes
a slow optimization.
5.3 Tic-Tac-Toe benchmark
This benchmark is the complete set of possible
board configurations at the end of tic-tac-toe games.
The numbers of instances are 958 and represent the
legal tic-tac-toe endgame boards. It has 9 attributes,
each attribute correspond to one tic-tac-toe square.
Obtained results are presented in Table 3.
Algorithm Traditional
with BCP Chunking
958
958
Examples
RBF
RBF
Kernel
3
3
Gamma
1000
1000
C
10%
10%
Chunk
55
22
SV’s at first
iteration
126
649
Total SV’s
Training time
Error
Iterations
.844sec
0%
9
Table 3
3.39sec
0%
11
In this case, the training time difference in not so
big but the resulting Support Vectors by proposed
algorithm is smaller than chunking algorithm. It is
important to note that the found Support Vectors at
first iteration by the proposed algorithm is smaller
than q but bigger than the obtained by chunking.
5.4 Iris benchmark
This benchmark is maybe the most known database
in different pattern recognition researches. It
contains 4 attributes and 2 classes where each class
refers to a type of iris plant. Total instances are 150
where 50 belong to a class and the rest to other.
Results are presented in Table 4.
Algorithm Traditional
with BCP
Chunking
150
150
Examples
Polynomial Polynomial
Kernel
2
2
Degree
1000
1000
C
10%
10%
Chunk
5
1
SV’s at first
iteration
13
8
Total SV’s
.042sec
.085sec
Training time
0%
1.33 %
Error
7
9
Iterations
Table 4
The way we determine the kernel, degree or gamma
parameters was after several test with different
values, we took the ones which work better.
6 Conclusions
We have presented an algorithm that can be used to
train Support Vector Machines avoiding the training
speed issue related to these. Also, we have
demonstrated the performance of the method
presenting different benchmark results which
showed that proposed algorithm have a better
training speed than traditional chunking.
Finally, the proposed algorithm can be extended to
algorithms like Osuna [7], where the task of
selecting a good working set can be fit for the
method.
References:
[1] V. Vapnik and A. Chervonenkis, Theory of
Pattern Recognition, Akademie-Berlag, 1979.
[2] B. E. Boser, I. Guyon and V. Vapnik, A
training algorithm for optimal margin
classifiers, Proceedings of the 5th Annual ACM
Workshop on Computational learning Theory,
1992, pp. 144-152.
[3] H. Poulard and D. Estève, Barycentric
Correction Procedure: A fast method of
learning threshold unit, World Congress on
Neuronal Networks, Vol.1, 1995, pp. 710-713
[4] H. Poulard and D. Estève, A convergence
theorem for Barycentric Correction Procedure,
Technical Report 95180 of LAAS-CNRS, 1995,
16p.
[5] B. Schölkopf, Christopher J. C. Burges and
Alexander J. Smola, Advances in Kernel
Methods: Support Vector Learning, the MIT
Press, 1998.
[6] J. C. Platt, Sequential Minimal Optimization: A
fast algorithm for training Support Vector
Machines, Technical Report MSR-TR-98-14,
1998.
[7] E. Osuna, R. Freund and F. Girosi, An
improved training algorithm for Support Vector
Machines, Neural Networks for Signal
Processing, Vol. 7, 1997, pp. 276-285
[8] Edgar Osuna, R. Freund and F. Girosi, Training
Support Vector Machines: An application of
face detection, Proceedings of Computer vision
and Pattern Recognition, 1997, pp. 130-136.
[9] J. Vanderbei, Loqo: An interior point code for
quadratic programming, Technical Report SOR
94-15, 1994.
[10] A. Murtagh and M. Saunders, MINOS 5.4
User’s
Guide,
System
Optimization
Laboratory, 1995.
[11] Blake, C.L. and Merz, UCI Repository of
machine
learning
databases
[http://www.ics.uci.edu/~mlearn/MLRepository
.html], Department of Information and
Computer Science, 1998.
Download