Reducing the training speed of Support Vector Machines By Barycentric Correction Procedure ARIEL LUCIEN GARCIA and NEIL HERNANDEZ GRESS Department of Computer Science ITESM, Campus Estado de México Carretera Lago de Guadalupe Km. 3.5, Atizapán de Zaragoza, Estado de México Abstract: - The Chunking algorithm developed by Vapnik has shown to be an effective method for reducing training speed while optimizing Support Vector Machines in relation to classical QP optimization. However, a problem concerning Chunking is still related to the optimization rate for real problems. The objective of the method is to iteratively hunt for the Support Vectors maintaining a reduced training set and thus simplifying the QP optimization. This task is generally computational time-demanding since the procedure is initialized through patterns selected randomly. By using BCP, we can select the initial vectors to be trained and consequently decrease substantially the training rate while optimizing Support Vector Machines. Some experimental results while training well-known benchmarks are discussed to show the capability of the proposed method. Key-Words: - SVM, Chunking, BCP, Kernel functions, Training speed, Classification. 1 Introduction Support Vector Machines (SVMs) is a well-known technique for training separating functions in pattern recognition tasks and for function estimation in regression problems [8]. In several problems has shown its generalization capabilities. The mathematical problem and solutions was settled by Vapnik and Chervonenkis [1]. In classification tasks, the main idea can be stated as follows; given a training data set characterized by n patterns xi , i 1,.., l belonging to two possible classes yi 1,1 , there exists a solution represented by the following optimization problem: Minimize: l W i i 1 1 l 2 i 1 y y k x ; x l j 1 i j i j i j (1) y 0 (2) 0 i C (3) i 1 i i The above formulation stands for a standard Quadratic Optimization problem (QP). It is possible to define the problem in matrix notation: Minimize: 1 W T 1 T Q 2 (5) Subject to: T y 0 0 i C Subject to: l The solution to the problem formulated in eq. 1,..,3 is a vector i* 0 for which the i* strictly grater than zero are the Support Vectors (SVs). While solving the optimization problem, constraints (2) and (3) are satisfied and thus (1) is minimized. (6) (7) In the other hand, Barycentric Correction Procedure (BCP) stands for a very fast (more than 70,000 than perceptron) method for training linearly separable problems [3]. The result is a hyperplane separating patterns in the data set into two classes. And a decision function: f ( x) sgn yi i k x xi b i 1 l (4) This paper deals with a new idea combining BCP and chunking in order to reduce the training speed for SVM. Other ideas concerning time consuming and memory requirements are discussed in [6, 7, 8]. Several other ideas are also developed in [5]. Support Vectors are patterns belonging to the original data set defining and thus near the separating hyperplane and margin. Here, the goal is to profit of the advantages of BCP to detect patterns near the margin even if these patterns are to be finetuned. This paper is structured as follows: in section 2, we discuss the SVM standard formulation and its optimization problems. Section 3 describes the Barycentric Correction Procedure used in the method proposed. In section 4, the original idea and its algorithm is discussed. It combines chunking and BCP in order to improve the training speed for SVM. Finally, experimental results for three benchmarks i) Adult data set, ii) Pima Indians Diabetes, iii) Tic-Tac-Toe and iv) Iris benchmarks are presented. 2 Support Vector Problem Potential disadvantages using SVM are: a) the method may use a prohibitive computational time for real problems and b) the quadratic form in (5) involves a Hessian matrix l·l producing memory limitations for large scale problems. Equations (1...4) show that having a training data set, one can produce a Hessian matrix Q and then solve the problem by a classic QP solution. Some of the most used algorithms found in literature are loqo [9], MinQ, MINOS [10], and QP. This approach is very time consuming so it is no viable to use in real applications. One approach found in literature to make this problem tractable is to decompose the QP problem into a series of smaller QP sub problems. One example is the chunking algorithm [1]. 2.1 Chunking Chunking algorithm brakes down the problem (eq. 1... 3) into smaller QP sub problems (or chunks) that are optimized iteratively. This algorithm is based on the idea that the solution of QP optimization is the same as if we remove vectors with α=0. The idea is to structure the sub problems with vectors whose * are non-zero as well as the M worst examples that violate KKT conditions and then solves the sub problem. The QP sub problem is initialized with the results of the previous one. The general chunking procedure is described as: 1. Decompose the original training data base in: a) Training data set (TRN). b) Testing data set (TST). 2. Optimize the reduced TRN by classical QP optimization. 3. Obtain the Support Vectors by the previous optimization. 4. Test the Support Vectors in TST by using a modified generalization function (see ea 4): l yi y j *j k xk , x b 1 j 1 5. Recombine Support Vectors and testing errors on a new training Set (TRNnew). 6. Repeat from step 2 until no errors are found by 4. The first chunk (TRN) is composed using q training patterns randomly selected (q is smaller than l) from the original data set. The starting point of the algorithm is blind (the solution can be near or far from the real one). By the use of chunking procedure, training time for SVM are reduced as revealed in [1, 2] but, it is observed that even chunking is time consuming for large scale problems. This is why we propose a more intelligent optimization method, described in section 4. 3 Barycentric Correction Procedure BCP is an algorithm based on geometrical characteristics for training a threshold unit [3]. It addresses slow convergence of algorithms like Perceptron and is more efficient for training linearly separable problems. It was proven that the algorithm must rapidly converge towards a solution. It has proven to be more than 70,000 time faster than classical perceptron. Also, BCP is a free parameters algorithm so it doesn’t suffers the critical problem of adjusting initial considerations. Convergence proves are given in [4]. The main idea of BCP is to achieve a controlled search of the weights. Each time the weight vector is modified, correct and misclassified patterns are taken into account. In this way, the weight vector is defined as a vector connecting two different barycenters where each one belongs to a different class. The way BCP converges on linearly separable case is modifying these barycenters to achieve a better direction of the hyperplane until a solution is reach. The algorithm defines a hyperplane w·x 0 dividing the input space for each class. Thus, we can define: I1 1,..., N1 and I 0 1,.., N 0 where N1 represents the number of patterns of target 1 and N 0 the number of patterns of target 1 . The training set is given by C C1 C0 . Also, let b ( b1 , b0 respectively) be the barycenter of points C1 and C 0 , weighted by the positive coefficients ( 1 ,..., N1 ) and ( 1 ,..., N 0 ) referred as weighting coefficients [3]: b1 iI1 i pi iI1 And b0 i m iI 0 i (14) j J0 new j (old ) j j (15) Where and are defined as follows: , N1 N 0 max min , min max , N 0 N1 iI 0 max (16) i (8) i w b1 b0 (9) H : w·x (10) At every step, the weight vector w should be modified, so barycenter is also modified. Barycenter that should be increased is the corresponding to misclassified patterns. Increasing the value of particular barycenter implies hyperplane moves on that direction. For computing the bias term , let’s define : n as follows: ( p ) w·p (11) Now, introduce the sets: 1 pi / pi C1 0 p j / p j C0 (17) According to different results obtained, we determine that min , min can be set to 1 and maz , max set to 30. The general algorithm works as follows: and the separating function Initialize weighting coefficients α and µ randomly. 1. Calculate b1, b0 and w according to (8) and (9). 2. Calculate differentiating 1 from 0 using (11) 3. Do the computation of bias term by (13). 4. Calculate modifications on weighting coefficients according to (14) and (15). 5. Checks if optimal hyperplane was found then stop other wise go to step 1. 4 Initializing Chunking strategies by Barycentric Correction Procedure (12) 1 0 and bias term could be calculated as follows: max 1 min 0 2 newi old i i max min , min So the weight vector w is defined as the difference between barycenter, i J 1 (13) The weighting modifications are calculated assuming the existence of J1 I1 defined as the index subset of misclassified examples of target 1 and J 0 I 0 to refer misclassified examples of target 1 . The modifications are computed as follows: As we have already mention, the goal is to reduce the training rate of Support Vector Machines. The original idea presented here takes profit of the characteristics of BCP (training speed) in order to pre-select the Support Vectors before to optimize by QP. The general algorithm works as follows: 1. Train the entire data set by BCP. An hyperplane for linearly separable and non separable problems is obtained. 2. Get the q closest patterns to the hyperplane. These patterns are TRN. 3. Optimize TRN with a QP solver. Get the Support Vectors from TRN and keep in A. 4. Test the entire data set. Testing errors are kept in . 5. While the number of errors is different from zero: 5.1 Train by BCP. 5.2 Get the q closest patterns to the hyperplane. These patterns are closest . 5.3 Optimize closest with QP solver. Get the Support Vectors from closest . 5.5 Add closest to TRN (new) and optimize TRN (new) with QP solver and set the solution in A. 5.6 Test the entire data set. Testing errors are kept in and go to 5. 6. Finish The strategy proposed consists in combining chunking and BCP algorithm to tackle the training speed in Support Vector Machines. From a training data set the QP optimization problem will provide the same result (the same Support Vectors) if the entire data set or a reduced data set having only the Support Vectors is used. To perform the task, we obtain the proximal solutions using Barycentric Correction Procedure. In this way, a hyperplane well-classifying data set is found. Then by Euclidian distance we get the closest points to the hyperplane and obtain a TRN. Due to geometrical characteristics of the clusters we can demonstrate that if the length of the TRS is greater enough the Support Vectors are included in TRS. What we can not control yet is that there are some patterns not belonging to the Support Vectors. Now we have a small training data set and thus training speed is reduced (cf. Fig. 1). with BCP and select the q closest points to the hyperplane found and perform the TRN as follows: TRN TRN 1 TRN 0 (18) TRN i 1,.., q / 2; i I1 (19) TRN j 1,.., q / 2; j I0 (20) The TRN is optimized in order to find Support Vectors A , where A n is the collection of such that i 0 , i 1,.., number of Support Vectors. Now, the entire data set is tested by (4) using A . Errors are kept in order to form a subset i where i 1,.., M and M is the Number of testing errors. Following chunking strategy, subset must be added to the solution found before, TRN (new) TRN (old ) (21) This TRN (new) sub set is optimized until optimal solution is reached. It has been observed that when the number of testing errors is large, the optimization becomes also a hard problem. To avoid the difficulty, we train testing errors by BCP and get closest patterns (q closest patterns to the solution obtained training ) then, optimize closest and add the solution to TRN (new) as follows: closest Closest patterns to solution A ,closest 0 TRN (new) TRN (old ) x Aclosest (22) Adding only relevant patterns to TRN (new) subset number of added patterns is controlled. Fig. 1 In order to have a better approximation, we still use the chunking strategy. First, we set q value referred as chunking size. Then, we train the entire data set As mentioned before, kernel evaluations make difficult to improve training speed on SVM, to avoid it, we use the caching strategy [5]. At every step, some rows of Hessian H are calculated, these rows are stored and during the optimization process if any row is needed, it is obtained from cache and not calculated again. When any of those rows is not used during several steps, it is removed from cache. 5 Experimental Results The proposed algorithm is compared to the traditional chunking algorithm. The QP solver used is QP developed by K. SCHITTKOWSKI. The chunking size q after different tests was set to 10% from the data set and C parameter set to 1000. Some parameters like number of examples, Support Vectors at first iteration, total Support Vectors, training time, training error, and number of iterations are taken into account. The benchmarks used to test the algorithm are: Pima Indians diabetes, Adult data set, Tic-tac-toe and Iris [11]. 5.1 Pima Indians Diabetes benchmark Pima Indians diabetes benchmark is used to observe if a patient shows signs of diabetes according to World Health Organization criteria. The number of instances is 768 having 8 attributes. Algorithm with BCP 768 Examples RBF Kernel 0.5 Gamma 1000 C 10% Chunk SV’s at first 76 iteration 499 Total SV’s Training time 4.76sec. 0.0% Error 7 Iterations Table 1 Traditional Chunking 768 RBF 0.5 1000 10% 33 768 13.089sec 0% 7 Table 1 shows that the proposed algorithm is considerably faster than the traditional chunking. As it can be observed, at the first iteration of the proposed algorithm, having 768 patterns and a q of 10%, 76 patterns become Support Vectors which means that all the patterns in q were Support Vectors. On the other hand, chunking algorithm at first iteration only found 33 Support Vectors. In contrast, the number of Support Vectors obtained by the proposed algorithm is smaller than the Support Vectors found by chunking. This may be caused by the fact that the proposed algorithm takes into account only relevant patterns and traditional chunking takes all the patterns to make the optimization. 5.2 Adult benchmark This benchmark compiled by J. Platt [6] from UCI repository predicts whether a person makes over $50, 000 during a year. This data set was composed by eight categorical and six continuous attributes. Continuous attributes were also discretized by Platt. Examples Kernel Gamma C Chunk size SV’s at first iteration Total SV’s Training time Error Iterations Algorithm Traditional with BCP Chunking 1478 1478 RBF RBF 1 1 1000 1000 10% 10% 146 115 860 649 23.063sec. 2078.46 0.94% 19.28% 9 27 Table 2 In this case, training speed of proposed algorithms is faster than the time spent by traditional chunking algorithm. At first step, optimizing with 147 patterns, 146 were identified as Support Vectors for the proposed algorithm while traditional chunking only identifies 115 Support Vectors (cf. Table 2). The differences in training time observed for this benchmark between proposed and chunking algorithms are noticeable. This is caused since at first iteration the found Support Vectors can not give a small testing error so when all testing errors are added to the TRN set, it grows a lot and it causes a slow optimization. 5.3 Tic-Tac-Toe benchmark This benchmark is the complete set of possible board configurations at the end of tic-tac-toe games. The numbers of instances are 958 and represent the legal tic-tac-toe endgame boards. It has 9 attributes, each attribute correspond to one tic-tac-toe square. Obtained results are presented in Table 3. Algorithm Traditional with BCP Chunking 958 958 Examples RBF RBF Kernel 3 3 Gamma 1000 1000 C 10% 10% Chunk 55 22 SV’s at first iteration 126 649 Total SV’s Training time Error Iterations .844sec 0% 9 Table 3 3.39sec 0% 11 In this case, the training time difference in not so big but the resulting Support Vectors by proposed algorithm is smaller than chunking algorithm. It is important to note that the found Support Vectors at first iteration by the proposed algorithm is smaller than q but bigger than the obtained by chunking. 5.4 Iris benchmark This benchmark is maybe the most known database in different pattern recognition researches. It contains 4 attributes and 2 classes where each class refers to a type of iris plant. Total instances are 150 where 50 belong to a class and the rest to other. Results are presented in Table 4. Algorithm Traditional with BCP Chunking 150 150 Examples Polynomial Polynomial Kernel 2 2 Degree 1000 1000 C 10% 10% Chunk 5 1 SV’s at first iteration 13 8 Total SV’s .042sec .085sec Training time 0% 1.33 % Error 7 9 Iterations Table 4 The way we determine the kernel, degree or gamma parameters was after several test with different values, we took the ones which work better. 6 Conclusions We have presented an algorithm that can be used to train Support Vector Machines avoiding the training speed issue related to these. Also, we have demonstrated the performance of the method presenting different benchmark results which showed that proposed algorithm have a better training speed than traditional chunking. Finally, the proposed algorithm can be extended to algorithms like Osuna [7], where the task of selecting a good working set can be fit for the method. References: [1] V. Vapnik and A. Chervonenkis, Theory of Pattern Recognition, Akademie-Berlag, 1979. [2] B. E. Boser, I. Guyon and V. Vapnik, A training algorithm for optimal margin classifiers, Proceedings of the 5th Annual ACM Workshop on Computational learning Theory, 1992, pp. 144-152. [3] H. Poulard and D. Estève, Barycentric Correction Procedure: A fast method of learning threshold unit, World Congress on Neuronal Networks, Vol.1, 1995, pp. 710-713 [4] H. Poulard and D. Estève, A convergence theorem for Barycentric Correction Procedure, Technical Report 95180 of LAAS-CNRS, 1995, 16p. [5] B. Schölkopf, Christopher J. C. Burges and Alexander J. Smola, Advances in Kernel Methods: Support Vector Learning, the MIT Press, 1998. [6] J. C. Platt, Sequential Minimal Optimization: A fast algorithm for training Support Vector Machines, Technical Report MSR-TR-98-14, 1998. [7] E. Osuna, R. Freund and F. Girosi, An improved training algorithm for Support Vector Machines, Neural Networks for Signal Processing, Vol. 7, 1997, pp. 276-285 [8] Edgar Osuna, R. Freund and F. Girosi, Training Support Vector Machines: An application of face detection, Proceedings of Computer vision and Pattern Recognition, 1997, pp. 130-136. [9] J. Vanderbei, Loqo: An interior point code for quadratic programming, Technical Report SOR 94-15, 1994. [10] A. Murtagh and M. Saunders, MINOS 5.4 User’s Guide, System Optimization Laboratory, 1995. [11] Blake, C.L. and Merz, UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository .html], Department of Information and Computer Science, 1998.