Mixed-Integer Programming for Kernel-Based Classifiers OLUTAYO OLADUNNI and THEODORE B. TRAFALIS School of Industrial Engineering University of Oklahoma 202 West Boyd, Room 124 Norman, OK 73019 USA Abstract: - The main objective of this paper is to present a kernel-based classification model based on mixed-integer programming (MIP). Most classification methods are based on models with continuous variables. The proposed MIP formulation is developed to discriminate between two classes of points, by minimizing the number of misclassified points, and the number of data points representing the separating hyperplane. Generalization for kernel-based classification is provided. The simplicity of the formulation makes it easy for domain expert to interpret. Preliminary computational results are reported using the single phase fluid flow data [14], which show that our model improves the support vector machine (SVM) solution. Key Words: mixed-integer programming, model, classification, hyperplane, feature variable, misclassification, kernel 1 Introduction Discrimination of sets or objects is a very practical problem in various fields such as finance, marketing, science, medical diagnosis, education and engineering [1]. Discrimination requires the construction of classification models that can assign a data point or instance into a well defined class. For example, in [1], vertical two-phase flow patterns were discriminated using the size of pipe, and the superficial gas & liquid velocities as attributes. There are a variety of methods for solving classification problems emanating from different disciplines [3]. Those include support vector machines (SVMs), neural networks (NNs), fuzzy logic, principal component analysis (PCA), linear programming, and Fisher’s discriminant function [4]. SVMs developed by Vapnik [2, 3, 5] are based on statistical learning theory and have been successfully applied to a wide range of problems. They provide an alternative to the existing methodology of modeling and simulation of decision making processes for classification and regression problems. SVMs for classification problems perform a mapping of the input variables in a high (possibly infinite) feature space. Classification is done in this feature space by the use of a hyperplane which is the image of a nonlinear decision surface in the input space. Discrete support vector machines (DSVM) developed by Orsenigo and Vercellis [6, 7] have been recently established as a family of models for the binary and multi-category classification, which presents an alternative to the traditional SVM [2, 3, 5]. The DSVM models are also based on the structural risk minimization (SRM) principle [2, 3], and therefore have very good generalization properties. A distinguishing feature of DSVM is the evaluation of the empirical error. DSVM uses a discrete function to compute the empirical error which essentially counts the number of misclassified points. On the other hand, SVM uses a continuous function to compute the empirical error [2]. Uney and Turkay [8] developed a mixedinteger programming (MIP) multi-category classification model for classification of objects into several categories using the concept of identifying hyper-boxes that define the boundaries of the given classes. This approach is different from the SVMs, NNs and linear discriminant methods, which are based on using nonlinear surfaces or hyperplanes to define the boundaries of the given classes. In their approach, the relationships among discrete decisions in the model are represented using propositional logic [9] and then are converted to their equivalent integer constraints using Boolean algebra, (see [8]). The objective of this paper is to develop a mixed-integer programming kernel based on a binary classification model that has good generalization properties and provide sparsity of representation of the optimal classifier. In section 2, a review of separation in feature space is presented. In section 3, two formulations of a mixed-integer programming kernel-based classification (MIPKC) model are described. In section 4 an application problem for classification of fluid flow in pipe is discussed and computational results provided. Section 5 concludes the study. Expressing the normal vector w as a linear combination of input vectors i.e. xj spans Rd. l l x ( x ) (2) w y x w y (x ) j 1 l j 1 wT xi b 1, yi 1 Where w is the normal vector perpendicular to the separating hyperplane, and b is the bias i.e. the position of the hyperplane. Eq. (1) is a linear classification problem analyzed in input space. To obtain a nonlinear classification problem it’s essential to carry out a nonlinear mapping from input space to feature space [3, 5, 12] using a mapping function. : Rd F j j j 1 j y j ( x j ) ( xi ) b 1, j y j ( x j ) ( xi ) b 1, l j 1 j j yi 1 (3) yi 1 In Eq. (3) the only way the input data appear is in form of dot products ( x j ) ( xi ) . Hence, the training model would only depend on the data through dot products in F [5]. Considering an appropriate kernel function such that [3]: (4) k ( x j , xi ) ( x j ) ( xi ) Then k can replace the dot products in constraints (3), and the mapping function wouldn’t be necessary to compute. The kernel-based separation constraints can now be expressed as: l j y j k ( x j , xi ) b 1, j y j k ( x j , xi ) b 1, l Consider the two-class classification problem. Specifically, l data points l xi , yi i 1 are given, where xi d are the input training vectors, and yi {1, 1} are the corresponding labels. The problem is to find a (w, b) that defines a separating hyperplane such that the following separation constraints hold: wT xi b 1, yi 1 (1) j Where αi is a weight vector in feature space i.e. feature variable that accounts for the level of importance of that point in F; l is number of input vectors. Substituting Eq. (2) in Eq. (1) becomes j 1 2 Feature Space Separation j j 1 yi 1 (5) yi 1 and its separating hyperplane is defined by (αj, b). 3 Mixed-Integer Programming Kernel-Based Classification In this section a mixed-integer programming kernel-based classifiers (MIPKC) formulation for separation of sets is proposed. There are two formulations represented by indices MIPKCps and MIPKCp respectively, the first is analogous to the structural minimization principle in the context of SVMs [2, 3]. The objective of this formulation is to minimize the misclassification error (number of incorrectly classified labels) and simultaneously minimize the number of data points that represent w i.e. the sparseness of the normal vector. Whereas the latter seeks to minimize only the misclassification error of any training set. To formulate the mixedinteger programming kernel-based classification model, we combine constraints (5) into one set of inequalities: (6) y k ( x , x ) b 1, i where k ( x j , xi ) is a kernel function. The l yi j j j i j 1 kernel function used in this study is the polynomial kernel given as [3]. k x j , xi ( x j T xi 1) P . (7) Defining binary variables l 1 if yi j y j k ( x j , xi ) b 1 pi i l j 1 0 if otherwise min eT p eT s (8) where the binary vector pi , i l accounts for the violation of an instance/data-point, and 1 if i 0 si i l 0 if i 0 (9) si , i l is the binary vector that accounts for the nonzeros components of i (feature variables). Consequently, binary vectors pi and si should account for the number of misclassified points and number of feature variables of a training set respectively. The objective function can now be formulated which minimizes the number of feature variables and the number of misclassified points; it is given below: l l i i 1 min 1 pi 2 si p, s (10) Problem (10) can be considered as a two objectives optimization problem, where (1 , 2 ) are parameters regulating the tradeoff between minimizing the misclassification error and the number of feature variables of the model. Using both problem (10) and constraint (6) the mixed-integer kernel-based classifiers (MIPKCps) is as follows: l l i i 1 min pi si ,b , p , s l s.t. yi j y j k ( x j , xi ) b 1 Hpi , j 1 i Csi i 0 pi 0,1 , l si 0,1 , l i, j 1,.., l is a penalty cost associated with misclassification of training points, it accounts for the level of inseparability. It’s introduced by dividing objective function (10) by 2 and setting 1 2 . Selection of constant is overly significant to the solution of the model as it adds importance to the amount of deviations from the separation constraints and it’s to be chosen by the modeler. Problem (11) in matrix notation is given below, where [1 ,....., l ]T : (11) ,b , p , s s.t. Y(k ( A, A ')Y b) 1 Hp, Cs (12) 0 p 0,1 , s 0,1 xi , x j i, j l are the input training vectors and Am*d is a matrix whose rows are the xi , i l ; kernel k ( A, A ') maps m*d d *m into m*m ; yi {1, 1} are the corresponding labels and Y m*m is a diagonal matrix whose diagonals are labels yii {1, 1} and e is the vector of ones. H and C are sufficiently large constant values defined by the user. The first constraint of problem (11) represents the separation constraints in the feature space, which is analogous to the constraints of the traditional SVM model. When a point is misclassified binary vector pi is set to 1. The vector p measures the amount of misclassification of points in feature space. The second constraint of problem (11) sets binary vector si to 1, whenever i 0 , and also provides an upper bound for i so that the feature variables don’t get too large, as interpretation of such high values could be difficult or even meaningless; si 0 implies that i 0 . A variant of the MIPKC model which seeks to minimize the violation of constraints is also considered. In formulation (11) the 2nd term of the objective function is removed and the right hand side of the 2nd constraint is replaced by C, in order to prevent large coefficients of i . Hence a slightly different formulation is achieved and given below with index MIPKCp the flow pattern (laminar, +1; turbulent, -1) of the model. There are 92 data-points and 6 attributes, 46 instances for laminar flow pattern and 46 instances turbulent flow pattern. l min pi ,b , p i s.t. yi j y j k ( x j , xi ) b 1 Hpi , j 1 0 i C l pi 0,1 , l (13) i, j 1,.., l In matrix notation, where [1 ,....., l ]T min eT p ,b , p s.t. Y( k ( A, A ')Y b) 1 Hp, (14) 0 C p 0,1 m The decision function is given by: Nf D( x) sign j y j k ( x Nf j , x) b j 1 (15) Where Nf is the number of nonzero components of feature vector . Both models (11) and (13) are mixed-integer programming problems which are more difficult to be solved than their continuous counterparts. However, with a powerful IP or MIP solver one could obtain an optimal solution for reasonably sized problems. The CPLEX OPL [13] software was employed to solve both the MIPKCps and MIPKCp model. 4 Application Problem In this section, computational results that utilize and compare the mixed-integer programming kernel-based classifiers defined by problems, (11) and (13) are presented respectively. Experiments were performed on the single phase fluid flow data used in the work of Trafalis and Oladunni [14]. Description of the data set is as follows: Single Phase Fluid Flow Data set: The single phase fluid flow data uses flow rate, density, viscosity, borehole diameter, drill collar diameter and mud type to delineate Both formulations were solved using the MIP package in CPLEX OPL [13] software. The single phase flow data were scaled using the sigmoidal normalization [14]. The formulations were trained on 60% of the data set and tested on the remaining 40% of the data set, all randomly drawn from the data set to obtain four training and test samples. Given 1 , H C 10 for MIPKCps, and C 1 for MIPKCp, and the polynomial kernel of order p = 2 (7), the results of the MIPKC model (11) and (13) are reported. The most accurate model will have the highest (lowest) classification accuracy (error) rate and lowest number of data points representing the sparsity of classifiers (feature variables). Accuracy total # of classified points , total # of observations (16) Error 1 Accuracy For 100% classification, accuracy = 1. Since it’s not known beforehand which β is best for a specific problem, we perform a study on the influence of β on the solution of the single phase flow data. Using β values equal to 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50 and 100, the quality of the solution based on the tradeoff between sparseness and quality of generalization was investigated. Table 1: Performance of MIPKCp, and MIPKCps, and SVM formulations Table 2: Average error and accuracy of MIPKC models in comparison with the conventional SVM formulation Table 3: Influence of β on MIPKCps error plot 0.6 0.5 error 0.4 Test error 0.3 Train error 0.2 0.1 0 0 0.01 0.01 0.05 0.1 0.5 1 5 10 50 100 beta rates appear to decrease. This happens because by increasing the β value we exert a higher penalty to the violation of the constraints, which is evident from Figure 1. At β = 0.001, less penalty has been assigned to reduce the training error and more penalty has been assigned to reduce the number of feature variables to 0. At β = 1, equal penalty has been assigned to reducing both the training error and the components of the feature vector. At β = 100, more penalty has been assigned to reduce the training error and less penalty has been assigned to reduce the number of feature variables (see Table 3). As shown above, changing the values of β does affect the solution of the problem. Hence, in order to avoid a null solution it’s always a good idea to start with β = 1, and then if necessary, one can give a higher penalty. With low β values we increase the risk of having a high misclassification rate. The results above indicate that mixedinteger programming formulation of kernelbased classifiers further improves the results of their continuous counterparts. Fig. 1: Plot of β versus error rate Table 1 contains the results for the MIPKC models, MIPKCps, MIPKCp and SVM on the single phase fluid flow pattern classification data. The flow patterns are nonlinearly separable. The error rates and feature variables of the MIPKCp are noticeably higher than/equal to the MIPKCps and SVM model for all test samples. By minimizing the number of feature variables we are able to reduce the number of data points representing the separating classifiers and simultaneously increase the generalization ability i.e. the ability to classify correctly unseen data during training phase. This distinctive reduction of feature variables is a primary improvement over the other models in comparison. Also the MIPKCps model provides an improvement to the current solution of the SVM, which is at a 95% accuracy level after 10 runs and 97% after 4 runs (see Table 2 and [14]). Table 3 reflects the influence of β on the MIPKCps model. As β increases both our train and test error 5 Conclusion In this study, a mixed-integer programming formulation for kernel-based classification as an alternative to SVMs is presented. The separation constraints in feature space are defined; binary variables are also introduced and incorporated into the classification model. The binary variable was used to properly measure the number of misclassified training points and the sparsity of the weight vector representation (nonzero components of the feature vector). For reasonably size data sets the two formulations present a viable alternative to the conventional SVM which uses continuous variables. However, for large scale problems the computing time becomes an issue of concern, deeming it advisable to train in batches of 100 or even 200 depending on the IP/MIP solver implemented. Also, since proposition (11) aims to reduce the number of used feature for classification, training could be done on smaller sample sizes with the assumption that the train sample size contains enough critical information to the model. The proposed formulations were applied to the single phase fluid flow data and compared to the solution of the SVM [14] for the same problem. In all test cases the MIPKCps model has performed slightly better than/equal to models in comparison. There is a 1 - 3 % improvement on the accuracy rate. All three models have shown to be consistent with respect to the test samples, however, the preference is the MIPKCps model which reports the lowest error rate and reduces the number of data points representing the weight vector (see Table 1 and 2). Extensions to the multi-classification problem will be investigated in the future. References: [1] Trafalis, T.B.; Oladunni, O.; Papavassiliou, D.V. Two-Phase Flow Regime Identification with a MultiClassification SVM Model (accepted), School of Industrial Engineering, College of Engineering, University of Oklahoma. To appear in the Industrial & Engineering Chemistry Research Journal. [2] Vapnik, V. Statistical Learning Theory. John Wiley & Sons, Inc., 1998. [3] Cristianini, N.; Shawe-Taylor, J. Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge, UK, 2000. [4] Haykin, S. Neural Networks: A Comprehensive Foundation, 2nd edition, Prentice Hall, Upper Saddle River, NJ, 1999 [5] Burges, C.J.C. A tutorial on support vector machines for pattern classification. Data Mining and Knowledge Discovery, 2(2): 1998, 121-167. [6] Orsenigo, C.; Vercellis, C. One-againstall Multicategory classification via discrete support vector machines, In: Data Mining IV, N. Ebecken et al. eds. WIT Press, Ashurst Lodge, 2004, 255-264. [7] Orsenigo, C.; Vercellis, C. Multivariate Classification Trees Based on Minimum Features Discrete Support Vector Machines. IMA Journal of Management Mathematics 14, 2003, 221-234. [8] Uney, F.; Turkay, M. A Mixed-Integer Programming Approach to Multi-Class Data Classification Problem, Technical Report, Department of Industrial Engineering, College of Engineering, Koc University, Rumelifeneri Yolu, 34450, Sariyer, Istanbul, Turkey. [9] Cavalier, T.M.; Pardalos, P.M.; Soyster, A.L. Modeling and integer programming techniques applied to propositional calculus. Comp. Oper. Res., 17(6): 1990 561-570. [10] Chang, C-C.; Lin, C-J. LIBSVM: A Library for Support Vector Machines, 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm. [11] Hsu, C-W.; Chang, C-C.; Lin, C-J. A Practical Guide to Support Vector Classification, Technical Report Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan, 2003. [12] Schölkopf, B. Statistical Learning and Kernel Methods. Technical Report MSRTR-2000-23, Microsoft Research Ltd., Microsoft Corporation, 2000. [13] ILOG OPL STUDIO 3.7. Language Manual, ILOG, S.A. Gentily, France and ILOG, Inc., Mountain View California, USA. 2003. http://www.ilog.com/products/oplstudio/ [14] Trafalis, T.B.; Oladunni, O. Single Phase Fluid Flow Classification via Neural Networks & Support Vector Machine, Intelligent Engineering Systems Through Artificial Neural Networks, (C.H. Dagli, A.L. Buczak, D. L. Enke, M.J. Embrechts, and O. Ersoy, eds.), ASME Press, Vol. 14: 2004, 427-432.