497-728

advertisement
Mixed-Integer Programming for Kernel-Based Classifiers
OLUTAYO OLADUNNI and THEODORE B. TRAFALIS
School of Industrial Engineering
University of Oklahoma
202 West Boyd, Room 124 Norman, OK 73019
USA
Abstract: - The main objective of this paper is to present a kernel-based classification model
based on mixed-integer programming (MIP). Most classification methods are based on models
with continuous variables. The proposed MIP formulation is developed to discriminate between
two classes of points, by minimizing the number of misclassified points, and the number of data
points representing the separating hyperplane. Generalization for kernel-based classification is
provided. The simplicity of the formulation makes it easy for domain expert to interpret.
Preliminary computational results are reported using the single phase fluid flow data [14], which
show that our model improves the support vector machine (SVM) solution.
Key Words: mixed-integer programming, model, classification, hyperplane, feature variable,
misclassification, kernel
1 Introduction
Discrimination of sets or objects is a very
practical problem in various fields such as
finance, marketing, science, medical
diagnosis, education and engineering [1].
Discrimination requires the construction of
classification models that can assign a data
point or instance into a well defined class.
For example, in [1], vertical two-phase flow
patterns were discriminated using the size of
pipe, and the superficial gas & liquid
velocities as attributes. There are a variety of
methods for solving classification problems
emanating from different disciplines [3].
Those include support vector machines
(SVMs), neural networks (NNs), fuzzy
logic, principal component analysis (PCA),
linear
programming,
and
Fisher’s
discriminant function [4].
SVMs developed by Vapnik [2, 3, 5] are
based on statistical learning theory and have
been successfully applied to a wide range of
problems. They provide an alternative to the
existing methodology of modeling and
simulation of decision making processes for
classification and regression problems.
SVMs for classification problems perform a
mapping of the input variables in a high
(possibly
infinite)
feature
space.
Classification is done in this feature space
by the use of a hyperplane which is the
image of a nonlinear decision surface in the
input space.
Discrete support vector machines (DSVM)
developed by Orsenigo and Vercellis [6, 7]
have been recently established as a family of
models for the binary and multi-category
classification, which presents an alternative
to the traditional SVM [2, 3, 5]. The DSVM
models are also based on the structural risk
minimization (SRM) principle [2, 3], and
therefore have very good generalization
properties. A distinguishing feature of
DSVM is the evaluation of the empirical
error. DSVM uses a discrete function to
compute the empirical error which
essentially counts
the
number
of
misclassified points. On the other hand,
SVM uses a continuous function to compute
the empirical error [2].
Uney and Turkay [8] developed a mixedinteger programming (MIP) multi-category
classification model for classification of
objects into several categories using the
concept of identifying hyper-boxes that
define the boundaries of the given classes.
This approach is different from the SVMs,
NNs and linear discriminant methods, which
are based on using nonlinear surfaces or
hyperplanes to define the boundaries of the
given classes. In their approach, the
relationships among discrete decisions in the
model are represented using propositional
logic [9] and then are converted to their
equivalent integer constraints using Boolean
algebra, (see [8]).
The objective of this paper is to develop a
mixed-integer programming kernel based on
a binary classification model that has good
generalization properties and provide
sparsity of representation of the optimal
classifier.
In section 2, a review of separation in
feature space is presented. In section 3, two
formulations
of
a
mixed-integer
programming kernel-based classification
(MIPKC) model are described. In section 4
an application problem for classification of
fluid flow in pipe is discussed and
computational results provided. Section 5
concludes the study.
Expressing the normal vector w as a linear
combination of input vectors i.e. xj spans Rd.
l
l
x  ( x )
(2)
w   y x  w   y (x )

j 1
l

j 1
wT xi  b  1,
yi  1
Where w is the normal vector perpendicular
to the separating hyperplane, and b is the
bias i.e. the position of the hyperplane. Eq.
(1) is a linear classification problem
analyzed in input space. To obtain a
nonlinear classification problem it’s
essential to carry out a nonlinear mapping
from input space to feature space [3, 5, 12]
using a mapping function.
 : Rd  F
j

j
j 1
j
y j ( x j )   ( xi )  b  1,
j
y j ( x j )   ( xi )  b  1,
l

j 1
j
j
yi  1
(3)
yi  1
In Eq. (3) the only way the input data appear
is in form of dot products  ( x j )   ( xi ) . Hence,
the training model would only depend on the
data through dot products in F [5].
Considering an appropriate kernel function
such that [3]:
(4)
k ( x j , xi )   ( x j )   ( xi )
Then k can replace the dot products in
constraints (3), and the mapping function
wouldn’t be necessary to compute.
The kernel-based separation constraints can
now be expressed as:
l

j
y j k ( x j , xi )  b  1,
j
y j k ( x j , xi )  b  1,
l
Consider the two-class classification
problem. Specifically, l data points
l
 xi , yi i 1 are given, where xi  d are the input
training vectors, and yi {1, 1} are the
corresponding labels. The problem is to find
a (w, b) that defines a separating hyperplane
such that the following separation
constraints hold:
wT xi  b  1, yi  1
(1)
j
Where αi is a weight vector in feature space
i.e. feature variable that accounts for the
level of importance of that point in F; l is
number of input vectors.
Substituting Eq. (2) in Eq. (1) becomes
j 1
2 Feature Space Separation
j

j 1
yi  1
(5)
yi  1
and its separating hyperplane is defined by
(αj, b).
3 Mixed-Integer Programming
Kernel-Based Classification
In this section a mixed-integer programming
kernel-based
classifiers
(MIPKC)
formulation for separation of sets is
proposed. There are two formulations
represented by indices MIPKCps and
MIPKCp respectively, the first is analogous
to the structural minimization principle in
the context of SVMs [2, 3]. The objective of
this formulation is to minimize the
misclassification
error
(number
of
incorrectly
classified
labels)
and
simultaneously minimize the number of data
points that represent w i.e. the sparseness of
the normal vector. Whereas the latter seeks
to minimize only the misclassification error
of any training set. To formulate the mixedinteger
programming
kernel-based
classification model, we combine constraints
(5) into one set of inequalities:


(6)
   y k ( x , x )  b   1, i


where k ( x j , xi ) is a kernel function. The
l
yi
j
j
j
i
j 1
kernel function used in this study is the
polynomial kernel given as [3].
k  x j , xi   ( x j T xi  1) P .
(7)
Defining binary variables

 l

1 if yi   j y j k ( x j , xi )  b   1
pi  
i l
 j 1


0 if otherwise
min  eT p  eT s
(8)
where the binary vector pi , i  l accounts for
the violation of an instance/data-point, and
1 if  i  0
si  
i l
0 if  i  0
(9)
si , i  l is the binary vector that accounts for
the nonzeros components of  i (feature
variables). Consequently, binary vectors
pi and si should account for the number of
misclassified points and number of feature
variables of a training set respectively.
The objective function can now be
formulated which minimizes the number of
feature variables and the number of
misclassified points; it is given below:
l
l
i
i 1
min 1  pi   2  si
p, s
(10)
Problem (10) can be considered as a two
objectives
optimization
problem,
where (1 , 2 ) are parameters regulating the
tradeoff
between
minimizing
the
misclassification error and the number of
feature variables of the model.
Using both problem (10) and constraint (6)
the mixed-integer kernel-based classifiers
(MIPKCps) is as follows:
l
l
i
i 1
min   pi   si
 ,b , p , s
 l

s.t. yi    j y j k ( x j , xi )  b   1  Hpi ,
 j 1

 i  Csi
i  0
pi  0,1 ,
l
si  0,1 ,
l
i, j  1,.., l
 is a penalty cost associated with
misclassification of training points, it
accounts for the level of inseparability. It’s
introduced by dividing objective function
(10) by 2 and setting   1 2 . Selection of
 constant is overly significant to the
solution of the model as it adds importance
to the amount of deviations from the
separation constraints and it’s to be chosen
by the modeler.
Problem (11) in matrix notation is given
below, where   [1 ,.....,  l ]T :
(11)
 ,b , p , s
s.t. Y(k ( A, A ')Y   b)  1  Hp,
  Cs
(12)
 0
p  0,1 ,
s  0,1
xi , x j i, j  l are the input training vectors
and Am*d is a matrix whose rows are
the xi , i  l ; kernel k ( A, A ') maps
m*d  d *m into m*m ; yi {1, 1} are the
corresponding labels and Y m*m is a
diagonal matrix whose diagonals are
labels yii {1, 1} and e is the vector of ones.
H and C are sufficiently large constant
values defined by the user.
The first constraint of problem (11)
represents the separation constraints in the
feature space, which is analogous to the
constraints of the traditional SVM model.
When a point is misclassified binary
vector pi is set to 1. The vector p measures
the amount of misclassification of points in
feature space. The second constraint of
problem (11) sets binary vector si to 1,
whenever i  0 , and also provides an upper
bound for  i so that the feature variables
don’t get too large, as interpretation of such
high values could be difficult or even
meaningless; si  0 implies that i  0 .
A variant of the MIPKC model which seeks
to minimize the violation of constraints is
also considered. In formulation (11) the 2nd
term of the objective function is removed
and the right hand side of the 2nd constraint
is replaced by C, in order to prevent large
coefficients of  i .
Hence a slightly different formulation is
achieved and given below with index
MIPKCp
the flow pattern (laminar, +1; turbulent, -1)
of the model. There are 92 data-points and 6
attributes, 46 instances for laminar flow
pattern and 46 instances turbulent flow
pattern.
l
min  pi
 ,b , p
i


s.t. yi    j y j k ( x j , xi )  b   1  Hpi ,
 j 1

0  i  C
l
pi  0,1 ,
l
(13)
i, j  1,.., l
In matrix notation, where   [1 ,.....,  l ]T
min eT p
 ,b , p
s.t. Y( k ( A, A ')Y   b)  1  Hp,
(14)
0 C
p  0,1
m
The decision function is given by:
 Nf

D( x)  sign    j y j k ( x Nf
j , x)  b 
 j 1

(15)
Where Nf is the number of nonzero
components of feature vector  . Both
models (11) and (13) are mixed-integer
programming problems which are more
difficult to be solved than their continuous
counterparts. However, with a powerful IP
or MIP solver one could obtain an optimal
solution for reasonably sized problems. The
CPLEX OPL [13] software was employed to
solve both the MIPKCps and MIPKCp
model.
4 Application Problem
In this section, computational results that
utilize and compare the mixed-integer
programming
kernel-based
classifiers
defined by problems, (11) and (13) are
presented respectively. Experiments were
performed on the single phase fluid flow
data used in the work of Trafalis and
Oladunni [14]. Description of the data set is
as follows:
Single Phase Fluid Flow Data set: The
single phase fluid flow data uses flow rate,
density, viscosity, borehole diameter, drill
collar diameter and mud type to delineate
Both formulations were solved using the
MIP package in CPLEX OPL [13] software.
The single phase flow data were scaled
using the sigmoidal normalization [14]. The
formulations were trained on 60% of the
data set and tested on the remaining 40% of
the data set, all randomly drawn from the
data set to obtain four training and test
samples. Given   1 , H  C  10 for
MIPKCps, and C  1 for MIPKCp, and the
polynomial kernel of order p = 2 (7), the
results of the MIPKC model (11) and (13)
are reported. The most accurate model will
have the highest (lowest) classification
accuracy (error) rate and lowest number of
data points representing the sparsity of
classifiers (feature variables).
Accuracy 
total # of classified points
,
total # of observations
(16)
Error  1  Accuracy
For 100% classification, accuracy = 1.
Since it’s not known beforehand which β is
best for a specific problem, we perform a
study on the influence of β on the solution of
the single phase flow data.
Using β values equal to 0.001, 0.005, 0.01,
0.05, 0.1, 0.5, 1, 5, 10, 50 and 100, the
quality of the solution based on the tradeoff
between sparseness and quality of
generalization was investigated.
Table 1: Performance of MIPKCp, and
MIPKCps, and SVM formulations
Table 2: Average error and accuracy of
MIPKC models in comparison with the
conventional SVM formulation
Table 3: Influence of β on MIPKCps
error plot
0.6
0.5
error
0.4
Test error
0.3
Train error
0.2
0.1
0
0
0.01 0.01 0.05 0.1
0.5
1
5
10
50
100
beta
rates appear to decrease. This happens
because by increasing the β value we exert a
higher penalty to the violation of the
constraints, which is evident from Figure 1.
At β = 0.001, less penalty has been assigned
to reduce the training error and more penalty
has been assigned to reduce the number of
feature variables to 0. At β = 1, equal
penalty has been assigned to reducing both
the training error and the components of the
feature vector. At β = 100, more penalty has
been assigned to reduce the training error
and less penalty has been assigned to reduce
the number of feature variables (see Table
3). As shown above, changing the values of
β does affect the solution of the problem.
Hence, in order to avoid a null solution it’s
always a good idea to start with β = 1, and
then if necessary, one can give a higher
penalty. With low β values we increase the
risk of having a high misclassification rate.
The results above indicate that mixedinteger programming formulation of kernelbased classifiers further improves the results
of their continuous counterparts.
Fig. 1: Plot of β versus error rate
Table 1 contains the results for the MIPKC
models, MIPKCps, MIPKCp and SVM on the
single phase fluid flow pattern classification
data. The flow patterns are nonlinearly
separable. The error rates and feature
variables of the MIPKCp are noticeably
higher than/equal to the MIPKCps and SVM
model for all test samples. By minimizing
the number of feature variables we are able
to reduce the number of data points
representing the separating classifiers and
simultaneously increase the generalization
ability i.e. the ability to classify correctly
unseen data during training phase. This
distinctive reduction of feature variables is a
primary improvement over the other models
in comparison. Also the MIPKCps model
provides an improvement to the current
solution of the SVM, which is at a 95%
accuracy level after 10 runs and 97% after 4
runs (see Table 2 and [14]). Table 3 reflects
the influence of β on the MIPKCps model.
As β increases both our train and test error
5 Conclusion
In this study, a mixed-integer programming
formulation for kernel-based classification
as an alternative to SVMs is presented. The
separation constraints in feature space are
defined; binary variables are also introduced
and incorporated into the classification
model. The binary variable was used to
properly
measure
the
number
of
misclassified training points and the sparsity
of the weight vector representation (nonzero
components of the feature vector). For
reasonably size data sets the two
formulations present a viable alternative to
the conventional SVM which uses
continuous variables. However, for large
scale problems the computing time becomes
an issue of concern, deeming it advisable to
train in batches of 100 or even 200
depending
on
the
IP/MIP
solver
implemented. Also, since proposition (11)
aims to reduce the number of used feature
for classification, training could be done on
smaller sample sizes with the assumption
that the train sample size contains enough
critical information to the model. The
proposed formulations were applied to the
single phase fluid flow data and compared to
the solution of the SVM [14] for the same
problem. In all test cases the MIPKCps
model has performed slightly better
than/equal to models in comparison. There
is a 1 - 3 % improvement on the accuracy
rate. All three models have shown to be
consistent with respect to the test samples,
however, the preference is the MIPKCps
model which reports the lowest error rate
and reduces the number of data points
representing the weight vector (see Table 1
and 2).
Extensions to the multi-classification
problem will be investigated in the future.
References:
[1]
Trafalis,
T.B.;
Oladunni,
O.;
Papavassiliou, D.V. Two-Phase Flow
Regime Identification with a MultiClassification SVM Model (accepted),
School of Industrial Engineering, College of
Engineering, University of Oklahoma. To
appear in the Industrial & Engineering
Chemistry Research Journal.
[2] Vapnik, V. Statistical Learning Theory.
John Wiley & Sons, Inc., 1998.
[3] Cristianini, N.; Shawe-Taylor, J. Support
Vector Machines and other kernel-based
learning methods. Cambridge University
Press, Cambridge, UK, 2000.
[4] Haykin, S. Neural Networks: A
Comprehensive Foundation, 2nd edition,
Prentice Hall, Upper Saddle River, NJ, 1999
[5] Burges, C.J.C. A tutorial on support
vector machines for pattern classification.
Data Mining and Knowledge Discovery,
2(2): 1998, 121-167.
[6] Orsenigo, C.; Vercellis, C. One-againstall Multicategory classification via discrete
support vector machines, In: Data Mining
IV, N. Ebecken et al. eds. WIT Press,
Ashurst Lodge, 2004, 255-264.
[7] Orsenigo, C.; Vercellis, C. Multivariate
Classification Trees Based on Minimum
Features Discrete Support Vector Machines.
IMA Journal of Management Mathematics
14, 2003, 221-234.
[8] Uney, F.; Turkay, M. A Mixed-Integer
Programming Approach to Multi-Class
Data Classification Problem, Technical
Report,
Department
of
Industrial
Engineering, College of Engineering, Koc
University, Rumelifeneri Yolu, 34450,
Sariyer, Istanbul, Turkey.
[9] Cavalier, T.M.; Pardalos, P.M.; Soyster,
A.L. Modeling and integer programming
techniques applied to propositional calculus.
Comp. Oper. Res., 17(6): 1990 561-570.
[10] Chang, C-C.; Lin, C-J. LIBSVM: A
Library for Support Vector Machines, 2001,
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[11] Hsu, C-W.; Chang, C-C.; Lin, C-J. A
Practical Guide to Support Vector
Classification, Technical Report Department
of Computer Science and Information
Engineering, National Taiwan University,
Taipei 106, Taiwan, 2003.
[12] Schölkopf, B. Statistical Learning and
Kernel Methods. Technical Report MSRTR-2000-23, Microsoft Research Ltd.,
Microsoft Corporation, 2000.
[13] ILOG OPL STUDIO 3.7. Language
Manual, ILOG, S.A. Gentily, France and
ILOG, Inc., Mountain View California,
USA. 2003.
http://www.ilog.com/products/oplstudio/
[14] Trafalis, T.B.; Oladunni, O. Single
Phase Fluid Flow Classification via Neural
Networks & Support Vector Machine,
Intelligent Engineering Systems Through
Artificial Neural Networks, (C.H. Dagli,
A.L. Buczak, D. L. Enke, M.J. Embrechts,
and O. Ersoy, eds.), ASME Press, Vol. 14:
2004, 427-432.
Download