Nonparallel plane proximal classifier

advertisement
ARTICLE IN PRESS
Signal Processing 89 (2009) 510–522
Contents lists available at ScienceDirect
Signal Processing
journal homepage: www.elsevier.com/locate/sigpro
Nonparallel plane proximal classifier
Santanu Ghorai , Anirban Mukherjee, Pranab K. Dutta
Electrical Engineering Department, Indian Institute of Technology, Kharagpur 721302, West Bengal, India
a r t i c l e i n f o
abstract
Article history:
Received 22 February 2008
Received in revised form
15 August 2008
Accepted 3 October 2008
Available online 18 October 2008
We observed that the two costly optimization problems of twin support vector machine
(TWSVM) classifier can be avoided by introducing a technique as used in proximal
support vector machine (PSVM) classifier. With this modus operandi we formulate a
much simpler nonparallel plane proximal classifier (NPPC) for speeding up the training
of it by reducing significant computational burden over TWSVM. The formulation of
NPPC for binary data classification is based on two identical mean square error (MSE)
optimization problems which lead to solving two small systems of linear equations in
input space. Thus it eliminates the need of any specialized software for solving the
quadratic programming problems (QPPs). The formulation is also extended for nonlinear
kernel classifier. Our computations show that a MATLAB implementation of NPPC can be
trained with a data set of 3 million points with 10 attributes in less than 3 s.
Computational results on synthetic as well as on several bench mark data sets indicate
the advantages of the proposed classifier in both computational time and test accuracy.
The experimental results also indicate that performances of classifiers obtained by MSE
approach are sufficient in many cases than the classifiers obtained by standard SVM
approach.
& 2008 Elsevier B.V. All rights reserved.
Keywords:
Nonparallel plane
Pattern classification
Proximal classifier
Support vector machines
1. Introduction
Support vector machine (SVM) algorithm is an excellent tool for binary data classification [1–4]. This
learning strategy introduced by Vapnik and co-worker
[1] is a principled and very powerful method in machine
learning algorithm. Within a few years after its introduction SVM has already outperformed most other systems
in a wide variety of applications. These include a wide
spectrum of research areas, ranging from pattern recognition [5], text categorization [6], biomedicine [7,8],
brain–computer interface [9,10], and financial applications [11,12], etc.
The theory of SVM, proposed by Vapnik, is based on the
idea of structural risk minimization (SRM) principle [1–3].
Corresponding author. Tel.: +91 9433436953.
E-mail addresses: sghorai@ee.iitkgp.ernet.in, san_ghorai@yahoo.co.in
(S. Ghorai).
0165-1684/$ - see front matter & 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.sigpro.2008.10.002
In its simplest form, SVM for a linearly separable two class
problem finds an optimal hyper plane that maximizes the
separation between the two classes. The hyper plane is
obtained by solving a quadratic optimization problem. For
nonlinearly separable cases the input feature vectors are
first mapped into a high dimensional feature space by
using a nonlinear kernel function [4,13,14]. A linear
classifier is then implemented in that feature space to
classify the data. One of the main challenges of standard
SVM is that it requires large training time for huge
database as it has to optimize a computationally expensive cost function. The performance of a trained SVM
classifier also depends on the optimal parameter set
which is usually found by cross-validation on a tuning set
[15]. The large training time of SVM also prevents one to
locate optimal parameter set from a very fine grid of
parameters over large span. To remove this drawback,
various versions of SVM have been reported by many
researchers with comparable classifications ability. Introduction of proximal type of SVMs [16–18] eradicate the
ARTICLE IN PRESS
S. Ghorai et al. / Signal Processing 89 (2009) 510–522
above shortcoming of standard SVM classifier. These
classifiers avoid the costly optimization problem of SVM
and as a result they are very fast. Such formulations of
SVM can be interpreted as regularized least squares and
considered in the much more general context of regularized networks [19,20].
All the above classifiers discriminate a pattern by
determining in which half space it lies. Mangasarian
and Wild [21] first proposed a classification method
by the proximity of patterns to one of the two nonparallel
planes. They named it as the generalized eigenvalue proximal support vector machine (GEPSVM) classifier. Instead of finding a single hyperplane, GEPSVM finds
two nonparallel hyperplanes such that each plane is
clustered around one particular class data. For this
GEPSVM solves two allied generalized eigenvalue problems. Although this approach is called a SVM but it is
more likely to discriminate patterns by using fisher
information criterion [13,15]. Because by changing the
two class margin representation by ‘‘parallel’’ to ‘‘nonparallel’’ hyperplanes it switches from a binary to
potentially many class approach. The linear kernel
GEPSVM is very fast as it solves two generalized
eigenvalue problems of the order of input space dimension. But performance of it is only comparable with
standard SVM and in many cases it gives low classification
rates. Recently, Jayadeva et al. [22] proposed twin support
vector machine (TWSVM) classifier. In TWSVM also two
nonparallel planes are generated similar to GEPSVM but in
a different technique. For this purpose it solves two
smaller sized quadratic programming problems (QPPs)
instead of solving large one as in the standard SVM [2–4].
Although TWSVM and GEPSVM classify data by two
nonparallel planes yet the former is more likely to a
typical SVM problem which does not eliminate the basic
assumption of selecting a minimum number of support
vectors [23]. Although TWSVM achieves good classification accuracy but it is not desirable to solve two
optimization problems in many cases, predominantly for
large data sets due to higher learning time. This fact
motivates us to formulate the proposed classifier such
that it has good classification ability as TWSVM [22] and
at the same time it should computationally efficient as
PSVM [18] or linear GEPSVM [21].
In this paper, we recommend binary data classifier,
named as nonparallel plane proximal classifier (NPPC).
NPPC also classifies binary data by the proximity of it to
one of two nonparallel planes. The formulation of NPPC is
totally different from that of GEPSVM [21]. But the
formulations of the objective functions of NPPC are similar
to that of TWSVM [22] with a different loss function and
equality constraints instead of inequality constraints. We
call this formulation nonparallel proximal plane classifier
(NPPC) rather than a SVM classifier as there is no SRM
by margin maximization between the two classes like
standard SVM. Thus it can be interpreted as a classifier
obtained by regularized mean square error (MSE) optimization. At last the most important fact is that the
computational results on several data sets show that the
performance of such classifiers obtained by MSE optimization is comparable or even better than the SVM
511
classifiers and eliminates the need of computational
costly SVM classifier in many cases.
The rest of this paper is organized as follows. A brief
introduction of all the SVM classifiers is given in Section 2.
In Section 3, we have formulated NPPC for linear kernel
with two visual examples in two dimensions and in
Section 4; we have extended the formulation for the
nonlinear kernels and demonstrated its performance
visually by one example. In Section 5 performances of
our proposed NPPC is compared with other SVM classifiers for linear and nonlinear kernels. Finally Section 6
concludes the paper.
A concise utterance regarding the notations used in
this paper [21] is as follows. All vectors are considered as
column vectors if not they are transposed by using a
superscript T. Inner product of two vectors x and y in ndimensional real space <n is denoted by xTy and the twonorm of x is indicated by ||x||. The vector e represents
column vector of ones of proper dimension whereas I
stands for identity matrix of subjective dimension. In case
of a matrix, containing feature vectors, A 2 <mn the ith
row Ai is a row vector in <n . Matrices A and B contain the
feature vectors of classes +1 and 1, respectively. For A 2
<m1 n and C 2 <nm , a kernel K(A,C) maps <m1 n <nm
into <m1 m . Only the symmetric property of the kernel is
assumed [21] without any use of Mercer’s positive
definiteness condition [2,3,4,13,14]. The ijth element of
the assumed Gaussian kernel [2] for testing nonlinear
T
2
classification is given by ðKðA; CÞÞij ¼ mjjAi C j jj , where
i ¼ 1,y,m1, j ¼ 1,y,m and m is a positive constant, e is the
base of the natural logarithm, and A and C are as described
above.
2. Brief introduction of SVM
2.1. The linear SVM
SVM is a state-of-the-art of machine learning algorithm which is based on guaranteed risk bounds of
statistical learning theory [1,2] which is known as SRM
principle. Among several tutorials on SVM literature we
refer to [4].
Given m training pairs (x1,y1), y, (xm, ym), where xi 2
<n is an input vector labeled by yiA{+1,1} for i ¼ 1,y,m,
the linear SVM classifier search for an optimal separating
hyperplane
oT x þ b ¼ 0
(1)
n
where b 2 < is the bias term and o 2 < is the normal
vector to the hyperplane. Expression (1) is obtained by
solving the following convex QPP:
Min
s:t:
m
X
1
jjojj2 þ c
xi
2
i¼1
yi :ðoT xi þ bÞX1 xi ;
xi X0; i ¼ 1; . . . ; m
(2)
where xi 2 < is the soft margin error of the ith training
sample. The parameter c(40) is the regularization
parameter that balances the importance between the
maximization of the margin width and the minimization of the training error. The solution of quadratic
ARTICLE IN PRESS
512
S. Ghorai et al. / Signal Processing 89 (2009) 510–522
optimization problem (2) is obtained by finding the saddle
point of the Lagrange function and the decision function is
of the form
!
m
X
f ðxÞ ¼ sgnðoT x þ bÞ ¼ sgn
ai yi xTi x þ b
(3)
i¼1
where ai (0oaioc) is the Lagrange multiplier. The
complexity of solving (2) is O(m3) iterations. Thus as the
number of training pattern increases the time needed to
train the SVM classifier also increases.
nonparallel planes:
oT1 x þ b1 ¼ 0;
oT2 x þ b2 ¼ 0
(6)
where the first plane is closest to the points of class 1 and
furthest from the points of class 1, and the second plane
has the opposite property. This leads to solving following
two optimization problems:
Min
jjAo1 þ e1 b1 jj2 =jj½o1 ; b1 T jj2
;
þ e2 b1 jj2 =jj½o1 ; b1 T jj2
o1 ;b1 a0 jjBo1
where Bo1 þ e2 b1 a0
(7)
2.2. Least square support vector machine (LS-SVM)
and
The drawback of the SVM that it requires large
training time as the training pattern increases was
removed by Suykens et al. [16,17]. They proposed least
square version of the SVM by formulating the classification problem as
Min
o;b;x
s:t:
m
1
cX
jjojj2 þ
x2
2
2 i¼1 i
yi ðoT xi þ bÞ ¼ 1 xi ;
i ¼ 1; . . . ; m
(4)
Here the problem (4) finds the same hyperplane (1) by
solving a set of m linear equations. The solution of (4)
requires inversion of a matrix of order m and its
computational complexity depends on the algorithm used
for its solution. For example Gaussian elimination technique leads to a complexity of O(m3) whereas for conjugate
gradient method it is less than O(m3) [17]. Thus LS-SVM is
computationally cheaper than SVM and its classification
ability is comparable with that of SVM.
2.3. Proximal support vector machine (PSVM)
Fung and Mangasarian proposed PSVM [18] that
classifies data by the proximity of it to one of the two
parallel planes oTx+b ¼ 71 by solving the following
optimization problem:
Min
ðo;c;xÞ2<nþ1þm
s:t:
jjBo2 þ e2 b2 jj2 =jj½o2 ; b2 T jj2
;
þ e1 b2 jj2 =jj½o2 ; b2 T jj2
where Ao2 þ e1 b2 a0
(8)
Here matrices A 2 <m1 n and B 2 <m2 n contain the m1
and m2 training patterns of classes 1 and 1, respectively,
in n-dimensional space, o1 ; o2 2 <n are normal vectors
and b1 ; b2 2 < are bias terms of respective planes and e1 2
<m1 ; e2 2 <m2 are vectors of ones. Solution of (7) and (8)
leads to solving two generalized eigenvalue problems in
<ðnþ1Þðnþ1Þ . Thus its complexity is of the order of O(n3),
where n is the input space dimension. So, linear GEPSVM
is as fast as PSVM classifier.
2.5. Twin support vector machines (TWSVM)
TWSVM [22] also finds two nonparallel planes like
GEPSVM for data classification by optimizing the following pair of QPPs:
ðTWSVM1Þ
Min
ðo1 ;b1 ;x2 Þ
s:t:
1
jjAo1 þ e1 b1 jj2 þ c1 eT2 x2
2
ðBo1 þ e2 b1 Þ þ x2 Xe2 ; x2 X0
(9)
and
c
1
2
jjxjj2 þ ðoT o þ b Þ
2
2
DðAo ebÞ þ x ¼ e
Min
o2 ;b2 a0 jjAo2
ðTWSVM2Þ
(5)
mm
where D 2 <
is a diagonal matrix containing +1 or 1
along its diagonal representing the class of the m training
samples, A 2 <mn is the matrix containing the m training
vectors in n-dimensional space and e 2 <m is a vector of
ones. The solution of (5) involves in inversion of a matrix
of order (n+1) (n+1) and thus its complexity is approximately O(n3), where n is the input space dimension.
Therefore PSVM is much faster than SVM and at the same
time it performs as well as SVM.
2.4. Generalized eigenvalue proximal support vector
machine (GEPSVM)
In GEPSVM [21], Mangasarian and Wild allowed the
parallel planes of PSVM [18] to be nonparallel and the data
are classified by the proximity of it to one of the two
nonparallel planes in <n . Thus GEPSVM seeks two
Min
ðo2 ;b2 ;x1 Þ
s:t:
1
jjBo2 þ e2 b2 jj2 þ c2 eT1 x1
2
ðAo2 þ e1 b2 Þ þ x1 Xe1 ; x1 X0
(10)
where c1 and c240 are regularization parameters of
TWSVM, e1 2 <m1 and e2 2 <m2 are vectors of ones
and x1 2 <m1 ; x2 2 <m2 are error variable vectors due to
classes 1 and 1 data, respectively. Thus formulation
of TWSVM is different from that of GEPSVM but similar
to that of SVM with the difference that it minimizes
two smaller sized optimization problems. Thus if patterns
of both the class are approximately equal to m/2 the
complexity of solving two optimization problems of
TWSVM is O(2 (m/2)3). Thus the ratio of runtimes between SVM with the formulation (2) and TWSVM
with (9) and (10) is m3/(2 (m/2)3) ¼ 4. This makes
TWSVM approximately four times faster [22] than SVM
with the assumption that patterns of both the classes are
equal.
ARTICLE IN PRESS
S. Ghorai et al. / Signal Processing 89 (2009) 510–522
3. The NPPC formulation
In this section, we elaborate the formulation of the
classifier which we name as the nonparallel plane
proximal classifier (NPPC). In the formulation of NPPC
we have applied the concept of both TWSVM [22]
and PSVM [18] with some modification to find two nonparallel planes. To obtain the two nonparallel planes as
defined in (6), the linear NPPC (LNPPC) solves the
following pair of QPPs:
ðLNPPC1Þ
Min
ðo1 ;b1 ;x2 Þ2<ðnþ1þm2 Þ
s:t:
1
Ao1 þ e1 b1 2 þ c1 eT x2 þ c2 xT x2
2
2
2 2
ðBo1 þ e2 b1 Þ þ x2 ¼ e2
(11)
and
ðLNPPC2Þ
Min
ðo2 ;b2 ;x1 Þ2<ðnþ1þm1 Þ
s:t:
1
c4 T
jjBo2 þ e2 b2 jj2 þ c3 eT1 x1 þ x1 x1
2
2
ðAo2 þ e1 b2 Þ þ x1 ¼ e1
m1 n
around respective class data. Further the equality constraint objective functions make it feasible to obtain the
solution of (11) and (12) by solving two small system of
linear equations.
Thus the formulation of NPPC is similar to that of
TWSVM [22] with the inequality constraints are made
equality and removing the nonnegative constraint from
the error variables x1 and x2. Further both linear and
quadratic error terms are used in the objective function
unlike TWSVM which uses a linear error term in it. Thus
the novelty of the NPPC is that, like TWSVM, it can
fit nonparallel planes by minimizing two almost similar
types of objective functions, as in PSVM [18], with
different constant values but without any regularization
term like 1/2(oTo+b2). But PSVM [18] solves one objective
function to generate two parallel planes and can never fit
nonparallel planes to binary data set.
The Lagrangian of the LNPPC1 is given by
Lðo1 ; b1 ; x2 ; aÞ ¼
(12)
m2 n
where matrices A 2 <
and B 2 <
contain the m1
and m2 training patterns of classes 1 and 1, respectively,
in n-dimensional space, o1 ; o2 2 <n are normal vectors
and b1 ; b2 2 < are bias terms of respective planes and e1 2
<m1 ; e2 2 <m2 are vectors of ones, x1 2 <m1 ; x2 2 <m2 are
error variable vectors due to classes 1 and 1 data,
respectively, and c1, c2, c3 and c440 are four regularization
parameters of the NPPC.
This formulation of NPPC finds two nonparallel
hyperplanes, one for each class. The first term of the
objective function LNPPC1 minimizes the sum of the
squared distances from the hyperplane to the patterns of
class 1 and the constraint requires that the patterns of
class 1 are at a distance 1 from the hyperplane with soft
errors. These errors are measured by variables x2i, for
i ¼ 1, 2,y,m2; whenever the patterns of class 1 are not
at a distance 1 from the hyperplane and unlike TWSVM
this error can be either positive or negative. Thus the error
variable x2i is a measure of distance to be added or
subtracted to bring a pattern from its position to a
distance 1 from the hyperplane. The second and third
terms of the objective function (11) constitute the
general quadratic error function [24,25] for minimization,
of the form f(x) ¼ PTx+xTQx, where P, x 2 <k are vectors
and Q 2 <kk is a symmetric and positive semi definite
matrix which ensure convexity of function f(x). The
motivation of such error function is that quadratic term
is very sensitive to the large errors but linear term is less
sensitive to it [26], so the combine effort will better
reduce the misclassification due to the patterns of class
1. Similarly LNPPC2 finds a hyperplane that clustered
around the patterns of class 1 by minimizing the sum
square distances of the patterns from it and keeping the
patterns of class 1 at a distance 1 with soft error variable
x1. Thus each objective function belongs to one particular
class and the constraints are formed by the patterns of the
other class. The inclusion of both linear and quadratic
error term in the objective function track competently the
error and helps the hyperplane to be precisely clustered
513
1
jjAo1 þ e1 b1 jj2 þ c1 eT2 x2
2
c2 T
þ x2 x2 aT ½ðBo1 þ e2 b1 Þ þ x2 e2 2
(13)
where a 2 <m2 is the vector of Lagrange multiplier. The
Karush–Kuhn–Tucker (KKT) optimality conditions [27] for
LNPPC1 are obtained by equating the stationary points of L
to zero as follows:
qL
¼ AT ðAo1 þ e1 b1 Þ þ BT a ¼ 0,
qo1
qL
¼ eT1 ðAo1 þ e1 b1 Þ þ eT2 a ¼ 0
qb1
qL
¼ c1 e2 þ c2 x2 a ¼ 0 and
qx2
ðBo1 þ e2 b1 Þ þ x2 e2 ¼ 0
(14)
Combination of first two expressions of (14) leads to
"
#
½AT eT1 ½A e1 o1
b1
þ ½BT eT2 a ¼ 0
(15)
Now, let
"
H ¼ ½A e1 ;
G ¼ ½B e2 and
u¼
o1
#
b1
(16)
Using these substitutions, (15) reduces to
HT Hu þ GT a ¼ 0; i:e:; u ¼ ðHT HÞ1 GT a
(17)
Eq. (17) can be modified, if required, by adding a
regularization term [21,22,28] dI, d40 and I is identity
matrix of appropriate dimension, to avoid the possibility
of the ill-conditioning of HTH, although it is positive
definite. So, the modified expression for u becomes
u ¼ ðHT H þ dIÞ1 GT a
(18)
Using notations of (16) and x2 ¼ ac1e2/c2 from the third
expression of (14) and substituting (17) in the last
expression of (14) we get,
1 I
c1
a¼
þ GðHT HÞ1 GT
þ 1 e2
(19)
c2
c2
ARTICLE IN PRESS
514
S. Ghorai et al. / Signal Processing 89 (2009) 510–522
Here a entails the inversion of massive m2 m2 matrix
and a few matrix multiplications. Therefore to reduce the
complexity of computation for a by (19) we straight away
used here Sherman–Morison–Woodbury formula [29] for
matrix inversion as used in [18,21,30]. This consequences
computation of a by
(
1 )
c1
T
T
T
1 I
1
a ¼ c2 I GðH HÞ
þ G GðH HÞ
GT
þ 1 e2
c2
c2
(20)
Computation of this expression for a involve the inversion
of a matrix of the order (n+1) (n+1), where n is the input
space dimension which is usually much smaller than
m2 m2, and utterly solves the classification problem.
The interesting fact observed from (20) is that, if c1 ¼ 0,
i.e., if linear error term is excluded from the objective
function (11) the solution for a will correspond to that
of considering only quadratic error term in it. Thus inclusion of both linear and quadratic error terms in the
objective function offers spare extent of freedom [31] for
fine tuning the decision planes of the classifier by varying
the ratio of c1/c2.
Similarly we construct the Lagrangian of LNPPC2 as
follows:
Lðo2 ; b2 ; x1 ; gÞ ¼
1
c4 T
jjBo2 þ e2 b2 jj2 þ c3 eT1 x1 þ x1 x1
2
2
gT ½ðAo2 þ e1 b2 Þ þ x1 e1 (21)
m1
is the vector of Lagrange multipliers.
where g 2 <
Following same procedure, as above, for optimal solution
of (12), we get augmented vector
"
#
v¼
o2
b2
¼ ðGT G þ dIÞ1 HT g
(22)
and
(
g ¼ c4 I HðGT GÞ1
I
þ HT HðGT GÞ1
c4
1
HT
)
c3
þ 1 e1
c4
(23)
Once a and g, computed respectively from (20) and (23),
are known the augmented vectors u and v and hence
(o1,b1) and (o2,b2) are computed using (18) and (22),
respectively. This will completely solve the training of
LNPPC and we will get the expression of the hyperplanes
(6) to classify new patterns.
A new data sample x 2 <n is assigned to a class k by
comparing the following distance measure [22] of it from
the two hyper planes given by (6), i.e.,
Class k ¼ Min jxT ok þ bk j
k¼1;2
(24)
We now cite our simple algorithm below for implementation of a linear NPPC.
Algorithm 1. Linear nonparallel plane proximal classifier. Given m1 and m2 training patterns of classes 1 and 1,
respectively, in n-dimensional space represented by the
A 2 <m1 n matrices and B 2 <m2 n , linear NPPC is generated
as follows:
(i) Define augmented matrices H and G by (16), where e1
and e2 are m1 1 and m2 1 vectors of ones,
respectively. With these values compute the Lagrange
multipliers a and g from Eqs. (20) and (23) with some
positive values of c1, c2, c3 and c4. Typically these
values are chosen by means of a tuning set.
(ii) Determine augmented vectors u ¼ [o1, b1]T and
v ¼ [o2, b2]T from (18) and (22), respectively, to
obtain the two nonparallel planes (6).
(iii) Classify a new pattern x 2 <n by using (24).
Here, at this point, we put NPPC side by side to
other two nonparallel plane classifiers, i.e., TWSVM and
GEPSVM classifiers with two examples for visual illustration in two dimensions. All the classifiers are trained to
obtain maximum training accuracy by varying their
corresponding parameters. Fig. 1 shows the nonparallel
planes learned by the above classifiers for a linearly nonseparable data set containing 1241 patterns with 765
patterns in class 1 and 476 patterns in class 1. Fig. 1(a)
and (b) are the NPPC obtained with making linear error
term zero in the objective functions and considering both
the error terms in the objective functions, respectively.
While the former classifier gives training set accuracy
93.71% the latter gives 96.21%. The improvement in the
training accuracy in the NPPC points toward the effectiveness of both the error terms in the objective functions.
Fig. 1(c) and (d) represent the classifiers learned by the
TWSVM and GEPSVM classifiers, respectively. All these
show that training accuracy of NPPC (96.21%) and TWSVM
(96.70%) are better than GEPSVM (77.20%). At the same
time the training time of all the three classifiers show that
NPPC is much faster than TWSVM and as fast as GEPSVM.
In Fig. 2(a)–(c) we have shown the same three
classifiers learned on a noisy cross plane data set. The
training set accuracy for both NPPC and TWSVM is 93.55%
for this example, whereas, for GEPSVM it is only 80.65%.
Thus linear NPPC proves its effectiveness by well learning
the tricky noisy cross plane data set. While in Fig. 2(d)–(f)
we have shown the effect of the linear and the quadratic
terms used in the loss function of NPPC formulation. For
this cross plane data set the effect of the linear term is
negligible as seen from Fig. 2(d) where the position of the
plane clustered around class 1 data set does not change as
c1 is made 0 from 23 keeping other parameters fixed.
But from Fig. 2(e) and (f) it is seen that as the value of c2
increases, keeping other three parameters fixed, the plane
gradually shifts away from the class 1 data. Thus it is
experienced that if all the patterns of opposite class are
below requisite distance 1 then the effect of linear error
terms are predominant than quadratic terms. The reverse is
also true if the same patterns are far away from the
requisite distance. Thus proper combination of these four
parameters c1, c2, c3 and c4 facilitates the classifier to track
competently the error whenever the hyperplane swings
either direction from the requisite distance of 1 and thus
learn better. So, the better performance of the classifier may
be explained physically as it minimizes both the average
error and sum square error (energy term) due to the
ARTICLE IN PRESS
S. Ghorai et al. / Signal Processing 89 (2009) 510–522
515
Fig. 1. A linearly non-separable data set learned by (a) NPPC without linear error term in the objective functions. The classifier obtained with c1 ¼ 0,
c2 ¼ 22, c3 ¼ 0, c4 ¼ 20 and d ¼ 107. Training accuracy ¼ 93.71% and training time ¼ 0.0009 s. (b) NPPC with c1 ¼ 24, c2 ¼ 20, c3 ¼ 23, c4 ¼ 21 and
d ¼ 107. Training accuracy ¼ 96.21% and training time ¼ 0.0009 s. (c) 100 TWSVM classifier with c ¼ 21 and d ¼ 107. Training accuracy ¼ 96.70% and
training time ¼ 189.7516 s and (d) GEPSVM classifier with d ¼ 104, training accuracy ¼ 77.20% and training time ¼ 0.0019 s. In all figures ‘‘+’’ and ‘‘o’’ signs
represent classes 1 and 1 patterns, respectively. Solid line represents plane clustered around class 1 patterns and dotted line around patterns of class 1.
opposite class data if those are not at a distance 1 while
fitting a hyperplane around a particular class data.
and
NKNPPC2
4. Nonlinear kernel nonparallel plane proximal classifier
(NKNPPC) formulation
In this section we have extended our formulation to
nonlinear classifiers by considering kernel generated
surfaces instead of planes [18,21,22].
For nonlinearly separable case, the input data is first
projected into a kernel generated feature space of same or
higher dimension than that of the input space. To apply
this transformation let K(.,.) is a nonlinear kernel function
and define the augmented matrix
A
C¼
2 <mn
(25)
B
1
jjKðB; C T Þy2 þ e2 b2 jj2
2
c4 T
þ c3 eT1 x1 þ x1 x1
2
ðKðA; C T Þy2 þ e1 b2 Þ þ x1 ¼ e1
Min
ðy2 ;b2 ;x1 Þ2<ðmþ1þm1 Þ
s:t:
(27)
where y1 and y2 2 <m are the normal vectors to the kernel
generated surfaces by the objective functions NKNPPC1
and NKNPPC2, respectively, and the rest of the parameters
are as described in Sections 2 and 3 for linear case.
Constructing the Lagrangian of problems (26) and (27)
and defining the matrices P 2 <m1 ðmþ1Þ ; Q 2 <m2 ðmþ1Þ and
augmented vectors l1 ; l2 2 <ðmþ1Þ as follows:
P ¼ ½KðA; C T Þ e1 ;
" #
Q ¼ ½KðB; C T Þ e2 ;
" #
where m1+m2 ¼ m, the total patterns in the training set.
We now construct the nonlinear kernel NPPC (NKNPPC)
objective functions as follows:
l1 ¼
NKNPPC1
the expression for l1, l2 and Lagrange multipliers a for (26)
and g for (27) can be written as
1
kKðA; C T Þy1 þ e1 b1 k2
2
c2 T
þ c1 eT2 x2 þ x2 x2
2
ðKðB; C T Þy1 þ e2 b1 Þ þ x2 ¼ e2
y1
b1
and
l2 ¼
y2
b2
(28)
Min
ðy1 ;b1 ;x2 Þ2<ðmþ1þm2 Þ
s:t:
(26)
l1 ¼ ðP T PÞ1 Q T a
(29)
l2 ¼ ðQ T Q Þ1 P T g
(30)
ARTICLE IN PRESS
516
S. Ghorai et al. / Signal Processing 89 (2009) 510–522
Fig. 2. A cross plane data set with random noise learned by the (a) NPPC with c1 ¼ 23, c2 ¼ 26, c3 ¼ 21, c4 ¼ 28 and d ¼ 107, training
accuracy ¼ 93.55%. (b) TWSVM classifier with c ¼ 25 and d ¼ 107, training accuracy ¼ 93.55%. (c) GEPSVM classifier with d ¼ 103 and training
accuracy ¼ 80.65%. (d) Effect of changing only c1 from 26 to 0 of the NPPC obtained in (a). (e)–(f) Shifting of the hyperplane clustered around class 1 data
set due to change of only c2 from 26 to 25 and 24, respectively, of the NPPC obtained in (a). In all figures ‘‘+’’ and ‘‘o’’ signs represent classes 1 and 1
patterns, respectively. Solid line represents plane clustered around class 1 patterns and dotted line around patterns of class 1.
a¼
g¼
1 1 I
þ Q ðPT PÞ1 Q T
c2
I
þ PðQ T Q Þ1 PT
c4
c1
þ 1 e2
c2
(31)
c3
þ 1 e1
c4
(32)
At this stage it is worthwhile to mention that like
the situation with linear kernels, Sherman–Morrison–
Woodbury formula is not suitable here for (31) and (32)
because Q(PTP)1QT and P(QTQ)1PT are square matrices of
size m2 m2 and m1 m1, respectively, and thus the
inversions must take place in potentially high dimensional
space <m2 and <m1 , respectively. Furthermore if the
number of patterns in either class becomes very large
the reduced kernel technique [18,21,22,32] may also be
applied to reduce the dimensionality of NKNPPC1 and
NKNPPC2. Further, if necessary, we can bring in a
regularization term dI, d40, while inverting (PTP) in (29)
ARTICLE IN PRESS
S. Ghorai et al. / Signal Processing 89 (2009) 510–522
and (31) or (QTQ) in (30) and (32), respectively. Practically
we have inserted a term dI in those expressions for testing
with nonlinear kernels. Once we get the values of l1
and l2 by (29) and (30), respectively, then the expressions for kernel generated surfaces of NKNPPC can be
obtained as
T
T
Kðx ; C Þy1 þ b1 ¼ 0
and
T
T
Kðx ; C Þy2 þ b2 ¼ 0
(33)
A new data point x 2 Rn is assigned to a class k ðk ¼ 1; 2Þ,
by comparing the distance measure of it from the two
kernel generated surfaces given by (33) after projecting it
to the kernel generated feature space, i.e.,
Class k ¼ Min
k¼1;2
jKðxT ; C T Þyk þ bk j
(34)
We now state explicit statement for nonlinear kernel
NPPC implementation analogous to Algorithm 1.
Algorithm 2. Nonlinear kernel nonparallel plane proximal classifier. Given m1 and m2 training patterns of
classes 1 and 1, respectively, in n-dimensional space
represented by the matrices A 2 <m1 n and B 2 <m2 n ,
nonlinear kernel NPPC is generated as follows:
(i) Form C 2 <mn matrix as defined in (25) considering
all the rows of A and B matrices or taking a small
fraction (usually 1–10%) of training data randomly for
reduced kernel [32].
(ii) Choose a kernel function K(A, CT).
(iii) Define augmented matrices P and Q as defined in
(28), where e1 and e2 are m1 1 and m2 1 vectors of
ones, respectively. With these values compute the
Lagrange multipliers a and g from Eqs. (31) and (32)
with some positive values of regularization parameters c1, c2, c3 and c4 also kernel parameters of
selected kernel. Typically these values are chosen by
means of a tuning set.
Fig. 3. Checkerboard data set learned by the NKNPPC with the Gaussian
kernel. The classifier is obtained with kernel parameter m ¼ 25 and
regularization parameters c1 ¼ 1, c2 ¼ 0.01, c3 ¼ 1, c4 ¼ 0.01 and
d ¼ 107. Training accuracy ¼ 98.80% and CPU time for training is
3.40 s. In the figure ‘‘+’’ and ‘‘o’’ signs represent classes 1 and 1
patterns, respectively.
517
(iv) Compute augmented vectors l1 ¼ [y1, b1]T and l2 ¼
[y2, b2]T by (29) and (30), respectively, to obtain the
nonlinear surfaces (33).
(v) Classify a new pattern x 2 <n by using (34).
In order to illustrate graphically the effectiveness of the
nonlinear kernel NPPC we test its ability to learn the
checkerboard data set [33]. This data set consists of 1000
points in <2 of black and white points taken from 16 black
and white squares of a checkerboard. This is a tricky test
case in data mining algorithm for testing performance of
nonlinear classifiers. However NKNPPC with the Gaussian
kernel and appropriate parameters generates a clear
distinction between the two types of data points as
shown in Fig. 3. This proves the effectiveness of the
NKNPPC formulation in learning highly nonlinear patterns.
5. Numerical testing and comparison
To compare the performance of our NPPC we investigate results in terms of accuracy and execution time on
publicly available benchmark data sets from the UCI
Repository [34] which are commonly used in testing
machine learning algorithms. All the classification methods are implemented in MATLAB 7 [35] on Windows XP
running on a PC with system configuration Intel P4
processor (3.06 GHz) with 1 GB of RAM. We compare both
linear and nonlinear kernel classifiers using NPPC,
TWSVM [22], GEPSVM [21], PSVM [18], one norm SVM
with formulation (2) and LS-SVM [17] algorithms. For the
CPU time and accuracy figures of PSVM, SVM and LS-SVM
we have used the MATLAB codes from SVM toolbox home
page [36], Gunn SVM toolbox [37] and LS-SVM 1.5
advanced toolbox with ‘‘CMEX’’ implementation [38],
respectively. For implementation of TWSVM we have
used the optimizer code ‘‘qp.dll’’ from Gunn SVM tool
box [37]. NPPC and GEPSVM are implemented using
simple MATLAB functions like ‘‘inv’’ and ‘‘eig’’, respectively. The average testing accuracies for all the methods
are calculated using standard 10-fold cross-validation
method ([39], pp. 111–112) as used in all other methods.
The parameter d for GEPSVM is selected from the set of
values {10i|i ¼ 7,6,y,7} and the value of d for both
NPPC and TWSVM is selected as 107 in all the case.
Further the optimal values of the regularization parameters for all methods are selected from the set of
values {2i|i ¼ 7, 6,y,15} by tuning a set comprising of
random 10% of training data. Once the parameter was
selected, the tuning set was returned to the training set to
learn the final classifier. In order to calculate the accuracy,
random permutations of the data points are also
performed before proceeding to the 10-fold testing.
Besides average 10-fold testing accuracy we also reported
average CPU time for training for each method and pvalue in 5% significance level. The p-value was calculated
by performing a paired t-test ([39], p. 148) comparing all
other classifiers to NPPC under the assumption of the null
hypothesis that there is no difference between the tests
set accuracy distributions.
ARTICLE IN PRESS
518
S. Ghorai et al. / Signal Processing 89 (2009) 510–522
Table 1
Ten-fold testing comparison of average CPU time for training and average test set correctness of linear classifiers and p-values of the results comparing all
methods to NPPC
Data set
NPPC
time (s)
test acc.
TWSVM
time (s)
test acc.
p-value
GEPSVM
time (s)
test acc.
p-value
PSVM
time (s)
test acc.
p-value
SVM
time (s)
test acc.
p-value
LSSVM
time (s)
test acc.
p-value
WPBC
110 32
0.0024
79.8479.70
0.9250
78.3278.89
0.091898
0.0077
70.76710.55
0.038464
0.0006
79.2179.04
0.944048
2.8472
78.3778.55
0.846563
0.2224
80.29711.32
0.285851
Ionosphere
351 34
0.0029
87.4476.30
3.1574
81.4875.74
0.001467
0.0096
76.00715.71
0.000536
0.0006
86.0476.93
0.132871
12.6057
88.274.51
0.147621
0.3954
84.01710.56
0.024780
Heart-c
303 24
0.0011
83.3373.41
1.2671
83.7073.78
0.084655
0.0021
66.6778.92
0.000019
0.0039
84.8175.60
0.814601
5.7702
83.7073.39
0.897386
0.0938
84.4474.32
0.677108
Pima-Indian
768 8
0.0014
77.2075.29
26.4315
77.8574.85
0.177352
0.0016
74.0974.64
0.003197
0.0006
77.0776.42
0.966127
108.5596
77.3474.37
0.481374
0.5076
77.2075.42
0.998165
BUPA liver
345 6
0.0010
70.21710.67
2.6117
66.4077.74
0.118958
0.0012
56.2977.35
0.000174
0.0031
68.6677.40
0.786537
11.2190
67.7875.51
0.580362
0.0882
69.34710.48
0.273269
Votes
435 16
0.0015
95.6373.36
5.2407
96.3273.63
0.083672
0.0018
91.0376.13
0.638405
0.0025
95.6472.78
0.224295
21.6651
95.1671.93
0.948016
0.3172
95.6373.36
1.000000
Breast cancer
699 16
0.0013
94.2576.25
22.4713
94.9775.35
0.212593
0.0016
73.65711.58
0.000168
0.0006
94.4272.43
0.946173
85.0442
95.1471.14
0.703021
0.4599
94.2576.25
1.000000
Heart-Statlog
270 14
0.0011
84.4475.93
1.2181
83.7073.78
0.679057
0.0022
66.6778.92
0.000136
0.0006
85.1974.06
0.763818
5.7694
83.3379.11
0.703757
0.0937
84.8174.52
0.832052
Sonar
208 60
0.0058
77.0077.46
0.6432
77.0079.10
0.948520
0.0314
68.69710.02
0.015912
0.0006
74.98711.28
0.751494
2.9825
78.93710.43
0.704704
0.0654
78.3674.42
0.610881
Australian
690 14
0.0018
86.2375.47
16.6121
85.9475.84
0.443456
0.0024
71.8874.16
0.000168
0.0006
85.2274.24
0.650593
86.8776
85.5174.85
0.784918
0.7082
85.8075.83
0.278964
German
1000 24
0.0036
76.3075.75
76.9041
75.7075.85
0.101038
0.0056
74.0074.90
0.902013
0.0006
76.7072.90
0.839176
251.1579
75.6074.34
0.404828
2.5300
76.2075.88
0.474957
In Table 1 we have compared linear classifiers using 11
data sets from UCI Repository [34]. The best accuracy
figures are bold. From Table 1 it is seen that there is no
significant difference in classification accuracy between
linear NPPC and TWSVM classifier except for Ionosphere
data set on which NPPC performs better. But there is a
noteworthy improvement of training time in NPPC than
that of TWSVM. In contrast classification accuracy of
GEPSVM is significantly lower than NPPC for all cases
except for Votes and German data sets as seen from the p-
ARTICLE IN PRESS
S. Ghorai et al. / Signal Processing 89 (2009) 510–522
519
Table 2
Ten-fold testing comparison of average CPU time for training and average test set correctness of nonlinear kernel classifiers and p-values of the results
comparing all methods to NPPC
Data set
NPPC
time (s)
test acc.
TWSVM
time (s)
test acc.
p-value
GEPSVM
time (s)
test acc.
p-value
PSVM
time (s)
test acc.
p-value
SVM
time (s)
test acc.
p-value
LSSVM
time (s)
test acc.
p-value
Haberman’s
survival
0.1612
4.9264
1.0523
0.0215
8.0285
0.2714
77.4377.11
78.7278.26
0.171591
73.4877.86
0.071523
73.5276.25
0.126987
73.17710.61
0.162121
73.8279.62
0.158359
0.0625
73.7279.80
2.1756
73.7279.80
1.00000
0.4015
42.79719.67
0.001178
0.0162
52.46712.29
0.000564
6.5248
60.05712.37
0.008554
0.1518
59.3779.53
0.000004
Tic-tac-toe
303 24
3.0640
89.3371.16
93.9697
89.3371.16
1.000000
2.8059
65.3474.34
0.000001
0.3800
94.4772.76
0.065232
244.5616
90.9272.18
0.120003
2.1627
95.4072.66
0.021368
Pima-Indian
768 8
1.6753
89.0672.79
51.5236
89.9774.33
0.429623
17.3321
65.1175.14
0.000000
0.2129
77.4774.19
0.000065
130.9067
74.0973.58
0.000001
2.3301
76.9674.15
0.000018
BUPA liver
345 6
0.2068
82.6176.12
4.9233
83.7675.94
0.423054
1.5986
57.9379.73
0.000045
0.0285
73.3476.01
0.000920
11.1348
70.3976.38
0.002364
0.2847
72.1876.49
0.000255
CMC
1473 9
2.3735
90.8272.51
66.8564
90.8173.25
0.989988
1.0744
62.2975.94
0.000001
0.3043
69.5474.49
0.000001
187.0435
68.9873.44
0.000001
2.7362
68.2774.75
0.000001
Breast cancer
699 16
1.3228
99.2870.96
45.0518
98.2873.37
0.167850
17.3172
82.93725.79
0.001102
0.1665
95.1372.23
0.000674
88.7292
95.4271.24
0.000034
1.0303
95.2674.79
0.000124
Spect
267 22
0.1174
95.9073.49
4.2090
94.0375.27
0.051576
0.9097
79.4076.46
0.000009
0.0141
86.1775.21
0.000536
5.8398
85.1077.18
0.001685
0.1061
86.8776.59
0.000753
306 3
Cleveland heart
297 13
values. On the other hand classification accuracies
for all the data sets are comparable between PSVM,
SVM, LS-SVM and NPPC as p-values for all the test
cases are above 0.05 which indicates that there is no
statistical difference in the results. Thus with the
proposed formulation of NPPC with both linear and
quadratic error terms in its objective functions allows
the classifier to learn better the data set reducing the
generalization errors. But it is noticeable from the test that
for linear kernel, the required training time of NPPC is
extremely small compared to SVM and LS-SVM but
comparable with that of PSVM and GEPSVM. This should
be noted that Gunn SVM toolbox [37] uses a ‘‘.dll’’ file for
solving optimization problem and we used ‘‘CMEX’’
implantation of LS-SVM from [38] for comparison purpose. Both of these implementations are faster than
MATLAB implementation.
Table 2 compares the performance of different nonlinear kernel classifiers using a Gaussian kernel for eight
data sets. For nonlinear kernel classifiers we also choose
the same procedure to find the optimal parameter set as
we did in case of the linear classifiers. The parameter m for
Gaussian kernel for all the methods was preferred from
the set {2i|i ¼ 7, 6,y,1, 0} while the regularization
parameters selected from same set as we used for linear
classifiers. The results of Table 2 show that there is no
statistical difference in average classification accuracy
between NPPC and TWSVM as in all the test cases
p-values are above 0.05. But NPPC is much faster than
TWSVM. This implies that NPPC can achieve same level
of accuracy as that of TWSVM without solving a costly
quadratic optimization problem but a simple MSE optimization problem. On the other hand difference between
NPPC and other classifiers, except TWSVM, are very much
ARTICLE IN PRESS
520
S. Ghorai et al. / Signal Processing 89 (2009) 510–522
statistically significant as p-values are well below 0.05 in
most of the cases. For example except Tic-tac-toe and
Haberman’s survival data set the classification accuracy
of NPPC is substantially higher than GEPSVM, PSVM,
SVM and LS-SVM. Particularly for Cleveland heart, PimaIndian, BUPA liver, CMC and Spect data sets the average
classification accuracies of NPPC differs by more than 10%
from that above four methods. Thus nonlinear kernel
NPPC undoubtedly point toward the dominance of it
over the other existing methods. The NPPC classifier will
perform well on the data sets where the patterns of
opposite class are distributed along two different hyperplanes. Even if the patterns are not like this we can choose
appropriate kernel function with proper kernel parameter
to transform the data in feature space for such distribution so that nonlinear kernel NPPC can perform better on
that data set.
In order to test the computational efficiency of linear
NPPC on large data set we synthetically generated 3
million artificial data point in 10 dimensions [30]. For this
we first created an arbitrary hyperplane in that dimension
and generated data points with entries uniformly distributed in between [0.9, +0.9]. The points are assigned a
class label according to which side of the plane it lies. We
choose same regularization parameters 1 for all the
methods compared and tested only the CPU training time
for each taking all these data. NPPC can train the classifier
with 3 million data in 2.2039 s which is comparable to
that of training time 0.3326 and 1.2824 s as required by
PSVM and GEPSVM, respectively. But TWSVM, SVM and
LS-SVM also fall short to converge in 40 h. To train LS-SVM
in this case we used fixed size LS-SVM that work with
sparse representations, approximate feature maps and
estimation in the primal [17]. The computational results
are recorded in Table 3. The results undoubtedly prove the
boost of computational efficiency of NPPC over TWSVM.
We also test CPU training time for all the methods with
nonlinear Gaussian kernel on Spambase data set [34]
under identical conditions. This data set contains 4601
training data with 57 attributes. The kernel parameter m
and regularization parameter c were taken as 0.1 and 1,
respectively, for all the methods compared in Table 4.
Reduced kernel technique [32] was applied for computation of the training time. A rectangular kernel [32] of the
size 4601 138 was used here instead of the square
4601 4601 kernel for these methods. Table 4 shows
GEPSVM classifier takes least time (0.6678 s) to train the
nonlinear classifier, whereas nonlinear kernel NPPC can
train the classifier in 18.7321 s which is less than 19.2882 s
as taken by PSVM classifier. While TWSVM, fixed size
LS-SVM and Gunn SVM on this same problem take more
than 2.6 h (9394 s), 3.15 h (11 296 s) and 7.5 h (27 043 s),
respectively, to learn the nonlinear classifier. The large
training time of fixed size LS-SVM is due to the fact that
it constructs a reduced set of support vectors based
on optimal kernel entropy [17]. These results certainly
indicate the computational efficiency of NKNPPC, PSVM
and GEPSVM.
The CPU time of training for all the classifiers tabulated
in Tables 1–4 can be explained by the complexity of the
six methods compared. In Appendix A we have derived the
computational complexities of linear and nonlinear kernel
NPPC in terms of number of simple multiplication
operation. Although computational complexities may be
different from this depending upon the algorithm used for
its calculations. This shows that complexity of training
linear NPPC is approximately O(kn3), where n is the input
space dimension and the value of k depends upon the
number of training data. Since the input space dimension
is small in most of the cases (typically 100) linear NPPC
is very fast and comparable to that of linear kernel PSVM
and GEPSVM. On the other hand complexity of nonlinear
kernel NPPC is approximately O(km3), where m is the total
number of patterns on the training set and value of k
depends on m. Thus complexity of nonlinear NPPC is
much higher than linear NPPC as the classification is
performed in higher dimensional feature space. To reduce
the complexity of NKNPPC reduced kernel technique [32]
may be used. By this method a subset (usually 1–10%) of
the original training data is selected prior to training
of the classifier. This leads to computation of a thin
rectangular kernel matrix instead of a large square kernel.
Thus with the reduced kernel technique complexity of
NKNPPC becomes Oðkm̄3 Þ, where m̄ is only 1–10% of
the total training data selected randomly. In contrast the
complexity of TWSVM, SVM and LSSVM depend on the
number of the training patterns as mentioned in Section 2.
Thus for large database these methods take large time to
train the classifier. Although one can use reduced kernel
technique to TWSVM [22] and fixed-size LS-SVM [17] for
large data set. Thus performance of linear NPPC is
comparable to that of all other SVM classifiers but for
Table 3
CPU time to learn different linear classifiers on synthetically generated 3 million point in 10-dimensions
NPPC time (s)
TWSVM time (s)
GEPSVM time (s)
PSVM time (s)
SVM time (s)
LSSVM time (s)
2.2039
Not converged in 40 h
1.2824
0.3326
Not converged in 40 h
Not converged in 40 h
Table 4
CPU time to learn different nonlinear classifiers with the Gaussian kernel on the Spambase data set [31]
NPPC time (s)
TWSVM time (s)
GEPSVM time (s)
PSVM time (s)
SVM time (s)
LSSVM time (s)
18.3721
9394
0.6678
19.2882
27 043
11 296
ARTICLE IN PRESS
S. Ghorai et al. / Signal Processing 89 (2009) 510–522
nonlinear NPPC it performs significantly better than
all other classifiers except TWSVM in terms of accuracy.
But training time of NPPC is significantly lower than
TWSVM.
6. Conclusion
In this paper we essentially put side by side the
performance of classifiers obtained by margin maximization concept in MSE framework versus classifiers obtained
by standard SVM approach. Based on the experimental results on benchmark data sets it conveys that
SVM may be replaced by simpler optimization problems
in several cases, in which we do not have to consider
support vectors and inequality constraints. The computational results given in Tables 1 and 2 indicate that MSE
optimization is sufficient in most of the cases than to deal
with a costly SVM problem. However it requires further
investigations that how the classifiers obtained by MSE
optimization approach performs on noisy data sets or data
sets from different applications compared to standard
SVM. One drawback of NPPC is that it has four regularization parameters to select by the user. Therefore the
selection of optimal parameter set for NPPC may be
one searching areas of research together with kernel
selection. Further, this approach can be extended for
multicategory classification and incremental classification
for large data sets.
Acknowledgements
Authors would like to thank the referees for very useful
comments and suggestions which greatly improved our
representation. Authors are also grateful to Professor P.
Mitra and Professor A. Routray of IIT Kharagpur for their
help in presentation of the paper. Santanu Ghorai
acknowledges the financial support of the authority of
MCKV Institute of Engineering, Liluah, Howrah 711204,
W.B., India, and All India Council of Technical Education
(AICTE, India) in the form of salary and scholarship,
respectively, for pursuing his Ph.D. degree under Quality
Improvement Programme (QIP) in the Indian Institute of
Technology, Kharagpur 721302, India.
Appendix A. Supplementary data
Supplementary data associated with this article can be
found in the online version at doi:10.1016/j.sigpro.
2008.10.002.
References
[1] C. Cortes, V.N. Vapnik, Support vector networks, Machine Learning
20 (3) (1995) 273–297.
[2] V. Vapnik, The Nature of Statistical Learning Theory, New York,
Springer, 1995.
[3] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector
Machines, vol. 3, Cambridge University Press, Cambridge, MA, 2000
Chapter 6, pp. 113–145.
521
[4] C.J.C. Burges, A tutorial on support vector machines for pattern
recognition, Data Mining Knowledge Discovery 2 (2) (1998)
121–167.
[5] S. Lee, A. Verri, Pattern recognition with support vector machines,
in: First International Workshop, SVM 2002, Springer, Niagara Falls,
Canada, 2002.
[6] T. Joachims, C. Ndellec, C. Rouveriol, Text categorization with
support vector machines: learning with many relevant features,
in: Proceedings of the European Conference on Machine Learning
(ECML), Berlin, 1998, pp. 137–142.
[7] D. Lin, N. Cristianini, C. Sugne, T. Furey, M. Ares, M. Brown, W.
Grundy, D. Haussler, Knowledge-base Analysis of Microarray Gene
Expression Data by Using Support Vector Machines, PNAS, vol. 97,
Springer, London, UK, 2000 Chapter 1, 262–267.
[8] W.S. Noble, Kernel Methods in Computational Biology, Support
Vector Machine Applications in Computational Biology, MIT Press,
Cambridge, 2004, pp. 71–92.
[9] T. Ebrahimi, G.N. Garcia, J.M. Vesin, Joint time-frequency-space
classification of EEG in a brain–computer interface application,
J. Appl. Signal Process. 1 (7) (2003) 713–729.
[10] T.N. Lal, M. Schroder, T. Hinterberger, J. Weston, M. Bogdan, N.
Birbaumer, B. Scholkopf, Support vector channel selection in BCI,
IEEE Trans. Biomed. Eng. 51 (6) (2004) 1003–1010.
[11] H. Ince, T.B. Trafalis, Support vector machine for regression and
applications to financial forecasting, in: International Joint Conference on Neural Networks (IJCNN’02), Como, Italy, IEEE-INNSENNS, 2002.
[12] C.J. Hsu, W.H. Chen, S. Wuc, Z. Huang, H. Chen, Credit rating
analysis with support vector machines and neural networks: a
market comparative study, Decision Support Systems 37 (2004)
543–558.
[13] N. Cristianini, J. Shawe Taylor, Kernel Methods for Pattern Analysis,
Cambridge University Press, Cambridge, UK, 2004.
[14] T. Joachims, Making large-scale support vector machine learning
practical, in: B. Schölkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances
in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, MA, 1999, pp. 169–184.
[15] S. Haykin, Neural Networks—A Comprehensive Foundation, second
ed., Pearson Education, 2006, Chapter 4, pp. 235–240.
[16] J.A.K. Suykens, J. Vandewalle, Least squares support vector machine
classifiers, Neural Process. Lett. 9 (3) (1999) 293–300.
[17] J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor,
J. Vandewalle, Least Squares Support Vector Machines, World
Scientific Publishing Co., Singapore, 2002.
[18] G. Fung, O.L. Mangasarian, Proximal support vector machine
classifiers, in: 7th International Proceedings on Knowledge Discovery and Data Mining, 2001, pp. 77–86.
[19] T. Evgeniou, M. Pontil, T. Poggio, Regularization networks and
support vector machines, Advances Comput. Math. 1 (13) (2000)
1–50.
[20] T. Evgeniou, M. Pontil, T. Poggio, Regularization networks and
support vector machines, in: A. Smola, P. Bartlett, B. Schölkopf, D.
Schuurmans (Eds.), Advances in Large Margin Classifiers, MIT Press,
Cambridge, MA, 2000, pp. 171–203.
[21] O.L. Mangasarian, E.W. Wild, Multisurface proximal support vector
classification via generalized eigenvalues, IEEE Trans. Pattern Anal.
Machine Intell. 28 (1) (2006) 69–74.
[22] Jayadeva, R. Khemchandani, S. Chandra, Twin support vector
machines for pattern classification, IEEE Trans. Pattern Anal.
Machine Intell. 29 (5) (2007) 905–910.
[23] A.K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition:
a review, IEEE Trans. Pattern Anal. Machine Intell. 22 (1) (2000)
4–37.
[24] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, 2002 Chapter 4, p. 138.
[25] J. Nocedal, S. Wright, Numerical Optimization, second ed., Springer,
2006 Chapter 16, p. 449.
[26] J. Chambers, A. Avlonities, A robust mixed-norm adaptive filter
algorithm, IEEE Signal Process. Lett. 4 (2) (1997) 46–48.
[27] M.S. Bazarra, H.D. Sherali, C.M. Shetty, Nonlinear Programming—Theory and Algorithms, second ed., Wiley, 2004 Chapter 4,
pp. 149–172.
[28] A.N. Tikhonov, V.Y. Arsenin, Solutions of Ill-posed Problems, Wiley,
New York, 1977.
[29] G.H. Golub, C.F. Van Loan, Matrix Computations, third ed., The John
Hopkins University Press, Baltimore, 1996 Chapter 2, p. 50.
[30] K.S. Chua, Efficient computations for large least square support
vector machine classifiers, Pattern Recognition Lett. 24 (2003)
75–80.
ARTICLE IN PRESS
522
S. Ghorai et al. / Signal Processing 89 (2009) 510–522
[31] W. Pao, L. Lan, D. Yang, The mixed norm proximal support vector
classifier, Department of Electronics Engineering, National Yunlin
University of Science & Technology, Taiwan.
[32] Y.-J. Lee, O.L. Mangasarian, RSVM: reduced support vector
machines, Technical Report 00-07, Data Mining Institute, Computer
Science Department, University Wisconsin, Madison, WI, USA, July
2000, Available from: /ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/
00-07.psS.
[33] Checker data set /ftp://ftp.cs.wisc.edu/math-prog/cpo-dataset/
machine-learn/checkerS.
[34] C.L. Blake, C.J. Merz, UCI Repository for Machine Learning
Databases, Department of Information and Computer Sciences,
[35]
[36]
[37]
[38]
[39]
University of California, Irvine, 1998 /http://www.ics.uci.edu/
mlearn/MLRepository.htmlS.
MATLAB, User’s Guide, The MathWorks, Inc., 1994–2001 /http://
www.mathworks.comS.
G. Fung, O.L. Mangasarian, SVM toolbox home page /http://www.
cs.wisc.edu/dmi/svm/psvmS.
S.R. Gunn, Support vector machine Matlab toolbox, 1998 /http://
www.isis.ecs.soton.ac.uk/resources/svminfo/S.
LS-SVM toolbox, version-1.5 advanced /http://www.esat.kuleuven.
ac.be/sista/lssvmlab/S.
T.M. Mitchell, Machine Learning, McGraw-Hill International, Singapore,
1997.
Download