ARTICLE IN PRESS Signal Processing 89 (2009) 510–522 Contents lists available at ScienceDirect Signal Processing journal homepage: www.elsevier.com/locate/sigpro Nonparallel plane proximal classifier Santanu Ghorai , Anirban Mukherjee, Pranab K. Dutta Electrical Engineering Department, Indian Institute of Technology, Kharagpur 721302, West Bengal, India a r t i c l e i n f o abstract Article history: Received 22 February 2008 Received in revised form 15 August 2008 Accepted 3 October 2008 Available online 18 October 2008 We observed that the two costly optimization problems of twin support vector machine (TWSVM) classifier can be avoided by introducing a technique as used in proximal support vector machine (PSVM) classifier. With this modus operandi we formulate a much simpler nonparallel plane proximal classifier (NPPC) for speeding up the training of it by reducing significant computational burden over TWSVM. The formulation of NPPC for binary data classification is based on two identical mean square error (MSE) optimization problems which lead to solving two small systems of linear equations in input space. Thus it eliminates the need of any specialized software for solving the quadratic programming problems (QPPs). The formulation is also extended for nonlinear kernel classifier. Our computations show that a MATLAB implementation of NPPC can be trained with a data set of 3 million points with 10 attributes in less than 3 s. Computational results on synthetic as well as on several bench mark data sets indicate the advantages of the proposed classifier in both computational time and test accuracy. The experimental results also indicate that performances of classifiers obtained by MSE approach are sufficient in many cases than the classifiers obtained by standard SVM approach. & 2008 Elsevier B.V. All rights reserved. Keywords: Nonparallel plane Pattern classification Proximal classifier Support vector machines 1. Introduction Support vector machine (SVM) algorithm is an excellent tool for binary data classification [1–4]. This learning strategy introduced by Vapnik and co-worker [1] is a principled and very powerful method in machine learning algorithm. Within a few years after its introduction SVM has already outperformed most other systems in a wide variety of applications. These include a wide spectrum of research areas, ranging from pattern recognition [5], text categorization [6], biomedicine [7,8], brain–computer interface [9,10], and financial applications [11,12], etc. The theory of SVM, proposed by Vapnik, is based on the idea of structural risk minimization (SRM) principle [1–3]. Corresponding author. Tel.: +91 9433436953. E-mail addresses: sghorai@ee.iitkgp.ernet.in, san_ghorai@yahoo.co.in (S. Ghorai). 0165-1684/$ - see front matter & 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2008.10.002 In its simplest form, SVM for a linearly separable two class problem finds an optimal hyper plane that maximizes the separation between the two classes. The hyper plane is obtained by solving a quadratic optimization problem. For nonlinearly separable cases the input feature vectors are first mapped into a high dimensional feature space by using a nonlinear kernel function [4,13,14]. A linear classifier is then implemented in that feature space to classify the data. One of the main challenges of standard SVM is that it requires large training time for huge database as it has to optimize a computationally expensive cost function. The performance of a trained SVM classifier also depends on the optimal parameter set which is usually found by cross-validation on a tuning set [15]. The large training time of SVM also prevents one to locate optimal parameter set from a very fine grid of parameters over large span. To remove this drawback, various versions of SVM have been reported by many researchers with comparable classifications ability. Introduction of proximal type of SVMs [16–18] eradicate the ARTICLE IN PRESS S. Ghorai et al. / Signal Processing 89 (2009) 510–522 above shortcoming of standard SVM classifier. These classifiers avoid the costly optimization problem of SVM and as a result they are very fast. Such formulations of SVM can be interpreted as regularized least squares and considered in the much more general context of regularized networks [19,20]. All the above classifiers discriminate a pattern by determining in which half space it lies. Mangasarian and Wild [21] first proposed a classification method by the proximity of patterns to one of the two nonparallel planes. They named it as the generalized eigenvalue proximal support vector machine (GEPSVM) classifier. Instead of finding a single hyperplane, GEPSVM finds two nonparallel hyperplanes such that each plane is clustered around one particular class data. For this GEPSVM solves two allied generalized eigenvalue problems. Although this approach is called a SVM but it is more likely to discriminate patterns by using fisher information criterion [13,15]. Because by changing the two class margin representation by ‘‘parallel’’ to ‘‘nonparallel’’ hyperplanes it switches from a binary to potentially many class approach. The linear kernel GEPSVM is very fast as it solves two generalized eigenvalue problems of the order of input space dimension. But performance of it is only comparable with standard SVM and in many cases it gives low classification rates. Recently, Jayadeva et al. [22] proposed twin support vector machine (TWSVM) classifier. In TWSVM also two nonparallel planes are generated similar to GEPSVM but in a different technique. For this purpose it solves two smaller sized quadratic programming problems (QPPs) instead of solving large one as in the standard SVM [2–4]. Although TWSVM and GEPSVM classify data by two nonparallel planes yet the former is more likely to a typical SVM problem which does not eliminate the basic assumption of selecting a minimum number of support vectors [23]. Although TWSVM achieves good classification accuracy but it is not desirable to solve two optimization problems in many cases, predominantly for large data sets due to higher learning time. This fact motivates us to formulate the proposed classifier such that it has good classification ability as TWSVM [22] and at the same time it should computationally efficient as PSVM [18] or linear GEPSVM [21]. In this paper, we recommend binary data classifier, named as nonparallel plane proximal classifier (NPPC). NPPC also classifies binary data by the proximity of it to one of two nonparallel planes. The formulation of NPPC is totally different from that of GEPSVM [21]. But the formulations of the objective functions of NPPC are similar to that of TWSVM [22] with a different loss function and equality constraints instead of inequality constraints. We call this formulation nonparallel proximal plane classifier (NPPC) rather than a SVM classifier as there is no SRM by margin maximization between the two classes like standard SVM. Thus it can be interpreted as a classifier obtained by regularized mean square error (MSE) optimization. At last the most important fact is that the computational results on several data sets show that the performance of such classifiers obtained by MSE optimization is comparable or even better than the SVM 511 classifiers and eliminates the need of computational costly SVM classifier in many cases. The rest of this paper is organized as follows. A brief introduction of all the SVM classifiers is given in Section 2. In Section 3, we have formulated NPPC for linear kernel with two visual examples in two dimensions and in Section 4; we have extended the formulation for the nonlinear kernels and demonstrated its performance visually by one example. In Section 5 performances of our proposed NPPC is compared with other SVM classifiers for linear and nonlinear kernels. Finally Section 6 concludes the paper. A concise utterance regarding the notations used in this paper [21] is as follows. All vectors are considered as column vectors if not they are transposed by using a superscript T. Inner product of two vectors x and y in ndimensional real space <n is denoted by xTy and the twonorm of x is indicated by ||x||. The vector e represents column vector of ones of proper dimension whereas I stands for identity matrix of subjective dimension. In case of a matrix, containing feature vectors, A 2 <mn the ith row Ai is a row vector in <n . Matrices A and B contain the feature vectors of classes +1 and 1, respectively. For A 2 <m1 n and C 2 <nm , a kernel K(A,C) maps <m1 n <nm into <m1 m . Only the symmetric property of the kernel is assumed [21] without any use of Mercer’s positive definiteness condition [2,3,4,13,14]. The ijth element of the assumed Gaussian kernel [2] for testing nonlinear T 2 classification is given by ðKðA; CÞÞij ¼ mjjAi C j jj , where i ¼ 1,y,m1, j ¼ 1,y,m and m is a positive constant, e is the base of the natural logarithm, and A and C are as described above. 2. Brief introduction of SVM 2.1. The linear SVM SVM is a state-of-the-art of machine learning algorithm which is based on guaranteed risk bounds of statistical learning theory [1,2] which is known as SRM principle. Among several tutorials on SVM literature we refer to [4]. Given m training pairs (x1,y1), y, (xm, ym), where xi 2 <n is an input vector labeled by yiA{+1,1} for i ¼ 1,y,m, the linear SVM classifier search for an optimal separating hyperplane oT x þ b ¼ 0 (1) n where b 2 < is the bias term and o 2 < is the normal vector to the hyperplane. Expression (1) is obtained by solving the following convex QPP: Min s:t: m X 1 jjojj2 þ c xi 2 i¼1 yi :ðoT xi þ bÞX1 xi ; xi X0; i ¼ 1; . . . ; m (2) where xi 2 < is the soft margin error of the ith training sample. The parameter c(40) is the regularization parameter that balances the importance between the maximization of the margin width and the minimization of the training error. The solution of quadratic ARTICLE IN PRESS 512 S. Ghorai et al. / Signal Processing 89 (2009) 510–522 optimization problem (2) is obtained by finding the saddle point of the Lagrange function and the decision function is of the form ! m X f ðxÞ ¼ sgnðoT x þ bÞ ¼ sgn ai yi xTi x þ b (3) i¼1 where ai (0oaioc) is the Lagrange multiplier. The complexity of solving (2) is O(m3) iterations. Thus as the number of training pattern increases the time needed to train the SVM classifier also increases. nonparallel planes: oT1 x þ b1 ¼ 0; oT2 x þ b2 ¼ 0 (6) where the first plane is closest to the points of class 1 and furthest from the points of class 1, and the second plane has the opposite property. This leads to solving following two optimization problems: Min jjAo1 þ e1 b1 jj2 =jj½o1 ; b1 T jj2 ; þ e2 b1 jj2 =jj½o1 ; b1 T jj2 o1 ;b1 a0 jjBo1 where Bo1 þ e2 b1 a0 (7) 2.2. Least square support vector machine (LS-SVM) and The drawback of the SVM that it requires large training time as the training pattern increases was removed by Suykens et al. [16,17]. They proposed least square version of the SVM by formulating the classification problem as Min o;b;x s:t: m 1 cX jjojj2 þ x2 2 2 i¼1 i yi ðoT xi þ bÞ ¼ 1 xi ; i ¼ 1; . . . ; m (4) Here the problem (4) finds the same hyperplane (1) by solving a set of m linear equations. The solution of (4) requires inversion of a matrix of order m and its computational complexity depends on the algorithm used for its solution. For example Gaussian elimination technique leads to a complexity of O(m3) whereas for conjugate gradient method it is less than O(m3) [17]. Thus LS-SVM is computationally cheaper than SVM and its classification ability is comparable with that of SVM. 2.3. Proximal support vector machine (PSVM) Fung and Mangasarian proposed PSVM [18] that classifies data by the proximity of it to one of the two parallel planes oTx+b ¼ 71 by solving the following optimization problem: Min ðo;c;xÞ2<nþ1þm s:t: jjBo2 þ e2 b2 jj2 =jj½o2 ; b2 T jj2 ; þ e1 b2 jj2 =jj½o2 ; b2 T jj2 where Ao2 þ e1 b2 a0 (8) Here matrices A 2 <m1 n and B 2 <m2 n contain the m1 and m2 training patterns of classes 1 and 1, respectively, in n-dimensional space, o1 ; o2 2 <n are normal vectors and b1 ; b2 2 < are bias terms of respective planes and e1 2 <m1 ; e2 2 <m2 are vectors of ones. Solution of (7) and (8) leads to solving two generalized eigenvalue problems in <ðnþ1Þðnþ1Þ . Thus its complexity is of the order of O(n3), where n is the input space dimension. So, linear GEPSVM is as fast as PSVM classifier. 2.5. Twin support vector machines (TWSVM) TWSVM [22] also finds two nonparallel planes like GEPSVM for data classification by optimizing the following pair of QPPs: ðTWSVM1Þ Min ðo1 ;b1 ;x2 Þ s:t: 1 jjAo1 þ e1 b1 jj2 þ c1 eT2 x2 2 ðBo1 þ e2 b1 Þ þ x2 Xe2 ; x2 X0 (9) and c 1 2 jjxjj2 þ ðoT o þ b Þ 2 2 DðAo ebÞ þ x ¼ e Min o2 ;b2 a0 jjAo2 ðTWSVM2Þ (5) mm where D 2 < is a diagonal matrix containing +1 or 1 along its diagonal representing the class of the m training samples, A 2 <mn is the matrix containing the m training vectors in n-dimensional space and e 2 <m is a vector of ones. The solution of (5) involves in inversion of a matrix of order (n+1) (n+1) and thus its complexity is approximately O(n3), where n is the input space dimension. Therefore PSVM is much faster than SVM and at the same time it performs as well as SVM. 2.4. Generalized eigenvalue proximal support vector machine (GEPSVM) In GEPSVM [21], Mangasarian and Wild allowed the parallel planes of PSVM [18] to be nonparallel and the data are classified by the proximity of it to one of the two nonparallel planes in <n . Thus GEPSVM seeks two Min ðo2 ;b2 ;x1 Þ s:t: 1 jjBo2 þ e2 b2 jj2 þ c2 eT1 x1 2 ðAo2 þ e1 b2 Þ þ x1 Xe1 ; x1 X0 (10) where c1 and c240 are regularization parameters of TWSVM, e1 2 <m1 and e2 2 <m2 are vectors of ones and x1 2 <m1 ; x2 2 <m2 are error variable vectors due to classes 1 and 1 data, respectively. Thus formulation of TWSVM is different from that of GEPSVM but similar to that of SVM with the difference that it minimizes two smaller sized optimization problems. Thus if patterns of both the class are approximately equal to m/2 the complexity of solving two optimization problems of TWSVM is O(2 (m/2)3). Thus the ratio of runtimes between SVM with the formulation (2) and TWSVM with (9) and (10) is m3/(2 (m/2)3) ¼ 4. This makes TWSVM approximately four times faster [22] than SVM with the assumption that patterns of both the classes are equal. ARTICLE IN PRESS S. Ghorai et al. / Signal Processing 89 (2009) 510–522 3. The NPPC formulation In this section, we elaborate the formulation of the classifier which we name as the nonparallel plane proximal classifier (NPPC). In the formulation of NPPC we have applied the concept of both TWSVM [22] and PSVM [18] with some modification to find two nonparallel planes. To obtain the two nonparallel planes as defined in (6), the linear NPPC (LNPPC) solves the following pair of QPPs: ðLNPPC1Þ Min ðo1 ;b1 ;x2 Þ2<ðnþ1þm2 Þ s:t: 1 Ao1 þ e1 b1 2 þ c1 eT x2 þ c2 xT x2 2 2 2 2 ðBo1 þ e2 b1 Þ þ x2 ¼ e2 (11) and ðLNPPC2Þ Min ðo2 ;b2 ;x1 Þ2<ðnþ1þm1 Þ s:t: 1 c4 T jjBo2 þ e2 b2 jj2 þ c3 eT1 x1 þ x1 x1 2 2 ðAo2 þ e1 b2 Þ þ x1 ¼ e1 m1 n around respective class data. Further the equality constraint objective functions make it feasible to obtain the solution of (11) and (12) by solving two small system of linear equations. Thus the formulation of NPPC is similar to that of TWSVM [22] with the inequality constraints are made equality and removing the nonnegative constraint from the error variables x1 and x2. Further both linear and quadratic error terms are used in the objective function unlike TWSVM which uses a linear error term in it. Thus the novelty of the NPPC is that, like TWSVM, it can fit nonparallel planes by minimizing two almost similar types of objective functions, as in PSVM [18], with different constant values but without any regularization term like 1/2(oTo+b2). But PSVM [18] solves one objective function to generate two parallel planes and can never fit nonparallel planes to binary data set. The Lagrangian of the LNPPC1 is given by Lðo1 ; b1 ; x2 ; aÞ ¼ (12) m2 n where matrices A 2 < and B 2 < contain the m1 and m2 training patterns of classes 1 and 1, respectively, in n-dimensional space, o1 ; o2 2 <n are normal vectors and b1 ; b2 2 < are bias terms of respective planes and e1 2 <m1 ; e2 2 <m2 are vectors of ones, x1 2 <m1 ; x2 2 <m2 are error variable vectors due to classes 1 and 1 data, respectively, and c1, c2, c3 and c440 are four regularization parameters of the NPPC. This formulation of NPPC finds two nonparallel hyperplanes, one for each class. The first term of the objective function LNPPC1 minimizes the sum of the squared distances from the hyperplane to the patterns of class 1 and the constraint requires that the patterns of class 1 are at a distance 1 from the hyperplane with soft errors. These errors are measured by variables x2i, for i ¼ 1, 2,y,m2; whenever the patterns of class 1 are not at a distance 1 from the hyperplane and unlike TWSVM this error can be either positive or negative. Thus the error variable x2i is a measure of distance to be added or subtracted to bring a pattern from its position to a distance 1 from the hyperplane. The second and third terms of the objective function (11) constitute the general quadratic error function [24,25] for minimization, of the form f(x) ¼ PTx+xTQx, where P, x 2 <k are vectors and Q 2 <kk is a symmetric and positive semi definite matrix which ensure convexity of function f(x). The motivation of such error function is that quadratic term is very sensitive to the large errors but linear term is less sensitive to it [26], so the combine effort will better reduce the misclassification due to the patterns of class 1. Similarly LNPPC2 finds a hyperplane that clustered around the patterns of class 1 by minimizing the sum square distances of the patterns from it and keeping the patterns of class 1 at a distance 1 with soft error variable x1. Thus each objective function belongs to one particular class and the constraints are formed by the patterns of the other class. The inclusion of both linear and quadratic error term in the objective function track competently the error and helps the hyperplane to be precisely clustered 513 1 jjAo1 þ e1 b1 jj2 þ c1 eT2 x2 2 c2 T þ x2 x2 aT ½ðBo1 þ e2 b1 Þ þ x2 e2 2 (13) where a 2 <m2 is the vector of Lagrange multiplier. The Karush–Kuhn–Tucker (KKT) optimality conditions [27] for LNPPC1 are obtained by equating the stationary points of L to zero as follows: qL ¼ AT ðAo1 þ e1 b1 Þ þ BT a ¼ 0, qo1 qL ¼ eT1 ðAo1 þ e1 b1 Þ þ eT2 a ¼ 0 qb1 qL ¼ c1 e2 þ c2 x2 a ¼ 0 and qx2 ðBo1 þ e2 b1 Þ þ x2 e2 ¼ 0 (14) Combination of first two expressions of (14) leads to " # ½AT eT1 ½A e1 o1 b1 þ ½BT eT2 a ¼ 0 (15) Now, let " H ¼ ½A e1 ; G ¼ ½B e2 and u¼ o1 # b1 (16) Using these substitutions, (15) reduces to HT Hu þ GT a ¼ 0; i:e:; u ¼ ðHT HÞ1 GT a (17) Eq. (17) can be modified, if required, by adding a regularization term [21,22,28] dI, d40 and I is identity matrix of appropriate dimension, to avoid the possibility of the ill-conditioning of HTH, although it is positive definite. So, the modified expression for u becomes u ¼ ðHT H þ dIÞ1 GT a (18) Using notations of (16) and x2 ¼ ac1e2/c2 from the third expression of (14) and substituting (17) in the last expression of (14) we get, 1 I c1 a¼ þ GðHT HÞ1 GT þ 1 e2 (19) c2 c2 ARTICLE IN PRESS 514 S. Ghorai et al. / Signal Processing 89 (2009) 510–522 Here a entails the inversion of massive m2 m2 matrix and a few matrix multiplications. Therefore to reduce the complexity of computation for a by (19) we straight away used here Sherman–Morison–Woodbury formula [29] for matrix inversion as used in [18,21,30]. This consequences computation of a by ( 1 ) c1 T T T 1 I 1 a ¼ c2 I GðH HÞ þ G GðH HÞ GT þ 1 e2 c2 c2 (20) Computation of this expression for a involve the inversion of a matrix of the order (n+1) (n+1), where n is the input space dimension which is usually much smaller than m2 m2, and utterly solves the classification problem. The interesting fact observed from (20) is that, if c1 ¼ 0, i.e., if linear error term is excluded from the objective function (11) the solution for a will correspond to that of considering only quadratic error term in it. Thus inclusion of both linear and quadratic error terms in the objective function offers spare extent of freedom [31] for fine tuning the decision planes of the classifier by varying the ratio of c1/c2. Similarly we construct the Lagrangian of LNPPC2 as follows: Lðo2 ; b2 ; x1 ; gÞ ¼ 1 c4 T jjBo2 þ e2 b2 jj2 þ c3 eT1 x1 þ x1 x1 2 2 gT ½ðAo2 þ e1 b2 Þ þ x1 e1 (21) m1 is the vector of Lagrange multipliers. where g 2 < Following same procedure, as above, for optimal solution of (12), we get augmented vector " # v¼ o2 b2 ¼ ðGT G þ dIÞ1 HT g (22) and ( g ¼ c4 I HðGT GÞ1 I þ HT HðGT GÞ1 c4 1 HT ) c3 þ 1 e1 c4 (23) Once a and g, computed respectively from (20) and (23), are known the augmented vectors u and v and hence (o1,b1) and (o2,b2) are computed using (18) and (22), respectively. This will completely solve the training of LNPPC and we will get the expression of the hyperplanes (6) to classify new patterns. A new data sample x 2 <n is assigned to a class k by comparing the following distance measure [22] of it from the two hyper planes given by (6), i.e., Class k ¼ Min jxT ok þ bk j k¼1;2 (24) We now cite our simple algorithm below for implementation of a linear NPPC. Algorithm 1. Linear nonparallel plane proximal classifier. Given m1 and m2 training patterns of classes 1 and 1, respectively, in n-dimensional space represented by the A 2 <m1 n matrices and B 2 <m2 n , linear NPPC is generated as follows: (i) Define augmented matrices H and G by (16), where e1 and e2 are m1 1 and m2 1 vectors of ones, respectively. With these values compute the Lagrange multipliers a and g from Eqs. (20) and (23) with some positive values of c1, c2, c3 and c4. Typically these values are chosen by means of a tuning set. (ii) Determine augmented vectors u ¼ [o1, b1]T and v ¼ [o2, b2]T from (18) and (22), respectively, to obtain the two nonparallel planes (6). (iii) Classify a new pattern x 2 <n by using (24). Here, at this point, we put NPPC side by side to other two nonparallel plane classifiers, i.e., TWSVM and GEPSVM classifiers with two examples for visual illustration in two dimensions. All the classifiers are trained to obtain maximum training accuracy by varying their corresponding parameters. Fig. 1 shows the nonparallel planes learned by the above classifiers for a linearly nonseparable data set containing 1241 patterns with 765 patterns in class 1 and 476 patterns in class 1. Fig. 1(a) and (b) are the NPPC obtained with making linear error term zero in the objective functions and considering both the error terms in the objective functions, respectively. While the former classifier gives training set accuracy 93.71% the latter gives 96.21%. The improvement in the training accuracy in the NPPC points toward the effectiveness of both the error terms in the objective functions. Fig. 1(c) and (d) represent the classifiers learned by the TWSVM and GEPSVM classifiers, respectively. All these show that training accuracy of NPPC (96.21%) and TWSVM (96.70%) are better than GEPSVM (77.20%). At the same time the training time of all the three classifiers show that NPPC is much faster than TWSVM and as fast as GEPSVM. In Fig. 2(a)–(c) we have shown the same three classifiers learned on a noisy cross plane data set. The training set accuracy for both NPPC and TWSVM is 93.55% for this example, whereas, for GEPSVM it is only 80.65%. Thus linear NPPC proves its effectiveness by well learning the tricky noisy cross plane data set. While in Fig. 2(d)–(f) we have shown the effect of the linear and the quadratic terms used in the loss function of NPPC formulation. For this cross plane data set the effect of the linear term is negligible as seen from Fig. 2(d) where the position of the plane clustered around class 1 data set does not change as c1 is made 0 from 23 keeping other parameters fixed. But from Fig. 2(e) and (f) it is seen that as the value of c2 increases, keeping other three parameters fixed, the plane gradually shifts away from the class 1 data. Thus it is experienced that if all the patterns of opposite class are below requisite distance 1 then the effect of linear error terms are predominant than quadratic terms. The reverse is also true if the same patterns are far away from the requisite distance. Thus proper combination of these four parameters c1, c2, c3 and c4 facilitates the classifier to track competently the error whenever the hyperplane swings either direction from the requisite distance of 1 and thus learn better. So, the better performance of the classifier may be explained physically as it minimizes both the average error and sum square error (energy term) due to the ARTICLE IN PRESS S. Ghorai et al. / Signal Processing 89 (2009) 510–522 515 Fig. 1. A linearly non-separable data set learned by (a) NPPC without linear error term in the objective functions. The classifier obtained with c1 ¼ 0, c2 ¼ 22, c3 ¼ 0, c4 ¼ 20 and d ¼ 107. Training accuracy ¼ 93.71% and training time ¼ 0.0009 s. (b) NPPC with c1 ¼ 24, c2 ¼ 20, c3 ¼ 23, c4 ¼ 21 and d ¼ 107. Training accuracy ¼ 96.21% and training time ¼ 0.0009 s. (c) 100 TWSVM classifier with c ¼ 21 and d ¼ 107. Training accuracy ¼ 96.70% and training time ¼ 189.7516 s and (d) GEPSVM classifier with d ¼ 104, training accuracy ¼ 77.20% and training time ¼ 0.0019 s. In all figures ‘‘+’’ and ‘‘o’’ signs represent classes 1 and 1 patterns, respectively. Solid line represents plane clustered around class 1 patterns and dotted line around patterns of class 1. opposite class data if those are not at a distance 1 while fitting a hyperplane around a particular class data. and NKNPPC2 4. Nonlinear kernel nonparallel plane proximal classifier (NKNPPC) formulation In this section we have extended our formulation to nonlinear classifiers by considering kernel generated surfaces instead of planes [18,21,22]. For nonlinearly separable case, the input data is first projected into a kernel generated feature space of same or higher dimension than that of the input space. To apply this transformation let K(.,.) is a nonlinear kernel function and define the augmented matrix A C¼ 2 <mn (25) B 1 jjKðB; C T Þy2 þ e2 b2 jj2 2 c4 T þ c3 eT1 x1 þ x1 x1 2 ðKðA; C T Þy2 þ e1 b2 Þ þ x1 ¼ e1 Min ðy2 ;b2 ;x1 Þ2<ðmþ1þm1 Þ s:t: (27) where y1 and y2 2 <m are the normal vectors to the kernel generated surfaces by the objective functions NKNPPC1 and NKNPPC2, respectively, and the rest of the parameters are as described in Sections 2 and 3 for linear case. Constructing the Lagrangian of problems (26) and (27) and defining the matrices P 2 <m1 ðmþ1Þ ; Q 2 <m2 ðmþ1Þ and augmented vectors l1 ; l2 2 <ðmþ1Þ as follows: P ¼ ½KðA; C T Þ e1 ; " # Q ¼ ½KðB; C T Þ e2 ; " # where m1+m2 ¼ m, the total patterns in the training set. We now construct the nonlinear kernel NPPC (NKNPPC) objective functions as follows: l1 ¼ NKNPPC1 the expression for l1, l2 and Lagrange multipliers a for (26) and g for (27) can be written as 1 kKðA; C T Þy1 þ e1 b1 k2 2 c2 T þ c1 eT2 x2 þ x2 x2 2 ðKðB; C T Þy1 þ e2 b1 Þ þ x2 ¼ e2 y1 b1 and l2 ¼ y2 b2 (28) Min ðy1 ;b1 ;x2 Þ2<ðmþ1þm2 Þ s:t: (26) l1 ¼ ðP T PÞ1 Q T a (29) l2 ¼ ðQ T Q Þ1 P T g (30) ARTICLE IN PRESS 516 S. Ghorai et al. / Signal Processing 89 (2009) 510–522 Fig. 2. A cross plane data set with random noise learned by the (a) NPPC with c1 ¼ 23, c2 ¼ 26, c3 ¼ 21, c4 ¼ 28 and d ¼ 107, training accuracy ¼ 93.55%. (b) TWSVM classifier with c ¼ 25 and d ¼ 107, training accuracy ¼ 93.55%. (c) GEPSVM classifier with d ¼ 103 and training accuracy ¼ 80.65%. (d) Effect of changing only c1 from 26 to 0 of the NPPC obtained in (a). (e)–(f) Shifting of the hyperplane clustered around class 1 data set due to change of only c2 from 26 to 25 and 24, respectively, of the NPPC obtained in (a). In all figures ‘‘+’’ and ‘‘o’’ signs represent classes 1 and 1 patterns, respectively. Solid line represents plane clustered around class 1 patterns and dotted line around patterns of class 1. a¼ g¼ 1 1 I þ Q ðPT PÞ1 Q T c2 I þ PðQ T Q Þ1 PT c4 c1 þ 1 e2 c2 (31) c3 þ 1 e1 c4 (32) At this stage it is worthwhile to mention that like the situation with linear kernels, Sherman–Morrison– Woodbury formula is not suitable here for (31) and (32) because Q(PTP)1QT and P(QTQ)1PT are square matrices of size m2 m2 and m1 m1, respectively, and thus the inversions must take place in potentially high dimensional space <m2 and <m1 , respectively. Furthermore if the number of patterns in either class becomes very large the reduced kernel technique [18,21,22,32] may also be applied to reduce the dimensionality of NKNPPC1 and NKNPPC2. Further, if necessary, we can bring in a regularization term dI, d40, while inverting (PTP) in (29) ARTICLE IN PRESS S. Ghorai et al. / Signal Processing 89 (2009) 510–522 and (31) or (QTQ) in (30) and (32), respectively. Practically we have inserted a term dI in those expressions for testing with nonlinear kernels. Once we get the values of l1 and l2 by (29) and (30), respectively, then the expressions for kernel generated surfaces of NKNPPC can be obtained as T T Kðx ; C Þy1 þ b1 ¼ 0 and T T Kðx ; C Þy2 þ b2 ¼ 0 (33) A new data point x 2 Rn is assigned to a class k ðk ¼ 1; 2Þ, by comparing the distance measure of it from the two kernel generated surfaces given by (33) after projecting it to the kernel generated feature space, i.e., Class k ¼ Min k¼1;2 jKðxT ; C T Þyk þ bk j (34) We now state explicit statement for nonlinear kernel NPPC implementation analogous to Algorithm 1. Algorithm 2. Nonlinear kernel nonparallel plane proximal classifier. Given m1 and m2 training patterns of classes 1 and 1, respectively, in n-dimensional space represented by the matrices A 2 <m1 n and B 2 <m2 n , nonlinear kernel NPPC is generated as follows: (i) Form C 2 <mn matrix as defined in (25) considering all the rows of A and B matrices or taking a small fraction (usually 1–10%) of training data randomly for reduced kernel [32]. (ii) Choose a kernel function K(A, CT). (iii) Define augmented matrices P and Q as defined in (28), where e1 and e2 are m1 1 and m2 1 vectors of ones, respectively. With these values compute the Lagrange multipliers a and g from Eqs. (31) and (32) with some positive values of regularization parameters c1, c2, c3 and c4 also kernel parameters of selected kernel. Typically these values are chosen by means of a tuning set. Fig. 3. Checkerboard data set learned by the NKNPPC with the Gaussian kernel. The classifier is obtained with kernel parameter m ¼ 25 and regularization parameters c1 ¼ 1, c2 ¼ 0.01, c3 ¼ 1, c4 ¼ 0.01 and d ¼ 107. Training accuracy ¼ 98.80% and CPU time for training is 3.40 s. In the figure ‘‘+’’ and ‘‘o’’ signs represent classes 1 and 1 patterns, respectively. 517 (iv) Compute augmented vectors l1 ¼ [y1, b1]T and l2 ¼ [y2, b2]T by (29) and (30), respectively, to obtain the nonlinear surfaces (33). (v) Classify a new pattern x 2 <n by using (34). In order to illustrate graphically the effectiveness of the nonlinear kernel NPPC we test its ability to learn the checkerboard data set [33]. This data set consists of 1000 points in <2 of black and white points taken from 16 black and white squares of a checkerboard. This is a tricky test case in data mining algorithm for testing performance of nonlinear classifiers. However NKNPPC with the Gaussian kernel and appropriate parameters generates a clear distinction between the two types of data points as shown in Fig. 3. This proves the effectiveness of the NKNPPC formulation in learning highly nonlinear patterns. 5. Numerical testing and comparison To compare the performance of our NPPC we investigate results in terms of accuracy and execution time on publicly available benchmark data sets from the UCI Repository [34] which are commonly used in testing machine learning algorithms. All the classification methods are implemented in MATLAB 7 [35] on Windows XP running on a PC with system configuration Intel P4 processor (3.06 GHz) with 1 GB of RAM. We compare both linear and nonlinear kernel classifiers using NPPC, TWSVM [22], GEPSVM [21], PSVM [18], one norm SVM with formulation (2) and LS-SVM [17] algorithms. For the CPU time and accuracy figures of PSVM, SVM and LS-SVM we have used the MATLAB codes from SVM toolbox home page [36], Gunn SVM toolbox [37] and LS-SVM 1.5 advanced toolbox with ‘‘CMEX’’ implementation [38], respectively. For implementation of TWSVM we have used the optimizer code ‘‘qp.dll’’ from Gunn SVM tool box [37]. NPPC and GEPSVM are implemented using simple MATLAB functions like ‘‘inv’’ and ‘‘eig’’, respectively. The average testing accuracies for all the methods are calculated using standard 10-fold cross-validation method ([39], pp. 111–112) as used in all other methods. The parameter d for GEPSVM is selected from the set of values {10i|i ¼ 7,6,y,7} and the value of d for both NPPC and TWSVM is selected as 107 in all the case. Further the optimal values of the regularization parameters for all methods are selected from the set of values {2i|i ¼ 7, 6,y,15} by tuning a set comprising of random 10% of training data. Once the parameter was selected, the tuning set was returned to the training set to learn the final classifier. In order to calculate the accuracy, random permutations of the data points are also performed before proceeding to the 10-fold testing. Besides average 10-fold testing accuracy we also reported average CPU time for training for each method and pvalue in 5% significance level. The p-value was calculated by performing a paired t-test ([39], p. 148) comparing all other classifiers to NPPC under the assumption of the null hypothesis that there is no difference between the tests set accuracy distributions. ARTICLE IN PRESS 518 S. Ghorai et al. / Signal Processing 89 (2009) 510–522 Table 1 Ten-fold testing comparison of average CPU time for training and average test set correctness of linear classifiers and p-values of the results comparing all methods to NPPC Data set NPPC time (s) test acc. TWSVM time (s) test acc. p-value GEPSVM time (s) test acc. p-value PSVM time (s) test acc. p-value SVM time (s) test acc. p-value LSSVM time (s) test acc. p-value WPBC 110 32 0.0024 79.8479.70 0.9250 78.3278.89 0.091898 0.0077 70.76710.55 0.038464 0.0006 79.2179.04 0.944048 2.8472 78.3778.55 0.846563 0.2224 80.29711.32 0.285851 Ionosphere 351 34 0.0029 87.4476.30 3.1574 81.4875.74 0.001467 0.0096 76.00715.71 0.000536 0.0006 86.0476.93 0.132871 12.6057 88.274.51 0.147621 0.3954 84.01710.56 0.024780 Heart-c 303 24 0.0011 83.3373.41 1.2671 83.7073.78 0.084655 0.0021 66.6778.92 0.000019 0.0039 84.8175.60 0.814601 5.7702 83.7073.39 0.897386 0.0938 84.4474.32 0.677108 Pima-Indian 768 8 0.0014 77.2075.29 26.4315 77.8574.85 0.177352 0.0016 74.0974.64 0.003197 0.0006 77.0776.42 0.966127 108.5596 77.3474.37 0.481374 0.5076 77.2075.42 0.998165 BUPA liver 345 6 0.0010 70.21710.67 2.6117 66.4077.74 0.118958 0.0012 56.2977.35 0.000174 0.0031 68.6677.40 0.786537 11.2190 67.7875.51 0.580362 0.0882 69.34710.48 0.273269 Votes 435 16 0.0015 95.6373.36 5.2407 96.3273.63 0.083672 0.0018 91.0376.13 0.638405 0.0025 95.6472.78 0.224295 21.6651 95.1671.93 0.948016 0.3172 95.6373.36 1.000000 Breast cancer 699 16 0.0013 94.2576.25 22.4713 94.9775.35 0.212593 0.0016 73.65711.58 0.000168 0.0006 94.4272.43 0.946173 85.0442 95.1471.14 0.703021 0.4599 94.2576.25 1.000000 Heart-Statlog 270 14 0.0011 84.4475.93 1.2181 83.7073.78 0.679057 0.0022 66.6778.92 0.000136 0.0006 85.1974.06 0.763818 5.7694 83.3379.11 0.703757 0.0937 84.8174.52 0.832052 Sonar 208 60 0.0058 77.0077.46 0.6432 77.0079.10 0.948520 0.0314 68.69710.02 0.015912 0.0006 74.98711.28 0.751494 2.9825 78.93710.43 0.704704 0.0654 78.3674.42 0.610881 Australian 690 14 0.0018 86.2375.47 16.6121 85.9475.84 0.443456 0.0024 71.8874.16 0.000168 0.0006 85.2274.24 0.650593 86.8776 85.5174.85 0.784918 0.7082 85.8075.83 0.278964 German 1000 24 0.0036 76.3075.75 76.9041 75.7075.85 0.101038 0.0056 74.0074.90 0.902013 0.0006 76.7072.90 0.839176 251.1579 75.6074.34 0.404828 2.5300 76.2075.88 0.474957 In Table 1 we have compared linear classifiers using 11 data sets from UCI Repository [34]. The best accuracy figures are bold. From Table 1 it is seen that there is no significant difference in classification accuracy between linear NPPC and TWSVM classifier except for Ionosphere data set on which NPPC performs better. But there is a noteworthy improvement of training time in NPPC than that of TWSVM. In contrast classification accuracy of GEPSVM is significantly lower than NPPC for all cases except for Votes and German data sets as seen from the p- ARTICLE IN PRESS S. Ghorai et al. / Signal Processing 89 (2009) 510–522 519 Table 2 Ten-fold testing comparison of average CPU time for training and average test set correctness of nonlinear kernel classifiers and p-values of the results comparing all methods to NPPC Data set NPPC time (s) test acc. TWSVM time (s) test acc. p-value GEPSVM time (s) test acc. p-value PSVM time (s) test acc. p-value SVM time (s) test acc. p-value LSSVM time (s) test acc. p-value Haberman’s survival 0.1612 4.9264 1.0523 0.0215 8.0285 0.2714 77.4377.11 78.7278.26 0.171591 73.4877.86 0.071523 73.5276.25 0.126987 73.17710.61 0.162121 73.8279.62 0.158359 0.0625 73.7279.80 2.1756 73.7279.80 1.00000 0.4015 42.79719.67 0.001178 0.0162 52.46712.29 0.000564 6.5248 60.05712.37 0.008554 0.1518 59.3779.53 0.000004 Tic-tac-toe 303 24 3.0640 89.3371.16 93.9697 89.3371.16 1.000000 2.8059 65.3474.34 0.000001 0.3800 94.4772.76 0.065232 244.5616 90.9272.18 0.120003 2.1627 95.4072.66 0.021368 Pima-Indian 768 8 1.6753 89.0672.79 51.5236 89.9774.33 0.429623 17.3321 65.1175.14 0.000000 0.2129 77.4774.19 0.000065 130.9067 74.0973.58 0.000001 2.3301 76.9674.15 0.000018 BUPA liver 345 6 0.2068 82.6176.12 4.9233 83.7675.94 0.423054 1.5986 57.9379.73 0.000045 0.0285 73.3476.01 0.000920 11.1348 70.3976.38 0.002364 0.2847 72.1876.49 0.000255 CMC 1473 9 2.3735 90.8272.51 66.8564 90.8173.25 0.989988 1.0744 62.2975.94 0.000001 0.3043 69.5474.49 0.000001 187.0435 68.9873.44 0.000001 2.7362 68.2774.75 0.000001 Breast cancer 699 16 1.3228 99.2870.96 45.0518 98.2873.37 0.167850 17.3172 82.93725.79 0.001102 0.1665 95.1372.23 0.000674 88.7292 95.4271.24 0.000034 1.0303 95.2674.79 0.000124 Spect 267 22 0.1174 95.9073.49 4.2090 94.0375.27 0.051576 0.9097 79.4076.46 0.000009 0.0141 86.1775.21 0.000536 5.8398 85.1077.18 0.001685 0.1061 86.8776.59 0.000753 306 3 Cleveland heart 297 13 values. On the other hand classification accuracies for all the data sets are comparable between PSVM, SVM, LS-SVM and NPPC as p-values for all the test cases are above 0.05 which indicates that there is no statistical difference in the results. Thus with the proposed formulation of NPPC with both linear and quadratic error terms in its objective functions allows the classifier to learn better the data set reducing the generalization errors. But it is noticeable from the test that for linear kernel, the required training time of NPPC is extremely small compared to SVM and LS-SVM but comparable with that of PSVM and GEPSVM. This should be noted that Gunn SVM toolbox [37] uses a ‘‘.dll’’ file for solving optimization problem and we used ‘‘CMEX’’ implantation of LS-SVM from [38] for comparison purpose. Both of these implementations are faster than MATLAB implementation. Table 2 compares the performance of different nonlinear kernel classifiers using a Gaussian kernel for eight data sets. For nonlinear kernel classifiers we also choose the same procedure to find the optimal parameter set as we did in case of the linear classifiers. The parameter m for Gaussian kernel for all the methods was preferred from the set {2i|i ¼ 7, 6,y,1, 0} while the regularization parameters selected from same set as we used for linear classifiers. The results of Table 2 show that there is no statistical difference in average classification accuracy between NPPC and TWSVM as in all the test cases p-values are above 0.05. But NPPC is much faster than TWSVM. This implies that NPPC can achieve same level of accuracy as that of TWSVM without solving a costly quadratic optimization problem but a simple MSE optimization problem. On the other hand difference between NPPC and other classifiers, except TWSVM, are very much ARTICLE IN PRESS 520 S. Ghorai et al. / Signal Processing 89 (2009) 510–522 statistically significant as p-values are well below 0.05 in most of the cases. For example except Tic-tac-toe and Haberman’s survival data set the classification accuracy of NPPC is substantially higher than GEPSVM, PSVM, SVM and LS-SVM. Particularly for Cleveland heart, PimaIndian, BUPA liver, CMC and Spect data sets the average classification accuracies of NPPC differs by more than 10% from that above four methods. Thus nonlinear kernel NPPC undoubtedly point toward the dominance of it over the other existing methods. The NPPC classifier will perform well on the data sets where the patterns of opposite class are distributed along two different hyperplanes. Even if the patterns are not like this we can choose appropriate kernel function with proper kernel parameter to transform the data in feature space for such distribution so that nonlinear kernel NPPC can perform better on that data set. In order to test the computational efficiency of linear NPPC on large data set we synthetically generated 3 million artificial data point in 10 dimensions [30]. For this we first created an arbitrary hyperplane in that dimension and generated data points with entries uniformly distributed in between [0.9, +0.9]. The points are assigned a class label according to which side of the plane it lies. We choose same regularization parameters 1 for all the methods compared and tested only the CPU training time for each taking all these data. NPPC can train the classifier with 3 million data in 2.2039 s which is comparable to that of training time 0.3326 and 1.2824 s as required by PSVM and GEPSVM, respectively. But TWSVM, SVM and LS-SVM also fall short to converge in 40 h. To train LS-SVM in this case we used fixed size LS-SVM that work with sparse representations, approximate feature maps and estimation in the primal [17]. The computational results are recorded in Table 3. The results undoubtedly prove the boost of computational efficiency of NPPC over TWSVM. We also test CPU training time for all the methods with nonlinear Gaussian kernel on Spambase data set [34] under identical conditions. This data set contains 4601 training data with 57 attributes. The kernel parameter m and regularization parameter c were taken as 0.1 and 1, respectively, for all the methods compared in Table 4. Reduced kernel technique [32] was applied for computation of the training time. A rectangular kernel [32] of the size 4601 138 was used here instead of the square 4601 4601 kernel for these methods. Table 4 shows GEPSVM classifier takes least time (0.6678 s) to train the nonlinear classifier, whereas nonlinear kernel NPPC can train the classifier in 18.7321 s which is less than 19.2882 s as taken by PSVM classifier. While TWSVM, fixed size LS-SVM and Gunn SVM on this same problem take more than 2.6 h (9394 s), 3.15 h (11 296 s) and 7.5 h (27 043 s), respectively, to learn the nonlinear classifier. The large training time of fixed size LS-SVM is due to the fact that it constructs a reduced set of support vectors based on optimal kernel entropy [17]. These results certainly indicate the computational efficiency of NKNPPC, PSVM and GEPSVM. The CPU time of training for all the classifiers tabulated in Tables 1–4 can be explained by the complexity of the six methods compared. In Appendix A we have derived the computational complexities of linear and nonlinear kernel NPPC in terms of number of simple multiplication operation. Although computational complexities may be different from this depending upon the algorithm used for its calculations. This shows that complexity of training linear NPPC is approximately O(kn3), where n is the input space dimension and the value of k depends upon the number of training data. Since the input space dimension is small in most of the cases (typically 100) linear NPPC is very fast and comparable to that of linear kernel PSVM and GEPSVM. On the other hand complexity of nonlinear kernel NPPC is approximately O(km3), where m is the total number of patterns on the training set and value of k depends on m. Thus complexity of nonlinear NPPC is much higher than linear NPPC as the classification is performed in higher dimensional feature space. To reduce the complexity of NKNPPC reduced kernel technique [32] may be used. By this method a subset (usually 1–10%) of the original training data is selected prior to training of the classifier. This leads to computation of a thin rectangular kernel matrix instead of a large square kernel. Thus with the reduced kernel technique complexity of NKNPPC becomes Oðkm̄3 Þ, where m̄ is only 1–10% of the total training data selected randomly. In contrast the complexity of TWSVM, SVM and LSSVM depend on the number of the training patterns as mentioned in Section 2. Thus for large database these methods take large time to train the classifier. Although one can use reduced kernel technique to TWSVM [22] and fixed-size LS-SVM [17] for large data set. Thus performance of linear NPPC is comparable to that of all other SVM classifiers but for Table 3 CPU time to learn different linear classifiers on synthetically generated 3 million point in 10-dimensions NPPC time (s) TWSVM time (s) GEPSVM time (s) PSVM time (s) SVM time (s) LSSVM time (s) 2.2039 Not converged in 40 h 1.2824 0.3326 Not converged in 40 h Not converged in 40 h Table 4 CPU time to learn different nonlinear classifiers with the Gaussian kernel on the Spambase data set [31] NPPC time (s) TWSVM time (s) GEPSVM time (s) PSVM time (s) SVM time (s) LSSVM time (s) 18.3721 9394 0.6678 19.2882 27 043 11 296 ARTICLE IN PRESS S. Ghorai et al. / Signal Processing 89 (2009) 510–522 nonlinear NPPC it performs significantly better than all other classifiers except TWSVM in terms of accuracy. But training time of NPPC is significantly lower than TWSVM. 6. Conclusion In this paper we essentially put side by side the performance of classifiers obtained by margin maximization concept in MSE framework versus classifiers obtained by standard SVM approach. Based on the experimental results on benchmark data sets it conveys that SVM may be replaced by simpler optimization problems in several cases, in which we do not have to consider support vectors and inequality constraints. The computational results given in Tables 1 and 2 indicate that MSE optimization is sufficient in most of the cases than to deal with a costly SVM problem. However it requires further investigations that how the classifiers obtained by MSE optimization approach performs on noisy data sets or data sets from different applications compared to standard SVM. One drawback of NPPC is that it has four regularization parameters to select by the user. Therefore the selection of optimal parameter set for NPPC may be one searching areas of research together with kernel selection. Further, this approach can be extended for multicategory classification and incremental classification for large data sets. Acknowledgements Authors would like to thank the referees for very useful comments and suggestions which greatly improved our representation. Authors are also grateful to Professor P. Mitra and Professor A. Routray of IIT Kharagpur for their help in presentation of the paper. Santanu Ghorai acknowledges the financial support of the authority of MCKV Institute of Engineering, Liluah, Howrah 711204, W.B., India, and All India Council of Technical Education (AICTE, India) in the form of salary and scholarship, respectively, for pursuing his Ph.D. degree under Quality Improvement Programme (QIP) in the Indian Institute of Technology, Kharagpur 721302, India. Appendix A. Supplementary data Supplementary data associated with this article can be found in the online version at doi:10.1016/j.sigpro. 2008.10.002. References [1] C. Cortes, V.N. Vapnik, Support vector networks, Machine Learning 20 (3) (1995) 273–297. [2] V. Vapnik, The Nature of Statistical Learning Theory, New York, Springer, 1995. [3] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines, vol. 3, Cambridge University Press, Cambridge, MA, 2000 Chapter 6, pp. 113–145. 521 [4] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining Knowledge Discovery 2 (2) (1998) 121–167. [5] S. Lee, A. Verri, Pattern recognition with support vector machines, in: First International Workshop, SVM 2002, Springer, Niagara Falls, Canada, 2002. [6] T. Joachims, C. Ndellec, C. Rouveriol, Text categorization with support vector machines: learning with many relevant features, in: Proceedings of the European Conference on Machine Learning (ECML), Berlin, 1998, pp. 137–142. [7] D. Lin, N. Cristianini, C. Sugne, T. Furey, M. Ares, M. Brown, W. Grundy, D. Haussler, Knowledge-base Analysis of Microarray Gene Expression Data by Using Support Vector Machines, PNAS, vol. 97, Springer, London, UK, 2000 Chapter 1, 262–267. [8] W.S. Noble, Kernel Methods in Computational Biology, Support Vector Machine Applications in Computational Biology, MIT Press, Cambridge, 2004, pp. 71–92. [9] T. Ebrahimi, G.N. Garcia, J.M. Vesin, Joint time-frequency-space classification of EEG in a brain–computer interface application, J. Appl. Signal Process. 1 (7) (2003) 713–729. [10] T.N. Lal, M. Schroder, T. Hinterberger, J. Weston, M. Bogdan, N. Birbaumer, B. Scholkopf, Support vector channel selection in BCI, IEEE Trans. Biomed. Eng. 51 (6) (2004) 1003–1010. [11] H. Ince, T.B. Trafalis, Support vector machine for regression and applications to financial forecasting, in: International Joint Conference on Neural Networks (IJCNN’02), Como, Italy, IEEE-INNSENNS, 2002. [12] C.J. Hsu, W.H. Chen, S. Wuc, Z. Huang, H. Chen, Credit rating analysis with support vector machines and neural networks: a market comparative study, Decision Support Systems 37 (2004) 543–558. [13] N. Cristianini, J. Shawe Taylor, Kernel Methods for Pattern Analysis, Cambridge University Press, Cambridge, UK, 2004. [14] T. Joachims, Making large-scale support vector machine learning practical, in: B. Schölkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, MA, 1999, pp. 169–184. [15] S. Haykin, Neural Networks—A Comprehensive Foundation, second ed., Pearson Education, 2006, Chapter 4, pp. 235–240. [16] J.A.K. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Process. Lett. 9 (3) (1999) 293–300. [17] J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, J. Vandewalle, Least Squares Support Vector Machines, World Scientific Publishing Co., Singapore, 2002. [18] G. Fung, O.L. Mangasarian, Proximal support vector machine classifiers, in: 7th International Proceedings on Knowledge Discovery and Data Mining, 2001, pp. 77–86. [19] T. Evgeniou, M. Pontil, T. Poggio, Regularization networks and support vector machines, Advances Comput. Math. 1 (13) (2000) 1–50. [20] T. Evgeniou, M. Pontil, T. Poggio, Regularization networks and support vector machines, in: A. Smola, P. Bartlett, B. Schölkopf, D. Schuurmans (Eds.), Advances in Large Margin Classifiers, MIT Press, Cambridge, MA, 2000, pp. 171–203. [21] O.L. Mangasarian, E.W. Wild, Multisurface proximal support vector classification via generalized eigenvalues, IEEE Trans. Pattern Anal. Machine Intell. 28 (1) (2006) 69–74. [22] Jayadeva, R. Khemchandani, S. Chandra, Twin support vector machines for pattern classification, IEEE Trans. Pattern Anal. Machine Intell. 29 (5) (2007) 905–910. [23] A.K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition: a review, IEEE Trans. Pattern Anal. Machine Intell. 22 (1) (2000) 4–37. [24] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, 2002 Chapter 4, p. 138. [25] J. Nocedal, S. Wright, Numerical Optimization, second ed., Springer, 2006 Chapter 16, p. 449. [26] J. Chambers, A. Avlonities, A robust mixed-norm adaptive filter algorithm, IEEE Signal Process. Lett. 4 (2) (1997) 46–48. [27] M.S. Bazarra, H.D. Sherali, C.M. Shetty, Nonlinear Programming—Theory and Algorithms, second ed., Wiley, 2004 Chapter 4, pp. 149–172. [28] A.N. Tikhonov, V.Y. Arsenin, Solutions of Ill-posed Problems, Wiley, New York, 1977. [29] G.H. Golub, C.F. Van Loan, Matrix Computations, third ed., The John Hopkins University Press, Baltimore, 1996 Chapter 2, p. 50. [30] K.S. Chua, Efficient computations for large least square support vector machine classifiers, Pattern Recognition Lett. 24 (2003) 75–80. ARTICLE IN PRESS 522 S. Ghorai et al. / Signal Processing 89 (2009) 510–522 [31] W. Pao, L. Lan, D. Yang, The mixed norm proximal support vector classifier, Department of Electronics Engineering, National Yunlin University of Science & Technology, Taiwan. [32] Y.-J. Lee, O.L. Mangasarian, RSVM: reduced support vector machines, Technical Report 00-07, Data Mining Institute, Computer Science Department, University Wisconsin, Madison, WI, USA, July 2000, Available from: /ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/ 00-07.psS. [33] Checker data set /ftp://ftp.cs.wisc.edu/math-prog/cpo-dataset/ machine-learn/checkerS. [34] C.L. Blake, C.J. Merz, UCI Repository for Machine Learning Databases, Department of Information and Computer Sciences, [35] [36] [37] [38] [39] University of California, Irvine, 1998 /http://www.ics.uci.edu/ mlearn/MLRepository.htmlS. MATLAB, User’s Guide, The MathWorks, Inc., 1994–2001 /http:// www.mathworks.comS. G. Fung, O.L. Mangasarian, SVM toolbox home page /http://www. cs.wisc.edu/dmi/svm/psvmS. S.R. Gunn, Support vector machine Matlab toolbox, 1998 /http:// www.isis.ecs.soton.ac.uk/resources/svminfo/S. LS-SVM toolbox, version-1.5 advanced /http://www.esat.kuleuven. ac.be/sista/lssvmlab/S. T.M. Mitchell, Machine Learning, McGraw-Hill International, Singapore, 1997.