Linear Support Vector Approach for Pattern Recognition Short Review Oleg S. Seredin, September, 1999 This short text is a conspect of linear support vectors method (SV) for pattern recognition. SV it is enough new popular branch in pattern recognition theory. The first publications were in 1992. The author of this theory, famous Russian scientist Vladimir Vapnik, whom working now in AT&T Research Lab, sum up several his old ideas and invented this, so popular today, approach. We shall discuss about the elementary basis of linear SV, which was realized in program Space5.2. We are solving the classical pattern recognition task – the two classes training, based on set of samples [1,2]. Training set contains N objects of two classes. Each object is represented by the set of real features x j R n and by the indices of class membership g j {1, 1} . I denote indices of classes as 1 and –1, because it will be convenient in the next computations. We’ll solve the, so called, geometrical task, or say more correctly deterministic task. It means that our goal is to find hypersurface (hyperplane) which separates objects of different classes. The idea, which Vapnik has suggested [3,4] is very simple. To find a direction vector a of separable hyperplane that next goal function will be maximize. J (a) min aT x j max aT x j max , under constraint aT a 1 . j:g j 1 j:g j 1 (1) This criterion reflects reasonable desirability to find a direction where in the separable case the gap between classes is maximal, and in the not separable case the overlap is minimal. If the objects of two classes are not separable the criterion (1) is true, but the optimal value of the objective function will be negative. Such a goal function is not convenient for numerical solving. Lets consider another formulation. At first we suppose that classes are linear separable. In such a case there exists a hyperplane aT x b 0 , so that aT x j b for g j 1 и aT x b for g j 1 , j 1,..., N , (2) where 0 . In other words there are exist the margin between two classes. It is equal of 2 . Vapnik calls the Optimal such hyperplane for which max with constraints (2) and aT a 1 . Let’s divide both constraints in (2) on . The result is 1 Now we denote 1 a as a and 1 1 1 T 1 aT x j b 1 , a x j b 1. b as b : aT x j b 1 for g j 1 and aT x b 1 for g j 1 . Or in more compact form g j (aT x j b) 1 , j 1,..., N . (3) Our goal is to maximize . It means we should to minimize a . So we have objective function: aT a min , under constraints g j (aT x j b) 1 , j 1,..., N . (4) The criterion (4) it is a standard task of quadratic programming. Now, let’s suppose that classes are not linearly separable. Making the similar reasoning we must minimize pattern overlap. It means that in such a case we shall obtain objective function: aT a max , under constraints g j (aT x j b) 1 , j 1,..., N . (5) It is known from the numerical optimization theory that optimization with constraints has a two settings – the primal and dual. For the cases (4) and (5) we have the primal task. Vapnik noticed some interesting features of the dual task. It is not so difficult to show how obtain the dual task based on primal. To do it is necessary to find the saddle point of a Lagrangian function, due regard for Kuhn-Takker conditions. In primal task we can have one variable and N conditions. In dual task we will have the N variables in objective function and one condition. So for primal task (4) we have correspondence dual task: N W (1 ,..., N ) j j 1 1 N N ( g j g k xTj x k ) j k max, 2 j 1 k 1 N (6) j g j 0, j 0, j 1,..., N . j 1 Here j are, so-called, Lagrangian multipliers a working variables in dual task (6). Here is a formula for variable conversions: N N a j g jx j j 1 b j aT x j j 1 N j j 1 . Now lets return to the formula (5), and we’ll find a restriction. The problem is that (5) objective function is convex, and for finding the global extreme we must to minimize convex function or to maximize concave function. That is why Vapnik decided do not use criterion (5) for non-separable case. He used a special trick. He suggested to add some variables into criterion (5): N aT a C j min , (5a) j 1 where C - it is a positive constant, and constraints: g j (aT x j b) 1 j , j 0, j 1,..., N . The idea of the trick that we want “to shift” some objects to the side of correct class. The dual quadratic task for primal criterion (5a) will be N W (1 ,..., N ) j j 1 N g j 1 j j 0, 1 N N ( g j g k xTj x k ) j k max, 2 j 1 k 1 (7) 0 j C , j 1,..., N . The formulas for variable conversions are the absolutely the same as in separable case. It is necessary to add that there are exist several variants of the objective function (5a). The difference is how to include j . For instance it is possible to add N ( j 1 N 2 or ( j ) 2 instead j) j 1 N j 1 j . It is interesting that the only non zeros optimal variables in dual tasks j 0 , are construct the direction vector of optimal hyperplane, and objects of training sample for whom it is true there is a support-vectors (or it is objects, which have minimal projection on direction vector for the first class and maximal projection for the second one (1)). There are exist a lot of philosophy about interpretation the support vectors and why they are so useful. Some of them: (1) The pattern recognition task was formulated as optimization criterion. (2) The number of support vectors gives you the relative accuracy of decision rule. (3) You have not to store all another objects from sample. Literature 1. Duda R. and Hart P. 1973. Pattern Classification and Scene Analysis. New York: Wiley. 2. Fukunaga K. Introduction to statistical pattern recognition. Academic Press, 1990. 3. Cortes C. and Vapnik V. Support-Vector Networks, Machine Learning, Vol.20, No.3, 1995. 4. Vapnik V. Statistical Learning Theory. John-Wiley & Sons, Inc. 1998.