Multi-class Support Vector Machines Technical report by J. Weston, C. Watkins Presented by Viktoria Muravina Introduction ο΅ Solution to binary classification problem using Support Vectors (SV) is well developed ο΅ Multi-Class pattern recognition (k>2 classes) are usually solved using a voting scheme methods based on combining many binary classification functions ο΅ The paper proposes two methods to solve k-class problems is one step 1. Direct generalization of binary SV method 2. Solving a linear program instead of a quadratic one 2 What is k-class Pattern Recognition Problem ο΅ We need to construct a decision function given π independent identically distributed samples of an unknown function π₯1 , π¦1 , … , π₯π , π¦π where π₯π , π = 1, … , π is a vector of length π and π¦π ∈ {1, … , π} represents class of the sample ο΅ Decision function π(π₯, πΌ) which classifies a point π₯, is chosen from a set of functions defined by parameter πΌ ο΅ ο΅ It is assumed that the set of functions is chosen before hand The goal is to choose the parameter πΌ that minimizes where 0 if π π₯, πΌ = π¦ πΏ π¦, π π₯, πΌ = 1 otherwise π π=1 πΏ(π¦π , π π₯π , πΌ ) 3 Binary Classification SVM ο΅ Support Vector approach is well developed for the binary (k=2) pattern recognition ο΅ Main idea is to separate 2 classes (labelled π¦ ∈ {−1, 1}) so that the margin is maximal ο΅ This gives the following optimization problem: minimize 1 π π€, π = < π€ ⋅ π€ > +πΆ 2 ο΅ π ππ π=1 with constrains π¦π < π€ ⋅ π₯π > +π ≥ 1 − ππ , and ππ ≥ 0, for π = 1, … , π 4 Binary SVM continued ο΅ Solution to this problem is to maximize the quadratic form: π π πΌ = π=1 ο΅ ο΅ 1 πΌπ − 2 π π¦π π¦π πΌπ πΌπ πΎ(π₯π , π₯π ) π,π=1 with constrains 0 ≤ πΌπ ≤ πΆ, π = 1, … , π and π π=1 πΌπ π¦π =0 Giving the following decision function π π π₯ = sign πΌπ π¦π πΎ π₯, π₯π +π π=1 5 Multi-Class Classification Using Binary SVM ο΅ There 2 main approaches to solving multi-class pattern recognition problem using binary SVM 1. Consider the problem as a collection of binary classification problems (1against-all) 2. ο΅ k classifiers are constructed one for each class ο΅ nth classifier constructs a hyperplane between class n and the k-1 other classes ο΅ Use a voting scheme to classify a new point Or we can construct π π−1 2 hyperplanes (1-against-1) ο΅ Each separating each class from each other ο΅ Applying some voting scheme for classification 6 k-class Support Vector Machines ο΅ More natural way to solve k-class problem is to construct a decision function by considering all classes at once ο΅ One can generalize the binary optimization problem on slide 5 to the following: minimize π π€, π = ο΅ 1 2 π π πππ < π€π ⋅ π€π > + πΆ π=1 1=π π≠π¦π with constraints < π€π¦π ⋅ π₯π > +ππ¦π ≥< π€π ⋅ π₯π > +ππ + 2 − πππ πππ ≥ 0 π = 1, … , π π ∈ 1, … , π \yπ ο΅ This gives the decision function: π π₯ = arg max < π€π ⋅ π₯ > +ππ , π π = 1, … , π 7 k-class Support Vector Machines continued ο΅ Solution to this optimization problem in 2 variables is the saddle point of the Lagrangian: πΏ π€, π, π, πΌ, π½ = π π 1 2 π π π πππ < π€π ⋅ π€π > + πΆ π=1 π=1 π=1 π πΌππ [ < π€π¦π − π€π ⋅ π₯π > +ππ¦π − ππ − 2 + πππ ] − − π=1 π=1 ο΅ π π½ππ πππ π=1 π=1 π¦ π¦ π¦ ο΅ with the dummy variables πΌπ π = 0, ππ π = 2, π½π π = 0, ο΅ and constraints πΌππ ≥ 0, π½ππ ≥ 0, πππ ≥ 0, π = 1, … , π π = 1, … , π π ∈ π, … , π \yπ which has to be maximized with respect to α and π½ and minimized with respect to π€ and π 8 k-class Support Vector Machines cont. ο΅ After taking partial derivatives and simplifying we end up with 1 π¦π 1 π π π π π¦π πΏ πΌ =2 πΌπ + − ππ π΄π π΄π + πΌπ πΌπ − πΌπ πΌπ < π₯π ⋅ π₯π > 2 2 π,π ο΅ ο΅ where πππ = π,π,π 1 if π¦π = π and π΄π = 0 if π¦π ≠ π π π π=1 πΌπ Which is a quadratic function in terms of πΌ with linear constraints π π πΌππ = π=1 πππ π΄π , π = 1, … , π π=1 π¦ and 0 ≤ πΌππ ≤ πΆ, πΌπ π = 0 π = 1, … , π π ∈ 0, … , π \yπ ο΅ Please see slides 19 to 22 for complete derivation of πΏ(πΌ) 9 K-class Support Vector Machines cont. ο΅ This gives the decision function π πππ π΄π − πΌππ < π₯π ⋅ π₯ > +ππ ] π π₯, πΌ = arg max[ π π=1 ο΅ The inner product < π₯π , π₯π > can be replaced with the kernel function πΎ(π₯π , π₯π ) ο΅ When our k=2 the resulting hyperplane is identical to the one that we get using binary SVM 10 k-Class Linear Programing Machine ο΅ Instead of considering the decision function as separating hyperplane we can view each class having its own decision function ο΅ Defined only by training points belonging to the class ο΅ This gives us that the decision rule is the largest decision function at point π₯ ο΅ For this method minimize the following linear program π π π πΌπ + πΆ π=1 ο΅ ππ π=1 π≠yπ Subject to the following constraints π πΌπ πΎ(π₯π , π₯π ) + ππ¦π ≥ π:π¦π =π¦π πΌπ ≥ 0, ο΅ πΌπ πΎ π₯π , π₯π + ππ¦π + 2 − ππ π:π¦π =π¦π π ππ ≥ 0, π = 1, … , π, π ∈ 1, … , π \yπ and use the decision rule π π₯ = arg max π πΌπ πΎ π₯, π₯π + ππ π:π¦π =π 11 Further Analysis ο΅ For binary SVM expected probability of an error in the test set is bounded by the ratio of expected number of support vectors to the number of vectors in the training set πΈ number of support vectors πΈ π error = number of training vectors − 1 ο΅ This bound also holds in the multi-class case for the voting scheme methods ο΅ It is important to note while 1-against-all method is a feasible solution to multi-class SV methods, it is not necessarily the optimal one 12 Benchmark Data Set Experiments ο΅ The 2 methods were tested using 5 benchmark problems from UCI machine learning repository ο΅ If test set was not provided the data was split randomly 10 times with a tenth of the data used as test set ο΅ All 5 of the data sets chosen were small due to the fact that at the time of the publication decomposition algorithm for larger data sets was not available 13 Description of the Datasets part 1 ο΅ Iris dataset ο΅ contains 3 classes of 50 instances each, where each class refers to a type of iris plant ο΅ Each instance has 4 numerical attributes ο΅ ο΅ Wine dataset ο΅ results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars representing different classes ο΅ ο΅ Class 1 has 59 instances, Class 2 has 71 instances and Class 3 has 48 instances for a total of 178 Each instance has 13 numerical attributes ο΅ ο΅ Each attribute is a continuous variable Each attribute is a continuous variable Glass dataset ο΅ 7 classes for different types of glass ο΅ ο΅ Class 1 has 70 instances, Class 2 has 76, Class 3 has 17, Class 4 has 0, Class 5 has 13, Class 6 has 9 and Class 7 has 29 for a total of 214 Each instance has 10 numerical attributes of which 1 is an index and thus irrelevant ο΅ The 9 relevant attributes are continuous variables 14 Description of the Datasets part 2 ο΅ Soy dataset ο΅ ο΅ 17 classes for different damage types to the soy plant ο΅ Classes 1, 2, 3, 6, 7, 9, 10, 11, 13 have 10 instances each; Classes 2, 12 have 20 instances each; Classes 4, 8, 14, 15 have 40 instances each; Classes 16, 17 have 6 instances each for a total 302 ο΅ Due to the fact that there are missing values for some of the instances we can work only with 289 instances Each instance has 35 categorical attributes encoded numerically ο΅ ο΅ After converting each categorical value into individual attributes we end up with 208 attributes each of which has either 0 or 1 value Vowel dataset ο΅ ο΅ 11 classes for different vowels ο΅ 48 instances each for a total of 528 in the training set ο΅ 42 instances each for a total of 495 in the testing set Each instance has 10 numerical attributes ο΅ Each attribute is a continuous variable 15 Results of the Benchmark Experiments ο΅ The table summarizing results of the experiments is 1-a-a 1-a-1 qp-mc-sv lp-mc-sv Name # pts # att # class %err svs %err svs %err svs %err svs Iris 150 4 3 1.33 75 1.33 54 1.33 31 2.0 13 Wine 178 13 3 5.6 398 5.6 268 3.6 135 10.8 110 Glass 214 9 7 35.2 308 36.4 368 35.6 113 37.2 72 Soy 289 208 17 2.43 406 2.43 1669 2.43 316 5.41 102 10 11 39.8 2170 38.7 3069 34.8 1249 39.6 258 Vowel 528 ο΅ 1-a-a means 1-against-all ο΅ 1-a-1 means 1-against-1 ο΅ qp-mc-sv is Quadratic multiclass SVM ο΅ lp-mc-sv is Linear multiclass SVM ο΅ svs is the number on non-zero coefficients ο΅ %err is the raw error percentage 16 Results of the Benchmark Experiments ο΅ Quadratic multi-class SV method gave results that are comparable to the 1against-all, while reducing number of support vectors ο΅ Linear programing method also gave reasonable results ο΅ ο΅ Even through the results are worse then with quadratic or 1-against-all methods, number of support vectors was reduced significantly compared to all other methods ο΅ Smaller number of support vectors means faster classification speed, this has been a problem for the SV methods when compared to other techniques 1-against-1 performed worse then the quadratic method. It also tended to have the most support vectors 17 Limitations and Conclusions ο΅ Optimization problem that we need to solve is very large ο΅ For the quadratic method our optimization function is quadratic is π − 1 ∗ π variables ο΅ Linear programing method optimization is linear with π ∗ π variable and π ∗ π constraints ο΅ This could lead to slower training times then in 1-against-all especially in the case of the quadratic method ο΅ The new methods do not out perform the 1-agains-all and 1-against-1 methods, however both methods reduce number of support vectors needed for the decision function ο΅ Further research is need to test how the methods perform for large datasets 18 Derivation of πΏ(πΌ) on slide 9 ο΅ We need to find a saddle point so we start with the Lagrangian 1 πΏ π€, π, π, πΌ, π½ = 2 π π π π πππ < π€π ⋅ π€π > + πΆ π=1 π=1 π=1 π=1 π=1 π π½ππ πππ π=1 π=1 Using the notation πππ = ο΅ π πΌππ [ < π€π¦π − π€π ⋅ π₯π > +ππ¦π − ππ − 2 + πππ ] − − ο΅ π 1 if π¦π = π and π΄π = 0 if π¦π ≠ π π πΌππ π=1 Take partial derivatives with respect to π€π , ππ and πππ and set them equal to 0 19 Derivation of πΏ(πΌ) on slide 9 cont. ο΅ The derivatives are ππΏ = π€π + ππ€π π π πΌππ π₯π − π΄π πππ π₯π = 0 π=1 π=1 π π ππΏ =− π΄π πππ + πΌππ = 0 πππ π=1 π=1 ππΏ π π π = −πΌπ + πΆ − π½π = 0 πππ ο΅ From this we get π π πππ π΄π − πΌππ π₯π , π€π = π=1 π πΌππ = π=1 and 0 ≤ πΌππ π΄π πππ , π½ππ + πΌππ = πΆ π=1 ≤πΆ 20 Derivation of πΏ(πΌ) on slide 9 cont. ο΅ After substituting the equations that we got on previous slides into the Lagrangian we get πΏ πΌ πΌππ =2 π,π + π,π,π 1 π π 1 1 1 π¦ π¦ ππ ππ π΄π π΄π − πππ π΄π πΌππ − πππ π΄π πΌππ + πΌππ πΌππ − ππ π π΄π πΌππ + πΌππ πΌπ π 2 2 2 2 21 Derivation of πΏ(πΌ) on slide 9 cont. ο΅ Due to the fact that π π π ππ π΄π πΌπ πΌππ πΏ πΌ =2 + π,π π¦ π,π,π = π π π ππ π΄π πΌπ we get 1 π π 1 π π π¦π π π¦π π π π΄ π΄ − ππ π΄π π΄π + πΌπ πΌπ − πΌπ πΌπ ) < π₯π ⋅ π₯π > 2 π π π π 2 π¦ π π π π π ππ ππ = ππ = ππ thus 1 π¦π 1 π π π π π¦π πΏ πΌ =2 πΌπ + − ππ π΄π π΄π + πΌπ πΌπ − πΌπ πΌπ < π₯π ⋅ π₯π > 2 2 ο΅ Also π,π π,π,π 22